AI/ML

Notes and experiments on AI/ML and design.

Why I changed my mind about weak labelling for ML

Clear and accessible overview by Raza Habib on a technique that generates labelled data from unlabelled data that sounds almost too good to be true.

When I first heard about weak labelling a little over a year ago, I was initially sceptical. The premise of weak labelling is that you can replace manually annotated data with data created from heuristic rules written by domain experts. This didn’t make any sense to me. If you can create a really good rules based system, then why wouldn’t you just use that system?!

Over the past year, I’ve had my mind completely changed.

The technique works by:

  1. an expert user writing a number of simple rules to classify samples that seem broadly reasonable but don’t have to be 100% correct (e.g. one rule might be “if a headline is in all caps, it is clickbait”, and another might be “if a headline starts with a number, it is clickbait”)
  2. then training a model to see how different rules agree/disagree/interact with each other when they are applied to unlabelled samples.

Then you just throw huge amounts of unlabelled data at this process, and it seems to create a classifier that is almost as good as one created from a carefully curated dataset, with no labelling required.

Given how access to sufficient high-quality labelled data is such a bottleneck to training reliable models, this appears to be a huge deal.

Markov Guybrush

Greg tweeted ‘Markov Guybrush Threepwood’ the other week. I thought that was a good idea so I spent a bit of Sunday making Markov Guybrush. It’s a Twitter bot that generates random Guybrush-ish Tweets.

All the important bits are taken from the code for Tom’s lovely Markov Chocolates and have been made worse. (I asked nicely first.) His blog post has more information on how it all works.

The dialogue was taken from this play-through of The Secret of Monkey Island and then hit with the Vim-hammer until it looked like this. It’s not everything he says in the game, but it seems to be enough for some randomly generated quips.

That’s it. Here’s the code, and here’s the bot. As Guybrush might say: