Clear and accessible overview by Raza Habib on a technique that generates labelled data from unlabelled data that sounds almost too good to be true.

When I first heard about weak labelling a little over a year ago, I was initially sceptical. The premise of weak labelling is that you can replace manually annotated data with data created from heuristic rules written by domain experts. This didn’t make any sense to me. If you can create a really good rules based system, then why wouldn’t you just use that system?!

Over the past year, I’ve had my mind completely changed.

The technique works by:

  1. an expert user writing a number of simple rules to classify samples that seem broadly reasonable but don’t have to be 100% correct (e.g. one rule might be “if a headline is in all caps, it is clickbait”, and another might be “if a headline starts with a number, it is clickbait”)
  2. then training a model to see how different rules agree/disagree/interact with each other when they are applied to unlabelled samples.

Then you just throw huge amounts of unlabelled data at this process, and it seems to create a classifier that is almost as good as one created from a carefully curated dataset, with no labelling required.

Given how access to sufficient high-quality labelled data is such a bottleneck to training reliable models, this appears to be a huge deal.