Snorkel: The One-Stop solution for lack of large labeled datasets.

Abhijeet Sahoo
The Startup
Published in
7 min readJun 26, 2020

--

Designed by author

“Data is the new oil”

This is a phrase most of you must have heard right! Well, this was the statement that introduced me to Data Science and got me hooked on to explore the vast field of Data Science. Now like any other novice Data Science enthusiast, I also started learning about different machine learning models and moved on to applying them to datasets I downloaded from Kaggle. It really felt good, being able to handle those datasets by doing some pre-processing stuff like dealing with Null values and then using the labelled datasets to train models and get my predictions. But as I explored into more pragmatic use-cases, I couldn’t help but recall this scene from F.R.I.E.N.D.S —

Still from F.R.I.E.N.D.S — Tv Show

I was literally blindsided and gradually for me, the above statement about data being the new oil transformed to —

“Labelled Training Data: The New New oil.”

And yes, there is two news in the statement, not a typo people! 😅 Why you say? Let's get started then —

Today’s state-of-the-art machine learning models are both more powerful and easier to spin up than ever before. Whereas practitioners used to spend the bulk of their time carefully engineering features for their models, we can now feed in raw data — images, text, genomic sequences, etc. — to systems that learn their own features. These powerful models, like deep neural networks, produce state-of-the-art results on many tasks. This new power and flexibility have sparked excitement about machine learning in fields ranging from medicine to business to law.

There is a hidden cost to this success, however: these models require massive labelled training sets. And while machine learning researchers can use carefully manicured training sets to benchmark their new models, these labelled training sets do not exist for most real-world tasks. Creating sufficiently large labelled training datasets is extremely expensive and slow in practice. This is exacerbated when domain expertise is required to label data, such as a radiologist labelling MRI images as containing malignant tumours or not. In addition, hand-labelled training data is not adaptable or flexible, and thus entirely unsuitable for learning tasks which change over time. The above instances show why labelled data is the real new new oil. And lastly, you don’t have GROWING data. Labelled or not, if your training data fed into the AI is a fixed size dataset, then your AI is not going to get smarter over time. What you really want is a system where the AI learns from data labelled in the wild. It gets smarter while you sleep, and you don’t pay a penny.

Hence considering all this, we really needed an alternative. That’s when a team at Stanford, developed a set of approaches broadly termed as ‘weak supervision’ to address this data labelling bottleneck. The idea is to programmatically label millions of data points.

Programmatic Labeling

What if there were a shortcut to labelling data? What if subject matter experts could write data labelling programs that each acted as weak labels that an unsupervised model — one requiring no labels — could combine into strong labels? That is the promise of the Snorkel Project that emerged from Stanford’s HazyResearch group. “Back in 2016, we were surprised to notice that a lot of our collaborators in ML were starting to spend the majority of their time building, managing, cleaning, and most of all, labelling massive training datasets — and we asked why there wasn’t a system where practitioners could label and manage their training data in higher-level, programmatic, and ultimately faster ways?”- Member of the Research group.

Snorkel team member Henry Ehrenberg calls it an effort to turn the onerous, ad-hoc tasks of hand labelling and manually managing a messy collection of training sets into an iterative development process, guided by a common set of interfaces (like labelling functions) and algorithms (like Snorkel’s label model) Weak supervision through Snorkel was introduced in the landmark 2016 paper Data Programming: Creating Large Training Sets, Quickly and 2017 paper Snorkel: Rapid Training Data Creation with Weak Supervision, calling it, “a first-of-its-kind system that enables users to train state-of-the-art models without hand labelling any training data.”

End to end flow diagram showing how is snorkel is used

Let me first introduce Snorkel a bit to you all, Snorkel uses an unsupervised generative model that learns the dependency graph among weak sets of labels produced by small user-written “labelling functions” by looking at each labelling function’s coverage, overlap, and conflicts with other labelling functions. This enables it to determine their accuracies as well as their correlations, and the only underlying assumption is that the average labelling function is better than a random coin flip. This seems like mathematical magic, matrix factorization of strange geometries from the Necronomicon. But I was somehow surprised to find how well it really does work. And it scales. Google uses it with PySpark at scale to label vast amounts of data for some of its most critical systems [Ratner, Hancock et al, 2018]. Weak supervision has become a fundamental part of Software 2.0. Another important ability of data programming with Snorkel is that it can label data without ever exposing it to human eyes — a critical feature in industries like healthcare and legal services.

There are 3 cornerstones of Snorkel design:

  1. Its ability to use labels from different weak supervision sources.
  2. The system should output probabilistic labels that are used to train popular classifiers that generalize beyond noisy labels.
  3. A user should be able to supervise and interact with the system.

Another important feature of Snorkel is its ability to learn not only the accuracies of the labelling functions but also their dependencies and correlations. Furthermore, while creating its generative model, it is able to use correlations between functions or implement the Majority Vote method, whichever better optimizes the results.

Snorkel Architecture

Snorkel Architecture

Snorkel’s architecture consists of 3 stages:

  1. Writing labelling functions:
    LFs can contain various weak supervision sources wrapped in a flexible interface.
  2. Modelling Accuracies and Correlations:
    Snorkel creates a generative model based on labelling functions’ correlations, that is, where they agree or disagree.
  3. Training a discriminative model:
    While the output is probabilistic labels, the ultimate goal is to train a discriminative model (such as popular ML models) that will be able to generalize beyond the noisy generative model.

Humans can use real-world knowledge, context and common-sense heuristics to assign a label to a candidate. We can use these abilities to write a number of labelling functions. Some of the approaches for labelling functions are:

  • Pattern-Based: Use heuristics and common pattern to write labelling function (we use regex here.)
  • Distant Supervision: Use any existing knowledge base to label candidates like using existing databases of drugs and its known ADRs while labelling tweets to indicate the presence of reported ADR or not.
  • Weak Classifiers: Classifiers that are insufficient for our task and provides us with a noisy label.
  • Labelling Function Generator: It’s an inbuilt Snorkel tool which generates multiple labelling functions from a single resource, such as crowd-sourced labels.

While we write our LFs, we need to keep in mind the following things — We want high-coverage and high-accuracy labelling functions but labelling functions with probability better than random chance will work and improve the final result. Having conflicts is actually good because it allows our model to learn information about the labelling functions.

Next, Snorkel will train a generative model using this large matrix of labels. The generative model is just a probability distribution over the latent variable (the unobservable true label, since our data is unlabeled). The generative model estimates the accuracy of the labelling function while automatically taking into account the pairwise correlation between these functions and labelling propensity (how often a function actually creates a label). By looking at how often the labelling functions agree or disagree with one another (Majority Vote method), Snorkel will learn estimated accuracies for each supervision source (e.g., an LF that all the other LFs tend to agree with will have high learned accuracy, whereas an LF that seems to be disagreeing with all the others whenever they vote on the same example will have a low learned accuracy). Once the model is trained, it can be used to estimate the true label for each candidate. These labels are numbers between 0 and 1 depicting a fuzzy “noise-aware” label, instead of a hard label (either 0 or 1) representing the probability of a positive class, and are known as probabilistic labels. Finally, the goal is to train a classifier that can generalize beyond our LFs.

Last part of the pipeline is the fastest one. The output of the generative model is a set of probabilistic training labels, we now want to use these to train our final Noise-aware discriminative model. We can use techniques like Logistic Regression, SVMs, LSTM at this stage. The discriminative model learns feature representation of our labelling functions and this makes it better able to generalize to unseen candidates. These increases recall significantly and give the final output.

The whole Snorkel framework is based on the data programming paradigm, which is a vast concept containing a lot of mathematics that I have not covered here because it is beyond the scope of this article. And honestly, I am not competent enough to explain them. This article over here is just an overview and I would like to encourage you to explore more through reading the articles: Data Programming: Creating Large Training Sets, Quickly and Snorkel: Rapid Training Data Creation with Weak Supervision.

Want to get your hands dirty using this tool, Refer to these youtube videos:

— From someone who likes to explore new ways to deal with the increasing amount of dark data. 😄

--

--

Abhijeet Sahoo
The Startup

Data Science enthuthiast | Biotechnology Researcher |