A Brief Introduction to Data Labeling

Everyone and everything needs training—even machine learning (ML) algorithms. That’s essentially what data labeling is: The process of adding context to files so the algorithms can learn to do it without human help. This manual training period provides the layer of meaning that's needed for algorithms to find patterns and make predictions. As machine learning continues to expand in scope and capabilities, it’s only a matter of time before more roles in tech participate in data labeling.

Michael Malyuk, Heartex founder and CEO, was recently featured on a podcast episode of Software Engineering Daily discussing data labeling. This article is based on that conversation and includes quotes from the recording.

What Is Data Labeling?

Malyuk defines data labeling as the process used “to prepare the datasets to train your machine learning models or to improve the accuracy of your existing models.”

On a practical level, it involves real people sitting in front of their computers. Also, different platforms have different configurations. For example, Heartex’s Label Studio runs in a browser. Data annotators work in a browser window, clicking tags or creating bounding boxes on images.

The industry is still new, so there’s not yet one standard for handling data labeling. It can be a central part of your pipeline or a supplementary add-on, though Malyuk notes it’s becoming a more central part of the pipeline.

“I think we're moving, as an industry, toward what’s called right now data-centric AI,” Malyuk says. “With data-centric AI, the data labeling solution becomes one of the central, most important pieces of the whole workflow.”

Who Are the Data Labelers?

The role of data labeling falls to many different people, but it’s predominantly three categories: Professional annotators, data scientists, and business users.

Professional annotators are people who do data labeling as a core function of their job—it might even be their whole job. They spend all day on a platform labeling data.

Data scientists sometimes label part or all of a dataset themselves as a part of preparing datasets for machine learning.

But business users are the category of data labelers that Malyuk is most excited about because these users bring a lot of subject matter expertise. “We're seeing a lot of those business people get more into data labeling because the type of knowledge that they can provide through the data labeling process is really, really, really valuable.”

A Data Labeling Workflow

To illustrate how a data labeling workflow can look, we can look at the example of a project to label all the questions and answers in a Zoom call.

To get this started, the data engineer initially sets up the project and invites the annotators. Label Studio has a Django backend, sources the React app, and opens in a browser. As soon as the data engineer doing the initial setup logs in, Label Studio automatically connects to cloud storage (S3 or Azure Blob storage). When the project is set up, data scientists can email invite links to annotators from inside Label Studio.

As soon as annotators click the emailed links, they’re prompted to create accounts. After registration, annotators are presented with a button that simply says, “Start Labeling.” When they press it, they see instructions for their task and can then start labeling data.

The data itself is a JSON file. Inside it, there’s a URL pointing to the audio file and an object that encodes the transcripts with the start and end positions of questions and answers. Once they’re done, the data sets will be sent to verification. When you’ve annotated and verified your data, you’re ready to retrain your ML model.

The Parts of a Machine Learning Pipeline

The machine learning pipeline consists of four major parts:

Infrastructure and hardware
Algorithms
Datasets
People

“I think over the years, what we have seen is companies and the industry as a whole were invested into the infrastructure and algorithms,” Malyuk says. “First, we had some advancements into the hardware, then algorithms, then hardware, then algorithms. And where we stand right now, it seems that the infrastructure and the algorithms are becoming more of a commodity.”

Because so many of these algorithms are published publicly for free, most companies use the same algorithms. And they are essentially running on the same frameworks: TensorFlow, Python, AWS, Azure, and GCP.

“But what’s not that easy to commoditize is the actual data sets that are specific to the company and the people that are employed by this company.” Instead, Malyuk explains that the software built around datasets and people gives companies a competitive advantage over companies that keep the focus on hardware and algorithms. Data-centric AI helps companies leverage their uniqueness, Malyuk says, helping “to build the most competitive machine learning models.”

Maintaining Data Label Quality and Accuracy

Maintaining quality and accuracy is a big concern for organizations that want to start using data labeling. There are technological solutions and approaches to the human element that can improve accuracy.

How Do You Know Your Data Is Accurately Labeled?

Initially, there weren’t many ways to check if data was accurately labeled. And while the entire data pipeline is now more sophisticated, reviewing annotator performance is still a new feature. Before Label Studio,, data labeling teams largely had to hope their labelers were accurately labeling their data. Label Studio incorporated review workflows and agreement matrices so experts, data scientists, or other project leaders could monitor data labeling quality and prevent the skewing of training data.

The review step is just one more in a plethora of safety measures added to the data pipeline. “You have all sorts of verifications, checks, clues, and procedures along the way before you actually want to store the data that you're getting,” Malyuk points out. “So, I think if you look at all the steps over the pipeline or the typical machine learning pipeline, on every step, there have been some advancements in terms of what type of software and what type of frameworks we use to go through that step.” All these advancements are designed to improve the accuracy of machine learning algorithms.

What Happens When Data Labelers Disagree?

If multiple people review data and come to different conclusions, there are a few options available. First, you can simply ask more people. If three people initially reviewed the data, you might expand it to six.

You can also seek out an expert opinion, someone who knows more and who can provide what’s called a “ground truth annotation” for everyone to work from. This can be particularly useful in situations where a lot of background information is helpful in making the decision. Malyuk brought up the issue of hate speech. Because that’s such a complex issue with so much historical context, Malyuk says bringing in an expert on hate speech to teach the algorithm can increase the model’s accuracy.

And finally, you can always lean on metrics to resolve the disagreement. Which metric you use depends on the type of data you’re labeling. If you’re labeling an image, for example, by drawing bounding boxes over certain objects in images, you use the IoU (Intersection over Union) metric. The IoU metric calculates how similar the bounding boxes are and creates an estimate for the bounding box borders based on where they overlap.

What to Look for in Your Data Labeling Platform

When it comes to choosing a platform for data labeling, look for software that’s flexible, Malyuk advises, because you don’t know what data labeling tasks you’ll be faced with in the future. “You may not know right away how you want to label your data. Sometimes you have what we call a policy for the data label ,and it may change over time because you're realizing that you need to be labeling the datasets in a slightly different way.”

Invest in a platform that uses many different data types. Not only will it save you money from not buying a new platform for every data type you need to label, but it will be easier on your data team if they only have to learn one platform. Much like software developers write in JavaScript, HTML, and Rust but use the same text editor for all of them, you can give your data team one platform to label every type of data.

Finally, your data labeling platform should be easy to use. For example, Label Studio’s interface allows users to simply click on tags to label data. It also provides templates for as many types of data labeling needs as possible, so users don’t have to waste time manually setting up their labeling projects.

If you want to learn more about data labeling, download the Heartex Data Labeling Guide.

If you’d like to try Label Studio Enterprise, sign up for our 14-day free trial.