Data Labeling Overview
for Machine Learning and Data Science
Data labeling or annotation is the process that gives meaning to your raw data, adding a critical layer of metadata that draws the connection between raw data and the prediction your model is learning to make.
In this data labeling overview, we will outline the core aspects of data labeling, including data, process, people, and technology. After reading this overview, you should understand the key components of data labeling and how to organize these components together to build a successful, efficient, and repeatable data labeling system for your organization.
Read a sample
Data labeling is, in and of itself, a process. Think of it as an assembly line that takes source data in as raw inputs and creates meaningful metadata, in a format that machine learning algorithms can understand and use to make predictions, as outputs. For a machine learning or data science project to be successful, you need to have a well-designed, efficient, and scalable process that can be actively monitored to ensure high-quality and accurate results.
While every project is unique, typically projects will align to one of three common categories of data labeling:
- Initial model training
- Model fine-tuning
- Human in the loop
Most aspects of the data labeling process are common across all three categories. However, there are important differences in each category that need to be understood and factored into your process. We’ll discuss those shortly, but first, let’s define the common attributes and components.
- Unlabeled dataset: This data is your raw input and consists of source data, that once labeled, will be used to train your machine learning model.
- Instructions: Clearly document instructions for your data labeling team. Most importantly, describe the data they will be reviewing and the decision(s) they are tasked to make. Additionally, provide instructions, including distinct steps, that annotators must follow and documentation for how to use any relevant tools.
- Labeling tasks: Individual samples from your unlabeled dataset are the labeling tasks.
- Annotators: Labeling tasks are assigned to people, or annotators, on your data labeling team.
- Labeled Dataset: The aggregated results of your labeling tasks make up your labeled dataset.