Data-Centric AI December 27, 2022

Report: Data science teams shift from model development to dataset development

As we wrap up 2022, many teams are reflecting on challenges and accomplishments, and thinking ahead to new projects, improving the way we work, and how to allocate resources in the coming year. The recent Label Studio Community survey reveals trends, challenges and shifting investments for data science teams that will resonate as you step into 2023.

In September, we asked the global Label Studio open source community to tell us about their data labeling operations and how they’ve integrated them into their machine learning workflows. Label Studio is the most popular open source data labeling platform with more than 150,000 users worldwide, 100,000,000+ annotations created and over 11,000 stars on GitHub. Community members from more than 40 countries participated in the survey, and 75% of the survey respondents currently have ML/AI models in production with another 15% planning to have models in production soon.

One major takeaway from the report is clear. As organizations invest more in their data science operations, the quality of their data will determine how successful their efforts are.

Key findings from the survey include:

Machine Learning and AI are becoming increasingly strategic.

73% of respondents noted their organizations will make a higher level of investment in their ML/AI initiatives in the coming year.

Data poses the biggest challenge to putting ML/AI models into production.

80% of respondents state that accurately labeled data is one of the biggest challenges to getting ML/AI models in production (the top response), while 46% cited lack of data as one of the biggest challenges (the second most popular response).

Data science teams now spend the majority of their time on dataset preparation, management and iteration, known as dataset development.

72% of respondents reported spending 50% or more of their time on data preparation, iteration and management, while more than one-third (34%) of respondents said they spend 75% or more of their time on the data.

Data preparation and labeling are becoming increasingly cross-functional.

While most respondents have the traditional roles of data scientists and data engineers, the responsibility for data labeling is broad, requiring engagement across organizations from interns to executives and business leaders. Notably, 20% reported that a mix of roles held the data prep responsibility, including subject matter experts, who accounted for 5% of responses, and business analysts, who accounted for 3%.

You can read the 2022 Label Studio Community Survey Report here and can share your insights and experiences at the Label Studio Slack.