Don’t Sleep On Data Operations

When most people think about deep learning practitioners, they think of data scientists who whisper to machine learning models using special powers they learned during their PhDs.

While that may be true for some organizations, the reality of most practical deep learning applications is more banal. The biggest determinant of model performance is now the data, not the model code. And when data is supreme, **data operations **becomes the most important part of your ML team.

An Intro To Data Operations

Fundamentally, data operations teams are responsible for the maintenance and improvement of the datasets that models train on. Some of their responsibilities include:

  • Ensuring the data and labels are clean and consistent. Bad data in the training set means that models will be confused at train time and learn the wrong thing. Bad data in the test set mean you can’t trust your model performance metrics to be accurate.
  • Tracing errors in the ML system back to the datapoints (or lack of datapoints) that caused those errors. Good understanding of error cases makes it easier to fix them.
  • Sourcing, labeling, and adding data to the dataset based on current priorities: fixing critical customer problems, addressing deficiencies in the model performance, or expanding model functionality to new tasks / domains.

A data operations team member is often an expert in their domain. Think about a recycling specialist who can distinguish between plastic and glass containers on sight, or a translator who can convert Chinese to Portuguese, or a radiologist who can navigate an MRI and tell you whether a patient has cancer or not.

Data operations personnel can also come from consulting or business backgrounds. It helps to be organized and methodical when working on any operations task, but especially with data. Knowledge of the business goals and the technology’s capabilities can also inform how best to prioritize data curation in order to improve the ML system.

Within data operations teams, team members can be assigned based on the data / model type that they are responsible for (for example, in a self driving application, different teammates owning the radar, lidar, and image detection systems) or based on the customer / geography that they serve (for example, one team member handling North American deployments and another handling APAC).

  • Data operations team members often will work with offshore labeling teams to help scale the throughput of data labeling. The offshore team deals with tasks that are simpler but take more manual effort. For example, adjusting bounding box labels to fit exactly around a variety of objects or labeling pictures of apples vs oranges. In contrast, in-house data operations teammates act as experts who define labeling instructions, inspect the work of the offshore team, and decide how to handle difficult or ambiguous scenarios. Data operations teams are best suited for jobs that require a smaller quantity of high quality work with relatively low turnaround time. Offshore teams are suited for large amounts of simpler jobs, tasks where quality is not as important as quantity, or situations where labeling throughput is more important than latency.

#machine-learning #deep-learning #operations #data-operations

The Unsung Heroes Of Machine Learning Are In Data Operations
1.35 GEEK