Artificial Intelligence (AI) and Machine Learning (ML) are being adopted by businesses in almost every industry. Many businesses are looking towards ML Infrastructure platforms to propel their movement of leveraging AI in their business. Understanding the various platforms and offerings can be a challenge.** The ML Infrastructure space is crowded, confusing, and complex**. There are a number of platforms and tools spanning a variety of functions across the model building workflow.

To understand the ecosystem, we broadly break up the machine learning workflow into three stages — data preparation, model building, and production. Understanding what the goals and challenges of each stage of the workflow can help make an informed decision on what ML Infrastructure platforms out there are best suited for your business’s needs.

Image for post

ML Infrastructure Platforms Diagram by Author

Each of these broad stages of the Machine Learning workflow (Data Preparation, Model Building and Production) have a number of vertical functions. Some of these functions are part of a larger end-to-end platform, while some functions are the main focus of some platforms.

In our last post, we dived in deeper into the Data Preparation part of the ML workflow. In that post, we discussed ML Infrastructure platforms that are focused for data preparation functions. In this post, we will dive deeper into Model Building.

What is Model Building?

The first step of model building begins with understanding the business needs. What business needs is the model addressing? This step begins much further at the planning and ideation phase of the ML workflow. During this phase, similar to the software development lifecycle, data scientists gather requirements, consider feasibility, and create a plan for data preparation, model building, and production. In this stage, they use the data to explore various model building experiments they had considered during their planning phase.

Feature Exploration and Selection

As part of this experimental process, data scientists explore various data input options to select features. Feature selection is the process of finding the feature inputs for machine learning models. For a new model, this can be a lengthy process of understanding the data inputs available, the importance of the input, and the relationships between different feature candidates. There are a number of decisions that can be made here for more interpretable models, shorter training times, cost of acquiring features, and reducing overfitting. Figuring out the right features is a constant iterative process.

ML Infrastructure companies in Feature Extraction: Alteryx/Feature Labs, Paxata(DataRobot)

Model Management

There are a number of modelling approaches that a data scientist can try. Some types of models are better for certain tasks than others (ex — tree based models are more interpretable). As part of the ideation phase, it will be evident if the model is supervised, unsupervised, classification, regression, etc. However, deciding what type of modelling approaches, what hyperparameters, and what features is dependent on experimentation. Some AutoML platforms will try a number of different models with various parameters and this can be helpful to establish a baseline approach. Even done manually, exploring various options can provide the model builder with insights on model interpretability.

Experiment Tracking

While there are a number of advantages and tradeoffs amongst the various types of models, in general, this phase involves a number of experiments. There are a number of platforms to track these experiments, modelling dependencies, and model storage. These functions are broadly categorized as model management. Some platforms primarily focus on experiment tracking. Other companies that have training and/or serving components have model management components for comparing the performance of various models, tracking training/test datasets, tuning and optimizing hyperparameters, storing evaluation metrics, and enabling detailed lineage and version control. Similar to Github for software, these model management platforms should enable version control, historical lineage, and reproducibility.

A tradeoff between these various model management platforms is the cost of integration. Some more lightweight platforms only offer experiment tracking, but can integrate easily with the current environment and be imported into data science notebooks. Others require some more heavy lifting integration and require model builders to move to their platform so there is centralized model management.

#artificial-intelligence #machine-learning #data-science #towards-data-science #infrastructure

ML Infrastructure Tools for Model Building
1.15 GEEK