There are a lot of different ways to train machine learning models, one of them is to run the training on a Kubernetes cluster. The advantage is that most companies have a Kubernetes cluster at hand and large resources or GPUs can be allocated easily and on demand. In general it makes sense to schedule your model training remotely, for example on Airflow, where you can use the KubernetesPodOperator. But if you are developing new features or general model improvements, or simply want to debug your model training under production conditions it sometimes makes sense to run it off-schedule — for example by using Kubernetes Jobs.

A common pattern for data science projects is to dockerize (or containerize) your model: the model training can then be done by running the model container with an entrypoint like train_model.py with some arguments like the S3 folder paths for the input data or hyperparameters to the model. Ideally the python part is also abstracted away, using a common command line interface with a train-model command (like suggested in one of my previous posts). This script then pulls the data from a data lake, trains the model and finally puts a trained model artefact into a model store (e.g. S3). Serving the model can be done using the same image again, by pulling the artefact into the container and spawning the endpoint on the cluster.

#machine-learning #data-science #kubernetes #cluster #model-training

Using Kubernetes Jobs and the Kubernetes python client to train your models
2.05 GEEK