For the past few days, I’ve been exploring MLFlow and its application to our Machine Learning pipeline.
MLFlow has four main components: Tracking, Project, Models, and Model Registry. The core of MLFlow is its Tracking service. This component allows you to log code, different versions of your model(s), metrics, and other artefacts.
MLFlow helps developers manage/reproduce experiments and models with their own choice of tools and platforms whether it be Apache Spark ML Pipeline, Tensorflow model or scikit-learn pipeline deployed to their own instance, Amazon SageMaker, GCP, etc.
However, the only way to grasp the features of MLFlow is when you play with it. It’s better to set it up locally as it gives you the idea on the orchestration and provisioning of resources when you decide to use it on production.
In this post, I am going to demonstrate how to make MLFlow work with MinIO as in most cases you really want to store your artefacts in a service like Amazon S3. We are going to set it up locally using Docker. Essentially we train (Spark ML) our model in one container while MLFlow server runs in another container which uses MinIO as an artefact store.
docker network create mlflow
2. Run MinIO with your desired access/secret keys. Here, we also created a bucket named ml-bucket
mkdir -p /buckets/ml-bucket
docker run --rm --net mlflow --name s3 \
-e "MINIO_ACCESS_KEY=xxxx" \
-e "MINIO_SECRET_KEY=xxxx" \
-p "9000:9000" \
-v "/buckets:/data:consistent" \
--minio/minio:RELEASE.2020–07–27T18–37–02Z server /data
3. Start MLFlow server. In this step, we’re going to manually install MLflow in a python container. The important part is to set the environment variable MLFLOW_S3_ENDPOINT_URL to point to your MinIO server.
#apache-spark #mlflow #minio