As a data scientist, I don’t have a lot of software engineering experience but I have certainly heard a lot of great comments about containers. I have heard about how lightweight they are compared to traditional VMs and how good they are at ensuring a safe consistent environment for your code.
However, when I tried to Dockerize my own model, I soon realized it is not that intuitive. It is not at all as simple as putting RUN in front of your EC2 bootstrap script. I found that inconsistencies and unpredictable behaviors happen quite a lot and it can be frustrating to learn to debug a new tool.
All of these motivated me to create this post with all the code snippets you need to factorize your ML model in Python to a Docker container. I will guide you through installing all the pip packages you need and build your first container image. And in the second part of this post, we will be setting up all the necessary AWS environment and kicking off the container as a Batch job.
Disclaimer: The model I am talking about here is a batch job on a single instance, NOT a web service with API endpoints, NOT distributed parallel jobs. If you follow this tutorial, the whole process to put your code to a container should not take more than 25 minutes.
an AWS account
AWS CLI installed
Docker installed, and username setup
Python 3 installed
To get your code to a container, you need to create a Dockerfile
, which tells Docker what you need in your application.
FROM python:3.6-stretch
MAINTAINER Tina Bu <tina.hongbu@gmail.com>
# install build utilities
RUN apt-get update && \
apt-get install -y gcc make apt-transport-https ca-certificates build-essential
# check our python environment
RUN python3 --version
RUN pip3 --version
# set the working directory for containers
WORKDIR /usr/src/<app-name>
# Installing python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy all the files from the project’s root to the working directory
COPY src/ /src/
RUN ls -la /src/*
# Running Python Application
CMD ["python3", "/src/main.py"]
minimal Dockerfile for a Python application
In the Dockerfile above, I started with the base Python 3.6 stretch image, apt-get
updated the system libraries, installed some make and build stuff, checked my python and pip version to make sure they are good, set up my work directory, copied requirements.txt
to the container and pip installed all the libraries in it, and finally copied all the other code files to the container, listed all the files to make sure all I need is there and triggered my entrypoint main.py
file.
This Dockerfile
should work for you if your code folder structure is like this.
- app-name
|-- src
|-- main.py
|-- other_module.py
|-- requirements.txt
|-- Dockerfile
All you need to do is to change the to your application name and we are ready to build an image from it.
There are a lot of best practices to make a docker file smaller and more efficient but most of them are out of the scope for this post. However, a few things you may want to be mindful about are:
People say instead of starting with a generic Ubuntu image, use an official base image like Alpine Python instead. But I have found it extremely difficult to work with especially for installing packages (Docker experts please do teach me how it should be done properly in the comments below but while I am on my own here, I am not going to waste more time fixing Numpy installation error). A Ubuntu base image will provide predictable behavior but I suggest you start with Python 3.6 stretch, which is the official Python image based on Debian 9 (aka stretch). Python stretch comes with the Python environment and pip installed and up to date, all of which you need to figure out how to install if you choose Ubuntu.
It’s also very tempting to copy-paste some Dockerfile template especially if this is your first Docker project. But it’s suggested to only install the things you actually need to control the size of the image. If you see a whole bunch of make and build stuff other people installed, try to not include them first and see if your container will work. A smaller image generally means it’s faster to build and deploy. (Another reason you should try my minimalism template above!)
Also to keep the image as lean as possible, use.dockerignore
which works exactly like .gitignore
to ignore files that won’t impact the model.
.git
.gitignore
README.md
LICENSE
Dockerfile*
docker-compose*
data/*
test/*
requirements.txt
Before CodeIn your Dockerfile, always add your requirements.txt
file before copying the source code. That way, when you change your code and re-build the container, Docker will re-use the cached layer up until the installed packages instead of executing thepip instal
l command on every build even if the packages needed never changed. No one wants to wait 1 extra minute just because you added an empty line in your code.
If you are interested to learn more about Dockerfile
, in the appendix there is a quick summary of the few basic commands we used. Feel free to jump to Step 2 for building a container with the Dockerfile
you just created.
docker build
creates an image according to the instructions given in the Dockerfile
. All you need to do is to give your image a name.
docker build -t ${IMAGE_NAME}:${VERSION} .
Check that your image exists locally with:
docker images
You can also choose to tag your image with a human-friendly name instead of using the hash ID.
docker tag ${IMAGE_ID} ${IMAGE_NAME}:${TAG}
# or
docker tag ${IMAGE_NAME}:${VERSION} ${IMAGE_NAME}:${TAG}
Now you should test your container locally to make sure everything works fine.
docker run ${IMAGE_NAME}:${TAG}
Congratulation! You just baked your model into a container that can be run anywhere Docker is installed. Join me for the second part of this post to learn how to schedule your container as a Batch job!
FROM
starts the Dockerfile
. It is a requirement that the Dockerfile
must start with the FROM
command. Images are created in layers, which means you can use another image as the base image for your own. The FROM command defines your base layer. As arguments, it takes the name of the image. Optionally, you can add the Docker Cloud username of the maintainer and image version, in the format username/imagename:version
.
RUN
is used to build up the image you’re creating. For each RUN
command, Docker will run the command then create a new layer of the image. This way you can roll back your image to previous states easily. The syntax for a RUN instruction is to place the full text of the shell command after the RUN
(e.g., RUN mkdir /user/local/foo
). This will automatically run in a /bin/sh
shell. You can define a different shell like this: RUN /bin/bash -c 'mkdir /user/local/foo'
COPY
copies local files into the container.
CMD
defines the commands that will run on the Image at start-up. Unlike a RUN
, this does not create a new layer for the Image, but simply runs the command. There can only be one CMD
per a Dockerfile/Image. If you need to run multiple commands, the best way to do that is to have the CMD
run a script. CMD
requires that you tell it where to run the command, unlike RUN
. So example CMD
commands would be:
EXPOSE
creates a hint for users of an image which ports provide services. It is included in the information which can be retrieved via docker inspect <container-id>
.
Note: The EXPOSE command does not actually make any ports accessible to the host! Instead, this requires publishing ports by means of the -p flag when using docker run.
PUSH
pushes your image to a private or cloud registry.
Thanks for reading !
#docker #machinelearning