How to use Docker containers for new Data Scientists

Docker is a tool for containerising code. There’s a ton of Docker introductions out there. I want to emphasise one important point: in 2020 you will want (or even need)to know Docker as a data scientist. Before we start with Docker however, we need to talk about a container.

What is a container?

A container is a more intuitive concept than you’d think. Basically, it wraps up your code all nice so that it has everything it needs to run in a nice little package. This is important as it makes code scalable and portable.

“A container is a standard unit of software that packages up code and all its dependencies so the application runs quickly and reliably from one computing environment to another. A Docker container image is a lightweight, standalone, executable package of software that includes everything needed to run an application: code, runtime, system tools, system libraries and settings.” – Docker.com

So, what do they mean by that? Say you’re building a (useless) application in Python that takes in text, labels them with part-of-speech (POS) tags and returns them. You write all the code for your application then commit it to Github. Your friend wants to use your application, so he clones your repository to his computer and tries to use it. Bam – error! He’s using Python 3.5 and you used 2.7. No big deal, he downloads 2.7 and tries to run it again. Bam – another error! He doesn’t have spaCy library you used in his Python. He tries to install it, but again he receives an error as it’s not the same version of the library!
This is image title

After a long, painstaking process he finally manages to get it working. However, this was neither efficient nor effective. All these problems could have been avoided if you’d both been running the code in the exact same environment. Now since our application only contains Python, we could just use the virtualenv package for this. However, what happens if our application also has bash scripts? Some java code? Maybe a local database running or other services running on some port? At this point we need more.

Why use Containers?

Containers allow us to run all our code with all its bells, whistles, languages and services across many different hosts consistently. If your friend had container software installed (like Docker), he could easily run your application on his own computer without needing to weed through the variety of dependencies necessary for it.

This is image title

Note the difference in infrastructure between using containers and VMs

Unlike virtual machines which install an entire operating system (OS) on your host machine, containers are lightweight and can be spun up with ease. Unlike language specific virtual environments (like Python’s virtualenv), they also have far more functionality. You can have containers running Python, different linux distributions, and windows along with a number of other software and services.

Containers become even more important in production, where you could be managing tons of instances of your code across large clusters of nodes. This can be an abstract concept as a new data scientist, but I can tell you from experience that companies are using Docker more and more for machine learning. It is highly encouraged those running deep learning workflows in libraries like Tensorflow do so in containers for reproducibility and portability.

So what is Docker?

As you have have no doubt guessed, Docker is an open source and enterprise software for containerising code. There are a couple offerings out there, but Docker is definitely the most popular and therefore the most desired on a resume. Now that you understand the goal of containerising code, we can dive into some Docker specific terms:

Docker Image – An image is a definition of a container. Docker images contain the plans for what software will be in the container, and what code it will execute when it runs. For example, the aforementioned application would use a base image of Python 2.7.

Dockerfile – This is the file in which we define our image. You can see an example further below, but it’s the file that contains the instructions on how to build your image. This includes things like the base image (Python, Ubuntu, etc…), other packages we might need (pip, Python libraries), and what code you want to execute when the container runs (my_app.py).

Docker Container – We know what a container does, but what a Docker container actually is in computer science terms is an instance of an image. We define an image in our Dockerfile, build that Dockerfile into an actual image, and then run the instructions from that image in a container. An image definition is like a class definition – we can create multiple containers from the same image just as we can have multiple instances of the same class.

See the image below for the general, simplified trajectory of how to use Docker.
This is image title

Getting Started with Docker
I see a lot of Docker tutorials out there – but they often focus on deploying an application, multiple services or something more complex. For demonstration purposes, and those completely new to Docker/data science, I want to demonstrate the most minimal and useless Dockerfile, image, and container that I can using the example above. Let’s start.

Install Docker

You can install Docker from the website here, or use Homebrew. I prefer the latter for managing dependencies and such but it’s up to you!

Let’s also start by creating a new directory for our work. I prefer to use terminal whenever possible, and encourage beginners to do so to gain comfort. Make a new directory, and populate it with a file named Dockerfile.

Build a Quick App

Before we decide what we’ll need in our Docker image, let’s build our app! Like I said, it’ll be a simple app that takes in text and returns POS tags using Python 2.7. Normally, you might use a web framework like flask for this, but because I want this to be as bare bones as possible we’re going to take terminal input. I named this file app.py.

import spacy

nlp = spacy.load('en_core_web_sm')
text = raw_input("Please enter a phrase: ")
doc = nlp(unicode(text))

for token in doc:
    print(token, token.pos_)

Create your Dockerfile

Feel free to try out these lines of code outside Docker – but remember the point is to containerize this code! We now need to create our image. Open up Dockerfile and put the following lines in it:

# Use an official Python runtime as a parent image
FROM python:2.7

# Set the working directory to /app
WORKDIR /app

# Copy the current directory contents into the container at /app
COPY . /app

# install the only library we really need for this
RUN pip install spacy

# install spacy's english words
RUN python -m spacy download en # and finally run our Python script!

# Run your application!
CMD ["python","app.py"]

It should be pretty intuitive – but I’ll run through them. First we specify a base image with the Docker FROM command. It’s good practice to keep these images as lightweight as possible. Normally I’d use the “slim” or “alpine” versions of Python which are minimal, but we need all of Python to work with spaCy.

After, we create a new directory called app note that this directory will be inside of our Docker container. The next command is key – the COPY . command copies everything from our current directory (specified by the .), and pastes in into the app directory inside our Docker image! So if you created a new directory with app.py and this Dockerfile in it, when you build the image it’ll copy that Python code into your Docker image.

Lastly, we use the Docker RUN command to install the packages we need, and the CMD command to tell Docker we want to run our Python script when we run this container.

Build Your Image

Now that we’ve defined our image we have to build it! First, open up a terminal and cd into the directory where you have your Dockerfile and app.py. Make sure you have Docker in your $PATH and that it is running by typing a simple command like docker --help. Then run the command below.

docker build -t my_image_name .

The first part of the command is standard when using docker. build obviously builds the image. The -t allows us to name the image for convenience – here I choose the filler of my_image_name but it can be whatever you want. The . again specifies to use the Dockerfile in the current directory.

When you run the command above, you’ll notice it pulls the Python image from the official Docker registry online, and then begins installing all of the things you told it to! It takes a little while to build the first time, but it’s worth the wait. You can check to make sure the image was built by running docker image ls.

Run the Container

Now that our image is build we’re all see to actually run our container! The commands for this are simple.

docker run -it my_image_name

There ya go. The -i is for interactive, and just states that Docker should keep the terminal open for you to enter the input. If you’re running a flask application, you won’t need the i.

Upon running this, you should be met with your input prompt and can label your text and have it spit back POS tags! If you want it to keep going instead of shutting down the container after one go, just wrap it in a while True: loop. Of course in production you’d want to wrap an API around this and expose it on a port – but we’ll cover that at a later date.

Docker for Data Science

In this tutorial we had spaCy do any data science for us. However, Docker is equally as important for managing, training and deploying your models. I always train frameworks like tensorflow in a Docker image to avoid dependency and versioning issues. It’s actually recommended to use the tensorflow Docker image for GPU support: https://www.tensorflow.org/install/docker.
This is image title

With that – we’re still only at the tip of the iceberg. Companies often use Kubernetes (K8s) to manage all the Docker containers running at scale across a number of clusters: https://kubernetes.io/. There is even a machine learning framework built for K8s called Kubeflow that is growing in popularity.
This is image title

Docker’s role in data science is becoming more prevalent everyday – especially as data science moves from your local machine and into the cloud at scale. Data science is no longer completely isolated from software engineering, and the best data scientists of the next century will not only know statistics and mathematics but scalable training and deployment.

Learn More

#docker #data-science