Docker for Data Science — A Step by Step Guide

By the end of this post, you will have an ML workspace running on your machine via Docker, packed with the ML libraries you need, VSCode, Jupyter Lab + Hub, and a lot of other goodies.

A lot has already been said about why Docker can improve your life as a data scientist. I was working on an (un-)cool depth estimation project using with a few friends when I stumbled upon this tweet by @jeremyphoward.

It so happens that we were using Docker to create our data science workspace for the project, so I thought it would make sense to address Jeremy’s questions and share this knowledge with the community.

I’ll very briefly review the core concepts and advantages of Docker, and then show a step-by-step example for setting up an entire data science workspace using Docker.

If you already know what Docker is and why it’s awesome, skip to the step-by-step tutorial.

What is Docker?

Docker is a tool for creating and deploying isolated environments (read: virtual machines) for running applications with their dependencies.

A few terms you should be familiar with (including a baking analogy for ease of understanding):

  • Docker Container — A single instance of the application, that is live and running. In our analogy, this is a cookie.

A Dancing Cookie. GIPHY

  • _Docker Image _— A blueprint for creating containers. Images are immutable and all containers created from the same image are exactly alike. In our analogy, this is the cookie-cutter mould.


