Docker + Jupyter for Machine Learning

Docker + Jupyter for Machine Learning

Many data science applications require an isolated model training/development environment from your host environment. The lightweight solution for this would be to integrate Jupyter with Docker.

The best practice for setting up such a container is using a docker file, which I have written following the best practices in less than 1 minute. I hope this would help anyone engaging in data science applications with docker.
The project is structured as follows

├── Prject_folder           
│ ├── Dockerfile           #Primary imga building 
│ ├── docker-compose.yml   #Describing the files to mount etc     
│ ├── requirements.txt     #Required python packages for the image
Step 1:

Download the following 3 scripts to your project directory on your computer. The GitHub repo can be found here.

FROM "ubuntu:bionic"

MAINTAINER [email protected]

RUN useradd -ms /bin/bash docker
RUN su docker

ENV LOG_DIR_DOCKER="/root/dockerLogs"
ENV LOG_INSTALL_DOCKER="/root/dockerLogs/install-logs.log"

RUN mkdir -p ${LOG_DIR_DOCKER} \
 && touch ${LOG_INSTALL_DOCKER}  \
 && echo "Logs directory and file created"  | sed -e "s/^/$(date +%Y%m%d-%H%M%S) :  /" 2>&1 | tee -a ${LOG_INSTALL_DOCKER}

RUN apt-get update | sed -e "s/^/$(date +%Y%m%d-%H%M%S) :  /" 2>&1 | tee -a ${LOG_INSTALL_DOCKER} \
  && apt-get install -y python3-pip python3-dev | sed -e "s/^/$(date +%Y%m%d-%H%M%S) :  /" 2>&1 | tee -a ${LOG_INSTALL_DOCKER} \
  && ln -s /usr/bin/python3 /usr/local/bin/python | sed -e "s/^/$(date +%Y%m%d-%H%M%S) :  /" 2>&1 | tee -a ${LOG_INSTALL_DOCKER} \
  && pip3 install --upgrade pip | sed -e "s/^/$(date +%Y%m%d-%H%M%S) :  /" 2>&1 | tee -a ${LOG_INSTALL_DOCKER}

COPY requirements.txt /root/datascience/requirements.txt
WORKDIR /root/datascience
RUN pip3 install -r requirements.txt
RUN jupyter notebook --generate-config --allow-root
RUN echo "c.NotebookApp.password = u'sha1:6a3f528eec40:6e896b6e4828f525a6e20e5411cd1c8075d68619'" >> /root/.jupyter/

CMD ["jupyter", "notebook", "--allow-root", "--notebook-dir=.", "--ip=", "--port=8888", "--no-browser"]


  image: datascience
  container_name: datascience
  restart: always
     - TERM=xterm
  hostname: ''
     - "8888:8888"         #JupyterNB

     - /Users/chamalgomes/Documents/Python/GitHub/medium/JupyterWithDocker/test.txt:/root/datascience/test.txt


Update it as {absolutePATH_to_yourFile}/{fileNameame}:/root/datascience/{fileName} to mount you preferred files.



Update it to include the packages you require i.e matplotlib==1.0.0

Step 2:

Install Docker on your computer,

Step 3:
docker build -t datascience .
Step 4:

Update the docker-compose.yml file to mount files you require from your host to the container following the example I have given already with the test.txt file.

Step 5:
docker-compose up 

DONE Now Visit http://localhost:8888, The default password is set as “root”. Feel free to change it.


If you run into trouble run the following command to download the log file to your current directory. Post this as a response to this article and I will get back to with a solution as soon as possible.

docker cp <container-name>:/root/dockerLogs/install-logs.log .

If you want to start an interactive session inside the container type in the following command

docker exec -it <container name> /bin/bash

If you want delete the image completely from your computer

Step 1: Stop the container

docker container stop <container-id>

Step 2: Remove the container

docker container rm <container-id>

Step 3: Delete the image

docker image rm datascience

Thanks for reading !

Originally published by Chamal Gomes at

Kubernetes Fundamentals - Learn Kubernetes from the Beginning

Kubernetes Fundamentals - Learn Kubernetes from the Beginning

Kubernetes is about orchestrating containerized apps. Docker is great for your first few containers. As soon as you need to run on multiple machines and need to scale/up down and distribute the load and so on, you need an orchestrator - you need Kubernetes

Kubernetes is about orchestrating containerized apps. Docker is great for your first few containers. As soon as you need to run on multiple machines and need to scale/up down and distribute the load and so on, you need an orchestrator - you need Kubernetes

This is the first part of a series of articles on Kubernetes, cause this topic is BIG!.

  • Part I - From the beginning, Part I, Basics, Deployment and Minikube: In this part, we cover why Kubernetes, some history and some basic concepts like deploying, Nodes, Pods.
  • Part II - Introducing Services and Labeling: In this part, we deepen our knowledge of Pods and Nodes. We also introduce Services and labeling using labels to query our artifacts.
  • Part III - Scaling: Here we cover how to scale our app
  • Part IV - Auto scaling: In this part we look at how to set up auto-scaling so we can handle sudden large increases of incoming requests
Resources Kubernetes

So what do we know about Kubernetes?

It's an open-source system for automating deployment, scaling, and management of containerized applications

Let'start with the name. It's Greek for Helmsman, the person who steers the ship. Which is why the logo looks like this, a steering wheel on a boat:

It's Also called K8s so K ubernete s, 8 characters in the middle are removed. Now you can impress your friends that you know why it's referred to as K8.

Here is some more Jeopardy knowledge on its origin. Kubernetes was born out of systems called Borg and Omega. It was donated to CNCF, Cloud Native Computing Foundation in 2014. It's written in Go/Golang.

If we see past all this trivia knowledge, it was built by Google as a response to their own experience handling a ton of containers. It's also Open Source and battle-tested to handle really large systems, like planet-scale large systems.

So the sales pitch is:

Run billions of containers a week, Kubernetes can scale without increasing your ops team

Sounds amazing right, billions of containers cause we are all Google size. No? :) Well even if you have something like 10-100 containers, it's for you.

In this part I hope to cover the following:

  • Why Kubernetes and Orchestration in General
  • Hello world: Minikube basics, talking through Minikube, simple deploy example
  • Cluster and basic commands, Nodes,
  • Deployments, what it is and deploying an app
  • Pods and Nodes, explain concepts and troubleshooting
Part I - From the beginning, Part I, Basics, Deployment and Minikube Why Orchestration

Well, it all started with containers. Containers gave us the ability to create repeatable environments so dev, staging, and prod all looked and functioned the same way. We got predictability and they were also light-weight as they drew resources from the host operating system. Such a great breakthrough for Developers and Ops but the Container API is really only good for managing a few containers at a time. Larger systems might consist of 100s or 1000+ containers and needs to be managed as well so we can do things like scheduling, load balancing, distribution and more.

At this point, we need orchestration the ability for a system to handle all these container instances. This is where Kubernetes comes in.

Getting started

Ok ok, let's say I buy into all of this, how do I get started?

Impatient ey, sure let's start to do something practical with Minikube

Ok, sounds good I'm a coder, I like practical stuff. What is Minikube?

Minikube is a tool that lets us run Kubernetes locally

Oh, sweet, millions of containers on my little machine?

Well, no, let's start with a few and learn Kubernetes basics while at it.


To install Minikube lets go to this installation page

It's just a few short steps that means we install

  • a Hypervisor
  • Kubectl (Kube control tool)
  • Minikube


Get that thing up and running by typing:

minikube start

It should look something like this:

You can also ensure that kubectl have been correctly installed and running:

kubectl version

Should give you something like this in response:

Ok, now we are ready to learn Kubernetes.

Learning kubectl and basic concepts

In learning Kubernetes lets do so by learning more about kubectl a command line program that lets us interact with our Cluster and lets us deploy and manage applications on said Cluster.

The word Cluster just means a group of similar things but in the context of Kubernetes, it means a Master and multiple worker machines called Nodes. Nodes were historically called Minions

, but not so anymore.

The master decides what will run on the Nodes, which includes things like scheduled workloads or containerized apps. Which brings us to our next command:

kubectl get nodes

This should give us a result like this:

What this tells us what Nodes we have available to do work.

Next up let's try to run our first app on Kubernetes with the run command like so:

kubectl run kubernetes-first-app --port=8080

This should give us a response like so:

Next up lets check that everything is up and running with the command:

kubectl get deployments

This shows the following in the terminal:

In putting our app on the Kluster, by invoking the run command, Kubernetes performed a few things behind the scenes, it:

  • searched for a suitable node where an instance of the application could be run, there was only one node so it got chosen
  • scheduled the application to run on that Node
  • configured the cluster to reschedule the instance on a new Node when needed

Next up we are going to introduce the concept Pod, so what is a Pod?

A Pod is the smallest deployable unit and consists of one or many containers, for example, Docker containers. That's all we are going to say about Pods at the moment but if you really really want to know more have a read here

The reason for mentioning Pods at this point is that our container and app is placed inside of a Pod. Furthermore, Pods runs in a private isolated network that, although visible from other Pods and services, it cannot be accessed outside the network. Which means we can't reach our app with say a curl command.

We can change that though. There is more than one way to expose our application to the outside world for now however we will use a proxy.

Now open up a 2nd terminal window and type:

kubectl proxy

This will expose the kubectl as an API that we can query with HTTP request. The result should look like:

Instead of typing kubectl version we can now type curl http://localhost:8001/version and get the same results:

The API Server inside of Kubernetes have created an endpoint for each pod by its pod name. So the next step is to find out the pod name:

kubectl get pods

This will list all the pods you have, it should just be one pod at this point and look something like this:

Then you can just save that down to a variable like so:

Lastly, we can now do an HTTP call to learn more about our pod:

curl http://localhost:8001/api/v1/namespaces/default/pods/$POD_NAME

This will give us a long JSON response back (I trimmed it a bit but it goes on and on...)

Maybe that's not super interesting for us as app developers. We want to know how our app is doing. Best way to know that is looking at the logs. Let's do that with this command:

kubectl logs $POD_NAME

As you can see below we know get logs from our app:

Now that we know the Pods name we can do all sorts of things like checking its environment variables or even step inside the container and look at the content.

kubectl exec $POD_NAME env

This yields the following result:

Now lets step inside the container:

kubectl exec -ti $POD_NAME bash

We are inside! This means we can see what the source code looks like even:

cat server.js

Inside of our container, we can now reach the running app by typing:

curl http://localhost:8080

Summary Part I

This is where we will stop for now.
What did we actually learn?

  • Kubernetes, its origin what it is
  • Orchestration why you will soon need it
  • Concepts like Master, Nodes and Pods
  • Minikube, kubectl and how to deploy an image onto our Cluster
Part II - Introducing Services and Labeling

In this part we will cover the following:

  • Deepen our knowledge on Pods and Nodes
  • Introduce Services and labeling
  • Perform an exercise, involving setting labels on Pods and use labels to query our artifacts
Concepts revisited

When we create a Deployment on Kubernetes, that Deployment creates Pods with containers inside them. So Pods are tied to Nodes and will continue to exist until terminated or deleted. Let's try to educate ourselves a bit more on Pods, Nodes and let's also introduce a new topic Services.


Pods are the atomic unit on the Kubernetes platform, i.e smallest possible deployable unit

We've stated the above before but it's worth mentioning again.

What else is there to know?

A Pod is an abstraction that represents a group of one or more containers, for example, Docker or rkt, and some shared resources for those containers. Those resources include:

  • Shared storage, as Volumes
  • Networking, as a unique cluster IP address
  • Information about how to run each container, such as the container image version or specific ports to use

A Pod can have more than one container. If it does contain more than one container it is so the other containers can support the primary application.
Typical examples of helper applications are data pullers, data pushers, and proxies. You can read more on that use case here

  1. The containers in a Pod share an IP Address and port space and are:
  2. Always co-located
  3. Co-scheduled

Let me show you an image to make it easier to visualize:

As we can see above a Pod can have a lot of different artifacts in them that are able to communicate and support the app in some way.


A Pod always runs on a Node

So Node is the Pods parent?


A Node is a worker machine and may be either a virtual or a physical machine, depending on the cluster

Each Node is managed by the Master. A Node can have multiple pods.

So it's a one to many relationship

The Kubernetes master automatically handles scheduling the pods across the Nodes in the cluster

Every Kubernetes Node runs at least a:

  • Kubelet, is responsible for the pod spec and talks to the cri interface

  • Kube proxy, is the main interface for coms between nodes

  • A container runtime, (like Docker, rkt) responsible for pulling the container image from a registry, unpacking the container, and running the application.

Ok so a Node contains a Kubelet and container runtime and one to many Pods. I think I got it.

Let's show an image to make this info stick, cause it's quite important that we know what goes on, at least at a high level:


Pods are mortal, they can die. Pods, in fact, have a lifecycle.

When a worker node dies, the Pods running on the Node are also lost.

What happens to our apps? :(

You might think them and their data are lost but not so. The whole point with Kubernetes is to not let that happen. We normally deploy something like a ReplicaSet.

A ReplicaSet, what do you mean?

A ReplicaSet is a high-level artifact that can drive the cluster back to desired state via the creation of new Pods to keep your application running.

Ok so if a Pod goes down the ReplicaSet just creates a new Pod/s in its place?

Yes, exactly that. If you focus on defining a desired state the rest is up to Kubernetes.

Phew sounds really great then.

This concept of desired state is a very important one. You need to specify how many containers you want of each kind, at all times.

Oh so 4 database containers, 3 services etc?

Yes exactly.

So you don't have to care about the details just tell Kubernetes what state you want and it does the rest. If something goes up, Kubernetes ensures it comes back up again to desired state.

Each Pod in a Kubernetes cluster has a unique IP address, even Pods on the same Node, so there needs to be a way of automatically reconciling changes among Pods so that your applications continue to function.


Yea, think like this. If a Pod containing your app goes down and another Pod is created in its place, running your app. Users should still be able to use your app after that.

Ok I got it. Makes me think...

The motivation for a Service

You should never refer to a Pod by it's IP address, just think what happens when a Pod goes down and comes back up again but this time with a different IP. It is for that reason a Service exists.

A Service in Kubernetes is an abstraction which defines a logical set of Pods and a policy by which to access them.

Makes me think of a routers and subnets

Yea I guess you can say there is a resemblance in there somewhere.

Services enable a loose coupling between dependent Pods and are defined using YAML or JSON file, just like all Kubernetes objects.

That's handy, just JSON and YAML :)

Services and Labels

The set of Pods targeted by a Service is usually determined by a LabelSelector.

Although each Pod has a unique IP address, those IPs are not exposed outside the cluster without a Service. We can expose them through a proxy though as we showed in part I.

Wait, go back a second here, you said LabelSelector. I wasn't quite following?

Remember how we couldn't refer to Pods by IP, cause Pods might go down and a new Pod could come back in its place?


Well, labels are the answer to how Services and Pods are able to communicate. This is what we mean by loose coupling. By applying labels like for example frontend, backend, release and so on to Pods, we are able to refer to Pods by their logical name rather than their specifics, i.e IP number.

Oh I get it, so it's a high-level domain language

Mm, kind of.

Services and Traffic

Services allow your applications to receive traffic.

Services can be exposed in different ways by specifying a type in ServiceSpec, service specification.

  • ClusterIP (default) - Exposes the Service on an internal IP in the cluster. This type makes the Service only reachable from within the cluster.
  • NodePort - Exposes the Service on the same port of each selected Node in the cluster using NAT. Makes a Service accessible from outside the cluster using :. Superset of ClusterIP.
  • LoadBalancer - Creates an external load balancer in the current cloud (if supported) and assigns a fixed, external IP to the Service. Superset of NodePort.
  • ExternalName - Exposes the Service using an arbitrary name (specified by externalName in the spec) by returning a CNAME record with the name. No proxy is used. This type requires v1.7 or higher of kube-dns.

Ok I think I get it. Ensure I'm speaking externally to a Service instead of specific Pods. Depending on what I expose the Service as, that leads to different behavior?

Yea that's correct.

You said something about labels though, how do we create and apply them to Pods?

Yea lets talk about that next.


As we just mentioned, Services are the abstraction that allows pods to die and replicate in Kubernetes without impacting your application.

Now, Services match a set of Pods using labels and selectors, it allows us to operate on Pods like a group.

Labels are key/value pairs attached to objects and can be used in any number of ways:

  • Designate objects for development, test, and production
  • Embed version tags
  • Classify an object using tags

Labels can be attached to objects at creation time or later on. They can be modified at any time.

Lab - Fun with Labels and kubectl

It's a good idea to have read the first part of this series where we create a deployment. If you haven't you need to first create a deployment like so:

kubectl run kubernetes-first-app --port=8080

Now we should be good to go.

Ok. I know you are probably all tired from all theory by now.

I bet you are just itching to learn more hands on Kubernetes with kubectl.

Well, the time for that has come :). We will do two things:

  1. Create a Service and learn how we can expose our app using said Service
  2. Learn about Labeling and how we can improve our querying game by having appropriate labels on our artifacts.

Let's create a new service.

We will get acquainted with the expose command.

Let's check for existing pods,

kubectl get pods

Next let's see what services we have:

kubectl get services

Next lets create a Service like so:

kubectl expose deployment/kubernetes-first-app --type="NodePort" --port 8080

As you can see above we are just targeting one of our deployments kubernetes-first-app and referring to it with [type]/[deployment name] and type being deployment.

We expose it as service of type NodePort and finally, we choose to expose it at port 8080.

Now run kubectl get services again and see the results:

As you can see we now have two services in use, our basic kubernetes service and our newly created kubernetes-first-app.

Next up we need to grab the port of our service and assign that to a variable:

export NODE_PORT=$(kubectl get services/kubernetes-first-app -o go-template='{{(index .spec.ports 0).nodePort}}')

We now have a our port stored on environment variable NODE_PORT and we are ready to start communicating with our service like so:

curl $(minikube ip):$NODE_PORT

Which leads to the following output:

Creating and applying Labels

When we created our deployment and our Pod, it was automatically assigned with a label.

By typing

kubectl describe deployment

we can see the name of said label.

Next up we can query the pods by that same label

kubectl get pods -l run=kubernetes-first-app

Above we are using -l to query for a specific label and kubernetes-bootcamp as the name of the label. This gives us the following result:

You can do a similar query to your services:

kubectl get services -l run=kubernetes-first-app

That just shows that you can query on different levels, for specific Pods or Services that have Pods with that label.

Next up we will look at how to change the label

First let's get the name of the pod, like so:


Above I'm just assigning what my Pod is called to a variable POD_NAME. Check with a kubectl getpods what your Pod is called.

Then we can add/apply the new label like so:

kubectl label pod $POD_NAME app=v1

Verify that the new label have been set, like so:

kubectl describe pod


kubectl describe pods $POD_NAME

As you can see from the result our new label app=v1 has been appended to existing labels.

Now we can query like so:

kubectl get pods -l app=v1

That's pretty much how labeling works, how to get available labels, apply them and use them in a query. Ensure to give them a descriptive name like an app version, a certain environment or a name like frontend or backend, something that makes sense to your situation.

Clean up

Ok, so we created a service. We should learn how to clean up after ourselves. Run the following command to remove our service:

kubectl delete service -l run=kubernetes-bootcamp

Verify the service is no longer there with:

kubectl get services

also, ensure our exposed IP and port can no longer be reached:

curl $(minikube ip):$NODE_PORT

Just because the service is gone doesn't mean the app is gone. The app should still be reachable on:

kubectl exec -ti $POD_NAME curl localhost:8080

Summary Part II

So what did we learn? We learned a bit more on Pods and Nodes. Furthermore, we learned that we shouldn't speak directly to Pods but rather use a high-level abstraction such as Services. Services use labels as a way to define a domain language and apply those to different Pods.

Ok, so we understand a bit more on Kubernetes and how different concepts relate. We mentioned something called desired state a number of times but we didn't go into detail on how to set such a state. That's our next part in this series where we will cover how to set the desired state and how Kubernetes maintains it, so stay tuned.

Part III - Scaling

This third part aims to show how you scale your application. We can easily set the number of Replicas we want of a certain application and let Kubernetes figure out how to do that. This is us defining a so-called desired state.

When traffic increases, we will need to scale the application to keep up with user demand. We've talked about deployments and services, now lets talk scaling.

What does scaling mean in the context of Kubernetes?

We get more Pods. More Pods that are scheduled to nodes.

Now it's time to talk about desired state again, that we mentioned in previous parts.

This is where we relinquish control to Kubernetes. All we need to do is tell Kubernetes how many Pods we want and Kubernetes does the rest.

So we tell Kubernetes about the number of Pods we want, what does that mean? What does Kubernetes do for us?

It means we get multiple instances of our application. It also means traffic is being distributed to all of our Pods, ie. load balancing.

Furthermore, Kubernetes, or more specifically, services within Kubernetes will monitor which Pods are available and send traffic to those Pods.

Scaling demo Lab

If you haven't followed the first two parts I do recommend you go back and have a read. What you need for the following to work is at least a deployment. So if you haven't created one, here is how:

kubectl run kubernetes-first-app --port=8080

Let's have a look at our deployments:

kubectl get deployments

Let's look closer at the response we get:

We have three pieces of information that are important to us. First, we have the READY column in which we should read the value in the following way, CURRENT STATE/DESIRED STATE. Next up is the UP_TO_DATE column which shows the number of replicas that were updated to match the desired state.
Lastly, we have the AVAILABLE column that shows how many replicas we have available to do work.

Let's scale

Now, let's do some scaling. For that we will use the scale command like so:

kubectl scale deployments/kubernetes-first-app --replicas=4 

as we can see above the number of replicas was increased to 4 and kubernetes is thereby ready to load balance any incoming requests.

Let's have a look at our Pods next:

When we asked for 4 replicas we got 4 Pods.

We can see that this scaling operation took place by using the describe command, like so:

kubectl describe deployments/kubernetes-first-app

In the above image, we are given quite a lot of information on our Replicas for example, but there is some other information in there that we will explain later on.

Does it load balance?

The whole point with the scaling was so that we could balance the load on incoming requests. That means that not the same Pod would handle all the requests but that different Pods would be hit.
We can easily try this out, now that we have scaled our app to contain 4 replicas of itself.

So far we used the describe command to describe the deployment but we can use it to describe the IP and port of. Once we have the IP and port we can then hit it with different HTTP requests.

kubectl describe services/kubernetes-first-app

Especially look at the NodePort and the Endpoints. NodePort is the port value that we want to hit with an HTTP request.

Now we will actually invoke the cURL command and ensure that it hits a different port each time and thereby prove our load balancing is working. Let's do the following:


Next up the cURL call:

curl $(minikube ip):$NODE_PORT

As you can see above we are doing the call 4 times. Judging by the output and the name of the instance we see that we are hitting a different Pod for each request. Thereby we see that the load balancing is working.

Scaling down

So far we have scaled up. We managed to go from one Pod to 4 Pods thanks to the scale command. We can use the same command to scale down, like so:

kubectl scale deployments/kubernetes-first-app --replicas=2 

Now if we are really fast adding the next command we can see how the Pods are being removed as Kubernetes is trying to adjust to desired state.

2 out of 4 Pods are saying Terminating as only 2 Pods are needed to maintain the new desired state.

Running our command again we see that only 2 Pods remain and thereby our new desired state have been reached:

We can also look at our deployment to see that our scale instruction has been parsed correctly:


Self-healing is Kubernetes way of ensuring that the desired state is maintained. Pods don't self heal cause Pods can die. What happens is that a new Pod appears in its place, thanks to Kubernetes.

So how do we test this?

Glad you asked, we can delete a Pod and see what happens. So how do we do that? We use the delete command. We need to know the name of our Pod though so we need to call get pods for that. So let's start with that:

kubectl get pods

Then lets pick one of our two Pods kubernetes-first-app-669789f4f8-6glpx and assign it to a variable:


Now remove it:

kubectl delete pods $POD_NAME

Let's be quick about it and check our Pod status with get pods. It should say Terminating like so:

Wait some time and then echo out our variable $POD_NAME followed by get pods. That should give you a result similar to the below.

So what does the above image tell us? It tells us that the Pod we deleted is truly deleted but it also tells us that the desired state of two replicas has been achieved by spinning up a new Pod. What we are seeing is * self-healing* at work.

Different ways to scale

Ok, we looked at a way to scale by explicitly saying how many replicas we want of a certain deployment. Sometimes, however, we might want a different way to scale namely auto-scaling. Auto-scaling is about you not having to set the exact number of replicas you want but rather rely on Kubernetes to create the number of replicas it thinks it needs. So how would Kubernetes know that? Well, it can look at more than one thing but a common metric is CPU utilization. So let's say you have a booking site and suddenly someone releases Bruce Springsteen tickets you are likely to want to rely on auto-scaling, cause the next day when the tickets are all sold out you want the number of Pods to go back to normal and you wouldn't want to do this manually.

Auto-scaling is a topic I plan to cover more in detail in a future article so if you are really curious how that is done I recommend you have a look here

Summary Part III

Ok. So we did it. We managed to scale an app by creating replicas of it. It wasn't so hard to accomplish. We showed how we only needed to provide Kubernetes with a desired state and it would do its utmost to preserve said state, also called * self-healing*. Furthermore, we mentioned that there was another way to scale, namely auto-scaling but decided to leave that topic for another article. Hopefully, you are now more in awe of how amazing Kubernetes is and how easy it is to scale your app.

Part IV - Auto scaling

In this article, we will cover the following:

  • Why auto scaling, we will discuss different scenarios in which it makes sense to rely on auto scaling over defining it statically like we do with desired state
  • How, lets talk about Horizontal Auto Scaling the concept/feature that allows us to scale in an elastic way.
  • Lab - lets scale, we will look at how to actually set this up in kubectl and simulate a ton of incoming requests. We will then inspect the results and see that Kubernetes acts the way we think

So in our last part, we talked about desired state. That's an OK strategy until something unforeseen happens and suddenly you got a great influx of traffic. This is likely to happen to businesses such as e-commerce around a big sale or a ticket vendor when you release tickets to a popular event.

Events like these are an anomaly which forces you to quickly scale up. The other side of the coin though is that at some point you need to scale down or you suddenly have overcapacity you might need to pay for. What you really want is for the scaling to act in an elastic way so it scaled up when you need it to and scales down when there is less traffic.


Horizontal auto-scaling, what does it mean?

It's a concept in Kubernetes that can scale the number of Pods we need. It can do so on a replication controller, deployment or replica set. It usually looks at CPU utilization but can be made to look at other things by using something called custom metrics support, so it's customizable.

It consists of two parts a resource and a controller. The controller checks utilization, or whatever metric you decided, to ensure that the number of replicas matches your specification. If need be it spins up more Pods or removes them. The default is checking every 15 seconds but you can change that by looking at a flag called --horizontal-pod-autoscaler-sync-period.

The underlying algorithm that decides the number of replicas looks like this:

desiredReplicas = ceil[currentReplicas * ( currentMetricValue / desiredMetricValue )]

Lab - lets scale

Ok, the first thing we need to do is to scale our deployment to use something other than desired state.

We have two things we need to specify when we do autoscaling:

  • min/max, we define a minimum and maximum in terms of how many Pods we want
  • CPU, in this version we set a certain CPU utilization percentage. When it goes above that it scales out as needed. Think of this one as an IF clause, if CPU value greater than the threshold, try to scale

Set up

Before we can attempt our scaling experiment we need to make sure we have the correct add-ons enabled. You can easily see what add-ons you have enabled by typing:

minikube addons list

If it looks like the above we are all good. Why am I saying that? Well, what we need, to be able to auto-scale, is that heapster and metrics-server add ons are enabled.

Heapster enables Container Cluster Monitoring and Performance Analysis.

Metrics server provide metrics via the resource metrics API. Horizontal Pod Autoscaler uses this API to collect metrics

We can easily enable them both with the following commands (we will need to for auto-scaling to show correct data):

minikube addons enable heapster


minikube addons enable metrics-server

We need to do one more thing, namely to enable Custom metrics, which we do by starting minikube with such a flag like so:

minikube start --extra-config kubelet.EnableCustomMetrics=true

Ok, now we are good to go.

Running the experiment

We need to do the following to run our experiment

  • Create a deployment
  • Apply autoscaling
  • Bombard the deployment with incoming requests
  • Watch the auto scaling how it changes

Create a deployment

kubectl run php-apache --requests=cpu=200m --expose --port=80

Above we are creating a deployment php-apache and expose it as a service on port 80. We can see that we are using the image

It should tell us the following:

service/php-apache created
deployment.apps/php-apache created


Next up we will use the command autoscale. We will use it like so:

kubectl autoscale deployment php-apache --cpu-percent=50 --min=1 --max=10

It should say something like:

horizontalpodautoscaler.autoscaling/php-apache autoscaled

Above we are applying the auto-scaling on the deployment php-apache and as you can see we are applying both min-max and cpu based auto scaling which means we give a rule for how the auto scaling should happen:

If CPU load is >= 50% create a new Pod, but only maximum 10 Pods. If the load is low go back gradually to one Pod

Bombard with requests

Next step is to send a ton of requests against our deployment and see our auto-scaling doing its work. So how do we do that?

First off let's check the current status of our horizontal pod auto-scaler or hpa for short by typing:

kubectl get hpa

This should give us something like this:

The above shows us two pieces of information. The first is the TARGETS column which shows our CPU utilization, actual usage/trigger value. The next bit of interest is the column REPLICAS that shows us the number of copies, which is 1 at the moment.

For our next trick open up a separate terminal tab. We need to do is to set things up so we can send a ton of requests.

Next up we create a container using this command:

kubectl run -i --tty load-generator --image=busybox /bin/sh

This should take us to a prompt within the container. This is followed by:

while true; do wget -q -O- http://php-apache.default.svc.cluster.local; done

The command above should result in something looking like this.

This will just go on and on until you hit CTRL+C, but leave it be for now.

This throws a ton on requests in while true loop.

I thought while true loops were bad?

They are but we are only going to run it for a minute so that the auto scaling can happen. Yes, the CPU will sound a lot but don't worry :)

Let this go on for a minute or so, then enter the following command into the first terminal tab (not the one running the requests), like so:

kubectl get hpa

It should now show something like this:

As you can see from the above the column TARGETS looks different and now says 339%/50% which means the current load on the CPU and REPLICAS is 7 which means it has gone from 1 to 7 replicas. So as you can see we have been bombarding it pretty hard.

Now go to the second terminal and hit CTRL+C or you will have a situation like this:

It will actually take a few minutes for Kubernetes to cool off and get the values back to normal. A first look at the Pods situation shows us the following:

kubectl get pods

As you can see we have 7 Pods up and running still but let's wait a minute or two and it should look like this:

Ok, now we are back to normal.

Summary Part IV

Now we did some great stuff in this article. We managed to set up auto-scaling, bombard it with requests and without frying our CPU, hopefully ;)

We also managed to learn some new Kubernetes commands while at it and got to see auto-scaling at work giving us new Pods based on our specification.

Docker Best Practices for Node Developers

Docker Best Practices for Node Developers

Welcome to the "Docker Best Practices for Node Developers"! With your basic knowledge of Docker and Node.js in hand, Docker Mastery for Node.js is a course for anyone on the Node.js path. This course will help you master them together.

Welcome to the best course on the planet for using Docker with Node.js! With your basic knowledge of Docker and Node.js in hand, Docker Mastery for Node.js is a course for anyone on the Node.js path. This course will help you master them together.

My talk on all the best of Docker for Node.js developers and DevOps dealing with Node apps. From DockerCon 2019. Get the full 9-hour training course with my coupon at

Get the source code for this talk at

Some of the many cool things you'll do in this course
  • Build Node.js Images that auto-scan for security vulnerabilities
  • Use Docker's cutting-edge BuildKit with SSH Agents and NPM Caches for better image building
  • Use docker-compose with Visual Studio Code for full Node.js debug support
  • Use BuildKit and Multi-stage Builds to create minimal and flexible Dockerfiles
  • Build custom Node.js images using distro's like CentOS and Alpine
  • Test Docker init, tini, and Node.js as a PID 1 process in containers
  • Create Node.js apps that properly startup and respond to healthchecks
  • Develop ARM based Node.js apps with Docker Desktop, and deploy to AWS A1 Servers
  • Build graceful shutdown code into your apps for zero-downtime deploys
  • Dig into HTTP connections with orchestration, and how Proxies can help
  • Study examples of Docker Swarm and Kubernetes deployments for Node.js
  • Spend time Migrating traditional (legacy) Node.js apps into containers
  • Simplify your microservice solutions with advanced Docker Compose features
What you will learn in this course

You'll start with a quick review about getting set up with Docker, as well as Docker Compose basics. That way we're on the same page for the basics.

Then you'll jump into Node.js Dockerfile basics, that way you'll have a good Dockerfile foundation for new features we'll add throughout the course.

You'll be building on all the different things you learn from each Lecture in the course. Once you have the basics down of Compose, Dockerfile, and Docker Image, then you'll focus on nuances like how Docker and Linux control the Node process and how Docker changes that to make sure you know what options there are for starting up and shutting down Node.js and the right way to do it in different scenarios.

We'll cover advanced, newer features around making the Dockerfile the most efficient and flexible as possible using things like BuildKit and Multi-stage.

Then we'll talk about distributed computing and cloud design to ensure your Node.js apps have 12-factor design in your containers, as well as learning how to migrate old apps into this new way of doing things.

Next we cover Compose and its awesome features to get really efficient local development and test set-up using the Docker Compose command line and Docker Compose YAML file.

With all this knowledge, you'll progress to production concerns and making images production-ready.

Then we'll jump into deploying those containers and running them in production. Whether you use Docker Engine or orchestration with Kubernetes or Swarm, I've got you covered. In addition, we'll cover HTTP connections and reverse proxies for connection handling and routing with multi-container systems.

Lastly, you'll get a final, big assignment where you'll be building and deploying a large, complex solution, including multiple Node.js containers that are doing different things. You'll build Docker images, Dockerfiles, and compose files, and deploy them to a server to test. You'll need to check whether connections failover properly. You'll basically take everything you've learned and apply it in one big project!

A Complete Machine Learning Project Walk-Through in Python

A Complete Machine Learning Project Walk-Through in Python

A Complete Machine Learning Project Walk-Through in Python: Putting the machine learning pieces together; Model Selection, Hyperparameter Tuning, and Evaluation; Interpreting a machine learning model and presenting results

Reading through a data science book or taking a course, it can feel like you have the individual pieces, but don’t quite know how to put them together. Taking the next step and solving a complete machine learning problem can be daunting, but preserving and completing a first project will give you the confidence to tackle any data science problem. This series of articles will walk through a complete machine learning solution with a real-world dataset to let you see how all the pieces come together.

We’ll follow the general machine learning workflow step-by-step:

  1. Data cleaning and formatting
  2. Exploratory data analysis
  3. Feature engineering and selection
  4. Compare several machine learning models on a performance metric
  5. Perform hyperparameter tuning on the best model
  6. Evaluate the best model on the testing set
  7. Interpret the model results
  8. Draw conclusions and document work

Along the way, we’ll see how each step flows into the next and how to specifically implement each part in Python. The complete project is available on GitHub, with the first notebook here.

(As a note, this problem was originally given to me as an “assignment” for a job screen at a start-up. After completing the work, I was offered the job, but then the CTO of the company quit and they weren’t able to bring on any new employees. I guess that’s how things go on the start-up scene!)

Problem Definition

The first step before we get coding is to understand the problem we are trying to solve and the available data. In this project, we will work with publicly available building energy data from New York City.

The objective is to use the energy data to build a model that can predict the Energy Star Score of a building and interpret the results to find the factors which influence the score.

The data includes the Energy Star Score, which makes this a supervised regression machine learning task:

  • Supervised: we have access to both the features and the target and our goal is to train a model that can learn a mapping between the two
  • Regression: The Energy Star score is a continuous variable

We want to develop a model that is both **accurate **— it can predict the Energy Star Score close to the true value — and interpretable — we can understand the model predictions. Once we know the goal, we can use it to guide our decisions as we dig into the data and build models.

Data Cleaning

Contrary to what most data science courses would have you believe, not every dataset is a perfectly curated group of observations with no missing values or anomalies (looking at you mtcars and iris datasets). Real-world data is messy which means we need to clean and wrangle it into an acceptable format before we can even start the analysis. Data cleaning is an un-glamorous, but necessary part of most actual data science problems.

First, we can load in the data as a Pandas DataFrame and take a look:

import pandas as pd
import numpy as np

# Read in data into a dataframe 
data = pd.read_csv('data/Energy_and_Water_Data_Disclosure_for_Local_Law_84_2017__Data_for_Calendar_Year_2016_.csv')

# Display top of dataframe

This is a subset of the full data which contains 60 columns. Already, we can see a couple issues: first, we know that we want to predict the ENERGY STAR Score but we don’t know what any of the columns mean. While this isn’t necessarily an issue — we can often make an accurate model without any knowledge of the variables — we want to focus on interpretability, and it might be important to understand at least some of the columns.

When I originally got the assignment from the start-up, I didn’t want to ask what all the column names meant, so I looked at the name of the file,

and decided to search for “Local Law 84”. That led me to this page which explains this is an NYC law requiring all buildings of a certain size to report their energy use. More searching brought me to all the definitions of the columns. Maybe looking at a file name is an obvious place to start, but for me this was a reminder to go slow so you don’t miss anything important!

We don’t need to study all of the columns, but we should at least understand the Energy Star Score, which is described as:

A 1-to-100 percentile ranking based on self-reported energy usage for the reporting year. The Energy Star score is a relative measure used for comparing the energy efficiency of buildings.

That clears up the first problem, but the second issue is that missing values are encoded as “Not Available”. This is a string in Python which means that even the columns with numbers will be stored as object datatypes because Pandas converts a column with any strings into a column of all strings. We can see the datatypes of the columns using the

# See the column data types and non-missing values

Sure enough, some of the columns that clearly contain numbers (such as ft²), are stored as objects. We can’t do numerical analysis on strings, so these will have to be converted to number (specifically float) data types!

Here’s a little Python code that replaces all the “Not Available” entries with not a number ( np.nan), which can be interpreted as numbers, and then converts the relevant columns to the float datatype:

# Replace all occurrences of Not Available with numpy not a number
data = data.replace({'Not Available': np.nan})

# Iterate through the columns
for col in list(data.columns):
    # Select columns that should be numeric
    if ('ft²' in col or 'kBtu' in col or 'Metric Tons CO2e' in col or 'kWh' in 
        col or 'therms' in col or 'gal' in col or 'Score' in col):
        # Convert the data type to float
data[col] = data[col].astype(float)

Once the correct columns are numbers, we can start to investigate the data.

Missing Data and Outliers

In addition to incorrect datatypes, another common problem when dealing with real-world data is missing values. These can arise for many reasons and have to be either filled in or removed before we train a machine learning model. First, let’s get a sense of how many missing values are in each column (see the notebook for code).

(To create this table, I used a function from this Stack Overflow Forum).

While we always want to be careful about removing information, if a column has a high percentage of missing values, then it probably will not be useful to our model. The threshold for removing columns should depend on the problem (here is a discussion), and for this project, we will remove any columns with more than 50% missing values.

At this point, we may also want to remove outliers. These can be due to typos in data entry, mistakes in units, or they could be legitimate but extreme values. For this project, we will remove anomalies based on the definition of extreme outliers:

  • Supervised: we have access to both the features and the target and our goal is to train a model that can learn a mapping between the two
  • Regression: The Energy Star score is a continuous variable

(For the code to remove the columns and the anomalies, see the notebook). At the end of the data cleaning and anomaly removal process, we are left with over 11,000 buildings and 49 features.

Exploratory Data Analysis

Now that the tedious — but necessary — step of data cleaning is complete, we can move on to exploring our data! Exploratory Data Analysis (EDA) is an open-ended process where we calculate statistics and make figures to find trends, anomalies, patterns, or relationships within the data.

In short, the goal of EDA is to learn what our data can tell us. It generally starts out with a high level overview, then narrows in to specific areas as we find interesting parts of the data. The findings may be interesting in their own right, or they can be used to inform our modeling choices, such as by helping us decide which features to use.

Single Variable Plots

The goal is to predict the Energy Star Score (renamed to score in our data) so a reasonable place to start is examining the distribution of this variable. A histogram is a simple yet effective way to visualize the distribution of a single variable and is easy to make using matplotlib.

import matplotlib.pyplot as plt

# Histogram of the Energy Star Score'fivethirtyeight')
plt.hist(data['score'].dropna(), bins = 100, edgecolor = 'k');
plt.xlabel('Score'); plt.ylabel('Number of Buildings'); 
plt.title('Energy Star Score Distribution');

This looks quite suspicious! The Energy Star score is a percentile rank, which means we would expect to see a uniform distribution, with each score assigned to the same number of buildings. However, a disproportionate number of buildings have either the highest, 100, or the lowest, 1, score (higher is better for the Energy Star score).

If we go back to the definition of the score, we see that it is based on “self-reported energy usage” which might explain the very high scores. Asking building owners to report their own energy usage is like asking students to report their own scores on a test! As a result, this probably is not the most objective measure of a building’s energy efficiency.

If we had an unlimited amount of time, we might want to investigate why so many buildings have very high and very low scores which we could by selecting these buildings and seeing what they have in common. However, our objective is only to predict the score and not to devise a better method of scoring buildings! We can make a note in our report that the scores have a suspect distribution, but our main focus in on predicting the score.

Looking for Relationships

A major part of EDA is searching for relationships between the features and the target. Variables that are correlated with the target are useful to a model because they can be used to predict the target. One way to examine the effect of a categorical variable (which takes on only a limited set of values) on the target is through a density plot using the seaborn library.

A density plot can be thought of as a smoothed histogram because it shows the distribution of a single variable. We can color a density plot by class to see how a categorical variable changes the distribution. The following code makes a density plot of the Energy Star Score colored by the the type of building (limited to building types with more than 100 data points):

# Create a list of buildings with more than 100 measurements
types = data.dropna(subset=['score'])
types = types['Largest Property Use Type'].value_counts()
types = list(types[types.values > 100].index)

# Plot of distribution of scores for building categories
figsize(12, 10)

# Plot each building
for b_type in types:
    # Select the building type
    subset = data[data['Largest Property Use Type'] == b_type]
    # Density plot of Energy Star scores
               label = b_type, shade = False, alpha = 0.8);
# label the plot
plt.xlabel('Energy Star Score', size = 20); plt.ylabel('Density', size = 20); 
plt.title('Density Plot of Energy Star Scores by Building Type', size = 28);

We can see that the building type has a significant impact on the Energy Star Score. Office buildings tend to have a higher score while Hotels have a lower score. This tells us that we should include the building type in our modeling because it does have an impact on the target. As a categorical variable, we will have to one-hot encode the building type.

A similar plot can be used to show the Energy Star Score by borough:

The borough does not seem to have as large of an impact on the score as the building type. Nonetheless, we might want to include it in our model because there are slight differences between the boroughs.

To quantify relationships between variables, we can use the Pearson Correlation Coefficient. This is a measure of the strength and direction of a linear relationship between two variables. A score of +1 is a perfectly linear positive relationship and a score of -1 is a perfectly negative linear relationship. Several values of the correlation coefficient are shown below:

While the correlation coefficient cannot capture non-linear relationships, it is a good way to start figuring out how variables are related. In Pandas, we can easily calculate the correlations between any columns in a dataframe:

# Find all correlations with the score and sort 
correlations_data = data.corr()['score'].sort_values()

The most negative (left) and positive (right) correlations with the target:

There are several strong negative correlations between the features and the target with the most negative the different categories of EUI (these measures vary slightly in how they are calculated). The EUI — Energy Use Intensity — is the amount of energy used by a building divided by the square footage of the buildings. It is meant to be a measure of the efficiency of a building with a lower score being better. Intuitively, these correlations make sense: as the EUI increases, the Energy Star Score tends to decrease.

Two-Variable Plots

To visualize relationships between two continuous variables, we use scatterplots. We can include additional information, such as a categorical variable, in the color of the points. For example, the following plot shows the Energy Star Score vs. Site EUI colored by the building type:

This plot lets us visualize what a correlation coefficient of -0.7 looks like. As the Site EUI decreases, the Energy Star Score increases, a relationship that holds steady across the building types.

The final exploratory plot we will make is known as the Pairs Plot. This is a great exploration tool because it lets us see relationships between multiple pairs of variables as well as distributions of single variables. Here we are using the seaborn visualization library and the PairGrid function to create a Pairs Plot with scatterplots on the upper triangle, histograms on the diagonal, and 2D kernel density plots and correlation coefficients on the lower triangle.

# Extract the columns to  plot
plot_data = features[['score', 'Site EUI (kBtu/ft²)', 
                      'Weather Normalized Source EUI (kBtu/ft²)', 
                      'log_Total GHG Emissions (Metric Tons CO2e)']]

# Replace the inf with nan
plot_data = plot_data.replace({np.inf: np.nan, -np.inf: np.nan})

# Rename columns 
plot_data = plot_data.rename(columns = {'Site EUI (kBtu/ft²)': 'Site EUI', 
                                        'Weather Normalized Source EUI (kBtu/ft²)': 'Weather Norm EUI',
                                        'log_Total GHG Emissions (Metric Tons CO2e)': 'log GHG Emissions'})

# Drop na values
plot_data = plot_data.dropna()

# Function to calculate correlation coefficient between two columns
def corr_func(x, y, **kwargs):
    r = np.corrcoef(x, y)[0][1]
    ax = plt.gca()
    ax.annotate("r = {:.2f}".format(r),
                xy=(.2, .8), xycoords=ax.transAxes,
                size = 20)

# Create the pairgrid object
grid = sns.PairGrid(data = plot_data, size = 3)

# Upper is a scatter plot
grid.map_upper(plt.scatter, color = 'red', alpha = 0.6)

# Diagonal is a histogram
grid.map_diag(plt.hist, color = 'red', edgecolor = 'black')

# Bottom is correlation and density plot
grid.map_lower(sns.kdeplot, cmap =

# Title for entire plot
plt.suptitle('Pairs Plot of Energy Data', size = 36, y = 1.02);

To see interactions between variables, we look for where a row intersects with a column. For example, to see the correlation of Weather Norm EUI with score, we look in the Weather Norm EUI row and the score column and see a correlation coefficient of -0.67. In addition to looking cool, plots such as these can help us decide which variables to include in modeling.

Feature Engineering and Selection

Feature engineering and selection often provide the greatest return on time invested in a machine learning problem. First of all, let’s define what these two tasks are:

  • Supervised: we have access to both the features and the target and our goal is to train a model that can learn a mapping between the two
  • Regression: The Energy Star score is a continuous variable

A machine learning model can only learn from the data we provide it, so ensuring that data includes all the relevant information for our task is crucial. If we don’t feed a model the correct data, then we are setting it up to fail and we should not expect it to learn!

For this project, we will take the following feature engineering steps:

  • Supervised: we have access to both the features and the target and our goal is to train a model that can learn a mapping between the two
  • Regression: The Energy Star score is a continuous variable

One-hot encoding is necessary to include categorical variables in a model. A machine learning algorithm cannot understand a building type of “office”, so we have to record it as a 1 if the building is an office and a 0 otherwise.

Adding transformed features can help our model learn non-linear relationships within the data. Taking the square root, natural log, or various powers of features is common practice in data science and can be based on domain knowledge or what works best in practice. Here we will include the natural log of all numerical features.

The following code selects the numeric features, takes log transformations of these features, selects the two categorical features, one-hot encodes these features, and joins the two sets together. This seems like a lot of work, but it is relatively straightforward in Pandas!

# Copy the original data
features = data.copy()

# Select the numeric columns
numeric_subset = data.select_dtypes('number')

# Create columns with log of numeric columns
for col in numeric_subset.columns:
    # Skip the Energy Star Score column
    if col == 'score':
        numeric_subset['log_' + col] = np.log(numeric_subset[col])
# Select the categorical columns
categorical_subset = data[['Borough', 'Largest Property Use Type']]

# One hot encode
categorical_subset = pd.get_dummies(categorical_subset)

# Join the two dataframes using concat
# Make sure to use axis = 1 to perform a column bind
features = pd.concat([numeric_subset, categorical_subset], axis = 1)

After this process we have over 11,000 observations (buildings) with 110 columns (features). Not all of these features are likely to be useful for predicting the Energy Star Score, so now we will turn to feature selection to remove some of the variables.

Feature Selection

Many of the 110 features we have in our data are redundant because they are highly correlated with one another. For example, here is a plot of Site EUI vs Weather Normalized Site EUI which have a correlation coefficient of 0.997.

Features that are strongly correlated with each other are known as collinear and removing one of the variables in these pairs of features can often help a machine learning model generalize and be more interpretable. (I should point out we are talking about correlations of features with other features, not correlations with the target, which help our model!)

There are a number of methods to calculate collinearity between features, with one of the most common the variance inflation factor. In this project, we will use thebcorrelation coefficient to identify and remove collinear features. We will drop one of a pair of features if the correlation coefficient between them is greater than 0.6. For the implementation, take a look at the notebook (and this Stack Overflow answer)

While this value may seem arbitrary, I tried several different thresholds, and this choice yielded the best model. Machine learning is an empirical field and is often about experimenting and finding what performs best! After feature selection, we are left with 64 total features and 1 target.

# Remove any columns with all na values
features  = features.dropna(axis=1, how = 'all')

(11319, 65)

Establishing a Baseline

We have now completed data cleaning, exploratory data analysis, and feature engineering. The final step to take before getting started with modeling is establishing a naive baseline. This is essentially a guess against which we can compare our results. If the machine learning models do not beat this guess, then we might have to conclude that machine learning is not acceptable for the task or we might need to try a different approach.

For regression problems, a reasonable naive baseline is to guess the median value of the target on the training set for all the examples in the test set. This sets a relatively low bar for any model to surpass.

The metric we will use is mean absolute error (mae) which measures the average absolute error on the predictions. There are many metrics for regression, but I like Andrew Ng’s advice to pick a single metric and then stick to it when evaluating models. The mean absolute error is easy to calculate and is interpretable.

Before calculating the baseline, we need to split our data into a training and a testing set:

  1. Data cleaning and formatting
  2. Exploratory data analysis
  3. Feature engineering and selection
  4. Compare several machine learning models on a performance metric
  5. Perform hyperparameter tuning on the best model
  6. Evaluate the best model on the testing set
  7. Interpret the model results
  8. Draw conclusions and document work

We will use 70% of the data for training and 30% for testing:

# Split into 70% training and 30% testing set
X, X_test, y, y_test = train_test_split(features, targets, 
                                        test_size = 0.3, 
                                        random_state = 42)

Now we can calculate the naive baseline performance:

# Function to calculate mean absolute error
def mae(y_true, y_pred):
    return np.mean(abs(y_true - y_pred))

baseline_guess = np.median(y)

print('The baseline guess is a score of %0.2f' % baseline_guess)
print("Baseline Performance on the test set: MAE = %0.4f" % mae(y_test, baseline_guess))
The baseline guess is a score of 66.00
Baseline Performance on the test set: MAE = 24.5164

The naive estimate is off by about 25 points on the test set. The score ranges from 1–100, so this represents an error of 25%, quite a low bar to surpass!


In this article we walked through the first three steps of a machine learning problem. After defining the question, we:

  1. Data cleaning and formatting
  2. Exploratory data analysis
  3. Feature engineering and selection
  4. Compare several machine learning models on a performance metric
  5. Perform hyperparameter tuning on the best model
  6. Evaluate the best model on the testing set
  7. Interpret the model results
  8. Draw conclusions and document work

Finally, we also completed the crucial step of establishing a baseline against which we can judge our machine learning algorithms.

A Complete Machine Learning Walk-Through in Python (Part Two): Model Selection, Hyperparameter Tuning, and Evaluation

Model Evaluation and Selection

As a reminder, we are working on a supervised regression task: using New York City building energy data, we want to develop a model that can predict the Energy Star Score of a building. Our focus is on both accuracy of the predictions and interpretability of the model.

There are a ton of machine learning models to choose from and deciding where to start can be intimidating. While there are some charts that try to show you which algorithm to use, I prefer to just try out several and see which one works best! Machine learning is still a field driven primarily by empirical (experimental) rather than theoretical results, and it’s almost impossible to know ahead of time which model will do the best.

Generally, it’s a good idea to start out with simple, interpretable models such as linear regression, and if the performance is not adequate, move on to more complex, but usually more accurate methods. The following chart shows a (highly unscientific) version of the accuracy vs interpretability trade-off:

We will evaluate five different models covering the complexity spectrum:

  • Supervised: we have access to both the features and the target and our goal is to train a model that can learn a mapping between the two
  • Regression: The Energy Star score is a continuous variable

In this post we will focus on implementing these methods rather than the theory behind them. For anyone interesting in learning the background, I highly recommend An Introduction to Statistical Learning (available free online) or Hands-On Machine Learning with Scikit-Learn and TensorFlow. Both of these textbooks do a great job of explaining the theory and showing how to effectively use the methods in R and Python respectively.

Imputing Missing Values

While we dropped the columns with more than 50% missing values when we cleaned the data, there are still quite a few missing observations. Machine learning models cannot deal with any absent values, so we have to fill them in, a process known as imputation.

First, we’ll read in all the data and remind ourselves what it looks like:

import pandas as pd
import numpy as np

# Read in data into dataframes 
train_features = pd.read_csv('data/training_features.csv')
test_features = pd.read_csv('data/testing_features.csv')
train_labels = pd.read_csv('data/training_labels.csv')
test_labels = pd.read_csv('data/testing_labels.csv')

Training Feature Size:  (6622, 64)
Testing Feature Size:   (2839, 64)
Training Labels Size:   (6622, 1)
Testing Labels Size:    (2839, 1)

Every value that is NaN represents a missing observation. While there are a number of ways to fill in missing data, we will use a relatively simple method, median imputation. This replaces all the missing values in a column with the median value of the column.

In the following code, we create a Scikit-Learn Imputer object with the strategy set to median. We then train this object on the training data (using and use it to fill in the missing values in both the training and testing data (using imputer.transform). This means missing values in the test data are filled in with the corresponding median value from the training data.

(We have to do imputation this way rather than training on all the data to avoid the problem of test data leakage, where information from the testing dataset spills over into the training data.)

# Create an imputer object with a median filling strategy
imputer = Imputer(strategy='median')

# Train on the training features

# Transform both training data and testing data
X = imputer.transform(train_features)
X_test = imputer.transform(test_features)

Missing values in training features:  0
Missing values in testing features:   0

All of the features now have real, finite values with no missing examples.

Feature Scaling

Scaling refers to the general process of changing the range of a feature. This is necessary because features are measured in different units, and therefore cover different ranges. Methods such as support vector machines and K-nearest neighbors that take into account distance measures between observations are significantly affected by the range of the features and scaling allows them to learn. While methods such as Linear Regression and Random Forest do not actually require feature scaling, it is still best practice to take this step when we are comparing multiple algorithms.

We will scale the features by putting each one in a range between 0 and 1. This is done by taking each value of a feature, subtracting the minimum value of the feature, and dividing by the maximum minus the minimum (the range). This specific version of scaling is often called normalization and the other main version is known as standardization.

While this process would be easy to implement by hand, we can do it using a MinMaxScaler object in Scikit-Learn. The code for this method is identical to that for imputation except with a scaler instead of imputer! Again, we make sure to train only using training data and then transform all the data.

# Create the scaler object with a range of 0-1
scaler = MinMaxScaler(feature_range=(0, 1))

# Fit on the training data

# Transform both the training and testing data
X = scaler.transform(X)
X_test = scaler.transform(X_test)

Every feature now has a minimum value of 0 and a maximum value of 1. Missing value imputation and feature scaling are two steps required in nearly any machine learning pipeline so it’s a good idea to understand how they work!

Implementing Machine Learning Models in Scikit-Learn

After all the work we spent cleaning and formatting the data, actually creating, training, and predicting with the models is relatively simple. We will use the Scikit-Learn library in Python, which has great documentation and a consistent model building syntax. Once you know how to make one model in Scikit-Learn, you can quickly implement a diverse range of algorithms.

We can illustrate one example of model creation, training (using .fit ) and testing (using .predict ) with the Gradient Boosting Regressor:

from sklearn.ensemble import GradientBoostingRegressor

# Create the model
gradient_boosted = GradientBoostingRegressor()

# Fit the model on the training data, y)

# Make predictions on the test data
predictions = gradient_boosted.predict(X_test)

# Evaluate the model
mae = np.mean(abs(predictions - y_test))

print('Gradient Boosted Performance on the test set: MAE = %0.4f' % mae)
Gradient Boosted Performance on the test set: MAE = 10.0132

Model creation, training, and testing are each one line! To build the other models, we use the same syntax, with the only change the name of the algorithm. The results are presented below:

To put these figures in perspective, the naive baseline calculated using the median value of the target was 24.5. Clearly, machine learning is applicable to our problem because of the significant improvement over the baseline!

The gradient boosted regressor (MAE = 10.013) slightly beats out the random forest (10.014 MAE). These results aren’t entirely fair because we are mostly using the default values for the hyperparameters. Especially in models such as the support vector machine, the performance is highly dependent on these settings. Nonetheless, from these results we will select the gradient boosted regressor for model optimization.

Hyperparameter Tuning for Model Optimization

In machine learning, after we have selected a model, we can optimize it for our problem by tuning the model hyperparameters.

First off, what are hyperparameters and how do they differ from parameters?

  • Supervised: we have access to both the features and the target and our goal is to train a model that can learn a mapping between the two
  • Regression: The Energy Star score is a continuous variable

Controlling the hyperparameters affects the model performance by altering the balance between underfitting and overfitting in a model. Underfitting is when our model is not complex enough (it does not have enough degrees of freedom) to learn the mapping from features to target. An underfit model has high bias, which we can correct by making our model more complex.

Overfitting is when our model essentially memorizes the training data. An overfit model has high variance, which we can correct by limiting the complexity of the model through regularization. Both an underfit and an overfit model will not be able to generalize well to the testing data.

The problem with choosing the right hyperparameters is that the optimal set will be different for every machine learning problem! Therefore, the only way to find the best settings is to try out a number of them on each new dataset. Luckily, Scikit-Learn has a number of methods to allow us to efficiently evaluate hyperparameters. Moreover, projects such as TPOT by Epistasis Lab are trying to optimize the hyperparameter search using methods like genetic programming. In this project, we will stick to doing this with Scikit-Learn, but stayed tuned for more work on the auto-ML scene!

Random Search with Cross Validation

The particular hyperparameter tuning method we will implement is called random search with cross validation:

  • Supervised: we have access to both the features and the target and our goal is to train a model that can learn a mapping between the two
  • Regression: The Energy Star score is a continuous variable

The idea of K-Fold cross validation with K = 5 is shown below:

The entire process of performing random search with cross validation is:

  1. Data cleaning and formatting
  2. Exploratory data analysis
  3. Feature engineering and selection
  4. Compare several machine learning models on a performance metric
  5. Perform hyperparameter tuning on the best model
  6. Evaluate the best model on the testing set
  7. Interpret the model results
  8. Draw conclusions and document work

Of course, we don’t do actually do this manually, but rather let Scikit-Learn’s RandomizedSearchCV handle all the work!

Slight Diversion: Gradient Boosted Methods

Since we will be using the Gradient Boosted Regression model, I should give at least a little background! This model is an ensemble method, meaning that it is built out of many weak learners, in this case individual decision trees. While a bagging algorithm such as random forest trains the weak learners in parallel and has them vote to make a prediction, a boosting method like Gradient Boosting, trains the learners in sequence, with each learner “concentrating” on the mistakes made by the previous ones.

Boosting methods have become popular in recent years and frequently win machine learning competitions. The Gradient Boosting Method is one particular implementation that uses Gradient Descent to minimize the cost function by sequentially training learners on the residuals of previous ones. The Scikit-Learn implementation of Gradient Boosting is generally regarded as less efficient than other libraries such as XGBoost , but it will work well enough for our small dataset and is quite accurate.

Back to Hyperparameter Tuning

There are many hyperparameters to tune in a Gradient Boosted Regressor and you can look at the Scikit-Learn documentation for the details. We will optimize the following hyperparameters:

  • Supervised: we have access to both the features and the target and our goal is to train a model that can learn a mapping between the two
  • Regression: The Energy Star score is a continuous variable

I’m not sure if there is anyone who truly understands how all of these interact, and the only way to find the best combination is to try them out!

In the following code, we build a hyperparameter grid, create a RandomizedSearchCV object, and perform hyperparameter search using 4-fold cross validation over 25 different combinations of hyperparameters:

# Loss function to be optimized
loss = ['ls', 'lad', 'huber']

# Number of trees used in the boosting process
n_estimators = [100, 500, 900, 1100, 1500]

# Maximum depth of each tree
max_depth = [2, 3, 5, 10, 15]

# Minimum number of samples per leaf
min_samples_leaf = [1, 2, 4, 6, 8]

# Minimum number of samples to split a node
min_samples_split = [2, 4, 6, 10]

# Maximum number of features to consider for making splits
max_features = ['auto', 'sqrt', 'log2', None]

# Define the grid of hyperparameters to search
hyperparameter_grid = {'loss': loss,
                       'n_estimators': n_estimators,
                       'max_depth': max_depth,
                       'min_samples_leaf': min_samples_leaf,
                       'min_samples_split': min_samples_split,
                       'max_features': max_features}

# Create the model to use for hyperparameter tuning
model = GradientBoostingRegressor(random_state = 42)

# Set up the random search with 4-fold cross validation
random_cv = RandomizedSearchCV(estimator=model,
                               cv=4, n_iter=25, 
                               scoring = 'neg_mean_absolute_error',
                               n_jobs = -1, verbose = 1, 
                               return_train_score = True,

# Fit on the training data, y)

After performing the search, we can inspect the RandomizedSearchCV object to find the best model:

# Find the best combination of settings

GradientBoostingRegressor(loss='lad', max_depth=5,

We can then use these results to perform grid search by choosing parameters for our grid that are close to these optimal values. However, further tuning is unlikely to significant improve our model. As a general rule, proper feature engineering will have a much larger impact on model performance than even the most extensive hyperparameter tuning. It’s the law of diminishing returns applied to machine learning: feature engineering gets you most of the way there, and hyperparameter tuning generally only provides a small benefit.

One experiment we can try is to change the number of estimators (decision trees) while holding the rest of the hyperparameters steady. This directly lets us observe the effect of this particular setting. See the notebook for the implementation, but here are the results:

As the number of trees used by the model increases, both the training and the testing error decrease. However, the training error decreases much more rapidly than the testing error and we can see that our model is overfitting: it performs very well on the training data, but is not able to achieve that same performance on the testing set.

We always expect at least some decrease in performance on the testing set (after all, the model can see the true answers for the training set), but a significant gap indicates overfitting. We can address overfitting by getting more training data, or decreasing the complexity of our model through the hyerparameters. In this case, we will leave the hyperparameters where they are, but I encourage anyone to try and reduce the overfitting.

For the final model, we will use 800 estimators because that resulted in the lowest error in cross validation. Now, time to test out this model!

Evaluating on the Test Set

As responsible machine learning engineers, we made sure to not let our model see the test set at any point of training. Therefore, we can use the test set performance as an indicator of how well our model would perform when deployed in the real world.

Making predictions on the test set and calculating the performance is relatively straightforward. Here, we compare the performance of the default Gradient Boosted Regressor to the tuned model:

# Make predictions on the test set using default and final model
default_pred = default_model.predict(X_test)
final_pred = final_model.predict(X_test)

Default model performance on the test set: MAE = 10.0118.
Final model performance on the test set:   MAE = 9.0446.

Hyperparameter tuning improved the accuracy of the model by about 10%. Depending on the use case, 10% could be a massive improvement, but it came at a significant time investment!

We can also time how long it takes to train the two models using the %timeit magic command in Jupyter Notebooks. First is the default model:

%%timeit -n 1 -r 5, y)

1.09 s ± 153 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)

1 second to train seems very reasonable. The final tuned model is not so fast:

%%timeit -n 1 -r 5, y)

12.1 s ± 1.33 s per loop (mean ± std. dev. of 5 runs, 1 loop each)

This demonstrates a fundamental aspect of machine learning: it is always a game of trade-offs. We constantly have to balance accuracy vs interpretability, bias vs variance, accuracy vs run time, and so on. The right blend will ultimately depend on the problem. In our case, a 12 times increase in run-time is large in relative terms, but in absolute terms it’s not that significant.

Once we have the final predictions, we can investigate them to see if they exhibit any noticeable skew. On the left is a density plot of the predicted and actual values, and on the right is a histogram of the residuals:

The model predictions seem to follow the distribution of the actual values although the peak in the density occurs closer to the median value (66) on the training set than to the true peak in density (which is near 100). The residuals are nearly normally distribution, although we see a few large negative values where the model predictions were far below the true values.


In this article we covered several steps in the machine learning workflow:

  • Supervised: we have access to both the features and the target and our goal is to train a model that can learn a mapping between the two
  • Regression: The Energy Star score is a continuous variable

The results of this work showed us that machine learning is applicable to the task of predicting a building’s Energy Star Score using the available data. Using a gradient boosted regressor we were able to predict the scores on the test set to within 9.1 points of the true value. Moreover, we saw that hyperparameter tuning can increase the performance of a model at a significant cost in terms of time invested. This is one of many trade-offs we have to consider when developing a machine learning solution.

A Complete Machine Learning Walk-Through in Python (Part Three): Interpreting a machine learning model and presenting results

As a reminder, we are working through a supervised regression machine learning problem. Using New York City building energy data, we have developed a model which can predict the Energy Star Score of a building. The final model we built is a Gradient Boosted Regressor which is able to predict the Energy Star Score on the test data to within 9.1 points (on a 1–100 scale).

Model Interpretation

The gradient boosted regressor sits somewhere in the middle on the scale of model interpretability: the entire model is complex, but it is made up of hundreds of decision trees, which by themselves are quite understandable. We will look at three ways to understand how our model makes predictions:

  1. Data cleaning and formatting
  2. Exploratory data analysis
  3. Feature engineering and selection
  4. Compare several machine learning models on a performance metric
  5. Perform hyperparameter tuning on the best model
  6. Evaluate the best model on the testing set
  7. Interpret the model results
  8. Draw conclusions and document work

The first two methods are specific to ensembles of trees, while the third — as you might have guessed from the name — can be applied to any machine learning model. LIME is a relatively new package and represents an exciting step in the ongoing effort to explain machine learning predictions.

Feature Importances

Feature importances attempt to show the relevance of each feature to the task of predicting the target. The technical details of feature importances are complex (they measure the mean decrease impurity, or the reduction in error from including the feature), but we can use the relative values to compare which features are the most relevant. In Scikit-Learn, we can extract the feature importances from any ensemble of tree-based learners.

With model as our trained model, we can find the feature importances usingmodel.feature_importances_. Then we can put them into a pandas DataFrame and display or plot the top ten most important:

import pandas as pd

# model is the trained model
importances = model.feature_importances_

# train_features is the dataframe of training features
feature_list = list(train_features.columns)

# Extract the feature importances into a dataframe
feature_results = pd.DataFrame({'feature': feature_list, 
                                'importance': importances})

# Show the top 10 most important
feature_results = feature_results.sort_values('importance', 
                                              ascending = False).reset_index(drop=True)


The Site EUI(Energy Use Intensity) and the Weather Normalized Site Electricity Intensity are by far the most important features, accounting for over 66% of the total importance. After the top two features, the importance drops off significantly, which indicates we might not need to retain all 64 features in the data to achieve high performance. (In the Jupyter notebook, I take a look at using only the top 10 features and discover that the model is not quite as accurate.)

Based on these results, we can finally answer one of our initial questions: the most important indicators of a building’s Energy Star Score are the Site EUI and the Weather Normalized Site Electricity Intensity. While we do want to be careful about reading too much into the feature importances, they are a useful way to start to understand how the model makes its predictions.

Visualizing a Single Decision Tree

While the entire gradient boosting regressor may be difficult to understand, any one individual decision tree is quite intuitive. We can visualize any tree in the forest using the Scikit-Learn function [export_graphviz]([]( ""). We first extract a tree from the ensemble then save it as a dot file:

from sklearn import tree

# Extract a single tree (number 105)
single_tree = model.estimators_[105][0]

# Save the tree to a dot file
tree.export_graphviz(single_tree, out_file = 'images/', 
feature_names = feature_list)

Using the Graphviz visualization software we can convert the dot file to a png from the command line:

dot -Tpng images/ -o images/tree.png

The result is a complete decision tree:

This is a little overwhelming! Even though this tree only has a depth of 6 (the number of layers), it’s difficult to follow. We can modify the call to export_graphviz and limit our tree to a more reasonable depth of 2:

Each node (box) in the tree has four pieces of information:

  1. Data cleaning and formatting
  2. Exploratory data analysis
  3. Feature engineering and selection
  4. Compare several machine learning models on a performance metric
  5. Perform hyperparameter tuning on the best model
  6. Evaluate the best model on the testing set
  7. Interpret the model results
  8. Draw conclusions and document work

(Leaf nodes only have 2.–4. because they represent the final estimate and do not have any children).

A decision tree makes a prediction for a data point by starting at the top node, called the root, and working its way down through the tree. At each node, a yes-or-no question is asked of the data point. For example, the question for the node above is: Does the building have a Site EUI less than or equal to 68.95? If the answer is yes, the building is placed in the right child node, and if the answer is no, the building goes to the left child node.

This process is repeated at each layer of the tree until the data point is placed in a leaf node, at the bottom of the tree (the leaf nodes are cropped from the small tree image). The prediction for all the data points in a leaf node is the value. If there are multiple data points ( samples ) in a leaf node, they all get the same prediction. As the depth of the tree is increased, the error on the training set will decrease because there are more leaf nodes and the examples can be more finely divided. However, a tree that is too deep will overfit to the training data and will not be able to generalize to new testing data.

In the second article, we tuned a number of the model hyperparameters, which control aspects of each tree such as the maximum depth of the tree and the minimum number of samples required in a leaf node. These both have a significant impact on the balance of under vs over-fitting, and visualizing a single decision tree allows us to see how these settings work.

Although we cannot examine every tree in the model, looking at one lets us understand how each individual learner makes a prediction. This flowchart-based method seems much like how a human makes decisions, answering one question about a single value at a time. Decision-tree-based ensembles combine the predictions of many individual decision trees in order to create a more accurate model with less variance. Ensembles of trees tend to be very accurate, and also are intuitive to explain.

Local Interpretable Model-Agnostic Explanations (LIME)

The final tool we will explore for trying to understand how our model “thinks” is a new entry into the field of model explanations. LIME aims to explain a single prediction from any machine learning model by creating a approximation of the model locally near the data point using a simple model such as linear regression (the full details can be found in the paper ).

Here we will use LIME to examine a prediction the model gets completely wrong to see what it might tell us about why the model makes mistakes.

First we need to find the observation our model gets most wrong. We do this by training and predicting with the model and extracting the example on which the model has the greatest error:

from sklearn.ensemble import GradientBoostingRegressor

# Create the model with the best hyperparamters
model = GradientBoostingRegressor(loss='lad', max_depth=5, max_features=None,
                                  min_samples_leaf=6, min_samples_split=6, 
                                  n_estimators=800, random_state=42)

# Fit and test on the features, y)
model_pred = model.predict(X_test)

# Find the residuals
residuals = abs(model_pred - y_test)
# Extract the most wrong prediction
wrong = X_test[np.argmax(residuals), :]

print('Prediction: %0.4f' % np.argmax(residuals))
print('Actual Value: %0.4f' % y_test[np.argmax(residuals)])
Prediction: 12.8615
Actual Value: 100.0000

Next, we create the LIME explainer object passing it our training data, the mode, the training labels, and the names of the features in our data. Finally, we ask the explainer object to explain the wrong prediction, passing it the observation and the prediction function.

import lime 

# Create a lime explainer object
explainer = lime.lime_tabular.LimeTabularExplainer(training_data = X, 
                                                   mode = 'regression',
                                                   training_labels = y,
                                                   feature_names = feature_list)

# Explanation for wrong prediction
exp = explainer.explain_instance(data_row = wrong, 
                                 predict_fn = model.predict)

# Plot the prediction explaination

The plot explaining this prediction is below:

Here’s how to interpret the plot: Each entry on the y-axis indicates one value of a variable and the red and green bars show the effect this value has on the prediction. For example, the top entry says the Site EUI is greater than 95.90 which subtracts about 40 points from the prediction. The second entry says the Weather Normalized Site Electricity Intensity is less than 3.80 which adds about 10 points to the prediction. The final prediction is an intercept term plus the sum of each of these individual contributions.

We can get another look at the same information by calling the explainer .show_in_notebook() method:

# Show the explanation in the Jupyter Notebook

This shows the reasoning process of the model on the left by displaying the contributions of each variable to the prediction. The table on the right shows the actual values of the variables for the data point.

For this example, the model prediction was about 12 and the actual value was 100! While initially this prediction may be puzzling, looking at the explanation, we can see this was not an extreme guess, but a reasonable estimate given the values for the data point. The Site EUI was relatively high and we would expect the Energy Star Score to be low (because EUI is strongly negatively correlated with the score), a conclusion shared by our model. In this case, the logic was faulty because the building had a perfect score of 100.

It can be frustrating when a model is wrong, but explanations such as these help us to understand why the model is incorrect. Moreover, based on the explanation, we might want to investigate why the building has a perfect score despite such a high Site EUI. Perhaps we can learn something new about the problem that would have escaped us without investigating the model. Tools such as this are not perfect, but they go a long way towards helping us understand the model which in turn can allow us to make better decisions.

Documenting Work and Reporting Results

An often under-looked part of any technical project is documentation and reporting. We can do the best analysis in the world, but if we do not clearly communicate the results, then they will not have any impact!

When we document a data science project, we take all the versions of the data and code and package it so it our project can be reproduced or built on by other data scientists. It’s important to remember that code is read more often than it is written, and we want to make sure our work is understandable both for others and for ourselves if we come back to it a few months later. This means putting in helpful comments in the code and explaining your reasoning. I find Jupyter Notebooks to be a great tool for documentation because they allow for explanations and code one after the other.

Jupyter Notebooks can also be a good platform for communicating findings to others. Using notebook extensions, we can hide the code from our final report , because although it’s hard to believe, not everyone wants to see a bunch of Python code in a document!

Personally, I struggle with succinctly summarizing my work because I like to go through all the details. However, it’s important to understand your audience when you are presenting and tailor the message accordingly. With that in mind, here is my 30-second takeaway from the project:

  1. Data cleaning and formatting
  2. Exploratory data analysis
  3. Feature engineering and selection
  4. Compare several machine learning models on a performance metric
  5. Perform hyperparameter tuning on the best model
  6. Evaluate the best model on the testing set
  7. Interpret the model results
  8. Draw conclusions and document work

Originally, I was given this project as a job-screening “assignment” by a start-up. For the final report, they wanted to see both my work and my conclusions, so I developed a Jupyter Notebook to turn in. However, instead of converting directly to PDF in Jupyter, I converted it to a Latex .tex file that I then edited in texStudio before rendering to a PDF for the final version. The default PDF output from Jupyter has a decent appearance, but it can be significantly improved with a few minutes of editing. Moreover, Latex is a powerful document preparation system and it’s good to know the basics.

At the end of the day, our work is only as valuable as the decisions it enables, and being able to present results is a crucial skill. Furthermore, by properly documenting work, we allow others to reproduce our results, give us feedback so we can become better data scientists, and build on our work for the future.


Throughout this series of posts, we’ve walked through a complete end-to-end machine learning project. We started by cleaning the data, moved into model building, and finally looked at how to interpret a machine learning model. As a reminder, the general structure of a machine learning project is below:

  1. Data cleaning and formatting
  2. Exploratory data analysis
  3. Feature engineering and selection
  4. Compare several machine learning models on a performance metric
  5. Perform hyperparameter tuning on the best model
  6. Evaluate the best model on the testing set
  7. Interpret the model results
  8. Draw conclusions and document work

While the exact steps vary by project, and machine learning is often an iterative rather than linear process, this guide should serve you well as you tackle future machine learning projects. I hope this series has given you confidence to be able to implement your own machine learning solutions, but remember, none of us do this by ourselves! If you want any help, there are many incredibly supportive communities where you can look for advice.

A few resources I have found helpful throughout my learning process:

  • Supervised: we have access to both the features and the target and our goal is to train a model that can learn a mapping between the two
  • Regression: The Energy Star score is a continuous variable

*Originally published at *

Thanks for reading

If you liked this post, share it with all of your programming buddies!

Follow us on Facebook | Twitter

Further reading

Machine Learning A-Z™: Hands-On Python & R In Data Science

Python for Data Science and Machine Learning Bootcamp

Machine Learning, Data Science and Deep Learning with Python

Deep Learning A-Z™: Hands-On Artificial Neural Networks

Artificial Intelligence A-Z™: Learn How To Build An AI

A Complete Machine Learning Project Walk-Through in Python

Machine Learning: how to go from Zero to Hero

Top 18 Machine Learning Platforms For Developers

10 Amazing Articles On Python Programming And Machine Learning

100+ Basic Machine Learning Interview Questions and Answers

How to using Docker Compose for NodeJS Development

How to using Docker Compose for NodeJS Development

Docker is an amazing tool for developers. It allows us to build and replicate images on any host, removing the inconsistencies of dev environments and reducing onboarding timelines considerably.

Docker is an amazing tool for developers. It allows us to build and replicate images on any host, removing the inconsistencies of dev environments and reducing onboarding timelines considerably.

To provide an example of how you might move to containerized development, I built a simple todo API using NodeJS, Express, and PostgreSQL using Docker Compose for development, testing, and eventually in my CI/CD pipeline.

In a two-part series, I will cover the development and pipeline creation steps. In this post, I will cover the first part: developing and testing with Docker Compose.

Requirements for This Tutorial

This tutorial requires you to have a few items before you can get started.

The todo app here is essentially a stand-in, and you could replace it with your own application. Some of the setup here is specific for this application, and the needs of your application may not be covered, but it should be a good starting point for you to get the concepts needed to Dockerize your own applications.

Once you have everything set up, you can move on to the next section.

Creating the Dockerfile

At the foundation of any Dockerized application, you will find a Dockerfile. The Dockerfile contains all of the instructions used to build out the application image. You can set this up by installing NodeJS and all of its dependencies; however the Docker ecosystem has an image repository (the Docker Store) with a NodeJS image already created and ready to use.

In the root directory of the application, create a new Dockerfile

/> touch Dockerfile

Open the newly created Dockerfile in your favorite editor. The first instruction, FROM, will tell Docker to use the prebuilt NodeJS image. There are several choices, but this project uses the node:7.7.2-alpine image.

FROM node:7.7.2-alpine

If you run docker build ., you will see something similar to the following:

Sending build context to Docker daemon 249.3 kB
Step 1/1 : FROM node:7.7.2-alpine
7.7.2-alpine: Pulling from library/node
709515475419: Pull complete
1a7746e437f7: Pull complete
662ac7b95f9d: Pull complete
Digest: sha256:6dcd183eaf2852dd8c1079642c04cc2d1f777e4b34f2a534cc0ad328a98d7f73
Status: Downloaded newer image for node:7.7.2-alpine
 ---> 95b4a6de40c3
Successfully built 95b4a6de40c3

With only one instruction in the Dockerfile, this doesn’t do too much, but it does show you the build process without too much happening. At this point, you now have an image created, and running docker images will show you the images you have available:

REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE
node                7.7.2-alpine        95b4a6de40c3        6 weeks ago         59.2 MB

The Dockerfile needs more instructions to build out the application. Currently it’s only creating an image with NodeJS installed, but we still need our application code to run inside the container. Let’s add some more instructions to do this and build this image again.

This particular Docker file uses RUN, COPY, and WORKDIR. You can read more about those on Docker’s reference page to get a deeper understanding.

Let’s add the instructions to the Dockerfile now:

FROM node:7.7.2-alpine

WORKDIR /usr/app

COPY package.json .
RUN npm install --quiet

COPY . .

Here is what is happening:

  • Set the working directory to /usr/app
  • Copy the package.json file to /usr/app
  • Install node_modules
  • Copy all the files from the project’s root to /usr/app

You can now run docker build . again and see the results:

Sending build context to Docker daemon 249.3 kB
Step 1/5 : FROM node:7.7.2-alpine
  ---> 95b4a6de40c3
Step 2/5 : WORKDIR /usr/app
 ---> e215b737ca38
Removing intermediate container 3b0bb16a8721
Step 3/5 : COPY package.json .
 ---> 930082a35f18
Removing intermediate container ac3ab0693f61
Step 4/5 : RUN npm install --quiet
 ---> Running in 46a7dcbba114


 ---> 525f662aeacf
 ---> dd46e9316b4d
Removing intermediate container 46a7dcbba114
Step 5/5 : COPY . .
 ---> 1493455bcf6b
Removing intermediate container 6d75df0498f9
Successfully built 1493455bcf6b

You have now successfully created the application image using Docker. Currently, however, our app won’t do much since we still need a database, and we want to connect everything together. This is where Docker Compose will help us out.

Docker Compose Services

Now that you know how to create an image with a Dockerfile, let’s create an application as a service and connect it to a database. Then we can run some setup commands and be on our way to creating that new todo list.

Create the file docker-compose.yml:

/> touch docker-compose.yml

The Docker Compose file will define and run the containers based on a configuration file. We are using compose file version 2 syntax, and you can read up on it on Docker’s site.

An important concept to understand is that Docker Compose spans “buildtime” and “runtime.” Up until now, we have been building images using docker build ., which is “buildtime.” This is when our containers are actually built. We can think of “runtime” as what happens once our containers are built and being used.

Compose triggers “buildtime” — instructing our images and containers to build — but it also populates data used at “runtime,” such as env vars and volumes. This is important to be clear on. For instance, when we add things like volumes and command, they will override the same things that may have been set up via the Dockerfile at “buildtime.”

Open your docker-compose.yml file in your editor and copy/paste the following lines:

version: '2'
    build: .
    command: npm run dev
      - .:/usr/app/
      - /usr/app/node_modules
      - "3000:3000"
      - postgres
      DATABASE_URL: postgres://[email protected]/todos
    image: postgres:9.6.2-alpine
      POSTGRES_USER: todoapp
      POSTGRES_DB: todos

This will take a bit to unpack, but let’s break it down by service.

The web service

The first directive in the web service is to build the image based on our Dockerfile. This will recreate the image we used before, but it will now be named according to the project we are in, nodejsexpresstodoapp. After that, we are giving the service some specific instructions on how it should operate:

  • command: npm run dev – Once the image is built, and the container is running, the npm run dev command will start the application.

  • volumes: – This section will mount paths between the host and the container.

  • .:/usr/app/ – This will mount the root directory to our working directory in the container.

- /usr/app/node_modules – This will mount the node_modules directory to the host machine using the buildtime directory.

  • environment: – The application itself expects the environment variable DATABASE_URL to run. This is set in db.js.

  • ports: – This will publish the container’s port, in this case 3000, to the host as port 3000.

The DATABASE_URL is the connection string. postgres://[email protected]/todos connects using the todoapp user, on the host postgres, using the database todos.

The Postgres service

Like the NodeJS image we used, the Docker Store has a prebuilt image for PostgreSQL. Instead of using a build directive, we can use the name of the image, and Docker will grab that image for us and use it. In this case, we are using postgres:9.6.2-alpine. We could leave it like that, but it hasenvironmentvariables to let us customize it a bit.

environment: – This particular image accepts a couple environment variables so we can customize things to our needs. POSTGRES_USER: todoapp – This creates the user todoapp as the default user for PostgreSQL. POSTGRES_DB: todos – This will create the default database as todos.

Running The Application

Now that we have our services defined, we can build the application using docker-compose up. This will show the images being built and eventually starting. After the initial build, you will see the names of the containers being created:

Pulling postgres (postgres:9.6.2-alpine)...
9.6.2-alpine: Pulling from library/postgres
627beaf3eaaf: Pull complete
e351d01eba53: Pull complete
cbc11f1629f1: Pull complete
2931b310bc1e: Pull complete
2996796a1321: Pull complete
ebdf8bbd1a35: Pull complete
47255f8e1bca: Pull complete
4945582dcf7d: Pull complete
92139846ff88: Pull complete
Digest: sha256:7f3a59bc91a4c80c9a3ff0430ec012f7ce82f906ab0a2d7176fcbbf24ea9f893
Status: Downloaded newer image for postgres:9.6.2-alpine
Building web
Creating nodejsexpresstodoapp_postgres_1
Creating nodejsexpresstodoapp_web_1
web_1       | Your app is running on port 3000

At this point, the application is running, and you will see log output in the console. You can also run the services as a background process, using docker-compose up -d. During development, I prefer to run without -d and create a second terminal window to run other commands. If you want to run it as a background process and view the logs, you can run docker-compose logs.

At a new command prompt, you can run docker-compose ps to view your running containers. You should see something like the following:

            Name                            Command              State           Ports
nodejsexpresstodoapp_postgres_1 postgres   Up      5432/tcp
nodejsexpresstodoapp_web_1        npm run dev                     Up>3000/tcp

This will tell you the name of the services, the command used to start it, its current state, and the ports. Notice nodejsexpresstodoapp_web_1 has listed the port as>3000/tcp. This tells us that you can access the application using localhost:3000/todos on the host machine.

/> curl localhost:3000/todos


The package.json file has a script to automatically build the code and migrate the schema to PostgreSQL. The schema and all of the data in the container will persist as long as the postgres:9.6.2-alpine image is not removed.

Eventually, however, it would be good to check how your app will build with a clean setup. You can run docker-compose down, which will clear things that are built and let you see what is happening with a fresh start.

Feel free to check out the source code, play around a bit, and see how things go for you.

Testing the Application

The application itself includes some integration tests built using jest. There are various ways to go about testing, including creating something like Dockerfile.test and docker-compose.test.ymlfiles specific for the test environment. That’s a bit beyond the current scope of this article, but I want to show you how to run the tests using the current setup.

The current containers are running using the project name nodejsexpresstodoapp. This is a default from the directory name. If we attempt to run commands, it will use the same project, and containers will restart. This is what we don’t want.

Instead, we will use a different project name to run the application, isolating the tests into their own environment. Since containers are ephemeral (short-lived), running your tests in a separate set of containers makes certain that your app is behaving exactly as it should in a clean environment.

In your terminal, run the following command:

/> docker-compose -p tests run -p 3000 --rm web npm run watch-tests

You should see jest run through integration tests and wait for changes.

The docker-compose command accepts several options, followed by a command. In this case, you are using -p tests to run the services under the tests project name. The command being used is run, which will execute a one-time command against a service.

Since the docker-compose.yml file specifies a port, we use -p 3000 to create a random port to prevent port collision. The--rmoption will remove the containers when we stop the containers. Finally, we are running in the web service npm run watch-tests.

Thanks for reading !

Learn Docker from Beginner to Advanced

Learn Docker from Beginner to Advanced

In this article, you'll learn Docker in simple and easy steps starting from beginner to advanced with examples

The series:

  • Learn Docker from Beginner to Advanced part I - Basics: This part covers what Docker is and why I think you should use it. It brings up concepts such as images and containers and takes you through building and running your first container
  • Learn Docker from Beginner to Advanced part II - Lolumes: This is about Volumes and how we can use volumes to persist data but also how we can turn our development environment into a Volume and make our development experience considerably better
  • Learn Docker from Beginner to Advanced part III - Databases, linking and networks: This is about how to deal with Databases, putting them into containers and how to make containers talk to other containers using legacy linking but also the new standard through networks
  • Learn Docker from Beginner to Advanced part IV - Introducing Docker Compose: This is how we manage more than one service using Docker Compose ( this is 1/2 part on Docker Compose)
  • Learn Docker from Beginner to Advanced part V- Going deeper with Docker Compose: This part is the second and concluding part on Docker Compose where we cover Volumes, Environment Variables and working with Databases and Networks

Now there are a ton of articles out there for Docker but I struggle with the fact that none of them are really thorough and explains what goes on, or rather that’s my impression, feel free to disagree :). I should say I’m writing a lot of these articles for me and my own understanding and to have fun in the process :). I also hope that it can be useful for you as well.

So I decided to dig relatively deep so that you all hopefully might benefit. TLDR, this is the first part in a series of articles on Docker, this part explains the basics and the reason I think you should use Docker.

This article really is Docker from the beginning, I assume no pre-knowledge, I assume nothing. Enjoy :)

alt text


Using Docker and containerization is about breaking apart a monolith into microservices. Throughout this series, we will learn to master Docker and all its commands. Sooner or later you will want to take your containers to a production environment. That environment is usually the Cloud. When you feel you've got enough Docker experience have a look at these links to see how Docker can be used in the Cloud as well:

  • Sign up for a free Azure account To use containers in the Cloud like a private registry you will need a free Azure account
  • Containers in the Cloud Great overview page that shows what else there is to know about containers in the Cloud
  • Deploying your containers in the Cloud Tutorial that shows how easy it is to leverage your existing Docker skill and get your services running in the Cloud
  • Creating a container registry Your Docker images can be in Docker Hub but also in a Container Registry in the Cloud. Wouldn't it be great to store your images somewhere and actually be able to create a service from that Registry in a matter of minutes?
Learn Docker from Beginner to Advanced part I - Basics

In this article, we will attempt to cover the following topics

  • Why Docker and what is it, this is probably the most important part of the article, why Docker, why not some other technology or status quo? I will attempt to explain what Docker is and what it consists of.
  • Docker in action, we will dockerize an application to showcase we understand and can use the core concepts that make out Docker.
  • Improving our set up, we should ensure our solution does not rely on static values. We can ensure this by creating and setting environment variables whose value we can read from inside of our application.
  • Managing our container, now it’s fairly easy to get a container up and running but let’s look how to manage it, after all, we don’t want the container to be up and running forever. Even it’s a very lightweight thing, it adds up and it can block ports that you want to use for other things.

Remember that this is the first part of a series and that we will look into other things in this series on Docker such as Volumes, Linking, Micro Services, and Orchestration, but that will be covered in future parts.

Why Docker and what is it

Docker helps you create a reproducible environment. You are able to specify the exact version of different libraries, different environment variables and their values among other things. Most importantly you are able to run your application in isolation inside of that environment.

The big question is why we would want that?

  • onboarding, every time you onboard a new developer in a project they need to set up a lot of things like installing SDKs, development tools, databases, add permissions and so on. This a process that can take from one day to up to 2 weeks
  • environments look the same, using Docker you can create a DEV, STAGING as well as PRODUCTION environment that all look the same. That is really great as before Docker/containerization you could have environments that were similar but there might have been small differences and when you discovered a bug you could spend a lot of time chasing down the root cause of the bug. Sometimes the bug was in the source code itself but sometimes it was due to some difference in the environment and that usually took a long time to determine.
  • works on my machine, this point is much like the above but because Docker creates these isolated containers, where you specify exactly what they should contain, you can also ship these containers to the customers and they will operate in the exact same way as they did on your development machine/s.
What is it

Ok, so we’ve mentioned some great reasons above why you should look into Docker but let's dive more into what Docker actually is. We’ve established that it lets us specify an environment like the OS, how to find and run the apps and the variables you need, but what else is there to know about Docker?

Docker creates stand-alone packages called containers that contain everything that is needed for you to run your application. Each container gets its own CPU, memory and network resources and does not depend on a specific operating system or kernel. The first that comes to mind when I describe the above is a Virtual Machine, but Docker differs in how it shares or dedicates resources. Docker uses a so-called layered file system which enables the containers to share common parts and the end result is that containers are way less of resource-hog on the host system than a virtual machine.

In short, the Docker containers, contain everything you need to run an application, including the source code you wrote. Containers are also isolated and secure light-weight units on your system. This makes it easy to create multiple micro-services that are written in different programming languages and that are using different versions of the same lib and even the same OS.

If you are curious about how exactly Docker does this I urge to have a look at the following links on layered file system and the library runc and also this great wikipedia overview of Docker.

Docker in action

Ok, so we covered what Docker is and some benefits. We also understood that the thing that eventually runs my application is called a container. But how do we get there? Well, we start out with a description file, called a Dockerfile. In this Dockerfile, we specify everything we need in terms of OS, environment variables and how to get our application in there.

Now we will jump in at the deep end. We will build an app and Dockerize it, so we will have our app running inside of a container, isolated from the outside world but reachable on ports that we explicitly open up.

We will take the following steps:

  • create an application, we will create a Node.js Express application, which will act as a REST API.
  • create a Dockerfile, a text file that tells Docker how to build our application
  • build an image, the pre-step to having our application up and running is to first create a so-called Docker image
  • create a container, this is the final step in which we will see our app up and running, we will create a container from a Docker image

Creating our app

We will now create an Express Node.js project and it will consist of the following files:

  • app.js, this is the file that spins up our REST
  • package.json, this is the manifest file for the project, here we will see all the dependencies like express but we will also declare a script start so we can easily start our application
  • Dockerfile, this is a file we will create to tell Docker how to Dockerize our application

To generate our package.json we just place ourselves in the projects directory and type:

npm init -y

This will produce the package.json file with a bunch of default values.

Then we should add the dependencies we are about to use, which is the library express , we install it by typing like this:

npm install express —-save

Let’s add some code

Now when we have done all the prework with generating a package.json file and installing dependencies, it’s time to add the code needed for our application to run, so we add the following code to app.js:

// app.js
const express = require('express')

const app = express()

const port = 3000

app.get('/', (req, res) => res.send('Hello World!'))

app.listen(port, () => console.log(`Example app listening on port ${port}!`))

We can try and run this application by typing:

node app.js

Going to a web browser on http://localhost:3000 we should now see:

alt text

Ok so that works, good :)

One little comment though, we should take note of the fact that we are assigning the port to 3000 when we later create our Dockerfile.

Creating a Dockerfile

So the next step is creating our Dockerfile. Now, this file acts as a manifest but also as a build instruction file, how to get our app up and running. Ok, so what is needed to get the app up and running? We need to:

  • copy all app files into the docker container
  • install dependencies like express
  • open up a port, in the container that can be accessed from the outside
  • instruct, the container how to start our app

In a more complex application, we might need to do things like setting environment variables or set credentials for a database or run a database seed to populate the database and so on. For now, we only need the things we specified in our bullet list above. So let’s try to express that in our Dockerfile:

// Dockerfile

FROM node:latest


COPY . .

RUN npm install


ENTRYPOINT ["node", "app.js"]

Let’s break the above commands down:

  • FROM, this is us selecting an OS image from Docker Hub. Docker Hub is a global repository that contains images that we can pull down locally. In our case we are choosing an image based on Ubuntu that has Node.js installed, it’s called node. We also specify that we want the latest version of it, by using the following tag :latest
  • WORKDIR, this simply means we set a working directory. This is a way to set up for what is to happen later, in the next command below
  • COPY, here we copy the files from the directory we are standing into the directory specified by our WORKDIR command
  • RUN, this runs a command in the terminal, in our case we are installing all the libraries we need to build our Node.js express application
  • EXPOSE, this means we are opening up a port, it is through this port that we communicate with our container
  • ENTRYPOINT, this is where we should state how we start up our application, the commands need to be specified as an array so the array [“node”, “app.js”] will be translated to the node app.js in the terminal

Quick overview

Ok, so now we have created all the files we need for our project and it should look like this:

app.js // our express app
Dockerfile // our instruction file that Docker will read from
/node_modules // directory created when we run npm install
package.json // npm init created this
package-lock.json // created when we installed libraries from NPM

Building an image

There are two steps that need to be taken to have our application up and running inside of a container, those are:

  • creating an image, with the help of the Dockerfile and the command docker build we will create an image
  • start the container, now that we have an image from the action we took above we need to create a container

First things first, let’s create our image with the following command:

docker build -t chrisnoring/node:latest .

The above instruction creates an image. The . at the end is important as this instructs Docker and tells it where your Dockerfile is located, in this case, it is the directory you are standing in. If you don’t have the OS image, that we ask for in the FROM command, it will lead to it being pulled down from Docker Hub and then your specific image is being built.

Your terminal should look something like this:

alt text

What we see above is how the OS image node:latest is being pulled down from the Docker Hub and then each of our commands is being executed like WORKDIR, RUN and so on. Worth noting is how it says removing intermediate container after each step. Now, this is Docker being smart and caching all the different file layers after each command so it goes faster. In the end, we see successfully built which is our cue that everything was constructed successfully. Let’s have a look at our image with:

docker images

alt text

We have an image, success :)

Creating a container

Next step is to take our image and construct a container from it. A container is this isolated piece that runs our app inside of it. We build a container using docker run . The full command looks like this:

docker run chrisnoring/node

That’s not really good enough though as we need to map the internal port of the app to an external one, on the host machine. Remember this is an app that we want to reach through our browser. We do the mapping by using the flag -p like so:

-p [external port]:[internal port]

Now the full command now looks like this:

docker run -p 8000:3000 chrisnoring/node

Ok, running this command means we should be able to visit our container by going to http://localhost:8000, 8000 is our external port remember that maps to the internal port 3000. Let’s see, let’s open up a browser:

alt text

There we have it folks, a working container :D

alt text

Improving our set up with Environment Variables

Ok, so we’ve learned how to build our Docker image, we’ve learned how to run a container and thereby our app inside of it. However, we could be handling the part with PORT a bit nicer. Right now we need to keep track of the port we start the express server with, inside of our app.js , to make sure this matches what we write in the Dockerfile. It shouldn’t have to be that way, it’s just static and error-prone.

To fix it we could introduce an environment variable. This means that we need to do two things:

  • add an environment variable to the Dockerfile
  • read from the environment variable in app.js

Add an environment variable

For this we need to use the command ENV, like so:


Let’s add that to our Dockerfile so it now looks like so:

FROM node:latest


COPY . .


RUN npm install


ENTRYPOINT ["node", "app.js"]

Let’s do one more change namely to update EXPOSE to use our variable, so we git rid of static values and rely on variables instead, like so:

FROM node:latest


COPY . .


RUN npm install


ENTRYPOINT ["node", "app.js"]

Note above how we change our EXPOSE command to $PORT, any variables we use needs to be prefixed with a $ character:


Read the environment variable value in App.js

We can read values from environment variables in Node.js like so:


So let’s update our app.js code to this:

// app.js
const express = require('express')

const app = express()

const port = process.env.PORT

app.get('/', (req, res) => res.send('Hello World!'))

app.listen(port, () => console.log(`Example app listening on port ${port}!`))

NOTE, when we do a change in our app.js or our Dockerfile we need to rebuild our image. That means we need to run the docker build command again and prior to that we need to have torn down our container with docker stop and docker rm. More on that in the upcoming sections.

Managing our container

Ok, so you have just started your container with docker run and you notice that you can’t shut it off in the terminal. Panic sets in ;) At this point you can go to another terminal window and do the following:

docker ps

This will list all running containers, you will be able to see the containers name as well as its id. It should look something like this:

alt text

As you see above we have the column CONTAINER_ID or NAMES column, both these values will work to stop our container, cause that is what we need to do, like so:

docker stop f40

We opt for using CONTAINER_ID and the three first digits, we don’t need more. This will effectively stop our container.

Daemon mode

We can do like we did above and open a separate terminal tab but running it in Daemon mode is a better option. This means that we run the container in the background and all output from it will not be visible. To make this happen we simply add the flag -d . Let’s try that out:

alt text

What we get now is just the container id back, that’s all we’re ever going to see. Now it’s easier for us to just stop it if we want, by typing docker stop 268 , that’s the three first digits from the above id.

Interactive mode

Interactive mode is an interesting one, this allows us to step into a running container and list files, or add/remove files or just about anything we can do for example bash. For this, we need the command docker exec, like so:

alt text

Above we run the command:

docker exec -it 268 bash

NOTE, the container needs to be up and running. If you've stopped it previously you should start it with docker start 268. Replace 268 with whatever id you got when it was created when you typed docker run.

268 is the three first digits if our container and -it means interactive mode and our argument bash at the end means we will run a bash shell.

We also run the command ls, once we get the bash shell up and running so that means we can easily list what’s in the container so we can verify we built it correctly but it’s a good way to debug as well.

If we just want to run something on the container like a node command, for example, we can type:

docker exec 268 node app.js

that will run the command node app.js in the container

Docker kill vs Docker stop

So far we have been using docker stop as way to stop the container. There is another way of stopping the container namely docker kill , so what is the difference?

  • docker stop, this sends the signal SIGTERM followed by SIGKILL after a grace period. In short, this is a way to bring down the container in a more graceful way meaning it gets to release resources and saving state.
  • docker kill, this sends SIGKILL right away. This means resource release or state save might not work as intended. In development, it doesn’t really matter which one of the two commands are being used but in a production scenario it probably wiser to rely on docker stop

Cleaning up

During the course of development you will end up creating tons of container so ensure you clean up by typing:

docker rm id-of-container

Summary Part I

Ok, so we have explained Docker from the beginning. We’ve covered motivations for using it and the basic concepts. Furthermore, we’ve looked into how to Dockerize an app and in doing so covered some useful Docker commands. There is so much more to know about Docker like how to work with Databases, Volumes, how to link containers and why and how to spin up and manage multiple containers, also known as orchestration.

Learn Docker from Beginner to Advanced Part II

Welcome to the second part of this series about Docker. Hopefully, you have read the first part to gain some basic understanding of Dockers core concepts and its basic commands or you have acquired that knowledge elsewhere.

In this article, we will attempt to cover the following topics

  • recap and problem introduction , let’s recap on the lessons learned from part I and let’s try to describe how not using a volume can be quite painful
  • persist data , we can use Volumes to persist files we create or Databases that we change ( e.g Sqllite).
  • turning our workdir into a volume , Volumes also give us a great way to work with our application without having to set up and tear down the container for every change.
Recap and the problem of not using a volume

Ok, so we will keep working on the application we created in the first part of this series, that is a Node.js application with the library express installed.

We will do the following in this section:

  • run a container, we will start a container and thereby repeat some basic Docker commands we learned in the first part of this series
  • update our app, update our source code and start and stop a container and realize why this way of working is quite painful

Run a container

As our application grows we might want to do add routes to it or change what is rendered on a specific route. Let’s show the source code we have so far:

// app.js

const express = require('express')

const app = express()

const port = process.env.PORT

app.get('/', (req, res) => res.send('Hello World!'))

app.listen(port, () => console.log(`Example app listening on port ${port}!`))

Now let’s see if we remember our basic commands. Let’s type:

docker ps

Ok, that looks empty. So we cleaned up last time with docker stop or docker kill , regardless of what we used we don’t have a container that we can start, so we need to build one. Let’s have a look at what images we have:

docker images

Ok, so we have our image there, let’s create and run a container:

docker run -d -p 8000:3000 chrisnoring/node

That should lead to a container up and running at port 8000 and it should run in detached mode, thanks to us specifying the -d flag.

We get a container ID above, good. Let’s see if we can find our application at http://localhost:8000:

Ok, good there it is. Now we are ready for the next step which is to update our source code.

Update our app

Let’s start by changing the default route to render out hello Chris , that is add the following line:

app.get('/', (req, res) => res.send('Hello Chris!'))

Ok, so we save our change and we head back to the browser and we notice it is still saying Hello World. It seems the container is not reflecting our changes. For that to happen we need to bring down the container, remove it, rebuild the image and then run the container again. Because we need to carry out a whole host of commands, we will need to change how we build and run our container namely by actively giving it a name, so instead of running the container like so:

docker run -d -p 8000:3000 chrisnoring/node

We now type:

docker run -d -p 8000:3000 --name my-container chrisnoring/node

This means our container will get the name my-container and it also means that when we refer to our container we can now use its name instead of its container ID, which for our scenario is better as the container ID will change for every setup and tear down.

docker stop my-container // this will stop the container, it can still be started if we want to

docker rm my-container // this will remove the container completely

docker build -t chrisnoring/node . // creates an image

docker run -d -p 8000:3000 --name my-container chrisnoring/node

You can chain these commands to look like this:

docker stop my-container && docker rm my-container && docker build -t chrisnoring/node . && docker run -d -p 8000:3000 --name my-container chrisnoring/node

My first seeing thought seeing that is WOW, that’s a lot of commands. There has got to be a better way right, especially when I’m in the development phase?

Well yes, there is a better way, using a volume. So let’s look at volumes next.

Using a volume

Volumes or data volumes is a way for us to create a place in the host machine where we can write files so they are persisted. Why would we want that? Well, when we are under development we might need to put the application in a certain state so we don’t have to start from the beginning. Typically we would want to store things like log files, JSON files and perhaps even databases (SQLite ) on a volume.

It’s quite easy to create a volume and we can do so in many different ways, but mainly there are two ways:

  • before you create a container
  • lazily, e.g while creating the container

Creating and managing a volume

To create a volume you type the following:

docker volume create [name of volume]

we can verify that our volume was created by typing:

docker volume ls

This will list all the different volumes we have. Now, this will after a while lead to you having tons of volumes created so it’s good to know how to keep down the number of volumes. For that you can type:

docker volume prune

This will remove all the volumes you currently are not using. You will be given a question if you want to proceed.

If you want to remove a single volume you can do so by typing:

docker volume rm [name of volume]

Another command you most likely will want to know about is the inspect command that allows us to see more details on our created volume and probably most important where it will place the persisted files.

docker inspect [name of volume]

A comment on this though is that most of the time you might not care where Docker place these files but sometimes you would want to know due to debugging purposes. As we will see later in this section controlling where files are persisted can work to our advantage when we develop our application.

As you can see the Mountpoint field is telling us where Docker is planning to persist your files.

Mounting a volume in your application

Ok, so we have come to the point that we want to use our volume in an application. We want to be able to change or create files in our container so that when we pull it down and start it up again our changes will still be there.

For this we can use two different commands that achieve relatively the same thing with a different syntax, those are:

  • -v, —-volume, the syntax looks like the following -v [name of volume]:[directory in the container], for example -v my-volume:/app
  • --mount, the syntax looks like the following--mount source=[name of volume],target=[directory in container] , for example —-mount source=my-volume,target=/app

Used in conjuncture with running a container it would look like this for example:

docker run -d -p 8000:3000 --name my-container --volume my-volume:/logs chrisnoring/node

Let’s try this out. First off let’s run our container:

Then let’s run our inspect command to ensure our volume has been correctly mounted inside of our container. When we run said command we get a giant JSON output but we are looking for the Mounts property:

Ok, our volume is there, good. Next step is to locate our volume inside of our container. Let’s get into our container with:

docker exec -it my-container bash

and thereafter navigate to our /logs directory:

Ok, now if we bring down our container everything we created in our volume should be persisted and everything that is not placed in the volume should be gone right? Yep, that’s the idea. Good, we understand the principle of volumes.

Mounting a subdirectory as a volume

So far we have been creating a volume and have let Docker decide on where the files are being persisted. What happens if we decide where these files are persisted?

Well if we point to a directory on our hard drive it will not only look at that directory and place files there but it will pick the pre-existing files that are in there and bring them into our mount point in the container. Let’s do the following to demonstrate what I mean:

  • create a directory, let’s create a directory /logs
  • create a file, let’s create a file logs.txt and write some text in it
  • run our container, let’s create a mount point to our local directory + /logs

The first two commands lead to us having a file structure like so:

 logs.txt // contains 'logging host...'

Now for the run command to get our container up and running:

Above we observe that our --volume command looks a bit different. The first argument is $(pwd)/logs which means our current working directory and the subdirectory logs. The second argument is /logs which means we are saying mount our host computers logs directory to a directory with the same name in the container.

Let’s dive into the container and establish that the container has indeed pulled in the files from our host computers logs directory:

As you we can see from the above set of commands we go into the container with docker exec -it my-container bash and then we proceed to navigate ourselves to the logs directory and finally we read out the content of logs.txt with the command cat logs.txt. The result is logging host... e.g the exact file and content that we have on the host computer.

But this is a volume however which means there is a connection between the volume in the host computer and the container. Let’s edit the file next on the host computer and see what happens to the container:

Wow, it changed in the container as well without us having to tear it down or restarting it.

Treating our application as a volume

To make our whole application be treated as a volume we need to tear down the container like so:

docker kill my-container && docker rm my-container

Why do we need to do all that? Well, we are about to change the Dockerfile as well as the source code and our container won’t pick up these changes, unless we use a Volume, like I am about to show you below.

Thereafter we need to rerun our container this time with a different volume argument namely --volume $(PWD):/app.

NOTE, if your PWD consists of a directory with space in it you might need to specify the argument as "$(PWD)":/app instead, i.e we need to surround $(PWD) with double quotes. Thank you to Vitaly for pointing that out :)

The full command looks like this:

This will effectively make our entire app directory a volume and every time we change something in there our container should reflect the changes.

So let’s try adding a route in our Node.js Express application like so:

app.get("/docker", (req, res) => {

  res.send("hello from docker");


Ok, so from what we know from dealing with the express library we should be able to reach http://localhost:8000/docker in our browser or?

Sad face :(. It didn’t work, what did we do wrong? Well here is the thing. If you change the source in a Node.js Express application you need to restart it. This means that we need to take a step back and think how can we restart our Node.js Express web server as soon as there is a file change. There are several ways to accomplish this like for example:

  • install a library like nodemon or forever that restarts the web server
  • run a PKILL command and kill the running node.js process and the run node app.js

It feels a little less cumbersome to just install a library like nodemon so let’s do that:

This means we now have another library dependency in package.json but it means we will need to change how we start our app. We need to start our app using the command nodemon app.js. This means nodemon will take care of the whole restart as soon as there is a change. While we are at it let’s add a start script to package.json, after all, that is the more Node.js -ish way of doing things:

Let's describe what we did above, in case you are new to Node.js. Adding a start script to a package.json file means we go into a section called "scripts" and we add an entry start, like so:

// excerpt package.json
"scripts": {
  "start": "nodemon app.js"

By default a command defined in "scripts" is run by you typing npm run [name of command]. There are however known commands, like start and test and with known commands we can omit the keyword run, so instead of typing npm run start, we can type npm start. Let's add another command "log" like so:

// excerpt package.json

"scripts": {
  "start": "nodemon app.js",
  "log": "echo \"Logging something to screen\""

To run this new command "log" we would type npm run log.

Ok, one thing remains though and that is changing the Dockerfile to change how it starts our app. We only need to change the last line from:

ENTRYPOINT ["node", "app.js"]


ENTRYPOINT ["npm", "start"]

Because we changed the Dockerfile this leads to us having to rebuild the image. So let’s do that:

docker build -t chrisnoring/node .

Ok, the next step is to bring up our container:

docker run -d -p 8000:3000 --name my-container --volume $(PWD):/app chrisnoring/node

Worth noting is how we expose the entire directory we are currently standing in and mapping that to /app inside the container.

Because we’ve already added the /docker route we need to add a new one, like so:

app.get('/nodemon', (req, res) => res.send('hello from nodemon'))

Now we hope that nodemon has done it’s part when we save our change in app.js :

Aaaand, we have a winner. It works to route to /nodemon . I don’t know about you but the first time I got this to work this was me:

Summary Part II

This has brought us to the end of our article. We have learned about Volumes which is quite a cool and useful feature and more importantly I’ve shown how you can turn your whole development environment into a volume and keep working on your source code without having to restart the container.

Learn Docker from Beginner to Advanced Part III

This the third part of our series. In this part, we will focus on learning how we work with Databases and Docker together. We will also introduce the concept of linking as this goes tightly together with working with Databases in a containerized environment.

In this article we will cover the following:

  • Cover some basics about working with MySql , it’s always good to cover some basics on managing a database generally and MySql, particularly as this, is the chosen database type for this article
  • Understand why we need to use a MySql Docker image , we will also cover how to create a container from said image and what environment variables we need to set, for it to work
  • Learn how to connect to our MySql container , from our application container using linking, this is about realizing how to do basic linking between two containers and how this can be used to our advantage when defining the database startup configuration
  • Expand our knowledge on linking , by covering the new way to link containers, there are two ways of doing linking, one way is more preferred than the other, which is deprecated, so we will cover how the new way of doing things is done
  • Describe some good ways of managing our database , such as giving it an initial structure and seed it
Working with databases in general and MySql in particular

With databases in general we want to be able to do the following:

  • read , we want to be able to read the data by using different kinds of queries
  • alter/create/delete , we need to be able to change the data
  • add structure , we need a way to create a structure like tables or collections so our data is saved in a specific format
  • add seed/initial data , in a simple database this might not be needed at all but in more complex scenario you would need some tables to be prepopulated with some basic data. Having a seed is also great to have under the development phase as it makes it easy to render certain view or test different scenarios if there is already pre existing data or that the data is put in a certain state, e.g a shopping cart has items and you want to test the checkout page.

There are many more things we want to do to a database like adding indexes, adding users with different access rights and much much more but let’s focus on these four points above as a references for what we, with the help of Docker should be able to support.

Installing and connecting to MySql

There are tons of ways to install MySql not all of them are fast :/. One of the easier ways on a Linux system is typing the following:

sudo apt-get install mysql-server

On a Mac you would be using brew and then instead type:

brew install mysql

In some other scenarios you are able to download an installer package and follow a wizard.

Once you are done installing MySql you will get some information similar to this, this will of course differ per installation package:

The above tells us we don’t have a root password yet, YIKES. We can fix that though by running mysql-secure-installation . Let’s for now just connect to the database running the suggested mysql -uroot .

NOOO, what happened? We actually got this information in the larger image above, we needed to start MySql either by running brew services start mysql , which would run it as a background service or using mysql-server start which is more of a one off. Ok, let’s enter brew services start :

The very last thing is it says Successfully started mysql :

Let’s see if Matthew is correct, can we connect ?

And we get a prompt above mysql>, we are in :D

Ok so we managed to connect using NO password, we should fix that and we dont have any database created, we should fix that too :)

Well it’s not entirely true, we do have some databases, not just any databases with content created by yourself, but rather supportive ones that we shouldn’t touch:

So next up would be to create and select the newly created database, so we can query from it:

Ok, great, but wait, we don’t have any tables? True true, we need to create those somehow. We could be creating them in the terminal but that would just be painful, lots and lots of multiline statements, so let’s see if we can feed MySql a file with the database and all the tables we want in it. Let’s first define a file that we can keep adding tables to:

// database.sql

// creates a table `tasks`


title VARCHAR(255) NOT NULL,

start_date DATE,

due_date DATE,



description TEXT,

PRIMARY KEY (task_id)

// add more tables below and indeces etc as our solution grows

Ok then, we have a file with database structure, now for getting the content in there we can use the source command like so ( masking over the user name):

If you want the full path to where your file is located just type PWD where your file is at and type source [path to sql file]/database.sql. As you can see above, we need to select a database before we run our SQL file so it targets a specific database and then we verify that the table has been created with SHOW TABLES; We can also use the same command to seed our database with data, we just give it a different file to process, one containing INSERT statements rather than CREATE TABLE...

Ok then. I think that’s enough MySql for now, let’s talk MySql in the context of Docker next.

Why we need a MySql Docker image and how to get it up and running as a container

Ok so let’s now say we want a container for our application. Let’s furthermore say that our solution needs a database. It wouldn’t make much sense to install MySql on the host the computer is running on. I mean one of the mantras of containers is that we shouldn’t have to care about the the host system the containers are running on. Ok so we need MySql in a container, but which container, the apps own container or a separate container? That’s a good question depending on your situation, you could either install MySql with your app or run it in a separate container. Some argument for that is:

  • rolling updates, your database can be kept online while your application nodes do individual restarts, you won’t experience downtime.
  • scalability, you can add nodes to scale in a loosely coupled fashion

There are a lot of arguments for and against and only you know exactly what works best for you — so you do you :)

MySql as stand alone image

Let’s talk about the scenario in which we pull down a MySql image. Ok, we will take this step by step so first thing we do is try to run it and as we learned from previous articles the image will be pulled down for us if we don’t have it, so here is the command:

docker run --name=mysql-db mysql

Ok, so what’s the output of that?

Pulling down the image, good, aaaand error.

We are doing all kinds of wrong :(.

Database is uninitialized, password option is not specified and so on. Let’s see if we can fix that:

docker run --name mysql-db -e MYSQL_ROOT_PASSWORD=complexpassword -d -p 8000:3306 mysql

and the winner is:

Argh, our container we started before is up and running despite the error message it threw. Ok, let’s bring down the container:

docker rm mysql-db

and let’s try to run it again with the database set like above:

Ok, we don’t get a whole bunch of logging to the terminal, because we are running in Daemon mode so let’s run docker ps to check:

At this point we want to connect to it from the outside. Our port forwarding means we need to connect to it on :

mysql -uroot -pcomplexpassword -h -P 8001

Ok, just a comment above we can either specify the password or just write -p and we will be prompted for the password on the next row. Let’s have a look at the result:

Ok, we managed to connect to our database, it is reachable, great :).

But wait, can we reach a database, inside of container, from another container? Well here is where it gets tricky. This is where we need to link the two containers.

Connecting to database from Node.js

Ok let’s first add some code in our app that tries to connect to our database. First off we need to install the NPM package for mysql :

npm install mysql

Then we add the following code to the top of our app.js file:

// app.js

const mysql = require('mysql');

const con = mysql.createConnection({

host: "localhost",

port: 8001,

user: "root",

password: "complexpassword",

database: 'Customers'


con.connect(function (err) {

if (err) throw err;

So let’s try this in the terminal:

Pain and misery :(

So why won’t it work.

This is because caching_sha2_password is introduced in MySQL 8.0, but the Node.js version is not implemented yet.

Ok so what, we try Postgres or some other database? Well we can actually fix this by connecting to our container like so:

mysql -uroot -pcomplexpassword -h -P 8001

and once we are at the mysql prompt we can type the following:

mysql> ALTER USER 'root' IDENTIFIED WITH mysql_native_password BY 'complexpassword';

Let’s try to run our node app.js command again and this time we get this:


Ok, so some call to this mysql_native_password seems to fix to whole thing. Let’s go deeper into the rabbit hole, what is that?

MySQL includes a mysql_native_password plugin that implements native authentication; that is, authentication based on the password hashing method in use from before the introduction of pluggable authentication

Ok, so that means MySql 8 have switched to some new pluggable authentication that our Node.js mysql library hasn’t been able to catch up on. That means we can either pull down an earlier version of MySql or revert to native authentication, your call :)


The idea of linking is that a container shouldn’t have to know any details on what IP or PORT the database, in this case, is running on. It should just assume that for example the app container and the database container can reach each other. A typical syntax looks like this:

docker run -d -p 5000:5000 --name product-service --link mypostgres:postgres chrisnoring/node

Let’s break the above down a bit:

  • app container, we are creating an app container called product-service
  • --link we are calling this command to link our product-service container with the existing mypostgres container
  • link alias, the statement --link mypostgres:postgres means that we specify what container to link with mypostgres and gives it an alias postgres. We will use this alias internally in product-service container when we try to connect to our database

Ok, so we think we kind of get the basics of linking, let’s see how that applies to our existing my-container and how we can link it to our database container mysql-db. Well because we started my-container without linking it we need to tear it down and restart it with our --link command specified as an additional argument like so:

docker kill my-container && docker rm my-container

That brings down the container. Before we bring it up though, we actually need to change some code, namely the part that has to do with connecting to our database. We are about to link it with our database container using the --link mysql-db:mysql argument which means we no longer need the IP or the PORT reference so our connection code can now look like this:

// part of app.js, the rest omitted for brevity

const con = mysql.createConnection({

**host: "mysql",**

user: "root",

password: "complexpassword",

database: 'Customers'


// and the rest of the code below

The difference is now our host is no longer stating localhost and the port is not explicitly stating 8001 cause the linking figures that out, all we need to focus on is knowing what alias we gave the database container when we linked e.g mysql. Because we both changed the code and we added another library myql we will need to rebuild the image, like so:

docker build - t chrisnoring/node .

now let’s get it started again, this time with a link:

docker run -d -p 8000:3000 --name my-container --link mysql-db:mysql chrisnoring/node

and let’s have a look at the output:

We use the docker logs my-container to see the logs from our app container and the very last line says Connected! which comes from the callback function that tells us we successfully connected to MySql.

You know what time it is ? BOOOM!

Expand our knowledge on linking — there is another way

Ok, so this is all well and good. We managed to place the application in one container and the database in another container. But… This linking we just looked at is considered legacy which means we need to learn another way of doing linking. Wait… It’s not as bad as you think, it actually looks better syntactically.

This new way of doing this is called creating networks or custom bridge networks. Actually, that’s just one type of network we can create. The point is that they are dedicated groups that you can say that your container should belong to. A container can exist in multiple networks if it has cross cutting concerrns for more than one group. That all sounds great so show me some syntax right?

The first thing we need to do is to create the network we want the app container and database container to belong to. We create said network with the following command:

docker network create --driver bridge isolated_network

Now that we have our network we just need to create each container with the newly created network as an additional argument. Let’s start with the app container:

docker run -d -p 8000:3000 --net isolated_network --name my-container chrisnoring-node

Above we are now using the --net syntax to create a network that we call isolated_network and the rest is just the usual syntax we use to create and run our container. That was easy :)

What about the database container?

docker run -p 8000:3306 --net isolated_network --name mysql-db -e MYSQL_ROOT_PASSWORD=complexpassword -d mysql

As you can see above we just create and run our database container but with the added --net isolated_network .

Looking back at this way of doing it, we no longer need to explicitly say that one container needs to be actively linked to another, we only place a container in a specific network and the containers in that network knows how to talk to each other.

There is a lot more to learn about networks, there are different types and a whole host of commands. I do think we got the gist of it, e.g how it differs from legacy linking. Have a look at this link to learn more docs on networking

General database management in a container setting

Ok, so we talked at the beginning of this section how we can create the database structure as a file and how we even can create our seed data as a file and how we can run those once we are at the MySql prompt after we connected to it. Here is the thing. How we get that structure and seed data in there is a little bit up to you but I will try to convey some general guidelines and ideas:

  • structure, you want to run this file at some point. To get in into the database you can either connect using the mysql client and refer to the file being outside in the host computer. Or you can place this file as part of your app repo and just let the Dockerfile copy the file into the container and then run it. Another way is looking at libraries such as Knex or Sequelize that supports migrations and specify the database structure in migrations that are being run. This is a bit different topic all together but I just wanted to show that there are different ways for you to handle this
  • seed, you can treat the seed the same way as you do the structure. This is something that needs to happen after we get our structure in and it is a one-off. It’s not something you want to happen at every start of the app or the database and depending on whether the seed is just some basic structural data or some data you need for testing purposes you most likely want to do this manually, e.g you connect to mysql using the client and actively runs the source command
Summary Part III

Ok, this article managed to cover some general information on MySql and discussed how we could get some structural things in there like tables and we also talked about an initial/test seed and how to get it into the database using standalone files.

Then we discussed why we should run databases in Docker container rather than installed on the host and went on to show how we could get two containers to talk to each other using a legacy linking technique. We did also spend some time cursing at MySql 8 and the fact the mysql Node.js module is not in sync which forced us to fix it.

Once we covered the legacy linking bit we couldn’t really stop there but we needed to talk about the new way we link containers, namely using networks.

Learn Docker from Beginner to Advanced Part IV

This part is about dealing with more than two Docker containers. You will come to a point eventually when you have so many containers to manage it feels unmanageable. You can only keep typing docker run to a certain point, as soon as you start spinning up multiple containers it just hurts your head and your fingers. To solve this we have Docker Compose.

TLDR; Docker Compose is a huge topic, for that reason this article is split into two parts. In this part, we will describe why Docker Compose and show when it shines. In the second part on Docker Compose we will cover more advanced topics like Environment Variables, Volumes and Databases.

In this part we will cover:

  • Why docker compose, it's important to understand, at least on a high level that there are two major architectures Monolith and Microservices and that Docker Compose really helps with managing the latter
  • Features, we will explain what feature Docker Compose supports so we will come to understand why it's such a good fit for our chosen Microservice architecture
  • When Docker isn't enough, we will explain at which point using Docker commands becomes tedious and painful and when using Docker Compose is starting to look more and more enticing
  • In action, lastly we will build a docker-compose.yaml file from scratch and learn how to manage our containers using Docker Compose and some core commands
Why Docker Compose

Docker Compose is meant to be used when we need to manage many services independently. What we are describing is something called a microservice architecture.

Microservice architecture

Let's define some properties on such an architecture:

  • Loosely coupled, this means they are not dependent on another service to function, all the data they need is just there. They can interact with other services though but that's by calling their external API with for example an HTTP call
  • Independently deployable, this means we can start, stop and rebuild them without affecting other services directly.
  • Highly maintainable and testable, services are small and thus there is less to understand and because there are no dependencies testing becomes simpler
  • Organized around business capabilities, this means we should try to find different themes like booking, products management, billing and so on

We should maybe have started with the question of why we want this architecture? It's clear from the properties listed above that it offers a lot of flexibility, it has less to no dependencies and so on. That sounds like all good things, so is that the new architecture that all apps should have?

As always it depends. There are some criteria where Microservices will shine as opposed to a Monolithic architecture such as:

  • Different tech stacks/emerging techs, we have many development teams and they all want to use their own tech stack or want to try out a new tech without having to change the entire app. Let each team build their own service in their chosen tech as part of a Microservice architecture.
  • Reuse, you really want to build a certain capability once, like for example billing, if that's being broken out in a separate service it makes it easier to reuse for other applications. Oh and in a microservices architecture you could easily combine different services and create many apps from it
  • Minimal failure impact, when there is a failure in a monolithic architecture it might bring down the entire app, with microservices you might be able to shield yourself better from failure

There are a ton more arguments on why Micro services over Monolithic architecture. The interested reader is urged to have a look at the following link .

The case for Docker Compose

The description of a Microservice architecture tells us that we need a bunch of services organized around business capabilities. Furthermore, they need to be independently deployable and we need to be able to use different tech stacks and many more things. In my opinion, this sounds like Docker would be a great fit generally. The reason we are making a case for Docker Compose over Docker is simply the sheer size of it. If we have more than two containers the amount of commands we need to type suddenly grows in a linear way. Let's explain in the next section what features Docker Compose have that makes it scale so well when the number of services increase.

Docker Compose features overview

Now Docker Compose enables us to scale really well in the sense that we can easily build several images at once, start several containers and many more things. A complete listing of features is as follows:

  • Manages the whole application life cycle.
  • Start, stop and rebuild services
  • View the status of running services
  • Stream the log output of running services
  • Run a one-off command on a service

As we can see it takes care of everything we could possibly need when we need to manage a microservice architecture consisting of many services.

When plain Docker isn't enough anymore

Let's recap on how Docker operates and what commands we need and let's see where that takes us when we add a service or two.

To dockerize something, we know that we need to:

  • define a Dockerfile that contains what OS image we need, what libraries we need to install, env variables we need to set, ports that need opening and lastly how to - start up our service
  • build an image or pull down an existing image from Docker Hub
  • create and run a container

Now, using Docker Compose we still need to do the part with the Dockerfile but Docker Compose will take care of building the images and managing the containers. Let's illustrate what the commands might look like with plain Docker:

docker build -t some-image-name .

Followed by

docker run -d -p 8000:3000 --name some-container-name some-image-name

Now that's not a terrible amount to write, but imagine you have three different services you need to do this for, then it suddenly becomes six commands and then you have the tear down which is two more commands and, that doesn't really scale.

alt text

Enter docker-compose.yaml

This is where Docker Compose really shines. Instead of typing two commands for every service you want to build you can define all services in your project in one file, a file we call docker-compose.yaml. You can configure the following topics inside of a docker-compose.yaml file:

  • Build, we can specify the building context and the name of the Dockerfile, should it not be called the standard name
  • Environment, we can define and give value to as many environment variables as we need
  • Image, instead of building images from scratch we can define ready-made images that we want to pull down from Docker Hub and use in our solution
  • Networks, we can create networks and we can also for each service specify which network it should belong to, if any
  • Ports, we can also define the port forwarding, that is which external port should match what internal port in the container
  • Volumes, of course, we can also define volumes
Docker compose in action

Ok so at this point we understand that Docker Compose can take care of pretty much anything we can do on the command line and that it also relies on a file docker-compose.yaml to know what actions to carry out.

Authoring a docker-compose.yml file

Let's actually try to create such a file and let's give it some instructions. First, though let's do a quick review of a typical projects file structure. Below we have a project consisting of two services, each having their own directory. Each directory has a Dockerfile that contains instructions on how to build a service.

It can look something like this:


Worth noting above is how we create the docker-compose.yaml file at the root of our project. The reason for doing so is that all the services we aim to build and how to build and start them should be defined in one file, our docker-compose.yml.
Ok, let's open docker-compose.yaml and enter our first line:

// docker-compose.yaml
version: '3'

Now, it actually matters what you specify here. Currently, Docker supports three different major versions. 3 is the latest major version, read more here how the different versions differ, cause they do support different functionality and the syntax might even differ between them Docker versions offical docs
Next up let's define our services:

// docker-compose.yaml
version: '3'
      context: ./product-service
      - "8000:3000"

Ok, that was a lot at once, let's break it down:

  • services:, there should only be one of this in the whole docker-compose.yaml file. Also, note how we end with :, we need that or it won't be valid syntax, that is generally true for any command
  • product-service, this is a name we choose ourselves for our service
  • build:, this is instructing Docker Compose how to build the image. If we have a ready-made image already we don't need to specify this one
  • context:, this is needed to tell Docker Compose where our Dockerfile is, in this case, we say that it needs to go down a level to the product-service directory
  • ports:, this is the port forwarding in which we first specify the external port followed by the internal port

All this corresponds to the following two commands:

docker build -t [default name]/product-service .
docker run -p 8000:3000 --name [default name]/product-service

Well, it's almost true, we haven't exactly told Docker Compose yet to carry out the building of the image or to create and run a container. Let's learn how to do that starting with how to build an image:

docker-compose build

The above will build every single service you have specified in docker-compose.yaml. Let's look at the output of our command:

alt text

Above we can see that our image is being built and we also see it is given the full name compose-experiments_product-service:latest, as indicated by the last row. The name is derived from the directory we are in, that is compose-experiments and the other part is the name we give the service in the docker-compose.yaml file.
Ok as for spinning it up we type:

docker-compose up

This will again read our docker-compose.yaml file but this time it will create and run a container. Let's also make sure we run our container in the background so we add the flag -d, so full command is now:

docker-compose up -d

Ok, above we can see that our service is being created. Let's run docker ps to verify the status of our newly created container:
It seems to be up and running on port 8000. Let's verify:
Ok, so went to the terminal and we can see we got a container. We know we can bring it down with either docker stop or docker kill but let's do it the docker-compose way:

docker-compose down

As we can see above the logs is saying that it is stopping and removing the container, it seems to be doing both docker stop [id] and docker rm [id] for us, sweet :)
It should be said if all we want to do is stop the containers we can do so with:

docker-compose stop

I don't know about you but at this point, I'm ready to stop using docker build, docker run, docker stop and docker rm. Docker compose seems to take care of the full life cycle :)

Docker compose showing off

Let's do a small recap so far. Docker compose takes care of the full life cycle of managing services for us. Let's try to list the most used Docker commands and what the corresponding command in Docker Compose would look like:

  • docker build becomes docker-compose build, the Docker Compose version is able to build all the services specified in docker-compose.yaml but we can also specify it to build a single service, so we can have more granular control if we want to
  • docker build + docker run becomes docker-compose up, this does a lot of things at once, if your images aren't built previously it will build them and it will also create containers from the images
  • docker stop becomes docker-compose stop, this is again a command that in Docker Compose can be used to stop all the containers or a specific one if we give it a single container as an argument
  • docker stop && docker rm becomes docker-compose down, this will bring down the containers by first stopping them and then removing them so we can start fresh

The above in itself is pretty great but what's even greater is how easy it is to keep on expanding our solution and add more and more services to it.

Building out our solution

Let's add another service, just to see how easy it is and how well it scales. We need to do the following:

  • add a new service entry in our docker-compose.yaml
  • build our image/s docker-compose build
  • run docker-compose up

Let's have a look at our docker-compose.yaml file and let's add the necessary info for our next service:

// docker-compose.yaml

version: '3'
      context: ./product-service
      - "8000:3000"
      context: ./inventory-service
        - "8001:3000"

Ok then, let's get these containers up and running, including our new service:

docker-compose up

Wait, aren't you supposed to run docker-compose build ? Well, actually we don't need to docker-compose up does it all for us, building images, creating and running containers.

alt text

CAVEAT, it's not so simple, that works fine for a first-time build + run, where no images exist previously. If you are doing a change to a service, however, that needs to be rebuilt, that would mean you need to run docker-compose build first and then you need to run docker-compose up.

Summary Part IV

Here is where we need to put a stop to the first half of covering Docker Compose, otherwise it would just be too much. We have been able to cover the motivation behind Docker Compose and we got a lightweight explanation to Microservice architecture. Furthermore, we talked about Docker versus Docker Compose and finally, we were able to contrast and compare the Docker Compose command to plain Docker commands.
Thereby we hopefully were able to show how much easier it is to use Docker Compose and specify all your services in a docker-compose.yaml file.

Learn Docker from Beginner to Advanced Part V

We will keep working on our project introduced in Part IV and in doing so we will showcase more Docker Compose features and essentially build out our project to cover everything you might possibly need.

In this part, we will cover:

  • Environment variables , now we have covered those in previous parts so this is mostly about how we set them in Docker Compose
  • Volumes , same thing with volumes, this has been covered in previous articles even though we will mention their use and how to work with them with Docker Compose
  • Networks and Databases , finally we will cover Databases and Networks, this part is a bit tricky but hopefully, we managed to explain it thoroughly

If you at any point should feel confused here is the repo this article is based on:

Environment variables

One of the things I’ve shown you in previous articles is how we can specify environment variables. Now variables can be set in the Dockerfile but we can definitely set them on the command line and thereby also in Docker Compose and specifically in docker-compose.yaml:

// docker-compose.yaml

version: '3'
     context: ./product-service
     - "8000:3000"
     - test=testvalue 
     context: ./inventory-service
   - "8001:3000"

Above we are creating an environment variable by defining environment followed by -test=testvalue, which means we create the variable test with value, testvalue.

We can easily test that this works by reading from process.env.test in our app.js file for the product-service directory.

Another way to test this is to run Docker compose and query for what environment variables are available, like so:

As you can see above we first run docker-compose ps and get the containers that are part of this Docker Compose session and then we run docker exec [container name] env to list the environment variables. A third option is to run docker exec -it [container name] bash and enter the container and use bash to echo out the variable value. There are quite a few ways to manage environment variables with Docker compose so have a read in the official docs, what else you can do.


We’ve covered volumes in an earlier part of this series and we found them to be a great way to:

  • create a persistent space , this is ideal to create log files or output from a Database that we want to remain, once we tear down and run our containers
  • turn our development environment into a Volume , the benefits of doing so meant that we could start up a container and then change our code and see those changes reflected without having to rebuild or tear down our container, a real time saver.

Create a persistent space

Let’s see how we can deal with Volumes in Docker compose:

// docker-compose.yml

version: '3'
     context: ./product-service
     - "8000:3000"
     - test=testvalue
     context: ./inventory-service
     - "8001:3000"
    - my-volume:/var/lib/data


Above we are creating a volume by the command volumes at the end of the file and on the second row we give it the name my-volume. Furthermore, in the inventory-service portion of our file, we refer to the just created volume and create a mapping to /var/lib/data which is a directory in the volume that will be persisted, through teardowns. Let’s look that it is correctly mapped:

As can be seen, by the above command, we first enter the container with docker exec followed by us navigating to our mapping directory, it is there, great :).

Let’s create a file in the data directory so we can prove that our volume mapping really works:

echo persist > persist.log

The above command creates a file persist.log with the content persist . Nothing fancy but it does create a file that we can look for after tearing down and restarting our container.

Now we can exit the container. Next, let’s recap on some useful Volume commands:

docker volume ls

The above lists all the currently mounted volumes. We can see that our created Volume is there compose-experiments_my-volume .

We can dive into more details with:

docker volume inspect compose-experiments_my-volume

Ok, so it’s giving us some details about our volume such as Mountpoint, which is where files will be persisted when we write to our volume mapping directory in the container.

Let’s now bring down our containers with:

docker-compose down

This means that the Volume should still be there so let’s bring them all up with:

docker-compose up -d

Let’s enter the container next and see if our persist.log file is there:

Oh yeah, it works.

Turn your current directory into a Volume

Ok, for this we need to add a new volume and we need to point out a directory on our computer and a place in the container that should be in sync. Your docker-compose.yaml file should look like the following:

// docker-compose.yaml

version: '3'
      context: ./product-service
      - "8000:3000"
      - test=testvalue
      - type: bind  
      source: ./product-service  
      target: /app  
      context: ./inventory-service
      - "8001:3000"
      - my-volume:/var/lib/data


The new addition is added to the product-service. We can see that we are specifying a volumes command with one entry. Let’s break down that entry:

  • type: bind , this creates a so-called bind mount, a type of volume more fit for purpose of syncing files between your local directory and your container
  • source , this is simply where your files are, as you can see we are pointing out ./product-service. This means that as soon as we change a file under that directory Docker will pick up on it.
  • target , this is the directory in the container, source and target will now be in sync we do a change in source, the same change will happen to target
Networks and databases

Ok then, this is the last part we aim to cover in this article. Let’s start with databases. All major vendors have a Docker image like Sql Server, Postgres, MySQL and so on. This means we don’t need to do the build-step to get them up and running but we do need to set things like environment variables and of course open up ports so we can interact with them. Let’s have a look at how we can add a MySQL database to our solution, that is our docker-compose.yml file.

Adding a database

Adding a database to docker-compose.yaml is about adding an already premade image. Lucky for us MySQL already provides a ready-made one. To add it we just need to add another entry under services: like so:

// docker-compose.yaml

  image: mysql
    - MYSQL_ROOT_PASSWORD=complexpassword
    - 8002:3306

Let’s break it down:

  • product-db is the name of our new service entry, we choose this name
  • image is a new command that we are using instead of build , we use this when the image is already built, which is true for most databases
  • environment , most databases will need to have a certain number of variables set to be able to connect to them like username, password and potentially the name of the database, this varies per type of database. In this case, we set MYSQL_ROOT_PASSWORD so we instruct the MySQL instance what the password is for the root user. We should consider creating a number of users with varying access levels
  • ports, this is exposing the ports that will be open and thereby this is our entrance in for talking to the database. By typing 8002:3306 we say that the container's port 3306 should be mapped to the external port 8002

Let’s see if we can get the database and the rest of our services up and running:

docker-compose up -d

Let’s verify with:

docker-compose ps OR docker ps

Looks, good, our database service experiments_product-db_1 seems to be up and running on port 8002. Let’s see if we can connect to the database next. The below command will connect us to the database, fingers crossed ;)

mysql -uroot -pcomplexpassword -h -P 8002

and the winner is…

Great, we did it. Next up let’s see if we can update one of our services to connect to the database.

Connecting to the database

There are three major ways we could be connecting to the database:

  • using docker client, we’ve tried this one already with mysql -uroot -pcomplexpassword -h -P 8002
  • enter our container, we do this using docker exec -it [name of container] bash and then we type mysql inside of the container
  • connecting through our app, this is what we will look at next using the NPM library mysql

We will focus on the third choice, connecting to a database through our app. The database and the app will exist in different containers. So how do we get them to connect? The answer is:

  • needs to be in the same network , for two containers to talk to each other they need to be in the same network
  • database needs to be ready , it takes a while to start up a database and for your app to be able to talk to the database you need to ensure the database have started up correctly, this was a fun/interesting/painful til I figured it out, so don’t worry I got you, we will succeed :)
  • create a connection object , ensure we set up the connection object correctly in app.js for product-service

Let’s start with the first item here. How do we get the database and the container into the same network? Easy, we create a network and we place each container in that network. Let’s show this in docker-compose.yaml:

// excerpt from docker-compose.yaml


We need to assign this network to each service, like so:

// excerpt from docker-compose.yaml

      - products

Now, for the second bullet point, how do we know that the database is finished initializing? Well, we do have a property called depends_on, with that property, we are able to specify that one container should wait for another container to start up first. That means we can specify it like so:

// excerpt from docker-compose.yaml

   depends_on: db
   image: mysql

Great so that solves it or? Nope nope nope, hold your horses:

So in Docker compose version 2 there used to be an alternative where we could check for a service’s health, if health was good we could process to spin up our container. It looked like so:

   condition: service_healthy

This meant that we could wait for a database to initialize fully. This was not to last though, in version 3 this option is gone. Here is doc page that explains why, control startup and shutdown order. The gist of it is that now it’s up to us to find out when our database is done and ready to connect to. Docker suggests several scripts for this:

All these scripts have one thing in common, the idea is to listen to a specific host and port and when that replies back, then we run our app. So what do we need to do to make that work? Well let’s pick one of these scripts, namely wait-for-it and let’s list what we need to do:

  • copy this script into your service container
  • give the script execution rights
  • instruct the docker file to run the script with database host and port as args and then to run the service once the script OKs it

Let’s start with copying the script from GitHub into our product-service directory so it now looks like this:


Now let’s open up the Dockerfile and add the following:

// Dockerfile

FROM node:latest



COPY . .

RUN npm install



RUN chmod +x /

Above we are copying the file to our container and on the line below we are giving it execution rights. Worth noting is how we also remove the ENTRYPOINT from our Dockerfile, we will instead instruct the container to start from the docker-compose.yaml file. Let’s have a look at said file next:

// excerpt from docker-compose.yaml

 command: ["/", "db:8002", "--", "npm", "start"]
 // definition of db service below

Above we are telling it to run the file and as an argument use db:8002 and after it gets a satisfactory response then we can go on to run npm start which will then start up our service. That sounds nice, will it work?

For full disclosure let’s show our full docker-compose.yaml file:

version: '3.3'
        - "db"
        context: ./product-service
      command: ["/", "db:8002", "--", "npm", "start"]
      - "8000:3000"
      - test=testvalue
      - DATABASE_PASSWORD=complexpassword
      - DATABASE_HOST=db
      - type: bind
      source: ./product-service
      target: /app
      - products
     build: ./product-db
       restart: always
       - "MYSQL_ROOT_PASSWORD=complexpassword"
       - "MYSQL_DATABASE=Products"
       - "8002:3306"
       - products
       context: ./inventory-service
       - "8001:3000"
       - my-volume:/var/lib/data



Ok, so to recap we placed product-service and db in the network products and we downloaded the script and we told it to run before we spun up the app and in the process listen for the host and port of the database that would respond as soon as the database was ready for action. That means we have one step left to do, we need to adjust the app.js file of the product-service, so let’s open that file up:

// app.js

const express = require('express')
const mysql = require('mysql');
const app = express()
const port = process.env.PORT || 3000;
const test = process.env.test;

let attempts = 0;

const seconds = 1000;

function connect() {

  console.log('password', process.env.DATABASE_PASSWORD);
  console.log('host', process.env.DATABASE_HOST);
  console.log(`attempting to connect to DB time: ${attempts}`);

 const con = mysql.createConnection({  
   host: process.env.DATABASE_HOST,  
   user: "root",  
   password: process.env.DATABASE_PASSWORD,  
   database: 'Products'  

  con.connect(function (err) {  
   if (err) {  
     console.log("Error", err);  
     setTimeout(connect, 30 * seconds);  
   } else {  


  conn.on('error', function(err) {  
    if(err) {  
      console.log('shit happened :)');  


app.get('/', (req, res) => res.send(`Hello product service, changed ${test}`))

app.listen(port, () => console.log(`Example app listening on port ${port}!`))

Above we can see that we have defined a connect() method that creates a connection by invoking createConnection() with an object as an argument. That input argument needs to know host, user, password and database. That seems perfectly reasonable. We also add a bit of logic to the connect() method call namely we invoke setTimeout(), this means that it will attempt to do another connection after 30 seconds. Now, because we use that functionality isn’t really needed but we could rely on application code alone to ensure we get a connection. However, we also call conn.on('error') and the reason for doing so is that we can loose a connection and we should be good citizens and ensure we can get that connection back.

Anyway, we’ve done everything in our power, but because we’ve introduced changes to Dockerfile let’s rebuild everything with docker-compose build and then let’s bring everything up with:

docker-compose up


There it is, Houston WE HAVE A CONNECTION, or as my friend Barney likes to put it:

Setting up the database — fill it with structure and data

Ok, maybe you were wondering about the way we built the service db ? That part of docker-compose.yaml looked like this:

// docker-compose.yaml

  build: ./product-db
  restart: always
    - "MYSQL_ROOT_PASSWORD=complexpassword"
    - "MYSQL_DATABASE=Products"
    - "8002:3306"
    - products

I would you to look at build especially. We mentioned at the beginning of this article that we can pull down ready-made images of databases. That statement is still true but by creating our own Dockerfile for this, we can not only specify the database we want but we can also run commands like creating our database structure and insert seed data. Let’s have a close look at the directory product-db:


Ok, we have a Dockerfile, let’s look at that first:

// Dockerfile

FROM mysql:5.6

ADD init.sql /docker-entrypoint-initdb.d

We specify that init.sql should be copied and renamed to docker-entrypoint-initdb.d which means it will run the first thing that happens. Great, what about the content of init.sql?

// init.sql


# create tables here
# add seed data inserts here

As you can see it doesn’t contain much for the moment but we can definitely expand it, which is important.

Summary Part V

We have now come full circle in this series, we have explained everything from the beginning. The basic concepts, the core commands, how to deal with volumes and databases and also how to be even more effective with Docker Compose. This series will continue of course and go deeper and deeper into Docker but this should hopefully take you a bit on the way. Thanks for reading this far.