Containers for Machine Learning, from scratch to Kubernetes
Blair Hudson
Posted on September 16, 2019
This article is for all those who keep hearing about the magical concept of containers from the world of DevOps, and wonder what it might have to do with the equally magical (but perhaps more familiar) concept of machine learning from the world of Data Science.
Well, wonder no more — in this article we're going to take a look at using containers for machine learning from scratch, why they actually make such a good match, and how to run them at scale in both the lightweight Docker Swarm and it's popular alternative Kubernetes!
(No container people... not FROM scratch
, although you can read all about that in my follow-on post)
A primer on machine learning in Python
If you've been working with Python for data science for a while, you will already be well-aquinted with tools like Jupyter, Scikit-Learn, Pandas and XGBoost. If not, you'll just have to take my word for it that these are some of the best open source projects out there for machine learning right now.
For this article, we're going to pull some sample data from everyone's favourite online data science community, Kaggle.
Assuming you already have Python 3 installed, let's go ahead and install our favourite tools (though you'll probably have most of these already):
pip install jupyterlab pandas scikit-learn xgboost kaggle
(If you’ve had any troubles installing Python 3 or the above package requirements you might like to skip straight to the next section.)
Once we've configured our local Kaggle credentials, change to a suitable directory and download and unzip the bank loan prediction dataset (or any other dataset you prefer)!
kaggle datasets download -d omkar5/dataset-for-bank-loan-prediction
unzip dataset-for-bank-loan-prediction.zip
With our data ready to go, let's run Jupyter Lab and start working on our demonstration model. Use the command jupyter lab
to start the service, which will open http://localhost:8888
in your browser.
Create a new notebook from the launcher, and call it notebook.ipynb
. You can copy the following code into each cell of your notebook.
First, we read the Kaggle data into a DataFrame object.
import pandas as pd
path_in = './credit_train.csv'
print('reading csv from %s' % path_in)
df = pd.read_csv(path_in)
Now we quickly divide our DataFrame into features and a target (but don't try this at home...)
def prep_data(df):
X = df.drop(['Number of Credit Problems'], axis=1).select_dtypes(include=['number','bool'])
y = df['Number of Credit Problems'] > 1
return X, y
print("preparing data")
X_train, y_train = prep_data(df)
With our data ready, let's fit an XGBoost classifier with all of the default hyper-parameters.
from xgboost import XGBClassifier
model = XGBClassifier()
print("training model")
model.fit(X_train, y_train)
When that finishes running, we now have ... a model? Though admittedly not a very good one, but this article is about containers not tuning XGBoost! Let's save our model so we can use it later on if necessary.
import joblib
path_out = './model.joblib'
print("dumping trained model to %s" % path_out)
joblib.dump(model, path_out)
Using Docker for managing your data science environment and executing notebooks
So we just did all of that work to set up our Jupyter environment with the right packages. Depending on our operating system and previous installations we may have even had some unexpected errors. (Did anyone else fail to install XGBoost the first time?) Hopefully you found a workaround for installing everything and I hope you took notes of the process — since we’ll want to be able to repeat that when we take our machine learning project to production later...
Ok, here comes the juicy part.
Docker solves this problem for us by allowing us to specify our entire environment (including the operating system and all the installation steps) as a reproducible script, so that we can easily move our machine learning project around without having to resolve the installation challenges ever again!
You'll need to install Docker. Luckily Docker Desktop for Mac and Windows includes everything we need for this tutorial. Linux users can find Docker in their favourite package manager — but you might need to configure the official Docker repository to get the latest version.
Once installed, make sure the Docker daemon is running, then run your first container!
This command will pull the CentOS 7 official Docker image and run an interactive terminal session. (Why CentOS 7? Given the similarities to Amazon Linux and Red Hat, which you'll often encounter in enterprise envirnonments. With some tweaking of the yum
installation commands, you could use any base operating system.)
-
docker run -it --rm centos:7
-
-it
tells Docker to make your container interactive (as opposed to detached) and attaches a tty (terminal) session to actually interact with it -
--rm
tells Docker to remove your container as soon as we stop it with ctrl-c
-
Now we want to find the right commands to install Python, Jupyter and our other packages, and as we do we'll write them into a Dockerfile to develop our new container on top of centos:7
.
Create a new file and name it Dockerfile
, the contents should look a little something like this:
FROM centos:7
# install python and pip
RUN yum install -y epel-release
RUN yum install -y python36-devel python36-pip
# install our pacakges
RUN pip3 install jupyterlab kaggle pandas scikit-learn xgboost
# turns out xgboost needs this
RUN yum install -y libgomp
# create a user to run jupyterlab
RUN adduser jupyter
# switch to our user and their home dir
USER jupyter
WORKDIR /home/jupyter
# tell docker to listen on port 8888 and run jupyterlab
EXPOSE 8888
CMD ["jupyter", "lab", "--ip=0.0.0.0", "--port=8888"]
To build your new container, run this command from the directory where your Dockerfile
exists,:
docker build -t jupyter .
This will run each of the commands in the Dockerfile
except for the last "CMD" comment, which is the default command to be executed when you launch the container, and then tag with built image with the name jupyter.
Once the build is complete, we can run a container based on our new jupyter image using the default CMD we provided (which will hopefully start our Jupyter server!):
docker run -it --rm jupyter
Done? Not quite.
So it turns out we also need to map the container port to our host computer so we can reach it in the browser. While we're at it, let's also map the current directory to the container user's home directory so we can access our files when Jupyter is launched:
-
docker run -it --rm -p "8888:8888" -v "$(pwd):/home/jupyter" jupyter
-
-p "HOST_PORT:CONTAINER_PORT"
tells Docker to map a port on our host computer to a port on the container (in this case 8888 to 8888 but they need not be the same) -
-v "/host/path/or/file:/container/path/or/file
tells Docker to map a path or file on our host so that the container can access it (and$(pwd)
simply outputs the current host path)
-
Using the same notebook cell code as above, write and execute a new notebook.ipynb
using the "containerised" Jupyter service.
Now we need to automate our notebook execution. In the Jupyter terminal prompt enter:
nbconvert --to notebook --inplace --execute notebook.ipynb
This calls a Jupyter utility to convert our run and update our notebook in-place, so any outputs/table/charts will be updated in addition to any actual outputs from the script.
When you're done, Ctrl-C a few times to quit Jupyter (and in doing so, this will exit and remove our container since we set the --rm
option in the previous docker run
command).
To make things automatable, it turns out we can override the default CMD without creating a new Dockerfile. With this, we can skip running Jupyterlab and instead run our nbconvert
command:
docker run -it --rm -p "8888:8888" -v "$(pwd):/home/jupyter" jupyter jupyter nbconvert --to notebook --inplace --execute notebook.ipynb
Notice that we simply specify our custom command (CMD) by specifying the command and any arguments at the end of our docker run
command. (Note the first jupyter is the image tag, while the second is the command to trigger our process.)
For the curious, this is the same as modifying our Dockerfile CMD to the following:
#...
CMD ["jupyter", "nbconvert", "--to", "notebook", "--inplace", "--execute", "notebook.ipynb"]
Once the container has exited, check model.joblib, which should have been modified seconds ago.
Success!
Scaling your environments with Docker Swarm
Running a container on your computer is one thing — but what if you want to speed up your machine learning workflows beyond what your computer alone can achieve? What if you want to run many of these services at the same time? What if all your data is stored in a remote environment and you don't want to transmit gigabytes of data over the Internet?
There's loads of great reasons why running containers in a cluster environment is beneficial, but whatever the reason, I'm going to show you just how easy this is by introducing Docker Swarm.
Conveniently Docker Swarm is a built-in capabaility of Docker, so to keep following this article you don't need to install anything else. Of course, in reality you would more likely choose to provision multiple compute resources in the cloud and initialise and join your cluster there. In fact, assuming network connectivity between them, you could even set up a cluster that spans multiple cloud providers! (How's that for high availability!? 👊)
To start a single node cluster, run docker swarm init
. This designates that host as a manager node in your 'swarm', which means it is responsible for scheduling services to run across all of the nodes in your cluster. If your manager node goes offline, then you lose access to your cluster so if resiliency is important it's good practice to have 3 or 5 to maintain consensus if 1 or 2 nodes fail.
This command will output a another command starting with docker swarm join
which when run on another host, joins that host as a worker node in your swarm. You can run this on as many worker nodes as you want, or even in an auto-scaling arrangement to ensure your cluster always has enough capacity — but we won't need it for now.
To run Jupyter as a service, Docker Swarm has a special command which is similar to docker run
above. The key difference is that this publishes (exposes) port 8888 across every node in your cluster, regardless of where the container itself is actually running. This means if you send traffic to port 8888 on any node in your cluster, Docker will automatically forward it to the correct host like magic! In certain use cases (such as stateless REST APIs or static application front-ends, you can use this to automatically load balance your services — cool!)
On a manager node in your cluster (which is your computer for now), run
-
docker service create --name jupyter --mount type=bind,source=$(pwd),destination=/home/jupyter --publish 8888:8888 jupyter
-
--name
gives the service a nickname to easily reference it later (for example, to stop it) -
--mount
allows you to bind data into the container -
--publish
exposes the specified port across the cluster
-
(Note that in this case bind-mounting a host directory will work since we only have a single node swarm. In multi-node clusters this won't work so well unless you can guarantee the data at the mount point on each host to be in sync. How to achieve this is not discussed here.)
After running the command, the service will output various status messages until it converges to a stable state (which basically means that no errors have occurred for 5 seconds once the container command is executed).
You can run docker service logs -f jupyter
to check the logs (I told you that naming our service would come in handy), and if you want to access Jupyter in your browser, you'll need to do this to retrieve the access token.
Now you can remove the service by running
docker service rm jupyter
What about our notebook execution? Try running this:
-
docker service create --name jupyter --mount type=bind,source=$(pwd),destination=/home/jupyter --restart-condition none jupyter jupyter nbconvert --to notebook --inplace --execute notebook.ipynb
-
--restart-condition none
is important here to prevent your restarting container when it's finished executing -
jupyter jupyter [params]
represents the name of the container, the name of a custom command to run, and it's subsequent parameters (nbconvert ...
)
-
These commands are getting pretty complex now, so it might be a good idea to start documenting them so we can easily reproduce our services later on. Luckily we have Docker Compose, which is a configuration-based service for doing just that. Here is what the first service command looks like as a compose.yaml file:
version: "3.3"
services:
jupyter:
image: jupyter
volumes:
- ${PWD}:/home/jupyter
ports:
- "8888:8888"
If you save this, you can run it as a "stack" of services (even though it only describes one right now), using the command:
docker stack deploy --compose-file compose.yaml jupyter
Much neater. It turns out you can include many related services in a single Docker Compose Stack, and so when you deploy one the services are named as stackname_servicename, so to retrieve the logs enter:
docker service logs -f jupyter_jupyter
This is the Docker Compose configuration for running our Jupyter notebook. Note the introduction of the restart_policy
. This is super important for running our job since we expect it to finish and by default Docker Swarm will automatically restart stopped containers which will execute your notebook repeatedly.
version: "3.3"
services:
jupyter:
image: jupyter
deploy:
restart_policy:
condition: none
volumes:
- ${PWD}:/home/jupyter
command: jupyter nbconvert --to notebook --inplace --execute notebook.ipynb
Getting started with Kubernetes
Docker Desktop for Mac and Windows also includes a single-node Kubernetes cluster, so in the settings for Docker Desktop you'll want to switch that on. Starting up Kuberenetes can take a while, since it is a pretty heavyweight cluster deigned for running massive workloads. Think thousands and thousands of containers at once!
In practice, you'll want to configure your Kubernetes cluster over multiple hosts, and with the introduction of tools like kubeadm
that process is similar to configured Docker Swarm as we did earlier. We won't be discussing setting up Kubernetes any further in this article, but if you're interested you can read more about kubeadm
here. If you are planning to use Kubernetes, you might also consider using one of the cloud vendor managed services such as AWS Elastic Kubernetes Service or Google Kubernetes Engine on Google Cloud.
In recent versions of Docker and Kubernetes, you can actually deploy a Docker stack straight to Kubernetes — using the same Docker Compose files we created earlier! (Though not without some gotcha's, such the convenient bind-mounted host directory we deployed without fear earlier.)
To target the locally configured Kubernetes cluster, simply update your command to add --orchestrator kubernetes
:
docker stack deploy --compose-file compose.yaml --orchestrator kubernetes jupyter
This will deploy a Kubernetes stack just as it deployed a Docker Swarm stack, containing your services (no pun intended). In Kubernetes, a Docker Swarm "service" is known as a "pod".
To see what pods are running, and to confirm that our Jupyter stack is one of them, just run this and take note of the exact name of your Jupyter pod (such as jupyter-54f889fdf6-gcshl
).
kubectl get pods
As usual you'll need to grab the Jupyter token to access your notebooks, and the equivalent command to access the logs is below. Note that you'll need to use the exact name of the pod from the above command.
-
kubectl logs -f jupyter-54f889fdf6-gcshl
And when you're all done with Jupyter on Kubernetes, you can tear down the stack with:
-
kubectl delete stack jupyter
Posted on September 16, 2019
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.