Barbara
Posted on January 7, 2022
When we work with Spark we usually want to first prototype to see if everything works as expected, before we start up big machines.
I spent an afternoon googling and starting and stopping the docker container to finally configure some lines of code.
So I want to share my basic local setup here, so maybe it will help someone to save some time.
When looking for a docker image with spark and jupyter we find the pyspark-notebook.
In my case I need to access AWS, so I need some additional libaries for the docker image.
To add them, I created a new Dockerfile
based on the pyspark-notebook.
The additional libraries needed are boto3
for AWS and python-dotenv
to access environment variables.
I decided to install boto3 with apt-get as this will be installed on the operating system level. Make sure to add -y
if the operating system is asking something during the install process, we will answer with yes
.
The dotenv is added via a requirements.txt so it will installed via pip, the python package manager.
Normally for the notebooks you need to have a token, but when we develop locally, we want to access the jupyter-notebook quickly and stay on the same site, without having to lookout for the new token everytime we change something.
So we need an custom configuration for that:
{
"NotebookApp": {
"allow_root": true,
"token": ""
}
}
In the Dockerfile we copy everything we need into to /home/jovyan/
directory. After some more googling I found out that this user jovyan stands for jupyter like environment. Just in case you where also wondering.
The final Dockerfile looks like this:
FROM jupyter/pyspark-notebook
USER root
# add needed packages
RUN apt-get update && apt-get install python3-boto3 -y
# Install Python requirements
COPY requirements.txt /home/jovyan/
RUN pip install -r /home/jovyan/requirements.txt
COPY jupyter_lab_config.json /home/jovyan/
In the docker-compose.yaml
we
- need to map the ports,
- map the volumes to save the notebook locally, otherwise everything would be lost, once we shut down the container and point to the env file.
- tell Docker where the
.env
file is located - tell Docker to build the Dockerfile in the same folder, instead of using an image.
The final docker-compose.yaml
looks like this:
version: "3.7"
services:
# jupyterlab with pyspark
pyspark:
#image: jupyter/pyspark-notebook
build: .
env_file:
- .env
environment:
JUPYTER_ENABLE_LAB: "yes"
ports:
- "8888:8888"
volumes:
- ./data:/home/jovyan/work
# docker run --rm -p 10000:8888 -e JUPYTER_ENABLE_LAB=yes -v "$PWD":/home/jovyan/work jupyter/pyspark-notebook
To start the container use docker-compose up
, if you changed something in the config use docker-compose up --force-recreate --build
to make sure the changes are build.
Have fun.
You can find the code also here.
Posted on January 7, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.