Mutable Ideas

Notes and ideas about Java, Scala, Big Data, NoSQL, Quality and Software Deploy

A Data Science Toolkit Inside a Docker Image, Build It Once, Run Everywhere

If you never heard about Jupyter Notebook, I highly recommend you to check it out. It have been my primary platform to build reports and data driven case studies. On this post I’d like to show how I create a simple and isolated environment with a Bash script and Docker to run JupyterLab.

Recently Jupyter Notebook received a major overhauling and become JupyterLab - currently in beta, but the new platform looks fresh and very powerful. Here is what it is about:

JupyterLab is an interactive development environment for working with notebooks, code and data … JupyterLab has full support for Jupyter notebooks.

The major downside to this platform is its dependency on Python packages and versions, which is not a problem per se but can be annoying manage Python packages on my Mac - of course I could use VirtualEnv or Conda, but for me it seemed a little bit off.

Enter Docker

I’ve been using Docker to aisle environments and guarantee builds and consistency among systems, even between our production server and my MacBook. Having notebooks with specific packages dependencies and Python version feels the kind of problem Docker can help us.

I love Docker and the simplicity it brings to the table - once you understand it and learn a few commands and best practices.

If you don’t have Docker installed on your machine, follow those steps for Mac or Windows. The bash script will work only for Mac though.

How it works

Checkout this repository, you’ll find the launcher bash script and a requirements.txt file.

The launcher is quite simple, it register a function to: 1. Check the last modified attribute from the requirements.txt 2. Convert it to a human readable date, it will be the image tag. 3. Then look locally for an image using this same tag. 4. If there is no image available locally, build it using the new requirements.txt 5. Once the image is available locally, launch it, running on http://localhost:8888 and mapping the current folder as the notebook folder.

Install and Usage

Create a local folder and clone the repository, I’ve several similar hacks on this folder I intend to share in the future

1
2
3
mkdir ~/.bashrc.d
cd ~/.bashrc.d
git clone git@github.com:arjones/jupyterlab-notebook.git notebook

Add this to your ~/.bashrc:

1
2
3
for FILE in $(ls -1 ~/.bashrc.d/*.rc); do
  source ${FILE}
done

Now reload and you’re good to go:

1
2
3
4
5
source ~/.bashrc

# switch to your working dir and run it
cd ~/Projects/analytics/
notebook

After a few seconds you will see this:

1
2
3
4
5
6
7
8
9
10
11
12
[I 21:38:37.536 LabApp] Writing notebook server cookie secret to /root/.local/share/jupyter/runtime/notebook_cookie_secret
[W 21:38:37.971 LabApp] All authentication is disabled.  Anyone who can connect to this server will be able to run code.
[I 21:38:37.998 LabApp] JupyterLab beta preview extension loaded from /usr/local/lib/python3.6/site-packages/jupyterlab
[I 21:38:37.998 LabApp] JupyterLab application directory is /usr/local/share/jupyter/lab
[W 21:38:38.001 LabApp] JupyterLab server extension not enabled, manually loading...
[I 21:38:38.002 LabApp] JupyterLab beta preview extension loaded from /usr/local/lib/python3.6/site-packages/jupyterlab
[I 21:38:38.002 LabApp] JupyterLab application directory is /usr/local/share/jupyter/lab
[I 21:38:38.007 LabApp] Serving notebooks from local directory: /root/notebook
[I 21:38:38.008 LabApp] 0 active kernels
[I 21:38:38.008 LabApp] The Jupyter Notebook is running at:
[I 21:38:38.008 LabApp] http://0.0.0.0:8888/
[I 21:38:38.008 LabApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).

Open your browser at http://0.0.0.0:8888/ and happy hacking :D

Improvements & variations

My idea was to create something that could be as transparent as possible, I like that notebook command feels as if it is a local installed software, but there are room for improvements, like:

  • Better error handling when Docker engine is not running.
  • Find another tag strategy, instead of stat result, git tag could be a candidate. PR anyone? 🤓

Comments