JupyterHub

JupyterHub#

Introduction to JupyterHub#

Jupyterhub is a web application that allows users to analyze and create reports on warehouse data (or a number of data sources).

Analyses on JupyterHub are accomplished using notebooks, which allow users to mix narrative with analysis code.

You can access JuypterHub using this link: notebooks.calitp.org.

Using JupyterHub#

For Python users, we have deployed a cloud-based instance of JupyterHub to make creating, using, and sharing notebooks easy.

This avoids the need to set up a local environment, provides dedicated storage, and allows you to push to GitHub.

Logging in to JupyterHub#

JupyterHub currently lives at notebooks.calitp.org.

Note: you need be added to the Cal-ITP organization on GitHub to use JupyterHub. Tag Evan Siroky on the Data Analysis channel on Teams to ask for access.

Default User vs Power User#

Default User#

Designed for general use and is ideal for less resource-intensive tasks. It’s a good starting point for most users who don’t expect to run very large, memory-hungry jobs.

Default User profile offers quick availability since it uses less memory and can allocate a smaller node, allowing you to start tasks faster. However, if your task grows in memory usage over time, it may exceed the node’s capacity, potentially causing the system to terminate your job. This makes the Default profile best for smaller or medium-sized tasks that don’t require a lot of memory. If your workload exceeds these limits, you might experience instability or crashes.

Power User#

Intended for more demanding, memory-intensive tasks that require more resources upfront. This profile is suitable for workloads that have higher memory requirements or are expected to grow during execution.

Power User profile allocates a full node or a significant portion of resources to ensure your job has enough memory and computational power, avoiding crashes or delays. However, this comes with a longer wait time as the system needs to provision a new node for you. Once it’s ready, you’ll have all the resources necessary for memory-intensive tasks like large datasets or simulations. The Power User profile is ideal for jobs that might be unstable or crash on the Default profile due to higher resource demands. Additionally, it offers scalability—if your task requires more resources than the initial node can provide, the system will automatically spin up additional nodes to meet the demand.

Connecting to the Warehouse#

Connecting to the warehouse requires a bit of setup after logging in to JupyterHub, but allows users to query data in the warehouse directly.

See the screencast below for a walkthrough of a similar process:

Copy the /iac/login.json from data-infra or the data-analyses repo to the root of your environment. Then set up this alias in your .bash_profile:

# An alias to run application-default login
alias glogin='gcloud auth application-default login --login-config=$HOME/iac/login.json'

Now when you type in glogin it should initiate a login script.

If you work in jupyterhub, consider modifying your .bash_profile further with code from this gist. This will make it attempt to login automatically if you open terminal.

If you are developing in the warehouse, you may need to set your project to staging with:

gcloud config set project cal-itp-data-infra-staging

For work in prod:

gcloud config set project cal-itp-data-infra

If you want to view cloud resources use this link.

If you are still not able to connect, make sure you have the suite of permissions associated with other analysts.

Increasing the Query Limit#

By default, there is a query limit set within the Jupyter Notebook. Most queries should be within that limit, and running into DatabaseError: 500 Query exceeded limit for bytes billed should be a red flag to investigate whether such a large query is needed for the analysis. To increase the query limit, add and execute the following in your notebook:

from calitp_data_analysis.tables import tbls

import os
os.environ["CALITP_BQ_MAX_BYTES"] = str(20_000_000_000)

tbls._init()

Increasing the Storage Limit#

By default, new JupyterHub instances are subject to a 10GB storage limit. This setting comes from the underlying infrastructure configuration, and requires some interaction with Kubernetes Engine in Google Cloud to modify.

If a JupyterHub user experiences an error indicating no space left on device or similar, their provisioned storage likely needs to be increased. This can be done from within the Storage section of the Google Kubernetes Engine web UI. Click into the “claim-[username]” entry associated with the user (not the “pvc-[abc123]” persistent volume associated with that entry), navigate to the “YAML” tab, and change the storage resource request under spec and the storage capacity limit under status.

After making the configuration change in GKE, shut down and restart the user’s JupyterHub instance. If the first restart attempt times out, try again once or twice - it can take a moment for the scaleup to complete and properly link the storage volume to the JupyterHub instance.

100GB should generally be more than enough for a given user - if somebody’s storage has already been set to 100GB and they hit a space limit again, that may indicate a need to clean up past work rather than a need to increase available storage.

Querying with SQL in JupyterHub#

JupyterHub makes it easy to query SQL in the notebooks.

To query SQL, simply import the below at the top of your notebook:

import calitp_data_analysis.magics

And add the following to the top of any cell block that you would like to query SQL in:

%%sql

Example:

import calitp_data_analysis.magics

%%sql

SELECT
    COUNT(*)
FROM `mart_gtfs.dim_schedule_feeds`
WHERE
    key = "db58891de4281f965b4e7745675415ab"
LIMIT 10

Saving Code to Github#

Use this link to navigate to the Saving Code section of the docs to learn how to commit code to GitHub from the Jupyter terminal. Once there, you will need to complete the instructions in the following sections:

Environment Variables#

Sometimes if data access is expensive, or if there is sensitive data, then accessing it will require some sort of credentials (which may take the form of passwords or tokens).

There is a fundamental tension between data access restrictions and analysis reproducibility. If credentials are required, then an analysis is not reproducible out-of-the-box. However, including these credentials in scripts and notebooks is a security risk.

Most projects should store the authentication credentials in environment variables, which can then be read by scripts and notebooks. The environment variables that are required for an analysis to work should be clearly documented.

Analysts should store their credentials in a _env file, a slight variation of the typical .env file, since the .env won’t show up in the JupyterHub filesystem.

Some credentials that need to be stored within the _env file may include GitHub API key, Census API key, Airtable API key, etc. Store them in this format:

GITHUB_API_KEY=ABCDEFG123456789
CENSUS_API_KEY=ABCDEFG123456789
AIRTABLE_API_KEY=ABCDEFG123456789

To pass these credentials in a Jupyter Notebook:

import dotenv
import os

# Load the env file
dotenv.load_dotenv("_env")

# Import the credential (without exposing the password!)
GITHUB_API_KEY = os.environ["GITHUB_API_KEY"]

Jupyter Notebook Best Practices#

External resources:

Developing warehouse models in JupyterHub#

See the warehouse README for warehouse project setup instructions.