JupyterHub
Contents
JupyterHub¶
Introduction to JupyterHub¶
Jupyterhub is a web application that allows users to analyze and create reports on warehouse data (or a number of data sources).
Analyses on JupyterHub are accomplished using notebooks, which allow users to mix narrative with analysis code.
You can access JuypterHub using this link: notebooks.calitp.org.
Table of Contents¶
Using JupyterHub¶
For Python users, we have deployed a cloud-based instance of JupyterHub to make creating, using, and sharing notebooks easy.
This avoids the need to set up a local environment, provides dedicated storage, and allows you to push to GitHub.
Logging in to JupyterHub¶
JupyterHub currently lives at notebooks.calitp.org.
Note: you will need to have been added to the Cal-ITP organization on GitHub to obtain access. If you have yet to be added to the organization and need to be, ask in the #services-team
channel in Slack.
Connecting to the Warehouse¶
Connecting to the warehouse requires a bit of setup after logging in to JupyterHub, but allows users to query data in the warehouse directly. To do this, you will need to download and install the gcloud commandline tool from the app.
See the screencast below for a full walkthrough.
The commands required:
# init will both authenticate and do basic configuration
# You do not have to set a default compute region/zone
gcloud init
# Optionally, you can auth and set the project separately
gcloud auth login
gcloud config set project cal-itp-data-infra
# Regardless, set up application default credentials
gcloud auth application-default login
If you are still not able to connect, make sure you have the suite of permissions associated with other analysts.
Increasing the Query Limit¶
By default, there is a query limit set within the Jupyter Notebook. Most queries should be within that limit, and running into DatabaseError: 500 Query exceeded limit for bytes billed
should be a red flag to investigate whether such a large query is needed for the analysis. To increase the query limit, add and execute the following in your notebook:
from calitp_data_analysis.tables import tbls
import os
os.environ["CALITP_BQ_MAX_BYTES"] = str(20_000_000_000)
tbls._init()
Querying with SQL in JupyterHub¶
JupyterHub makes it easy to query SQL in the notebooks.
To query SQL, simply import the below at the top of your notebook:
import calitp_data_analysis.magics
And add the following to the top of any cell block that you would like to query SQL in:
%%sql
Example:
import calitp_data_analysis.magics
%%sql
SELECT
COUNT(*)
FROM `mart_gtfs.dim_schedule_feeds`
WHERE
key = "db58891de4281f965b4e7745675415ab"
LIMIT 10
Saving Code to Github¶
Use this link to navigate to the Saving Code
section of the docs to learn how to commit code to GitHub from the Jupyter terminal. Once there, you will need to complete the instructions in the following sections:
Environment Variables¶
Sometimes if data access is expensive, or if there is sensitive data, then accessing it will require some sort of credentials (which may take the form of passwords or tokens).
There is a fundamental tension between data access restrictions and analysis reproducibility. If credentials are required, then an analysis is not reproducible out-of-the-box. However, including these credentials in scripts and notebooks is a security risk.
Most projects should store the authentication credentials in environment variables, which can then be read by scripts and notebooks. The environment variables that are required for an analysis to work should be clearly documented.
Analysts should store their credentials in a _env
file, a slight variation of the typical .env
file, since the .env
won’t show up in the JupyterHub filesystem.
Some credentials that need to be stored within the _env
file may include GitHub API key, Census API key, Airtable API key, etc. Store them in this format:
GITHUB_API_KEY=ABCDEFG123456789
CENSUS_API_KEY=ABCDEFG123456789
AIRTABLE_API_KEY=ABCDEFG123456789
To pass these credentials in a Jupyter Notebook:
import dotenv
import os
# Load the env file
dotenv.load_dotenv("_env")
# Import the credential (without exposing the password!)
GITHUB_API_KEY = os.environ["GITHUB_API_KEY"]
Jupyter Notebook Best Practices¶
External resources:
Developing warehouse models in JupyterHub¶
See the warehouse README for warehouse project setup instructions.