Introduction to JupyterHub¶
Jupyterhub is a web application that allows users to analyze and create reports on warehouse data (or a number of data sources).
Analyses on JupyterHub are accomplished using notebooks, which allow users to mix narrative with analysis code.
You can access JuypterHub using this link: notebooks.calitp.org.
Table of Contents¶
For Python users, we have deployed a cloud-based instance of JupyterHub to make creating, using, and sharing notebooks easy.
This avoids the need to set up a local environment, provides dedicated storage, and allows you to push to GitHub.
Logging in to JupyterHub¶
JupyterHub currently lives at notebooks.calitp.org.
Note: you will need to have been added to the Cal-ITP organization on GitHub to obtain access. If you have yet to be added to the organization and need to be, ask in the
#services-team channel in Slack.
Connecting to the Warehouse¶
Connecting to the warehouse requires a bit of setup after logging in to JupyterHub, but allows users to query data in the warehouse directly. To do this, you will need to download and install the gcloud commandline tool from the app.
See the screencast below for a full walkthrough.
The commands required:
# init will both authenticate and do basic configuration # You do not have to set a default compute region/zone gcloud init # Optionally, you can auth and set the project separately gcloud auth login gcloud config set project cal-itp-data-infra # Regardless, set up application default credentials gcloud auth application-default login
If you are still not able to connect, make sure you have the suite of permissions associated with other analysts.
Increasing the Query Limit¶
By default, there is a query limit set within the Jupyter Notebook. Most queries should be within that limit, and running into
DatabaseError: 500 Query exceeded limit for bytes billed should be a red flag to investigate whether such a large query is needed for the analysis. To increase the query limit, add and execute the following in your notebook:
from calitp_data_analysis.tables import tbls import os os.environ["CALITP_BQ_MAX_BYTES"] = str(20_000_000_000) tbls._init()
Querying with SQL in JupyterHub¶
JupyterHub makes it easy to query SQL in the notebooks.
To query SQL, simply import the below at the top of your notebook:
And add the following to the top of any cell block that you would like to query SQL in:
%%sql SELECT COUNT(*) FROM `mart_gtfs.dim_schedule_feeds` WHERE key = "db58891de4281f965b4e7745675415ab" LIMIT 10
Saving Code to Github¶
Use this link to navigate to the
Saving Code section of the docs to learn how to commit code to GitHub from the Jupyter terminal. Once there, you will need to complete the instructions in the following sections:
Sometimes if data access is expensive, or if there is sensitive data, then accessing it will require some sort of credentials (which may take the form of passwords or tokens).
There is a fundamental tension between data access restrictions and analysis reproducibility. If credentials are required, then an analysis is not reproducible out-of-the-box. However, including these credentials in scripts and notebooks is a security risk.
Most projects should store the authentication credentials in environment variables, which can then be read by scripts and notebooks. The environment variables that are required for an analysis to work should be clearly documented.
Analysts should store their credentials in a
_env file, a slight variation of the typical
.env file, since the
.env won’t show up in the JupyterHub filesystem.
Some credentials that need to be stored within the
_env file may include GitHub API key, Census API key, Airtable API key, etc. Store them in this format:
GITHUB_API_KEY=ABCDEFG123456789 CENSUS_API_KEY=ABCDEFG123456789 AIRTABLE_API_KEY=ABCDEFG123456789
To pass these credentials in a Jupyter Notebook:
import dotenv import os # Load the env file dotenv.load_dotenv("_env") # Import the credential (without exposing the password!) GITHUB_API_KEY = os.environ["GITHUB_API_KEY"]
Jupyter Notebook Best Practices¶