Getting Notebooks Ready for the Portfolio#

Packages to include#

  • Order matters, %%capture must go first.

  • warnings.filterwarnings('ignore') warnings from displaying in the portfolio site (shared_utils).

# Include this in the cell where packages are imported

%%capture

import warnings
warnings.filterwarnings('ignore')

import calitp_data_analysis.magics

Headers#

Parameterized Titles#

  • When parameterizing a notebook, the first Markdown cell must include parameters to inject.

    • Ex: If district is one of the parameters in your sites/my_report.yml, a header Markdown cell could be # District {district} Analysis.

    • Note: The site URL is constructed from the original notebook name and the parameter in the JupyterBook build: 0_notebook_name__district_x_analysis.html

Consecutive Headers#

  • Headers must move consecutively in Markdown cells or the parameterized notebook will not generate. No skipping!

# Notebook Title
## First Section
## Second Section
### Another subheading
  • To get around consecutive headers, you can use display(HTML()).

    display(HTML(<h1>First Header</h1>) display(HTML(<h3>Next Header</h3>))
    

Capturing Parameters#

  • Create a code cell in which your parameter will be captured. Make sure the parameter tag for the cell is turned on.

 district_number = "4"
  • If you’re using a heading, you can either use HTML or capture the parameter and inject.

  • HTML - this option works when you run your notebook locally.

    from IPython.display import HTML
    
    display(HTML(f"<h3>Header with {variable}</h3>"))
    
  • Capture parameters - this option won’t display locally in your notebook (it will still show {district_number}), but will be injected with the value when the JupyterBook is built.

    In a code cell:

    %%capture_parameters
    
    district_number = f"{df.caltrans_district.iloc[0].split('-')[0].strip()}"
    

    In a Markdown cell:

    ## District {district_number}
    

Narrative#

  • Narrative content can be done in Markdown cells or code cells.

    • Markdown cells should be used when there are no variables to inject.

    • Code cells should be used to write narrative whenever variables constructed from f-strings are used.

  • For papermill, add a parameters tag to the code cell Note: Our portfolio uses a custom papermill engine and we can skip this step.

  • Markdown cells can inject f-strings if it’s plain Markdown (not a heading) using display(Markdown()) in a code cell.

from IPython.display import Markdown

display(Markdown(f"The value of {variable} is {value}."))
  • Use f-strings to fill in variables and values instead of hard-coding them

    • Turn anything that runs in a loop or relies on a function into a variable.

    • Use functions to grab those values for a specific entity (operator, district), rather than hard-coding the values into the narrative.

n_routes = (df[df.calitp_itp_id == itp_id]
            .route_id.nunique()
            )


n_parallel = (df[
            (df.calitp_itp_id == itp_id) &
            (df.parallel==1)]
            .route_id.nunique()
            )

display(
    Markdown(
        f"**Bus routes in service: {n_routes}**"
        "<br>**Parallel routes** to State Highway Network (SHN): "
        f"**{n_parallel} routes**"
        )
)
  • Stay away from loops if you need to use headers.

    • You will need to create Markdown cells for headers or else JupyterBook will not build correctly. For parameterized notebooks, this is an acceptable trade-off.

    • For unparameterized notebooks, you may want use display(HTML()).

    • Caveat: Using display(HTML()) means you’ll lose the table of contents navigation in the top right corner in the JupyterBook build.

Writing Guide#

These are a set of principles to adhere to when writing the narrative content in a Jupyter Notebook. Use your best judgment to decide when there are exceptions to these principles.

  • Decimals less than 1, always prefix with a 0, for readability.

    • 0.05, not .05

  • Integers when referencing dates, times, etc

    • 2020 for year, not 2020.0 (coerce to int64 or Int64 in pandas; Int64 are nullable integers, which allow for NaNs to appear alongside integers)

    • 1 hr 20 min, not 1.33 hr (use best judgment to decide what’s easier for readers to interpret)

  • Round at the end of the analysis. Use best judgment to decide on significant digits.

    • Too many decimal places give an air of precision that may not be present.

    • Too few decimal places may not give enough detail to distinguish between categories or ranges.

    • A good rule of thumb is to start with 1 extra decimal place than what is present in the other columns when deriving statistics (averages, percentiles), and decide from there if you want to round up.

      • An average of $100,000.0 can simply be rounded to $100,000.

      • An average of 5.2 mi might be left as is.

    • National Institutes of Health Rounding Rules (full article)

  • Additional references: American Psychological Association (APA) style, and Purdue

Standard Names#

  • GTFS data in our warehouse stores information on operators, routes, and stops.

  • Analysts should reference the operator name, route name, and Caltrans district the same way across analyses.

    • ITP ID: 182 is Metro (not LA Metro, Los Angeles County Metropolitan Transportation Authority, though those are all correct names for the operator)

    • Caltrans District: 7 is 07 - Los Angeles

    • Between route_short_name, route_long_name, route_desc, which one should be used to describe route_id? Use shared_utils.portfolio_utils, which relies on regular expressions, to select the most human-readable route name.

  • Before deploying your portfolio, make sure the operator name you’re using is what’s used in other analyses in the portfolio.

    • Use shared_utils.portfolio_utils to help you grab the right names to use.

      from shared_utils import portfolio_utils
      
      route_names = portfolio_utils.add_route_name()
      
      # Merge in the selected route name using route_id
      df = pd.merge(df,
                  route_names,
                  on = ["calitp_itp_id", "route_id"]
      )
      
      
      agency_names = portfolio_utils.add_agency_name()
      
      # Merge in the operator's name using calitp_itp_id
      df = pd.merge(df,
                  agency_names,
                  on = "calitp_itp_id"
      )