Skip to main content
Ctrl+K
Cal-ITP Data Services - Home
  • Data Services Documentation

Analysts

  • Welcome!
    • Cal-ITP Project Information
    • How We Work
  • Technical Onboarding
  • Introduction to Analytics Tools
    • GitHub Setup
    • Tools Quick Links
    • Business Insights & Dashboards
    • JupyterHub
    • Local Oracle Database Connections
    • Useful Python Libraries
    • Scripts
    • RT Analysis Module
    • Saving Code
    • Storing Data During Analysis
    • Using Data Catalogs
    • Helpful Links
  • Tutorials for New Python Users
    • Data Analysis: Intro
    • Data Analysis: Intermediate
    • Data Management
    • Working with Jupyter notebooks
    • Working with Geospatial Data: Basics
    • Working with Geospatial Data: Intro
    • Working with Geospatial Data: Intermediate
    • Working with Geospatial Data: Advanced
  • Introduction to the Warehouse
    • Warehouse: Where to Begin
    • Navigating the dbt Docs
    • What is an agency?
    • Developing models in dbt
    • Adding Ad-Hoc Data to the Warehouse
    • What is GTFS, anyway?
  • Where can I publish data?
    • Data Publishing Principles
    • Static Visualizations
    • HTML Visualizations
    • Getting Notebooks Ready for the Portfolio
    • The Cal-ITP Analytics Portfolio
    • Metabase
    • GCS
    • Publishing data to California Open Data aka CKAN
    • Publishing data to California State Geoportal

Developers

  • Architecture Overview
    • Deployed Services and Sites
    • Published Images and Packages
    • Data pipelines
  • Airflow Operational Considerations
  • Transit Database (Airtable)

Contribute to the Docs!

  • Getting Started
    • Best Practices
    • Submitting Changes
    • Common Content
  • Repository
  • Suggest edit
  • Open issue
  • .md

Helpful Links

Contents

  • Data Analysis
    • Python
    • Pandas
    • Summarizing
    • Merging
    • Dates
    • Monetary Values
    • Tidy Data
  • Visualization
    • Charts
      • Altair
    • Maps
    • DataFrames
    • ipywidgets
      • Tabs
    • Markdown
    • ReviewNB on GitHub

Helpful Links#

Here are some resources data analysts have collected and referenced, that will hopefully help you out in your work.

  • Data Analysis

    • Python

    • Pandas

    • Summarizing

    • Merging

    • Dates

    • Monetary Values

    • Tidy Data

  • Visualizations

    • Charts

    • Maps

    • DataFrames

    • Ipywidgets

    • Markdown

    • ReviewNB

Data Analysis#

Python#

  • Composing Programs: comprehensive Python course

  • Intermediate Python: tips for improving your programs

  • Stop Python from executing code when a module is imported.

  • Loop through 2 lists with zip in parallel.

  • Find the elements that are in one list, but not in another list.

  • What does += do?

Pandas#

  • Turn columns into dummy variables.

  • Export multiple dataframes into their own sheets to a single Excel workbook.

  • Display multiple dataframes side by side.

  • Display all rows or columns of a dataframe in the notebook

Summarizing#

  • Groupby and calculate a new value, then use that value within your DataFrame.

  • Explanation of the split-apply-combine paradigm.

  • Pandas profiling tool: creates html reports from DataFrames.

    • Examples

Merging#

  • When working with data sets where the “merge on” column is a string data type, it can be difficult to get the DataFrames to join. For example, df1 lists County of Sonoma, Human Services Department, Adult and Aging Division, but df2 references the same department as: County of Sonoma (Human Services Department) .

    • Potential Solution #1: fill in a column in one DataFrame that has a partial match with the string values in another one.

    • Potential Solution #2: use the package fuzzymatcher. This will require you to carefully comb through for any bad matches.

Dates#

  • Use shift to calculate the number of days between two dates.

df['n_days_between'] = (df['prepared_date'] - df.shift(1)['prepared_date']).dt.days
  • Assign fiscal year to a date.

# Make sure your column is a date time object
df['financial_year'] = df['base_date'].map(lambda x: x.year if x.month > 3 else x.year-1)

Monetary Values#

  • Reformat values that are in scientific notation into millions or thousands.

    • Example in notebook.

    x=alt.X("Funding Amount", axis=alt.Axis(format="$.2s", title="Obligated Funding ($2021)"))
  • Reformat values from 19000000 to $19.0M.

  • Adjust for inflation.

# Must install and import cpi package for the function to work.
def adjust_prices(df):
    cols =  ["total_requested",
           "fed_requested",
           "ac_requested"]

    def inflation_table(base_year):
        cpi.update()
        series_df = cpi.series.get(area="U.S. city average").to_dataframe()
        inflation_df = (series_df[series_df.year >= 2008]
                        .pivot_table(index='year', values='value', aggfunc='mean')
                        .reset_index()
                       )
        denominator = inflation_df.value.loc[inflation_df.year==base_year].iloc[0]

        inflation_df = inflation_df.assign(
        inflation = inflation_df.value.divide(denominator)
        )

        return inflation_df

    ##get cpi table
    cpi = inflation_table(2021)
    cpi.update
    cpi = (cpi>>select(_.year, _.value))
    cpi_dict = dict(zip(cpi['year'], cpi['value']))


    for col in cols:
        multiplier = df["prepared_y"].map(cpi_dict)

        ##using 270.97 for 2021 dollars
        df[f"adjusted_{col}"] = ((df[col] * 270.97) / multiplier)
    return df

Tidy Data#

Tidy Data follows a set of principles that ensure the data is easy to work with, especially when using tools like pandas and matplotlib. Primary rules of tidy data are:

  • Each variable must have its own column.

  • Each observation must have its own row.

  • Each value must have its own cell.

Tidy data ensures consistency, making it easier to work with tools like pandas, matplotlib, or seaborn. It also simplifies data manipulation, as functions like groupby(), pivot(), and melt() work more intuitively when the data is structured properly. Additionally, tidy data enables vectorized operations in pandas, allowing for efficient analysis on entire columns or rows at once.

Learn more about Tidy Data here.

Visualization#

Charts#

Altair#

  • Manually concatenate a bar chart and line chart to create a dual axis graph.

  • Adjust the time units of a datetime column for an axis.

  • Label the lines on a line chart.

  • Layer altair charts, lose color with no encoding, workaround to get different colors to appear on legend.

  • Add regression line to scatterplot.

  • Adjust scales for axes to be the min and max values.

  • Resolving the error ‘TypeError: Object of type ‘Timestamp’ is not JSON serializable’

  • Manually sort a legend.

  • Add tooltip to chart functions.

def add_tooltip(chart, tooltip1, tooltip2):
    chart = (
        chart.encode(tooltip= [tooltip1,tooltip2]))
    return chart

Maps#

  • Quick interactive maps with Geopandas.gdf.explore()

DataFrames#

  • Styling dataframes with HTML.

  • After styling a DataFrame, you will have to access the underlying data with .data.

ipywidgets#

Tabs#

  • Create tabs to switch between different views.

  • Stack Overflow Help.

    • Notebook example.

    • Example on Ipywidgets docs page.

Markdown#

  • Create a markdown table.

  • Add a table of content that links to headers throughout a markdown file.

  • Add links to local files.

  • Direct embed an image.

ReviewNB on GitHub#

  • Tool designed to facilitate reviewing Jupyter Notebooks in a collaborative setting on GitHub

  • Shows side-by-side diffs of Jupyter Notebooks, including changes to both code cells and markdown cells and allows reviewers to comment on specific cells

previous

Using Data Catalogs

next

Tutorials for New Python Users

Contents
  • Data Analysis
    • Python
    • Pandas
    • Summarizing
    • Merging
    • Dates
    • Monetary Values
    • Tidy Data
  • Visualization
    • Charts
      • Altair
    • Maps
    • DataFrames
    • ipywidgets
      • Tabs
    • Markdown
    • ReviewNB on GitHub

By Cal-ITP

© Copyright 2023.