Tutorial - Querying the CALITP Data Warehouse

Introduction

The following content represents a tutorial introduction to simple queries that can be made to the calitp data warehouse, and the methods that can be used to perform them.

Example Queries

The queries represented in the following tutorial are as follows:

Available Tools

The tools that we can use to answer them are:

Relevant Tables

Fact Tables

These tables contain measurements, metrics, and facts used to answer the questions from the following perspectives:

Feeds

Routes

Stops

Trips

Dimensional Tables

These tables compliment the fact tables by providing additional descriptive attributes:

Important Column Types and Columns

Query Examples

Importing Relevant Python Libraries

1. Number of Routes for a Given Agency Over Time

SQL

Primary Fact Table → views.gtfs_schedule_fact_daily_feed_routes
Secondary Table → views.gtfs_schedule_dim_feeds

Timedate (GROUP BY)
Geographyroute_key (the unique identifier for each record, to COUNT by)
Agency → Join with table views.gtfs_schedule_dim_feeds on variable feed_key for calitp_feed_name (GROUP BY)

Metabase

Primary Fact Table → Gtfs Schedule Fact Daily Feed Routes
Secondary Table → Gtfs Schedule Dim Feeds

TimeDate (FILTER)
GeographyRoute Key (the unique identifier for each record, to COUNT by)
Agency → Metabase automatically joins with table Gtfs Schedule Dim Feeds on variable Feed Key to get Calitp Feed Name (FILTER)

Metabase Configuration Screenshots

Siuba

Primary Fact Table → views.gtfs_schedule_fact_daily_feed_routes
Secondary Table → views.gtfs_schedule_dim_feeds

Timedate (COUNT by)
Geographyroute_key (the unique identifier for each record)
Agency → Join with table views.gtfs_schedule_dim_feeds on variable feed_key for calitp_feed_name (FILTER by)

2. Number of Stops for a Given Agency Over Time

SQL

Primary Fact Table → views.gtfs_schedule_fact_daily_feed_stops
Secondary Table → views.gtfs_schedule_dim_feeds

Timedate (GROUP BY)
Geographystop_key (the unique identifier for each record, to COUNT by)
Agency → Join with table views.gtfs_schedule_dim_feeds on variable feed_key for calitp_feed_name (GROUP BY)

Metabase

Primary Fact Table → Gtfs Schedule Fact Daily Feed Stops
Secondary Table → Gtfs Schedule Dim Feeds

TimeDate (COUNT by)
GeographyStop Key (the unique identifier for each record, to COUNT by)
Agency → Metabase automatically joins with table Gtfs Schedule Dim Feeds on variable Feed Key to get Calitp Feed Name (FILTER by)

Metabase Configuration Screenshots

Siuba

Primary Fact Table → views.gtfs_schedule_fact_daily_feed_stops
Secondary Table → views.gtfs_schedule_dim_feeds

Timedate (GROUP BY)
Geographystop_key (the unique identifier for each record, to COUNT by)
Agency → Join with table views.gtfs_schedule_dim_feeds on variable feed_key for calitp_feed_name (GROUP BY)

3. Number of Stops Made Across all Trips for an Agency

SQL

Primary Fact Table → views.gtfs_schedule_data_feed_trip_stops_latest

Timeno variable - this table only has information for the current day
Geographystop_time_key (the unique identifier for each record, to COUNT by)
Agencycalitp_feed_name (GROUP BY)

Metabase

Primary Fact Table → Gtfs Schedule Data Feed Trip Stops Latest
Secondary Table → Gtfs Schedule Dim Feeds

Timeno variable - this table only has information for the current day
GeographyStop Time Key (the unique identifier for each record, to COUNT by)
Agency → Metabase automatically joins with table Gtfs Schedule Dim Feeds on variable Feed Key to get Calitp Feed Name (FILTER by)

Count of Trip Stops Made Across all Trips for an Agency

Metabase Configuration Screenshots

Distinct Trips in Trip Stops

Metabase Configuration Screenshots

Distinct Stops in Trip Stops

Metabase Configuration Screenshots

Siuba

Primary Fact Table → views.gtfs_schedule_data_feed_trip_stops_latest

Timeno variable - this table only has information for the current day
Geographystop_time_key (the unique identifier for each record, to COUNT by)
Agencycalitp_feed_name (GROUP BY)

4. For a Given Agency, on Each Day, Days Until the Feed Expires

SQL

Primary Fact Table → views.gtfs_schedule_fact_daily_feeds
Secondary Table → views.gtfs_schedule_dim_feeds

Timedate, feed_end_date
Measuredays_until_feed_end_date
Agency → Join with table views.gtfs_schedule_dim_feeds on variable feed_key for calitp_feed_name (GROUP BY)

Metabase

Primary Fact Table → Gtfs Schedule Fact Daily Feeds
Secondary Table → Join: Gtfs Schedule Dim Feeds

TimeDate, Feed End Date
MeasureDays Until Feed End Date
Agency → Join with table Gtfs Schedule Dim Feeds on variable feed_key for Calitp Feed Name (FILTER by) and Feed End Date

Columns to Select:

Metabase Configuration Screenshots

Siuba

Primary Fact Table → views.gtfs_schedule_fact_daily_feeds
Secondary Table → views.gtfs_schedule_dim_feeds

Timedate (FILTER by), feed_end_date
Measuredays_until_feed_end_date
Agency → Join with table views.gtfs_schedule_dim_feeds on variable feed_key for calitp_feed_name (FILTER by)

5. Max Number of Stops a Trip Can Have, Per Agency

SQL

Primary Fact Table → views.gtfs_schedule_data_feed_trip_stops_latest
Secondary Table → views.gtfs_schedule_dim_feeds

Timeno variable, finding max across all days
Geographytrip_id (the unique identifier for each record, to GROUP BY)
Agency → Join with table views.gtfs_schedule_dim_feeds on variable feed_key for calitp_feed_name (GROUP BY)

Metabase

Primary Fact Table → Gtfs Schedule Data Feed Trip Stops Latest
Secondary Table → Gtfs Schedule Dim Feeds

Timeno variable, finding max across all days
GeographyTrip ID (the unique identifier for each record, to COUNT by)
Agency → Metabase automatically joins with table Gtfs Schedule Dim Feeds on variable Feed Key to get Calitp Feed Name (COUNT by)

Metabase Configuration Screenshots

Siuba

Primary Fact Table → views.gtfs_schedule_data_feed_trip_stops_latest
Secondary Table → views.gtfs_schedule_dim_feeds

Timeno variable, finding max across all days
Geographytrip_id (the unique identifier for each record, to GROUP BY)
Agency → Join with table views.gtfs_schedule_dim_feeds on variable feed_key for calitp_feed_name (GROUP BY)