Developing models in dbt#
Information related to contributing to the Cal-ITP dbt project.
What is dbt?#
dbt is the tool we use to manage the data models (tables and views) in our BigQuery data warehouse. We use dbt to define the SQL code that creates our data models, to manage dependencies between models, to document models, and to test models.
The Cal-ITP dbt project runs every day, orchestrated by Airflow. When the dbt project runs, it refreshes all the data in the warehouse by running all the queries within the project in order based on the dependencies defined between models. For example, if model B depends on model A, the project will execute model A before model B.
dbt project structure#
The dbt project (specifically the models/
directory) is broken out into data processing stages, broadly according to the standard outlined by dbt. You can read dbt documentation on the purpose of staging, intermediate, and mart models.
Project level configuration lives in two separate places:
dbt_project.yml
(dbt documentation, our file): This contains information about the dbt project structure, for example configuring how the folder structure should be interpreted.profiles.yml
(dbt documentation, production file (should be used for prod runs only), template for local/testing use): This contains information about how the dbt project should connect to our cloud warehouse (BigQuery) and configures settings associated with that database connection, for example, what the query timeout should be.
Note
The relationship between a folder in warehouse/models/
and a dataset/schema in BigQuery is configured via dbt_project.yml in the models
section. The warehouse/models/some_data/my_table.sql
file will not correspond to calitp-data-infra.some_data.my_table
in BigQuery unless configured to do so; all models are created in the staging
dataset by default (so, without that configuration, the table would be created in calitp-data-infra.staging.my_table
).
For an example creating a new entry in dbt_project.yml
for a new mart dataset, see data-infra PR #3172.
Resources#
If you have questions specific to our project or you encounter any issues when developing, please bring those questions to the
#data-warehouse-devs
or#data-office-hours
Cal-ITP Slack channels. Working through questions “in public” helps build shared knowledge that’s searchable later on.For Cal-ITP-specific data warehouse documentation, including high-level concepts and naming conventions, see our Cal-ITP dbt documentation site. This documentation is automatically generated by dbt, and incorporates the table- and column-level documentation that developers enter in YAML files in the dbt project.
For general dbt concepts (for example, models, dbt Jinja or tests), see the general dbt documentation site.
For general SQL or BigQuery concepts (for example, tables, views, or window functions), see the BigQuery docs site.
How to contribute to the dbt project#
Getting started#
To get set up to contribute to the dbt project via JupyterHub, follow the README in the data-infra repo warehouse folder. If you hit any trouble with setup, let folks know in the #data-warehouse-devs
or #data-office-hours
Cal-ITP Slack channels.
We recommend that everyone who does dbt development joins the #data-warehouse-devs
channel in the Cal-ITP Slack to ask questions, collaborate, and build shared knowledge.
Developer workflow#
To develop dbt models, you can edit the .sql
files for your models, save your changes, and then run the model from the command line to execute the SQL you updated.
To inspect tables as you are working, the fastest method is usually to run some manual test queries or “preview” the tables in the BigQuery user interface. You can also use something like pandas.read_gbq
to perform example queries in a notebook.
When you run dbt commands locally or on JupyterHub, your models will be created in the cal-itp-data-infra-staging.<your name>_<dbt folder name, like mart_gtfs>
BigQuery dataset. Note that this is in the cal-itp-data-infra-staging
Google Cloud Platform project, not the production cal-itp-data-infra
project.
Once your models are working the way you want and you have added all necessary documentation and tests in YAML files (see below for more on modeling, documentation, and testing considerations), you are ready to merge.
Because the warehouse is collectively maintained and changes can affect a variety of users, please open PRs against main
when work is ready to merge and keep an eye out for comments and questions from reviewers, who might require tweaks before merging.
Video walkthrough#
For an example of working with dbt in JupyterHub, see the recording of the original onboarding call in April 2023 (requires Cal-ITP Google Drive access). A few notes on this video:
The documentation shown is an older version of this docs page; the information shared verbally is correct but the page has been updated.
The bug encountered towards the end of the video (that prevented us from running dbt tests) has been fixed.
The code owners mentioned in the video have changed; consult in Slack for process guidance.
Modeling considerations#
When developing or updating dbt models, there are some considerations which may differ from a notebook-based analysis. These can be thought of as a checklist or decision tree of questions that you should run through whenever you are editing or creating a dbt model. Longer explanations of each item are included below.
Identify the cause of your bug#
Example bug troubleshooting walkthroughs
Here is a series of recordings showing a workflow for debugging a failing dbt test. The resulting PR is #2892.
Here’s a recording of a meeting on August 17, 2023 demonstrating the investigation of some failing dbt tests that day. The notes shown in the video live here. The associated PR is #2899.
Usually, bugs are caused by:
New or historical data issues. For example, an agency may be doing something in their GTFS data that we didn’t expect and this may have broken one of our models. This can happen with brand new data that is coming in or in historical data that wasn’t included in local testing (this is especially relevant for RT data, where local testing usually includes a very small subset of the full data.)
GTFS or data schema bugs. Sometimes we may have misinterpreted the GTFS spec (or another incoming data model) and modeled something incorrectly.
SQL bugs. Sometimes we may have written SQL incorrectly (for example, used the wrong kind of join.)
How to investigate the bug depends on how the bug was noticed.
If there was a failing dbt test, you can dbt compile
locally to compile the project SQL. You can then find the SQL for the failing test (follow the dbt testing FAQ under “one of my tests failed, how can I debug it?” to find the compiled test SQL). Run that SQL in BigQuery to see the rows that are failing.
Note
When you dbt compile
locally, you will compile SQL that’s pointed at the staging project and your namespaced dataset. Make sure to change those references when you run the compiled SQL. So, cal-itp-data-infra-staging.laurie_mart_gtfs.fct_scheduled_trips
would become cal-itp-data-infra.mart_gtfs.fct_scheduled_trips
.
If you noticed an issue that wasn’t caused by a failing test, you can start with the model that you noticed the problem in.
In either case, you may need to consider upstream models. To identify your model’s parents, you can look at the dbt docs website page for your model. See the dbt docs for how to look at the model’s lineage. You can modify the model selector in the bottom middle to just +<your model name>
to only see the model’s parents. You can also run poetry run dbt ls -s +<your model> --resource-type model
to see a model’s parents just on the command line. Try to figure out where the root cause of the problem is occurring. This may involve running ad-hoc SQL queries to inspect the models involved.
Should it be a dbt model?#
Changes to dbt models are likely to be appropriate when one or more of the following is true:
There is a consistent or ongoing need for this data. dbt can ensure that transformations are performed consistently at scale, every day.
The data is big. Doing transformations in BigQuery can be more performant than doing them in notebooks or any workflow where the large data must be loaded into local memory.
We want to use the same data across multiple domains or tools. The BigQuery data warehouse is the easiest way to provide consistent data throughout the Cal-ITP data ecosystem (in JupyterHub, Metabase, open data publishing, the reports site, etc.)
dbt models may not be appropriate when:
You are doing exploratory data analysis, especially on inconsistently-constructed data. It will almost always be faster to do initial exploration of data via Jupyter/Python than in SQL. If you only plan to use the data for a short period of time, or plan to reshape it many speculatively before you settle on a more long-lived form, you probably don’t need to represent it with a dbt model quite yet.
You want to apply a simple transformation (for example, a grouped summary or filter) to answer a specific question. In this case, it may be more appropriate to create a Metabase dashboard with the desired transformations.
What is the grain of your model?#
Grain means “what does a row represent”. For example, for: Do you want one row per route per day? One row per fare transaction? One row per organization per month?
This concept of grain can be one of the biggest differences between notebook-based analysis and warehouse analytics engineering. In notebooks, you may be making a lot of transformations and saving each step out as its own dataframe, and you may use functions for reusable transformation steps. In warehouse development, we want to be focused on making reusable models where the data itself is the common building block across analyses. That often means trying to make only one table in the warehouse for each grain, regardless of how many different types of analysis it might be used for.
Is there already a model with this grain?#
If there is already a model with the grain you are targeting, you should almost always add new columns to that existing model rather than making a new model with the same grain.
Example: fct_scheduled_trips
Consider fct_scheduled_trips
. This is our core trip-level table. Every scheduled trip should have a row in this model and attributes that you might want from that trip should be present for easy access. As a result, this table has a lot of columns, because when we need new information about trips, we add it here. For example, when we wanted to fix time zone handling for trips, we added those columns instead of creating a new model.
To figure out if there is a model with your desired grain, you can search the dbt docs for relevant terms. For example, if you want a table of routes, you can search “routes” to see what models already exist. You can also explore the dependency tree for a related table (like dim_routes
) to see if you can find a table that looks like it has the right grain. You can also see our dbt docs homepage for a discussion of table naming conventions to interpret dimension, fact, and bridge tables.
For dimensions you may need to think more about whether you are truly making a new dimension, or whether you are simply applying a filter on an existing dimension (for example, dim_bus_vehicles
could be a subset of an existing dim_vehicles
dimension model, in which case you could just add a boolean column is_bus
on dim_vehicles
rather than making a new dedicated model.)
Make your changes#
The kind of changes you make will depend on what you’ve discovered in previous steps. Fixing a bug might involve changing model SQL, changing tests to reflect a new understanding, or something else. Adding a column might involve a simple change on one model or require updating several parent models. Creating a new model for brand new data may involve a new external table or it might be a straightforward transformation just in the mart. Some examples of different types of changes are listed below.
If you find yourself making big changes that seem likely to significantly affect other users, you may need to step back and convene a conversation to make sure that everyone is on board; see for example this Google Doc about Airtable schema changes where stakeholders confirmed how they wanted to handle schema changes in the warehouse. The downstream impacts section below has suggestions for how to assess the impacts of your changes.
Example bug fix PRs#
Here are a few example data-infra
PRs that fixed past bugs:
PR #2076 fixed two bugs: There was a hardcoded incorrect value in our SQL that was causing Sundays to not appear in our scheduled service index (SQL syntax bug), and there was a bug in how we were handling the relationship between
calendar_dates
andcalendar
(GTFS logic bug).PR #2623 fixed bugs caused by unexpected calendar data from a producer.
Example new column PRs#
Here are a few example data-infra
PRs that added columns to existing models:
PR #2778 is a simple example of adding a column that already exists in staging to a mart table.
For intermediate examples of adding a column in a staging table and propagating it through a few different downstream models, see
PR #2383 adds a column to Airtable data end-to-end (starting from the raw data/external tables; this involves non-dbt code).
Example new model PRs#
Here are a few data-infra
PRs that created brand new models:
Test your changes#
Once you have made some changes, it is important to test them.
Different types of testing
Functional testing during development is different than adding dbt tests (described below). dbt tests ensure some floor of model validity over time; while developing, you should run more holistic tests to ensure that your code is working as expected.
The first step is running your changes in the test/staging environment. You can run a command like poetry run dbt run -s +<your model>
to run your model and its antecedents. Your models will be created in the cal-itp-data-infra-staging.<your name>_<dbt folder name, like mart_gtfs>
BigQuery dataset. Note that this is in the cal-itp-data-infra-staging
Google Cloud Platform project, not the production cal-itp-data-infra
project.
What to test/check will vary based on what you’re doing, but below are some example things to consider.
Column values#
Are the values in your column/model what you expect? For example, are there nulls? Does the column have all the values you anticipated (for example, if you have a day of the week column, is data from all 7 days present)? If it’s numeric, what are the minimum and maximum values; do they make sense (for example, if you have a percentage column, is it always between 0 and 100)? What is the most common value?
To check nulls:
SELECT * FROM <your model> WHERE <your column> IS NULL
To check distinct values in a column:
SELECT DISTINCT <your column> FROM <your model>
To check min/max:
SELECT MIN(<your column>), MAX(<your_column>) FROM <your model>
To check most common values:
SELECT <your_column>, COUNT(*) AS ct FROM <your model> GROUP BY 1 ORDER BY ct DESC
Row count and uniqueness#
To confirm that the grain is what you expect, you should check whether an anticipated unique key is actually unique. For example, if you were making a daily shapes table, you might expect that date + feed_key + shape_id
would be unique. Similarly, you should have a ballpark idea of the order of magnitude of the number of rows you expect. If you’re making a yearly organizations table and your table has a million rows, something is likely off. Some example queries could be:
To check row count:
SELECT COUNT(*) FROM <your model>
To check row count by some attribute (for example, rows per date):
SELECT <your column>, COUNT(*) AS ct FROM <your model> GROUP BY 1 ORDER BY 1
To check uniqueness based on a combination of a few columns:
WITH tbl AS ( SELECT * FROM <your model> ), dups AS ( SELECT <column 1>, <column 2>, <column 3>, COUNT(*) AS ct FROM tbl -- adjust this based on the number of columns that make the composite unique key GROUP BY 1, 2, 3 HAVING ct > 1 ) SELECT * FROM dups LEFT JOIN tbl USING (<column 1>, <column 2>, <column 3>) ORDER BY <column 1>, <column 2>, <column 3>
Performance#
While testing, you should keep an eye on the performance (cost/data efficiency) of the model:
When you run the dbt model locally, look at how many bytes are billed to build the model(s).
Before you run test queries, check the bytes estimates (these may not be accurate for queries on views or clustered tables)
After you run test queries, look at the total bytes billed after the fact in the Job Information tab in the Query results section of the BigQuery console.
If the model takes more than 100 GB to build, or if test queries seem to be reading a lot of data (this is subjective; it’s ok to build a sense over time), you may want to consider performance optimizations.
Below are a few options to improve performance. Data infra PR #2711 has examples of several different types of performance interventions.
If the model is expensive to build: First, try to figure out what specific steps are expensive. You can run individual portions of your model SQL in the BigQuery console to assess the performance of individual CTEs.
If the model involves transformations on a lot of data that doesn’t need to be reprocessed every day, you may want to make the model incremental. You can run
poetry run dbt ls -s config.materialized:incremental --resource-type model
to see examples of other incremental models in the repo.If the model reads data from an expensive parent table, you may want to consider leveraging clustering or partitioning on that parent table to make a join or select more efficient. See this comment on data infra PR #2743 for an example of a case where changing a join condition was a more appropriate performance intervention than making the table incremental.
If the model is expensive to query: The main interventions to make a model more efficient to query involve changing the data storage.
Consider storing it as a table rather than a view.
If the model is already a table, you can consider partitioning or clustering on columns that will commonly be used as filters.
Warning
Incremental models have two different run modes: full refreshes (which re-process all historical data available) and incremental runs that load data in batches based on your incremental logic. These two modes run different code.
If you make your table incremental, you should make sure to run both a full refresh (use the --full-refresh
flag) and an incremental run (after the table has already been built once; no flag) in your testing to ensure that both are working as expected.
Downstream impacts#
Another important consideration is the potential downstream impacts of your changes, particularly if you are changing existing models.
You can run dbt tests on the downstream models using poetry run dbt test -s <your_model>+
. You should make sure that your changes do not cause new test failures in downstream models.
Check which models are downstream of your changes using poetry run dbt ls -s <your_model>+ --resource-type model
. If your model has a lot of descendents, consider performing additional tests to ensure that your changes will not cause problems downstream.
To check for impacts on defined downstream artifacts (like the reports site and open data publishing), you can check which exposures are downstream of your model using poetry run dbt ls -s <your_model>+ --resource-type exposure
.
Other considerations#
Other questions will be more specific to your changes or goals, but it’s usually a good idea to take a second and brainstorm things that you would expect to be true and check whether your model reflects them. For example, we expect more trip activity during AM/PM peak periods than in the middle of the night; is that true in your model? What is the balance of weekend to weekday activity in your model, and does it make sense for the context?
Add dbt tests and documentation#
Once you are satisfied with your changes, you should add tests and documentation, both of which are vital to keeping the project maintainable over time.
dbt tests#
dbt tests help us ensure baseline model validity and guarantees over time (for example: “this ID is unique”). A dbt test failure should be something that you’d want to fix quickly to ensure models work for downstream users. So, a test failure should be something you’d want to act on by doing something like fixing, dropping, or adding a warning flag on failing rows. A test can also be thought of an assertion: a not-null test on a column asserts that that column is never null.
dbt tests are run every day in Airflow and alert when models fail. Because they run every day and execute SQL code, there is some tradeoff with cost: we don’t want to test too excessively because that could become wasteful.
We usually prefer to have tests on tables (rather than views) for cost reasons. Most tables, especially in mart datasets, should have at least a primary key test that tests that there is a unique, non-null column; this is one way to monitor that the grain of the model is stable and is not being violated.
You may want to find a model similar to the one you’re changing and see what tests that other model has.
Make sure your tests pass!
After you add your tests, you should make sure they pass by running poetry run dbt test -s <your_model>
. If your tests don’t pass, you should figure out why and make changes until they do.
Documentation#
Documentation in dbt helps different data users collaborate by explaining what a given model or column is. Documentation should answer the question: What information would someone else need to use this model/column effectively? All models should have a description and most columns should too.
Try to write a description that will make sense to future readers. It is helpful to be specific, for example saying “April 2023” instead of “now” or “current”.
Model documentation should make the grain clear.
Merge your changes#
Once you have finished work, you should make a PR to get your changes merged into main
. PRs that sit and become stale may become problematic if other people make changes to models before they merge that cause them to behave unexpectedly.
Once your changes merge, if they will impact other users (for example by changing a high-traffic model), you may want to announce your changes on Slack in #data-warehouse-devs
, #data-analysis
, or a similar channel.
Warning
Incremental models downstream of your changes may require a full refresh after your changes merge.
To check for incremental models downstream of your model, run poetry run dbt ls -s <your_model>+,config.materialized:incremental --resource-type model
. If you need to refresh incremental models:
Wait for the build-dbt GitHub action associated with your PR to complete after you merge.
Go into the Airflow UI and go to the transform_warehouse_full_refresh DAG. Specify appropriate model selectors to only refresh models that were affected by your changes and then run the DAG task.
Incremental models#
Incremental models are one of the trickier concepts to learn in dbt and in our warehouse. By default, dbt models process all of the available data identified by the SQL queries within them and essentially re-create each model from scratch every day. Incremental models are designed to process smaller batches of data and update existing models, rather than creating the model from scratch. This is helpful for large data (like GTFS RT and GTFS schedule stop times) where re-processing all of history every day is expensive.
Computational cost vs. complexity cost trade-off
Incremental models basically trade off a simple but computationally expensive approach (just process all of history every day) for a more complex approach, where daily processing is cheaper computationally but there is a higher cost in terms of time spent understanding or troubleshooting the model.
Most of the Cal-ITP incremental models use a shared incremental_where macro
to handle the incremental logic; by default in dbt you could use the if is_incremental()
checks directly in each incremental model, but we use the incremental_where
macro to store some shared handling for things like the concept of the GTFS RT and GTFS schedule data start dates.
The core question to ask when working with incremental models is: How do I identify the new rows that should be brought in on each new run?
You can compile the SQL for an incremental model and run it directly in BigQuery to inspect what rows are identified for addition in an incremental run (see the section on identifying bugs above for information on how to find compiled SQL).
Working with incremental models can affect how you approach various dbt-related workflows. See callouts in the individual step sections above related to incremental models for more details.
Helpful talks and presentations#
dbt at Cal-ITP introduction#
In 2022, Laurie gave a lunch and learn about why we are using dbt at Cal-ITP.
Coalesce#
Some folks from Data Services attended Coalesce (dbt’s conference) in 2022 and thought the following talks may be of interest:
The accidental analytics engineer by Michael Chow - this talk outlines some differences Michael has experienced between R/tidyverse and dbt/MDS (modern data stack) approaches to working with data
dbt and MDS in small-batch academic research - this talk outlines some benefits this researcher found to using dbt in an academic context; note that he uses DuckDB (instead of BigQuery)