Developing models in dbt
Developing models in dbt¶
Information related to contributing to the Cal-ITP dbt project.
If you have questions specific to our project or you encounter any issues when developing, please bring those questions to the
#data-office-hoursCal-ITP Slack channels. Working through questions “in public” helps build shared knowledge that’s searchable later on.
For Cal-ITP-specific data warehouse documentation, including high-level concepts and naming conventions, see our Cal-ITP dbt documentation site. This documentation is automatically generated by dbt, and incorporates the table- and column-level documentation that developers enter in YAML files in the dbt project.
How to contribute to the dbt project¶
To get set up to contribute to the dbt project via JupyterHub, follow the README in the data-infra repo warehouse folder. If you hit any trouble with setup, let folks know in the
#data-office-hours Cal-ITP Slack channels.
We recommend that everyone who does dbt development joins the
#data-warehouse-devs channel in the Cal-ITP Slack to ask questions, collaborate, and build shared knowledge.
See next section for modeling considerations
This section describes the high-level mechanics/process of the developer workflow to edit the dbt project. Please read the next section for things you should consider from the data modeling perspective.
To test your work while developing dbt models, you can edit the
.sql files for your models, save your changes, and then run the model from the command line to execute the SQL you updated.
To inspect tables as you are working, the fastest method is usually to run some manual test queries or “preview” the tables in the BigQuery user interface. You can also use something like
pandas.read_gbq to perform example queries in a notebook.
When you run dbt commands locally on JupyterHub, your models will be created in the
cal-itp-data-infra-staging.<your name>_<dbt folder name, like mart_gtfs> BigQuery dataset. Note that this is in the
cal-itp-data-infra-staging Google Cloud Platform project, not the production
Once your models are working the way you want and you have added all necessary documentation and tests in YAML files (see below for more on modeling, documentation, and testing considerations), you are ready to merge.
Because the warehouse is collectively maintained and changes can affect a variety of users, please open PRs against
main when work is ready to merge and keep an eye out for comments and questions from reviewers, who might require tweaks before merging.
For an example of working with dbt in JupyterHub, see the recording of the original onboarding call in April 2023 (requires Cal-ITP Google Drive access). A few notes on this video:
The documentation shown is an older version of this docs page; the information shared verbally is correct but the page has been updated.
The bug encountered towards the end of the video (that prevented us from running dbt tests) has been fixed.
The code owners mentioned in the video have changed; consult in Slack for process guidance.
When developing or updating dbt models, there are some considerations which may differ from a notebook-based analysis. These can be thought of as a checklist or decision tree of questions that you should run through whenever you are editing or creating a dbt model. Longer explanations of each item are included below.
Identify the cause of your bug¶
Example bug troubleshooting walkthroughs
Here is a series of recordings showing a workflow for debugging a failing dbt test. The resulting PR is #2892.
Usually, bugs are caused by:
New or historical data issues. For example, an agency may be doing something in their GTFS data that we didn’t expect and this may have broken one of our models. This can happen with brand new data that is coming in or in historical data that wasn’t included in local testing (this is especially relevant for RT data, where local testing usually includes a very small subset of the full data.)
GTFS or data schema bugs. Sometimes we may have misinterpreted the GTFS spec (or another incoming data model) and modeled something incorrectly.
SQL bugs. Sometimes we may have written SQL incorrectly (for example, used the wrong kind of join.)
How to investigate the bug depends on how the bug was noticed.
If there was a failing dbt test, you can
dbt compile locally to compile the project SQL. You can then find the SQL for the failing test (follow the dbt testing FAQ under “one of my tests failed, how can I debug it?” to find the compiled test SQL). Run that SQL in BigQuery to see the rows that are failing.
dbt compile locally, you will compile SQL that’s pointed at the staging project and your namespaced dataset. Make sure to change those references when you run the compiled SQL. So,
cal-itp-data-infra-staging.laurie_mart_gtfs.fct_scheduled_trips would become
If you noticed an issue that wasn’t caused by a failing test, you can start with the model that you noticed the problem in.
In either case, you may need to consider upstream models. To identify your model’s parents, you can look at the dbt docs website page for your model. See the dbt docs for how to look at the model’s lineage. You can modify the model selector in the bottom middle to just
+<your model name> to only see the model’s parents. You can also run
poetry run dbt ls -s +<your model> --resource-type model to see a model’s parents just on the command line. Try to figure out where the root cause of the problem is occurring. This may involve running ad-hoc SQL queries to inspect the models involved.
Should it be a dbt model?¶
Changes to dbt models are likely to be appropriate when one or more of the following is true:
There is a consistent or ongoing need for this data. dbt can ensure that transformations are performed consistently at scale, every day.
The data is big. Doing transformations in BigQuery can be more performant than doing them in notebooks or any workflow where the large data must be loaded into local memory.
We want to use the same data across multiple domains or tools. The BigQuery data warehouse is the easiest way to provide consistent data throughout the Cal-ITP data ecosystem (in JupyterHub, Metabase, open data publishing, the reports site, etc.)
dbt models may not be appropriate when:
You are doing exploratory data analysis, especially on inconsistently-constructed data. It will almost always be faster to do initial exploration of data via Jupyter/Python than in SQL. If you only plan to use the data for a short period of time, or plan to reshape it many speculatively before you settle on a more long-lived form, you probably don’t need to represent it with a dbt model quite yet.
You want to apply a simple transformation (for example, a grouped summary or filter) to answer a specific question. In this case, it may be more appropriate to create a Metabase dashboard with the desired transformations.
What is the grain of your model?¶
Grain means “what does a row represent”. For example, for: Do you want one row per route per day? One row per fare transaction? One row per organization per month?
This concept of grain can be one of the biggest differences between notebook-based analysis and warehouse analytics engineering. In notebooks, you may be making a lot of transformations and saving each step out as its own dataframe, and you may use functions for reusable transformation steps. In warehouse development, we want to be focused on making reusable models where the data itself is the common building block across analyses. That often means trying to make only one table in the warehouse for each grain, regardless of how many different types of analysis it might be used for.
Is there already a model with this grain?¶
If there is already a model with the grain you are targeting, you should almost always add new columns to that existing model rather than making a new model with the same grain.
fct_scheduled_trips. This is our core trip-level table. Every scheduled trip should have a row in this model and attributes that you might want from that trip should be present for easy access. As a result, this table has a lot of columns, because when we need new information about trips, we add it here. For example, when we wanted to fix time zone handling for trips, we added those columns instead of creating a new model.
To figure out if there is a model with your desired grain, you can search the dbt docs for relevant terms. For example, if you want a table of routes, you can search “routes” to see what models already exist. You can also explore the dependency tree for a related table (like
dim_routes) to see if you can find a table that looks like it has the right grain. You can also see our dbt docs homepage for a discussion of table naming conventions to interpret dimension, fact, and bridge tables.
For dimensions you may need to think more about whether you are truly making a new dimension, or whether you are simply applying a filter on an existing dimension (for example,
dim_bus_vehicles could be a subset of an existing
dim_vehicles dimension model, in which case you could just add a boolean column
dim_vehicles rather than making a new dedicated model.)
Make your changes¶
The kind of changes you make will depend on what you’ve discovered in previous steps. Fixing a bug might involve changing model SQL, changing tests to reflect a new understanding, or something else. Adding a column might involve a simple change on one model or require updating several parent models. Creating a new model for brand new data may involve a new external table or it might be a straightforward transformation just in the mart. Some examples of different types of changes are listed below.
If you find yourself making big changes that seem likely to significantly affect other users, you may need to step back and convene a conversation to make sure that everyone is on board; see for example this Google Doc about Airtable schema changes where stakeholders confirmed how they wanted to handle schema changes in the warehouse. The downstream impacts section below has suggestions for how to assess the impacts of your changes.
Example bug fix PRs¶
Here are a few example
data-infra PRs that fixed past bugs:
PR #2076 fixed two bugs: There was a hardcoded incorrect value in our SQL that was causing Sundays to not appear in our scheduled service index (SQL syntax bug), and there was a bug in how we were handling the relationship between
calendar(GTFS logic bug).
PR #2623 fixed bugs caused by unexpected calendar data from a producer.
Example new column PRs¶
Here are a few example
data-infra PRs that added columns to existing models:
PR #2778 is a simple example of adding a column that already exists in staging to a mart table.
For intermediate examples of adding a column in a staging table and propagating it through a few different downstream models, see
PR #2383 adds a column to Airtable data end-to-end (starting from the raw data/external tables; this involves non-dbt code).
Test your changes¶
Once you have made some changes, it is important to test them.
Different types of testing
Functional testing during development is different than adding dbt tests (described below). dbt tests ensure some floor of model validity over time; while developing, you should run more holistic tests to ensure that your code is working as expected.
The first step is running your changes in the test/staging environment. You can run a command like
poetry run dbt run -s +<your model> to run your model and its antecedents. Your models will be created in the
cal-itp-data-infra-staging.<your name>_<dbt folder name, like mart_gtfs> BigQuery dataset. Note that this is in the
cal-itp-data-infra-staging Google Cloud Platform project, not the production
What to test/check will vary based on what you’re doing, but below are some example things to consider.
Are the values in your column/model what you expect? For example, are there nulls? Does the column have all the values you anticipated (for example, if you have a day of the week column, is data from all 7 days present)? If it’s numeric, what are the minimum and maximum values; do they make sense (for example, if you have a percentage column, is it always between 0 and 100)? What is the most common value?
To check nulls:
SELECT * FROM <your model> WHERE <your column> IS NULL
To check distinct values in a column:
SELECT DISTINCT <your column> FROM <your model>
To check min/max:
SELECT MIN(<your column>), MAX(<your_column>) FROM <your model>
To check most common values:
SELECT <your_column>, COUNT(*) AS ct FROM <your model> GROUP BY 1 ORDER BY ct DESC
Row count and uniqueness¶
To confirm that the grain is what you expect, you should check whether an anticipated unique key is actually unique. For example, if you were making a daily shapes table, you might expect that
date + feed_key + shape_id would be unique. Similarly, you should have a ballpark idea of the order of magnitude of the number of rows you expect. If you’re making a yearly organizations table and your table has a million rows, something is likely off. Some example queries could be:
To check row count:
SELECT COUNT(*) FROM <your model>
To check row count by some attribute (for example, rows per date):
SELECT <your column>, COUNT(*) AS ct FROM <your model> GROUP BY 1 ORDER BY 1
To check uniqueness based on a combination of a few columns:
WITH tbl AS ( SELECT * FROM <your model> ), dups AS ( SELECT <column 1>, <column 2>, <column 3>, COUNT(*) AS ct FROM tbl -- adjust this based on the number of columns that make the composite unique key GROUP BY 1, 2, 3 HAVING ct > 1 ) SELECT * FROM dups LEFT JOIN tbl USING (<column 1>, <column 2>, <column 3>) ORDER BY <column 1>, <column 2>, <column 3>
While testing, you should keep an eye on the performance (cost/data efficiency) of the model:
When you run the dbt model locally, look at how many bytes are billed to build the model(s).
After you run test queries, look at the total bytes billed after the fact in the Job Information tab in the Query results section of the BigQuery console.
If the model takes more than 100 GB to build, or if test queries seem to be reading a lot of data (this is subjective; it’s ok to build a sense over time), you may want to consider performance optimizations.
Below are a few options to improve performance. Data infra PR #2711 has examples of several different types of performance interventions.
If the model is expensive to build: First, try to figure out what specific steps are expensive. You can run individual portions of your model SQL in the BigQuery console to assess the performance of individual CTEs.
If the model involves transformations on a lot of data that doesn’t need to be reprocessed every day, you may want to make the model incremental. You can run
poetry run dbt ls -s config.materialized:incremental --resource-type modelto see examples of other incremental models in the repo.
If the model reads data from an expensive parent table, you may want to consider leveraging clustering or partitioning on that parent table to make a join or select more efficient. See this comment on data infra PR #2743 for an example of a case where changing a join condition was a more appropriate performance intervention than making the table incremental.
If the model is expensive to query: The main interventions to make a model more efficient to query involve changing the data storage.
Incremental models have two different run modes: full refreshes (which re-process all historical data available) and incremental runs that load data in batches based on your incremental logic. These two modes run different code.
If you make your table incremental, you should make sure to run both a full refresh (use the
--full-refresh flag) and an incremental run (after the table has already been built once; no flag) in your testing to ensure that both are working as expected.
Another important consideration is the potential downstream impacts of your changes, particularly if you are changing existing models.
You can run dbt tests on the downstream models using
poetry run dbt test -s <your_model>+. You should make sure that your changes do not cause new test failures in downstream models.
Check which models are downstream of your changes using
poetry run dbt ls -s <your_model>+ --resource-type model. If your model has a lot of descendents, consider performing additional tests to ensure that your changes will not cause problems downstream.
To check for impacts on defined downstream artifacts (like the reports site and open data publishing), you can check which exposures are downstream of your model using
poetry run dbt ls -s <your_model>+ --resource-type exposure.
Other questions will be more specific to your changes or goals, but it’s usually a good idea to take a second and brainstorm things that you would expect to be true and check whether your model reflects them. For example, we expect more trip activity during AM/PM peak periods than in the middle of the night; is that true in your model? What is the balance of weekend to weekday activity in your model, and does it make sense for the context?
Add dbt tests and documentation¶
Once you are satisfied with your changes, you should add tests and documentation, both of which are vital to keeping the project maintainable over time.
dbt tests help us ensure baseline model validity and guarantees over time (for example: “this ID is unique”). A dbt test failure should be something that you’d want to fix quickly to ensure models work for downstream users. So, a test failure should be something you’d want to act on by doing something like fixing, dropping, or adding a warning flag on failing rows. A test can also be thought of an assertion: a not-null test on a column asserts that that column is never null.
dbt tests are run every day in Airflow and alert when models fail. Because they run every day and execute SQL code, there is some tradeoff with cost: we don’t want to test too excessively because that could become wasteful.
We usually prefer to have tests on tables (rather than views) for cost reasons. Most tables, especially in mart datasets, should have at least a primary key test that tests that there is a unique, non-null column; this is one way to monitor that the grain of the model is stable and is not being violated.
You may want to find a model similar to the one you’re changing and see what tests that other model has.
Documentation in dbt helps different data users collaborate by explaining what a given model or column is. Documentation should answer the question: What information would someone else need to use this model/column effectively? All models should have a description and most columns should too.
Try to write a description that will make sense to future readers. It is helpful to be specific, for example saying “April 2023” instead of “now” or “current”.
Model documentation should make the grain clear.
Merge your changes¶
Once you have finished work, you should make a PR to get your changes merged into
main. PRs that sit and become stale may become problematic if other people make changes to models before they merge that cause them to behave unexpectedly.
Once your changes merge, if they will impact other users (for example by changing a high-traffic model), you may want to announce your changes on Slack in
#data-analysis, or a similar channel.
Incremental models downstream of your changes may require a full refresh after your changes merge.
To check for incremental models downstream of your model, run
poetry run dbt ls -s <your_model>+,config.materialized:incremental --resource-type model. If you need to refresh incremental models:
Incremental models are one of the trickier concepts to learn in dbt and in our warehouse. By default, dbt models process all of the available data identified by the SQL queries within them and essentially re-create each model from scratch every day. Incremental models are designed to process smaller batches of data and update existing models, rather than creating the model from scratch. This is helpful for large data (like GTFS RT and GTFS schedule stop times) where re-processing all of history every day is expensive.
Computational cost vs. complexity cost trade-off
Incremental models basically trade off a simple but computationally expensive approach (just process all of history every day) for a more complex approach, where daily processing is cheaper computationally but there is a higher cost in terms of time spent understanding or troubleshooting the model.
Most of the Cal-ITP incremental models use a shared
incremental_where macro to handle the incremental logic; by default in dbt you could use the
if is_incremental() checks directly in each incremental model, but we use the
incremental_where macro to store some shared handling for things like the concept of the GTFS RT and GTFS schedule data start dates.
The core question to ask when working with incremental models is: How do I identify the new rows that should be brought in on each new run?
You can compile the SQL for an incremental model and run it directly in BigQuery to inspect what rows are identified for addition in an incremental run (see the section on identifying bugs above for information on how to find compiled SQL).
Working with incremental models can affect how you approach various dbt-related workflows. See callouts in the individual step sections above related to incremental models for more details.
Helpful talks and presentations¶
dbt at Cal-ITP introduction¶
In 2022, Laurie gave a lunch and learn about why we are using dbt at Cal-ITP.
Some folks from Data Services attended Coalesce (dbt’s conference) in 2022 and thought the following talks may be of interest:
The accidental analytics engineer by Michael Chow - this talk outlines some differences Michael has experienced between R/tidyverse and dbt/MDS (modern data stack) approaches to working with data
dbt and MDS in small-batch academic research - this talk outlines some benefits this researcher found to using dbt in an academic context; note that he uses DuckDB (instead of BigQuery)