Ideal Transformation Steps in dbt

otasium

Otwoma E.

Posted on October 11, 2022

Ideal Transformation Steps in dbt

Analytics is a cornerstone of decision making at many of the world's most successful institutions. Having the right tools to make the most of your data is therefore key in maintaining an advantage over competitors and innovating at the cutting edge. Everyday, more than 2.5 quintillion bytes of raw data is produced; that's 18 zeros. Transforming this massive amount of data is therefore a key step towards making it useful for any practical applications.
dbt (data build tool) is a powerful tool to transform raw data into various forms to allow the generation of insights for any organization. The modern data stack can be broken down to three key steps, Extraction, Transformation and Loading, aka ETL. dbt handles the T in ETL. In order to create an efficient data warehouse, dbt enables a stepwise data transformation process that allows for flexibility and modularity that supports version control like conventional software engineering.

Sources (src_) - Raw Data

The first layer in your transformation process is the ingestion of raw data from your production systems. This data can be stored on Snowflake, S3 or BigQuery, dbt allows for integration from a variety of different sources. No additional transformation and are essentially a base for the rest of the transformation process. These tables have names with the prefix src_... .sql and follow the dbt snake-case naming convention.

Staging (stg_) - Lightly Transformed Raw Data

Typically materialized as views, these are tables with a one-to-one relationship with sources tables. They contain very light transformations that clean and standardize the raw data before any further modifications are done downstream. For instance, you may want to change the names of certain columns to match business logic or avoid confusion with existing logic, truncate dates or change the currency or a payment column.
Make sure to limit your transformations to these four:

  • Type casting
  • Simple calculations
  • Categorizing data (think CASE WHEN ... statements)
  • Renaming columns If a staging table is built on top of a source table names src_farmer_payments.sql then the staging table built on top of it can be called stg_farmer_payments.sql.

Intermediate (int_) Models

After transforming data to the staging models, you may find it useful to have another layer before creating the final fact or dimension models that are the final marts that go in yoru data warehouse. Intermediate models are useful for performing slightly more intricate calculations meant to enrich the data before a mart is created. For instance, based on stg_farmer_payments.sql from the examples above, we can have int_farmer_payments_pivoted.sql which provides a different look at the data and can be used to create a different mart.
Intermediate models should be built on staging models to leverage the transformations already done on raw data. You can use intermediate models for:

  • Isolating complex processes in your transformation process for easier debugging later
  • Collapse a column to constituent parts before using it to create a mart, e.g ungrouping payments or undoing a pivot.
  • Simplifying data marts - put some joins in intermediate models to have simpler data marts.

Fact (fct_) Models

The next step in the dbt transformation process is creating fact models that contain immutable data. This is typically data about events that have occurred or are currently happening. For instance, this can be data about user sessions, transactions, orders, votes, stories, sales calls etc. They are typically narrow and long tables. The naming convention is fct_ ... .sql so we could have fct_payment_receipts.sql to track each payment received by a company.

Dimensions (dim_) Models

These models are based on previous models that are already transformed. Typically used to represent data about a person, place or thing. These models are very wide and short. For instance, you can use these to store all the static information about a patient in a hospital, or about a bank branch or a city landmark. Naming convention is as follows dim_patients.sql, dim_students.sql, you get the idea.

Custom Models

It is entirely possible that your organization has special data and business logic that would necessitate the creation of a new model with a different name prefix. dbt allows you to do this, however, you should endeavour to maintain dbt folder structure recommendations to ensure your project is clean.

Conclusion

These transformation steps are inline with the recommended dbt folder structure. You can expected to see src_ tables in the Sources folder, dim_ in the Dimensions folder and so forth. You can read more about dbt folder structure and transformations on the dbt docs page.

💖 💪 🙅 🚩
otasium
Otwoma E.

Posted on October 11, 2022

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related