dbt Fundamentals

dbt stands for data build tool, and it is designed to simplify the management of data warehouses and transform the data within.

Primarily defines data models and transformations using SQL:
- Defines relationships between data models
- Runs the data transformation processes
- Test for data quality requirements

It primarily handles the T in the ELT/ETL

It provides the ability to use SQL across teams of multiple users

Allows easy switching between data warehouses: dbt provides SQL dialects translations which facilitate the connections

There is dbt core which is open-source and dbt-cloud which is a SaaS

Dbt structure

What is a dbt project?

Projects encompass all the components for working with data within dbt.

A project is a structured collection of files that define how to transform and organize your data.

A project includes:

Project name
Folder name
Data sources
Data destinations
SQL queries and templates – how to acces and transform the data into the desired formats
Documentation for the data
Relationships within it

A project is implemented as a folder structure placed into a source control as needed

To start a project, you only need to type dbt init, and you will get asked:

Project name
Which database/data warehouse type

This command creates the top level project folder and all needed structure

Project Profile

Within dbt, a profile is like a deployment scenario. This includes:

Development
Staging
Testing
Production

A dbt project can have multiple profiles allowing multiple configurations for deployment scenario. The profiles are defined in the profiles.yml

yml – yet another markup language. Text file but spacing matters
DuckDB – opensource severless database like sqlite

The command dbt debug can be used to test your dbt configurations and verify connectivity with your data sources.

Workflow for dbt

Create project (dbt init)
Define configuration (profiles.yml)
Define data models -> you will spend more of the time here
Instantiate models (dbt run)
1. This command takes your source sql code and translate it as necessary for your deployment (profile) target
1. Executes the transformation process
Verify and test your data

dbt run

dbt run – performs the data transformations and pushes updates to the warehouse

Dbt model

What is a data model?

Defines the logical organization and interpretation of a dataset whether a database, table, Dataframe … It also represents how a set of data and its components relate to each other. The primary purpose of a data model is to help users collaborate and understand the data in a common way

A model in dbt represents something more specific than a basic data model. It represents the various transformations performed on the raw data source datasets.

The transformations are typically written in sql, some new versions allow python.

Each model contains a select query to transform the source data as desired

These queries are saved in a text file, with a sql extension

DBT automatically uses these files when tasked with operations such as dbt run

How to create a dbt model?

Create a directory in the models directory
Create a .sql file in above directory
Add the sql statement to the newly created file
Run dbt run to materialize the model

How to update a dbt model?

An advantage of working with dbt is to easily make changes to a project without writing new queries/models from scratch each time

Check out from source control
1. Git clone dbt_project
Find the model in question
Update query contents – you will update the .sql file
Regenerate with
1. dbt run or
1. dbt run -f (Force full refresh)
Check changes back to source control

You can also modify:

YAML files
- Dbt_project.yml
- Model_properties.yml

Creating documentation in dbt

What elements are included in documentation?

Model definitions
Details about columns within models
Show data lineage / DAG
Document any test/validations
View generated warehouse information:
- Column data types
- Data sizes

To generate documentation, dbt provides a subcommand:

dbt docs

To create the documentation website based on a project, we use this command:

dbt docs generate

This will update as we add models, tests, etc. We should run this command after dbt run

To see the documentation, you can run:

 dbt docs serve

Jinga

To simplify the creation of our dbt models and other objects

What is a template?

A defined format that allows dbt to substitute information automatically.

What is jinja?

A simple text-based templating system used in many tools.

To define a jinja template, simply put the desire content between two opening and closing curly braces within your text files. When dbt is run, it will replace the contents of the braces with the correct result.

{{ to_replace }}

Hierarchical models in dbt

It represents the dependencies within a dbt project, meaning the relationship between source and transformed data.

A hierarchical model in dbt is known as a DAG or lineage graph.

The purpose of the DAG or hierarchy is to allow models to be built and updated with their dependencies in mind. dbt must determine the order that models be built and run accordingly.

Why do we need hierarchy on dbt?

Imagine you want to create a view, but to create a view, you need the tables to be created first. Having a lineage graphs help you define that the table should be created first and then the tables

How are hiercarchies defined?

Built using the jinja template langiagie in the model file
Most often using the ref function
Replace table name with {{ ref(‘model_name’) }} in SQL
(Re-) materialize table with dbt run
Ref templates are substituted with actual table names