dbt stands for data build tool, and it is designed to simplify the management of data warehouses and transform the data within.
- Primarily defines data models and transformations using SQL:
- Defines relationships between data models
- Runs the data transformation processes
- Test for data quality requirements
It primarily handles the T in the ELT/ETL
- It provides the ability to use SQL across teams of multiple users
- Allows easy switching between data warehouses: dbt provides SQL dialects translations which facilitate the connections
- There is dbt core which is open-source and dbt-cloud which is a SaaS
Dbt structure
What is a dbt project?
Projects encompass all the components for working with data within dbt.
A project is a structured collection of files that define how to transform and organize your data.
A project includes:
- Project name
- Folder name
- Data sources
- Data destinations
- SQL queries and templates – how to acces and transform the data into the desired formats
- Documentation for the data
- Relationships within it
A project is implemented as a folder structure placed into a source control as needed
To start a project, you only need to type dbt init, and you will get asked:
- Project name
- Which database/data warehouse type
This command creates the top level project folder and all needed structure
Project Profile
Within dbt, a profile is like a deployment scenario. This includes:
- Development
- Staging
- Testing
- Production
A dbt project can have multiple profiles allowing multiple configurations for deployment scenario. The profiles are defined in the profiles.yml
yml – yet another markup language. Text file but spacing matters
DuckDB – opensource severless database like sqlite
The command dbt debug can be used to test your dbt configurations and verify connectivity with your data sources.
Workflow for dbt
- Create project (
dbt init) - Define configuration (profiles.yml)
- Define data models -> you will spend more of the time here
- Instantiate models (
dbt run)- This command takes your source sql code and translate it as necessary for your deployment (profile) target
- Executes the transformation process
- Verify and test your data
dbt run
dbt run – performs the data transformations and pushes updates to the warehouse
Dbt model
What is a data model?
Defines the logical organization and interpretation of a dataset whether a database, table, Dataframe … It also represents how a set of data and its components relate to each other. The primary purpose of a data model is to help users collaborate and understand the data in a common way
A model in dbt represents something more specific than a basic data model. It represents the various transformations performed on the raw data source datasets.
The transformations are typically written in sql, some new versions allow python.
Each model contains a select query to transform the source data as desired
These queries are saved in a text file, with a sql extension
DBT automatically uses these files when tasked with operations such as dbt run
How to create a dbt model?
- Create a directory in the models directory
- Create a .sql file in above directory
- Add the sql statement to the newly created file
- Run
dbt runto materialize the model
How to update a dbt model?
An advantage of working with dbt is to easily make changes to a project without writing new queries/models from scratch each time
- Check out from source control
- Git clone dbt_project
- Find the model in question
- Update query contents – you will update the .sql file
- Regenerate with
dbt runor
dbt run -f(Force full refresh)
- Check changes back to source control
You can also modify:
- YAML files
- Dbt_project.yml
- Model_properties.yml
Creating documentation in dbt
What elements are included in documentation?
- Model definitions
- Details about columns within models
- Show data lineage / DAG
- Document any test/validations
- View generated warehouse information:
- Column data types
- Data sizes
To generate documentation, dbt provides a subcommand:
dbt docs
To create the documentation website based on a project, we use this command:
dbt docs generate
This will update as we add models, tests, etc. We should run this command after dbt run
To see the documentation, you can run:
dbt docs serve
Jinga
To simplify the creation of our dbt models and other objects
What is a template?
A defined format that allows dbt to substitute information automatically.
What is jinja?
A simple text-based templating system used in many tools.
To define a jinja template, simply put the desire content between two opening and closing curly braces within your text files. When dbt is run, it will replace the contents of the braces with the correct result.
{{ to_replace }}
Hierarchical models in dbt
It represents the dependencies within a dbt project, meaning the relationship between source and transformed data.
A hierarchical model in dbt is known as a DAG or lineage graph.
The purpose of the DAG or hierarchy is to allow models to be built and updated with their dependencies in mind. dbt must determine the order that models be built and run accordingly.
Why do we need hierarchy on dbt?
Imagine you want to create a view, but to create a view, you need the tables to be created first. Having a lineage graphs help you define that the table should be created first and then the tables
How are hiercarchies defined?
- Built using the jinja template langiagie in the model file
- Most often using the ref function
- Replace table name with {{ ref(‘model_name’) }} in SQL
- (Re-) materialize table with dbt run
- Ref templates are substituted with actual table names