blog-bg.jpg

BLOG

BLOG

DBT. New Pipeline Technology

With the advent of such productive and scalable analytical databases as BigQuery, Redshift, Snowflake, there is no point in making transformations outside the Data Warehouse. DBT (Data Build Tool) provides a common foundation for data engineers, allowing quality data transformation pipelines to be built. DBT is a multifunctional tool for creating transformations in a warehouse. It is designed to work with data that is already loaded into your storage, but you need to make some transformations on it to prepare for use. DBT does not upload data from sources. It provides vast opportunities to work with data already loaded into the Warehouse.


How DBT Works

DBT allows users to describe the necessary transformations as conveniently and quickly as possible and fill Data Mart with data. Almost all that is required of the user is to define the required data structures with SELECT-queries. All the rest of the framework will be dealt with by itself.



The DBT framework works as follows:

  • the user creates model code in the development environment;

  • models are run via the CLI-interface;

  • DBT compiles model code into SQL-queries;

  • each SQL model includes a SELECT query that defines the result - the final set of data;

  • model code is a mixture of SQL and Jinja template language;

  • compiled SQL-code is executed in the repository in the form of a task graph or a model dependency tree, DAG (Directed Acyclic Graph);

  • DBT builds a graph based on the configurations of all models of the project, taking into account their references to each other to run the models in the desired sequence and generate data marts in parallel.


In addition to forming the models themselves, DBT allows users to test assumptions about the result dataset, including uniqueness, referential integrity, compliance with the list of valid values, and Not Null. Custom data tests can also be added as SQL-queries.

Thanks to adapters, DBT supports the following databases and data stores: Postgres, Redshift, BigQuery, Snowflake, Presto, Apache Spark SQL, Databricks, etc. Users can also create their own adapter to integrate with another storage using a materialization strategy.

DBT provides mechanisms for adding, versioning, and distributing metadata and comments at the model and even individual attributes level. Macros as a set of constructs and expressions that can be invoked as functions within models allow you to reuse SQL between models and projects. DBT's built-in package manager allows users to publish and reuse individual modules and macros.


In combination with Cloud Data Warehouse, DBT is a revolutionary technology because it allows users to prepare data for BI (Business Intelligence) using SQL alone.



Connect with DataEngi


54 views2 comments