How to Integrate Dbt and GreatExpectations

Data quality is crucial for making good data-driven decisions. As data volumes grow exponentially, automating data testing and documentation is key to maintaining high quality data in analytics databases. Dbt (Data Build Tool) and Great Expectations are two popular open-source frameworks that help tackle different aspects of managing analytics engineering pipelines.
In this post, we will walk through an end-to-end example of integrating Dbt and Great Expectations using an e-commerce dataset. We will load sample e-commerce order data into a Snowflake data warehouse. Dbt will help us transform the raw data into analytics-ready tables and views. Great Expectations will then help create data quality test suites validate the analytics dbt models.

Prerequisites to integrate Dbt and GreatExpectations

Before we begin, let's outline the key items we need:

Access to a Snowflake account where the raw e-commerce data is loaded
Create a database called ECOMM_ANALYTICS
Dbt installed and configured to connect to Snowflake
Great Expectations installed and connected to Snowflake
Git/Github repository to manage Dbt and Great Expectations config as code
Basic understanding of SQL, dbt, and Great Expectations

Sample E-Commerce Data Model

Our raw sample e-commerce data has information on customer orders, products, payments and order reviews stored in Snowflake stages.

Here is a snippet of what the raw orders data looks like:

Table 1
order_id	customer_id	order_date	order_value
1	100	2022-01-01	99.99
2	200	2022-01-05	149.99

Made with HTML Tables

And the products data:

Table 1
product_id	product_name	category	cost_price	selling _price
1001	T-shirt	Apparel	20	49.99
1002	Jeans	Apparel	30	99.99

Made with HTML Tables

We want to transform this raw data into an analytics dataset with clean, derived metrics like order revenue, product margins etc.
Here is what our target analytics data model looks like:

stg_orders - staged orders data
stg_products - staged products data
stg_payments - staged payments data
stg_reviews - staged reviews data
fct_orders - derived metrics like order revenue, profit
dim_customers - clean customer dimension
dim_products - clean product dimension

This end-to-end pipeline is depicted below:

How to setup DBT project

We first setup a dbt project to build the transformations and generate the analytics dataset.
Initialize a dbt project ecomm_dbt and configure the connection to the Snowflake database ECOMM_ANALYTICS


# dbt_project.yml

name: 'ecomm_dbt' 
version: '1.0.0'
config-version: 2

profile: ecomm_snowflake

Define the source and target schemas in the schema.yml file:


# schema.yml

sources:
  - name: ecommerce
    database: ECOMM_ANALYTICS
    schema: raw
    tables:
      - name: orders 
      - name: products

models:
  - name: stg
    description: Staged raw data
    columns:
      - name: order_id
        description: Primary key
        tests:
          - unique
          - not_null

  - name: dim
    description: Dimension tables  
    columns:
      - name: customer_id
        description: Customer ID
        tests: 
          - unique
          - not_null
          
  - name: fct
    description: Fact tables
    columns:
      - name: order_profit
        description: Order profit margin

Build the dbt SQL models to transform raw data:


-- stg_orders.sql 

select * from {{ source('ecommerce','orders') }}


-- dim_customers.sql

select
    customer_id,
    first_name,
    last_name
from {{ source('ecommerce','customers') }}


-- fct_orders.sql

select
    order_id,
    order_date,
    
    order_value as order_revenue,
    
    order_value - order_cost as order_profit
    
from {{ source('ecommerce','orders') }} 

inner join {{ source('ecommerce','order_costs') }}

Run tests on dbt project:

dbt compile to catch SQL errors
Sample test suite:


-- test_fct_orders.sql

select 
    count(*) as order_count
from {{ ref('fct_orders') }}

union all

select  
    round(avg(order_profit),2) as profit_avg,
    max(order_profit) as profit_max
from {{ ref('fct_orders') }}

Great Expectations Integration

Next, we setup Great Expectations to define data quality expectations and validate the analytics dataset.

Initialize Great Expectations and connect it to the Snowflake database


great_expectations --datasource=snowflake://USER:PASSWORD@ACCOUNT build ecomm_expectations

Create a datasource pointing to the ECOMM_ANALYTICS database
Ingest metadata from dbt catalog:


great_expectations --v3-api datasource new

great_expectations suite edit default.suite edit expectations

Auto-generate expectations for dim_customers


expect_column_values_to_not_be_null -> customer_id
expect_column_unique -> customer_id  
expect_column_values_to_be_in_type_list -> all columns

Add custom expectations for metrics:
fct_orders.sql


expect_column_min_to_be_between -> order_profit: 5

expect_column_max_to_be_between -> order_revenue: 500, 50000

expect_column_kl_divergence_less_than -> order_profit: 0.6

Execute validation for customer 360 expectation suite:


great_expectations --v3-api suite run default.dim_customers.suite

Next Steps

Here are some ways we can build on this ETL monitoring framework:

Schedule batched workflow for daily validation runs
Set Data Docs to publish data quality results and metrics
Expand test coverage for more edge cases
Enable consumers to view data quality issues and fixes

Conclusion

In this post, we walked through a detailed example of integrating dbt and Great Expectations for an e-commerce pipeline using code snippets. Defining rigorous data quality tests is key to identifying issues early and maintaining high quality datasets.
Hopefully this gives you a blueprint to implementing automated data testing and documentation for your own data transformation pipelines!

‍

How to Integrate Dbt and GreatExpectations