How to Integrate Dbt and GreatExpectations

This blog post provides a step-by-step guide to integrating Dbt (Data Build Tool) and Great Expectations for enhancing data quality with E-commerce analytics as an example. It covers the entire process from setting up a project, transforming raw data into analytics-ready tables, to implementing data quality tests using a sample e-commerce dataset in Snowflake.

GraphQL has a role beyond API Query Language- being the backbone of application Integration
background Coditation

How to Integrate Dbt and GreatExpectations

Data quality is crucial for making good data-driven decisions. As data volumes grow exponentially, automating data testing and documentation is key to maintaining high quality data in analytics databases. Dbt (Data Build Tool) and Great Expectations are two popular open-source frameworks that help tackle different aspects of managing analytics engineering pipelines.
In this post, we will walk through an end-to-end example of integrating Dbt and Great Expectations using an e-commerce dataset. We will load sample e-commerce order data into a Snowflake data warehouse. Dbt will help us transform the raw data into analytics-ready tables and views. Great Expectations will then help create data quality test suites validate the analytics dbt models.

Prerequisites to integrate Dbt and GreatExpectations

Before we begin, let's outline the key items we need:

  • Access to a Snowflake account where the raw e-commerce data is loaded
  • Create a database called ECOMM_ANALYTICS
  • Dbt installed and configured to connect to Snowflake
  • Great Expectations installed and connected to Snowflake
  • Git/Github repository to manage Dbt and Great Expectations config as code
  • Basic understanding of SQL, dbt, and Great Expectations

Sample E-Commerce Data Model

Our raw sample e-commerce data has information on customer orders, products, payments and order reviews stored in Snowflake stages.

Here is a snippet of what the raw orders data looks like:

Table 1
order_idcustomer_idorder_dateorder_value
1
100  
2022-01-0199.99
22002022-01-05149.99
Made with HTML Tables

And the products data:

Table 1
product_idproduct_namecategorycost_priceselling _price
1001T-shirt
Apparel2049.99
1002JeansApparel3099.99
Made with HTML Tables

We want to transform this raw data into an analytics dataset with clean, derived metrics like order revenue, product margins etc.
Here is what our target analytics data model looks like:

  • stg_orders - staged orders data
  • stg_products - staged products data
  • stg_payments - staged payments data
  • stg_reviews - staged reviews data
  • fct_orders - derived metrics like order revenue, profit
  • dim_customers - clean customer dimension
  • dim_products - clean product dimension

This end-to-end pipeline is depicted below:

How to setup DBT project

We first setup a dbt project to build the transformations and generate the analytics dataset.
Initialize a dbt project ecomm_dbt and configure the connection to the Snowflake database ECOMM_ANALYTICS


# dbt_project.yml

name: 'ecomm_dbt' 
version: '1.0.0'
config-version: 2

profile: ecomm_snowflake

Define the source and target schemas in the schema.yml file:


# schema.yml

sources:
  - name: ecommerce
    database: ECOMM_ANALYTICS
    schema: raw
    tables:
      - name: orders 
      - name: products

models:
  - name: stg
    description: Staged raw data
    columns:
      - name: order_id
        description: Primary key
        tests:
          - unique
          - not_null

  - name: dim
    description: Dimension tables  
    columns:
      - name: customer_id
        description: Customer ID
        tests: 
          - unique
          - not_null
          
  - name: fct
    description: Fact tables
    columns:
      - name: order_profit
        description: Order profit margin

Build the dbt SQL models to transform raw data:


-- stg_orders.sql 

select * from {{ source('ecommerce','orders') }}


-- dim_customers.sql

select
    customer_id,
    first_name,
    last_name
from {{ source('ecommerce','customers') }}


-- fct_orders.sql

select
    order_id,
    order_date,
    
    order_value as order_revenue,
    
    order_value - order_cost as order_profit
    
from {{ source('ecommerce','orders') }} 

inner join {{ source('ecommerce','order_costs') }}

Run tests on dbt project:

  • dbt compile to catch SQL errors
  • Sample test suite:

-- test_fct_orders.sql

select 
    count(*) as order_count
from {{ ref('fct_orders') }}

union all

select  
    round(avg(order_profit),2) as profit_avg,
    max(order_profit) as profit_max
from {{ ref('fct_orders') }}

Great Expectations Integration

Next, we setup Great Expectations to define data quality expectations and validate the analytics dataset.

  1. Initialize Great Expectations and connect it to the Snowflake database

great_expectations --datasource=snowflake://USER:PASSWORD@ACCOUNT build ecomm_expectations

Create a datasource pointing to the ECOMM_ANALYTICS database
Ingest metadata from dbt catalog:


great_expectations --v3-api datasource new

great_expectations suite edit default.suite edit expectations

Auto-generate expectations for dim_customers


expect_column_values_to_not_be_null -> customer_id
expect_column_unique -> customer_id  
expect_column_values_to_be_in_type_list -> all columns

Add custom expectations for metrics:
fct_orders.sql


expect_column_min_to_be_between -> order_profit: 5

expect_column_max_to_be_between -> order_revenue: 500, 50000

expect_column_kl_divergence_less_than -> order_profit: 0.6

Execute validation for customer 360 expectation suite:


great_expectations --v3-api suite run default.dim_customers.suite

Next Steps

Here are some ways we can build on this ETL monitoring framework:

  1. Schedule batched workflow for daily validation runs
  2. Set Data Docs to publish data quality results and metrics
  3. Expand test coverage for more edge cases
  4. Enable consumers to view data quality issues and fixes

Conclusion

In this post, we walked through a detailed example of integrating dbt and Great Expectations for an e-commerce pipeline using code snippets. Defining rigorous data quality tests is key to identifying issues early and maintaining high quality datasets.
Hopefully this gives you a blueprint to implementing automated data testing and documentation for your own data transformation pipelines!

Want to receive update about our upcoming podcast?

Thanks for joining our newsletter.
Oops! Something went wrong.

Latest Articles

Optimizing Databricks Spark jobs using dynamic partition pruning and AQE

Learn how to supercharge your Databricks Spark jobs using Dynamic Partition Pruning (DPP) and Adaptive Query Execution (AQE). This comprehensive guide walks through practical implementations, real-world scenarios, and best practices for optimizing large-scale data processing. Discover how to significantly reduce query execution time and resource usage through intelligent partition handling and runtime optimizations. Perfect for data engineers and architects looking to enhance their Spark job performance in Databricks environments.

time
8
 min read

Implementing custom serialization and deserialization in Apache Kafka for optimized event processing performance

Dive deep into implementing custom serialization and deserialization in Apache Kafka to optimize event processing performance. This comprehensive guide covers building efficient binary serializers, implementing buffer pooling for reduced garbage collection, managing schema versions, and integrating compression techniques. With practical code examples and performance metrics, learn how to achieve up to 65% higher producer throughput, 45% better consumer throughput, and 60% reduction in network bandwidth usage. Perfect for developers looking to enhance their Kafka implementations with advanced serialization strategies.

time
11
 min read

Designing multi-agent systems using LangGraph for collaborative problem-solving

Learn how to build sophisticated multi-agent systems using LangGraph for collaborative problem-solving. This comprehensive guide covers the implementation of a software development team of AI agents, including task breakdown, code implementation, and review processes. Discover practical patterns for state management, agent communication, error handling, and system monitoring. With real-world examples and code implementations, you'll understand how to orchestrate multiple AI agents to tackle complex problems effectively. Perfect for developers looking to create robust, production-grade multi-agent systems that can handle iterative development workflows and maintain reliable state management.

time
7
 min read