DBT expectations w/postgres

Goal

Create a process with dbt where it has been used great expectations, with this package. Here is the main repository

Description

Source

For this project, we are going to use as a source Soccer Events

Environment

Environment with python

For this project we are going to install a virtual enviorment under the name env_dbt_ge:

python3 -m venv env_dbt_ge

... and activate
source env_dbt_ge/bin/activate

With this enviorment we could export our requirements with this script:

python -m pip freeze > requirements.txt

Requirements library

For this project only we need these requirements:

dbt-core==1.5.1
dbt-extractor==0.4.1
dbt-postgres==1.5.1
markupsafe==2.0.1

Environment with docker

Actually, I'm working with a MacOs with TMUX for this reason, my docker's preference should be changed like this:

{
  "builder": {
    "gc": {
      "defaultKeepStorage": "20GB",
      "enabled": true
    }
  },
  "experimental": false,
  "features": {
    "buildkit": false
  }
}

Steps

0 - Create an env file

I've used this .env file with this project:

DBT_PROFILES_DIR=<YOUR_FOLDER>/projects/dbt_great_expectations
PS_VERSION=15
PS_USER=ps
PS_PASSWORD=ps_2023
PS_DB=RAW 
PY_VERSION=3.9-slim-buster

1 - Create our Postgres with dbt

As we've seen it in last article, we could create our docker file for postgres as this way:

1.1 Docker postgres

Dockerfile for postgres

# Dockerfile for postgres
ARG PS_VERSION
FROM postgres:${PS_VERSION}

1.2. Docker dbt

For this case, only we are going to add our new comand: dbt test (to run our desired test) & dbt deps (to download the last version of our package)

ARG PY_VERSION
FROM python:${PY_VERSION}

# Working directory
WORKDIR /app

# Install OS dependencies
RUN apt-get update && apt-get install -qq -y \
    git gcc build-essential libpq-dev --fix-missing --no-install-recommends \ 
    && apt-get clean

# Make sure we are using latest pip
RUN pip install --upgrade pip

# Create directory for dbt config
RUN mkdir -p /root/.dbt

# Copy requirements.txt
COPY requirements.txt requirements.txt

# Install dependencies
RUN pip install -r requirements.txt

# Copy dbt profile
COPY profiles.yml /root/.dbt/profiles.yml


# Copy source code
COPY dbt_project/ /app

WORKDIR /app/dbt_post


# Run 1- Install package + import data + check data 
CMD ["/bin/bash", "-c", "dbt deps;dbt seed;dbt test"]

1.3. Docker compose

Then we could have this docker compose to work with our environment:



version: '3.9'
volumes:
  db-data:
    name: "data_postres"
    driver: local

services:
  post_db:
    build: 
      context: .
      dockerfile: Dockerfile.postgres
      args:
        - PS_VERSION=${PS_VERSION}
    user: postgres
    image: postgres/test:v1
    environment:
      - POSTGRES_USER=${PS_USER}
      - POSTGRES_PASSWORD=${PS_PASSWORD}
      - PG_DATA:/var/lib/postgresql/data/pgdata
      - POSTGRES_DB=${PS_DB}
    healthcheck:
      test: ["CMD-SHELL","pg_isready -U ${PS_USER} ${PS_PASSWORD}"]
      interval: 10s
      timeout: 5s
      retries: 5
    ports:
      - "5432:5432"
    volumes:
      - db-data:/var/lib/postgresql/data
    restart: unless-stopped
  dbt_python:
    build:
      context: .
      dockerfile: Dockerfile.python
      args:
        - PY_VERSION=${PY_VERSION}
    depends_on:
      - post_db

2 - DBT

2.1. - Import csv in our folder:

In this step we should incloude in our folder of dbt_project/seed our csv

2.2 - Add the package

In our folder of dbt_project, we should create a file with name "package.yml" with this code:

packages:
  - package: calogica/dbt_expectations
    version: [">=0.6.0", "<0.7.0"]

2.3. - Add schema in seed folder

In our folder of dbt_project/seed, we should create a file "schema.yml"

In there, we could include which type of column it is going to be (char, int, ...), and also we could include some test.

columns:
 - name: season
        tests:
          - unique:
              config:
                store_failures: true

If we choose the preset's test we could use the option of config, and also store_failures. These store_failure's checks it's going to be saved on our desired database under schema: "dbt_test__audit"."name_of_our_test"

The main purpouse is sees how its load in our table in case something goes wrong. For this reason this test should fail.

These tests could be working with great expectation's package:

columns:
      - name: id_odsp
        tests:
          - dbt_expectations.expect_column_values_to_be_unique

In this case, I couldn't see how canonical [the main package] could store faliures in case if something goes wrong.

Example

In here, we could see an example of this case of use:

version: 2

seeds:
  - name: ginf
    column_types:
      id_odsp: varchar(10)
      link_odsp: varchar(90)
      adv_stats: bool
      date: date
      league: varchar(4)
      season: int
      ht: varchar(30)
      at: varchar(30)
      odd_h: float
      odd_d: float
      odd_a: float
      odd_over: varchar(5)
      odd_under: varchar(5)
      odd_bts: varchar(5)
      odd_bts_n: varchar(5)
    columns:
      - name: id_odsp
        tests:
          - dbt_expectations.expect_column_values_to_be_unique
          - not_null
      - name: season
        tests:
          - unique:
              config:
                store_failures: true

2.4. - Save it in other schema without "_"

In case we want to create our tables in other schema, DBT as preset's behavior create an schema like this : "presetschema_createdschema". And we could avoid with this solution

2.4.1 - Add a macro

In the folder macro, we should create a sql file with this code (it doesn't care too much the name of the file):

{% macro generate_schema_name(custom_schema_name, node) -%}

    {%- set default_schema = target.schema -%}
    {%- if custom_schema_name is none -%}

        {{ default_schema }}

    {%- else -%}

        {{ custom_schema_name | trim }}

    {%- endif -%}

{%- endmacro %}

2.4.2 - Set new schema in dbt_project

We should add in dbt_project

seeds:
  +schema: src

Conclusion

It's really amazing the work of this dbt package, it brings a lot of features of great expecations to DBT. However, I've seen a little missing details to achive a greate feature as could be:

Lack of documentation: Currently if you use this package to try to document your database. You are going to see something like this:

It is not really clear to share with the main customere this kind of input.

Revision check: in case you use the presets factory of dbt, you could see the data's behavior. Otherwise, in case you are going to work with this great expecation package, you will need some help too see where is the error.
PySpark no compatibility: in the case of great expecations, you could work with a Pyspark's cluster. And this is little sad, because this could be helpful in dbt's world.

Also, I feel this package have a lot of cool features and in the future it is going to be an essential one.

Blog