Data Pipelines with Great Expectations | Step 1: Setup

samuelearl

Samuel Earl

Posted on January 31, 2023

Data Pipelines with Great Expectations | Step 1: Setup

GitHub Repo

Before we go any further, here is the finished project that contains the files used in this tutorial: gx-getting-started

Setup your folder structure and virtual environment

Create your folder structure

Create a folder called gx-getting-started.


VS Code Recommendation
I recommend opening your gx-getting-started folder in VS Code. GX uses Jupyter Notebooks to configure your project. For me it is helpful to understand where files are located and what they do. Jupypter Notebooks does not have a good file explorer, which makes it difficult for me to understand where files are located within my project. However, VS Code does have a good file explorer.

If you open your gx-getting-started folder in VS Code you can run any GX CLI commands from VS Code’s built-in terminal and look at the project files in VS Code's file explorer. This might help you understand the structure of GX projects better and it could help you make the connection between the scripts inside the Jupyter Notebook files and your GX project configs. More on that later.


Inside of the gx-getting-started folder create another folder called data and copy the two data files from this data folder.

Create an environment.yml file in your project root directory

You can create and activate virtual environments with conda. (See Managing environments.)

Create the following environment.yml file in your gx-getting-started directory:

# /environment.yml

name: gx-env
channels:
  - conda
  - conda-forge
dependencies:
  - python=3.9.5
  - great-expectations==0.15.46
Enter fullscreen mode Exit fullscreen mode

Install Anaconda3

  1. Go to Anaconda's website and install Anaconda for your operating system.
  2. After Anaconda is installed, then make sure that Python3 and Pip3 are on your PATH variable. In a terminal run which python3 and then which pip3. Those commands should return a file path to your Anaconda installation.
  3. conda is Anaconda's package manager. Make sure that your conda installation worked: conda --version. If that returns a version number, then you have installed Anaconda3 and conda correctly.

Create your virtual environment

In a terminal window cd into your project root directory and run:

conda env create -f environment.yml
Enter fullscreen mode Exit fullscreen mode

This will install the virtual environment that is specified in your environment.yml file. If you see a prompt asking you to confirm before proceeding, type y and press Enter to continue creating the environment. Depending on your system configuration, it may take a while for the process to complete.

Verify that the new virtual environment was installed correctly

In your terminal type:

conda env list
Enter fullscreen mode Exit fullscreen mode

You should see gx-env in that list, which is the name specified in your environment.yml file.

Activate your virtual environment

conda activate gx-env
Enter fullscreen mode Exit fullscreen mode

Once your virtual environment has been activated, your command prompt should be prefixed with (gx-env). For example:

(gx-env) ~/gx-getting-started$
Enter fullscreen mode Exit fullscreen mode

NOTE: If you need to deactivate or delete a virtual environment, look at the end of this post for instructions.

Setup a Great Expectations project

Check that Great Expectations is installed

GX should already be available in your virtual environment, so you don’t need to install it. But you can confirm that you have Great Expectations installed by running

great_expectations --version
Enter fullscreen mode Exit fullscreen mode

This should output like this:

great_expectations, version 0.15.46
Enter fullscreen mode Exit fullscreen mode

Create a Data Context

What is a Data Context? In web development, a project is a folder that contains all of the packages, config files, code, etc. to develop and run a web project. A Data Context in GX is similar. It contains all of the packages, config files, code, etc. to configure and run data validations in your pipeline.

To create a Data Context, open your terminal, make sure that you are inside your gx-getting-started folder, and run the following command:

great_expectations init
Enter fullscreen mode Exit fullscreen mode

When you are asked “OK to proceed?” type y and press Enter.

This will create a subdirectory called great_expectations with a bunch of files and folders inside. I will show you some of those files and folders in this tutorial and explain their purpose.


Additional Info

How to deactivate a virtual environment

When you are done working on a particular project you can deactivate the virtual environment with:

conda deactivate
Enter fullscreen mode Exit fullscreen mode

See Deactivating an environment.

How to delete a virtual environment

If you need to delete a virtual environment for any reason, this is how:

First make sure your environment is deactivated. Then run this command:

conda env remove --name gx-env
Enter fullscreen mode Exit fullscreen mode

You can verify that your virtual environment has been deleted by running:

conda env list
Enter fullscreen mode Exit fullscreen mode

You should not see your virtual environment listed.

💖 💪 🙅 🚩
samuelearl
Samuel Earl

Posted on January 31, 2023

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related