Data Pipelines with Great Expectations | Step 1: Setup
Samuel Earl
Posted on January 31, 2023
GitHub Repo
Before we go any further, here is the finished project that contains the files used in this tutorial: gx-getting-started
Setup your folder structure and virtual environment
Create your folder structure
Create a folder called gx-getting-started
.
VS Code Recommendation
I recommend opening yourgx-getting-started
folder in VS Code. GX uses Jupyter Notebooks to configure your project. For me it is helpful to understand where files are located and what they do. Jupypter Notebooks does not have a good file explorer, which makes it difficult for me to understand where files are located within my project. However, VS Code does have a good file explorer.If you open your
gx-getting-started
folder in VS Code you can run any GX CLI commands from VS Code’s built-in terminal and look at the project files in VS Code's file explorer. This might help you understand the structure of GX projects better and it could help you make the connection between the scripts inside the Jupyter Notebook files and your GX project configs. More on that later.
Inside of the gx-getting-started
folder create another folder called data
and copy the two data files from this data
folder.
Create an environment.yml
file in your project root directory
You can create and activate virtual environments with conda
. (See Managing environments.)
Create the following environment.yml
file in your gx-getting-started
directory:
# /environment.yml
name: gx-env
channels:
- conda
- conda-forge
dependencies:
- python=3.9.5
- great-expectations==0.15.46
Install Anaconda3
- Go to Anaconda's website and install Anaconda for your operating system.
- After Anaconda is installed, then make sure that Python3 and Pip3 are on your PATH variable. In a terminal run
which python3
and thenwhich pip3
. Those commands should return a file path to your Anaconda installation. -
conda
is Anaconda's package manager. Make sure that your conda installation worked:conda --version
. If that returns a version number, then you have installed Anaconda3 and conda correctly.
Create your virtual environment
In a terminal window cd
into your project root directory and run:
conda env create -f environment.yml
This will install the virtual environment that is specified in your environment.yml
file. If you see a prompt asking you to confirm before proceeding, type y
and press Enter to continue creating the environment. Depending on your system configuration, it may take a while for the process to complete.
Verify that the new virtual environment was installed correctly
In your terminal type:
conda env list
You should see gx-env
in that list, which is the name specified in your environment.yml
file.
Activate your virtual environment
conda activate gx-env
Once your virtual environment has been activated, your command prompt should be prefixed with (gx-env)
. For example:
(gx-env) ~/gx-getting-started$
NOTE: If you need to deactivate or delete a virtual environment, look at the end of this post for instructions.
Setup a Great Expectations project
Check that Great Expectations is installed
GX should already be available in your virtual environment, so you don’t need to install it. But you can confirm that you have Great Expectations installed by running
great_expectations --version
This should output like this:
great_expectations, version 0.15.46
Create a Data Context
What is a Data Context? In web development, a project is a folder that contains all of the packages, config files, code, etc. to develop and run a web project. A Data Context in GX is similar. It contains all of the packages, config files, code, etc. to configure and run data validations in your pipeline.
To create a Data Context, open your terminal, make sure that you are inside your gx-getting-started
folder, and run the following command:
great_expectations init
When you are asked “OK to proceed?” type y
and press Enter.
This will create a subdirectory called great_expectations
with a bunch of files and folders inside. I will show you some of those files and folders in this tutorial and explain their purpose.
Additional Info
How to deactivate a virtual environment
When you are done working on a particular project you can deactivate the virtual environment with:
conda deactivate
See Deactivating an environment.
How to delete a virtual environment
If you need to delete a virtual environment for any reason, this is how:
First make sure your environment is deactivated. Then run this command:
conda env remove --name gx-env
You can verify that your virtual environment has been deleted by running:
conda env list
You should not see your virtual environment listed.
Posted on January 31, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 30, 2024