BigQuery Schema Generation Made Easier with PyPI’s bigquery-schema-generator

noela_tenku

Noela Tenku

Posted on July 9, 2024

BigQuery Schema Generation Made Easier with PyPI’s bigquery-schema-generator

When importing data into BigQuery, a crucial step is defining the table's structure - its schema. This schema can be auto-detected or defined manually.

Auto-Detection with BigQuery’s LoadJobConfig Method (for Smaller Datasets)

When we load data from a CSV file, we use the LoadJobConfig method with the autodetect parameter set to True. This tells BigQuery's data importer (bq load) to peek at the first 500 records of your data to guess its schema. This works well for smaller datasets, especially if the data originates from a well-defined source like a pre-existing database.

Image description

Manual Definition: Tedious for Large & Evolving Data

When dealing with data extracted from a service like a REST API, that might have thousands of records, or where older records may have different fields compared to newer ones, auto-detection falls short. Here, manually defining the schema becomes necessary. The traditional approach is to manually create a schema.json file.

Sure, you could skim through the JSON data, assuming you possess reading skills like Josh2funny (Checkout the Nigerian comedian), and define the schema based on that. But wouldn't it be nice to have a more reliable approach?

PyPI’s bigquery-schema-generator

This Python package is a lifesaver when it comes to generating BigQuery schemas. It works with newline-delimited data, whether it's in JSON, CSV format, or even a list of Python dictionaries and csv.DictReader objects. Unlike BigQuery's data importer, this package analyzes all your data to create a more accurate and error-free schema. Plus, it spares you the confusion that often arises with BigQuery's Repeated Mode and RECORD data type.

Installation: Getting Started
To install bigquery-schema-generator within your virtual environment, simply run:
pip3 install bigquery-schema-generator

Usage
The package offers various ways to integrate it into your workflow:

  • Command Line: You can use it directly from the command line by running the included shell script, invoking the Python module, or using a Python script. The PyPI documentation provides all the details on these methods.
  • As a Library : This is the method I used as indicated in the code below.

Image description

(Checkout the project I implemented with this tool.)

💖 💪 🙅 🚩
noela_tenku
Noela Tenku

Posted on July 9, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related