How to combine BigQuery with DuckDB
Marcelo Costa
Posted on April 27, 2024
This blog post will discuss the benefits of integrating Google BigQuery, a leading data warehouse solution, with DuckDB, an embedded analytical database. This powerful combination can enhance your data analysis processes by offering the best of both worlds: BigQuery's massive scalability and DuckDB's agility for quick and on-the-fly queries.
Before we start, here is a quick summary of the key features for each:
BigQuery
Key Features:
Serverless Architecture: BigQuery manages infrastructure automatically, scaling to meet query demands without manual resource provisioning.
Storage and Computation Separation: Users can store large amounts of data independently, reducing costs and optimizing performance.
Real-time Analytics: Supports real-time analysis with the capability to stream and query data almost instantaneously.
Machine Learning Integration: BigQuery ML offers machine learning capabilities inside the database, allowing SQL practitioners to build and deploy models using SQL commands.
DuckDB
Key Features:
In-Process Database: Designed for embedded processes, it is ideal for applications and analytics tools requiring a built-in database.
Simple Integration: Easy to set up
Let's review two easy options for bringing your BigQuery data to DuckDB.
Export Data From BigQuery to DuckDB
Export it to cloud storage, then download it manually or use gsutil.
EXPORT DATA
OPTIONS (
uri = 'gs://bq_export_demo/export/*.parquet',
format = 'PARQUET',
overwrite = true)
AS (
SELECT ssn, user_name
FROM `demo-project.bq_dataset_0024.org_extend_rich_schemas_2890`
ORDER BY user_name
);
Using the cloud storage import feature from DuckDB is also possible.
BigQuery Client Library
Make sure your environment has the following libraries installed:
pip install duckdb
pip install pyarrow
pip install google-cloud-bigquery
pip install google-cloud-bigquery-storage
Then an efficient way of querying the data is using the bigquery storage client and its underlying abstractions that map the rows to pyarrow:
import duckdb
from google.cloud import bigquery
bqclient = bigquery.Client()
table = bigquery.TableReference.from_string(
"demo-project.bq_dataset_0024.org_extend_rich_schemas_2890"
)
rows = bqclient.list_rows(table)
org_extend_rich_schemas_2890 = rows.to_arrow(create_bqstorage_client=True)
cursor = duckdb.connect()
print(cursor.execute('SELECT * FROM org_extend_rich_schemas_2890').fetchall())
Push Data from DuckDB to BigQuery
DuckDB has the advantage of allowing you to run everything on your local machine without having to worry about costs. However, it is important to keep in mind that if you are dealing with sensitive or customer-related data, you should take appropriate security measures to protect it.
DuckDB: Transform Data and Export to Parquet
-- Load the Parquet file
CREATE OR REPLACE TABLE original_data AS
SELECT *
FROM read_parquet('/path/bq_export_demo/export/*.parquet');
-- Perform transformations
CREATE OR REPLACE TABLE transformed_data AS
SELECT
column1,
column2,
column3 + 10 AS new_column3,
UPPER(column4) AS new_column4
FROM original_data;
-- Export the transformed data to a new Parquet file
COPY transformed_data
TO '/path/to/output_file.parquet' (FORMAT 'parquet');
Once you have your transformed Parquet file, you can load it into BigQuery using a load job:
bq load --source_format=PARQUET --autodetect \
mydataset.new_table \
'gs://your_bucket/path/to/output_file.parquet'
And that's it! Combining both is certainly something I have in my data toolkit, and it helps me with my day-to-day work.
Having said that here are some final caveats:
Don’t Overload DuckDB with Big Data Tasks:
DuckDB is not designed to handle data of the same scale as BigQuery. Avoid using DuckDB for large datasets better suited to BigQuery’s infrastructure.Don’t Neglect the Cost Implications:
Be mindful of the costs associated with data storage and transfer, especially when moving large amounts of data between BigQuery and DuckDB.Don’t Forget to Scale Appropriately:
As your data grows or your analytical needs change, revisit your use of BigQuery and DuckDB. Scalability is a crucial concern, and what works at one scale may not work well at another.Worry about Security:
Moving sensitive data from a secure production warehouse to your local environment or any environment where DuckDB is used as an embedded database can raise security concerns. Therefore, it is essential to handle sensitive data with care.
I hope this helps!
Posted on April 27, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.