How to pass an Array of Structs in Bigquery's parameterized queries

matthieucham

matthieucham

Posted on October 15, 2024

How to pass an Array of Structs in Bigquery's parameterized queries

In Google's Bigquery, SQL queries can be parameterized. If you're not familiar with this concept, it basically means that you can write SQL queries as parameterized templates like this:

INSERT INTO mydataset.mytable(columnA, columnB)
    VALUES (@valueA, @valueB)
Enter fullscreen mode Exit fullscreen mode

And pass the values separately. This has numerous benefits:

  • The query is more readable than when it's built by string concatenation
  • The code is more robust and industrialized
  • It's a great protection against SQL injection attacks (mandatory XKCD)

The passing of query parameters from a Python script appears straightforward... at first sight. For example:

from google.cloud.bigquery import (
    Client,
    ScalarQueryParameter,
    ArrayQueryParameter,
    StructQueryParameter,
    QueryJobConfig,
)

client=Client()

client.query("
INSERT INTO mydataset.mytable(columnA, columnB)
    VALUES (@valueA, @valueB)
", job_config=QueryJobConfig(
    query_parameters=[
        ScalarQueryParameter("valueA","STRING","A"), 
        ScalarQueryParameter("valueB","STRING","B")
])
Enter fullscreen mode Exit fullscreen mode

The example above inserts simple ("Scalar") values in columns A and B. But you can also pass more complex parameters:

  • Arrays (ArrayQueryParameter)
  • Structs (StructQueryParameter)

Problems arise when you want to insert arrays of structs : there are many gotchas, almost no documentation and very few resources on the subject on the web. The goal of this article is to fill this gap.

How to persist an array of structs in bigquery using parameterized queries

Let's define the following object that we want to store in our destination table

from dataclasses import dataclass

@dataclass
class Country:
    name: str
    capital_city: str

@dataclass
class Continent:
    name: str
    countries: list[Country]
Enter fullscreen mode Exit fullscreen mode

by invoking this parameterized query

query = UPDATE continents SET countries=@countries WHERE name="Oceania"
Enter fullscreen mode Exit fullscreen mode

The first try by following the shallow documentation would be

client.query(query, 
    job_config=QueryJobConfig(query_parameters=[
        ArrayQueryParameter("countries", "RECORD", [
             {name="New Zealand", capital_city="Wellington"},
             {name="Fiji", capital_city="Suva"} ...]
]))
Enter fullscreen mode Exit fullscreen mode

which would fail miserably

AttributeError: 'dict' object has no attribute 'to_api_repr'

Gotcha n°1: ArrayQueryParameter's values must be instances of StructQueryParameter

It turns out that the third argument of the constructor - values- must be a collection of StructQueryParameter instances, not the wanted values directly. So let's build them:

client.query(query, 
job_config=QueryJobConfig(query_parameters=[
    ArrayQueryParameter("countries", "RECORD", [
    StructQueryParameter("countries",
        ScalarQueryParameter("name", "STRING", ct.name), 
        ScalarQueryParameter("capital_city", "STRING", ct.capital_city)
    )
    for ct in countries])
]))
Enter fullscreen mode Exit fullscreen mode

This time it works... Until you try to set an empty array

client.query(query, 
    job_config=QueryJobConfig(
    query_parameters=[
        ArrayQueryParameter("countries", "RECORD", [])
]))
Enter fullscreen mode Exit fullscreen mode

ValueError: Missing detailed struct item type info for an empty array, please provide a StructQueryParameterType instance.

Gotcha n°2: Provide the full structure type as second argument

The error message is pretty clear: "RECORD" is not enough for Bigquery to know what to do with your empty array. It needs the fully detailed structure. So be it

client.query(query, job_config=QueryJobConfig(query_parameters=[
    ArrayQueryParameter("countries",
        StructQueryParameterType(
            ScalarQueryParameterType("STRING","name"),
            ScalarQueryParameterType("STRING","capital_city")
        ), [])
]))
Enter fullscreen mode Exit fullscreen mode

(Notice how the order of the arguments of the ...ParameterType constructor is the reverse of ...Parameter constructor. Just another trap on the road...)

And now it works for empty arrays too, yay !

One last gotcha to be aware of: every subfield of a StructQueryParameterType must have a name, even if the second parameter (name) is optional in the constructor. It's actually mandatory for subfields, otherwise you'll get a new kind of error

Empty struct field name

I think that's all we need to know to complete the usage of arrays of records in query parameters, I hope this helps !


Thanks for reading! I’m Matthieu, data engineer at Stack Labs.
If you want to discover the Stack Labs Data Platform or join an enthousiast Data Engineering team, please contact us.


Photo de Denys Nevozhai sur Unsplash

💖 💪 🙅 🚩
matthieucham
matthieucham

Posted on October 15, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related