Embarking on the Big Query Quest: Exploring the Depths of its Inner Workings
Matheus Tramontini
Posted on September 24, 2024
BigQuery is a powerful serverless data warehouse solution available on Google Cloud Platform. It allows organizations to analyze large amounts of data with high efficiency and speed, making it, along with Redshift, a common choice for data-driven businesses. However, understanding how it actually works internally can be challenging for both newcomers and more experienced users. For this reason, in this article, I aim to demystify how BigQuery operates internally and provide a straightforward summary of how it can process vast amounts of data with ease. I hope this post helps people not to feel overwhelmed when seeing, for example, the image below, and that by understanding how data is processed, they can make more informed technical decisions when setting up their structure in BigQuery.
Dremel
Dremel was born out of Google’s need for its internal teams to analyze massive amounts of data stored in distributed file systems or other types of storage (such as BigTable) quickly and on-demand, using something as close to SQL as possible.
So, what did Dremel do differently in this case? The first aspect was how it handles unstructured data, using the concept of columnar storage, where data is read by columns rather than rows, which already helped reduce the amount of MapReduce performed for each query. Another important aspect was the use of tree structures for query execution, eliminating the need to recombine data to work in the conventional manner of a database for each call. Finally, it employed parallel computing using multiple shared clusters.
Now, as the last point about Dremel, we need to understand how it uses the tree concept to execute queries. First, it uses a “query dispatcher” to send the query to the root node. The root node has two main functions: the first is to send the information to the next level, known for containing Mixers; the second function is to return the result to the client that made the request. The Mixers rewrite the query to reduce and optimize it, primarily applying partitions and shuffling. After this, they send the modified query to the leaf nodes, which perform the heavy lifting, such as reading data from Colossus, applying filters and aggregations, removing unused columns, among other tasks. It’s essential to note that each leaf has its processing in slots (processing units), where, on average, each slot reads 5GB of data.
Another process in which Dremel makes use of tree structures is with nested fields. In nested fields, the root is the highest level, and the leaves are the depths. When executing the query, Dremel selects the fields from the column that will be used, discarding unnecessary ones by leveraging the knowledge of the tree’s levels, making the processing of this structure faster and lighter.
If you’re curious and enjoy reading more theoretical material, check out the official Dremel paper, which is quite interesting.
Capacitor
Capacitor is a successor to ColumnarIO, which also uses columnar data, but with an improvement that allows it to perform operations on compressed files without needing to decompress them. Before explaining how Capacitor does this, I’d like to mention two advantages that I think are important for understanding the appeal of columnar formats.
- Reduced Scans: In columnar format, as values are referenced by columns, there is no need to scan all rows with all columns in a table, but only the selected columns.
- Better Compression Ratio: Columnar storage achieves a compression ratio of 1:10, while row-based storage achieves approximately 1:3. Google created Capacitor as an internal columnar format that heavily applies variations and improvements to compression and decompression methods presented in this research, such as Run Length Encoding (RLE), Dictionary encoding, etc.
Using RLE as an example, as shown in the image below, we would need about 25 iterations to compress the information.
However, if we apply some sorting logic and reordering the table, we can reduce the number of iterations to 12, improving the storage and processing process.
Of course, internally, BigQuery applies more intelligence to this idea of compression, using, for example, models that differentiate the best approach for longs and strings, columns that are more likely to be selected in the SELECT statement or filtered in a WHERE clause. Another interesting point is that as it applies this intelligence and rules, it collects unique statistics for each column, allowing it to choose the best path during query planning. At the end of the entire process, it persists both this information and the compressed tables in Colossus.
Colossus
Colossus is Google’s latest generation distributed file system (the successor to Google File System). Each Google data center has its own Colossus cluster, applying rules to handle replication, error recovery, and distributed management. To reduce cost and facilitate replication between clusters, data is typically saved using Reed-Solomon encoding, and the metadata is saved in Google’s NoSQL database, BigTable. Notably, this is what allowed Colossus to scale clusters up to 100 times more than GFS.
Borg
To execute all this processing, BigQuery uses the Borg infrastructure, which is a large-scale system for cluster management that abstracts complex aspects of machine scalability, recovery processes, and error handling for users. In simplified terms, users, in this case, BigQuery, submit workloads in the form of jobs. Each of these jobs is executed in a Borg cell, which is a set of machines working as a unit.
Borg’s structure is so scalable and functional that it served as the foundation for the creation of Kubernetes and would warrant a dedicated post to discuss its entire operation.
For those curious, you can access the original paper and another paper with more recent cases of Borg clusters.
Jupiter
Finally, to execute all these tools and service loads, BigQuery uses Google’s Jupiter network, which can deliver about 1 Petabit per second of total bandwidth in a bisectional network. To put this into perspective, the total network used in a query running on a structure of 100,000 machines communicating at 10 GBs would be about 0.1% of the total capacity utilizing this infrastructure.
This post delves a bit more into the concept used by Jupiter
Conclusion
I hope that with this post, you can better understand the inner workings of BigQuery and even adapt ways of thinking about the solution based on this information. I, myself, have had a change in my perspective on how it operates after understanding how this entire system works.
Posted on September 24, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
September 24, 2024