How to integrate cloud data into your analytics stack with Cloud2SQL

Cloud infrastructure can be really complex, and finding resources that match certain criteria can be like searching for the proverbial needle in the haystack.

Why would you want to search for a specific resource? For example, to find:

vulnerabilities
misconfigurations
dependencies

Usually, that requires direct access to your cloud environment, and that privilege is reserved to a few select people.

But what if you could just query your cloud infrastructure with SQL? For example, this query would return the name of an AWS ELB and the name of the VPC it is connected to.

SELECT aws_elb.name, aws_vpc.name
 FROM aws_elb
 INNER JOIN link_aws_vpc_aws_elb ON aws_elb._id = link_aws_vpc_aws_elb.to_id
 INNER JOIN aws_vpc ON aws_vpc._id = link_aws_vpc_aws_elb.from_id
 LIMIT 1;

With raw cloud data, you could use your existing analytics toolchain to transform and visualize your cloud data - think Airflow, dbt, Metabase, etc. By extracting resource metadata, and loading it into a separate SQL database (say a cloud warehouse like Snowflake) for analysis - we're widening the number of people who can ask smart questions about your cloud.

The first step for that is to build the "EL" part of "ELT" - extract, load, and transform. Extracting data from the clouds like AWS and GCP can be quite cumbersome - the APIs are all fragmented, and use different data models, even within a single cloud.

And that's why the vast majority of infrastructure and security engineers always fall back to an existing tool with a UI. That tool acquires and stores data in a proprietary format, in yet another infrastructure data silo, in yet another dashboard. And then charges and arm and a leg for it.

We think that data integration for infrastructure engineers should be a commodity. Cloud data should be easily accessible and separated from the underlying infrastructure, so that data professionals like analytics engineers can work with it.

That's why we've created Cloud2SQL. Cloud2SQL is an open source tool that extracts resource metadata from your cloud (currently with support for AWS, GCP and DigitalOcean) and syncs it to a destination database. Cloud2SQL flattens that data into tables, complete with foreign keys and link tables.

The image below shows what the fields in a table for an AWS ELB look like (screenshot taken from Metabase).

A link table is a special type of table that allows you to easily find relationships between different resources.

Dependencies matter in the cloud, you need to understand how your resources are connected. The relationships among your resources are 1st class citizens, and as important as the resources themselves. Link tables capture those dependencies.

Each link table is prefixed with link_ followed by the two resource kind names. For example, a link table connecting an AWS VPC to an AWS ELB would be named link_aws_vpc_aws_elb.

Link tables only have two fields: from_id and to_id, which can be easily JOINed on.

By using link tables, you can find dependent resources without needing to know the specific details of each resource's API or how the resources are related. Much like when working with a graph you can also find resources based on the state of another resource.

All of this allows users who are familiar with SQL to easily work with the data collected by Cloud2SQL, using the analytics toolchain and apps they are already familiar. You can now build your own transformations, and write your own queries - custom to your business. We think about it as the first step to replace your "XOps" tools, which only give you 80% of the data you need to start with.

Cloud2SQL already has support for SQLite, MySQL, MariaDB, and PostgreSQL, and Snowflake, as well as Parquet columnar structure files.

To install Cloud2SQL, all you need is Python 3.9 or newer. Create a new virtual environment and install the cloud2sql[all] package:

$ pip3 install --user cloud2sql[all]

See the full installation instructions for Cloud2SQL here.

In the next few posts, we'll publish example queries and transformations. To stay informed, best is to star and follow the GitHub repo for Cloud2SQL.

Blog

How to integrate cloud data into your analytics stack with Cloud2SQL

Lars Kamp

Join Our Newsletter. No Spam, Only the good stuff.

Related