10 AWS Services that use SQL

Amazon Web Services (AWS) is the biggest cloud platform in the world, with over 200 features. In this article, we break down 10 AWS services that support at least some SQL syntax, talk about their use cases, and give examples of how to write queries.

Service	Description	SQL Support	Use Case
RDS	Postgres, MySQL, etc.	Full	Small-medium web apps
Aurora	Serverless databases	Full	Serverless apps
Redshift	Data warehouse	Full	OLAP, Petabytes of data, analytics
DynamoDB	NoSQL database	Some - PartiSQL	Ecommerce, building fast
Keyspaces	Managed Cassandra (key value)	Some - CQL	Messaging
Neptune	Graph database	Some - openCypher	Social networks
Timestream	Time series database	Partial	IOT, Logging
Quantum Ledger	Cryptographically verified transactions	Some - PartiSQL	Finance
Athena	Ad-hoc queries on S3	Some - CTAS	Historical data
Babelfish	MSFT SQL Server on Aurora	Full	.NET

The table above shows how SQL support varies between the services. A graph database cannot be queried in the same way as a classic relational database, and various subsets of SQL, like PartiQL, have emerged to fit these models. In fact, even within standard SQL, there are many SQL dialects for different companies like Oracle and Microsoft.

AWS database services

From relational, to graph, to ledger, to time series, the services below cover almost every conceivable hosted database use case. Let's examine what each is for and how to write a SQL query against them!

RDS

Amazon RDS is one of the most basic AWS database services, used mainly for offloading your database management operations to a platform. Therefore, it is used for small or medium enterprises where the data volume is limited, and the functionalities required for company operations are not too complex.

Amazon RDS supports database engines such as MySQL, MariaDB, PostgreSQL, Oracle, and Microsoft SQL Server. It comes with workflows to secure your RDS instance using SSH and offers a straightforward cloud console for connecting.

Benefits

Amazon RDS is the most inexpensive service, thanks to its ease of usage and lack of complexity.
It is highly scalable and allows you to scale up to 32 vCPUs and 244 Gb of RAM.
This service is also easy to use and pretty fast.

Example SQL

RDS is just some nice management options around the core databases so it supports the exact SQL that each database supports. Here would be a subquery that could be run against Postgres, MySQL, or Oracle.

SELECT
    *
FROM
    users
WHERE
    id IN (
        SELECT
            user_id
        FROM
            transactions
        WHERE
            amount > 1000
    );

AWS Aurora

Amazon Aurora is an elevated version of Amazon RDS. Large enterprises use this since their data volume and complexity of operations are much higher. It doesn’t support all the same database engines as Amazon RDS, and instead only supports MySQL and PostgreSQL. Aurora scales up and down as the load on your database increases and decreases. Newer providers like PlanetScale also offer this capability with additional schema migration features and lower costs.

Amazon Aurora, like RDS, can perform replication. It actually offers about 15 different types of replications, and one replication can be done within milliseconds. On the other hand, RDS can perform only five types of replications, taking more time.

Some of the use cases that can depict the strength of Amazon Aurora are enterprise applications, SaaS applications, and web/mobile gaming.

Benefits:

Auto-scaling allows scaling operations to be carried out automatically without affecting the DB performance. It allows up to 128 TB per database instance.
Aurora backup operations are automatic, continuous, incremental, and have about 99.99999999% durability.
Aurora can detect database failure and recover in less than a minute. Furthermore, in case of a permanent failure, it can automatically move to a replica without data loss.

Redshift

Like Amazon Aurora, Amazon Redshift is used by large enterprises. However, Redshift is more complex, can handle more data, and is referred to as a data warehouse. This is because Redshift is built for OLAP (Online Analytical Processing).

Furthermore, Redshift can scale up to Petabytes of data and supports up to 60 user-defined databases per cluster. On the other hand, Aurora can only scale till Terrabytes and support up to 40. Besides this, the security and the maintenance of both the database services are pretty much the same.

A few use cases of Amazon Redshift are creating machine models for forecasting operations, optimizing your company’s business intelligence, and increasing developer productivity.

Benefits

Redshift possesses the highest scaling capabilities amongst the three options we've examined.
Its performance is much faster and more durable.
Amazon Redshift can also handle a more significant amount of data and analyze it within a shorter period.

Example SQL

Redshift supports some SQL functions and queries which would generally only be necessary with large data warehouse applications. For example, PERCENTILE_CONT computes a linear interpolation to return a percentile.

SELECT
    TOP 10 salesid,
    sum(pricepaid),
    percentile_cont(0.6) WITHIN GROUP (
        ORDER BY 
            salesid
    ),
    median (salesid)
FROM
    sales
GROUP BY
    salesid,
    pricepaid;

DynamoDB

DynamoDB is Amazon's answer to MongoDB, a NoSQL database that works on JSON documents. These databases rely heavily on nested data and do not enforce any strict schema, unless the developer turns that option on. That means that DynamoDB is great for high-volume sites like a CMS or mobile apps with a lot of traffic. For example, both Major League Baseball and Duolingo make use of DynamoDB.

Example SQL

Because DynamoDB is not relational and does not enforce ACID by default, it must use a modified version of standard SQL. Amazon has developed a query language called PartiQL which uses many SQL concepts but is built for highly nested data. The query below takes advantage of the key-value underpinnings of DynamoDB in a relatively SQL standard way.

UPDATE
    Music
SET
    AwardsWon = 1
SET
    AwardDetail = { 'Grammys': [ 2020, 2018 ] }
WHERE
    Artist = 'Acme Band'
    AND SongTitle = 'PartiQL Rocks'

Keyspaces

Cassandra is a famous distributed key-value database. iMessage, while it uses SQLite on the client side, sends all its messages through a massive Cassandra cluster at Apple. Other developers building distributed services with high fault tolerance needs may also look to Cassandra, albeit without wanting the headache of managing a large cluster. This is where AWS Keyspaces comes in by proving managed Cassandra.

Example SQL

The Cassandra project has developed its own SQL subset called CQL which uses the concepts of Keyspaces, Tables, Partitions, Rows, and Columns to query data. The example below is a CQL instruction to use Cassandra's famous fault tolerance to restore a keyspace at a point in time.

RESTORE TABLE mykeyspace.mytable_restored
FROM TABLE mykeyspace.my_table 
WITH restore_timestamp = '2020-06-30T04:05:00+0000'
AND custom_properties = { 
    'point_in_time_recovery': { 'status' :'disabled' },
    'capacity_mode': { 
        'throughput_mode': 'PROVISIONED',
        'read_capacity_units': 10,
        'write_capacity_units': 20 
        }
    }
AND TAGS = { 
    'key1' :'val1',
    'key2' :'val2' 
};

AWS Neptune

The most well-known graph database is probably Neo4j, but for developers in the AWS ecosystem, Neptune can fill that role. Graph databases are very helpful for modeling relationships between N items in a network. Social networks are an obvious application, but fraud detection and even infrastructure management are good use cases for graph DBs too. Neo4j developed an open-source graph query language called openCypher, which AWS Neptune supports.

Example SQL

As you can see, it's a little charitable to call the below query a SQL query. In fact, the openCypher language is probably the most exotic SQL language here. In this query, we define a route between two airports. This is basically an INSERT. Later on, we could query for the best route options between airports in a ticketing system.

MATCH (a:airport {code:'SEA'}), (b:airport {code:'ANC'}),
CREATE (a)-[r:route]->(b)
RETURN r

Timestream

Time-series databases are increasingly popular thanks to the emergence of unlimited logging, IoT, blockchain, and fintech. A time series database attempts to hide the pain of querying across date ranges, storing unlimited historical data, and keeping current data up to date. Compression is often a key metric for the use of a time-series database.

AWS Timestream specifically helps with managing the lifecycle of this data and also integrates with other AWS services you might expect, including AWS IoT Core!

Example SQL

When making queries over time-series data, we make heavy use of common table expressions (CTEs) and aggregates. AWS Timestream supports a number of useful functions which make these queries easier. Because do you want to implement linear interpolation in SQL? In the query below we find the average CPU utilization binned at 30-second intervals for a specific EC2 host over the past 2 hours, filling in the missing values using linear interpolation. This data could then be displayed on a handy line chart or otherwise used for service monitoring.

WITH binned_timeseries AS (
    SELECT
        hostname,
        BIN(time, 30s) AS binned_timestamp,
        ROUND(AVG(measure_value: :double), 2) AS avg_cpu_utilization
    FROM
        "sampleDB" .DevOps
    WHERE
        measure_name = 'cpu_utilization'
        AND hostname = 'host-Hovjv'
        AND time > ago(2h)
    GROUP BY
        hostname,
        BIN(time, 30s)
),
interpolated_timeseries AS (
    SELECT
        hostname,
        INTERPOLATE_LINEAR(
            CREATE_TIME_SERIES(binned_timestamp, avg_cpu_utilization),
            SEQUENCE(min(binned_timestamp), max(binned_timestamp), 15s)
        ) AS interpolated_avg_cpu_utilization
    FROM
        binned_timeseries
    GROUP BY
        hostname
)
SELECT
    time,
    ROUND(value, 2) AS interpolated_cpu
FROM
    interpolated_timeseries
    CROSS JOIN UNNEST(interpolated_avg_cpu_utilization)

Quantum Ledger (QLDB)

QLDB is a massive ledger, or a history of financial events. Organizations like banks or the Census Bureau use ledgers to ensure that no matter what happens with their data, they can be absolutely sure of the record. We will note that the AWS docs work hard not to use the term blockchain. Indeed, if QLDB was distributed around the world and open access, it would be very close to Bitcoin.

The key properties of QLDB are that it is immutable. No way, no how can the commit log for the database be altered. And if it is the ledger can report this or refuse to process additional transactions. The underpinnings of QLDB are a JSON-like data structure called AWS Ion. We saw earlier how PartiQL provides a SQL-esque query language for highly nested data. The example below inserts a record into QLDB to track vehicle registrations.

Example SQL

Once this record is entered into the ledger it becomes immutable and cannot be altered without inserting a new record. In other words, there is really no concept of UPDATE for QLDB.

INSERT INTO VehicleRegistration VALUE
{
    'VIN' : 'KM8SRDHF6EU074761', --string
    'RegNum' : 1722, --integer
    'State' : 'WA',
    'City' : 'Kent',
    'PendingPenaltyTicketAmount' : 130.75, --decimal
    'Owners' : { --nested struct
        'PrimaryOwner' : { 'PersonId': '294jJ3YUoH1IEEm8GSabOs' },
        'SecondaryOwners' : [ --list of structs
            { 'PersonId' : '1nmeDdLo3AhGswBtyM1eYh' },
            { 'PersonId': 'IN7MvYtUjkp1GMZu0F6CG9' }
        ]
    },
    'ValidFromDate' : `2017-09-14T`, --Ion timestamp literal with day precision
    'ValidToDate' : `2020-06-25T`
}

AWS Athena

Do you ever just shove data in S3 and forget about it? S3 often stores terabytes of JSON, CSV, text, or other unstructured data. AWS Athena is a SQL service to pull that data out of S3 and push it into a relational structure. It's great for inspecting buckets, transforming data through ETL processes, or cleaning data to send to services like Hadoop. It's also compatible with DataStation.

There is also another way to query S3 with a service called S3 Select. More information is coming to the Arctype blog on this topic soon! Athena is more flexible than S3 Select because it can touch multiple objects and buckets in the same query. This is also known as making "ad-hoc" queries.

Example SQL

The key concept of AWS Athena is called Creating a Table from Query Results (CTAS). Obviously, there are no tables that exist before S3 is queried, so we need to make use of INSERT INTO when we pull data out to create the table. In the query below, we graph dates from before 2015 from CSVs in our S3 bucket.

INSERT INTO new_parquet
SELECT id,
         date,
         element,
         datavalue,
         mflag,
         qflag,
         sflag,
         obstime,
         substr("date",1,4) AS year
FROM original_csv
WHERE cast(substr("date",1,4) AS bigint) < 2015

Babelfish for Aurora PostgreSQL

We would be remiss not to mention Babelfish as the last SQL service on this list. Technically Babelfish is more of a technology than a fully-fledged database but it accepts SQL queries so here we go!

With Babelfish, you can connect applications that you have written within the Microsoft ecosystem, for example using .NET and TSQL, to AWS Aurora which is backed by Postgres. Although we will note that the list of limitations and gotchas seems fairly long, for organizations with massive codebases making use of TSQL, this will in fact be an easier migration path than rewriting every query to support Postgres.

Example query

Just a simple one here. This query would not run inside Postgres because it uses Microsoft-specific syntax. Yet with Babelfish backed by Aurora, it will return the correct result.

SELECT @@VERSION AS version;

Fully managed databases vs. managing servers

All of the options above fit into a category where AWS software is doing the heavy lifting of provisioning, backups, scale-out, upgrades, and other tasks associated with operating a database. You can, of course, host all of the software above on your own servers and do this yourself.

The Elastic Compute Cloud is almost the opposite of AWS Managed Services (AMS). It was created as a virtual machine to create multiple instances simultaneously. Each instance can run operating systems or software packages based on your requirements and needs. While features such as Auto Scaling are involved, users are left on their own to carry out significant functions for security, resource utilization, maintenance, and more. This leads to another set of pros and cons:

Pros:

It is highly scalable
It is integrated with most AWS services.
The presence of the API makes your operations very easy
You have complete control of the instances.

Cons:

You’ll have to look at the security of the public instances.
Managing resource utilization, maintenance, and management of the lifecycle can become pretty tedious.
In specific scenarios, EC2 is not cost-efficient for companies.

Conclusion

This is a (fairly) complete overview of the AWS database services for databases and for writing SQL. All of these allow you to manipulate and query data via SQL, which makes them easy to use while also being highly scalable. One thing that stands out is the sheer variety of architectures, syntax, and targeted use cases. This speaks to both the depth of Amazon's offerings and the size of the cloud computing market. Take heart, you're working in an industry that's growing 40% year over year!

Blog

10 AWS Services that use SQL

Arctype Team

AWS database services

RDS

AWS Aurora

Redshift

DynamoDB

Keyspaces

AWS Neptune

Timestream

Quantum Ledger (QLDB)

AWS Athena

Babelfish for Aurora PostgreSQL

Fully managed databases vs. managing servers

Conclusion

Join Our Newsletter. No Spam, Only the good stuff.

Related