axurcio
Posted on April 11, 2021
Insight delivered a customer solution leveraging Cosmos DB for quick retrieval of fuel prices within a radius of the user’s selected location. This article looks at some interesting findings the team encountered working with Cosmos DB.
What storage is the right storage?
Before starting this project we had to decide on the storage platform for price data. The platform needed to support geospatial functions to find nearby fuel prices and, have scalable performance.
- Azure SQL Server gave us great storage capacity, supported geospatial functions but price increased dramatically with performance. Performance is also difficult to scale for SQL (autoscale at time of writing is a preview feature and incurs a delay for first access after sleep). We knew the data set was small so SQL’s major benifit was little use to us
-
Azure Cache for Redis provided great performance, ample storage, a good price point and supported geospatial functions. When items are added using
GEOADD
, we can search for nearby items usingGEOSEARCH
with parameterBYRADIUS
[1]. - Azure Cosmos DB at the lower price points had limited storage but still enough for the expected data set. Great performance was on offer starting at 400 RUs, phenomenally low latency and, geospatial indexing and functions.
What tipped the balance towards Cosmos DB was the ability to query all properties within an item. Redis, while it does have geospatial functions, is a key value database. Cosmos gives us the flexibility to query on other properties in the future.
Should I use Cosmos DB?
We were able to select Cosmos DB because we understood the:
- current and future requirements of the data
- shape of the data and if an item was self contained
- performance and storage requirements
Many of us are familiar with large relational data sources with normalized table structures. We can update reference tables and those changes are immediately available to all tables that reference it. This kind of data is poorly suited to NoSQL. Changes to fields such as Category
means traversing all objects to find and replace the Category value. This is a costly operation which may have performance impacts on the system.
If your data can be represented as a self-contained document it may be a good candidate for NoSQL databases such as Cosmos DB. Also remember to consider factors such as cost, performance and read/write frequency. This will ensure that the data platform is the right choice for your solution.
The structure of Cosmos DB
Before gettings started with Cosmos DB, it’s important to understand the components we’ll work with. Cosmos DB consists of:
- Account : Management of Keys, firewall settings, virtual network integration
- Database : Defines the type of database API e.g. SQL, Mongo, Gremlin. You can also set the throughput at the database level
- Container : Responsible for storing the schema agnostic items. You can set the scale as shared (inherits from the database and shares with other containers within the database) or dedicated. The container is also where indexing, stored procedures and, item TTL can be defined.
We selected the SQL API and used a single container for all our objects. Remember, containers are schema agnostic so you can put any object in a container. How you partition and index is important and we’ll explore this in the next section.
Understanding Partitions
This is a large topic of which there are many great resources. I’ll only introduce the core concepts and how they applied to our project.
The primary concepts are:
- Item ID : The unique identifier for your item
- Partition key : Unique values define logical partitions
- Logical partition : Items that have the same partition key
- Physical partition : Azure’s internal physical distribution of logical partitions. There may be 1 or more logical partitions in a physical partition. You don’t have control of the physical partitions.
- Regions : Geographic distribution of physical partitions (if configured for your containers)
The following diagram shows how Cosmos DB implements the hierarcy of partitions and regions
Consider the following fuel price item:
{
"id": "64e682ae-fd8c-4350-ba96-1f92978ac8a7",
"suburb":"Hobart",
"brand": "Ace Fuel",
"lat": 42.8,
"long": 147.3,
"prices": [{"ulp":128.0},{"diesel":136.9}]
}
A partition key of Suburb
will create a separate Logical Partition for each Suburb in my dataset. In the following query, Cosmos DB will distribute my query across the Logical Partitions and aggregate the response
SELECT c.* FROM c WHERE c.Brand = 'BP'
In the following query, Cosmos DB doesn’t have to distribute my query across partitions and can restrict it to my “Hobart” Logical Partition.
SELECT c.* FROM c WHERE c.brand = 'BP' AND c.suburb = 'Hobart'
Where things get interesting is when you provision capacity that exceeds the limits of containers (10,000 RUs and 20 GB [4]). Cosmos DB will allocate your logical partitions to different physical partitions and the first query will be issued across each physical partition at a cost of 2.5 RUs per partition.
Our dataset was small (< 250 records) and because we were predominantly querying by lat / long there wasn’t a property that would satisfy most queries. Selecting “Suburb” would allow insert parallelism and because our RU and storage requirements were well clear of the limits, we had a high degree of confidence that our logical partitions would reside on a single physical partition.
Indexing
As we couldn’t rely on the partition key to improve query performance, we leveraged indexes. There are a number of indexing strategies available in Cosmos DB. Our queries would filter by lat / long so Spatial indexes (docs.microsft.com) were applicable.
For a spatial index we need the:
- Path for our latitude and longitude represented as a Point
- A bounding box
- Container Settings - Geospatial Configuration set to Geography
The following json defines our index. Within spatialIndex
we’ve set the path
to our point and, the boundingBox
for our fuel station locations.
{
"indexingMode": "consistent",
"automatic": true,
"includedPaths": [
{
"path": "/*"
}
],
"excludedPaths": [
{
"path": "/\"_etag\"/?"
}
],
"spatialIndexes": [
{
"path": "/locations/*",
"types": [
"Point",
"LineString",
"Polygon",
"MultiPolygon"
],
"boundingBox": {
"xmin": 143.423677,
"ymin": -39.385084,
"xmax": 148.747318,
"ymax": -44.015256
}
}
]
}
Performance
Cosmos DB measures performance in Request Units per Second or, RUs. The cost to do a point read (i.e. fetching a single item by its ID and partition key value) for a 1 KB item is 1 Request Unit (or 1 RU).
As item size increases, queries span multiple physical partitions, indexes need to be managed, so does the cost of our operations.
Capacity is set as RUs with the minimum available being 400 RUs. Capacity can be set at the:
- Database : shared across all containers
- Container : dedicated to the container
Reads
We know that our fuel price solution has a small item size, one spatial index, small number of items and, moderate usage. Testing indicated We were able to provision 500 RUs at the database level to cater for the expected demand.
Writes
Another requirement was to periodically refresh the fuel price data. This was being provided every 20 minutes for all prices. We addressed this with an Azure Function that retrieved and upserted values to our container.
Did we mention Cosmos DB was fast? We utilised the .NET SDK and used the Bulk support [6] to implement our periodic loads. Bulk support is optimised for throughput , the impact is that our 20 minute job consumed all available RUs…we needed another strategy.
When you hit capacity threshold, Cosmos DB will return a HTTP Status code of 429 Too many requests. The Bulk support is smart, when it detects a 429 it will back-off and retry. The downside is you’re also causing outages for your read clients who will encounter a 429 and also issue retries.
Note, retry can be configured for the CosmosClient
. It can be set using CosmosClientOptions.MaxRetryAttemptsOnRateLimitedRequests
. This value is set to 9
by default and shouldn’t need to be configured.
Despite retries in the API, we could still see issues in our testing. To reduce load, Bulk support was reverted to sequential upserts. While this ran slower / consumed less RUs we were still hitting our configured capacity. We implemented an artificial pause to ensure our consumed RUs from the load wouldn’t exceed half the maximum capacity of our database. The example below shows one way this can be handled. Note, sample only. Capacity may be defined on the Container and should be checked as Throughput
is nullable.
public async Task UpsertWithDelayAsync(IList<T> items)
{
var throughputResponse = await Database.ReadThroughputAsync(new RequestOptions());
var maxRUs = throughputResponse.Resource.Throughput;
var startTime = DateTime.UtcNow;
double totalRU = 0;
foreach(var item in items) {
var response = await Container.UpsertItemAsync(item);
totalRU += response.RequestCharge;
if ((DateTime.UtcNow - startTime).TotalSeconds > 1 && totalRU > maxRUs / 2) {
Thread.Sleep(1000);
totalRU = 0;
startTime = DateTime.UtcNow;
}
}
}
By ensuring the sum RequestCharge
per second doesn’t exceed half our Throughput
we were able to keep consumed RUs below 50% during the 20 minute load and prevented any impact to end users.
Serverless
Cosmos DB also has a serverless tier which allows you to select an upper bound RU (minimum is 4,000 RUs). With no load Cosmos DB will provision 10% of the upper bound. This seemed like a great solution however there are two considerations:
- Serverless is charged at 1.5 x Max RU/s used each hour
- At time of writing, It’s currently in preview
For our scenario, our 20 minute Function would have peaked consumption ~1,000 RU/s every hour. As price is based on peak and not average, this would have resulted in 1.5 x 1,000 RUs. Our nominal load is significantly less than 1,000 RUs which means we would pay for far more than we need!
Azure Cosmos DB serverless (Preview)
Summary
There are a myriad of data storage options available in Azure including but not limited to Azure SQL Server, Azure Cache for Redis, OSS PaaS options including MySQL, PostgreSQL and, MariaDB, Azure Storage Accounts and Azure Cosmos DB.
Selecting the right platform for your data needs can set your solution up for success and ensure you meet performance, availability and, pricing requirements.
Once we understood the right way to work with Cosmos DB it was truely a great platform for our data needs. Recent pricing changes have also put this product within reach of most organisations.
We’re also looking forward to using the serverless option once GA. It will provide clients with unpredictable load patterns, automatted scalability. I would like to see revisions to how pricing is calculated though.
References
Posted on April 11, 2021
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.