Will Velida
Posted on March 1, 2020
A gentle introduction into performing graph queries using the Azure Cosmos DB Gremlin API
Thanks to the Gremlin API offering of Azure Cosmos DB, we can build globally distributed databases that store and operate on graph data. We can perform graph queries in Cosmos DB using the Gremlin query language.
In this article, I’m going to show you how to set up a Cosmos DB account that supports the Gremlin API and how you can perform some simple graph queries in that account.
If you want to follow along, you’ll need to set up an Azure subscription. Check out this link on how to do that. Otherwise, let’s begin!
Setting up our Graph Database
Let’s start with creating our graph database. Head to your Azure Portal and click “Create a new resource”. Search for Azure Cosmos DB and click “‘Create New”.
In order to create our graph database, we’ll need to provide some basic information:
- Subscription — Essentially the subscription that your Cosmos DB account will be charged to.
- Resource Group — Resource groups are great for managing a collection of Azure resources. At the end of this tutorial, we can delete all the resources within a group, rather than deleting resources one by one. You can assign your Cosmos DB account to a pre-existing resource group or create a new one for it. I’ve decided that I’m going to create a new one for this tutorial.
- API — We can create different kinds of Cosmos DB datastores depending on the API that we assign it. In this case, we’re creating a Graph database, so we’ll be picking the Gremlin (Graph) API.
- Notebooks — Love Jupyter notebooks? Cosmos DB supports that! It’s out of scope for this tutorial so I’m not going to enable it.
- Location — Cosmos DB is a fundamental service in Azure, meaning that’s it’s available in all Azure regions. Pick a region that’s closest to where you are right now. In my case, I’m based in Auckland, New Zealand, so I’m deploying my database in Australia East.
- Account Type — This is new?! I’ve never experienced this before. Hovering over the tooltip, it looks like it has something to do with how the UI experience works. I’ll have to do some more investigation into this, but for now I’ve just set it to Production for now. You can change it and it doesn’t impact the engine.
- Geo-redundancy and multi-region writes — I’ve disabled this for now. Look at the screenshot below for an example:
Click “Review + Create” and you’ll be redirected to a validation page. We don’t need to worry about networking or tags just for the moment. If your configuration is valid, you’ll be shown a success message and you can click “Create” to provision your graph database!
For this tutorial, we’re going to use the sample graph that the Cosmos DB provides for all new Graph accounts. In order to do this, click “Quick Start”. You’ll be asked to choose a platform for creating a sample app, but all we’re doing here is creating a container with some sample data.
The Cosmos DB team have a Persons container that we can use, so click on Create ‘Persons’ container and that should be enough.
Click on “Data Explorer” to navigate to your graph.
This is new? They’ve changed the way that the data explorer works! It looks like we have to add our current IP address to our firewall rules in order to see our data? (Very Azure SQL, if you’ve used Azure SQL before).
Click on the notification and you’ll be redirected to the firewall settings tab. Your IP address should be prepopulated in the Firewall section below, so just click “Save ”to add your IP address to the allow access list. After a few minutes you should be good to go.
Once this is all done, navigate back to the Data Explorer and we should see our new Persons Graph in our Cosmos DB account.
Alright, that’s all the admin setup that we need to do, let’s start diving into some awesome graph queries!
Performing queries in our Graph Database
Let’s start by inserting a couple of items into our Persons Graph. Head to your data explorer and you should see a text box that allows us to execute Gremlin queries.
Let’s start by adding 5 people to our graph. Type in the following Gremlin queries:
g.addV('person').property('firstName', 'Will').property('lastName', 'Velida').property('age', 28).property('hairColor', 'blonde').property('userId', 1).property('pk', 'pk')
g.addV('person').property('firstName', 'Alex').property('lastName', 'Smith').property('age', 22).property('hairColor', 'brown').property('userId', 2).property('pk', 'pk')
g.addV('person').property('firstName', 'Mike').property('lastName', 'Jones').property('hairColor', 'black').property('userId', 2).property('pk', 'pk')
g.addV('person').property('firstName', 'Sarah').property('lastName', 'Smith').property('hairColor', 'blonde').property('userId', 4).property('pk', 'pk')
g.addV('person').property('firstName', 'Debbie').property('lastName', 'Stevens').property('hairColor', 'black').property('age', 57).property('userId', 5).property('pk', 'pk')
What we’ve done here is added 5 people vertices. Vertices are discrete entities in our graphs. In this example, we have used People as our vertices, but they could be places or events.
Let’s take a bit of a deeper dive as to what each Gremlin command is doing:
- addV() — This adds a vertex (our discrete entities) to our graph.
- property() — This adds a property to our verticies
Now let’s add some relationships between our vertices. These are referred to as edges in graph databases. Let’s write the following queries:
g.V().hasLabel('person').has('firstName', 'Will').addE('knows').to(g.V().hasLabel('person').has('firstName', 'Alex'))
g.V().hasLabel('person').has('firstName', 'Alex').addE('knows').to(g.V().hasLabel('person').has('firstName', 'Mike'))
Let’s examine the new Gremlin queries:
- addE() — has an edge (relationship) between two vertices.
- has() and hasLabel() — used to filter properties, vertices and edges. in this example, we are filtering on our ‘firstName’ property.
I've messed up one of my entities! Luckily, we can update our vertices with a command like so:
g.V().hasLabel('person').has('firstName', 'Mike').property('userId', 3)
Let’s try filtering our Persons collection. Let’s retrieve all the vertices that have an age less than 30:
g.V().hasLabel('person').has('age', lt(30))
Here, we get two results back:
We can refine this query further to return just the name properties:
g.V().hasLabel('person').has('age', lt(30)).values('firstName')
which returns the following result:
Finally, let’s perform a simple traversal in our graph to find out all the people that Alex knows:
g.V().hasLabel('person').has('firstName', 'Alex').outE('knows').inV().hasLabel('person')
As we can see, he knows 1 person, Mike Jones:
[
{
"id": "c2260feb-207b-403b-82cb-30cd47006912",
"label": "person",
"type": "vertex"
"properties":
"firstName": [
{
"id": "1a383c60-82ce-414b-a610-8c2b9ab08759",
"value": "Mike"
}
]
"lastName": [
{
"id": "1a1a762a-dace-4912-aa16-e87f4327d1a7",
"value": "Jones"
}
],
"hairColor": [
{
"id": "5f71ca42-1cf0-4da4-93a9-c24693975e56",
"value": "black"
}
],
"userId": [
{
"id": "9e06cf3a-e58a-494f-aaf2-e0f1e2940cce",
"value": 3
}
],
"pk": [
{
"id": "4f27fd02-1861-480a-8887-4b40c8c7d6c6",
"value": "pk"
}
]
}
}
]
Assessing performance of graph queries
This is pretty cool stuff, but I can already hear some of you say ‘how can I measure the performance of these queries?’.
The Gremlin query language has a method called executionProfile() that allows us to look at metrics on how our query performed. Let’s append this method to our previous query and see how it performs:
g.V().hasLabel('person').has('firstName', 'Alex').outE('knows').inV().hasLabel('person').executionProfile()
This is the response we get back:
[
{
"gremlin": "g.V().hasLabel('person').has('firstName', 'Alex').outE('knows').inV().hasLabel('person').executionProfile()",
"activityId": "983b9301-f94a-4e0d-a743-92c3f53ffcff",
"totalTime": 21,
"totalResourceUsage": 9.17,
"metrics": [
{
"name": "GetVertices",
"time": 12,
"stepResourceUsage": 3.06,
"annotations": {
"percentTime": 57.14,
"percentResourceUsage": 33.37
},
"counts": {
"resultCount": 1
},
"storeOps": [
{
"fanoutFactor": 1,
"count": 1,
"size": 761,
"storageCount": 1,
"storageSize": 704,
"time": 8.66,
"storeResourceUsage": 3.06
}
]
},
{
"name": "GetEdges",
"time": 4,
"stepResourceUsage": 3.21,
"annotations": {
"percentTime": 19.05,
"percentResourceUsage": 35.01
},
"counts": {
"resultCount": 1
},
"storeOps": [
{
"fanoutFactor": 1,
"count": 1,
"size": 481,
"storageCount": 1,
"storageSize": 432,
"time": 3.48,
"storeResourceUsage": 3.21
}
]
},
{
"name": "GetNeighborVertices",
"time": 4,
"stepResourceUsage": 2.9,
"annotations": {
"percentTime": 19.05,
"percentResourceUsage": 31.62
},
"counts": {
"resultCount": 1
},
"storeOps": [
{
"fanoutFactor": 1,
"count": 1,
"size": 695,
"storageCount": 1,
"storageSize": 638,
"time": 3.81,
"storeResourceUsage": 2.9
}
]
},
{
"name": "FilterInBatchOperator",
"time": 0,
"stepResourceUsage": 0,
"annotations": {
"percentTime": 0,
"percentResourceUsage": 0
},
"counts": {
"resultCount": 1
}
},
{
"name": "ProjectOperator",
"time": 1,
"stepResourceUsage": 0,
"annotations": {
"percentTime": 4.76,
"percentResourceUsage": 0
},
"counts": {
"resultCount": 1
}
}
]
}
]
Let’s go through the following properties:
- gremlin — The statement that was executed.
- totalTime — The time in milliseconds that the operation took metrics — metrics for each of the steps that were executed in our gremlin query. These are separated into GetVertices (getting our properties),GetEdges (seeing what the relationship between our vertices are) and GetNeighborVertices (seeing which vertices have what relationships). These metrics include the time it took to execute the query in milliseconds, the percentTime of total query execution time, how many results were returned, count and size.
Conclusion
In this post, we set up a new Cosmos DB Graph database and perform some basic queries in it. We also looked at how we can assess the performance of those queries. If you’re done with this graph db, feel free to delete it. Otherwise, why not add some more properties and take a look at the Cosmos DB Graph documentation to discover what else you can do with Gremlin queries.
In a future post, I’ll talk about how we can model our data for graph databases and look at some best practices. I’ll also talk about how we can apply these concepts in the context of developing applications for Azure Cosmos DB.
If you have any questions, please feel free to ask them in the comment section below.
Until next time!
Posted on March 1, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.