Tariq Abughofa
Posted on December 3, 2019
MongoDB is a document-based NoSQL database based on the JSON data format. Because of its nested document data structure, tables (or collections as they are called in MongoDB), can have more records than it's SQL counterpart. Which makes paginating efficiently important to have. Pagination can be used to do batch data processing or to show data on user interfaces. In this article, I go through the approaches that MongoDB provides for this problem.
Using the cursor
API's skip
and limit
The cursor API in MongoDB provide tow functions that helps implementing pagination:
-
cursor.skip(n)
which returns a cursor which begin returning results after skipping the firstn
documents. -
cursor.limit(m)
which constrains the size of a cursor’s result set tom
documents.
This is how you paginate using the MongoDB shell:
// 1st page
db.users.find().limit(5)
// 2nd page
db.users.find().skip(5).limit(5)
// 3rd page
db.users.find().skip(10).limit(5)
Two things to note here: MongoDB cursors are not the same as cursors in SQL databases which does server-full pagination on a data set. It's actually similar to offset
and limit
in SQL.
Using the _id
field
The _id
field is a column that is part of all MongoDB collections by default. It has the data type ObjectId
. ObjectIds
are 12 bytes long unique ordered auto-generated values that act as an identifier for the document. Kind of like how primary keys are in SQL databases. The important features about _id
fields is that it is ordered and indexed by default which make them suitable to use for pagination if they are used with the limit
function:
// 1st page
set = db.users.find().limit(1000)
max_id = set[4]._id
// 2nd page
set = db.users.find({'_id': {'$gt': max_id}}).limit(1000)
max_id = set[4]._id
// 3rd page
db.users.find({'_id': {'$gt': max_id}}).limit(1000)
Using an indexed field
If you have an indexed field and you wanted to return the pages sorted on that field instead, a good solution is to combine the cursor.sort()
, cursor.limit(n)
with a comparison query operator ($gt
, $ls
) to skip the previous pages. This way the query will use the index for the query to skip the unwanted documents and then it will read only the n
wanted documents. The query looks like this:
db.users.find({ created_date: { $gt: ISODate("2018-07-21T12:01:35") })
.sort({ created_date: 1 })
.limit(1000)
The downside is that we can't jump directly to a specific page. If that is necessary this page doesn't work.
Using the Bucket Pattern
This is a unique storage/pagination technique that can be only used with document-based NoSQL databases. It has great scalability in terms of the size of the stored data and the index. At the same time it allows to navigate to any page randomly. However, this method starts with the way we store the data.
A good use-case for this pattern is time-series data. Let's say we're getting location updates through GPS each minute and we store the document this way:
{
_id: ObjectId(...)
source_id: 12345,
timestamp: ISODate("2019-09-28T02:00:00.000Z"),
latitude: -8.80173,
longitude: -20.63476
}
{
_id: ObjectId(...)
source_id: 12345,
timestamp: ISODate("2019-09-28T02:01:00.000Z"),
latitude: -8.80175,
longitude: -20.63478
}
{
_id: ObjectId(...)
source_id: 12345,
timestamp: ISODate("2019-09-28T02:02:00.000Z"),
latitude: -8.80178,
longitude: -20.63486
}
Very convenient indices to have here is one on source_id
and another on timestamp
. Next we can paginate the data sorted on timestamp
as we saw in the previous method. However, the scalability of this solution is questionable as the timestamp
index and the collection get huge really fast.
Here the Bucket Pattern comes to the rescue. Instead of saving each data point as a document, we leverage the document data model that MongoDB uses. We save data points that appear in each, let's say hour, as a list in one single document we refer to as a bucket. We also add to the document extra attributes stating to the start_timestamp
the date at which the bucket data points start, and maybe some aggregation data. The bucket would look like this:
{
source_id: 12345,
start_timestamp: ISODate("2019-09-28T02:00:00.000Z"),
locations: [
{
latitude: -8.80173,
longitude: -20.63476
},
{
latitude: -8.80175,
longitude: -20.63478
},
…
{
latitude: -8.80378,
longitude: -20.63786
}
],
average_speed: 56.056
}
Using the Bucket Pattern, we went down from 60 documents each hour into only one. For the index, we can now index on start_timestamp
instead of timestamp
so the size is 60 times less.
Now you might ask "how does this help with pagination though?". The answer is that by pre-aggregating data per hour, we implemented a built-in pagination for the collection. So to get the 10th page from the collection we just need to get the 10th document from the collection:
// the date at which our measurement started
data_start_point = ISODate("2019-01-01T01:00:00.000Z")
// add a 10-hour period to the date which is 10*60*60*1000 in milliseconds
page_timestamp = new Date(data_start_point.getTime() - 10*60*60*1000)
db.users.find({ start_timestamp: page_timestamp })
If you want to get the 10th page but you prefer each page to has 3 hours instead of one, it's just a matter of math:
// the date at which our measurement started
data_start_point = ISODate("2019-01-01T01:00:00.000Z")
// add a 30-hour period to the date which is 3*10*60*60*1000 in milliseconds
page_start_timestamp = new Date(data_start_point.getTime() - 3*10*60*60*1000)
// use "limit" to get 3-hour data
db.users.find({ start_timestamp: { $gte: page_timestamp }}).limit(3)
Posted on December 3, 2019
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.