Pagination in MongoDB and the Bucket Pattern

tariqabughofa

Tariq Abughofa

Posted on December 3, 2019

Pagination in MongoDB and the Bucket Pattern

MongoDB is a document-based NoSQL database based on the JSON data format. Because of its nested document data structure, tables (or collections as they are called in MongoDB), can have more records than it's SQL counterpart. Which makes paginating efficiently important to have. Pagination can be used to do batch data processing or to show data on user interfaces. In this article, I go through the approaches that MongoDB provides for this problem.

Using the cursor API's skip and limit

The cursor API in MongoDB provide tow functions that helps implementing pagination:

  • cursor.skip(n) which returns a cursor which begin returning results after skipping the first n documents.
  • cursor.limit(m) which constrains the size of a cursor’s result set to m documents.

This is how you paginate using the MongoDB shell:

// 1st page
db.users.find().limit(5)
// 2nd page
db.users.find().skip(5).limit(5)
// 3rd page
db.users.find().skip(10).limit(5)

Two things to note here: MongoDB cursors are not the same as cursors in SQL databases which does server-full pagination on a data set. It's actually similar to offset and limit in SQL.

Using the _id field

The _id field is a column that is part of all MongoDB collections by default. It has the data type ObjectId. ObjectIds are 12 bytes long unique ordered auto-generated values that act as an identifier for the document. Kind of like how primary keys are in SQL databases. The important features about _id fields is that it is ordered and indexed by default which make them suitable to use for pagination if they are used with the limit function:

// 1st page
set = db.users.find().limit(1000)
max_id = set[4]._id
// 2nd page
set = db.users.find({'_id': {'$gt': max_id}}).limit(1000)
max_id = set[4]._id
// 3rd page
db.users.find({'_id': {'$gt': max_id}}).limit(1000)

Using an indexed field

If you have an indexed field and you wanted to return the pages sorted on that field instead, a good solution is to combine the cursor.sort() , cursor.limit(n) with a comparison query operator ($gt, $ls) to skip the previous pages. This way the query will use the index for the query to skip the unwanted documents and then it will read only the n wanted documents. The query looks like this:

db.users.find({ created_date: { $gt: ISODate("2018-07-21T12:01:35") })
    .sort({ created_date: 1 })
    .limit(1000)

The downside is that we can't jump directly to a specific page. If that is necessary this page doesn't work.

Using the Bucket Pattern

This is a unique storage/pagination technique that can be only used with document-based NoSQL databases. It has great scalability in terms of the size of the stored data and the index. At the same time it allows to navigate to any page randomly. However, this method starts with the way we store the data.

A good use-case for this pattern is time-series data. Let's say we're getting location updates through GPS each minute and we store the document this way:

{
   _id: ObjectId(...)
   source_id: 12345,
   timestamp: ISODate("2019-09-28T02:00:00.000Z"),
   latitude: -8.80173,
   longitude: -20.63476
}

{
   _id: ObjectId(...)
   source_id: 12345,
   timestamp: ISODate("2019-09-28T02:01:00.000Z"),
   latitude: -8.80175,
   longitude: -20.63478
}

{
   _id: ObjectId(...)
   source_id: 12345,
   timestamp: ISODate("2019-09-28T02:02:00.000Z"),
   latitude: -8.80178,
   longitude: -20.63486
}

Very convenient indices to have here is one on source_id and another on timestamp. Next we can paginate the data sorted on timestamp as we saw in the previous method. However, the scalability of this solution is questionable as the timestamp index and the collection get huge really fast.

Here the Bucket Pattern comes to the rescue. Instead of saving each data point as a document, we leverage the document data model that MongoDB uses. We save data points that appear in each, let's say hour, as a list in one single document we refer to as a bucket. We also add to the document extra attributes stating to the start_timestamp the date at which the bucket data points start, and maybe some aggregation data. The bucket would look like this:

{
    source_id: 12345,
    start_timestamp: ISODate("2019-09-28T02:00:00.000Z"),
    locations: [
       {
           latitude: -8.80173,
           longitude: -20.63476
       },
       {
           latitude: -8.80175,
           longitude: -20.63478
       },
       
       {
           latitude: -8.80378,
           longitude: -20.63786
       }
    ],
    average_speed: 56.056
}

Using the Bucket Pattern, we went down from 60 documents each hour into only one. For the index, we can now index on start_timestamp instead of timestamp so the size is 60 times less.

Now you might ask "how does this help with pagination though?". The answer is that by pre-aggregating data per hour, we implemented a built-in pagination for the collection. So to get the 10th page from the collection we just need to get the 10th document from the collection:

// the date at which our measurement started
data_start_point = ISODate("2019-01-01T01:00:00.000Z")
// add a 10-hour period to the date which is 10*60*60*1000 in milliseconds
page_timestamp = new Date(data_start_point.getTime() - 10*60*60*1000)
db.users.find({ start_timestamp: page_timestamp })

If you want to get the 10th page but you prefer each page to has 3 hours instead of one, it's just a matter of math:

// the date at which our measurement started
data_start_point = ISODate("2019-01-01T01:00:00.000Z")
// add a 30-hour period to the date which is 3*10*60*60*1000 in milliseconds
page_start_timestamp = new Date(data_start_point.getTime() - 3*10*60*60*1000)
// use "limit" to get 3-hour data
db.users.find({ start_timestamp: { $gte: page_timestamp }}).limit(3)
💖 💪 🙅 🚩
tariqabughofa
Tariq Abughofa

Posted on December 3, 2019

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related