Pagination with an OFFSET is better without OFFSET

franckpachot

Franck Pachot

Posted on December 9, 2023

Pagination with an OFFSET is better without OFFSET

In a previous post discussing joins and pagination I was reading the first 10000 rows on a large table with some filtering. In YugabyteDB this was executed in 20 read requests, visible as loops=1 ... Read Requests: 10 for the outer table "one" and loops=10 Read Requests: 1 for the inner table "many":

explain (costs off, analyze, dist)
/*+ Set( yb_bnl_batch_size 1024 ) */
select one.category, one.created_at, many.value
 from (
    select * from one
    where one.category='🍓'
    order by one.created_at desc
    fetch first 10000 rows only
 ) one
 left outer join many using(one_id)
 order by one.created_at desc
;

                                                           QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
 Sort (actual time=55.449..56.315 rows=10000 loops=1)
   Sort Key: one.created_at DESC
   Sort Method: quicksort  Memory: 853kB
   ->  YB Batched Nested Loop Left Join (actual time=6.767..51.799 rows=10000 loops=1)
         Join Filter: (one.one_id = many.one_id)
         ->  Limit (actual time=1.567..8.961 rows=10000 loops=1)
               ->  Index Only Scan using one_category_hash_created_desc_id on one (actual time=1.565..6.822 rows=10000 loops=1)
                     Index Cond: (category = '🍓'::text)
                     Heap Fetches: 0
                     Storage Index Read Requests: 10
                     Storage Index Read Execution Time: 1.420 ms
         ->  Index Only Scan using many_one_asc on many (actual time=3.179..3.179 rows=0 loops=10)
               Index Cond: (one_id = ANY (ARRAY[one.one_id, $1, $2, ..., $1023]))
               Heap Fetches: 0
               Storage Index Read Requests: 1
               Storage Index Read Execution Time: 2.251 ms
Enter fullscreen mode Exit fullscreen mode

I want now to read the next page, with the same performance.

Bad solution: OFFSET

To read the next page I may be tempted to simply add offset 10000:

explain (costs off, analyze, dist)
/*+ Set( yb_bnl_batch_size 1024 ) */
select one.category, one.created_at, many.value
 from (
    select * from one
    where one.category='🍓'
    order by one.created_at desc
    offset 10000 fetch first 10000 rows only
 ) one
 left outer join many using(one_id)
 order by one.created_at desc
;

                                                           QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
 Sort (actual time=64.970..65.990 rows=10000 loops=1)
   Sort Key: one.created_at DESC
   Sort Method: quicksort  Memory: 853kB
   ->  YB Batched Nested Loop Left Join (actual time=18.728..61.328 rows=10000 loops=1)
         Join Filter: (one.one_id = many.one_id)
         ->  Limit (actual time=13.184..21.144 rows=10000 loops=1)
               ->  Index Only Scan using one_category_hash_created_desc_id on one (actual time=1.541..17.279 rows=20000 loops=1)
                     Index Cond: (category = '🍓'::text)
                     Heap Fetches: 0
                     Storage Index Read Requests: 20
                     Storage Index Read Execution Time: 7.081 ms
         ->  Index Only Scan using many_one_asc on many (actual time=2.916..2.916 rows=0 loops=10)
               Index Cond: (one_id = ANY (ARRAY[one.one_id, $1, $2, ..., $1023]))
               Heap Fetches: 0
               Storage Index Read Requests: 1
               Storage Index Read Execution Time: 2.022 ms
Enter fullscreen mode Exit fullscreen mode

This is not efficient. OFFSET is never a good idea in SQL because it has to read all rows (Index Only Scan ... rows=20000) and discard the rows below the offset to reduce the result (Limit ... rows=20000).

Good solution: start with unique last value

The correct way is to keep track of the last row that was read and start from it. Here, because I may have multiple rows for the same created_at and the limit may have stopped in the middle, I need an additional unique column.

One solution is to use the unique key. I already have one_id in the index:

yugabyte=# \d one
                                Table "public.one"
   Column   |           Type           | Collation | Nullable |      Default
------------+--------------------------+-----------+----------+-------------------
 one_id     | uuid                     |           | not null | gen_random_uuid()
 category   | text                     |           |          |
 created_at | timestamp with time zone |           |          | clock_timestamp()
 xxx        | integer                  |           |          |
Indexes:
    "one_pkey" PRIMARY KEY, lsm (one_id HASH)
    "one_category_hash_created_desc_id" lsm (category HASH, created_at DESC, one_id ASC)
Referenced by:
    TABLE "many" CONSTRAINT "many_one_id_fkey" FOREIGN KEY (one_id) REFERENCES one(one_id)

Enter fullscreen mode Exit fullscreen mode

Then I change the ORDER BY accordingly (order by one.created_at desc, one.one_id asc) and return one_id, in addition to created_at, to the application. For the pagination, I add the where clause that matches the ORDER BY caluse ( one.created_at < $1::timestamptz or one_id > $2::uuid ) so that the application run it with the last (created_at,one_id) as ($1,$2). Unfortunately, this OR condition cannot use the index and then I add a redundant predicate: one.created_at <= $1::timestamptz. If I want to use the same query for the first and next pages, an additional clause will check if $1 and $2 are null:


yugabyte=# '2023-12-07 20:56:15.3813+00' '6b7dcda2-a1de-4578-8252-fe86b44f6cbc'

yugabyte=# explain (costs off, analyze, dist)
/*+ Set( yb_bnl_batch_size 1024 ) */
select one.category, one.created_at, many.value, one.one_id
 from (
    select * from one
    where one.category='🍓'
    and (
      -- first page starts at the begining of the index
      ( $1::timestamptz is null and $2::uuid is null )
      or
      (
       -- next page start for the index scan
            one.created_at <= $1::timestamptz
       -- precise row to start
       and ( 
            one.created_at <   $1::timestamptz 
        or (one.created_at  = $1::timestamptz and one_id > $2::uuid ) 
       )
      )
    )
    order by one.created_at desc, one.one_id asc
    fetch first 10000 rows only
 ) one
 left outer join many using(one_id)
 order by one.created_at desc
;

                                                               QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
 Sort (actual time=53.851..55.106 rows=10000 loops=1)
   Sort Key: one.created_at DESC
   Sort Method: quicksort  Memory: 1166kB
   ->  YB Batched Nested Loop Left Join (actual time=6.434..49.937 rows=10000 loops=1)
         Join Filter: (one.one_id = many.one_id)
         ->  Limit (actual time=1.897..9.530 rows=10000 loops=1)
               ->  Index Only Scan using one_category_hash_created_desc_id on one (actual time=1.895..7.235 rows=10000 loops=1)
                     Index Cond: ((category = '🍓'::text) AND (created_at <= '2023-12-07 20:56:15.20535+00'::timestamp with time zone))
                     Heap Fetches: 0
                     Storage Index Read Requests: 10
                     Storage Index Read Execution Time: 1.739 ms
         ->  Index Only Scan using many_one_asc on many (actual time=2.944..2.944 rows=0 loops=10)
               Index Cond: (one_id = ANY (ARRAY[one.one_id, $1, $2, ..., $1023]))
               Heap Fetches: 0
               Storage Index Read Requests: 1
               Storage Index Read Execution Time: 2.104 ms
 Planning Time: 0.807 ms
 Execution Time: 57.239 ms
Enter fullscreen mode Exit fullscreen mode

This is more complex to write than OFFSET but also gives more control and is optimized to read only what is necessary.

I've already written about this in the past:

Alternative solution: last value + relative offset

Without adding a unique column from the table, you can create one using a window function that has the same ordering as the ORDER BY FETCH FIRST ROWS. Here's an example: row_number() over(partition by created_at order by created_at desc). This way, there no additional sort needed (check it in the execution plan) and you have numbered each duplicate row. However, in my opinion, all tables should have a unique key, and adding it to the end of the index shouldn't be an issue.

💖 💪 🙅 🚩
franckpachot
Franck Pachot

Posted on December 9, 2023

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related