ORDER BY is mandatory in SQL to get a sorted result
Franck Pachot
Posted on November 15, 2021
In IT, just like in math, a negative test can demonstrate that an assertion is incorrect. However, a positive test doesn't definitively establish the correctness of the assertion. This challenge is amplified in IT because algorithms can yield apparently correct results with a probability that exceeds mere chance. Simply put, one should never make assumptions based solely on results. It's essential to understand how it works.
Here's a classic example:
postgres=# create table demo (n int primary key);
CREATE TABLE
postgres=# insert into demo
select n from generate_series(1,1e6) n;
INSERT 0 1000000
postgres=# select n from demo limit 10;
n
----
1
2
3
4
5
6
7
8
9
10
(10 rows)
Using this example, one might assume that a SELECT query returns rows in a specific order. However, this is a misconception. SQL is a declarative language, and without an ORDER BY clause, the result is essentially unordered. The apparent sorting in this scenario is merely a consequence of inserting data into heap tables by appending it at the end of the file and reading the file sequentially with a single thread. Rows are displayed as they are retrieved.
When inserting the rows in a different order:
postgres=# drop table if exists demo;
DROP TABLE
postgres=# create table demo (n int primary key);
CREATE TABLE
postgres=# insert into demo select 1e6-n from generate_series(1,1e6) n;
INSERT 0 1000000
postgres=# select n from demo limit 10;
n
--------
999999
999998
999997
999996
999995
999994
999993
999992
999991
999990
(10 rows)
they appear as they were originally stored. This behavior is typical of a serial SeqScan:
postgres=# explain select n from demo limit 10;
QUERY PLAN
--------------------------------------------------------
Limit (cost=0.00..0.14 rows=10 width=4)
-> Seq Scan on demo (cost=0.00..14425.00 rows=1000000 width=4)
(2 rows)
This behavior cannot be relied upon. It can change with a different execution plan, and this can happen for many reasons like different data distribution, new indexes, parallelization, or simply some configuration changes:
postgres=# set enable_seqscan=false;
SET
postgres=# select n from demo limit 10;
n
---
0
1
2
3
4
5
6
7
8
9
(10 rows)
postgres=# explain select n from demo limit 10;
QUERY PLAN
--------------------------------------------------------
Limit (cost=0.42..0.77 rows=10 width=4)
-> Index Only Scan using demo_pkey on demo (cost=0.42..34712.43 rows=1000000 width=4)
(2 rows)
Several factors can influence this behavior. For example, an index scan instead of a full table scan, parallel query execution, and updates can all alter the way rows are read and their physical location:
postgres=# set enable_seqscan=true;
SET
postgres=# alter table demo add column x char;
ALTER TABLE
postgres=# update demo set x=1 where mod(n,3)=0;
UPDATE 333334
postgres=# select n from demo limit 10;
n
--------
999998
999997
999995
999994
999992
999991
999989
999988
999986
999985
(10 rows)
The only way to obtain a reliably sorted result is by using an ORDER BY clause:
postgres=# select n from demo order by n limit 10;
n
---
0
1
2
3
4
5
6
7
8
9
(10 rows)
Don't assume that an ORDER BY clause will necessarily add a sorting operation. The query planner may choose a physical structure that returns the rows in the desired order without an explicit sorting step:
postgres=# explain select n from demo order by n limit 10;
QUERY PLAN
-------------------------------------------------------------------------------------------
Limit (cost=0.42..0.77 rows=10 width=4)
-> Index Only Scan using demo_pkey on demo (cost=0.42..34716.43 rows=1000000 width=4)
(2 rows)
In SQL, you declare the desired order using an ORDER BY clause, and the query planner selects the appropriate access methods to retrieve the data in that order. The optimizer has this knowledge, but it's typically hidden from you unless you've read and comprehended the source code for the exact version you're using. That's why you must add an ORDER BY if you expect a sorted result.
YugabyteDB
In a distributed SQL database, it's highly likely that the default order won't align with your expectations:
yugabyte=# create table demo (n int primary key);
CREATE TABLE
yugabyte=# insert into demo select 1e6-n from generate_series(1,1e6) n;
INSERT 0 1000000
yugabyte=# select n from demo limit 10;
n
--------
110359
192735
219128
237047
310517
593962
627995
651891
669921
790562
(10 rows)
This may appear random at first glance, as it exhibits an increasing order not tied to the insertion sequence, unlike PostgreSQL. However, there is a underlying logic to it. In this case, the default sharding method involves applying a hash function to the first column of the primary key. We can query this hashing function ourselves with yb_hash_code()
yugabyte=# select min(h),max(h),avg(h),
count(*)::float/count(distinct h) per_hash from (
select yb_hash_code( generate_series(1,1e6) ) h
) v;
min | max | avg | per_hash
-----+-------+--------------------+---------------
0 | 65535 | 32774.509179000000 | 15.2587890625
(1 row)
This function can return one of 65536 values, ranging from 0 to 65535. These values are used to distribute rows into tablets within the distributed storage system of YugabyteDB, called DocDB.
Without an ORDER BY clause, the rows are returned in the order of their hash code:
yugabyte=# select yb_hash_code(n), n from demo limit 10;
yb_hash_code | n
--------------+--------
0 | 110359
0 | 192735
0 | 219128
0 | 237047
0 | 310517
0 | 593962
0 | 627995
0 | 651891
0 | 669921
0 | 790562
(10 rows)
Indeed, when querying all rows with a specific hash code, they are returned in order, because for each hash value, in each DocDB partition, the rows are clustered and sorted on their primary key.
yugabyte=# select n from demo where yb_hash_code(n)=0;
n
--------
110359
192735
219128
237047
310517
593962
627995
651891
669921
790562
792363
819768
891493
984191
(14 rows)
When you delve into the physical storage and retrieval process using a SeqScan, the initial order may seem random, but it's not mysterious. With 1 million rows and 65536 hash values, there are roughly 15.2 rows per hash code on average. What appears to be an order when selecting a small number of rows exhibits a different pattern when selecting more rows than the number of rows per hash code:
yugabyte=# select n, yb_hash_code(n) from demo limit 50;
n | yb_hash_code
--------+--------------
110359 | 0
192735 | 0
219128 | 0
237047 | 0
310517 | 0
593962 | 0
627995 | 0
651891 | 0
669921 | 0
790562 | 0
792363 | 0
819768 | 0
891493 | 0
984191 | 0
17012 | 1
24685 | 1
153595 | 1
186378 | 1
219742 | 1
258869 | 1
271029 | 1
547922 | 1
565568 | 1
763430 | 1
766123 | 1
772002 | 1
781840 | 1
840822 | 1
844655 | 1
953917 | 1
162485 | 2
168413 | 2
271551 | 2
285516 | 2
407063 | 2
420509 | 2
440160 | 2
572540 | 2
585722 | 2
589471 | 2
628271 | 2
719191 | 2
837125 | 2
866379 | 2
951013 | 2
976519 | 2
994652 | 2
854 | 3
57757 | 3
70079 | 3
(50 rows)
Internally, within YugabyteDB, table rows are organized in order based on the DocDB key. This key is essentially the primary key with a prefix of the hash code. This design facilitates fast access to specific key points or ranges: starting from the hash code, it identifies the correct tablet, and subsequently locates the desired row within the SSTable (Sorted Sequence Table) structure.
Should I choose to shard based on a range, even a SeqScan would return results in an ordered fashion:
yugabyte=# drop table demo;
DROP TABLE
yugabyte=# create table demo (n int, primary key(n asc))
split at values ( (333333),(666666) );
CREATE TABLE
yugabyte=# insert into demo select 1e6-n from generate_series(1,1e6) n;
INSERT 0 1000000
yugabyte=# select n from demo limit 10;
n
---
0
1
2
3
4
5
6
7
8
9
(10 rows)
yugabyte=# explain analyze select n from demo limit 10;
QUERY PLAN
----------------------------------------------------------------------------------------------------------
Limit (cost=0.00..1.00 rows=10 width=4) (actual time=0.788..0.796 rows=10 loops=1)
-> Seq Scan on demo (cost=0.00..100.00 rows=1000 width=4) (actual time=0.787..0.791 rows=10 loops=1)
Planning Time: 0.047 ms
Execution Time: 0.855 ms
(4 rows)
The Seq Scan displays an order similar to that of an Index Scan on the primary key:
yugabyte=# explain analyze select n from demo order by n limit 10;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..1.14 rows=10 width=4) (actual time=0.809..0.816 rows=10 loops=1)
-> Index Scan using demo_pkey on demo (cost=0.00..114.00 rows=1000 width=4) (actual time=0.808..0.813 rows=10 loops=1)
Planning Time: 0.066 ms
Execution Time: 0.843 ms
(4 rows)
This occurs because the table is stored the primary key structure, similar to Oracle's Index Organized Table, SQL Server's Clustered Index, or MySQL's InnoDB tables.
I mentioned earlier that the hash code has 65536 possible values:
yugabyte=# drop table demo;
DROP TABLE
yugabyte=# create table demo (n int primary key)
split into 16 tablets;
CREATE TABLE
yugabyte=# insert into demo
select n from generate_series(1,1e6) n;
INSERT 0 1000000
yugabyte=# select count(distinct yb_hash_code(n)) from demo;
count
-------
65536
(1 row)
However, it's important to note that the number of tablets is typically smaller, with just a few per node, to facilitate rebalancing when adding nodes.
To determine the number of tablets from the execution plan, I execute an aggregate function that gets pushed down to each tablet. This way, the number of rows in the execution plan corresponds to the number of aggregations performed:
yugabyte=# explain analyze select count(*) from demo;
yugabyte=# explain analyze select count(*) from demo;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------
Aggregate (cost=102.50..102.51 rows=1 width=8) (actual time=984.031..984.031 rows=1 loops=1)
-> Seq Scan on demo (cost=0.00..100.00 rows=1000 width=0) (actual time=357.419..984.010 rows=16 loops=1)
Planning Time: 0.055 ms
Execution Time: 984.145 ms
(4 rows)
The rows=16
corresponds to the 16 count(*)
results obtained from each tablet. In this case, there are 16 tablets, and each tablet handles 4096 hash codes.
What about GROUP BY or DISTINCT ?
Because many databases implement deduplication by sorting the row set, it frequently leads to the misconception that GROUP BY or DISTINCT queries yield ordered results. To illustrate this, here's an example of a misguided and unhelpful question on Twitter with numerous incorrect responses:
To truly master SQL, you should disregard the myths disseminated by self-proclaimed experts who make broad generalizations based on isolated observations. Instead, prioritize understanding execution plans and remember these key points that apply to all SQL databases:
- SQL is a declarative language, so you can't predict how rows are processed without examining the execution plan
- The use of ORDER BY is the sole declaration in a SQL query that ensures a sorted result.
Back to my example on YugabyteDB, and you can see the same on PostgreSQL, without an ORDER BY the result is not sorted:
yugabyte=> select n from demo group by n limit 10;
n
--------
790
662718
694339
233338
129976
164125
756002
502084
157514
503514
(10 rows)
yugabyte=> explain analyze
select n from demo group by n limit 10;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Limit (cost=102.50..102.60 rows=10 width=4) (actual time=2811.675..2811.680 rows=10 loops=1)
-> HashAggregate (cost=102.50..112.50 rows=1000 width=4) (actual time=2811.673..2811.676 rows=10 loops=1)
Group Key: n
-> Seq Scan on demo (cost=0.00..100.00 rows=1000 width=4) (actual time=2.195..2463.991 rows=1000000 loops=1)
Planning Time: 0.064 ms
Execution Time: 2818.669 ms
Peak Memory Usage: 139813 kB
(7 rows)
The execution plan makes it clear that an Hash Aggregate was used.
I have the option to disable it and enforce a Sort Aggregate, which can be achieved using pg_hint_plan
. This extension is installed by default in YugabyteDB, and is handy to restrict the setting to the scope of one query, but you can also set the parameter at the scope of the session:
yugabyte=> /*+ Set(enable_hashagg off) */
select n from demo group by n limit 10;
n
----
1
2
3
4
5
6
7
8
9
10
(10 rows)
yugabyte=> /*+ Set(enable_hashagg off) */
explain analyze
select n from demo group by n limit 10;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Limit (cost=149.83..149.88 rows=10 width=4) (actual time=2894.237..2894.244 rows=10 loops=1)
-> Group (cost=149.83..154.83 rows=1000 width=4) (actual time=2894.236..2894.240 rows=10 loops=1)
Group Key: n
-> Sort (cost=149.83..152.33 rows=1000 width=4) (actual time=2894.234..2894.235 rows=10 loops=1)
Sort Key: n
Sort Method: external merge Disk: 13800kB
-> Seq Scan on demo (cost=0.00..100.00 rows=1000 width=4) (actual time=2.306..2621.073 rows=1000000 loops=1)
Planning Time: 0.085 ms
Execution Time: 2896.456 ms
Peak Memory Usage: 6495 kB
(10 rows)
The ascending order of the result is merely a result of the grouping algorithm used. It's essential to remember that without an ORDER BY clause, you cannot depend on obtaining ordered results.
Do you believe that adding an ORDER BY clause will impact response time? Simply inspect the execution plan, and you'll observe that it remains unchanged. The query planner recognizes that the Group
operation retains the order and doesn't require an additional sorting step:
yugabyte=> /*+ Set(enable_hashagg off) */
explain analyze
select n from demo group by n order by n limit 10;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Limit (cost=149.83..149.88 rows=10 width=4) (actual time=2870.805..2870.812 rows=10 loops=1)
-> Group (cost=149.83..154.83 rows=1000 width=4) (actual time=2870.804..2870.808 rows=10 loops=1)
Group Key: n
-> Sort (cost=149.83..152.33 rows=1000 width=4) (actual time=2870.802..2870.804 rows=10 loops=1)
Sort Key: n
Sort Method: external merge Disk: 13800kB
-> Seq Scan on demo (cost=0.00..100.00 rows=1000 width=4) (actual time=2.242..2603.396 rows=1000000 loops=1)
Planning Time: 0.089 ms
Execution Time: 2872.848 ms
Peak Memory Usage: 6495 kB
(10 rows)
To Summarize...
It's crucial to understand how rows are stored and fetched to optimize performance, and that's where the execution plan comes in. However, when executing application queries, remember that SQL is declarative, and the actual plan can vary. If you require a sorted result, explicitly declare it using ORDER BY, and the query planner will determine the appropriate steps. Any claim about an expected ordering without an ORDER BY is a mistake, by definition.
Posted on November 15, 2021
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.