Troubleshooting a Slow Cluster with a Hot Node
Fabio Ghirardello
Posted on January 8, 2021
This blog is about the process of troubleshooting a problematic cluster.
Understanding the Problem
Your cluster is experiencing high latency and spike in CPU usage for some nodes during the load test.
You need to lower latencies and improve CPU utilization to achieve higher throughput.
Take a look at the cluster to gather:
- the environment configuration: CPUs, MEM, Storage, Networking, node count, node location, CockroachDB version, etc.
- the database schemas.
- the SQL queries run as part of the load test.
The Environment
The UAT environment runs on 12 nodes across 4 AZs in 2 regions, US East and US West.
The cluster is on the latest CockroachDB version, on 4 vCPUs/16GB Mem instances with standard storage.
Verify that the latency is minimal (less than 1ms) within zones of the same region.
Database Schemas
Here's the schema
CREATE TABLE credits (
id INT2 NOT NULL,
code UUID NOT NULL,
channel STRING(1) NOT NULL,
pid INT4 NOT NULL,
end_date DATE NOT NULL,
status STRING(1) NOT NULL,
start_date DATE NOT NULL,
CONSTRAINT "primary" PRIMARY KEY (id ASC, code ASC),
INDEX credits_pid_idx (pid ASC),
INDEX credits_code_id_idx (code ASC, id ASC) STORING (channel, status, end_date, start_date),
FAMILY "primary" (id, code, channel, pid, end_date, status, start_date)
);
CREATE TABLE offers (
id INT4 NOT NULL,
code UUID NOT NULL,
token UUID NOT NULL,
start_date DATE,
end_date DATE,
CONSTRAINT "primary" PRIMARY KEY (id ASC, code ASC, token ASC),
INDEX offers_token_idx (token ASC),
FAMILY "primary" (id, code, token)
);
Check how the data is distributed into ranges
SHOW RANGES FROM TABLE credits;
start_key | end_key | range_id | range_size_mb | lease_holder | lease_holder_locality | replicas | replica_localities
--------------------------------------------------------+-------------------------------------------------------+----------+---------------+--------------+-------------------------------------------+----------+----------------------------------------------------------------------------------------------------------------------------------------
NULL | /1/"\x00\x05\x16\x80\xf7\xcbL䣵w\x81\x1a\x1d\xd6\xf6" | 38 | 0.000728 | 3 | cloud=gce,region=us-east1,zone=us-east1-b | {3,4,8} | {"cloud=gce,region=us-east1,zone=us-east1-b","cloud=gce,region=us-east1,zone=us-east1-c","cloud=gce,region=us-west1,zone=us-west1-b"}
/1/"\x00\x05\x16\x80\xf7\xcbL䣵w\x81\x1a\x1d\xd6\xf6" | /15/"\x15\xd3\xe7\xb3_\xa9A\"\xb5p\xa2\xf5\xb9Ba'" | 40 | 225.468243 | 3 | cloud=gce,region=us-east1,zone=us-east1-b | {3,4,10} | {"cloud=gce,region=us-east1,zone=us-east1-b","cloud=gce,region=us-east1,zone=us-east1-c","cloud=gce,region=us-west1,zone=us-west1-c"}
/15/"\x15\xd3\xe7\xb3_\xa9A\"\xb5p\xa2\xf5\xb9Ba'" | NULL | 41 | 222.721298 | 8 | cloud=gce,region=us-west1,zone=us-west1-b | {3,4,8} | {"cloud=gce,region=us-east1,zone=us-east1-b","cloud=gce,region=us-east1,zone=us-east1-c","cloud=gce,region=us-west1,zone=us-west1-b"}
Note that credits
has 2 secondary indexes. Notice how the leaseholder of the ranges are spread across both regions (check the lease_holder_locality
column).
SHOW RANGES FROM TABLE offers;
start_key | end_key | range_id | range_size_mb | lease_holder | lease_holder_locality | replicas | replica_localities
------------+---------+----------+---------------+--------------+-------------------------------------------+----------+----------------------------------------------------------------------------------------------------------------------------------------
NULL | NULL | 37 | 0 | 3 | cloud=gce,region=us-east1,zone=us-east1-b | {3,4,12} | {"cloud=gce,region=us-east1,zone=us-east1-b","cloud=gce,region=us-east1,zone=us-east1-c","cloud=gce,region=us-west1,zone=us-west1-c"}
(1 row)
Notice how table offers
has 1 secondary index, and the table is empty (range_size_mb
is 0).
SQL Queries
Here is the SQL query
SELECT DISTINCT c.id, c.code, c.channel, c.status, c.end_date, c.start_date
FROM credits AS c
WHERE c.status = 'A'
AND c.end_date >= '2020-11-20'
AND c.start_date <= '2020-11-20'
AND c.pid = '000000'
UNION
SELECT c.id, c.code, c.channel, c.status, c.end_date, c.start_date
FROM credits AS c, offers AS o
WHERE c.id = o.id
AND c.code = o.code
AND c.status = 'A'
AND c.end_date >= '2020-11-20'
AND c.start_date <= '2020-11-20'
AND o.token = 'c744250a-1377-4cdf-a1f4-5b85a4d29aaa';
Run the load test
You start the app load test.
While it runs, check the Metrics in the DB Console. Open the Hardware dashboard to see if you can replicate the spike in high CPU usage.
Notice how 2 nodes have very high CPU usage compared to all other nodes. Take notice in the Summary of the values for QPS - 4046 - and P99 latency - 402ms -, too.
Check the Service Latency charts in the SQL dashboard for a better understanding.
Stop the workload now, indeed there are high CPU spikes and high latency.
Analyze the Queries
Let's break the query down into 2 parts, and let's pull the query plan for the 1st part. Again, the value 000000
is a placeholder for a value passed by the application.
EXPLAIN (VERBOSE) SELECT DISTINCT c.id, c.code, c.channel, c.status, c.end_date, c.start_date
FROM credits AS c
WHERE c.status = 'A'
AND c.end_date >= '2020-11-20'
AND c.start_date <= '2020-11-20'
AND c.pid = '000000';
tree | field | description | columns | ordering
-----------------------+---------------------+----------------------------------------------------------------------------------+--------------------------------------------------------+-----------
| distribution | local | |
| vectorized | false | |
project | | | (id, code, channel, status, end_date, start_date) |
│ | estimated row count | 0 | |
└── filter | | | (id, code, channel, pid, end_date, status, start_date) |
│ | estimated row count | 0 | |
│ | filter | ((status = 'A') AND (end_date >= '2020-11-20')) AND (start_date <= '2020-11-20') | |
└── index join | | | (id, code, channel, pid, end_date, status, start_date) |
│ | estimated row count | 0 | |
│ | table | credits@primary | |
│ | key columns | id, code | |
└── scan | | | (id, code, pid) |
| estimated row count | 0 | |
| table | credits@credits_pid_idx | |
| spans | /0-/1 | |
(15 rows)
Time: 77ms total (execution 77ms / network 0ms)
So the optimizer is leveraging index credits@credits_pid_idx
to filter rows that have that specific pid
, but then it has to do a join with primary
to fetch status
, end_date
and start_date
to finish the rest of the WHERE
, and SELECT
, clauses.
Wouldn’t it be better if it didn’t have to do this join and instead access a single index?
Let's now pull the plan for the second part
EXPLAIN (VERBOSE) SELECT c.id, c.code, c.channel, c.status, c.end_date, c.start_date
FROM credits AS c, offers AS o
WHERE c.id = o.id
AND c.code = o.code
AND c.status = 'A'
AND c.end_date >= '2020-11-20'
AND c.start_date <= '2020-11-20'
AND o.token = 'c744250a-1377-4cdf-a1f4-5b85a4d29aaa';
tree | field | description | columns | ordering
---------------------------+-----------------------+-------------------------------------------------------------------------------------+--------------------------------------------------------------------+-----------
| distribution | full | |
| vectorized | false | |
project | | | (id, code, channel, status, end_date, start_date) |
│ | estimated row count | 0 | |
└── lookup join (inner) | | | (id, code, token, id, code, channel, end_date, status, start_date) |
│ | estimated row count | 0 | |
│ | table | credits@credits_code_id_idx | |
│ | equality | (code, id) = (code,id) | |
│ | equality cols are key | | |
│ | pred | ((status = 'A') AND (end_date >= '2020-11-20')) AND (start_date <= '2020-11-20') | |
└── scan | | | (id, code, token) |
| estimated row count | 1 | |
| table | offers@offers_token_idx | |
| spans | /"\xc7D%\n\x13wLߡ\xf4[\x85\xa4Қ\xaa"-/"\xc7D%\n\x13wLߡ\xf4[\x85\xa4Қ\xaa"/PrefixEnd | |
Here we see that the optimizer is choosing an index to filter from the offers
table and join with credits
, which is fine.
Addressing the Hotspot
Let's tackle the high CPU usage issue first. Why is it so, why is a node, n3 in this case, using all the CPU?
You restart you workload and execute hot.py to find the hottest range.
$ python3 hot.py --numtop 10 --host myhost.mydomain.com --adminport 26258 --dbport 26257
rank rangeId QPS Nodes leaseHolder DBname, TableName, IndexName
1: 37 2006.722472 [4, 3, 12] 3 ['defaultdb', 'offers', '']
2: 39 857.900812 [5, 2, 8] 5 ['defaultdb', 'credits', 'credits_pid_idx']
3: 6 48.688882 [1, 6, 3, 11, 9] 9 ['', '', '']
4: 26 17.644921 [1, 4, 11, 12, 7] 4 ['system', 'namespace2', '']
5: 4 15.409775 [1, 9, 12] 12 ['', '', '']
6: 35 12.764951 [4, 3, 9, 8, 10] 10 ['system', 'sqlliveness', '']
7: 11 3.369730 [6, 2, 9, 12, 10] 2 ['system', 'jobs', '']
8: 2 2.697347 [1, 5, 3, 12, 7] 12 ['', '', '']
9: 3 1.976292 [6, 4, 3, 11, 7] 6 ['', '', '']
10: 7 1.373621 [1, 2, 4, 12, 7] 4 ['system', 'lease', '']
So it looks like Range ID 37 on n3 is hot. What's in that range, why that range?
Back to your SQL terminal, show the ranges for offers@offers_token_idx
, since the query plan showed it's using this index
SHOW RANGES FROM INDEX offers@offers_token_idx;
start_key | end_key | range_id | range_size_mb | lease_holder | lease_holder_locality | replicas | replica_localities
------------+---------+----------+---------------+--------------+-------------------------------------------+----------+----------------------------------------------------------------------------------------------------------------------------------------
NULL | NULL | 37 | 0 | 3 | cloud=gce,region=us-east1,zone=us-east1-b | {3,4,12} | {"cloud=gce,region=us-east1,zone=us-east1-b","cloud=gce,region=us-east1,zone=us-east1-c","cloud=gce,region=us-west1,zone=us-west1-c"}
(1 row)
Time: 2.517s total (execution 2.516s / network 0.000s)
Bingo! We found Range ID 37.
As offers
is empty, so is index offers@offers_token_idx
, and thus there is just one range for that table. If you have a join operation going on, inevitably CockroachDB will always want to access that range to do the join, causing the hotspot.
We need to ask our DBA:
- Why is a join operation sent against an empty table?
- Why is the table empty?
There are 4 possible ways to resolve a hot range situation:
Add data to the table, so that more ranges are created.
Use the Follower Reads pattern, where you relax the requirement to always go to the leaseholder to access the data and instead can read from any of the replicas. Range replicas are spread across the cluster, balancing the load to other nodes.
Use the Duplicate Indexes pattern, where you duplicate the table/index and thus its ranges, resulting in multiple ranges that can serve the data.
Use the
CONFIGURE ZONE
command to decrease the range size from 512MB to something smaller, 64MB for example. This will create many, small, scattered ranges that help balance the load.You can insert some junk data in the table or keep the old, obsolete data, so to keep enough data that the table results in multiple ranges.
In this case, you add a lot of data to the table. As a result, 4 ranges have been created and these will be spread across the nodes.
Re-run the workload, then check the Hardware dashboard again
Much better, good job! Let's see how the ranges for the index are spread out:
SELECT lease_holder, lease_holder_locality
FROM [
SHOW RANGES
FROM INDEX offers@offers_token_idx
];
lease_holder | lease_holder_locality
---------------+--------------------------------------------
5 | cloud=gce,region=us-east1,zone=us-east1-c
4 | cloud=gce,region=us-east1,zone=us-east1-c
3 | cloud=gce,region=us-east1,zone=us-east1-b
2 | cloud=gce,region=us-east1,zone=us-east1-b
Better! On average we can expect the load to be spread across 4 ranges in 4 different nodes.
Addressing the Latency
Understanding where the latency comes from
On the SQL terminal, you run a few queries to verify the Response Time.
Show my locality first
SHOW LOCALITY;
locality
---------------------------------------------
cloud=gce,region=us-east1,zone=us-east1-b
(1 row)
Time: 2ms total (execution 1ms / network 0ms)
Ok, I'm in US East. Let's run the first part of the query using a randomly picked valid c.pid
.
SELECT DISTINCT c.id, c.code, c.channel, c.status, c.end_date, c.start_date
FROM credits AS c
WHERE c.status = 'A'
AND c.end_date >= '2020-11-20'
AND c.start_date <= '2020-11-20'
AND c.pid = '1109619';
id | code | channel | status | end_date | start_date
-----+--------------------------------------+---------+--------+---------------------------+----------------------------
9 | f99e6553-18fb-475b-910e-eae4287e7ffa | O | A | 2020-12-19 00:00:00+00:00 | 2020-05-04 00:00:00+00:00
(1 row)
Time: 67ms total (execution 67ms / network 0ms)
Response Time is 69ms, a little too much. Why is it so? Let's check where the range that has this row is located.
From the query plan we pulled above, we see that it's using index credits_pid_idx
. Find the key of the index
SHOW INDEX FROM credits;
table_name | index_name | non_unique | seq_in_index | column_name | direction | storing | implicit
-------------+---------------------+------------+--------------+-------------+-----------+---------+-----------
[...]
credits | credits_pid_idx | true | 1 | pid | ASC | false | false
credits | credits_pid_idx | true | 2 | id | ASC | false | true
credits | credits_pid_idx | true | 3 | code | ASC | false | true
[...]
Cool, for credits@credits_pid_idx
the key is pid id code
.
Let's pull the correct range
SELECT lease_holder_locality
FROM [
SHOW RANGE FROM INDEX credits@credits_pid_idx
FOR ROW(1109619, 9, 'f99e6553-18fb-475b-910e-eae4287e7ffa')
];
lease_holder_locality
---------------------------------------------
cloud=gce,region=us-east1,zone=us-east1-b
Ok, the range is local (us-east-1), this should only take 1ms. From the query plan we see that there is a join with credits@primary
to fetch the other columns.
Let's see how long that takes
SELECT *
FROM credits@primary
WHERE id = 9 AND code = 'f99e6553-18fb-475b-910e-eae4287e7ffa';
id | code | channel | pid | end_date | status | start_date
-----+--------------------------------------+---------+---------+---------------------------+--------+----------------------------
9 | f99e6553-18fb-475b-910e-eae4287e7ffa | O | 1109619 | 2020-12-19 00:00:00+00:00 | A | 2020-05-04 00:00:00+00:00
(1 row)
Time: 66ms total (execution 66ms / network 0ms)
66ms! Let's do the same exercise as before and find out where this range is located.
SELECT lease_holder_locality
FROM [
SHOW RANGE
FROM TABLE credits
FOR ROW(9, 'f99e6553-18fb-475b-910e-eae4287e7ffa')
];
lease_holder_locality
-------------------------------------------
cloud=gce,region=us-west1,zone=us-west1-c
A-ha! This table is in US West, so we're paying the latency price to go to the other region to fetch the data.
The problem is twofold: sub-optimal tables/indexes and cross-regional reads.
Part 1 - Optimize the table primary and secondary indexes
Let's optimize the top part of the query by removing the need to join with credits@primary
.
We need to create an index similar to credits@credits_pid_idx
that stores the fields required by the query.
Also, index credits@credits_code_id_idx
seems to be redundant, so we drop it.
DROP INDEX credits@credits_code_id_idx;
DROP INDEX credits@credits_pid_idx;
CREATE INDEX credits_pid_idx
ON credits(pid ASC)
STORING (channel, end_date, status, start_date);
Pull the query plan to confirm no join is required
EXPLAIN (VERBOSE) SELECT DISTINCT c.id, c.code, c.channel, c.status, c.end_date, c.start_date
FROM credits AS c
WHERE c.status = 'A'
AND c.end_date >= '2020-11-20'
AND c.start_date <= '2020-11-20'
AND c.pid = '12';
tree | field | description | columns | ordering
-----------------+---------------------+----------------------------------------------------------------------------------+--------------------------------------------------------+-----------
| distribution | local | |
| vectorized | false | |
project | | | (id, code, channel, status, end_date, start_date) |
│ | estimated row count | 0 | |
└── filter | | | (id, code, channel, pid, end_date, status, start_date) |
│ | estimated row count | 0 | |
│ | filter | ((status = 'A') AND (end_date >= '2020-11-20')) AND (start_date <= '2020-11-20') | |
└── scan | | | (id, code, channel, pid, end_date, status, start_date) |
| estimated row count | 0 | |
| table | credits@credits_pid_idx | |
| spans | /12-/13 | |
Good stuff, we eliminated a join operation!
As per the second part, we see that the optimizer is never using index offers@primary
. Let's alter the primary key of that table to make it similar to index offers@offers_token_idx
.
DROP INDEX offers_token_idx;
-- this will take some time as we're basically rewriting the entire table
BEGIN;
ALTER TABLE offers DROP CONSTRAINT "primary";
ALTER TABLE offers ADD CONSTRAINT "primary" PRIMARY KEY (token, id, code);
COMMIT;
Check the docs for how to change the primary key while still keeping the database online.
Review the schema after these changes.
SHOW CREATE TABLE offers;
table_name | create_statement
-------------+----------------------------------------------------------------------
offers | CREATE TABLE public.offers (
| id INT4 NOT NULL,
| code UUID NOT NULL,
| token UUID NOT NULL,
| start_date DATE NULL,
| end_date DATE NULL,
| CONSTRAINT "primary" PRIMARY KEY (token ASC, id ASC, code ASC),
| FAMILY "primary" (id, code, token, start_date, end_date)
| )
Awesome, much more efficient!
Part 2 - Create duplicate indexes and pin to region
Now that we have our tables well organized, we need to resolve the latency issue.
We need our read latency to be the same regardless of where the query originates.
Implementing the Follower Reads pattern requires an App modification, so while ideal the Follower Reads pattern is unfortunately not available.
The best solution thus is to follow the Duplicate Index pattern: we create a copy of each index and table.
Then, we pin a copy to US West and another to US East.
We create the indexes first
-- copy of credits@primary
CREATE INDEX primary_copy
ON credits(id ASC, code ASC)
STORING (channel, pid, end_date, status, start_date);
-- copy of credits_pid_idx
CREATE INDEX credits_pid_idx_copy
ON credits(pid ASC)
STORING (channel, end_date, status, start_date);
-- copy of offers@primary
CREATE INDEX primary_copy
ON offers(token ASC, id ASC, code ASC)
STORING (start_date, end_date);
Excellent, we have now a copy of each index (primary
included)!
Next, pin a copy to East, and another to West.
-- credits
-- pin to East
ALTER TABLE credits CONFIGURE ZONE USING
num_replicas = 3,
constraints = '{+region=us-east1: 1}',
lease_preferences = '[[+region=us-east1]]';
ALTER INDEX credits@credits_pid_idx CONFIGURE ZONE USING
num_replicas = 3,
constraints = '{+region=us-east1: 1}',
lease_preferences = '[[+region=us-east1]]';
-- pin to West
ALTER INDEX credits@primary_copy CONFIGURE ZONE USING
num_replicas = 3,
constraints = '{+region=us-west1: 1}',
lease_preferences = '[[+region=us-west1]]';
ALTER INDEX credits@credits_pid_idx_copy CONFIGURE ZONE USING
num_replicas = 3,
constraints = '{+region=us-west1: 1}',
lease_preferences = '[[+region=us-west1]]';
-- offers
-- pin to East
ALTER TABLE offers CONFIGURE ZONE USING
num_replicas = 3,
constraints = '{+region=us-east1: 1}',
lease_preferences = '[[+region=us-east1]]';
-- pin to West
ALTER INDEX offers@primary_copy CONFIGURE ZONE USING
num_replicas = 3,
constraints = '{+region=us-west1: 1}',
lease_preferences = '[[+region=us-west1]]';
Part 3 - Validate the theory
Re-run the first part of the query from both regions. Check the query plan using EXPLAIN (VERBOSE)
.
From n1 (US East region):
SELECT DISTINCT c.id, c.code, c.channel, c.status, c.end_date, c.start_date
FROM credits AS c
WHERE c.status = 'A'
AND c.end_date >= '2020-11-20'
AND c.start_date <= '2020-11-20'
AND c.pid = '12';
id | code | channel | status | end_date | start_date
-----+--------------------------------------+---------+--------+---------------------------+----------------------------
19 | 468750f4-cb58-4707-9fd3-bd5f99111855 | O | A | 2020-12-18 00:00:00+00:00 | 2020-09-21 00:00:00+00:00
(1 row)
Time: 1ms total (execution 1ms / network 0ms)
EXPLAIN (VERBOSE) SELECT DISTINCT c.id, c.code, c.channel, c.status, c.end_date, c.start_date
FROM credits AS c
WHERE c.status = 'A'
AND c.end_date >= '2020-11-20'
AND c.start_date <= '2020-11-20'
AND c.pid = '12';
tree | field | description | columns | ordering
-----------------+---------------------+---------------------------------------------------------------------------------------+--------------------------------------------------------+-----------
| distribution | local | |
| vectorized | false | |
project | | | (id, code, channel, status, end_date, start_date) |
│ | estimated row count | 0 | |
└── filter | | | (id, code, channel, pid, end_date, status, start_date) |
│ | estimated row count | 0 | |
│ | filter | ((status = 'A') AND (end_date >= '2020-11-20')) AND (start_date <= '2020-11-20') | |
└── scan | | | (id, code, channel, pid, end_date, status, start_date) |
| estimated row count | 0 | |
| table | credits@credits_pid_idx | |
| spans | /3124791208-/3124791209 | | | |
Same queries above run on n12 (US West region):
id | code | channel | status | end_date | start_date
-----+--------------------------------------+---------+--------+---------------------------+----------------------------
19 | 468750f4-cb58-4707-9fd3-bd5f99111855 | O | A | 2020-12-18 00:00:00+00:00 | 2020-09-21 00:00:00+00:00
(1 row)
Time: 2ms total (execution 1ms / network 0ms)
tree | field | description | columns | ordering
-----------------+---------------------+---------------------------------------------------------------------------------------+--------------------------------------------------------+-----------
| distribution | local | |
| vectorized | false | |
project | | | (id, code, channel, status, end_date, start_date) |
│ | estimated row count | 1 | |
└── filter | | | (id, code, channel, pid, end_date, status, start_date) |
│ | estimated row count | 1 | |
│ | filter | ((status = 'A') AND (end_date >= '2020-11-19')) AND (start_date <= '2020-11-19') | |
└── scan | | | (id, code, channel, pid, end_date, status, start_date) |
| estimated row count | 2 | |
| table | credits@credits_pid_idx_copy | |
| spans | /3124791208-/3124791209 | |
Perfect, we've low latency from both regions! Now start the workload again and let's measure the overall latency.
Compare to the initial result: A huge improvement in performance! We doubled the QPS and halved the Latency!
Congratulations, your cluster is performing much better now!
We hope you have now a better understanding on the process of troubleshooting an underperforming cluster.
References
Official CockroachDB Documentation
- Troubleshooting Overview
- Troubleshooting Slow Query
- Troubleshooting Hot Node/Range
- DB Console Overview
- DB Console Hardware Dashboard
- DB Console SQL Dashboard
Blogs
Posted on January 8, 2021
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.