Hello,
This issue has been created to track down the problem that I'm having with Mercure:
Mercure uses almost %100 CPU, doesn't matter what specs you've.
The reason is probably a lot of subscribers and updates published to them:
# TYPE mercure_subscribers gauge
mercure_subscribers 4299
# HELP mercure_subscribers_total Total number of handled subscribers
# TYPE mercure_subscribers_total counter
mercure_subscribers_total 2.939146e+06
# HELP mercure_updates_total Total number of handled updates
# TYPE mercure_updates_total counter
mercure_updates_total 7.891176e+06
This is just for 2 days, this is a reduced amount, normally it's up to x4 of updates that are currently disabled.
Here is debug profile of CPU:
File: caddy
Type: cpu
Time: Aug 26, 2021 at 6:53am
Duration: 300.16s, Total samples = 461.92s (153.89%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 231.78s, 50.18% of 461.92s total
Dropped 1620 nodes (cum <= 2.31s)
Showing top 10 nodes out of 141
flat flat% sum% cum cum%
55.77s 12.07% 12.07% 191.39s 41.43% runtime.selectgo
41.19s 8.92% 20.99% 52.73s 11.42% runtime.sellock
24.36s 5.27% 26.26% 24.40s 5.28% runtime.(*waitq).dequeue (inline)
22.55s 4.88% 31.15% 114.28s 24.74% github.com/dunglas/mercure.(*Subscriber).CanDispatch
18.83s 4.08% 35.22% 18.83s 4.08% go.uber.org/zap.(*Logger).check
16.90s 3.66% 38.88% 17.25s 3.73% runtime.unlock2
14.31s 3.10% 41.98% 32.20s 6.97% github.com/dunglas/mercure.canReceive
13.21s 2.86% 44.84% 13.21s 2.86% runtime.empty
12.48s 2.70% 47.54% 44.65s 9.67% runtime.schedule
12.18s 2.64% 50.18% 12.64s 2.74% runtime.gopark
pprof.caddy.samples.cpu.006.pb.gz
Here is the image of it:
The problematic lines are:
https://github.com/dunglas/mercure/blob/ca2df66f27356aa2511ef4f959fc4156a0f69a94/subscriber.go#L68-L102
At first, we thought the problem is with Zap(*Logger) check
but PR https://github.com/dunglas/mercure/pull/554 should fix it.
Anyways, completely removing Zap logging, the situation didn't improve but instead showed more problems with runtime.selectgo
Here is debug profile of CPU:
File: caddy
Type: cpu
Time: Aug 31, 2021 at 12:27pm
Duration: 300.10s, Total samples = 568.67s (189.49%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 347.87s, 61.17% of 568.67s total
Dropped 1614 nodes (cum <= 2.84s)
Showing top 10 nodes out of 116
flat flat% sum% cum cum%
76.58s 13.47% 13.47% 263.92s 46.41% runtime.selectgo
50.10s 8.81% 22.28% 50.28s 8.84% runtime.(*waitq).dequeue (inline)
47.13s 8.29% 30.56% 59.44s 10.45% runtime.sellock
29.29s 5.15% 35.71% 29.29s 5.15% memeqbody
29.05s 5.11% 40.82% 86.37s 15.19% github.com/dunglas/mercure.(*Subscriber).CanDispatch
28.25s 4.97% 45.79% 166.05s 29.20% github.com/dunglas/mercure.(*Subscriber).Dispatch
25.76s 4.53% 50.32% 57.32s 10.08% github.com/dunglas/mercure.canReceive
25.14s 4.42% 54.74% 25.14s 4.42% runtime.empty (inline)
18.58s 3.27% 58.01% 19.21s 3.38% runtime.gopark
17.99s 3.16% 61.17% 17.99s 3.16% runtime.(*gQueue).pop
Here is the image of it:
Sorry, I can't write a benchmark for it but ready to test what changes is needed.
Data posted to Mercure is always to different topics.
Posted to topics:
/users/613193cee6be0e1016042334
Private: true
Data:
{
"@context": "/contexts/User",
"@id": "/users/613193cee6be0e1016042334",
"@type": "User",
"id": "613193cee6be0e1016042334",
"quota": 40000,
"used": 7186,
"isDisabled": false,
"isDeleted": false,
"createdAt": "2021-09-03T05:17:34+02:00",
"updatedAt": "2021-09-03T05:18:29+02:00"
}
Thanks!