Ingvar
Posted on July 22, 2020
Hello guys, in this post I gonna write about garbage collection freeze issue in .Net Core 3 that we found on our production. This post is dedicated to dotnet experts mostly. I doubt that this issue could be reproduced on small projects so it will be interesting for devs from enterprise size projects.
Introduction
After migration to .Net Core described in previous post we finally got our production working on Linux. We had a lot of Linux machines with our services and they worked pretty well. But after some time we faced issue. Sometimes one of our services stops working on few machines randomly. It started to happen pretty often (~20 times per day) so we decided to investigate this issue.
Initial look on issue
Initially we thought that our services crashes on these machines. When it happened again we connected to this machine via ssh and it appeared that dotnet process was working there. Anyway none of our requests (even health check) didn't work because dotnet process was stuck and didn't respond to incoming requests.
We created a dump from this stuck process. Dotnet provides tools for creating and analyzing dumps such as dotnet-dump etc. Script that we used for dump creation:
createdump <pid> -u -f <dump_file>
Linux dumps could be analyzed on Linux only. I prefer to use dotnet-dump itself for analyzing but you can consider using lldb if you need more details. Anyway it's a bit hard to find exact lldb version that works for specific net core and sos plugin. For analyzing I use Docker container mounted to directory with dump.
Dump investigation
We collected few dumps from stuck machines and started to investigate them. It was a bit problematic to grab a dump because we had automatic restart of not responding services. Also it takes some time which sometimes is not acceptable for high-load on production. Note that we didn't use gcdump because it triggers full GC and doesn't work on stuck processes like we had. Even for working processes it fails sometimes (especially if process uses a lot of RAM).
For dump analyzing I used dotnet-dump tool itself. dotnet-dump is some kind of old-school SOS analyzer. It uses same commands but it supports .Net Core and different platforms. This how I checked it:
1) Initially I thought we have a deadlock somewhere in the code. So I used clrthreads
command that showed me list of all CLR threads. After that I used clrstack -all
to print all threads stack traces. I checked few dumps and most of them had completely different stack traces! It wasn't result that I expected. No sign of deadlock or similar issue.
Example of clrthreads
output
2) Also I checked objects in memory. I used dumpheap
for this purpose. For example, dumpheap -stat -min 10000
showed me statistics per type for objects large than 10k bytes. It's the fastest way to detect large objects. I found some large (2-4 MB objects) but nothing really large.
3) Dump had some signs of coming GC: objects that were ready to be removed, gcframe in stack traces. But what if it's just a coincidence?
Basic performance metrics
In order to check that GC is the issue and to look back in past before freeze I added more metrics. For this purpose I used Prometheus. It was used because it allows to grab metrics with different intervals w/o any need to send batches of metrics somewhere because it uses pull model instead of push. Also dotnet version has a lot of good built-in metrics that I used for investigation. Later in one of my future posts I will explain details of using Prometheus in dotnet apps.
I collected a lot of metrics from production with 1 second pulling interval. Metrics showed me following:
1) Full garbage collection interval is too low. Every few seconds it collected garbage in all generations! It's obvious that something went wrong.
Every time memory decreases after GC to minimal value, that means we do full GC each time
2) A lot of 2nd generation objects were created during app lifetime. Most of them were obviously objects from LOH. It looked weird because I didn't expect we're creating so many large objects.
3) Each time before freeze memory stat showed growth or even was on peak like on image above. GC starts collection approximately on same memory usage levels so I thought it's good proof that GC was the reason of our issue.
Improving performance
Freezing GC seems to be a bug in GC itself but what exactly triggers it? Large objects that force full GC each time I assumed. I opened a dump again and checked largest objects there. This is what I found:
1) Large strings - 4 MB in UTF-16.
2) Large byte arrays - 2 MB each.
These objects looked suspiciously so I decided to check them. It appeared that we used a two-level cache for large object that were used pretty often. First level was in memory cache (IMemoryCache
) and the second one was compressed cache in database. These objects were shared. They contained some configuration for users and different users could have same configuration so we also used additional intermediate entity. It looked like this:
public class UserSettings
{
public string UserId { get; set; }
public string SettingsHash { get; set; }
}
SettingsHash
is a hash of large settings object that was used as primary key for settings search in cache and database. So initially we retrieved this hash value from cache by user id and later loaded settings by user settings hash. This way of caching helped us to reuse our large objects but it's a bit complicated. As you know complicated code has more bugs - this is exactly what happened to us.
I started to play with our caching code. I asked myself following questions:
1) Does it work at all or cache misses rate is too high?
2) If it works do we need additional configuration for it, for example ttl tuning?
3) Is there a way to simplify or improve it?
4) We used compressing for settings in database using LZ4 - I decided to check that we're compressing everything properly too
5) Does our database cache work too slow? It required loading large byte array, decompressing into string and deserializing it, sounds like good reason to be slow
I added metrics for cache so I could get exact cache stats. I used Prometheus for this purpose as previously, it has good syntax for such checks. Also I checked our (de)serialization and (de)compressing operations speed using it. Here is an example of neat Prometheus syntax:
using (_histogram.NewTimer())
{
// some code here
}
This example tracks execution time of code inside using
block and allows to build histogram for it later. Looks easy, right? No need to use StopWatch
or similar stuff.
I checked our compression code. Code itself was almost good but it missed few using
's - it created memory streams w/o calling Dispose
. Ok, it's not so serious issue. But after looking closer I found much more serious one. Before compressing object was converted to json, but json serialization settings were incorrect and object was saved with all formatting! So after serializing and compression it was even larger that after proper serializing w/o formatting. I fixed that and compression started to work as expected.
I wasn't able to improve ttl for our cache because it was pretty good. Large objects had long ttl and small objects were recreated much often. Large objects change less often so it's reasonable. Looks good, let's go further.
Last thing that I checked was caching logic itself. We used following algorithm:
1) Check for UserSettings
object in local cache
2) If it's in cache return it and check for large settings object in local cache or db
3) If it's not in cache load it from db or calculate and load settings from db
See a bug? If user settings object is not in cache we don't even try to look for large settings object in local cache! And because ttl of these types of objects are completely different and large objects are reused across user settings it's a serious mistake. We created a lot of large objects on each request and load them into local cache replacing same objects. Good way to waste your resources 😃 I fixed that bug - added check for local settings objects before asking external cache.
APM
Right at that moment we tried new APM system on our production. After some time it detected our GC problem as well:
I strictly recommend to use (good) APM on production because it could save you a lot of time.
Fix results
During deployment to production I immediately started to monitor performance. I was shocked because right after deploy service latency went down by 50%:
Metrics also verified that cache works good (cache misses count is close to 0):
Also gc stats showed that we started to create much less objects in 2nd generation:
Right after deploy large objects count became closer to 0
Second generation on few machines
Execution time for search in cache became much faster too because local cache started to work as a charm. Wow!
We got so high cache success rate, more than 90% of our requests were almost instant!
After some time we checked our restart job. It appeared that in 72h after deploy we didn't have any stuck processes! Amazing result after 20+ freezes in 24h. Later we had few freezes due to other performance issues but rate was at least 10x times lower than ever before.
Conclusion
After all improvements GC freeze became rare issue for us (yes, it is still reproducible because we also have another type of too large objects that are extremely difficult to get rid of, we will try to do it on close future). Also performance of our hottest endpoint was improved significantly. We contacted Microsoft regarding this issue but still are waiting for their response. GC freeze is not fixed in dotnet itself so it could happen in your project too. Are you ready?
Posted on July 22, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.