This repository contains some tests that I used to investigate a tricky caching bug.
You can find more information about it in my blog albertowar.me.
Posted on December 6, 2021
We often see companies publishing a short incident recap after they had a major outage. A couple of recent ones were Facebook and Cloudflare.
Although knowing the actual root cause is interesting, I think there is a lot to learn from how the engineers managed to troubleshoot the issue.
So I decided to take a stab at it and walk you through the troubleshooting process that took place in one of our recent incidents.
A few weeks ago, my team was preparing for the launch of an alpha version of a new product. Engineers were heads down wrapping up the last features and smashing bugs when our QA raised a bug in production.
The users should have been redirected to the payment page. However, quite often, the underlying request would timeout when anyone was attempting to click on the Buy button.
This was a regression and a major one, hence we decided to jump on it.
The first step was to dive into the backend app logs. Here is what we saw:
Judging by the logs, it looked like the request froze 🧊 at some point during the execution path. This was hard to believe since the app is simple enough that shouldn't lead to any deadlocks. However, we should have seen an Exception
somewhere and that wasn't the case here.
Also, what made it even more interesting was the fact that we also saw some requests that executed normally.
Nevertheless, since the bug was fairly easy to reproduce, we needed to keep digging.
The next step in the investigation was to analyze the log lines in the code and pin point 🎯 the section that was causing the request thread to freeze. This can simply be done by looking at the last log line reported and the next log line that should have been reported.
This exercise brought to light the first suspect. As part of the request, we fetched a large JSON file from an external service. Since this file didn't change very often, we added a caching layer to save some network requests and reduce the latency of the request.
To implement this caching layer, we decided to use a Guava cache. The cache was configured so that it would return old values while the new values were being fetched.
Here is a fragment of code showing how the cache was configured:
var cache = CacheBuilder.newBuilder()
.refreshAfterWrite(Constants.CACHE_REFRESH_TIME_MS, TimeUnit.MILLISECONDS)
.build(buildCacheLoader(baseUrl));
private static CacheLoader<String, String> buildCacheLoader(
String baseUrl) {
return new CacheLoader<>() {
@Override
public String load(final String environment) throws Exception {
log.info("Fetching data");
var builder =
HttpClient.newBuilder().connectTimeout(Duration.ofSeconds(Constants.TIMEOUT_IN_SECONDS));
var client = builder.build();
final HttpRequest request =
HttpRequest.newBuilder()
.uri(URI.create(baseUrl))
.timeout(Duration.ofSeconds(Constants.TIMEOUT_IN_SECONDS))
.build();
final HttpResponse<String> response =
client.send(request, HttpResponse.BodyHandlers.ofString());
return mapper.readValue(response.body(), Data.class);
}
};
At this point, it was clear that the cache wasn't behaving as we expected since:
Exception
that would explain what was causing the request thread to hang.Thus, I decided to spend some time trying to understand in depth how Guava caches work.
In situations like this one, I usually create a playground project where I can remove all the noise (business logic) and focus on what matters 🧐.
I had to come up with a setup as close to production to ensure I was able to reproduce the exact same issue.
To achieve that, I decided to:
To ensure we have visibility in everything that happens during the different test cases, the threads will run the following Runnable which includes several log lines:
@Slf4j
public class CacheReader implements Runnable {
private LoadingCache<String, Data> cache;
public CacheReader(LoadingCache<String, Data> cache) {
this.cache = cache;
}
@Override
public void run() {
try {
log.info("Reading value at {}", System.currentTimeMillis());
log.info("Value {} at {}", this.cache.get("something"), System.currentTimeMillis());
} catch (ExecutionException e) {
log.error("Exception while loading value at {}", System.currentTimeMillis(), e);
}
}
}
The first mystery to solve was the missing Exception
. For that, there are two potential failures we could have experienced in production that should have reported an Exception
:
@Test
@SneakyThrows
void givenEmptyCache_whenRemoteServiceTimeout_shouldThrowException() {
var cache = CacheFactory.buildCacheWithoutExceptionHandling(wireMockServer.baseUrl());
setupRemoteResponse("value", 2 * Constants.TIMEOUT_IN_SECONDS);
var thread = new Thread(new CacheReader(cache));
thread.start();
thread.join();
}
Which reports the following logs:
1861 [Thread-8] INFO org.example.CacheReader [] - Reading value at 1637941200937
1872 [Thread-8] INFO org.example.CacheFactory [] - Fetching data
2925 [Thread-8] ERROR org.example.CacheReader [] - Exception while loading value at 1637941202006
java.util.concurrent.ExecutionException: java.net.http.HttpTimeoutException: request timed out
... <omitted stacktrace> ...
When the cache is empty, the thread loading the cache seems to be the one that receives the Exception
.
Since we didn't see any Exception
in our logs, this means this wasn't the failure we observed in production. Let's keep going.
Exception
@Test
void givenEmptyCache_whenRemoteServiceReturns500_shouldThrowException() throws InterruptedException {
LoadingCache<String, String> cache =
CacheFactory.buildCacheWithoutExceptionHandling(wireMockServer.baseUrl());
setupFailedResponse();
var thread = new Thread(new CacheReader(cache));
thread.start();
thread.join();
}
Which reports the following logs:
1793 [Thread-8] INFO org.example.CacheReader [] - Reading value at 1637944243058
1807 [Thread-8] INFO org.example.CacheFactory [] - Fetching data
2070 [Thread-8] ERROR org.example.CacheReader [] - Exception while loading value at 1637944243341
java.util.concurrent.ExecutionException: com.fasterxml.jackson.databind.exc.MismatchedInputException: No content to map due to end-of-input
at [Source: (String)""; line: 1, column: 0]
... <omitted stacktrace> ...
In this case, the calling thread is getting a serialization Exception
since the response body was not the expected one.
Again, since we didn't see any Exception
in our logs, we need to continue the investigation.
@Test
@SneakyThrows
void givenCacheDueForRefresh_whenRemoteServiceTimeout_shouldGetOldValue() {
var cache = CacheFactory.buildCacheWithoutExceptionHandling(wireMockServer.baseUrl());
setupRemoteResponse(new Data("old_value"));
var thread = new Thread(new CacheReader(cache));
thread.start();
thread.join();
waitForCacheToExpire();
setupRemoteResponse(new Data("new_value"), 2 * Constants.TIMEOUT_IN_SECONDS);
thread = new Thread(new CacheReader(cache));
thread.start();
thread.join();
}
Which reports the following logs:
1834 [Thread-8] INFO org.example.CacheReader [] - Reading value at 1637945326283
1846 [Thread-8] INFO org.example.CacheFactory [] - Fetching data
2099 [Thread-8] INFO org.example.CacheReader [] - Value Data(value=old_value) at 1637945326554
3114 [Thread-9] INFO org.example.CacheReader [] - Reading value at 1637945327569
3115 [Thread-9] INFO org.example.CacheFactory [] - Reloading
3115 [Thread-9] INFO org.example.CacheFactory [] - Fetching datacom.google.common.cache.LocalCache$Segment$1 run
WARNING: Exception thrown during refresh
java.util.concurrent.ExecutionException: java.net.http.HttpTimeoutException: request timed out
at com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:566)
4296 [Thread-9] INFO org.example.CacheReader [] - Value old_value at 1637941240139
at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:527)
... <omitted stacktrace> ...
In these logs, we see that the thread that attempts to load the value after the cache is due for refresh will block (we will follow up on this later) in attempt to reload the new value. However, if that fails, it will get the old value and an Exception will be logged in the background.
When we look into the Guava documentation for the type of cache we are using (refreshAfterWrite
), we find the following:
If an exception is thrown while refreshing, the old value is kept, and the exception is logged and swallowed.
This explains why we never saw the Exception in the logs. It was logged by Guava and shallowed which was being filtered out when our app sends those logs to NewRelic. This was a decision we made to avoid polluting NewRelic with logs that we don't care about but it ended up biting us in the arse 🤦
Mystery 1 solved.
While troubleshooting the previous issue, we found that, when the cache is due for refresh, the thread that attempts to fetch the new value will block and, if it fails, it will eventually get the old value.
The next thing to figure out is what happens if many theads attempt to get the value from the cache at the same time when it is due for refresh
@Test
@SneakyThrows
void givenCacheDueForRefresh_whenRemoteServiceTimeout_allThreadsShouldGetOldValue() {
var cache = CacheFactory.buildCacheWithoutExceptionHandling(wireMockServer.baseUrl());
setupRemoteResponse(new Data("old_value"));
var thread = new Thread(new CacheReader(cache));
thread.start();
thread.join();
waitForCacheToExpire();
setupRemoteResponse(new Data("new_value"), 2 * Constants.TIMEOUT_IN_SECONDS);
runManyAndWait(cache, 3);
}
Which reports the following logs:
1861 [Thread-8] INFO org.example.CacheReader [] - Reading value at 1637945351638
1871 [Thread-8] INFO org.example.CacheFactory [] - Fetching data
2118 [Thread-8] INFO org.example.CacheReader [] - Value Data(value=old_value) at 1637945351899
3132 [Thread-11] INFO org.example.CacheReader [] - Reading value at 1637945352913
3132 [Thread-10] INFO org.example.CacheReader [] - Reading value at 1637945352913
3132 [Thread-9] INFO org.example.CacheReader [] - Reading value at 1637945352913
3132 [Thread-9] INFO org.example.CacheFactory [] - Reloading
3132 [Thread-10] INFO org.example.CacheReader [] - Value Data(value=old_value) at 1637945352913
3132 [Thread-9] INFO org.example.CacheFactory [] - Fetching data
3132 [Thread-11] INFO org.example.CacheReader [] - Value Data(value=old_value) at 1637945352913com.google.common.cache.LocalCache$Segment$1 run
4185 [Thread-9] INFO org.example.CacheReader [] - Value old_value at 1637942442820
WARNING: Exception thrown during refresh
java.util.concurrent.ExecutionException: java.net.http.HttpTimeoutException: request timed out
... <omitted stacktrace> ...
Here, we observe that the unlucky Thread-9
blocks to fetch the new value while all the other threads get the old value and move on with their lives.
This gives us enough information to conclude what happened in production!
When our QA was running their regression test, their requests would often be the ran by an unlucky thread that gets blocked while fetching the new value. Since we have multiple instances of this application, this could be reproduced easily.
Funny enough, when multiple people were trying at the same time, if they were lucky enough to hit the same instance, everything would work just fine, which is what we saw in the logs.
Mystery 2 solved!
Not quite!
We have learnt that, when the cache is due for refresh, there is always an unlucky thread that has to pay the latency cost of refreshing the value.
To provide the best experience for our users, we should make sure the cache refresh takes place asynchronously and that all requests get the old value while the new one is being fetched.
So ... how do we do that? 🤔
If we look again into Guava's documentation we can see a couple of comments about asynchronous reload:
[...] As specified in LoadingCache.refresh(K), refreshing a key loads a new value for the key, possibly asynchronously [...]
[...] If CacheLoader.reload is implemented to be asynchronous, then the query will not be slowed down by the refresh [...]
Unfortunately, they forgot to add an example on how to do it! 😅
It's alright, 💩 happens. A couple of DuckDuckGo searches later, we found that we could achieve this by configuring the cache to use asyncReloading
:
public static LoadingCache<String, Data> buildCacheWithAsyncLoading(
String baseUrl, ExecutorService pool) {
return CacheBuilder.newBuilder()
.refreshAfterWrite(Constants.CACHE_REFRESH_TIME_MS, TimeUnit.MILLISECONDS)
.build(CacheLoader.asyncReloading(CacheFactory.buildCacheLoader(baseUrl), pool));
}
In this case, the refresh will take place on a thread from a separate thread pool we have to pass on during instantiation.
To validate this works as expected, we write another test 🧪:
@Test
@SneakyThrows
void solution_whenFutureResponsesTimeoutAndAsyncLoading_returnOldValue() {
ExecutorService pool = Executors.newFixedThreadPool(10);
try {
var cache = CacheFactory.buildCacheWithAsyncLoading(wireMockServer.baseUrl(), pool);
setupRemoteResponse(new Data("old_value"));
runAndWait(cache);
waitForCacheToExpire();
setupRemoteResponse(new Data("new_value"), 2 * Constants.TIMEOUT_IN_SECONDS);
runManyAndWait(cache, THREAD_COUNT);
} finally {
Thread.sleep(10000);
pool.shutdown();
}
}
Which reports the following logs:
1772 [Thread-8] INFO org.example.CacheReader [] - Reading value at 1637945095606
1783 [Thread-8] INFO org.example.CacheFactory [] - Fetching data
2032 [Thread-8] INFO org.example.CacheReader [] - Value Data(value=old_value) at 1637945095871
3049 [Thread-9] INFO org.example.CacheReader [] - Reading value at 1637945096888
3049 [Thread-10] INFO org.example.CacheReader [] - Reading value at 1637945096888
3050 [Thread-11] INFO org.example.CacheReader [] - Reading value at 1637945096889
3050 [Thread-11] INFO org.example.CacheReader [] - Value Data(value=old_value) at 1637945096889
3050 [Thread-10] INFO org.example.CacheReader [] - Value Data(value=old_value) at 1637945096889
3053 [pool-2-thread-1] INFO org.example.CacheFactory [] - Reloading
3053 [pool-2-thread-1] INFO org.example.CacheFactory [] - Fetching data
3065 [Thread-9] INFO org.example.CacheReader [] - Value Data(value=old_value) at 1637945096904
4063 [pool-2-thread-1] ERROR org.example.CacheFactory [] - Loading Exception at 1637945097902
... <omitted stacktrace> ...
We can see that there isn't an unlucky thread anymore and that the Exception
is reported by the thread in the pool.
When we encounter an issue in production, depending on its severity and impact, we often perform an RCA (root cause analysis) investigation.
The goal of this process is to find the root cause of the problem, fix it and also to come up with ways we can prevent issues like this happening in the future or, at least, improve our automated monitoring to ensure we catch these issues as soon as they happen.
For this particular problem, apart from fixing the bug, we decided to:
And that's the end of it all!
If you are curious to see the playground code, you can find it in
This repository contains some tests that I used to investigate a tricky caching bug.
You can find more information about it in my blog albertowar.me.
Posted on December 6, 2021
Sign up to receive the latest update from our blog.