Analyzing Weird Spikes in Cloud Function Require Latency
Eric Hacke
Posted on July 9, 2020
Code to reproduce this is available on GitHub
The whole idea of Asserted is that it allows you to run custom test code against your application. At the time I started building it, I figured the fastest and easiest way to do that was using GCP Cloud Functions. Cloud Functions have been around for years, and have well known performance and security characteristics, so it seemed like a safe bet.
At it's core, the implementation was simple. Copy code into a Cloud Function and then use child_process to execute it safely with a timeout.
This seemed to work great at first. Relatively low-latency, and easy to maintain.
But this code runs continuously, as often as every minute, forever. Within less than a day, I got a timeout on the child_process.exec.
The Mystery Begins
Logically, I assumed it was my fault, because most things are.
The code I was executing was calling API endpoints and maybe they were holding the connection open too long or something. I ignored it first, but then I noticed that when I ran the code locally on my machine for extended periods, the timeouts didn't happen. So it wasn't the code exactly, and it wasn't the API I was calling from inside that code.
I started investigating. I did the usual debugging steps of basically adding console.log statements everywhere to see where the holdup was, and set the exec to inherit stdio so I could easily see the logs.
I added some around child_process.exec:
And others inside the user code itself:
After running the function a number of times, I looked into GCP Logging where I could see the log lines and the time they occurred.
I was surprised to see that the delay wasn't happening within the bulk of the user code, it was happening between the exec starting and when the require statements finished.
There is a huge variance in how long the require statements take to finish. Some times the require statements would complete within 100 ms, and other times it may take over 2 seconds, or not even complete before the timeout.
That definitely seemed weird. These aren't weird esoteric dependencies. They are some of the most commonly used libraries on NPM.
Profiling these require statements on my own machine showed negligible impact, so maybe it was something about Cloud Functions itself that was weird?
I decided to come up with a more formal test to see if I could track it down.
The Experiment
Environments
I had tried out Cloud Run around the same time and knew that I didn't see the issue there, only in Cloud Functions. So I decided to do a three-way comparison. I would run the same code in three environments and compare the results:
- Cloud Function - 2048 MB Memory - single 2.4 GHz CPU
- Cloud Run - 2048 MB Memory - single vCPU
- Local Docker - 2048 MB Memory - single CPU
Code
In terms of the code I was running, I didn't want to rely on a specific pre-existing library. While that's where I originally noticed it, I didn't want to introduce the idea that for some reason this specific dependency was an issue.
So I wrote a bit of code that randomly generates node modules. Each containing a single object with up to 100 randomly created properties.
Then I used that to create a folder containing 1000 randomly generated libraries, and a single index.js file that requires all of those libraries and exports them in a single giant object.
1000 dependencies might sound like a lot, but if you run ls -al node_modules | wc -l inside an arbitrary Node project, you'll see that it's actually pretty reasonable. Maybe even conservative.
As mentioned at the beginning of the post, you can see the full codebase for this experiment here.
Scenarios
Beyond just calling require on 1000 dependencies, I wanted to contrast it with a few different scenarios to give some context to the issue. So I came up with three scenarios that I'd run in each of the three environments:
- Normal Require - Load 1000 dependencies from the default directory
- Regenerate and Require - Regenerate and load 1000 dependencies in /tmp
- CPU - Just eat CPU for 1 second
The idea here is that Cloud Functions loads the code you provide from a read-only directory. I don't know much at all about the underlying implementation of Cloud Functions, but I wanted to control for the fact that this read-only directory may be somehow effecting things. So I added a second scenario where I regenerate all of the dependencies during the request into /tmp, and then load them from there.
And the last scenario is a simple control group, where I just spin in place for 1000 ms and then exit.
The Results
I ran each of these scenarios 1000 times in each of the three environments and collected the results. The times shown in all of these charts are not the HTTP request latency, but the amount of time it takes for the child_process.exec to complete loading the giant dependency.
Require Time
As you can see in the chart, there is a huge variation in the amount of time it takes for the fake dependencies to load within the Cloud Function. From 2.5 seconds to well over 10 seconds.
The Cloud Run instance shows some variation, but quite reasonable. And the local Docker instance is basically unchanged, which is what you'd expect.
Statistics:
- Cloud Function - Standard Deviation: 862 ms - Median: 4015 ms
- Cloud Run - Standard Deviation: 207 ms - Median: 2265 ms
- Local Docker - Standard Deviation: 30 ms - Median: 1213 ms
The chart above shows a distribution of the latencies with the outlier 1% stripped. The local docker is very tight, some variation in Cloud Run, and a wide variation in Cloud Function.
Regenerate and Require Time
This scenario has more going on, so the numbers are bigger, but the pattern is essentially the same. Cloud Function performs worst, Cloud Run has some variation but is reasonable, and local Docker is tight.
Statistics:
- Cloud Function - Standard Deviation: 1664 ms - Median: 7198 ms
- Cloud Run - Standard Deviation: 524 ms - Median: 5895 ms
- Local Docker - Standard Deviation: 36 ms - Median: 3245 ms
The distribution is similar to the simpler require scenario. The local Docker is tight, Cloud Run wider (with an outlier), and the Cloud Function has an even wider distribution.
CPU Time (control)
The vertical axis on this chart has been adjusted to match the first scenario to give a better visual comparison.
You can see that when it's just doing straight CPU work, all environments are close to the same. There are some spikes in the Cloud Function times, but nothing significant.
Statistics:
- Cloud Function - Standard Deviation: 23 ms - Median: 1172 ms
- Cloud Run - Standard Deviation: 20 ms - Median: 1095 ms
- Local Docker - Standard Deviation: 2 ms - Median: 1045 ms
I could not seem to adjust the horizontal axis in this case, but note that the overall variation shown here is narrow, even if the Cloud Function is more broad than the other two.
Conclusion
You: This is interesting Eric, but what does this mean?
Me: I have no idea.
I don't know enough about how Cloud Functions are implemented to speculate about why this is happening.
At a glance, it seems likely that for some reason, large reads from disk (or disk-in-memory?) for Cloud Functions seem to have unpredictable performance characteristics.
I can't say why exactly this is happening. But I can say that it was a big enough problem for me that I switched everything over to using Cloud Run instead.
I'd be really curious to know if any Google people have a guess as to why this might be the case, and I'd definitely post it here if I hear anything.
Posted on July 9, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.