2x Faster, 40% less RAM: The Cloud Run stdout logging hack
Siddhant Khare
Posted on November 24, 2024
Sometimes, the simplest solutions yield the most dramatic improvements. In a recent private project, I discovered that changing how I handled logging in Cloud Run led to significant performance gains and memory savings. Here's the full story of what happened, what I learned, and how you can implement these improvements in your projects.
The setup
I've been running a Node.js (v18.x) application with NestJS (v10.x) on Cloud Run for about 8 months now, handling around 50k requests per day. I chose @google-cloud/logging-winston (v6.0.0) for logging because of its promised seamless integration with Google Cloud's ecosystem. Initially, I was using the default configuration, which directly calls the Cloud Logging API. Everything seemed fine at first glance - the logs appeared in Cloud Logging, and the application was running. But then things got interesting.
The mystery of the memory spikes
My application was configured with 1 vCPU
and 4 GiB RAM
, which should have been plenty. However, I noticed some concerning patterns in my metrics:
- Memory usage would occasionally spike to 100%
- CPU utilization would hit maximum capacity during these spikes
- Container instance time was higher than expected
When these memory spikes occurred, I started seeing errors like:
Error: Rating exceeded quota for quota metric 'Write requests' and limit 'Write requests per minute' of service 'logging.googleapis.com'
And the dreaded Cloud Run timeout messages:
Error: The request has been terminated because it has reached the maximum request timeout.
While these metrics were concerning, the application seemed to be functioning normally most of the time, so I initially assigned it a lower priority for investigation. That was until I encountered something that made me sit up and take notice.
The three-hour log delay incident
One day, while monitoring my application, I noticed something bizarre. An alert was triggered based on a log condition, but when I investigated, I discovered the log was from more than three hours ago!
Here's what the delayed log looked like:
{
"severity": "ERROR",
"message": "Database connection failed",
"timestamp": "2024-10-19T10:15:00.000Z",
"receiveTimestamp": "2024-10-19T13:15:03.421Z",
// Over 3 hours difference!
}
This wasn't just a minor inconvenience - it had serious implications:
- Log-based alerts were delayed, making them less useful for real-time monitoring
- Backed-up logs were consuming memory
- Some instances were failing to accept requests with timeout errors
During investigation, these commands were invaluable:
# Check container logs for OOM issues
gcloud logging read "resource.type=cloud_run_revision AND textPayload:\"OOMKilled\""
# Monitor memory usage in real-time
gcloud run services describe my-service --format='get(status.conditions)'
The root cause analysis
After investigation, I pieced together what was happening:
- Logs weren't being sent immediately to Cloud Logging
- They were accumulating in memory
- Eventually, there wasn't enough memory to even handle new requests
- The application would start throwing timeout errors
The worst part? If an instance terminated while logs were stuck in memory, those logs would be lost forever. It's not ideal for observability!
The Solution: stdout
to the rescue
The fix turned out to be surprisingly simple. Instead of using the Cloud Logging API directly, I modified the configuration to write logs to standard output. Cloud Run automatically forwards stdout to Cloud Logging, making this a more efficient approach.
Here's how to implement it:
import { LoggingWinston } from "@google-cloud/logging-winston";
import { createLogger } from "winston";
const logger = createLogger({
transports: [
new LoggingWinston({
// This redirects logs to Cloud Run's stdout
redirectToStdout: true,
// Default is true, affects log structure.
// Prevents wrapping logs unnecessarily
useMessageField: false,
}),
],
});
The results were crazy!!!
After implementing this change, the improvements were immediate and significant:
Memory metrics:
-
Before:
- Peak memory usage: 3.8GB of 4GB (95%)
- Average memory usage: 3.2GB (80%)
- Frequent OOM risks
-
After:
- Peak memory usage: 1.6GB (40%)
- Average memory usage: 1.2GB (30%)
- Zero OOM incidents in 30 days
Performance metrics:
- p95 latency decreased from 800ms to 300ms
- Error rate dropped from 0.5% to 0.02%
- Average CPU utilization dropped from 85% to 40%
Cost impact:
- 40% reduction in Cloud Run instance hours
- 35% decrease in logging costs
- Estimated annual savings: $2,400
I validated the improvement with a load test:
- Tool: Artillery
- Duration: 30 minutes
- Load: 0 to 500 RPS ramp-up
- Payload: 1KB JSON messages
The stdout version handled 2x the load with half the memory usage.
The blue line in the center of the following images indicates when the change was applied.
Important considerations for structured logging
When using stdout logging with Cloud Run, there are some crucial points to keep in mind:
Handle null values in labels carefully: If any label contains null, it gets stored in a
jsonPayload
field underlogging.googleapis.com/labels
. I solve this by converting all label values to strings before logging.Timestamp precision: Always pass timestamps as Date objects. Otherwise, the logger might override your timestamp with its processing time.
Trace configuration: For request tracing, set up the
X-Cloud-Trace-Context
header value in your logs. This lets you filter logs by request, which is invaluable for debugging. Learn more in official docsInstance identification: Cloud Run automatically includes
labels.instanceId
, which is extremely useful for debugging instance-specific issues.
Alternative approaches I considered
Before settling on stdout logging, I evaluated several other options:
-
Optimizing API calls:
- Adjusting API call frequency
- Batch processing logs
- Implementing retry logic
-
Using logging agents:
- Exploring the Logging agent (though it's no longer in active development)
- Investigating the Ops agent (more suited for Compute Engine)
-
Custom buffering solutions:
- Implementing an in-memory buffer with size limits
- Creating a separate logging microservice
The stdout approach won out due to its simplicity, effectiveness, and built-in support in Cloud Run.
Have you encountered similar logging challenges in your Cloud Run applications? I'd love to hear about your experiences and solutions in the comments below!
For more tips and insights, follow me on Twitter @Siddhant_K_code and stay updated with the latest & detailed tech content like this.
Posted on November 24, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.