2x Faster, 40% less RAM: The Cloud Run stdout logging hack

siddhantkcode

Siddhant Khare

Posted on November 24, 2024

2x Faster, 40% less RAM: The Cloud Run stdout logging hack

Sometimes, the simplest solutions yield the most dramatic improvements. In a recent private project, I discovered that changing how I handled logging in Cloud Run led to significant performance gains and memory savings. Here's the full story of what happened, what I learned, and how you can implement these improvements in your projects.

High level diagram

The setup

I've been running a Node.js (v18.x) application with NestJS (v10.x) on Cloud Run for about 8 months now, handling around 50k requests per day. I chose @google-cloud/logging-winston (v6.0.0) for logging because of its promised seamless integration with Google Cloud's ecosystem. Initially, I was using the default configuration, which directly calls the Cloud Logging API. Everything seemed fine at first glance - the logs appeared in Cloud Logging, and the application was running. But then things got interesting.

The mystery of the memory spikes

My application was configured with 1 vCPU and 4 GiB RAM, which should have been plenty. However, I noticed some concerning patterns in my metrics:

  • Memory usage would occasionally spike to 100%
  • CPU utilization would hit maximum capacity during these spikes
  • Container instance time was higher than expected

When these memory spikes occurred, I started seeing errors like:

Error: Rating exceeded quota for quota metric 'Write requests' and limit 'Write requests per minute' of service 'logging.googleapis.com'
Enter fullscreen mode Exit fullscreen mode

And the dreaded Cloud Run timeout messages:

Error: The request has been terminated because it has reached the maximum request timeout.
Enter fullscreen mode Exit fullscreen mode

While these metrics were concerning, the application seemed to be functioning normally most of the time, so I initially assigned it a lower priority for investigation. That was until I encountered something that made me sit up and take notice.

The three-hour log delay incident

One day, while monitoring my application, I noticed something bizarre. An alert was triggered based on a log condition, but when I investigated, I discovered the log was from more than three hours ago!

Here's what the delayed log looked like:

{
  "severity": "ERROR",
  "message": "Database connection failed",
  "timestamp": "2024-10-19T10:15:00.000Z",
  "receiveTimestamp": "2024-10-19T13:15:03.421Z",
  // Over 3 hours difference!
}
Enter fullscreen mode Exit fullscreen mode

This wasn't just a minor inconvenience - it had serious implications:

  • Log-based alerts were delayed, making them less useful for real-time monitoring
  • Backed-up logs were consuming memory
  • Some instances were failing to accept requests with timeout errors

During investigation, these commands were invaluable:

# Check container logs for OOM issues
gcloud logging read "resource.type=cloud_run_revision AND textPayload:\"OOMKilled\""

# Monitor memory usage in real-time
gcloud run services describe my-service --format='get(status.conditions)'
Enter fullscreen mode Exit fullscreen mode

The root cause analysis

After investigation, I pieced together what was happening:

  1. Logs weren't being sent immediately to Cloud Logging
  2. They were accumulating in memory
  3. Eventually, there wasn't enough memory to even handle new requests
  4. The application would start throwing timeout errors

The worst part? If an instance terminated while logs were stuck in memory, those logs would be lost forever. It's not ideal for observability!

Root cause analysis diagram

The Solution: stdout to the rescue

The fix turned out to be surprisingly simple. Instead of using the Cloud Logging API directly, I modified the configuration to write logs to standard output. Cloud Run automatically forwards stdout to Cloud Logging, making this a more efficient approach.

stdout solution diagram for cloud run logs

Here's how to implement it:

import { LoggingWinston } from "@google-cloud/logging-winston";
import { createLogger } from "winston";

const logger = createLogger({
  transports: [
    new LoggingWinston({
      // This redirects logs to Cloud Run's stdout
      redirectToStdout: true,
      // Default is true, affects log structure.
      // Prevents wrapping logs unnecessarily
      useMessageField: false,
    }),
  ],
});
Enter fullscreen mode Exit fullscreen mode

The results were crazy!!!

After implementing this change, the improvements were immediate and significant:

Memory metrics:

  • Before:

    • Peak memory usage: 3.8GB of 4GB (95%)
    • Average memory usage: 3.2GB (80%)
    • Frequent OOM risks
  • After:

    • Peak memory usage: 1.6GB (40%)
    • Average memory usage: 1.2GB (30%)
    • Zero OOM incidents in 30 days

Performance metrics:

  • p95 latency decreased from 800ms to 300ms
  • Error rate dropped from 0.5% to 0.02%
  • Average CPU utilization dropped from 85% to 40%

Cost impact:

  • 40% reduction in Cloud Run instance hours
  • 35% decrease in logging costs
  • Estimated annual savings: $2,400

I validated the improvement with a load test:

  • Tool: Artillery
  • Duration: 30 minutes
  • Load: 0 to 500 RPS ramp-up
  • Payload: 1KB JSON messages

The stdout version handled 2x the load with half the memory usage.

The blue line in the center of the following images indicates when the change was applied.

Memory usage

CPU usage

Billed container instance hours

Important considerations for structured logging

When using stdout logging with Cloud Run, there are some crucial points to keep in mind:

  1. Handle null values in labels carefully: If any label contains null, it gets stored in a jsonPayload field under logging.googleapis.com/labels. I solve this by converting all label values to strings before logging.

  2. Timestamp precision: Always pass timestamps as Date objects. Otherwise, the logger might override your timestamp with its processing time.

  3. Trace configuration: For request tracing, set up the X-Cloud-Trace-Context header value in your logs. This lets you filter logs by request, which is invaluable for debugging. Learn more in official docs

  4. Instance identification: Cloud Run automatically includes labels.instanceId, which is extremely useful for debugging instance-specific issues.

Alternative approaches I considered

Before settling on stdout logging, I evaluated several other options:

  1. Optimizing API calls:

    • Adjusting API call frequency
    • Batch processing logs
    • Implementing retry logic
  2. Using logging agents:

    • Exploring the Logging agent (though it's no longer in active development)
    • Investigating the Ops agent (more suited for Compute Engine)
  3. Custom buffering solutions:

    • Implementing an in-memory buffer with size limits
    • Creating a separate logging microservice

The stdout approach won out due to its simplicity, effectiveness, and built-in support in Cloud Run.

Have you encountered similar logging challenges in your Cloud Run applications? I'd love to hear about your experiences and solutions in the comments below!


For more tips and insights, follow me on Twitter @Siddhant_K_code and stay updated with the latest & detailed tech content like this.

💖 💪 🙅 🚩
siddhantkcode
Siddhant Khare

Posted on November 24, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related