Serverless Logs

mstn

Marco

Posted on March 5, 2020

Serverless Logs

Logging is important in order to get insight into a system's behavior. Logs are useful not only for debugging a system in a dev environment, but also in production if enabled in a controlled manner.

We would like logging to be reliable without adding too much overhead to the main computation.

When it comes to serverless, it is not trivial to build a robust logging solution that is reliable as well as fast.

Here, we discuss some common problems we have when we integrate an AWS serverless application with an external logging service.

  • Lambdas can be executed more times on failure. This means that we can have duplicate log messages.
  • Lambdas are "frozen" when a value is returned and the event loop is empty. This may cause a loss of log data.
  • Lambdas are terminated with failure by timeouts. If an external logging service is too slow, we might experience timeouts, but we do not want to fail the whole computation.
  • Lambdas fail and might be executed again when an unhandled exception occurs. We would like to catch unhandled exceptions in log commands.

Let's go deeper into technical details. In the rest of this post I will present some toy examples to explain the main concepts. You can have a look at this github repository if you want to try and play around on your own.

AWS Lambda Retries

In some cases, depending on the event source, AWS Lambda might retry the execution of a Lambda function that failed.

Since there is no roll-back mechanism, side effects are not undone on failure. This problem affects logging, as well. Fortunately, the solution is simple.

The general rule is add awsRequestId to log data (CloudWatch does it for you, but here we want to rely on an external service). A Lambda context has an awsRequestId property unique per invocation. In other words, if a Lambda fails, it is executed again with the same awsRequestId.

Adding awsRequestId to log data allows us to understand if a log message is a duplicate.

Some related suggested reads

Frozen execution context

Ideally, We do not want to block the main computation because we are waiting for the logging service's responses. Creating a user is more important than logging the fact itself.

In addition, if we do not wait, we actually run each log command in parallel, making the execution faster.

Basically, what we can do is to send HTTP requests to our favorite logging service and do not wait for a promise's resolution.

This could work in traditional NodeJS, but it is doomed to fail in serverless.

You can try it yourself with the code in the repo. If the code is run in nodejs, you should see something similar to this:

$ node delayed-logging/run.js
[2020-03-02T10:06:27.912Z] STARTED LAMBDA
[2020-03-02T10:06:27.914Z] before log
[2020-03-02T10:06:27.920Z] after log
[2020-03-02T10:06:27.920Z] before log
[2020-03-02T10:06:27.920Z] after log
[2020-03-02T10:06:27.921Z] END LAMBDA
node delayed-logging/run.js  0,08s user 0,01s system 1% cpu 5,113 total

Enter fullscreen mode Exit fullscreen mode

The business logic of the Lambda function is executed immediately, but the computation ends after 5 seconds, when the HTTP server returns its responses. Under the hood, nodejs waits for pending requests.

Log messages are actually printed in server's logs as you can verify easily.

❯ node server.js
Listening on port 3000!
[2020-03-02T10:06:32.939Z] LOGGED: {"message":"Create user"}
[2020-03-02T10:06:32.952Z] LOGGED: {"message":"Return"}
Enter fullscreen mode Exit fullscreen mode

Let's see what happens in AWS.

# invoke and forget log promises
$ curl https://XXX.execute-api.YOUR-REGION.amazonaws.com/dev/functions/1
Enter fullscreen mode Exit fullscreen mode

In the following picture, we show logs for the lambda as well as for the logger. As you can see, the logger did not print out any message. More precisely, the logger lambda didn't start at all!

We lost log data.

Basically, the problem is bound to the execution semantics of AWS Lambdas.

The execution context is the runtime environment of a Lambda function and includes external HTTP endpoints, timeout handlers and other background processes.

When a Lambda function returns a value and the event loop is empty, the execution context is frozen for future reuse for some time.

Freezing the execution context is useful in many cases and leads to better performances because the initialization phase of external dependencies can be skipped. For example, we can take advantage of this feature to avoid recreating database connections at every invocation (from here).

If we have pending HTTP requests to an external service, like in the case of logging, when the Lambda returns, the event loop is empty. The HTTP request callbacks have not been pushed into the event loop yet. However, the execution context is not blank: HTTP request processes are still pending on background.

So, the Lambda is frozen. If it is invoked again, then the old execution context is resumed and the HTTP request processes are executed in the second call. Maybe... Unfortunately, we cannot be sure about that! In fact, if a lambda is idle for long, it is killed and the execution context deleted.

In practice, we might lose log data or see them in following invocations, which is quite confusing and unpredictable. Thus, the general recommendation is:

You should make sure any background processes or callbacks in your code are complete before the code exits. source

We can verify it. If we invoke our lambda several times very quickly, then we can see some logs, but others are still missing (but I am not sure if it depends on some concurrency policy of the API Gateway, any idea?).

$ repeat 5 { curl  https://XXX.execute-api.YOUR-REGION.amazonaws.com/dev/functions/1 }
Enter fullscreen mode Exit fullscreen mode

Instead, if we add an await statement before fetch requests (as AWS documentation recommends), then all the logs pass through (remember to decrease the value for the environment variable LOGGER_TIMEOUT for the logger lambda in the AWS console, otherwise you get a timeout).

# invoke with blocking logs
$ curl https://XXX.execute-api.YOUR-REGION.amazonaws.com/dev/functions/2
Enter fullscreen mode Exit fullscreen mode

However, the main computation is blocked until the logger returns something, which is exactly what we wanted to avoid.

The solution is to wait for logging promises before the end of the Lambda function. In other words, we need to collect promises and wait for them together (i.e. Promise.all) before returning a value.

# invoke with promise collector and await at the end
$ curl https://XXX.execute-api.YOUR-REGION.amazonaws.com/dev/functions/3
Enter fullscreen mode Exit fullscreen mode

This solution is also more efficient. If you run fun2 and fun3 with a slow logger (LOGGER_TIMEOUT equals 5 seconds, for example), fun3 returns a value while fun2 is timed out. If LOGGER_TIMEOUT is low, you can measure latency using xRay or artillery more accurately. Not surprisingly, fun3 is faster than fun2.

If you want to understand this topic better, I found the following blog posts very useful. I picked up the information presented in this section from there. I recommend them also if you want to try your knowledge of the event loop!

Timeouts

If you have tried the code above, you might have experienced one or more timeouts. Indeed, lambdas are interrupted after some time.

Our logging service is quite slow (by design), so the main Lambda function might time out if it waits for the logger's responses at the end of its body.

We do not want to cause a timeout because the logging service is too slow. In this case, we would like to interrupt logging preemptively if we are running out of time.

You may object that logging services are fast. But if you want to build a resilient distributed system, you should not trust other parties. You do not have control on an external logging service. If it becomes unavailable for any reason, your serverless application will start behaving in an unpredictable way.

If a timeout occurs, a Lambda might be executed again. So it is more desirable to lose some log data than rerunning the computation.

However, interrupting logging does not mean that log commands do not have any effect. As far as we know, the external service might have received an HTTP request. If a promise is pending on our side, it means only that we have not received an HTTP response or that the HTTP callback has not been executed.

Unfortunately, Promises cannot be canceled in NodeJs. The correct solution is to use the native Promise.race as in the following snippet:

  await Promise.race([
    Promise.all(promises).then(() => {
      if (timeoutHandle) clearTimeout(timeoutHandle);
    }),
    new Promise( (_resolve, reject) => {
      timeoutHandle = setTimeout(() => {
        reject();
      }, LOGGER_TIMEOUT);
    })
  ]);

Enter fullscreen mode Exit fullscreen mode

In this way, we are sure that Promise.all is removed from the execution context when setTimeout callback is executed (credits).

Unhandled exceptions

Lambdas can fail for the following reasons

  1. Timeouts
  2. Unhandled exception, e.g. TypeError or ReferenceError.
  3. Run out of Memory

I have not seen much of (3) and we discussed (1) previously. Here, I want to talk about "Unhandled exceptions".

Firstly, it is a good practice to wrap a Lambda in a try-catch block. In this way, we can handle uncaught exceptions, gracefully.

However, we do not want to fail the whole computation because of a TypeError in a log command.

We should try-catch individual log commands.

Bonus: I have a playground repo for testing unhandled exceptions' behavior in AWS.

Final remarks

Building a logging solution in serverless is not easy. We need to accomplish at least these steps.

  • Add context information to log message (i.e. awsRequestId)
  • "Flush" pending log promises with Promise.all before returning a value
  • Preemptively interrupt logging if we are running out of time.
  • Wrap log commands in try-catch

A lot of work for a lazy programmer like me.

Fortunately, it is a rote job that can be encoded in a library without rewriting the basics every time.

In a next post, I will show you an elegant functional solution to this problem.

💖 💪 🙅 🚩
mstn
Marco

Posted on March 5, 2020

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related

Serverless Logs
serverless Serverless Logs

March 5, 2020