AWS Lambda SnapStart - Part 1 Initial measuring of Java 11 Lambda cold starts

vkazulkin

Vadym Kazulkin

Posted on December 4, 2022

AWS Lambda SnapStart - Part 1 Initial measuring of Java 11 Lambda cold starts

This article was updated on December 6 (see below)

Introduction

In the recent years I talked a lot about Adopting Java for the Serverless world on AWS as I'm very passionate about both: Java and AWS Serverless. You can watch one of these talks. The basic message was, that the cold starts have been pretty significant in that area and may impact lots of application depending on their architecture (for sure for public-facing APIs). With GraalVM and its Ahead-of-Time compilation you can improve those cold starts a lot, but GraalVM has its own set of challenges, as not every dependency that you'd like use in your application may be GraalVM-ready, but also building times of Native Image are quite high (up to several minutes) which impacts developer experience. So I was pretty excited as SnapStart was announced at Re:Invent conference this week. So I wanted to give it a try.

How SnapStart works

Mark Sailes from AWS wrote very detailed about this. So please take a look at his post. Generally a lot of innovation was required around fast Firecracker VM restores and JVM technology called Coordinated Restore at Checkpoint (CRaC) to make it work for Java. Currently SnapStart is only available for Java Corretto 11.

Current limitations of SnapStart

The list of limitations is quite big and worth reading. SnapStart does not currently support provisioned concurrency, the arm64 architecture, Amazon Elastic File System (Amazon EFS), AWS X-Ray, or ephemeral storage greater than 512 MB. Additionaly, you can enable SnapStart only for the published version of the Lambda funtion and not for the $Latest one.

Project Setup

I created a very basic project with AWS SAM to make a very first test.

I wrote the Lambda function GetProductByIdWithSnapStart and gave it 1024 MB of memory which makes a DynamoDB read to get the product by id. In its static initializer block I created DynamoDB client like this :

private static final DynamoDbClient dynamoDbClient = DynamoDbClient.builder()
    .credentialsProvider(DefaultCredentialsProvider.create())
    .region(Region.EU_CENTRAL_1)
    .overrideConfiguration(ClientOverrideConfiguration.builder()
      .build())
    .build();
Enter fullscreen mode Exit fullscreen mode

Then I enabled SnapStart for this Lambda function in the configuration section:

Image description

What you'll observe when deploying the Lambda after this change is that a series of the INIT invocations have been executed and appeared in the CloudWatch Log Groups of the function to take snapshot of the JVM state after the static initalizer block of this function has run :

Image description

Measuring the cold starts

First of all, before enabling SnapSart for Lambda function GetProductByIdWithSnapStart I measured the average cold start and it has been around 4.5 seconds.

It's currently very tricky to measure them for the SnapStart-enabled function due to the lack of support of X-Ray tracing. So I wrote another Lambda function GetProductByIdWithOutSnapStart and synchronously called GetProductByIdWithSnapStart like this

InvokeRequest invokeRequest = InvokeRequest.builder()
.functionName("GetProductByIdWithSnapStart")
.qualifier("7")
InvokeResponse invokeResponse = lambdaClient.invoke(invokeRequest);
Enter fullscreen mode Exit fullscreen mode

Please note, you have to define the qualifier (in this case 7) which to build the full ARN of the Lambda function (including he version) number. When I simply made a log entry before invoking GetProductByIdWithSnapStart and in the beginning of the handleRequest of the GetProductByIdWithSnapStart function. The difference of those can be considered roughly as a cold start. Of course there is latency involved around invoking Lambda function from another function, but in my experiments it took around 0.1 seconds, so it can be neglected.

I invoked GetProductByIdWithOutSnapStart once in 1 hour to always run into cold starts and measure them as described above.
The average cold start time was around 1.6 seconds.

I also made a test to figure out whether there is a difference between the first cold start after publishing a new function version and subsequent cold starts without publishing, but I could't identify any significant difference.

In the CloudWatch Logs of the function GetProductByIdWithSnapStart, you'll observe the JVM restore entries in case of the cold starts:

Image description

which took only 235 ms. So the more interesting question is what took the remaining 1.365 seconds. Maybe its Firecracker VM restore, maybe something else.

Conclusions and next steps

The average cold start time was around 1.6 seconds, which is a huge win compared to 4.5 seconds without enabling SnapStart. There is for sure a lot of room for further optimizations as the technologies are quite new. The most important for me is that for the most cases I don't require to make any changes to the existing function's code.

Currently you can achieve cold starts around 600 ms with GraalVM Native Image for Lambda having 1024 MB of memory and the similar architecture.

I'll experiment more (i.e. trying to use different AWS Services like SQS and SNS) and contact AWS experts to provide more details and visibility to what is exactly happening under the hood.

I'd like to make similar measurements with SnapStart-enabled Lambda function written using frameworks like Quarkus, Micronaut and Spring and compare their results. More to come in the upcoming blog posts!

Update on December 6

There is a more precise way to measure the cold start of te funtion by executing the CloudWatch Log Insights Query, which calculates the cold start duration as the sum of Restore Duration (if there is one) + Duration.

The results of running this query are:
p50 1266.05
p90 1306.85
p99 1326.81

which are even more impressive. With that there is no need in this experiment to write an extra Lambda function to call SnapStart-enabled Lambda function.

Update: 1 you can significantly reduce the cold start times of the Lambda function with SnapStart enabled further by applying the optimization technique called priming. Learn more about it in my article.
Update: 2 due to the AWS fix for correctly displaying the snapshot restore time and insights measuring end to end API Gateway latency from part 5 , I re-measured the cold start times in the part 7 Re-measuring of Java 11 Lambda cold starts.

💖 💪 🙅 🚩
vkazulkin
Vadym Kazulkin

Posted on December 4, 2022

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related