Spring Boot performance benchmarks with Tomcat, Undertow and Webflux

Tomcat vs Undertow vs Webflux

JHipster is used by thousands of people to generate production-ready Spring Boot applications. We've been using Undertow for years, with great success, but as we are planning for JHipster 7 we started discussing migrating away from Undertow.

This is the kind of discussion which happens very frequently in the JHipster community: with so many people contributing, we often test and try alternatives to our current approach.

We ended up discussing about 3 different application servers:

Undertow, from Red Hat/IBM: it is known for being lightweight, and we have years of (good) experiences with it.
Tomcat, from the Apache Software Foundation: by far, the most popular option. It is also the default solution coming with Spring Boot.
Webflux, from VMWare: this isn't really an application server, this is Spring Webflux running on top of Netty. This implies using reactive APIs, which are supposed to provide better performance and scalability. It's a whole different approach, which is also supported by JHipster.

The test applications

I created a performance benchmarks GitHub repository, with applications generated by JHipster.

Those applications are more complex and realistic than simple "hello, world" applications created with Spring Boot. For instance, they use Spring Security and Spring Boot Actuator: those libraries will impact application start-up time and performance, but they are what you would use in the real world.

Then, they are not as complex as they could be: I didn't configure a database or an application cache. Those would make running performance tests a lot more complicated, and wouldn't add any specific value: as we would use the same drivers or caching solution with all three application servers, we would end up testing the same things.

In the GitHub repository, you will find 4 directories: one for each application, and one with the performance tests.

Using Azure Virtual Machines for the test

As we needed to set up a test environment, using the cloud was obviously the best solution! For this test, I have created 2 virtual machines, with their own private network.

I used:

2 Azure Virtual Machines, using their default configuration called "Standard D2s v3" (2 vcpus, 8 GiB memory).
1 private Azure Virtual Network.

One of the machines was hosting the application server, and the other one was used to run the performance test suite.

The VMs were created using the Ubuntu image provided by default by Azure, on which I installed the latest LTS AdoptOpenJDK JVM (openjdk version "11.0.6" 2020-01-14).

To configure your virtual machines easily, I maintain this script which you might find useful: https://github.com/jdubois/jdubois-configuration.

Start-up time

Start-up time is all the rage now, but on JHipster we usually give more importance to runtime performance. For example, that's the reason why we use Afterburner: this slows down our start-up time, but gives about 10% higher runtime performance over a "normal" Spring Boot application.

Here are the results after 10 rounds, for each application server:

	Undertow	Tomcat	Webflux
1	4,879	5,237	4,285
2	4,847	5,125	4,225
3	4,889	5,103	4,221
4	5,013	5,129	4,232
5	4,84	5,134	4,271
6	5,007	5,141	4,191
7	4,868	5,214	4,147
8	4,826	5,032	4,251
9	4,856	5,069	4,274
10	4,908	5,078	4,128
Mean	4,8933	5,1262	4,2225
Difference		4,76%	-13,71%

As expected, Undertow is lighter than the competition, but the difference is quite small.

The runtime performance test

For our runtime performance, we needed to have a specific test suite.

The performance test were written in Scala for the Gatling load-testing tool. They are pretty simple (we are just doing POST and GET requests), and are available here.

This test does the following:

Each user does 100 GET requests and 100 POST requests, every 1.5 seconds.
We will have 10,000 users doing those requests, with a ramp up of 1 minute.

The objective here is to stay under 5,000 requests per second, as when you go above that level you will usually need to do some specific OS tuning.

Undertow performance benchmarks

The Undertow results were quite good: it could handle the whole load without losing one single request, and it was delivering about 2,700 requests per second:

The response time was quite slow, with nearly all users having to wait about 3 seconds to get a response. But it was also quite stable (or "fair" for all users), as the 50th percentile is not that far away from the 99th percentile (or even the "max" time!):

Tomcat performance benchmarks

Tomcat had about 5% of its requests failing:

Those failures explain why the graph below doesn't look very good. Also, it was only delivering about 2,100 requests per second (compared to 2,700 requests per second for Undertow):

Last but not least, the response time was good for about 10% of the requests, but it was much worse than Undertow at the 95th and 99th percentile, which shows that it could not handle all requests correctly. That's also why it had such a bad standard deviation (2,760 seconds!):

Webflux performance benchmarks

Webflux had about 1% of its requests failing:

Those failures were at the beginning of the tests, and then the server could correctly handle the load: it looks like it had to trouble to handle the traffic growth because it was quite sudden, and then stabilized.

Then, we can notice that once stabilized, Webflux had some strange variations - this is why we see all those peaks in the blue graph below: it would go suddenly from handling nearly 5,000 requests/second to less than 1,000 requests/second. In average, it was handling a bit more than 2,700 requests/second, so that's the same as Undertow, but with big variations that Undertow didn't have.

The variations that we noticed in the previous graph also explain why, compared to Undertow, Webflux has a lower 50th percentile, but a higher 95th percentile. And that's also why its standard deviation is much worse:

Conclusion

Undertow definitely had impressive results in those tests! Compared to Tomcat, it proved to start up faster, handle more load, and also had a far more stable throughput. Compared to Webflux, which has a completely different programming model, the difference was less important: Webflux started faster, but had 1% of errors at the beginning of the test - it looks like it had some trouble to handle the load at the beginning, but that wasn't a huge issue.

On JHipster, this is probably one of the many different choices that we have made, which make JHipster applications much faster and more stable than classical Spring Boot applications. So this performance test is definitely very important in our future decision to keep Undertow or move away from it. If you'd like to participate in the discussion, because you ran more tests or have any good insight, please don't hesitate to comment on our "migrating away from Undertow" ticket, or on this blog post.

Blog