Pooling in aioredis/redis-py may be dangerous
Denis Anikin
Posted on June 6, 2022
This story happens to me a couple of days ago. It learns me some things, and I'm ready to tell them to people. These things you may recognize by the bold highlight down the text. The story may look terrifying or entertaining for you, depending on point of view.
This was regular Friday. I was prepared to spend the day as usual — meetings as I am a zoom certified expert (team lead), and doing coding, because, you know, I'm still a programmer too.
Suddenly, I heard a loud boom from production — our beautiful microservice distributed event-driven almost reactive architecture says «whoops, a couple of services are dead, bye, this was funny». And then it followed by head-crashing 14 hours of zoom mob debugging, that lead us to impressive conclusion: keydb (performance oriented redis fork) cluster dead and can't be restarted properly, because each time when we start new master node, we got ten's thousands of connections in a blink of eye from our services. Then we got connection limit overflow and… dead master node. And it couldn't be fixed with a reboot.
And, here we must start a list of things, that I learned that day. First point of my story — keydb has really strange mechanics of handling connections overflow. It may lead the database to crash. I don't know how to properly handle this kind of situation other than to increase the maximum connection size on the server side and limit the pool size on the client side. But perhaps it can lead you to your own conclusions.
Second — you should monitor any connections in your cluster/service, regardless of their type, especially if we are talking about any type of database. We didn't, and it cost us a lot of stress.
So, back to our tragic story. Why we got so many connections at each restart? It looks insane — our project does not have many requests per second, — about 40-50 at peak times. But here I must admit, we got multiplications because of «hops» between microservices, but it can't produce ten of thousands of connections at the moment no matter what. So, how and what?
Well, it was hard to determine, but we found «superposition of mistakes», as I call such things.
First, it was aioredis library. We are using sentinel based client because with this we can achieve failover easily. Aioredis spawn pool of connections, that transparently reconnects (and here third thing — FOREVER, hello DDOS) to our sentinel nodes, and then to master node. It supposed to do so. Also, we found that if you are not limiting maximum connections count, library will do it for you and set it as 2 ** 31 (here you can see it) — this is fourth thing. Furthermore, pool in our version (2.0.1) not closing automatically, and it makes the problem worse.
upd. If you are concerned about pool auto-closing/cleanup, just check redis py, it now includes aioredis where this issue is being solved. And it complete drop-in replacement for aioredis itself:
from redis import asyncio as aioredis
However, back to our code. Things were as if all this was not enough for us. Another piece of the puzzle was in the simple fact, in that we are trying to adopt defensive programming techniques. Because of them, we used the connection pool in a very specific way — as a context manager, where each time in each coroutine we created a new connection pool. And even more, we have added a backoff library wherever we work with the keydb. In other words, the power of DDOS on our database was just epic, because of our good intentions, of course. And it can't be handled anyhow by poor keydb, I guess.
Couple more words about aioredis/redis-py pooling. Let's look at the extracted code from the library:
def __init__(
self,
connection_class: Type[Connection] = Connection,
max_connections: Optional[int] = None,
**connection_kwargs,
):
max_connections = max_connections or 2**31
if not isinstance(max_connections, int) or max_connections < 0:
raise ValueError('"max_connections" must be a positive integer')
Be prepared — there is no word in the documentation about max_connections
option. But you must pass it if you don't want troubles as we got:
sentinel = aioredis.sentinel.Sentinel(
[("localhost", 26379), ("sentinel2", 26379)],
max_connections=1000
)
Moreover, while we are trying to stop fire from spread, crashed consumer where lied our keydb related code was not able to handle (hard to handle anything when you are dead) around 100k of messages from kafka. It also amplifies DDOS and blocked us from fixing the service.
You may ask — how we fixed this horrible mess? We write another consumer, that handle those messages from kafka (just trow them away — it was unimportant things), patch the library and revived our shiny master. It sounds boring, but it was hard as hell.
I hope these things from the article can make your life easier if you've ever come across similar architectures and circumstances. Peace!
Upd. The colleague of mine opened an issue in redis-py (aioredes has been merged in this project recently) https://github.com/redis/redis-py/issues/2220. Please, vote for this issue, this is serious problem.
Posted on June 6, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.