Self healing code

naarok

John Zittlau

Posted on April 19, 2022

Self healing code

Yup, a grandiose title, but something to think about and try for when it makes sense.

What I mean by "self-healing code" is writing code such that when a problem happens the code automatically reacts in such a way that the current user is unaware of the problem and future users do not trigger the problem.

A pretty common pattern that does this something like this is the Circuit Breaker, although I suggest taking it further. The circuit breaker simply returns an error once it trips. Self-healing code ideally does something more helpful to the user.

Let's say you are writing code to find an optimal route between two points. You have a trivial solution that you wrote, but then find the super-duper-always-perfect route as an API online. Now suppose that API can be unstable and not always return a result, especially under load.

A self-healing solution could be to fall back to your trivial solution when you detect a failure. In addition, your application could remember that the super-duper solution is having problems and maybe not send requests its way until a cooldown period has passed.

Your customers still get a route. Maybe not the best route, but some route is likely better than no route. Additionally, except for the first failed call, remaining calls don't waste time going to a broken API and so the response to your customer is faster. Finally, the super-duper solution is given a break to recover. Eventually you start calling the super-duper solution again and all is good. This is basically the concept of Graceful Degradation. Graceful degradation fits into what I'm thinking, but what if there are scenarios where you can return the exact results back to the user after an error rather than maybe not the best route as above? This is the ultimate dream of self-healing code.

Here is another example that I actually went through that doesn't fit the Circuit Breaker pattern and goes further than Graceful Degradation. This led to my thinking on self-healing code.

Since HTTP GET requests should only be doing reads from the database, we figured we could easily distribute traffic between our primary database and our read-replica database by automatically sending DB reads from GET requests to the read-replica.

The problem. We discovered that we were actually writing to the database in our GET requests. Not all of them, but enough to make it an issue. We decided to fix the GET requests to do the right thing so we could go forward with this plan.

The problem. There were enough GETs that wrote to the DB to make it too large of an effort to fix them all. The benefits of the project wouldn’t balance the costs.

The insight. We could keep a "skip list" of GET routes that do a write to the DB. Then, we could automatically send GET requests to the read-replica database unless they are in the skip list.

The problem. Again, we have many GETs that write to the database and no easy search patterns that would assure us that we could identify them all in our codebase.

The self-healing insight: We can default to sending all GET requests to the read-replica database. If a write happens within the processing of that request, it will error out since it can't write to the read-only replica database. Then, we can detect that error and re-run the full request against the primary database. The user will be oblivious to the problem except for a slightly longer response time. The self-healing part is that along with re-running the request, we record this route into the skip-list. Now at most one user (roughly, threading complexities aside) will see a delayed response. All other users will automatically just go to the primary database because the route is on the skip-list.

The extra win. This becomes a comprehensive list of routes that need fixing. As we fix the routes, we can remove them from the skip-list.

This allowed us to immediately start seeing benefits from our work to move traffic to the read-replica database. We can focus on the most common requests that will have the biggest lift, and requests that are so rare they are maybe used a handful of times a day can be deprioritized. We'll fix it eventually because writing to the database on a GET is just wrong, but we don't have to fix every single bad call before our database can breathe a sigh-of-relief.

Image description

In the end, this was a big win, and this way of thinking can likely be applied in many other places. The concept of letting the code both gracefully detect an error and find another way of solving the problem is huge. Coupling this with the code remembering the error so it doesn't keep trying takes it to the next level. This can be leveraged in all sorts of refactoring attempts, particularly complicated cross-cutting concerns. Keep this in your back pocket! Any time you can simplify a big-bang solution to small bites, it is almost always worth the effort to do so.

About Jobber
We're hiring for remote positions across Canada at all software engineering levels!
Our awesome Jobber technology teams span across Payments, Infrastructure, AI/ML, Business Workflows & Communications. We work on cutting edge & modern tech stacks using React, React Native, Ruby on Rails, & GraphQL.
If you want to be a part of a collaborative work culture, help small home service businesses scale and create a positive impact on our communities, then visit our careers site to learn more!

💖 💪 🙅 🚩
naarok
John Zittlau

Posted on April 19, 2022

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related