Developer Fears: Breaking Production

viguza

Victor Guzman

Posted on May 27, 2020

Developer Fears: Breaking Production

About the series

This is part of a series of posts dedicated to talk about the biggest fears that we face as developers.

There’s always a first time

If you’re a new dev, you might be thinking: ”ha! It’s never happened to me”. Well, sorry to ruin your moment, but it’s just a matter of time.

On the other side, people who have been coding for a while probably already did it or know someone who did and have some great stories to tell!

But, don’t get me wrong, this is a good thing. Other than having funny stories to tell at the office, either if it was you or someone else, breaking production is one of the best moments that you have to learn and grow.

And talking about funny stories, let me tell you 3 of my favorites. But more importantly, let’s see what we all learned from them and how that helped us to improve.

Alt Text

The Jr hire that deployed on the first day 😱

It was the first day for a new junior dev. As part of the onboarding process, we let them follow the README with instructions on how to pull and set up the project locally.

We were focused on our stuff when suddenly received a notification that a production deploy had been made... by the new guy.

After a few seconds of panic, we determined that the deploy was triggered by a push to the canonical repo with no changes. Fortunately, nothing got broken and there were no casualties. What happened next?

  1. Created a permissions policy: we realized that we kept no track of who has access to what, and that allowed the guy to push to canonical when he shouldn't. After that, all grants were revoked and a new process was set up to ask for access as needed.

  2. Improved the README: we also noticed that the root of the problem was that the document wasn’t clear on how/where to run the command. So, we updated it and also start encouraging people to update it during the onboarding if they notice something wrong with it.

The SQL query without WHERE condition 😬

This is a common one, especially if you work with data.

There were a bunch of queries that need to be executed to update records on the database. The guy was selecting and running one query at the time and at some point he started screaming: "rollback, rollback!!".

He half selected the last query and didn't include the WHERE clause, updating ALL the records in the database.

What did we learn from it?

  1. Backups are really important: thankfully, a backup was created before running the queries so it was easy to restore it. However, especially if it's a routine process, it's really easy to forget about backups and how important they are. Always make sure to create copies before starting any risky process.

  2. Always test before running it live: it doesn't matter if it's a query, a command, or a script. It's important to have another environment to test before doing it in production.

💡 pro-tip: Start your queries by writing the WHERE clause, that way you make sure you don't forget about it.

The day that rollback didn’t fix it 💀

This one actually happened to me.

We were seeing an issue on the staging environment causing the page not to load correctly and leaving the application on a useless blank page.

We found the issue (or that's what we thought) and boom! It worked on staging. We immediately deployed to production, and it was supposed to be safe, but then we got the same error.

So we did what we always do, revert the last deploy. It makes sense, right? Well, it didn't work!!

It took us 1 hour to figure out the problem. Which means that users were unable to use the site for 1 hour and that's bad, really bad. And here's what we learned from it:

  1. The backup plan won't always work: we learned this the hard way. The revert was our backup plan in case of fire, but it's not bulletproof. It's useful to have a backup plan for your backup plan. In our case, it was to call DevOps team.

  2. It's not always code's fault: After all, we discovered that the issue wasn't in the code that we deployed. Instead, it was a configuration in the deployment process causing the dependencies not to update properly. So, just don't assume it's always the code, and try to see the whole picture instead.

Conclusions

  • Be careful!! Always double-check what you're doing and, if possible, ask someone else to look at it.

  • See something? Say something! The sooner you report the issue and everyone is aware of it, the sooner it can be solved and stop it from getting it worse.

  • Always learn as much as you can from it. I'm not saying that breaking production is a good thing nor encouraging you to do it. But if it happens, as long as we come out of it with a lesson learned or a process improved, it should be way less painful.

Have you ever broken production? Do you have any funny stories to share? Tell us!

💖 💪 🙅 🚩
viguza
Victor Guzman

Posted on May 27, 2020

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related