From Production Outage to the Front Page of Hacker News

the_adam_tal

Adam Tal

Posted on May 20, 2023

From Production Outage to the Front Page of Hacker News

Image description

Hello, everyone! I'm Adam, currently the Director of Payments Engineering at Vimeo. It's graduation season, meaning many are about to embark on their own professional journeys. It's an exciting time, and in the spirit of this, I'd like to share a story from my career, which spans over a decade in engineering.

We often celebrate our wins and milestones, as we should, but we tend to gloss over the challenging times, the missteps, and the lessons they bring. I've experienced my fair share of these lessons; I've brought down production at every company I've worked for at least once.

While that might sound alarming, it's in those moments of crisis that I've learned the most, grown exponentially as an engineer, and come away with valuable insights.

Today, I'd like to share one such story.

To set the stage, I was leading the Platform Engineering team at a startup credit card company. Buried in refactoring work, I discovered an empty Database (DB) table named "cards." It was a mystery – not referenced anywhere in the code, devoid of data, and mentioned in dusty old documents earmarked for deletion.

After discussing it with colleagues and doing some due diligence, I decided to axe it. A DB migration to drop the table was prepared and launched. A simple enough task. But as we're aware in tech, simple rarely equates to trouble-free.

Within 15 minutes post-deployment, our production environment buckled and crashed. A seemingly innocuous migration had spiraled into an engineering catastrophe. As the orchestrator of this chaos, I was now on a mission for answers.

Diving into DB logs and audit trails, the puzzle began to unravel. The empty table had a Foreign Key (FK) to another table. To drop it, Postgres needed to lock the referenced table. That table? ... "users." A relatively essential table, hard to get a lock on.

A database deadlock ensued. The migration query was unable to secure the necessary locks due to these hostage-taking threads. This query also held a lock on the main application table, lining up all other queries in a deadlock-laden conga line that would never move.

The application, overwhelmed by the surge of waiting queries, went down. The immediate resolution? Terminate the migration query.

From this tumultuous experience, the idea for my tool was born.


pg_explain_locks "DROP TABLE cards"

+-------------+---------------+---------------------+
| Relation ID | Relation Name | Lock Type           |
+-------------+---------------+---------------------+
| 16415       | cards         | AccessExclusiveLock |
| 16416       | users         | AccessExclusiveLock |
+-------------+---------------+---------------------+

Enter fullscreen mode Exit fullscreen mode

This ordeal became an intensive crash course in database fundamentals, a challenging yet enriching learning experience that left an indelible mark on my professional journey.

In the aftermath, I developed a tool to better understand DB locks' interactions and their impacts on an application. Four years later, this humble tool found itself on the front page of Hacker News.

Continue to build, stumble, and learn. The journey is long, and each step, misstep, or fall is a stepping stone toward growth. You never know what your next mistake might teach you.

💖 💪 🙅 🚩
the_adam_tal
Adam Tal

Posted on May 20, 2023

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related