POST-MORTEM
John Otienoh
Posted on June 28, 2024
Database Connection
Issue Summary
Duration: 2 hours 55 minutes (14:55 UTC - 17:50 UTC).
The affected service was the website's search functionality leading to 20% of users experienced slow load times, with average page load times increasing by 7 seconds.
The root Cause was a misconfigured database connection pooling leading to connection timeouts.
Timeline
- 14:55 UTC: Monitoring alerts triggered for high response times on the search functionality.
- 15:00 UTC: Engineer on call investigates and notices high CPU usage on the database server.
- 15:10 UTC: Initial assumption is that the issue is related to a recent code deployment, and the team begins reviewing code changes.
- 15:20 UTC: Investigation reveals no issues with the code deployment, and attention turns to the database server.
- 15:40 UTC: Misleading investigation path: The team explores the possibility of a database query optimization issue.
- 16:00 UTC: Escalation to the database administration team.
- 16:20 UTC: Root cause identified: misconfigured database connection pooling.
- 16:55 UTC: Configuration changes made to the database connection pooling.
- 17:50 UTC: Issue resolved, page load times return to normal.
Root Cause and Resolution
The root cause of the issue was a misconfigured database connection pooling setting, which led to connection timeouts and increased CPU usage on the database server. This caused slow load times for 20% of users, affecting the website's search functionality.
The issue was resolved by adjusting the database connection pooling settings to optimize connection reuse and reduce timeouts. This change was made in collaboration with the database administration team.
Corrective and Preventative Measures
- Improve database connection pooling configuration and monitoring.
- Implement automated testing for database connection pooling settings.
- Enhance monitoring for CPU usage on the database server.
- Conduct regular reviews of database server performance and configuration.
TODO List
- Patch database connection pooling configuration to optimize connection reuse.
- Add monitoring for database connection timeouts.
- Implement automated testing for database connection pooling settings.
- Schedule regular database server performance reviews.
- Develop a playbook for troubleshooting database connection issues.
This postmortem highlights the importance of thorough investigation and collaboration between teams to resolve complex issues. By identifying and addressing the root cause of the problem, we can prevent similar issues from occurring in the future and improve the overall reliability of our services.
Posted on June 28, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 30, 2024