The Intriguing Issue of 429 Errors in Cloud Infrastructure
Jacob Schatz
Posted on April 26, 2023
In the realm of cloud infrastructure, unexpected challenges may arise, leading to complex problem-solving scenarios. One such instance occurred in Calendar.dev's Kubernetes deployment, where a peculiar bug caused 429 errors (too many requests) from DigitalOcean's API. This article details the investigative process and shares valuable insights for both learning and amusement.
Upon deploying Calendar.dev via CI/CD, 429 errors were received from DigitalOcean's API, implying over 250 requests per minute. Despite confidence in making fewer requests during deployment, DigitalOcean representatives insisted otherwise. In search of a resolution, the number of replicaSets was reduced, but even with just one, an image pull error emerged from Kubernetes. A careful count revealed no more than 10 requests made to DigitalOcean within a 10-minute window, leaving the situation perplexing.
Further investigation revealed that the rate limit error stemmed from domain checking. The primary issue was traced to a recent domain change to calendar.dev. An outdated Kubernetes (k8s) cert-manager had not been updated to refresh the HTTPS certificate after the domain switch. Consequently, cert-manager attempted to provision new certificates but failed to validate the domain, as it was incorrectly pointing to the external load balancer IP.
DNS was initially dismissed as a potential problem, considering that the SSL certificate functioned without issue, and the error only surfaced when altering the number of replicaSets. The absence of an invalid SSL certificate warning was attributed to Cloudflare, which automatically provided an SSL certificate when recently added to the mix. Unaware of this automatic provisioning, cert-manager continued to run, leading to an unintended overlap of SSL certificates.
In summary, each production deployment was met with 429 errors as Kubernetes attempted to spin up new pods. The excessive API requests to DigitalOcean were a result of Kubernetes trying to renew an improperly configured SSL certificate pointing to the incorrect domain. The lack of a visible error, caused by Cloudflare's automatic SSL certificate provision, masked the issue.
In hindsight, the DNS problem was difficult to identify without any apparent DNS issues, and the focus was directed toward potential Kubernetes configuration mistakes. This was an interesting experience and underscores the importance of understanding interactions between cloud infrastructure components, such as Cloudflare and SSL certificates, to prevent similar challenges.
There's an old saying:
No, it is not a compiler error. It is never a compiler error.
When you're at your wits end trying to solve a bug you can often assume the worst which can stop you from exploring other areas and just saying "what if".
Posted on April 26, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.