Why Running Databases on Kubernetes is a Recipe for Disaster: The Case for a New Platform Designed for Stateful Workloads

alialp

Ali Alp

Posted on October 23, 2024

Why Running Databases on Kubernetes is a Recipe for Disaster: The Case for a New Platform Designed for Stateful Workloads

Kubernetes has become a powerhouse for managing containerized applications, especially for stateless workloads, because of its scalability and automation. However, when it comes to running databases—critical, stateful systems—things get much more complicated. Even though Kubernetes has improved, there are still big challenges. These issues suggest that perhaps we need a completely new platform built specifically for stateful workloads like databases, rather than trying to make Kubernetes do something it wasn’t originally designed for.

Let’s explore why running databases on Kubernetes remains risky and why we might need a platform tailored to handle databases' unique demands.

1. CSI Crashes and Storage Attachments: Still a Risk

The Issue: The Container Storage Interface (CSI) manages how Kubernetes attaches and detaches storage. While CSI has gotten better over time, it can still fail, and these failures can cause data loss or corruption in databases.

Why It’s a Problem: Databases rely on constant, stable access to storage. If a CSI crash occurs when storage is being reattached—say after a node failure or pod eviction—the database might lose data or get corrupted.

A Better Solution: A new platform specifically built for databases could provide better storage management, ensuring that databases always stay connected to their storage even during failures. This would reduce the risk of data loss or corruption significantly.

2. Immature Database Operators: Still Not Perfect

The Issue: Database operators in Kubernetes are responsible for automating tasks like setting up the database, handling backups, and managing failovers. While some operators are now quite robust, others are still maturing, and things can go wrong, especially in complex scenarios like failovers.

Why It’s a Problem: Even with improved operators, there’s still the risk of errors, especially during critical moments like failovers or upgrades. This could lead to data inconsistencies or even corruption, which is unacceptable for production databases.

A Better Solution: A platform built specifically for databases could come with native tools that handle these tasks reliably, without the need for third-party operators. This would simplify database management and reduce the risks of running critical operations.

3. Data Loss and Corruption: Risks from Pod Evictions, Node Failures, and Network Issues

The Issue: Kubernetes isn’t great at handling stateful applications when things go wrong, such as when nodes fail, pods get evicted, or network issues occur. These events can disrupt databases and cause data loss or corruption if they aren’t carefully managed.

Why It’s a Problem: Without careful tuning, databases running on Kubernetes can face serious risks from these kinds of disruptions. For example, network partitions can cause a "split-brain" scenario, where two database replicas think they are the primary, leading to conflicting data.

A Better Solution: A dedicated platform for databases would handle these situations better by providing built-in mechanisms to ensure data consistency and prevent issues like split-brain from occurring in the first place.

4. Replica Lag and Network Bottlenecks: A Constant Struggle

The Issue: In distributed databases, replication is key to keeping data in sync across multiple instances. On Kubernetes, network congestion and I/O bottlenecks can lead to replication delays (also known as replica lag), which can cause major problems.

Why It’s a Problem: If the network gets too congested, replication may fall behind, meaning if the primary database fails, the backups may not have the latest data. This could result in data loss or inconsistencies during a failover.

A Better Solution: A platform built specifically for databases would prioritize network and I/O resources for replication, ensuring that databases always stay in sync, even when other workloads are running on the same infrastructure.

5. Kubernetes Wasn’t Built for Databases

The Issue: Kubernetes was originally designed for stateless applications. While it now supports stateful workloads with features like StatefulSets and PersistentVolumes, these were added later and aren’t ideal for databases. Running a database well requires specialized handling of things like backups, disaster recovery, and failovers.

Why It’s a Problem: Without native support for these critical database tasks, organizations end up using a mix of third-party tools and custom scripts to manage things like backups and failovers. This adds complexity and increases the chance of errors, making it harder to ensure database reliability.

A Better Solution: A new platform could offer all of these features out of the box, specifically designed with databases in mind. That means built-in support for backups, disaster recovery, and seamless failover, reducing the need for custom solutions and making databases easier to manage.

6. DBAs Now Need to Be Kubernetes Experts

The Issue: Running databases on Kubernetes has blurred the line between traditional database administrators (DBAs) and Kubernetes administrators (CKAs). DBAs now need to understand Kubernetes deeply, or CKAs need to learn how to manage databases.

Why It’s a Problem: Managing both Kubernetes infrastructure and databases is a complex task. Expecting a DBA to also become a Kubernetes expert—or expecting a CKA to know the intricacies of databases—adds a lot of complexity. This skill gap can lead to operational issues and downtime if not handled properly.

A Better Solution: A new platform designed for stateful workloads could abstract away much of the complexity of Kubernetes, allowing DBAs to focus on managing databases, without needing to learn the ins and outs of Kubernetes infrastructure. This would simplify the skill requirements and reduce operational risks.

Points to Consider:

1. Kubernetes Ecosystem Maturity

Kubernetes has come a long way in supporting stateful workloads. Tools like StatefulSets and CSI drivers are maturing, and many database operators are becoming more reliable. However, the complexity and learning curve involved in running databases on Kubernetes remain high.

The Takeaway: Even though Kubernetes is evolving, it still wasn’t designed with databases in mind. For teams without deep Kubernetes and database expertise, a simpler, purpose-built platform could offer a better solution with less overhead.

2. Building a New Platform Adds Complexity

While building a new platform for stateful workloads might solve some of these issues, it could also create its own set of problems. A new platform means new learning curves, migration challenges, and the risk of fragmenting the ecosystem.

The Takeaway: While a dedicated platform for databases could be ideal, it’s important to consider the overhead of learning and migrating to a new system. The trade-offs between short-term complexity and long-term reliability need to be weighed carefully.

3. Can We Integrate These Features into Kubernetes?

Instead of building an entirely new platform, we could explore whether the features needed for stateful workloads, like better backup and failover handling, could be integrated into Kubernetes itself or provided as extensions.

The Takeaway: While Kubernetes is general-purpose by nature, it has a strong ecosystem of extensions. It’s worth exploring whether we can enhance Kubernetes to better handle stateful workloads rather than starting from scratch.

Conclusion: Time for a Stateful Revolution

Running databases on Kubernetes can work, but it still comes with significant risks and challenges. Kubernetes was not designed with stateful workloads like databases in mind, and while the ecosystem has improved, the complexity of managing databases on Kubernetes remains high. This has led to an ongoing debate: Should we continue to push Kubernetes to do something it wasn’t originally designed for, or is it time to build a new platform specifically for stateful workloads like databases?

While Kubernetes will continue to evolve, a dedicated platform designed for databases could offer a simpler, more reliable solution. Such a platform would be optimized for the needs of stateful workloads, reducing the complexity and risks of running critical databases in production.

Whether through a new platform or better integration within Kubernetes, the future of managing databases needs to focus on reducing operational complexity, ensuring data reliability, and allowing teams to focus on what matters—keeping their databases secure, scalable, and highly available.


Apendix A: Responding to the Challenges: Practical Approaches to Achieve Better Solutions

Addressing these database challenges on a purpose-built platform rather than on Kubernetes would require more than theoretical solutions; it would involve dedicated, database-specific infrastructure and processes to mitigate risks. Here’s how some of these "better solutions" could be implemented:

Enhanced Storage Orchestration and Persistent Connectivity

In cloud environments, ensuring databases maintain constant connectivity to storage—even during failures—demands specialized orchestration. A platform built for databases could prioritize this by:

  • Implementing persistent connection protocols specifically for database storage to facilitate rapid reconnection after temporary disruptions.
  • Offering automated storage fallback mechanisms, where replicas or secondary storage instances stay available and in sync to minimize downtime and data loss during a primary storage disconnection.

Dedicated Networking for Database Traffic

Network congestion and I/O bottlenecks often disrupt replication and failover processes. To address this, a dedicated platform could prioritize database communications through:

  • Network segmentation, which reserves specific channels for database operations, reducing competition for bandwidth.
  • Quality-of-service (QoS) policies that prioritize database replication traffic, helping maintain synchronization across instances even when other workloads are active.

Native Failover, Backup, and Split-Brain Prevention

Unlike Kubernetes, which requires third-party operators and configurations for failover and backup, a database-centric platform could incorporate these capabilities natively:

  • Built-in split-brain prevention mechanisms could ensure that only one database instance can function as the primary, using quorum-based protocols to maintain consistency.
  • Automated failover tools would manage role reassignment across replicas instantly in response to node or network failures, ensuring continuity and minimizing inconsistencies.

Database-Focused Management Tools

One of the primary benefits of a specialized platform would be its alignment with the skills and workflows of database administrators, minimizing the need for Kubernetes-specific expertise:

  • A dedicated management console for database configuration, monitoring, and troubleshooting would streamline database-specific tasks, removing Kubernetes abstractions.
  • Automated backup and recovery capabilities, modeled after traditional RDBMS management, would reduce the need for custom scripts, further lowering the risk of errors during critical operations.

These focused solutions show that a platform designed specifically for stateful workloads could provide significant advantages over Kubernetes by addressing databases’ unique demands for reliability and consistency. Whether through enhancing Kubernetes or developing a new platform, the future of stateful workload management will likely focus on reducing operational complexity and maintaining data integrity, ensuring that teams can manage databases efficiently and securely.

💖 💪 🙅 🚩
alialp
Ali Alp

Posted on October 23, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related