Fauna Deep Dive: Architecting a Distributed Serverless Database

luiseduardocolon

Luis Eduardo Colon

Posted on December 15, 2022

Fauna Deep Dive: Architecting a Distributed Serverless Database

The distributed serverless nature of the Fauna database architecture illustrates how well-architected composability makes modern system complexity a manageable endeavor.

Today’s serverless application developers have learned many lessons from the complex monolithic systems of the past, driving the popularity of composable systems built off of microservices. This reality further validates Gall's Law, considered to be a rule of thumb for systems design, formulated by John Gall in the 1986 book Systemantics: How Systems Really Work and How They Fail. The law states that: “A complex system that works is invariably found to have evolved from a simple system that worked.” Indeed, when these serverless system components are well implemented, they abstract and delegate complexity in a fashion that makes them ideal building blocks for modern application architectures where the evolving complexity is much more manageable than with monoliths.

“A complex system that works is invariably found to have evolved from a simple system that worked.” - Gall's Law

Gall’s Law came to mind as I recently reviewed an architectural overview authored by our engineers here at Fauna. Fauna is a distributed serverless document-relational database, delivered as a global API endpoint, that meets the requirements of both simple and complex modern applications. The architectural overview, available here, outlines the versatility of its design. The engineering team’s work in building and operating Fauna is based on state-of-the-art research and industry progress in databases and operating systems, in addition to a wealth of hands-on operational experience gained over several years.

The Fauna database service has been rigorously battle-tested over time – tens of thousands of customers have created hundreds of thousands of databases, stored hundreds of terabytes of data, and sent billions of requests to the service in production. Considering how difficult it has been to distribute data in traditional operational databases over the years, I believe the architectural overview illustrates how Fauna’s core architecture layers enable the creation of massively scaled, complex, and performant database applications while efficiently abstracting operational complexity from developers.

As a fully featured, general-purpose database service, Fauna’s characteristics balance the benefits expected by the most demanding database application developers without compromising the core benefits of traditional database systems. Before diving deep into its architecture, we’ll briefly summarize these characteristics to better understand the broad feature set that the underlying architecture empowers.


The Fauna Serverless Database: A 3-minute Primer

At the heart of Fauna is a document-relational data model that combines the flexibility and familiarity of JSON documents with the relationships and querying power of a traditional relational database. Fauna's fundamental data building blocks are structured documents grouped in collections. Compared to traditional relational databases, these documents are equivalent to rows, while collections are equivalent to tables.

Developers can execute joins across document collections by using references to other documents. Fauna indexes behave like SQL views. They support searching on one or more document fields and define the return values for matching documents. The return values can be used to enforce uniqueness constraints and the order of matching documents. Indexes can define bindings to perform logic on field values during indexing. Like SQL indexes, Fauna indexes perform queries better than a table scan. Rather than fighting what may become an unpredictable optimizer under subtle dataset or query changes, all queries use a named index explicitly.

Fauna has a GraphQL endpoint that provides a simple way to execute GraphQL queries. GraphQL schema resolvers can be implemented as user-defined functions (UDFs). Any custom complex query and UDF can be constructed in the native, functional, and Turing-complete FQL (Fauna Query Language). FQL’s functional nature is well suited for building composable queries; think of this composability as powerful as using pipes for chaining Linux shell commands.

Fauna provides elegant options to secure data, including restricting access based on user identity and the attributes (that is, the fields or columns) that are being accessed while allowing for a hierarchical database structure to be used as an additional permission boundary. Underlying temporal document versions can preserve all content in configured retention periods.


Providing Consistency, Availability, Durability, and Performance

Inspired by the Calvin transaction protocol, Fauna delivers strictly serialized transactions from anywhere in the world, unlike most distributed databases. Offering strictly serializable transactions is widely considered the optimal consistency model for databases, guaranteeing that all transaction operations (including sub-operations) take place atomically and that this particular guarantee applies in the context of the entire system as a whole.

The Fauna consistency model is designed to deliver strict serializability across transactions in a globally-distributed cluster without compromising scalability, throughput, or read latency. All read-write transactions are strictly serializable based on their position in a global transaction log because the order reflects the real-time processing order. Read-only transactions are serializable with an additional Read Your Own Writes (RYOW) guarantee, which is facilitated by the driver maintaining a high watermark of the most recent logical transaction time observed from any prior transaction.

Consistency Models in Databases - provided by Jepsen.io
Fig. 1. Consistently Models of Distributed Database Systems. Source:Jepsen.io

All cloud providers offer availability zones and regions, enabling them to contain the blast radius of service interruptions. Fauna replicates data across zones in a single region, across regions in a region group, as well as across region groups around the world. Read requests can be served regardless of whether other zones or regions are accessible. Write requests require communication between a majority of zones or regions in the deployment to succeed, which can be scaled with added regions in a region group.

Since compute workloads are increasingly moving towards edge servers, applications need to access data across wide geographies with low latency. Fauna makes this possible by replicating data across multiple availability zones or regions. An intelligent routing layer at Fauna’s edge directs requests sent to the API to the closest region where requests can be served without any client configuration. Fauna exposes data through a single global API endpoint, natively supported by edge computing providers.

While Fauna is technically a CP system according to the criteria put forth in the CAP Theorem, in that it guarantees consistency across the system at the cost of availability in the event of a network partition, it implements robust replication that minimizes this effect. Network partition failures are extraordinarily rare in today’s redundant cloud network topologies, with these networks ensuring many 9’s of availability. Because of its ability to coordinate across regions and region groups, Fauna is not vulnerable to a single point of failure. It is designed to tolerate temporary or permanent node unavailability, increased node latency, or a network partition that isolates a zone or region, making it unique compared to other distributed database implementations.

This is achieved by replicating data within or across regions to bring it closer to the end user and by optimally routing requests from ingress to the data. Requests are routed to the closest zone or region where the data lives by default, even in the case of complete zonal or regional failure. Write requests must be replicated to a majority of log leaders in the log segment before a response can be sent. Fauna’s public region groups typically exhibit single-digit millisecond latency for reads and double-digit millisecond latency for basic writes.

Fauna is designed with a stateless compute layer that can be scaled horizontally and vertically at any time with no points of failure. Any query coordinator in any region can receive any request, and coordinator nodes can communicate with log and data nodes in any other region. Public region groups typically handle hundreds of thousands of requests per minute and burst to millions of requests per minute on current hardware. The query coordinator layer can be scaled rapidly to handle orders of magnitude more traffic based on demand.

The Fauna storage engine is implemented as a compressed log-structured merge (LSM) tree, similar to the primary storage engine in Bigtable. Transactions are committed in batches to the global transaction log. Replicas process the log and apply relevant write effects atomically in bulk. This model maintains a very high throughput and avoids the need to accumulate and sort incoming write effects. Atomicity is maintained, and data is preserved even in the event of a minority node or replica loss. Because the Fauna temporal data model is composed of immutable versions, there are no problematic synchronous overwrites. Logical layouts in data storage are partitioned across all nodes. Documents and their history are partitioned by primary key, while indexes are partitioned by lookup term. Logical layouts in data storage are partitioned across all nodes. Documents and their history are partitioned by primary key, while indexes are partitioned by lookup term.


Fauna’s Core Architectural Layers

Fauna implements Calvin to schedule transactions and replicate data in a way that it minimizes contention costs, along with other layers to coordinate and implement database operations. Every replica is guaranteed to see the same transaction log, its final state equivalent resulting from executing all transactions one by one, and a final state equivalent to every other replica. Fauna’s core functionality is implemented by four primary layers: a routing layer, a query coordination layer, a transaction logging layer, and a data storage layer. Within every zone or region deployed, all nodes in all layers understand the full deployment topology and can forward requests to nodes in other zones or regions if a local node is not responding. The diagram below summarizes a typical query path, and the following paragraphs summarize the request tasks at a high level.

How Fauna's architecture handles database requests
Fig 2. How Fauna’s architectural layers handle read and/or write database requests.

The query and a key or token are sent via a single global endpoint from the client. It is first received via a highly available DNS service and leverages latency-based forwarding to be routed to nodes in the closest zone or region. The routing nodes use the key to route to the correct region group. It can also protect response times by identifying whether a large burst of requests requires throttling or whether any operations exceed defensive rate limits.

As the request makes it to the query coordinator, some of the key benefits of the Calvin protocol implementation emerge. The protocol removes the need for per-transaction locks by pre-computing the inputs and effects of each transaction ahead of time. All coordinator nodes are stateless and can be scaled horizontally with ease. A snapshot time is selected, either the current time or a time specified in the request. With values in hand from data storage nodes, it optimistically executes all transaction operations without committing the writes. The output of this execution is a flattened set of reads and a set of writes that are to be committed if there’s no contention. If there are no writes, the request is completed at this time, oftentimes within very low double-digit milliseconds for the entire request.

If there are writes to handle, the transaction logging layer is engaged to sequence and commit them. If the logs or storage nodes are unavailable in the local zone or region, the coordinator finds appropriate nodes in other zones or regions. It functions as a write-ahead log and is the only place where cross-replica coordination is necessary. The log is split into multiple segments that span replicas, and segments can be readily increased to speed up throughput.

Each segment runs an optimized version of the Raft consensus algorithm. A node is elected as leader, and all non-leader nodes forward transactions to it. If a current leader becomes unavailable, a new leader election is triggered. The leader periodically assembles the received transactions into batches to commit based on a configurable time interval called the epoch interval. The leader communicates with other leaders in the Raft ring to agree on the full set of transactions in the epoch, the batch is replicated with Raft, and the system has enough information and copies to declare its transactions optimistically committed. However, the write effects have yet to be finally applied at this point. Once all log segments have committed their respective transaction batches for a given epoch, all the epoch’s transactions are available to downstream data storage nodes.

It should be noted that real-time global clock synchronization in Fauna is not required to guarantee correctness. Log nodes, which are the only ones generating epochs, are the only ones where clock synchronization occurs, and epochs are thus generated at about the same time. Based on epoch other, a timestamp is applied to every transaction that reflects its real commit time, within milliseconds of real-time, and its logical, strictly serializable order with respect to other transactions.

Data storage nodes maintain a persistent connection with each local log node, listening to transactions in the key range it covers. Each node is assigned a range of keys to monitor. All data is stored in each zone or region, and every document is stored in at least three nodes. The storage nodes validate that no values read during the transaction's executions have changed between the execution snapshot time and the final commit time. They check with peer nodes and obtain the state of values it doesn't cover. If there are no conflicts, the values it covers are updated; if there are conflicts, it drops the transactional writes. There are a set of deterministic checks on the data set, so either all nodes apply the transaction or none of them do.

Written documents in applied transactions are not overwritten. Instead, a new document version at the current transaction timestamp is inserted into the document history as a create, update, or delete event. Fauna’s storage system supports temporal queries, so all transactions can be executed consistently at any point in the past. This is useful for auditing, rollback, cache coherency, and synchronization to other systems and forms a fundamental part of the Fauna isolation model. The storage system also facilitates event streaming, which allows customers to subscribe to notifications when a document or a collection is updated.

Beyond these core layers, additional services exist for activities like metrics, billing, backup/restores, user authentication and authorization, and a general-purpose, topology-aware, journaled task scheduler similar to Hadoop YARN.


Summing Up

As I first read and understood the nature of the Fauna architecture, I was immediately struck by how the combination of these layers, which themselves are built atop proven technologies and rigorously-tested algorithms and protocols, can still allow the overall system to be malleable. The clusters of nodes can be horizontally scaled with relative ease in current clouds, efficient network pathing by current providers is maximized, and many of its components can operate with minimal to no state, optimized in memory as appropriate. They can be elegantly transitioned to take advantage of new hardware improvements. They can leverage new points of presence (PoPs) at content delivery networks as those continue to proliferate.

And yet it is all abstracted as a serverless API endpoint that, other than an application key, requires no ongoing configuration to indicate the location of the database, replicas, shards, or any other detail about how the database is distributed. That level of abstraction keeps the promise of composable systems for today’s developers, making it easier to adapt, refactor, and avoid technical architecture debt over the long term.

About the author

Luis Colon is a data scientist that focuses on modern database applications and best practices, as well as other serverless, cloud, and related technologies. He currently serves as a Senior Technology Evangelist at Fauna, Inc. You can reach him at @luiscolon1 on Twitter and Reddit.

💖 💪 🙅 🚩
luiseduardocolon
Luis Eduardo Colon

Posted on December 15, 2022

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related