Storage and retrieval for system design interview

The following content is my own note from the book, "Designing Data-Intensive Applications". The writing is intended for people who want to dash through the book quickly.

There're different ways to store data (store engines) for transactional workloads and analytics. They're called OLTP (optimized for transaction processing) and OLAP (optimized for analytics).

OLTP is user-facing, meaning request volume is high, so some strategies are required to improve performance on queries such as using index.
On the other hand, OLAP is computational heavy where each query demands scanning over millions of records because of aggregate functions (SUM, COUNT, AVG, etc). Therefore, column-oriented storage is generally desirable.

Where does it start from?

In order for humans to understanding what's happening in applications or machines, we need some kind of records, we call it, log.

Log is simply an output text describing states of a machine or an application. As long as a machine operates, logs will be created endlessly and storing these log is constrained by disk or RAM capacity. So, some strategies to avoid running out of disk space such as compaction is performed.

Compaction breaks a large log file down into smaller segments and merge them to keep the storage efficient. It's generally ideal for writes because it throws away duplicate keys in the log, keeping only the most recent update for each key.

Segments are never modified after they have been written and reads can be served fast as there's no need of frequent updates on segment files. It continues to write requests to the latest segment file, and after a while, it then merges old segments and switch read requests to using new merged segment, then remove old segment files to keep storage efficient.

There're also different ways to store data such as column-based.

In a relational database schema, each row can consist of 1 to many columns and not every request needs those column values and indexing strategy is often used to optimize performance to some degree.

Index is a data structure to efficiently find the value for a particular key into the database. The general idea is to keep some additional metadata on the side (to help you locate the data you want). It's derived from the primary data, affecting performance of queries. Especially, when a write happens, indexes also need to be updated and it slows down writes. So, developers need to be mindful when creating indexes.

Most-widely used indexing data structure is B-trees which stores key-val pairs sorted by key.

There're also many other indexing strategies such as hash indexing, etc. However, in a column-based schema, values of each field are stored in a single row, separated by a comma, so to access nth record, developers simply accesses nth value of each row (representing fields/attributes). Duplications can be improved by bitmap encoding, etc.

Reference

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems by Martin Kleppmann

Blog

Storage and retrieval for system design interview

Daniel Lee

Reference

Join Our Newsletter. No Spam, Only the good stuff.

Related