Unpacking LSM-Trees: The Powerhouse Behind Modern Databases
zhengxin
Posted on December 2, 2023
Introduction
Log-Structured Merge-trees (LSM-trees) are virtually omnipresent in today's database systems, spanning both SQL and NoSQL architectures. They are the storage layer backbone of various high-profile databases like BigTable, Dynamo, HBase, Cassandra, LevelDB, RocksDB, and AsterixDB, to name a few.
Designed for write-intensive workloads, an LSM-tree is essentially made up of two segments: an in-memory component and an on-disk component. The in-memory part gets updated with every operation, and data gets flushed to the on-disk component once certain conditions are met.
In this post, we'll walk you through the evolution of LSM-trees, discuss why they're all the rage, delve into their key components, explore common optimizations, and take a closer look at LevelDB's implementation.
History of LSM-trees
Discussing database systems often boils down to two primary data update strategies: in-place and out-of-place updates.
In-place updates involve direct modifications to the data on disk, while out-of-place updates append new data values to a different disk location, leaving the original values unchanged.
In the above diagram, you'll notice how updating the value of key k1 from v1 to v4 works differently depending on the update strategy employed. When updating the value of key k1
from v1
to v4
, the in-place update updates the value directly, while the out-of-place create a new key-value pair and store the (k1,v4)
there.
LSM-tree takes the out-of-place strategy and is designed to be write-optimized. With such context, you can understand the rise of LSM-trees, which is influenced by three key trends:
- Modern applications are storing increasingly more data, which is further fueled by the decreasing price of storage and memory.
- Hence more applications are performing more insertions than read queries in their business logics.
- The global trend move to cloud-based datat management further supports immutability-based systems.
Yes, the fact of "more write operation than read" on modern data store systems is the root cause why LSM-tree is so perferred nowadays.
The concept of LSM-trees was introduced in 1996 by O'Neil and later popularized by Google's groundbreaking BigTable paper in 2006. Since then, a multitude of SQL and NoSQL databases have integrated LSM-trees, sparking active research in this field.
The LSM-tree model was first introduced by O'Neil in his 1996 research paper titled "The Log-Structured Merge-Tree (LSM-Tree)." A decade later, in 2006, Google released a pivotal paper on Bigtable, which had a profound impact on the database and big data communities. Bigtable employs LSM-trees to manage the tablets that store the actual data. Since then, LSM-trees have gained widespread adoption as the storage layer in numerous NoSQL and even some SQL databases. Research into LSM-trees continues to be an active area of study.
Today's LSM-trees
Today's LSM-trees generally consist of two main components: MemTable and SSTable.
Note: While not all LSM-trees are built with MemTables and SSTables, these components are the most commonly used. This section outlines a typical LSM-tree architecture, using LevelDB as an illustrative example. Each data storage system may have its own unique implementation, but the core concepts remain consistent.
In the LSM-tree structure, the MemTable serves as a temporary in-memory buffer for incoming write operations. This write buffer stores data in RAM, allowing for quick writes. When the MemTable fills up or a specific trigger occurs, its data is sorted and written to disk as an SSTable (Sorted String Table). This batch processing minimizes disk I/O and is particularly advantageous in write-heavy applications. After flushing, a new MemTable is created to host the newly incomming requests.
Generally, there's only one active MemTable in the system at any given time, as depicted in the figure below.
SSTables are immutable files containing a sequence of sorted key-value pairs. They offer several features:
- Key Ordering: Keys are sorted, facilitating efficient range queries and ordered data retrieval.
- Indexed: An index often accompanies an SSTable, enabling fast key lookups by pointing to specific data positions within the file.
SSTables are grouped into "levels" on disk. A higher level usually contains more SSTables than the one below it. When a MemTable is flushed to disk, it creates a new SSTable at level 0. Multiple SSTables can exist at this level, and their key ranges may overlap, as shown above.
A compaction is triggered when the number of SSTables in a level reaches a certain threshold. In such a case, an SSTable from level L is chosen, along with any overlapping SSTables from level L+1. These are merged to form new SSTables at level L+1.
Compactions are performed in the background, ensuring that read and write operations aren't interrupted.
For instance, if an SSTable in level L has a key range of 101-150, any SSTables in level L+1 with overlapping key ranges would also be selected for merging, whose key ranges are 31-120
and 121-150
.
New SSTables are generated at level L+1, based on the merge result of the 3 selected SSTables. The count of the newly generated SSTables depends not only on key ranges but also on the data size for each key. The primary goal is to maintain uniform SSTable sizes. Therefore, after compaction, the SSTable at level L was removed, and its data is now merged in level L+1.
As you can see, except for level 0, SSTables at each level have non-overlapping key ranges. This design allows for efficient point or range queries, as each key can be located without scanning multiple SSTables.
Optimization on LSM-trees
While each storage system may have its unique optimizations, several common techniques are often employed:
- Bloom Filters: Used to minimize disk reads by quickly determining if a key exists in a certain level.
- Compaction Strategies: Tailored to reduce write amplification and read latency.
- Partitioning: Enhances concurrency and reduces contention by dividing data.
- Caching: Strategies are in place to limit disk access and boost performance.
- Indexing: Auxiliary structures are employed to speed up lookups.
- Compression: Saves space and improves I/O efficiency.
- Tuning Size Ratios: Balances read and write performance by adjusting the size ratios between levels.
Summary
In today's data-rich environment, LSM-trees serve as a sturdy foundation for modern storage systems. Understanding LSM-trees is more than just a primer; it's essential for grasping how contemporary data storage operates. Through this blog, my aim is to equip you with the fundamentals of LSM-trees, so you'll have a clearer understanding the next time you encounter them.
References
[1] Chen Luo and Michael J. Carey. 2020. LSM-based Storage Techniques: A Survey.The VLDB Journal 29, 1 (January 2020), 393–418. DOI:https://doi.org/10.1007/s00778-019-00555-y
[2] Patrick O’Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O’Neil. 1996. The log-structured merge-tree (LSM-tree). Acta Informatica 33, 4 (June 1996), 351–385. DOI:https://doi.org/10.1007/s002360050048
[3] Subhadeep Sarkar and Manos Athanassoulis. 2022. Dissecting, Designing, and Optimizing LSM-based Data Stores. In Proceedings of the 2022 International Conference on Management of Data, ACM, Philadelphia PA USA, 2489–2497. DOI:https://doi.org/10.1145/3514221.3522563
[4] leveldb/doc/impl.md at main · google/leveldb. GitHub. Retrieved October 21, 2023 from https://github.com/google/leveldb/blob/main/doc/impl.md
Posted on December 2, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.