LSM-tree and block storage
Frits Hoogland
Posted on February 3, 2023
YugabyteDB uses database storage that implements LSM-tree type storage. LSM-tree storage fundamentally differs from block based storage. Block based storage is used as the main storage implementation for many well known database such as PostgreSQL, Oracle, SQL Server, DB2 and many others.
Block based storage 101
In common block based storage implementation, data is stored in a datafile inside a block. Changes to the data are put in a write ahead or redo log before the change to the data inside the block in the datafile, and then applied to the block, potentially overwriting a previous version. A key observation and check is the existence of the datafile, which is the fundamental and singular place where the data is stored.
LSM-tree storage 101
In common LSM-tree based databases, data is stored as a sequential list of key-value pairs, and from a data perspective no data is overwritten ever, any change is an addition to the stream of key-value pairs.
LSM-tree based databases use a write ahead log too, which contains the records that are added because of inserts, but also updates and even deletes. A delete is an addition indicating a previously inserted record now is deleted.
Upon write ahead log addition, the changes are inserted to the first level of storage in a memtable. A memtable is memory only storage, and thus will be gone when the database stops.
This has some consequences:
- Creating a table and inserting data will not (immediately) create a datafile, which are called 'SST files'.
- Creating a table and inserting data will result in a memtable, holding the data.
- To overcome stopping/starting the database, or anything else that removes the memtable, such as a crash, the write ahead log can be used to reconstruct the memtable.
- This means that right after table creation and (limited) data insertion, there will be no datafiles (SST files).
- This looks really weird when you come from a block based database and look into this for the very first time.
Of course a memtable cannot be grown unlimited, eventually any computer system will run out of memory, and the needed write ahead log storage to reconstruct a memtable will be huge. Therefore, a memtable has a configurable, limited size. Once a memtable reaches the set maximum size, the following happens:
- The full memtable is made 'immutable'.
- A new memtable is allocated, and will be used by the database.
- The full memtable is written to disk as SST file.
The entire database now has its data split into two chunks: the memtable and the SST file. When the memtable reaches its maximum size again, the same procedure will happen, and the database is split into a memtable and two SST files, and so on.
The YugabyteDB implementation
This is all generic. If you want to know and see how this is implemented on YugabyteDB, read on:
I created a small and simple RF1 one node YugabyteDB universe, which makes it the most easy to see and understand. The same principles apply to RF3 clusters, which means a tablet will have 3 replicas instead of 1 with RF1.
-
Create a table with a single tablet:
create table test ( id int, f1 text ) split into 1 tablets;
-
See what happened in YugabyteDB:
➜ yb_stats --details-enable --gauges-enable --hostname-match 9000 --table-name-match test Begin ad-hoc in-memory snapshot created, press enter to create end snapshot for difference calculation. Time between snapshots: 1.306 seconds 192.168.66.80:9000 tablet 93bf6aa88e4841c1bbea28eb3bf0bcca yugabyte.test is_raft_leader 1 y/n +0 192.168.66.80:9000 tablet 93bf6aa88e4841c1bbea28eb3bf0bcca yugabyte.test mem_tracker 4096 bytes +0 192.168.66.80:9000 tablet 93bf6aa88e4841c1bbea28eb3bf0bcca yugabyte.test mem_tracker_IntentsDB 2048 bytes +0 192.168.66.80:9000 tablet 93bf6aa88e4841c1bbea28eb3bf0bcca yugabyte.test mem_tracker_IntentsDB_MemTable 2048 bytes +0 192.168.66.80:9000 tablet 93bf6aa88e4841c1bbea28eb3bf0bcca yugabyte.test mem_tracker_RegularDB 2048 bytes +0 192.168.66.80:9000 tablet 93bf6aa88e4841c1bbea28eb3bf0bcca yugabyte.test mem_tracker_RegularDB_MemTable 2048 bytes +0 192.168.66.80:9000 tablet 93bf6aa88e4841c1bbea28eb3bf0bcca yugabyte.test raft_term 1 terms +0
This shows that the table creation created a tablet with UUID
93bf6aa88e4841c1bbea28eb3bf0bcca
, and TWO memtables created: one for IntentsDB and one for RegularDB. This is a YugabyteDB specific implementation detail: we use an IntentDB rocksdb (LSM-tree) database that holds locks and intermediate data, and a regularDB rocksdb (LSM-tree) database that will eventually hold the actual table data. Both LSM-tree databases have allocated a memtable, which must be empty because we have not inserted any data. -
Insert some data
insert into test select id, repeat('x',1000) from generate_series(1,100) id; INSERT 0 100
-
Look at the statistics
➜ yb_stats --details-enable --gauges-enable --hostname-match 9000 --table-name-match test Begin ad-hoc in-memory snapshot created, press enter to create end snapshot for difference calculation. Time between snapshots: 5.974 seconds 192.168.66.80:9000 tablet 93bf6aa88e4841c1bbea28eb3bf0bcca yugabyte.test is_raft_leader 1 y/n +0 192.168.66.80:9000 tablet 93bf6aa88e4841c1bbea28eb3bf0bcca yugabyte.test mem_tracker 331776 bytes +0 192.168.66.80:9000 tablet 93bf6aa88e4841c1bbea28eb3bf0bcca yugabyte.test mem_tracker_IntentsDB 198656 bytes +0 192.168.66.80:9000 tablet 93bf6aa88e4841c1bbea28eb3bf0bcca yugabyte.test mem_tracker_IntentsDB_MemTable 198656 bytes +0 192.168.66.80:9000 tablet 93bf6aa88e4841c1bbea28eb3bf0bcca yugabyte.test mem_tracker_RegularDB 133120 bytes +0 192.168.66.80:9000 tablet 93bf6aa88e4841c1bbea28eb3bf0bcca yugabyte.test mem_tracker_RegularDB_MemTable 133120 bytes +0 192.168.66.80:9000 tablet 93bf6aa88e4841c1bbea28eb3bf0bcca yugabyte.test raft_term 1 terms +0 192.168.66.80:9000 tablet 93bf6aa88e4841c1bbea28eb3bf0bcca yugabyte.test replicated_retryable_request_ranges 1 reqs +0
Because we inserted more data, the memtables simply grown. The most obvious question now is how far this will grow. This is dependent on the parameter (gflag)
memstore_size_mb
, which is 128 by default. -
Let's insert more data so we fill up the memtable
insert into test select id, repeat('x',1000) from generate_series(101,100000) id; INSERT 0 99900
-
And look at the statistics again:
➜ yb_stats --details-enable --gauges-enable --hostname-match 9000 --table-name-match test Begin ad-hoc in-memory snapshot created, press enter to create end snapshot for difference calculation. Time between snapshots: 1.117 seconds 192.168.66.80:9000 tablet 93bf6aa88e4841c1bbea28eb3bf0bcca yugabyte.test is_raft_leader 1 y/n +0 192.168.66.80:9000 tablet 93bf6aa88e4841c1bbea28eb3bf0bcca yugabyte.test log_wal_size 75077698 bytes +0 192.168.66.80:9000 tablet 93bf6aa88e4841c1bbea28eb3bf0bcca yugabyte.test mem_tracker 11800576 bytes +0 192.168.66.80:9000 tablet 93bf6aa88e4841c1bbea28eb3bf0bcca yugabyte.test mem_tracker_BlockBasedTable_IntentsDB 50239421 bytes +0 192.168.66.80:9000 tablet 93bf6aa88e4841c1bbea28eb3bf0bcca yugabyte.test mem_tracker_BlockBasedTable_RegularDB 65618 bytes +0 192.168.66.80:9000 tablet 93bf6aa88e4841c1bbea28eb3bf0bcca yugabyte.test mem_tracker_IntentsDB 11798528 bytes +0 192.168.66.80:9000 tablet 93bf6aa88e4841c1bbea28eb3bf0bcca yugabyte.test mem_tracker_IntentsDB_MemTable 11798528 bytes +0 192.168.66.80:9000 tablet 93bf6aa88e4841c1bbea28eb3bf0bcca yugabyte.test mem_tracker_RegularDB 2048 bytes +0 192.168.66.80:9000 tablet 93bf6aa88e4841c1bbea28eb3bf0bcca yugabyte.test mem_tracker_RegularDB_MemTable 2048 bytes +0 192.168.66.80:9000 tablet 93bf6aa88e4841c1bbea28eb3bf0bcca yugabyte.test raft_term 1 terms +0 192.168.66.80:9000 tablet 93bf6aa88e4841c1bbea28eb3bf0bcca yugabyte.test replicated_retryable_request_ranges 1 reqs +0 192.168.66.80:9000 tablet 93bf6aa88e4841c1bbea28eb3bf0bcca yugabyte.test rocksdb_current_version_num_sst_files 3 files +0 192.168.66.80:9000 tablet 93bf6aa88e4841c1bbea28eb3bf0bcca yugabyte.test rocksdb_current_version_sst_files_size 11854842 bytes +0 192.168.66.80:9000 tablet 93bf6aa88e4841c1bbea28eb3bf0bcca yugabyte.test rocksdb_current_version_sst_files_uncompressed_size 119355043 bytes +0 192.168.66.80:9000 tablet 93bf6aa88e4841c1bbea28eb3bf0bcca yugabyte.test rocksdb_total_sst_files_size 11854842 bytes +0
The addition of more data filled up the memtable, and made it flush to disk. The IntentDB memtable holds approx. 11MB, the RegularDB memtable holds 2k.
The first thing to look at is the statistic log_wal_size
: this is the write ahead log data to reconstruct the memtable. If you look back at the previous statistics you will notice it was not present there: this is a an issue: Change statistic log_wal_size to include both current and previous files. #14480.
The second thing to look at are the statistics that report the SST file information: rocksdb_current_version_num_sst_files
: 3; we got 3 SST files, and rocksdb_current_version_sst_files_size
and rocksdb_current_version_sst_files_uncompressed_size
: an approximation of the size on disk, and an approximation of the size of the files if these are uncompressed.
The last thing that is interesting to see are the statistics mem_tracker_BlockBasedTable_IntentsDB
and mem_table_BlockBasedTable_RegularDB
statistics. These are the memory areas that are used as buffer cache: in principle, once a memtable is flushed to disk in an SST file, it sits on disk. The BlockBasedTable is a cache to hold SST file data in memory.
Posted on February 3, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.