Introduction to LSM Trees

Databases are an integral part of any system. Choosing the right database for your application, and using it correctly can improve your application's performance, consistency, and reliability. In this article, we will briefly explore Log Structured Merge Trees, how they work, and why databases use them.

What is LSM Tree?

A Log Structured Merge Tree (LSM Tree) is a data structure used for effectively storing and organizing data. It is widely used in data-intensive databases like Amazon's DynamoDB, Cassandra, and ScyllaDB.

Internally, it uses two data structures - Memtable and SS Tables. Let's look at each of them in detail.

Memtable

Memtables are in-memory data structures, and all reads and writes are first directed to them. Generally, Red Black Trees are used as memtables to maintain the keys in a sorted format. Since it is an in-memory data structure, it allows high read and write throughput.

Memtables are volatile, and we will lose all data in case of power failure or hardware failure. To ensure durability, they are backed by a write-ahead log (WAL) that stores all operations before applying them to the Memtable.

Since we cannot store more data than the available memory, the memtable is flushed to disk when it reaches a threshold size, and a new memtable is created. The data from the flushed memtable is stored in SS Tables.

SS Tables

A Sorted String Table (SS Table) is a persistent file format that stores data on disk in a sorted and immutable way. It cannot be modified once it is written to disk. Since every memtable will be flushed into an SS Table, there can be multiple SS Tables present at a time. So to improve read performance and free up some space on disk, compaction of SS Tables is done.

During compaction, multiple SS Tables are merged into one using a process similar to one used by the merge-sort algorithm to merge two sorted arrays. Duplicate entries are also removed in this process by taking only the most recent entries and discarding the older ones. The sorted order of data is maintained during this process.

A sparse in-memory index table is also maintained for each SS Table to improve read performance. For every few (let's say 128) entries, we store the first key of that block along with its memory address in the in-memory sparse index table. Now we don't have to traverse each entry in the SS Table, instead, we can just perform a binary search over entries in the index table and then linearly search in that block of 128 entries.

You might have observed that we are not directly performing a binary search on the SS Table even though entries are kept in sorted order in an SS Table. It is because the entries do not have a fixed size and it is not possible to tell where one record ends and the next one starts. Hence, we perform a linear search over the block.

Optimizing read performance using Bloom Filter

For reads, we first check in the memtable if the key is present or not, and if it is not present then we traverse through all the SS Tables in descending order of their creation time until the key is found. In the worst case when the key is not present, we are traversing all SS Tables which takes some time, and this negatively impacts the read performance. To avoid this problem, LSM Trees use Bloom Filters.

A Bloom filter is a probabilistic data structure that is used to check the presence of an element in a set. It is exceptionally space-efficient and performant. When checking if an element is present in the bloom filter, it may return false-positive results. It will either return that an element is definitely not present in the set or that the element may be present in the set. We will not discuss Bloom Filters in depth in this blog but you can check out this article for more details.

LSM Trees can construct a Bloom Filter for every SS Table which will allow us to skip an SS Table if it does not contain the target key. This will save a lot of disk IO, hence resulting in better read performance.

Wrapping up

LSM Tree is a popular choice for databases with high write throughput requirements. For writes, we first write data to memtable and when it reaches a threshold size, it is flushed into an SS Table. Compaction is performed to merge multiple SS Tables into one. And for reads, we first check the memtable and then check all the SS Tables until the key is not found. We can also further optimize reads using bloom filters. I hope this article was helpful and you learned something new.

Thanks for reading!

Aksh's blog