title: Reducing Lock Contention in RocksDB layout: post author: sdong category: blog redirect_from:
In this post, we briefly introduce the recent improvements we did to RocksDB to improve the issue of lock contention costs.
RocksDB has a simple thread synchronization mechanism (See RocksDB Architecture Guide to understand terms used below, like SST tables or mem tables). SST tables are immutable after being written and mem tables are lock-free data structures supporting single writer and multiple readers. There is only one single major lock, the DB mutex (DBImpl.mutex_) protecting all the meta operations, including:
Increase or decrease reference counters of mem tables and SST tables
Change and check meta data structures, before and after finishing compactions, flushes and new mem table creations
Coordinating writers
This DB mutex used to be scalability bottleneck preventing us from scaling to more than 16 threads. To address the issue, we improved RocksDB in several ways.
To solve this problem, we created a meta-meta data structure called “super version”, which holds reference counters to all those mem table and SST tables, so that readers only need to increase the reference counters for this single data structure. In RocksDB, list of live mem tables and SST tables only changes infrequently, which would happen when new mem tables are created or flush/compaction happens. Now, at those times, a new super version is created with their reference counters increased. A super version lists live mem tables and SST tables so a reader only needs acquire the lock in order to find the latest super version and increase its reference counter. From the super version, the reader can find all the mem and SST tables which are safety accessible as long as the reader holds the reference count for the super version.
We replace some reference counters to stc::atomic objects, so that decreasing reference count of an object usually doesn’t need to be inside the mutex any more.
Make fetching super version and reference counting lock-free in read queries. After consolidating reference counting to one single super version and removing the locking for decreasing reference counts, in read case, we only acquire mutex for one thing: fetch the latest super version and increase the reference count for that (dereference the counter is done in an atomic decrease). We designed and implemented a (mostly) lock-free approach to do it. See details. We will write a separate blog post for that.
Avoid disk I/O inside the mutex. As we know, each disk I/O to hard drives takes several milliseconds. It can be even longer if file system journal is involved or I/Os are queued. Even occasional disk I/O within mutex can cause huge performance outliers. We identified in two situations, we might do disk I/O inside mutex and we removed them: (1) Opening and closing transactional log files. We moved those operations out of the mutex. (2) Information logging. In multiple places we write to logs within mutex. There is a chance that file write will wait for disk I/O to finish before finishing, even if fsync() is not issued, especially in EXT systems. We occasionally see 100+ milliseconds write() latency on EXT. Instead of removing those logging, we came up with a solution of delay logging. When inside mutex, instead of directly writing to the log file, we write to a log buffer, with the timing information. As soon as mutex is released, we flush the log buffer to log files.
Reduce object creation inside the mutex. Object creation can be slow because it involves malloc (in our case). Malloc sometimes is slow because it needs to lock some shared data structures. Allocating can also be slow because we sometimes do expensive operations in some of our classes' constructors. For these reasons, we try to reduce object creations inside the mutex. Here are two examples:
(1) std::vector uses malloc inside. We introduced “autovector” data structure, in which memory for first a few elements are pre-allocated as members of the autovector class. When an autovector is used as a stack variable, no malloc will be needed unless the pre-allocated buffer is used up. This autovector is quite useful for manipulating those meta data structures. Those meta operations are often locked inside DB mutex.
(2) When building an iterator, we used to creating iterator of every live men table and SST table within the mutex and a merging iterator on top of them. Besides malloc, some of those iterators can be quite expensive to create, like sorting. Now, instead of doing that, we simply increase the reference counters of them, and release the mutex before creating any iterator.
(2) New PlainTable format (optimized for SST in ramfs/tmpfs) does not organize data by blocks. Data are located by memory addresses so no block cache is needed.
With all of those improvements, lock contention is not a bottleneck anymore, which is shown in our memory-only benchmark . Furthermore, lock contentions are not causing some huge (50 milliseconds+) latency outliers they used to cause.
Please post an example of reading the same rocksdb concurrently.
We are using the latest 3.0 rocksdb; however, when two separate processes try and open the same rocksdb for reading, only one of the open requests succeed. The other open always fails with “db/LOCK: Resource temporarily unavailable” So far we have not found an option that allows sharing the rocksdb for reads. An example would be most appreciated.
Sorry for the delay. We don’t have feature support for this scenario yet. Here is an example you can work around this problem. You can build a snapshot of the DB by doing this:
DB::DisableFileDeletions()DB::GetLiveFiles() to get a full list of the files.DB::GetLiveFiles() for more information)DB::EnableFileDeletions()By the way, the best way to ask those questions is in our facebook group. Let us know if you need any further help.
Will this consistency problem of RocksDB all occurs in case of single put/write? What all ACID properties is supported by RocksDB, only durability irrespective of single or batch write?
We recently introduced optimistic transaction which can help you ensure all of ACID.
This blog post is mainly about optimizations in implementation. The RocksDB consistency semantic is not changed.