title: "Parallel Compression Revamp: Dramatically Reduced CPU Overhead" layout: post author: peterd
The upcoming RocksDB 10.7 release includes a major revamp of parallel compression that dramatically reduces the feature's CPU overhead by up to 65% while maintaining or improving throughput for compression-heavy workloads. We expect this to broaden the set of workloads that could benefit from parallel compression, especially for bulk SST generation and remote compaction use cases that are less sensitive to CPU responsiveness.
Parallel compression in RocksDB (CompressionOptions::parallel_threads > 1) allows multiple threads to compress different blocks simultaneously during SST file generation, which can significantly improve compaction throughput for workloads where compression is a bottleneck. However, the original implementation had substantial CPU overhead that often outweighed the benefits, limiting its practical adoption.
The parallel compression framework has been completely rewritten from the ground up in pull request #13910 to address the core inefficiencies:
Instead of separate compression and write queues with complex thread coordination, the new implementation uses a ring buffer of blocks-in-progress that enables efficient work distribution across threads. This bounds working memory while enabling high throughput with minimal cross-thread synchronization.
Previously, the calling thread could only generate uncompressed blocks, dedicated compression threads could only compress, and a writer thread could only write the SST file to storage. Now, all threads can participate in compression work in a quasi-work-stealing manner, dramatically reducing the need for threads to block waiting for work. While only one thread (the calling thread or "emit thread") can generate uncompressed SST blocks in the new implementation, feeding compression work to other threads and itself, all other threads are compatible with writing compressed blocks to storage.
The ring buffer enables another key feature: auto-scaling of active threads based on ring buffer utilization. The framework intelligently wakes up idle worker threads only when there's sufficient work to justify the overhead, achieving near-maximum throughput while minimizing CPU waste from unnecessary thread wake-ups.
The entire framework is now lock-free (and wait-free as long as compatible work units are available for each thread), based primarily on atomic operations. To cleanly pack and leverage many data fields into a single atomic value, I've developed a new BitFields utility API. This is proving useful for cleaning up the HyperClockCache implementation as well, and will be the topic of a later blog post.
Semaphores are used for lock-free management of idle threads (assuming a lock-free semaphore implementation, which is likely the case with ROCKSDB_USE_STD_SEMAPHORES but that is untrustworthy; see below).
The results speak for themselves. Here's a comparison using db_bench fillseq benchmarks with various compression configurations:
Note:
Before:
After:
For ZSTD compression level 8, the improvements are even more dramatic:
Before:
After:
Alongside the parallel compression revamp, some optimizations have gone into the underlying compression implementations/integrations. Most notably, LZ4HC received dramatic performance improvements through better reuse of internal data structures between compression calls (detailed in pull request #13805). A small regression in LZ4 performance from that change was fixed in pull request #14017.
While ZSTD remains the gold standard for medium-to-high compression ratios in RocksDB, these LZ4HC optimizations make it an increasingly attractive option for read-heavy workloads where LZ4's faster decompression can provide overall performance benefits.
With these efficiency improvements, parallel compression is now considered production-ready. The feature has been thoroughly tested in both unit tests and stress testing, including validation on high-load scenarios with hundreds of concurrent compression jobs and thousands of threads.
Some notes on current limitations:
UserDefinedIndex and with the deprecated decouple_partitioned_filters=false setting-DROCKSDB_USE_STD_SEMAPHORES at compile time, though this is not currently recommended due to reported bugs in some implementations of C++20 semaphoresThe dramatically reduced CPU overhead means parallel compression is now viable for a broader range of workloads, particularly those using higher compression levels or compression-heavy scenarios like time-series data. However, simply enabling parallel compression could result in more spiky CPU loads for hosts serving live DB data. Parallel compression might be most useful for bulk SST file generation and/or remote compaction workloads because they are less sensitive to CPU responsiveness. In these scenarios there is little danger in setting parallel_threads=8 even with the possibility of over-subscribing CPU cores, though the potentially safer "sweet spot" is typically around parallel_threads=3, depending on compression level, etc.
Although this offers a great improvement in the implementation of an existing option, we recognize that this setup is suboptimal in a number of ways:
Parallel compression revamp will be available in RocksDB 10.7. As always, we recommend testing in your specific environment to determine the optimal configuration for your workload.