NOTE: Entries for next release do not go here. Follow instructions in
unreleased_history/README.txt
OpenAndCompactOptions::allow_resumption for resumable compaction that persists progress during OpenAndCompact(), allowing interrupted compactions to resume from the last progress persitence. The default behavior is to not persist progress.DB::FlushWAL(const FlushWALOptions&) as an alternative to DB::FlushWAL(bool sync), where FlushWALOptions includes a new rate_limiter_priority field (default Env::IO_TOTAL) that allows rate limiting and priority passing of manual WAL flush's IO operations.kChangeTemperature FIFO compaction will now honor compaction_target_temp to all levels regardless of cf_options::last_level_temperatureMultiScanArgs::max_prefetch_size that limits the memory usage of per file pinning of prefetched blocks.sst_dump by allowing standalone file and directory arguments without --file=. Also added new options and better output for sst_dump --command=recompress. See sst_dump --helpestimated_entry_charge is now production-ready and is the preferred block cache implementation vs. LRUCache. Please consider updating your code to minimize the risk of hitting performance bottlenecks or anomalies from LRUCache. See cache.h for more detail.block_cache is nullptr (default) and no_block_cache==false (default). We recommend explicitly creating a HyperClockCache block cache based on memory budget and sharing it across all column families and even DB instances. This change could expose previously hidden memory or resource leaks.CompressionOptions::parallel_threads > 1), though this efficiency improvement makes parallel compression currently incompatible with UserDefinedIndex and with old setting of decouple_partitioned_filters=false. Parallel compression is now considered a production-ready feature. Maximum performance is available with -DROCKSDB_USE_STD_SEMAPHORES at compile time, but this is not currently recommended because of reported bugs in implementations of std::counting_semaphore/binary_semaphore.cf_allow_ingest_behind. This option aims to replace DBOptions::allow_ingest_behind to enable ingest behind at the per-CF level. DBOptions::allow_ingest_behind is deprecated.MultiScanArgs::io_coalesce_threshold to allow a configurable IO coalescing threshold.IngestExternalFileOptions::allow_db_generated_files now allows files ingestion of any DB generated SST file, instead of only the ones with all keys having sequence number 0.decouple_partitioned_filters = true is now the default in BlockBasedTableOptions.allow_ingest_behind is enabled, compaction will no longer drop tombstones based on the absence of underlying data. Tombstones will be preserved to apply to ingested files.GetFileChecksumsFromCurrentManifest.memtable_avg_op_scan_flush_trigger that supports triggering memtable flush when an iterator scans through an expensive range of keys, with the average number of skipped keys from the active memtable exceeding the threshold.TransactionOptions::large_txn_commit_optimize_byte_threshold to enable optimizations for large transaction commit by transaction batch data size.CompactionOptionsUniversal::reduce_file_locking and if it's true, auto universal compaction picking will adjust to minimize locking of input files when bottom priority compactions are waiting to run. This can increase the likelihood of existing L0s being selected for compaction, thereby improving write stall and reducing read regression.format_version=7 to aid experimental support of custom compression algorithms with CompressionManager and block-based table. This format version includes changing the format of TableProperties::compression_name.CompactOnDeletionCollectorFactory.TransactionOptions::large_txn_commit_optimize_threshold now has default value 0 for disabled. TransactionDBOptions::txn_commit_bypass_memtable_threshold now has no effect on transactions.CompactionOptionsFIFO::allow_trivial_copy_when_change_temperature along with CompactionOptionsFIFO::trivial_copy_buffer_size to allow optimizing FIFO compactions with tiering when kChangeTemperature to move files from source tier FileSystem to another tier FileSystem via trivial and direct copying raw sst file instead of reading thru the content of the SST file then rebuilding the table files.TransactionOptions::large_txn_commit_optimize_threshold to enable optimizations for large transaction commit with per transaction threshold. TransactionDBOptions::txn_commit_bypass_memtable_threshold is deprecated in favor of this transaction option.request_id to be passed to RocksDB and propagated to the filesystem via IODebugContextCompressionOptions::parallel_threads > 1 and a TablePropertiesCollector overriding BlockAdd().COMPACTION_PREFETCH_BYTES to measure number of bytes for RocksDB's prefetching (as opposed to file
system's prefetch) on SST file during compaction readIngestWriteBatchWithIndex() for ingesting updates into DB while bypassing memtable writes. This improves performance when writing a large write batch to the DB.memtable_op_scan_flush_trigger that triggers a flush of the memtable if an iterator's Seek()/Next() scans over a certain number of invisible entries from the memtable.DB::MaxMemCompactionLevel().ReadOptions::ignore_range_deletions.experimental::PromoteL0().PREFETCH_BYTES_USEFUL, PREFETCH_HITS, PREFETCH_BYTES only account for prefetching during user initiated scanFileSystem::ReopenWritableFile internally does not track the correct file size.DBOptions.calculate_sst_write_lifetime_hint_set setting that allows to customize which compaction styles SST write lifetime hint calculation is allowed on. Today RocksDB supports only two modes kCompactionStyleLevel and kCompactionStyleUniversal.num_l0_files in CompactionJobInfo about the number of L0 files in the CF right before and after the compactionGetAllKeyVersions() now interprets empty slices literally, as valid keys, and uses new OptSlice type default value for extreme upper and lower range limits.DeleteFilesInRanges() now takes RangeOpt which is based on OptSlice. The overload taking RangePtr is deprecated.CompressedSecondaryCacheOptions::compress_format_version == 1 is also deprecated.ldb now returns an error if the specified --compression_type is not supported in the build.auto_refresh_iterator_with_snapshot opt-in knob that (when enabled) will periodically release obsolete memory and storage resources for as long as the iterator is making progress and its supplied read_options.snapshot was initialized with non-nullptr value.FaissIVFIndex and SecondaryIndex for more details. Note: the FAISS integration currently requires using the BUCK build.num_running_compaction_sorted_runs that tracks the number of sorted runs being processed by currently running compactionsSimpleSecondaryIndex and SecondaryIndex for more details.TransactionDBOptions::txn_commit_bypass_memtable_threshold, which enables optimized transaction commit (see TransactionOptions::commit_bypass_memtable) when the transaction size exceeds a configured threshold.SecondaryIndex::NewIterator virtual and adding a SecondaryIndexIterator class that can be utilized by applications to find the primary keys for a given search target.SecondaryIndex::GetSecondary{KeyPrefix,Value} as well as the addition of a new method SecondaryIndex::FinalizeSecondaryKeyPrefix. See the API comments for more details.CompressionType kZSTDNotFinalCompression is also removed.VerifyBackup in verify_with_checksum=true mode will now evaluate checksums in parallel. As a result, unlike in case of original implementation, the API won't bail out on a very first corruption / mismatch and instead will iterate over all the backup files logging success / _degree_offailure for each.overwrite_key=true, this affects the output only if Merge is used (#13387).TransactionOptions::commit_bypass_memtable.GetMergeOperands() that can return incorrect status (MergeInProgress) and incorrect number of merge operands. This can happen when GetMergeOperandsOptions::continue_cb is set, both active and immutable memtables have merge operands and the callback stops the look up at the immutable memtable.SecondaryIndex interface. See the SecondaryIndex API comments for more details. Note: this feature is currently only available in conjunction with write-committed pessimistic transactions, and Merge is not yet supported.track_and_verify_wals to track and verify various information about WAL during WAL recovery. This is intended to be a better replacement to track_and_verify_wals_in_manifest.io_buffer_size to BackupEngineOptions to enable optimal configuration of IO sizerandom_access_max_buffer_size, related rules and all the clients wrappers. This option has been officially deprecated in 5.4.0.file_ingestion_nanos and file_ingestion_blocking_live_writes_nanos in PerfContext to observe file ingestionsstd::unique_ptr<DB>* output parameters and deprecate the old versions that use DB** output parameters.max_compaction_bytes. This prevents overly large compactions in some cases (#13306).preclude_last_level_data_seconds and preserve_internal_time_seconds are now mutable with SetOptions(). Some changes to handling of these features along with long-lived snapshots and range deletes made this possible.TransactionOptions::commit_bypass_memtable to enable transaction commit to bypass memtable insertions. This can be beneficial for transactions with many operations, as it reduces commit time that is mostly spent on memtable insertion.value parameter can be null, and it will be set only if value_found is passed in.Transaction::GetAttributeGroupIterator that can be used to create a multi-column-family attribute group iterator over the specified column families, including the data from both the transaction and the underlying database. This API is currently supported for optimistic and write-committed pessimistic transactions.Transaction::GetCoalescingIterator that can be used to create a multi-column-family coalescing iterator over the specified column families, including the data from both the transaction and the underlying database. This API is currently supported for optimistic and write-committed pessimistic transactions.BaseDeltaIterator now honors the read option allow_unprepared_value.BaseDeltaIterator now calls PrepareValue on the base iterator in case it has been created with the allow_unprepared_value read option set. Earlier, such base iterators could lead to incorrect values being exposed from BaseDeltaIterator.DB::Close()) that are written but never become live due to various failures. (We now have a check for such leaks with no outstanding issues.)Options::compaction_readahead_size greater than max_sectors_kb (i.e, largest I/O size that the OS issues to a block device defined in linux)block_cache options in BlockBasedTableOptions are now mutable with DB::SetOptions(). See also Bug Fixes below.allow_unprepared_value and the iterator API PrepareValue.IngestExternalFileOptions::fill_cache to support not adding blocks from ingested files into block cache during file ingestion.allow_unprepared_value is now also supported for multi-column-family iterators (i.e. CoalescingIterator and AttributeGroupIterator).Seek() are not prefetched when ReadOptions::auto_readahead_size=true (default value) and ReadOptions::prefix_same_as_start = trueblock_based_table_factory options. The fix has some subtle behavior changes because of copying and replacing the TableFactory on a change with SetOptions, including requiring an Iterator::Refresh() for an existing Iterator to use the latest options.GetApproximateMemTableStats() could return disastrously bad estimates 5-25% of the time. The function has been re-engineered to return much better estimates with similar CPU cost.GetOptionsFromString(), possibly elsewhere as well. Affected options, now fixed:background_close_inactive_wals, write_dbid_to_manifest, write_identity_file, prefix_seek_opt_in_onlyprefix_seek_opt_in_only that makes iterators generally safer when you might set a prefix_extractor. When prefix_seek_opt_in_only=true, which is expected to be the future default, prefix seek is only used when prefix_same_as_start or auto_prefix_mode are set. Also, prefix_same_as_start and auto_prefix_mode now allow prefix filtering even with total_order_seek=true.blob_garbage_collection_force_threshold to define a threshold for the overall garbage ratio of all blob files currently eligible for garbage collection (according to blob_garbage_collection_age_cutoff). This can provide better control over space amplification at the cost of slightly higher write amplification.write_dbid_to_manifest=true by default. This means DB ID will now be preserved through backups, checkpoints, etc. by default. Also add write_identity_file option which can be set to false for anticipated future behavior.file_temperature_age_thresholds) will compact one file at a time, instead of merging multiple eligible file together (#13018).IngestExternalFileOptions::link_files to hard link input files and preserve original files links after ingestion.prefix_extractor with memtable prefix filter. Previously, prefix seek could mix different prefix interpretations between memtable and SST files. Now the latest prefix_extractor at the time of iterator creation or refresh is respected.BlockBasedTableOptions::decouple_partitioned_filters should improve efficiency in serving read queries because filter and index partitions can consistently target the configured metadata_block_size. This option is currently opt-in.paranoid_memory_checks. It enables additional validation on data integrity during reads/scanning. Currently, skip list based memtable will validate key ordering during look up and scans.SstFileManager. The slow deletion is subject to the configured rate_bytes_per_sec, but not subject to the max_trash_db_ratio.unordered_write mode.IngestExternalFileOptions::allow_db_generated_files.log_size_for_flush argument in CreateCheckpoint API, the size of the archived log will not be included to avoid unnecessary flushwhole_key_filtering=false and partition_filters=true.OnErrorRecoveryBegin() is not called before auto recovery starts.bg_error_ member without holding db mutex(#12803).CompactForTieringCollectorFactory to auto trigger compaction for tiering use case.GetEntityForUpdate API.rocksdb_writebatch_update_timestamps, rocksdb_writebatch_wi_update_timestamps in C API.rocksdb_iter_refresh in C API.rocksdb_writebatch_create_with_params, rocksdb_writebatch_wi_create_with_params to create WB and WBWI with all options in C APILogFile and VectorLogPtr in favor of new names WalFile and VectorWalPtr.level0_file_num_compaction_trigger) #12477.background_close_inactive_wals.ldb dump_wal command for PutEntity records so it prints the key and correctly resets the hexadecimal formatting flag after printing the wide-column entity.PutEntity records were handled incorrectly while rebuilding transactions during recovery.GetEntity API.Iterator property, "rocksdb.iterator.is-value-pinned", for checking whether the Slice returned by Iterator::value() can be used until the Iterator is destroyed.MultiGetEntity API.PutEntity API. Support for read APIs and other write policies (WritePrepared, WriteUnprepared) will be added later.DBOptions::allow_2pc == true (all TransactionDBs except OptimisticTransactionDB) that have exactly one column family. Due to a missing WAL sync, attempting to open the DB could have returned a Status::Corruption with a message like "SST file is ahead of WALs".ColumnFamilyOptions::inplace_update_support == true between user overwrites and reads on the same key.CompactFiles() can compact files of range conflict with other ongoing compactions' when preclude_last_level_data_seconds > 0 is usedStatus::Corruption reported when reopening a DB that used DBOptions::recycle_log_file_num > 0 and DBOptions::wal_compression != kNoCompression.deadline and max_size_bytes for CacheDumper to exit earlyGetEntityFromBatchAndDB to WriteBatchWithIndex that can be used for wide-column point lookups with read-your-own-writes consistency. Similarly to GetFromBatchAndDB, the API can combine data from the write batch with data from the underlying database if needed. See the API comments for more details.MultiGetEntityFromBatchAndDB to WriteBatchWithIndex that can be used for batched wide-column point lookups with read-your-own-writes consistency. Similarly to MultiGetFromBatchAndDB, the API can combine data from the write batch with data from the underlying database if needed. See the API comments for more details.SstFileReader::NewTableIterator API to support programmatically read a SST file as a raw table file.WaitForCompactOptions - wait_for_purge to make WaitForCompact() API wait for background purge to completeCompactionOptions::compression since CompactionOptions's API for configuring compression was incomplete, unsafe, and likely unnecessaryOptionChangeMigration() to migrate from non-FIFO to FIFO compaction
with Options::compaction_options_fifo.max_table_files_size > 0 can cause
the whole DB to be dropped right after migration if the migrated data is larger than
max_table_files_sizeBlockBasedTableOptions::block_align is now incompatible (i.e., APIs will return Status::InvalidArgument) with more ways of enabling compression: CompactionOptions::compression, ColumnFamilyOptions::compression_per_level, and ColumnFamilyOptions::bottommost_compression.CompactionOptions::compression to kDisableCompressionOption, which means the compression type is determined by the ColumnFamilyOptions.BlockBasedTableOptions::optimize_filters_for_memory is now set to true by default. When partition_filters=false, this could lead to somewhat increased average RSS memory usage by the block cache, but this "extra" usage is within the allowed memory budget and should make memory usage more consistent (by minimizing internal fragmentation for more kinds of blocks).SetDumpFilter() is not calledCompactRange() with CompactRangeOptions::change_level = true and CompactRangeOptions::target_level = 0 that ends up moving more than 1 file from non-L0 to L0 will return Status::Aborted().VerifyFileChecksums() to return false-positive corruption under BlockBasedTableOptions::block_align=trueNewIterators() API.DeleteRange() together with ColumnFamilyOptions::memtable_insert_with_hint_prefix_extractor. The impact of this bug would likely be corruption or crashing.DisableManualCompactions() where compactions waiting to be scheduled due to conflicts would not be canceled promptlyColumnFamilyOptions::max_successive_merges > 0 where the CPU overhead for deciding whether to merge could have increased unless the user had set the option ColumnFamilyOptions::strict_max_successive_mergesMultiGet() and MultiGetEntity() together with blob files (ColumnFamilyOptions::enable_blob_files == true). An error looking up one of the keys could cause the results to be wrong for other keys for which the statuses were Status::OK.DataVerificationInfo::checksum upon file creationPinnableWideColumns.SstFilemManager's slow deletion feature even if it's configured.GetMergeOperandsOptions::continue_cb, to give users the ability to end GetMergeOperands()'s lookup process before all merge operands were found.default_write_temperature CF option and opening an SstFileWriter with a temperature.WriteBatchWithIndex now supports wide-column point lookups via the GetEntityFromBatch API. See the API comments for more details.Iterator::GetProperty("rocksdb.iterator.write-time") to allow users to get data's approximate write unix time and write data with a specific write time via WriteBatch::TimedPut API.best_efforts_recovery == true) may now be used together with atomic flush (atomic_flush == true). The all-or-nothing recovery guarantee for atomically flushed data will be upheld.bottommost_temperature, already replaced by last_level_temperatureWriteCommittedTransaction::GetForUpdate, if the column family enables user-defined timestamp, it was mandated that argument do_validate cannot be false, and UDT based validation has to be done with a user set read timestamp. It's updated to make the UDT based validation optional if user sets do_validate to false and does not set a read timestamp. With this, GetForUpdate skips UDT based validation and it's users' responsibility to enforce the UDT invariant. SO DO NOT skip this UDT-based validation if users do not have ways to enforce the UDT invariant. Ways to enforce the invariant on the users side include manage a monotonically increasing timestamp, commit transactions in a single thread etc.kEnableWait to measure time spent by user threads blocked in RocksDB other than mutex, such as a write thread waiting to be added to a write group, a write thread delayed or stalled etc.RateLimiter's API no longer requires the burst size to be the refill size. Users of NewGenericRateLimiter() can now provide burst size in single_burst_bytes. Implementors of RateLimiter::SetSingleBurstBytes() need to adapt their implementations to match the changed API doc.write_memtable_time to the newly introduced PerfLevel kEnableWait.RateLimiters created by NewGenericRateLimiter() no longer modify the refill period when SetSingleBurstBytes() is called.ColumnFamilyOptions::max_successive_merges when the key's merge operands are all found in memory, unless strict_max_successive_merges is explicitly set.kBlockCacheTier reads to return Status::Incomplete when I/O is needed to fetch a merge chain's base value from a blob file.kBlockCacheTier reads to return Status::Incomplete on table cache miss rather than incorrectly returning an empty value.multiGet() variants now take advantage of the underlying batched multiGet() performance improvements.
Before
Benchmark (columnFamilyTestType) (keyCount) (keySize) (multiGetSize) (valueSize) Mode Cnt Score Error Units
MultiGetBenchmarks.multiGetList10 no_column_family 10000 16 100 64 thrpt 25 6315.541 ± 8.106 ops/s
MultiGetBenchmarks.multiGetList10 no_column_family 10000 16 100 1024 thrpt 25 6975.468 ± 68.964 ops/s
After
Benchmark (columnFamilyTestType) (keyCount) (keySize) (multiGetSize) (valueSize) Mode Cnt Score Error Units
MultiGetBenchmarks.multiGetList10 no_column_family 10000 16 100 64 thrpt 25 7046.739 ± 13.299 ops/s
MultiGetBenchmarks.multiGetList10 no_column_family 10000 16 100 1024 thrpt 25 7654.521 ± 60.121 ops/s
SstFileWriter create SST files without persisting user defined timestamps when the Option.persist_user_defined_timestamps flag is set to false.DeleteFilesInRanges and GetPropertiesOfTablesInRange.access_hint_on_compaction_startColumnFamilyOptions::check_flush_compaction_key_orderWritableFile::GetFileSize and FSWritableFile::GetFileSize implementation that returns 0 and make it pure virtual, so that subclasses are enforced to explicitly provide an implementation.ColumnFamilyOptions::level_compaction_dynamic_file_sizeEnableFileDeletions API because it is unsafe with no known legitimate use.ColumnFamilyOptions::ignore_max_compaction_bytes_for_inputsst_dump --command=check now compares the number of records in a table with num_entries in table property, and reports corruption if there is a mismatch. API SstFileDumper::ReadSequential() is updated to optionally do this verification. (#12322)DBImpl::RenameTempFileToOptionsFile.rocksdb.sst.write.micros measures time of each write to SST file; rocksdb.file.write.{flush|compaction|db.open}.micros measure time of each write to SST table (currently only block-based table format) and blob file for flush, compaction and db open.kVerify to enum class FileOperationType in listener.h. Update your switch statements as needed.level_compaction_dynamic_file_size, ignore_max_compaction_bytes_for_input, check_flush_compaction_key_order, flush_verify_memtable_count, compaction_verify_record_count, fail_if_options_file_error, and enforce_single_del_contractsrocksdb.blobdb.blob.file.write.micros expands to also measure time writing the header and footer. Therefore the COUNT may be higher and values may be smaller than before. For stacked BlobDB, it no longer measures the time of explicitly flushing blob file.rocksdb.blobdb.blob.file.synced includes blob files failed to get synced and rocksdb.blobdb.blob.file.bytes.written includes blob bytes failed to get written.BackupEngine, sst_dump, or ldb.preclude_last_level_data_seconds option that could interfere with expected data tiering.WriteBatchWithIndex. This includes the PutEntity API and support for wide columns in the existing read APIs (GetFromBatch, GetFromBatchAndDB, MultiGetFromBatchAndDB, and BaseDeltaIterator).TablePropertiesCollectorFactory may now return a nullptr collector to decline processing a file, reducing callback overheads in such cases.HyperClockCacheOptions::eviction_effort_cap controls the space-time trade-off of the response. The default should be generally well-balanced, with no measurable affect on normal operation.RocksDB.get([ColumnFamilyHandle columnFamilyHandle,] ReadOptions opt, ByteBuffer key, ByteBuffer value) which now accepts indirect buffer parameters as well as direct buffer parameters
2 Extended RocksDB.put( [ColumnFamilyHandle columnFamilyHandle,] WriteOptions writeOpts, final ByteBuffer key, final ByteBuffer value) which now accepts indirect buffer parameters as well as direct buffer parameters
3 Added RocksDB.merge([ColumnFamilyHandle columnFamilyHandle,] WriteOptions writeOptions, ByteBuffer key, ByteBuffer value) methods with the same parameter options as put(...) - direct and indirect buffers are supported
4 Added RocksIterator.key( byte[] key [, int offset, int len]) methods which retrieve the iterator key into the supplied buffer
5 Added RocksIterator.value( byte[] value [, int offset, int len]) methods which retrieve the iterator value into the supplied buffer
6 Deprecated get(final ColumnFamilyHandle columnFamilyHandle, final ReadOptions readOptions, byte[]) in favour of get(final ReadOptions readOptions, final ColumnFamilyHandle columnFamilyHandle, byte[]) which has consistent parameter ordering with other methods in the same class
7 Added Transaction.get( ReadOptions opt, [ColumnFamilyHandle columnFamilyHandle, ] byte[] key, byte[] value) methods which retrieve the requested value into the supplied buffer
8 Added Transaction.get( ReadOptions opt, [ColumnFamilyHandle columnFamilyHandle, ] ByteBuffer key, ByteBuffer value) methods which retrieve the requested value into the supplied buffer
9 Added Transaction.getForUpdate( ReadOptions readOptions, [ColumnFamilyHandle columnFamilyHandle, ] byte[] key, byte[] value, boolean exclusive [, boolean doValidate]) methods which retrieve the requested value into the supplied buffer
10 Added Transaction.getForUpdate( ReadOptions readOptions, [ColumnFamilyHandle columnFamilyHandle, ] ByteBuffer key, ByteBuffer value, boolean exclusive [, boolean doValidate]) methods which retrieve the requested value into the supplied buffer
11 Added Transaction.getIterator() method as a convenience which defaults the ReadOptions value supplied to existing Transaction.iterator() methods. This mirrors the existing RocksDB.iterator() method.
12 Added Transaction.put([ColumnFamilyHandle columnFamilyHandle, ] ByteBuffer key, ByteBuffer value [, boolean assumeTracked]) methods which supply the key, and the value to be written in a ByteBuffer parameter
13 Added Transaction.merge([ColumnFamilyHandle columnFamilyHandle, ] ByteBuffer key, ByteBuffer value [, boolean assumeTracked]) methods which supply the key, and the value to be written/merged in a ByteBuffer parameter
14 Added Transaction.mergeUntracked([ColumnFamilyHandle columnFamilyHandle, ] ByteBuffer key, ByteBuffer value) methods which supply the key, and the value to be written/merged in a ByteBuffer parameterEnableFileDeletion API not default to force enabling. For users that rely on this default behavior and still
want to continue to use force enabling, they need to explicitly pass a true to EnableFileDeletion.daily_offpeak_time_utc, the compaction picker will select a larger number of files for periodic compaction. This selection will include files that are projected to expire by the next off-peak start time, ensuring that these files are not chosen for periodic compaction outside of off-peak hours.DB::StartTrace(), the subsequent trace writes are skipped to avoid writing to a file that has previously seen error. In this case, DB::EndTrace() will also return a non-ok status with info about the error occurred previously in its status message.TablePropertiesCollector::Finish() once.WAL_ttl_seconds > 0, we now process archived WALs for deletion at least every WAL_ttl_seconds / 2 seconds. Previously it could be less frequent in case of small WAL_ttl_seconds values when size-based expiration (WAL_size_limit_MB > 0) was simultaneously enabled.rocksdb.fifo.{max.size|ttl}.compactions to count FIFO compactions that drop files for different reasonsDBOptions::daily_offpeak_time_utc in "HH:mm-HH:mm" format. This information will be used for resource optimization in the futureSetSingleBurstBytes() for RocksDB rate limiterDBOptions::fail_if_options_file_error changed from false to true. Operations that set in-memory options (e.g., DB::Open*(), DB::SetOptions(), DB::CreateColumnFamily*(), and DB::DropColumnFamily()) but fail to persist the change will now return a non-OK Status by default.Options::compaction_readahead_size is 0Status::NotSupported()max_successive_merges logic.create_missing_column_families=true and many column families.COMPACTION_CPU_TOTAL_TIME that records cumulative compaction cpu time. This ticker is updated regularly while a compaction is running.GetEntity() API for ReadOnly DB and Secondary DB.Iterator::Refresh(const Snapshot *) that allows iterator to be refreshed while using the input snapshot to read.merge_operand_count_threshold. When the number of merge operands applied during a successful point lookup exceeds this threshold, the query will return a special OK status with a new subcode kMergeOperandThresholdExceeded. Applications might use this signal to take action to reduce the number of merge operands for the affected key(s), for example by running a compaction.NewRibbonFilterPolicy(), made the bloom_before_level option mutable through the Configurable interface and the SetOptions API, allowing dynamic switching between all-Bloom and all-Ribbon configurations, and configurations in between. See comments on NewRibbonFilterPolicy()NewTieredCache() API in rocksdb/cache.h..FullMergeV3 to MergeOperator. FullMergeV3 supports wide columns both as base value and merge result, which enables the application to perform more general transformations during merges. For backward compatibility, the default implementation implements the earlier logic of applying the merge operation to the default column of any wide-column entities. Specifically, if there is no base value or the base value is a plain key-value, the default implementation falls back to FullMergeV2. If the base value is a wide-column entity, the default implementation invokes FullMergeV2 to perform the merge on the default column, and leaves any other columns unchanged.CompactionFilter::Context. See CompactionFilter::Context::input_start_level,CompactionFilter::Context::input_table_properties for more.Options::compaction_readahead_size 's default value is changed from 0 to 2MB.acceleration parameter is configurable by setting the negated value in CompressionOptions::level. For example, CompressionOptions::level=-10 will set acceleration=10NewTieredCache API has been changed to take the total cache capacity (inclusive of both the primary and the compressed secondary cache) and the ratio of total capacity to allocate to the compressed cache. These are specified in TieredCacheOptions. Any capacity specified in LRUCacheOptions, HyperClockCacheOptions and CompressedSecondaryCacheOptions is ignored. A new API, UpdateTieredCache is provided to dynamically update the total capacity, ratio of compressed cache, and admission policy.NewTieredVolatileCache() API in rocksdb/cache.h has been renamed to NewTieredCache().Options::compaction_readahead_size is explicitly set to 0IO error: No such file or directory: While open a file for random read: /tmp/rocksdbtest-501/db_flush_test_87732_4230653031040984171/000013.sst.LRUCache before returning, thus incurring locking overhead. With this fix, inserts and lookups are no-ops and do not add any overhead.MultiGet for cleaning up SuperVersion acquired with locking db mutex.GenericRateLimiter that could cause it to stop granting requestsrocksdb.file.read.verify.file.checksums.micros is not populatedcompaction_verify_record_count is introduced for this purpose and is enabled by default.bottommost_file_compaction_delay to allow specifying the delay of bottommost level single-file compactions.memtable_max_range_deletions that limits the number of range deletions in a memtable. RocksDB will try to do an automatic flush after the limit is reached. (#11358)timeout in microsecond option to WaitForCompactOptions to allow timely termination of prolonged waiting in scenarios like recurring recoverable errors, such as out-of-space situations and continuous write streams that sustain ongoing flush and compactionsrocksdb.file.read.{get|multiget|db.iterator|verify.checksum|verify.file.checksums}.micros measure read time of block-based SST tables or blob files during db open, Get(), MultiGet(), using db iterator, VerifyFileChecksums() and VerifyChecksum(). They require stats level greater than StatsLevel::kExceptDetailedTimers.WaitForCompactOptions to call Close() after waiting is done.CompressionOptions::checksum for enabling ZSTD's checksum feature to detect corruption during decompression.Options::access_hint_on_compaction_start related APIs as deprecated. See #11631 for alternative behavior.rocksdb.sst.read.micros now includes time spent on multi read and async read into the fileperiodic_compaction_seconds) will be set to 30 days by default if block based table is used.GeneralCache and MakeSharedGeneralCache() as our plan changed to stop exposing a general-purpose cache interface. The old forms of these APIs, Cache and NewLRUCache(), are still available, although general-purpose caching support will be dropped eventually.periodic_compaction_seconds no longer supports FIFO compaction: setting it has no effect on FIFO compactions. FIFO compaction users should only set option ttl instead.AdvancedColumnFamilyOptions.persist_user_defined_timestamps in the Manifest and table properties for a SST file when it is created. And use the recorded flag when creating a table reader for the SST file. This flag is only explicitly record if it's false.rocksdb.files.marked.trash.deleted to track the number of trash files deleted by background thread from the trash queue.WaitForCompact() to wait for all flush and compactions jobs to finish. Jobs to wait include the unscheduled (queued, but not scheduled yet).WriteBatch::Release() that releases the batch's serialized data to the caller.rocksdb_options_add_compact_on_deletion_collector_factory_del_ratio.rocksdb.error.handler.bg.error.count, rocksdb.error.handler.bg.io.error.count, rocksdb.error.handler.bg.retryable.io.error.count to replace the misspelled ones: rocksdb.error.handler.bg.errro.count, rocksdb.error.handler.bg.io.errro.count, rocksdb.error.handler.bg.retryable.io.errro.count ('error' instead of 'errro'). Users should switch to use the new tickers before 9.0 release as the misspelled old tickers will be completely removed then.level_compaction_dynamic_level_bytes to true. This affects users who use leveled compaction and do not set this option explicitly. These users may see additional background compactions following DB open. These compactions help to shape the LSM according to level_compaction_dynamic_level_bytes such that the size of each level Ln is approximately size of Ln-1 * max_bytes_for_level_multiplier. Turning on this option has other benefits too: see more detail in wiki: https://github.com/facebook/rocksdb/wiki/Leveled-Compaction#option-level_compaction_dynamic_level_bytes-and-levels-target-size and in option comment in advanced_options.h (#11525).CompactRange() will now always try to compact to the last non-empty level. (#11468)
For Leveled Compaction users, CompactRange() with bottommost_level_compaction = BottommostLevelCompaction::kIfHaveCompactionFilter will behave similar to kForceOptimized in that it will skip files created during this manual compaction when compacting files in the bottommost level. (#11468)allow_ingest_behind=true (currently only Universal compaction is supported), files in the last level, i.e. the ingested files, will not be included in any compaction. (#11489)rocksdb.sst.read.micros scope is expanded to all SST reads except for file ingestion and column family import (some compaction reads were previously excluded).block_protection_bytes_per_key, which can be used to enable per key-value integrity protection for in-memory blocks in block cache (#11287).JemallocAllocatorOptions::num_arenas. Setting num_arenas > 1 may mitigate mutex contention in the allocator, particularly in scenarios where block allocations commonly bypass jemalloc tcache.ShardedCacheOptions::hash_seed, which also documents the solved problem in more detail.CompactionOptionsFIFO::file_temperature_age_thresholds that allows FIFO compaction to compact files to different temperatures based on key age (#11428).BLOCK_CHECKSUM_MISMATCH_COUNT.rocksdb.file.read.db.open.micros that measures read time of block-based SST tables or blob files during db open._LEVEL_SEEK_*. (#11460)DB::ClipColumnFamily to clip the key in CF to a certain range. It will physically deletes all keys outside the range including tombstones.MakeSharedCache() construction functions to various cache Options objects, and deprecated the NewWhateverCache() functions with long parameter lists._LEVEL_SEEK_*. stats. (#11460)SstFileWriter::DeleteRange() now returns Status::InvalidArgument if the range's end key comes before its start key according to the user comparator. Previously the behavior was undefined.multi_get_for_update to C API.level_compaction_dynamic_level_bytes=true, RocksDB now trivially moves levels down to fill LSM starting from bottommost level during DB open. See more in comments for option level_compaction_dynamic_level_bytes (#11321).ReadOptions take effect for more reads of non-CacheEntryRole::kDataBlock blocks.level_compaction_dynamic_level_bytes=true, RocksDB now drains unnecessary levels through background compaction automatically (#11340). This together with #11321 makes it automatic to migrate other compaction settings to level compaction with level_compaction_dynamic_level_bytes=true. In addition, a live DB that becomes smaller will now have unnecessary levels drained which can help to reduce read and space amp.CompactRange() is called with CompactRangeOptions::bottommost_level_compaction=kForce* to compact from L0 to L1, RocksDB now will try to do trivial move from L0 to L1 and then do an intra L1 compaction, instead of a L0 to L1 compaction with trivial move disabled (#11375)).PerfContext counters iter_{next|prev|seek}_count for db iterator, each counting the times of corresponding API being called.WriteBufferManager allows stall or not by calling SetAllowStall()rocksdb.file.read.{flush|compaction}.micros that measure read time of block-based SST tables or blob files during flush or compaction.internal_merge_count PerfContext counter.WriteOptions::disableWAL == true (#11148).internal_merge_point_lookup_count which tracks the number of Merge operands applied while serving point lookup queries.HyperClockCacheOptions inherits secondary_cache option from ShardedCacheOptions)rocksdb.cf-write-stall-stats, rocksdb.db-write-stall-statsand APIs to examine them in a structured way. In particular, users of GetMapProperty() with property kCFWriteStallStats/kDBWriteStallStats can now use the functions in WriteStallStatsMapKeys to find stats in the map.Cache that are mostly relevant to custom implementations or wrappers. Especially, asychronous lookup functionality is moved from Lookup() to a new StartAsyncLookup() function.ReadOptions::verify_checksums=false disables checksum verification for more reads of non-CacheEntryRole::kDataBlock blocks.ColumnFamilyData::flush_reason caused by concurrent flushes.Get and MultiGet when user-defined timestamps is enabled in combination with BlobDB.LockWAL() such as allowing concurrent/recursive use and not expecting UnlockWAL() after non-OK result. See API comments.GetEntity would expose the blob reference instead of the blob value.DisableManualCompaction() and CompactRangeOptions::canceled to cancel compactions even when they are waiting on conflicting compactions to finishGetMergeOperands() could transiently return Status::MergeInProgress()LoadOptionsFromFile, LoadLatestOptions, CheckOptionsCompatibility.BLOCK_CACHE_INDEX_BYTES_EVICT, BLOCK_CACHE_FILTER_BYTES_EVICT, BLOOM_FILTER_MICROS, NO_FILE_CLOSES, STALL_L0_SLOWDOWN_MICROS, STALL_MEMTABLE_COMPACTION_MICROS, STALL_L0_NUM_FILES_MICROS, RATE_LIMIT_DELAY_MILLIS, NO_ITERATORS, NUMBER_FILTERED_DELETES, WRITE_TIMEDOUT, BLOB_DB_GC_NUM_KEYS_OVERWRITTEN, BLOB_DB_GC_NUM_KEYS_EXPIRED, BLOB_DB_GC_BYTES_OVERWRITTEN, BLOB_DB_GC_BYTES_EXPIRED, BLOCK_CACHE_COMPRESSION_DICT_BYTES_EVICT as well as the histograms STALL_L0_SLOWDOWN_COUNT, STALL_MEMTABLE_COMPACTION_COUNT, STALL_L0_NUM_FILES_COUNT, HARD_RATE_LIMIT_DELAY_COUNT, SOFT_RATE_LIMIT_DELAY_COUNT, BLOB_DB_GC_MICROS, and NUM_DATA_BLOCKS_READ_PER_LEVEL. Note that as a result, the C++ enum values of the still supported statistics have changed. Developers are advised to not rely on the actual numeric values.GetColumnFamilyOptionsFrom{Map|String}(const ColumnFamilyOptions&, ..), GetDBOptionsFrom{Map|String}(const DBOptions&, ..), GetBlockBasedTableOptionsFrom{Map|String}(const BlockBasedTableOptions& table_options, ..) and GetPlainTableOptionsFrom{Map|String}(const PlainTableOptions& table_options,..).Status::Corruption, Status::SubCode::kMergeOperatorFailed, for users to identify corruption failures originating in the merge operator, as opposed to RocksDB's internally identified data corruptionsmake build now builds a shared library by default instead of a static library. Use LIB_MODE=static to override.FilterV3 API. See the comment of the API for more details.do_not_compress_roles to CompressedSecondaryCacheOptions to disable compression on certain kinds of block. Filter blocks are now not compressed by CompressedSecondaryCache by default.MultiGetEntity API that enables batched wide-column point lookups. See the API comments for more details.epoch_number and sort L0 files by epoch_number instead of largest_seqno. epoch_number represents the order of a file being flushed or ingested/imported. Compaction output file will be assigned with the minimum epoch_number among input files'. For L0, larger epoch_number indicates newer L0 file.iterate_upper_bound is processed.CompactionOptionsFIFO::max_table_files_size is no exceeded since #10348 or 7.8.0.DB::SyncWAL() affecting track_and_verify_wals_in_manifest. Without the fix, application may see "open error: Corruption: Missing WAL with log number" while trying to open the db. The corruption is a false alarm but prevents DB open (#10892).epoch_number. Before the fix, force_consistency_checks=true may catch the corruption before it's exposed to readers, in which case writes returning Status::Corruption would be expected. Also replace the previous incomplete fix (#5958) to the same corruption with this new and more complete fix.CompactRange() under change_level=true acts on overlapping range with an ongoing file ingestion for level compaction. This will either result in overlapping file ranges corruption at a certain level caught by force_consistency_checks=true or protentially two same keys both with seqno 0 in two different levels (i.e, new data ends up in lower/older level). The latter will be caught by assertion in debug build but go silently and result in read returning wrong result in release build. This fix is general so it also replaced previous fixes to a similar problem for CompactFiles() (#4665), general CompactRange() and auto compaction (commit 5c64fb6 and 87dfc1d).CreateBackupOptions::exclude_files_callback. To restore the DB, the excluded files must be provided in alternative backup directories using RestoreOptions::alternate_dirs.MergeOperationOutput::op_failure_scope for merge operator users to control the blast radius of merge operator failures. Existing merge operator users do not need to make any change to preserve the old behaviorStatus::Corruption in case of force_consistency_checks=true (default). It affects use cases that enable both parallel flush (max_background_flushes > 1 or max_background_jobs >= 8) and non-default memtable count (max_write_buffer_number > 2).READ_NUM_MERGE_OPERANDS ticker was not updated when the base key-value or tombstone was read from an SST file.block_cache_compressed. block_cache_compressed no longer attempts to use SecondaryCache features.PutEntity API, and retrieved using GetEntity and the new columns API of iterator. For compatibility, the classic APIs Get and MultiGet, as well as iterator's value API return the value of the anonymous default column of wide-column entities; also, GetEntity and iterator's columns return any plain key-values in the form of an entity which only has the anonymous default column. Merge (and GetMergeOperands) currently also apply to the default column; any other columns of entities are unaffected by Merge operations. Note that some features like compaction filters, transactions, user-defined timestamps, and the SST file writer do not yet support wide-column entities; also, there is currently no MultiGet-like API to retrieve multiple entities at once. We plan to gradually close the above gaps and also implement new features like column-level operations (e.g. updating or querying only certain columns of an entity).estimated_entry_charge option.block_cache_compressed as a deprecated feature. Use SecondaryCache instead.SecondaryCache::InsertSaved() API, with default implementation depending on Insert(). Some implementations might need to add a custom implementation of InsertSaved(). (Details in API comments.)DeleteRange() now supports user-defined timestamp.DB::Properties::kFastBlockCacheEntryStats, which is similar to DB::Properties::kBlockCacheEntryStats, except returns cached (stale) values in more cases to reduce overhead.ignore_max_compaction_bytes_for_input to ignore max_compaction_bytes limit when adding files to be compacted from input level. This should help reduce write amplification. The option is enabled by default.preserve_internal_time_seconds to preserve the time information for the latest data. Which can be used to determine the age of data when preclude_last_level_data_seconds is enabled. The time information is attached with SST in table property rocksdb.seqno.time.map which can be parsed by tool ldb or sst_dump.flush_opts.wait=false to stall when database has stopped all writes (#10001).ldb update_manifest and ldb unsafe_remove_sst_file are not usable because they were requiring the DB files to match the existing manifest state (before updating the manifest to match a desired state).AdvancedColumnFamilyOptions.level_compaction_dynamic_file_size to false. As a side effect, it can create SSTs larger than the target_file_size (capped at 2x target_file_size) or smaller files.GetLiveFiles or CreateNewBackup is asked to trigger and wait for memtable flush on a read-only DB. Such indirect requests for memtable flush are now ignored on a read-only DB.FlushWAL(true /* sync */) (used by GetLiveFilesStorageInfo(), which is used by checkpoint and backup) could cause parallel writes at the tail of a WAL file to never be synced.SetOptions() fail to update periodical_task time like: stats_dump_period_sec, stats_persist_period_sec.rocksdb_column_family_handle_get_id, rocksdb_column_family_handle_get_name to get name, id of column family in C APICompressedSecondaryCache, we just insert a dummy block into the primary cache and don’t erase the block from CompressedSecondaryCache. A standalone handle is returned to the caller. Only if the block is found again from CompressedSecondaryCache before the dummy block is evicted, we erase the block from CompressedSecondaryCache and insert it into the primary cache.CompressedSecondaryCache, we just insert a dummy block in CompressedSecondaryCache. Only if it is evicted again before the dummy block is evicted from the cache, it is treated as a hot block and is inserted into CompressedSecondaryCache.malloc_usable_size is available (see #10583).num_file_reads_for_auto_readahead is added in BlockBasedTableOptions which indicates after how many sequential reads internal auto prefetching should be start (default is 2).block_cache_standalone_handle_count, block_cache_real_handle_count,compressed_sec_cache_insert_real_count, compressed_sec_cache_insert_dummy_count, compressed_sec_cache_uncompressed_bytes, and compressed_sec_cache_compressed_bytes.CompressedSecondaryCacheOptions::enable_custom_split_merge is added for enabling the custom split and merge feature, which split the compressed value into chunks so that they may better fit jemalloc bins.DeleteRange() users. Internally, iterator will skip to the end of a range tombstone when possible, instead of looping through each key and check individually if a key is range deleted.PinnableSlice now only points to the blob value and pins the backing resource (cache entry or buffer) in all cases, instead of containing a copy of the blob value. See #10625 and #10647.DeleteRange() users should see improvement in get/iterator performance from mutable memtable (see #10547).prepopulate_blob_cache to ColumnFamilyOptions. If enabled, prepopulate warm/hot blobs which are already in memory into blob cache at the time of flush. On a flush, the blob that is in memory (in memtables) get flushed to the device. If using Direct IO, additional IO is incurred to read this blob back into memory again, which is avoided by enabling this option. This further helps if the workload exhibits high temporal locality, where most of the reads go to recently written data. This also helps in case of the remote file system since it involves network traffic and higher latencies.secondary_cache in LRUCacheOptions.LRUCacheOptions::strict_capacity_limit = true), creation will fail with Status::MemoryLimit(). To opt in this feature, enable charging CacheEntryRole::kBlobCache in BlockBasedTableOptions::cache_usage_options.memtable_protection_bytes_per_key that turns on memtable per key-value checksum protection. Each memtable entry will be suffixed by a checksum that is computed during writes, and verified in reads/compaction. Detected corruption will be logged and with corruption status returned to user.low_pri_pool_ratio in LRUCacheOptions to configure the ratio of capacity reserved for low priority cache entries (and therefore the remaining ratio is the space reserved for the bottom level), or configuring the new argument low_pri_pool_ratio in NewLRUCache() to achieve the same effect.CompactRangeOptions::exclusive_manual_compaction is now false by default. This ensures RocksDB does not introduce artificial parallelism limitations by default.bottommost_temperture to last_level_temperture. The old option name is kept only for migration, please use the new option. The behavior is changed to apply temperature for the last_level SST files only.FSDirectory::Fsync or FSDirectory::Close after the first FSDirectory::Close; Also, valgrind could report call to close() with fd=-1.)GenericRateLimiter could revert the bandwidth set dynamically using SetBytesPerSecond() when a user configures a structure enclosing it, e.g., using GetOptionsFromString() to configure an Options that references an existing RateLimiter object.GenericRateLimiter.FIFOCompactionPicker::PickTTLCompaction where total_size calculating might cause underflowbest_efforts_recovery may fail to open the db with mmap read.fill_cache read option set to false.AllocateData() in CompressedSecondaryCache::SplitValueIntoChunks() and MergeChunksIntoValueTest.FaultInjectionSecondaryCache.CompressedSecondaryCache, the original block is split according to the jemalloc bin size in Insert() and then merged back in Lookup().preclude_last_level_data_seconds is enabled, the size amplification is calculated within non last_level data only which skip the last level and use the penultimate level as the size base.WriteBufferManager constructed with allow_stall == false will no longer trigger write stall implicitly by thrashing until memtable count limit is reached. Instead, a column family can continue accumulating writes while that CF is flushing, which means memory may increase. Users who prefer stalling writes must now explicitly set allow_stall == true.CompressedSecondaryCache into the stress tests.FragmentedRangeTombstoneList during every read operation, it is now constructed once and stored in immutable memtables. This improves speed of querying range tombstones from immutable memtables.experimental_mempurge_threshold is now a ColumnFamilyOptions and can now be dynamically configured using SetOptions().ReadOptions::iter_start_ts is set.blob_cache to enable/disable blob caching.BlobSource for blob read logic gives all users access to blobs, whether they are in the blob cache, secondary cache, or (remote) storage. Blobs can be potentially read both while handling user reads (Get, MultiGet, or iterator) and during compaction (while dealing with compaction filters, Merges, or garbage collection) but eventually all blob reads go through Version::GetBlob or, for MultiGet, Version::MultiGetBlob (and then get dispatched to the interface -- BlobSource).AdvancedColumnFamilyOptions::preclude_last_level_data_seconds, which makes sure the new data inserted within preclude_last_level_data_seconds won't be placed on cold tier (the feature is not complete).rocksdb_get_column_family_metadata() and rocksdb_get_column_family_metadata_cf() to obtain rocksdb_column_family_metadata_t.rocksdb_column_family_metadata_t and its get functions & destroy function.rocksdb_level_metadata_t and its and its get functions & destroy function.rocksdb_file_metadata_t and its and get functions & destroy functions.LRUCache with strict_capacity_limit=true), DB operations now fail with Status code kAborted subcode kMemoryLimit (IsMemoryLimit()) instead of kIncomplete (IsIncomplete()) when the capacity limit is reached, because Incomplete can mean other specific things for some operations. In more detail, Cache::Insert() now returns the updated Status code and this usually propagates through RocksDB to the user on failure.int ReserveThreads(int threads_to_be_reserved) and int ReleaseThreads(threads_to_be_released) into Env class. In the default implementation, both return 0. Newly added xxxEnv class that inherits Env should implement these two functions for thread reservation/releasing features.rocksdb_options_get_prepopulate_blob_cache and rocksdb_options_set_prepopulate_blob_cache to C API.prepopulateBlobCache and setPrepopulateBlobCache to Java API.protection_bytes_per_key > 0 on WriteBatch or WriteOptions, and configure inplace_callback != nullptr.WritableFileWriter::filesize_ by DB::SyncWAL() and DB::Put() in two write queue mode.SyncWAL() on the only WAL file of the db will not log the event in MANIFEST, thus allowing a subsequent DB::Open even if the WAL file is missing or corrupted.index_type=kHashSearch and using SetOptions to change the prefix_extractor.manual_wal_flush and/or FlushWAL(true /* sync */), together with track_and_verify_wals_in_manifest == true. For those users, losing unsynced data (e.g., due to power loss) could make future DB opens fail with a Status::Corruption complaining about missing WAL data.WriteBatchInternal::Append() where WAL termination point in write batch was not considered and the function appends an incorrect number of checksums.kDataBlockBinaryAndHash.get_pinned and multi_get to C API.rocksdb_transaction_get_writebatch_wi and rocksdb_transaction_rebuild_from_writebatch to C API.rocksdb_options_get_blob_file_starting_level and rocksdb_options_set_blob_file_starting_level to C API.blobFileStartingLevel and setBlobFileStartingLevel to Java API.rocksdb_comparator_with_ts_create to create timestamp aware comparatorwith_tsauto_prefix_mode.auto_prefix_mode now notes some corner cases in which it returns different results than total_order_seek, due to design bugs that are not easily fixed. Users using built-in comparators and keys at least the size of a fixed prefix length are not affected.WriteOptions::protection_bytes_per_key, which can be used to enable key-value integrity protection for live updates.blob_garbage_collection_policy and blob_garbage_collection_age_cutoff to both force-enable and force-disable GC, as well as selectively override age cutoff when using CompactRange.GetSortedWalFiles() (also used by GetLiveFilesStorageInfo(), BackupEngine, and Checkpoint) to reduce risk of successfully created backup or checkpoint failing to open because of missing WAL file.blob_file_starting_level to enable writing blob files during flushes and compactions starting from the specified LSM tree level.MultiGet() APIsblob_cache_hit_count, blob_read_count, blob_read_byte, blob_read_time, blob_checksum_time and blob_decompress_time.BLOB_DB_CACHE_MISS, BLOB_DB_CACHE_HIT, BLOB_DB_CACHE_ADD, BLOB_DB_CACHE_ADD_FAILURES, BLOB_DB_CACHE_BYTES_READ and BLOB_DB_CACHE_BYTES_WRITE.mutex_ if this db instance does not need to switch wal and mem-table (#7516).avoid_flush_during_recovery == true by removing valid WALs, leading to Status::Corruption with message like "SST file is ahead of WALs" when attempting to reopen.Delete to remove keys, even if the keys should be removed with SingleDelete. Mixing Delete and SingleDelete may cause undefined behavior.WritableFileWriter::WriteDirect and WritableFileWriter::WriteDirectWithChecksum. The rate_limiter_priority specified in ReadOptions was not passed to the RateLimiter when requesting a token.verify_sst_unique_id_in_manifest is introduced to enable/disable the verification, if enabled all SST files will be opened during DB-open to verify the unique id (default is false), so it's recommended to use it with max_open_files = -1 to pre-open the files.LRUCacheOptions::strict_capacity_limit = true), creation will fail with Status::MemoryLimit(). To opt in this feature, enable charging CacheEntryRole::kFileMetadata in BlockBasedTableOptions::cache_usage_options.SingleDelete to mark a key as removed.BlockBasedTableOptions::cache_usage_options and use that to replace BlockBasedTableOptions::reserve_table_builder_memory and BlockBasedTableOptions::reserve_table_reader_memory.GetUniqueIdFromTableProperties to return a 128-bit unique identifier, which will be the standard size now. The old functionality (192-bit) is available from GetExtendedUniqueIdFromTableProperties. Both functions are no longer "experimental" and are ready for production use.prio as deprecated for future removal.file_system.h, mark IOPriority as deprecated for future removal.CompressionOptions::use_zstd_dict_trainer, to indicate whether zstd dictionary trainer should be used for generating zstd compression dictionaries. The default value of this option is true for backward compatibility. When this option is set to false, zstd API ZDICT_finalizeDictionary is used to generate compression dictionaries.--try_load_options default to true if --db is specified and not creating a new DB, the user can still explicitly disable that by --try_load_options=false (or explicitly enable that by --try_load_options).rocksdb.read.block.compaction.micros cannot track compaction stats (#9722).file_type, relative_filename and directory fields returned by GetLiveFilesMetaData(), which were added in inheriting from FileStorageInfo.track_and_verify_wals_in_manifest. Without the fix, application may see "open error: Corruption: Missing WAL with log number" while trying to open the db. The corruption is a false alarm but prevents DB open (#9766).BlockBasedTableOptions::reserve_table_reader_memory = true.rocksdb.live-blob-file-garbage-size that exposes the total amount of garbage in the blob files in the current version.initial_auto_readahead_size which now can be configured through BlockBasedTableOptions.GetMapProperty() with property kBlockCacheEntryStats can now use the functions in BlockCacheEntryStatsMapKeys to find stats in the map.fail_if_not_bottommost_level to IngestExternalFileOptions so that ingestion will fail if the file(s) cannot be ingested to the bottommost level.is_in_sec_cache to SecondaryCache::Lookup(). It is to indicate whether the handle is possibly erased from the secondary cache after the Lookup.TransactionDB layer APIs do not allow timestamps because we require that all user-defined-timestamps-aware operations go through the Transaction APIs.ldbBlockBasedTableOptions::detect_filter_construct_corruption can now be dynamically configured using DB::SetOptions.UpdateManifestForFilesState or ldb update_manifest --update_temperatures).versions_ between DBImpl::ResumeImpl() and threads waiting for recovery to complete (#9496)options.compression even options.compression_per_level is set.DisableManualCompaction. Also DB close can cancel the manual compaction thread.alive_log_files_ in non-two-write-queues mode. The race is between the writethread in WriteToWAL() and another thread executing FindObsoleteFiles(). The race condition will be caught if __glibcxx_requires_nonempty is enabled.Iterator::Refresh() reads stale keys after DeleteRange() performed.options.compression_per_level is dynamically changeable with SetOptions().WriteOptions::rate_limiter_priority. When set to something other than Env::IO_TOTAL, the internal rate limiter (DBOptions::rate_limiter) will be charged at the specified priority for writes associated with the API to which the WriteOptions was provided. Currently the support covers automatic WAL flushes, which happen during live updates (Put(), Write(), Delete(), etc.) when WriteOptions::disableWAL == false and DBOptions::manual_wal_flush == false.DB::GetMergeOperands().std::vector instead of std::map for storing the metadata objects for blob files, which can improve performance for certain workloads, especially when the number of blob files is high.ReadOptions::rate_limiter_priority. When set to something other than Env::IO_TOTAL, the internal rate limiter (DBOptions::rate_limiter) will be charged at the specified priority for file reads associated with the API to which the ReadOptions was provided.BackupableDBOptions. Use backup_engine.h and BackupEngineOptions. Similar renamings are in the C and Java APIs.UtilityDB::OpenTtlDB. Use db_ttl.h and DBWithTTL::Open.Cache::CreateCallback from void* to const void*.rocksdb_filterpolicy_create() from C API, as the only C API support for custom filter policies is now obsolete.SizeApproximationOptions.include_memtabtles to SizeApproximationOptions.include_memtables.CompactionService::Start() and CompactionService::WaitForComplete(). Please use CompactionService::StartV2(), CompactionService::WaitForCompleteV2() instead, which provides the same information plus extra data like priority, db_id, etc.ColumnFamilyOptions::OldDefaults and DBOptions::OldDefaults are marked deprecated, as they are no longer maintained.OnSubcompactionBegin() and OnSubcompactionCompleted().FileOperationInfo in event listener API.NewSequentialFile(). backup and checkpoint operations need to open the source files with NewSequentialFile(), which will have the temperature hints. Other operations are not covered.ReadOptions::total_order_seek no longer affects DB::Get(). The original motivation for this interaction has been obsolete since RocksDB has been able to detect whether the current prefix extractor is compatible with that used to generate table files, probably RocksDB 5.14.0.BlockBasedTableOptions::detect_filter_construct_corruption for detecting corruption during Bloom Filter (format_version >= 5) and Ribbon Filter construction.rocksdb.blob-stats DB property.LAST_LEVEL_READ_*, NON_LAST_LEVEL_READ_*.Note: The next release will be major release 7.0. See https://github.com/facebook/rocksdb/issues/9390 for more info.
TraceFilterType: kTraceFilterIteratorSeek, kTraceFilterIteratorSeekForPrev, and kTraceFilterMultiGet. They can be set in TraceOptions to filter out the operation types after which they are named.TraceOptions::preserve_write_order. When enabled it guarantees write records are traced in the same order they are logged to WAL and applied to the DB. By default it is disabled (false) to match the legacy behavior and prevent regression.Options::OldDefaults is marked deprecated, as it is no longer maintained.BlockBasedTableOptions::block_size from size_t to uint64_t.Iterator::Refresh() together with DB::DeleteRange(), which are incompatible and have always risked causing the refreshed iterator to return incorrect results.AdvancedColumnFamilyOptions.bottommost_temperature dynamically changeable with SetOptions().DB::DestroyColumnFamilyHandle() will return Status::InvalidArgument() if called with DB::DefaultColumnFamily().Options::DisableExtraChecks() that can be used to improve peak write performance by disabling checks that should not be necessary in the absence of software logic errors or CPU+memory hardware errors. (Default options are slowly moving toward some performance overheads for extra correctness checking.)fcntl(F_FULLFSYNC) on OS X and iOS.ObjectRegistry. The bug could result in failure to save the OPTIONS file.FaultInjectionTestFS.FSRandomAccessFile::GetUniqueId() (previously used when available), so a filesystem recycling unique ids can no longer lead to incorrect result or crash (#7405). For files generated by RocksDB >= 6.24, the cache keys are stable across DB::Open and DB directory move / copy / import / export / migration, etc. Although collisions are still theoretically possible, they are (a) impossible in many common cases, (b) not dependent on environmental factors, and (c) much less likely than a CPU miscalculation while executing RocksDB.checker argument that performs additional checking on timestamp sizes.TableProperties::properties_offsets with uint64_t property external_sst_file_global_seqno_offset to save table properties's memory.TableProperties.getPropertiesOffsets() as it exposed internal details to external users.BlockBasedTableOptions::reserve_table_builder_memory = true.blob_compaction_readahead_size.CompactRange() with CompactRangeOptions::change_level == true from possibly causing corruption to the LSM state (overlapping files within a level) when run in parallel with another manual compaction. Note that setting force_consistency_checks == true (the default) would cause the DB to enter read-only mode in this scenario and return Status::Corruption, rather than committing any corruption.RecordTick(stats_, WRITE_WITH_WAL) (at 2 place), this fix remove the extra RecordTicks and fix the corresponding test case.GenericRateLimiter::Request.BlockBasedTableOptions if insertion into one of {block_cache, block_cache_compressed, persistent_cache} can show up in another of these. (RocksDB expects to be able to use the same key for different physical data among tiers.)Env::Priority::BOTTOM pool will no longer see RocksDB schedule automatic compactions exceeding the DB's compaction concurrency limit. For details on per-DB compaction concurrency limit, see API docs of max_background_compactions and max_background_jobs.GetSortedWalFiles() to fail randomly with an error like IO error: 001234.log: No such file or directoryNUM_FILES_IN_SINGLE_COMPACTION was only counting the first input level files, now it's including all input files.TransactionUtil::CheckKeyForConflicts can also perform conflict-checking based on user-defined timestamps in addition to sequence numbers.GenericRateLimiter's minimum refill bytes per period previously enforced.WriteBufferManager as final because it is not intended for extension.FSDirectory::FsyncWithDirOptions(), which provides extra information like directory fsync reason in DirFsyncOptions. File system like btrfs is using that to skip directory fsync for creating a new file, or when renaming a file, fsync the target file instead of the directory, which improves the DB::Open() speed by ~20%.DB::Open() is not going be blocked by obsolete file purge if DBOptions::avoid_unnecessary_blocking_io is set to true.gettid(), info log ("LOG" file) lines now print a system-wide thread ID from gettid() instead of the process-local pthread_self(). For all users, the thread ID format is changed from hexadecimal to decimal integer.pthread_setname_np(), the background thread names no longer contain an ID suffix. For example, "rocksdb:bottom7" (and all other threads in the Env::Priority::BOTTOM pool) are now named "rocksdb:bottom". Previously large thread pools could breach the name size limit (e.g., naming "rocksdb:bottom10" would fail).ReadOptions::iter_start_seqnum and DBOptions::preserve_deletes, please try using user defined timestamp feature instead. The options will be removed in a future release, currently it logs a warning message when using.BlockBasedTableBuilder for FullFilter and PartitionedFilter case (#9070)NUM_FILES_IN_SINGLE_COMPACTION was only counting the first input level files, now it's including all input files.DisableManualCompaction() to cancel compactions even when they are waiting on automatic compactions to drain due to CompactRangeOptions::exclusive_manual_compactions == true.Env::ReopenWritableFile() and FileSystem::ReopenWritableFile() to specify any existing file must not be deleted or truncated.IngestExternalFiles() with files for multiple column families. The bug could have introduced a delay in ingested file keys becoming visible after IngestExternalFiles() returned. Furthermore, mutations to ingested file keys while they were invisible could have been dropped (not necessarily immediately).WriteBufferManager who constructed it with allow_stall == true. The race condition led to undefined behavior (in our experience, typically a process crash).WriteBufferManager::SetBufferSize() with new_size == 0 to dynamically disable memory limiting.DB::close() thread-safe.BackupEngine where some internal callers of GenericRateLimiter::Request() do not honor bytes <= GetSingleBurstBytes().REMOTE_COMPACT_READ_BYTES, REMOTE_COMPACT_WRITE_BYTES.class CacheDumper and CacheDumpedLoader at rocksdb/utilities/cache_dump_load.h Note that, this feature is subject to the potential change in the future, it is still experimental.blob_garbage_collection_force_threshold, which can be used to trigger compactions targeting the SST files which reference the oldest blob files when the ratio of garbage in those blob files meets or exceeds the specified threshold. This can reduce space amplification with skewed workloads where the affected SST files might not otherwise get picked up for compaction.GetUniqueIdFromTableProperties. Only SST files from RocksDB >= 6.24 support unique IDs.GetMapProperty() support for "rocksdb.dbstats" (DB::Properties::kDBStats). As a map property, it includes DB-level internal stats accumulated over the DB's lifetime, such as user write related stats and uptime.file_temperature to IngestExternalFileArg such that when ingesting SST files, we are able to indicate the temperature of the this batch of files.DB::Close() failed with a non aborted status, calling DB::Close() again will return the original status instead of Status::OK.lowest_used_cache_tier option to DBOptions (immutable) and pass it to BlockBasedTableReader. By default it is CacheTier::kNonVolatileBlockTier, which means, we always use both block cache (kVolatileTier) and secondary cache (kNonVolatileBlockTier). By set it to CacheTier::kVolatileTier, the DB will not use the secondary cache.keyMayExist() supports ByteBuffer.prepopulate_block_cache = kFlushOnly to only apply to flushes rather than to all generated files.db_name, db_id, session_id, which could help the user uniquely identify compaction job between db instances and sessions.VerifyChecksum() and VerifyFileChecksums() queries.rocksdb.num-blob-files, rocksdb.blob-stats, rocksdb.total-blob-file-size, and rocksdb.live-blob-file-size. The existing property rocksdb.estimate_live-data-size was also extended to include live bytes residing in blob files.Env::IO_USER,Env::IO_MID. Env::IO_USER will have superior priority over all other RateLimiter IOPriorities without being subject to fair scheduling constraint.SstFileWriter now supports Puts and Deletes with user-defined timestamps. Note that the ingestion logic itself is not timestamp-aware yet.OnBlobFileCreationStarted,OnBlobFileCreatedand OnBlobFileDeleted in EventListener class of listener.h. It notifies listeners during creation/deletion of individual blob files in Integrated BlobDB. It also log blob file creation finished event and deletion event in LOG file.DB::MultiGet using MultiRead.CompactionServiceJobStatus::kUseLocal to instruct RocksDB to run the compaction locally instead of waiting for the remote compaction result.RateLimiter::GetTotalPendingRequest(int64_t* total_pending_requests, const Env::IOPriority pri) for the total number of requests that are pending for bytes in the rate limiter.strict_capacity_limit=true for the block cache, in addition to existing conditions that can trigger unbuffering.SstFileMetaData::size from size_t to uint64_t.FlushJobInfo and CompactionJobInfo in listener.h to provide information about the blob files generated by a flush/compaction and garbage collected during compaction in Integrated BlobDB. Added struct members blob_file_addition_infos and blob_file_garbage_infos that contain this information.output_file_names of CompactFiles API to also include paths of the blob files generated by the compaction in Integrated BlobDB.BackupEngine functions now return IOStatus instead of Status. Most existing code should be compatible with this change but some calls might need to be updated.level_at_creation in TablePropertiesCollectorFactory::Context to capture the level at creating the SST file (i.e, table), of which the properties are being collected.ColumnFamilyData objects. The earlier logic unlocked the DB mutex before destroying the thread-local SuperVersion pointers, which could result in a process crash if another thread managed to get a reference to the ColumnFamilyData object.RenameFile() on a non-existent info log file ("LOG") when opening a new DB. Such a call was guaranteed to fail though did not impact applications since we swallowed the error. Now we also stopped swallowing errors in renaming "LOG" file.OnFlushCompleted was not called for atomic flush.MultiGet API when used with keys spanning multiple column families and sorted_input == false.options.allow_fallocate=false.ReplayOptions in Replayer::Replay(), or via --trace_replay_fast_forward in db_bench.LiveSstFilesSizeAtTemperature to retrieve sst file size at different temperature.BLOB_DB_BLOB_FILE_BYTES_READ, BLOB_DB_GC_NUM_KEYS_RELOCATED, and BLOB_DB_GC_BYTES_RELOCATED, as well as the histograms BLOB_DB_COMPRESSION_MICROS and BLOB_DB_DECOMPRESSION_MICROS.rocksdb_filterpolicy_create_ribbon is unchanged but adds new rocksdb_filterpolicy_create_ribbon_hybrid.DB::NewDefaultReplayer() to create a default Replayer instance. Added TraceReader::Reset() to restart reading a trace file. Created trace_record.h, trace_record_result.h and utilities/replayer.h files to access the decoded Trace records, replay them, and query the actual operation results.SetDBOptions() does not change any option value.StringAppendOperator additionally accepts a string as the delimiter.Obsolete keys in the bottommost level that were preserved for a snapshot will now be cleaned upon snapshot release in all cases. This form of compaction (snapshot release triggered compaction) previously had an artificial limitation that multiple tombstones needed to be present.
Blob file checksums are now printed in hexadecimal format when using the manifest_dump ldb command.
GetLiveFilesMetaData() now populates the temperature, oldest_ancester_time, and file_creation_time fields of its LiveFileMetaData results when the information is available. Previously these fields always contained zero indicating unknown.
Fix mismatches of OnCompaction{Begin,Completed} in case of DisableManualCompaction().
Fix continuous logging of an existing background error on every user write
Fix a bug that Get() return Status::OK() and an empty value for non-existent key when read_options.read_tier = kBlockCacheTier.
Fix a bug that stat in get_context didn't accumulate to statistics when query is failed.
Fixed handling of DBOptions::wal_dir with LoadLatestOptions() or ldb --try_load_options on a copied or moved DB. Previously, when the WAL directory is same as DB directory (default), a copied or moved DB would reference the old path of the DB as the WAL directory, potentially corrupting both copies. Under this change, the wal_dir from DB::GetOptions() or LoadLatestOptions() may now be empty, indicating that the current DB directory is used for WALs. This is also a subtle API change.
list_live_files_metadata, that shows the live SST files, as well as their LSM storage level and the column family they belong to.int to uint64_t to support sub-compaction id.Added API comments clarifying safe usage of Disable/EnableManualCompaction and EventListener callbacks for compaction.
fs_posix.cc GetFreeSpace() always report disk space available to root even when running as non-root. Linux defaults often have disk mounts with 5 to 10 percent of total space reserved only for root. Out of space could result for non-root users.
Subcompactions are now disabled when user-defined timestamps are used, since the subcompaction boundary picking logic is currently not timestamp-aware, which could lead to incorrect results when different subcompactions process keys that only differ by timestamp.
Fix an issue that DeleteFilesInRange() may cause ongoing compaction reports corruption exception, or ASSERT for debug build. There's no actual data loss or corruption that we find.
Fixed confusingly duplicated output in LOG for periodic stats ("DUMPING STATS"), including "Compaction Stats" and "File Read Latency Histogram By Level".
Fixed performance bugs in background gathering of block cache entry statistics, that could consume a lot of CPU when there are many column families with a shared block cache.
NewRibbonFilterPolicy in place of NewBloomFilterPolicy to use Ribbon filters instead of Bloom, or ribbonfilter in place of bloomfilter in configuration string.DBWithTTL to use DeleteRange api just like other DBs. DeleteRangeCF() which executes WriteBatchInternal::DeleteRange() has been added to the handler in DBWithTTLImpl::Write() to implement it.cancel field to CompactRangeOptions, allowing individual in-process manual range compactions to be cancelled.rocksdb.cur-size-active-mem-table, rocksdb.cur-size-all-mem-tables, and rocksdb.size-all-mem-tables.GetLiveFiles() output included a non-existent file called "OPTIONS-000000". Backups and checkpoints, which use GetLiveFiles(), failed on DBs impacted by this bug. Read-write DBs were impacted when the latest OPTIONS file failed to write and fail_if_options_file_error == false. Read-only DBs were impacted when no OPTIONS files existed.AdvancedColumnFamilyOptions.max_compaction_bytes is under-calculated for manual compaction (CompactRange()). Manual compaction is split to multiple compactions if the compaction size exceed the max_compaction_bytes. The bug creates much larger compaction which size exceed the user setting. On the other hand, larger manual compaction size can increase the subcompaction parallelism, you can tune that by setting max_compaction_bytes.CompactionFilters to apply in more table file creation scenarios such as flush and recovery. For compatibility, CompactionFilters by default apply during compaction. Users can customize this behavior by overriding CompactionFilterFactory::ShouldFilterTableFileCreation().stats_dump_period_sec.TableProperties::num_filter_entries, which can be used with TableProperties::filter_size to calculate the effective bits per filter entry (unique user key or prefix) for a table file.CompactionFilterContext.skip_filters parameter to SstFileWriter is now considered deprecated. Use BlockBasedTableOptions::filter_policy to control generation of filters.ApplyToAllEntries to Cache, to replace ApplyToAllCacheEntries. Custom Cache implementations must add an implementation. Because this function is for gathering statistics, an empty implementation could be acceptable for some applications.ColumnFamilyOptions::sample_for_compression now takes effect for creation of all block-based tables. Previously it only took effect for block-based tables created by flush.CompactFiles() can no longer compact files from lower level to up level, which has the risk to corrupt DB (details: #8063). The validation is also added to all compactions.strerror_r() to get error messages.Env has high-pri thread pool disabled (Env::GetBackgroundThreads(Env::Priority::HIGH) == 0)DBOptions::max_open_files to be set with a non-negative integer with ColumnFamilyOptions::compaction_style = kCompactionStyleFIFO.yield instead of wfe to relax cpu to gain better performance.TableProperties::slow_compression_estimated_data_size and TableProperties::fast_compression_estimated_data_size. When ColumnFamilyOptions::sample_for_compression > 0, they estimate what TableProperties::data_size would have been if the "fast" or "slow" (see ColumnFamilyOptions::sample_for_compression API doc for definitions) compression had been used instead.FlushReason::kWalFull, which is reported when a memtable is flushed due to the WAL reaching its size limit; those flushes were previously reported as FlushReason::kWriteBufferManager. Also, changed the reason for flushes triggered by the write buffer manager to FlushReason::kWriteBufferManager; they were previously reported as FlushReason::kWriteBufferFull.delayed_write_rate is actually exceeded, with an initial burst allowance of 1 millisecond worth of bytes. Also, beyond the initial burst allowance, delayed_write_rate is now more strictly enforced, especially with multiple column families.BackupableDBOptions::share_files_with_checksum to true and deprecated false because of potential for data loss. Note that accepting this change in behavior can temporarily increase backup data usage because files are not shared between backups using the two different settings. Also removed obsolete option kFlagMatchInterimNaming.FilterBlobByKey() to CompactionFilter. Subclasses can override this method so that compaction filters can determine whether the actual blob value has to be read during compaction. Use a new kUndetermined in CompactionFilter::Decision to indicated that further action is necessary for compaction filter to make a decision.WriteBatch through the write to RocksDB's in-memory update buffer (memtable). This is intended to detect some cases of in-memory data corruption, due to either software or hardware errors. Users can enable protection by constructing their WriteBatch with protection_bytes_per_key == 8.full_history_ts_low option in manual compaction, which is for old timestamp data GC.FileSystems) whose source code resides outside the RocksDB repo. See "plugin/README.md" for developer details, and "PLUGINS.md" for a listing of available plugins.rocksdb::DB API, as opposed to the separate rocksdb::blob_db::BlobDB interface used by the earlier version, and can be configured on a per-column family basis using the configuration options enable_blob_files, min_blob_size, blob_file_size, blob_compression_type, enable_blob_garbage_collection, and blob_garbage_collection_age_cutoff. It extends RocksDB's consistency guarantees to blobs, and offers more features and better performance. Note that some features, most notably Merge, compaction filters, and backup/restore are not yet supported, and there is no support for migrating a database created by the old implementation.TransactionDB returns error Statuses from calls to DeleteRange() and calls to Write() where the WriteBatch contains a range deletion. Previously such operations may have succeeded while not providing the expected transactional guarantees. There are certain cases where range deletion can still be used on such DBs; see the API doc on TransactionDB::DeleteRange() for details.OptimisticTransactionDB now returns error Statuses from calls to DeleteRange() and calls to Write() where the WriteBatch contains a range deletion. Previously such operations may have succeeded while not providing the expected transactional guarantees.WRITE_PREPARED, WRITE_UNPREPARED TransactionDB MultiGet() may return uncommitted data with snapshot.CompressionOptions::max_dict_buffer_bytes, to limit the in-memory buffering for selecting samples for generating/training a dictionary. The limit is currently loosely adhered to.DB::VerifyFileChecksums(), we now fail with Status::InvalidArgument if the name of the checksum generator used for verification does not match the name of the checksum generator used for protecting the file when it was created.ErrorHandler::SetBGError.WalAddition and WalDeletion, fixed this by changing the encoded format of them to be ignorable by older versions.merge_operator now fails immediately, causing the DB to enter read-only mode. Previously, failure was deferred until the merge_operator was needed by a user read or a background operation.WALRecoveryMode::kPointInTimeRecovery is used. Gaps are still possible when WALs are truncated exactly on record boundaries; for complete protection, users should enable track_and_verify_wals_in_manifest.read_amp_bytes_per_bit during OPTIONS file parsing on big-endian architecture. Without this fix, original code introduced in PR7659, when running on big-endian machine, can mistakenly store read_amp_bytes_per_bit (an uint32) in little endian format. Future access to read_amp_bytes_per_bit will give wrong values. Little endian architecture is not affected.CompactRange and GetApproximateSizes.Env::GetChildren and Env::GetChildrenFileAttributes will no longer return entries for the special directories . or ...track_and_verify_wals_in_manifest. If true, the log numbers and sizes of the synced WALs are tracked in MANIFEST, then during DB recovery, if a synced WAL is missing from disk, or the WAL's size does not match the recorded size in MANIFEST, an error will be reported and the recovery will be aborted. Note that this option does not work with secondary instance.rocksdb_approximate_sizes and rocksdb_approximate_sizes_cf in the C API now requires an error pointer (char** errptr) for receiving any error.format_version >= 3), indexes are partitioned (index_type == kTwoLevelIndexSearch), and some index partitions are pinned in memory (BlockBasedTableOptions::pin_l0_filter_and_index_blocks_in_cache). The bug could cause keys to be truncated when read from the index leading to wrong read results or other unexpected behavior.index_type == kTwoLevelIndexSearch), some index partitions are pinned in memory (BlockBasedTableOptions::pin_l0_filter_and_index_blocks_in_cache), and partitions reads could be mixed between block cache and directly from the file (e.g., with enable_index_compression == 1 and mmap_read == 1, partitions that were stored uncompressed due to poor compression ratio would be read directly from the file via mmap, while partitions that were stored compressed would be read from block cache). The bug could cause index partitions to be mistakenly considered empty during reads leading to wrong read results.Status::Corruption failure when paranoid_file_checks == true and range tombstones were written to the compaction output files.WriteOptions.no_slowdown=true).ignore_unknown_options flag (used in option parsing/loading functions) changed.NotFound instead of InvalidArgument for option names not available in the present version.TableBuilder::NeedCompact() before TableBuilder::Finish() in compaction job. For example, the NeedCompact() method of CompactOnDeletionCollector returned by built-in CompactOnDeletionCollectorFactory requires BlockBasedTable::Finish() to return the correct result. The bug can cause a compaction-generated file not to be marked for future compaction based on deletion ratio.BlockBasedTableOptions::pin_l0_filter_and_index_blocks_in_cache and BlockBasedTableOptions::pin_top_level_index_and_filter. These options still take effect until users migrate to the replacement APIs in BlockBasedTableOptions::metadata_cache_options. Migration guidance can be found in the API comments on the deprecated options.DB::VerifyFileChecksums to verify SST file checksum with corresponding entries in the MANIFEST if present. Current implementation requires scanning and recomputing file checksums.ColumnFamilyOptions::compression_opts now additionally affect files generated by flush and compaction to non-bottommost level. Previously those settings at most affected files generated by compaction to bottommost level, depending on whether ColumnFamilyOptions::bottommost_compression_opts overrode them. Users who relied on dictionary compression settings in ColumnFamilyOptions::compression_opts affecting only the bottommost level can keep the behavior by moving their dictionary settings to ColumnFamilyOptions::bottommost_compression_opts and setting its enabled flag.enabled flag is set in ColumnFamilyOptions::bottommost_compression_opts, those compression options now take effect regardless of the value in ColumnFamilyOptions::bottommost_compression. Previously, those compression options only took effect when ColumnFamilyOptions::bottommost_compression != kDisableCompressionOption. Now, they additionally take effect when ColumnFamilyOptions::bottommost_compression == kDisableCompressionOption (such a setting causes bottommost compression type to fall back to ColumnFamilyOptions::compression_per_level if configured, and otherwise fall back to ColumnFamilyOptions::compression).CompactRange() with CompactRangeOptions::change_level set fails due to a conflict in the level change step, which caused all subsequent calls to CompactRange() with CompactRangeOptions::change_level set to incorrectly fail with a Status::NotSupported("another thread is refitting") error.BottommostLevelCompaction.kForce or kForceOptimized is set.CompactRange() for refitting levels (CompactRangeOptions::change_level == true) and another manual compaction are executed in parallel.recycle_log_file_num to zero when the user attempts to enable it in combination with WALRecoveryMode::kTolerateCorruptedTailRecords. Previously the two features were allowed together, which compromised the user's configured crash-recovery guarantees.SstFileWriter. Previously, the dictionary would be trained/finalized immediately with zero samples. Now, the whole SstFileWriter file is buffered in memory and then sampled.avoid_unnecessary_blocking_io=1 and creating backups (BackupEngine::CreateNewBackup) or checkpoints (Checkpoint::Create). With this setting and WAL enabled, these operations could randomly fail with non-OK status.std::string requested_checksum_func_name is added to FileChecksumGenContext, which enables the checksum factory to create generators for a suite of different functions.ldb unsafe_remove_sst_file, which removes a lost or corrupt SST file from a DB's metadata. This command involves data loss and must not be used on a live DB.kCompactionStyleLevel compaction style with level_compaction_dynamic_level_bytes set.share_files_with_checksum is used with kLegacyCrc32cAndFileSize naming (discouraged).
share_files_with_checksum, we are confident there is no regression (vs. pre-6.12) in detecting DB or backup corruption at backup creation time, mostly because the old design did not leverage this extra checksum computation for detecting inconsistencies at backup creation time.share_table_files without "checksum" (not recommended), there is a regression in detecting fundamentally unsafe use of the option, greatly mitigated by file size checking (under "Behavior Changes"). Almost no reason to use share_files_with_checksum=false should remain.DB::VerifyChecksum and BackupEngine::VerifyBackup with checksum checking are still able to catch corruptions that CreateNewBackup does not.DB::DeleteFile() API describing its known problems and deprecation plan.FSRandomAccessFile.Prefetch() default return status is changed from OK to NotSupported. If the user inherited file doesn't implement prefetch, RocksDB will create internal prefetch buffer to improve read performance.EventListener in listener.h contains new callback functions: OnFileFlushFinish(), OnFileSyncFinish(), OnFileRangeSyncFinish(), OnFileTruncateFinish(), and OnFileCloseFinish().FileOperationInfo now reports duration measured by std::chrono::steady_clock and start_ts measured by std::chrono::system_clock instead of start and finish timestamps measured by system_clock. Note that system_clock is called before steady_clock in program order at operation starts.DB::GetDbSessionId(std::string& session_id) is added. session_id stores a unique identifier that gets reset every time the DB is opened. This DB session ID should be unique among all open DB instances on all hosts, and should be unique among re-openings of the same or other DBs. This identifier is recorded in the LOG file on the line starting with "DB Session ID:".DB::OpenForReadOnly() now returns Status::NotFound when the specified DB directory does not exist. Previously the error returned depended on the underlying Env. This change is available in all 6.11 releases as well.verify_with_checksum is added to BackupEngine::VerifyBackup, which is false by default. If it is ture, BackupEngine::VerifyBackup verifies checksums and file sizes of backup files. Pass false for verify_with_checksum to maintain the previous behavior and performance of BackupEngine::VerifyBackup, by only verifying sizes of backup files.file_checksum_gen_factory is set to GetFileChecksumGenCrc32cFactory(), BackupEngine will compare the crc32c checksums of table files computed when creating a backup to the expected checksums stored in the DB manifest, and will fail CreateNewBackup() on mismatch (corruption). If the file_checksum_gen_factory is not set or set to any other customized factory, there is no checksum verification to detect if SST files in a DB are corrupt when read, copied, and independently checksummed by BackupEngine.stats_dump_period_sec > 0, either as the initial value for DB open or as a dynamic option change, the first stats dump is staggered in the following X seconds, where X is an integer in [0, stats_dump_period_sec). Subsequent stats dumps are still spaced stats_dump_period_sec seconds apart.db_id) and DB session identity (db_session_id) are added to table properties and stored in SST files. SST files generated from SstFileWriter and Repairer have DB identity “SST Writer” and “DB Repairer”, respectively. Their DB session IDs are generated in the same way as DB::GetDbSessionId. The session ID for SstFileWriter (resp., Repairer) resets every time SstFileWriter::Open (resp., Repairer::Run) is called.BackupableDBOptions::share_files_with_checksum_naming is added with new default behavior for naming backup files with share_files_with_checksum, to address performance and backup integrity issues. See API comments for details.max_subcompactions can be set dynamically using DB::SetDBOptions().shared_checksum directory when using share_files_with_checksum_naming = kUseDbSessionId (new default), except on SST files generated before this version of RocksDB, which fall back on using kLegacyCrc32cAndFileSize.Status::InvalidArgument if the range's end key comes before its start key according to the user comparator. Previously the behavior was undefined.force_consistency_checks is false.pin_l0_filter_and_index_blocks_in_cache no longer applies to L0 files larger than 1.5 * write_buffer_size to give more predictable memory usage. Such L0 files may exist due to intra-L0 compaction, external file ingestion, or user dynamically changing write_buffer_size (note, however, that files that are already pinned will continue being pinned, even after such a dynamic change).Env::LowerThreadPoolCPUPriority(Priority, CpuPriority) is added to Env to be able to lower to a specific priority such as CpuPriority::kIdle.BlockBasedTableBuilder. This optimization makes block building, block compression and block appending a pipeline, and uses multiple threads to accelerate block compression. Users can set CompressionOptions::parallel_threads greater than 1 to enable compression parallelism. This feature is experimental for now.max_background_flushes can be set dynamically using DB::SetDBOptions().--compression_level_from and --compression_level_to to report size of all compression levels and one compression_type must be specified with it so that it will report compressed sizes of one compression type with different levels.PerfContext::user_key_comparison_count for lookups in files written with format_version >= 3.COMMITTED, while the old misspelled COMMITED is still available as an alias.CreateBackupOptions is added to both BackupEngine::CreateNewBackup and BackupEngine::CreateNewBackupWithMetadata, you can decrease CPU priority of BackupEngine's background threads by setting decrease_background_thread_cpu_priority and background_thread_cpu_priority in CreateBackupOptions.WriteBatchWithIndex::DeleteRange returns Status::NotSupported. Previously it returned success even though reads on the batch did not account for range tombstones. The corresponding language bindings now cannot be used. In C, that includes rocksdb_writebatch_wi_delete_range, rocksdb_writebatch_wi_delete_range_cf, rocksdb_writebatch_wi_delete_rangev, and rocksdb_writebatch_wi_delete_rangev_cf. In Java, that includes WriteBatchWithIndex::deleteRange.BLOB_DB_GC_NUM_FILES (number of blob files obsoleted during GC), BLOB_DB_GC_NUM_NEW_FILES (number of new blob files generated during GC), BLOB_DB_GC_FAILURES (number of failed GC passes), BLOB_DB_GC_NUM_KEYS_RELOCATED (number of blobs relocated during GC), and BLOB_DB_GC_BYTES_RELOCATED (total size of blobs relocated during GC). On the other hand, the following statistics, which are not relevant for the new GC implementation, are now deprecated: BLOB_DB_GC_NUM_KEYS_OVERWRITTEN, BLOB_DB_GC_NUM_KEYS_EXPIRED, BLOB_DB_GC_BYTES_OVERWRITTEN, BLOB_DB_GC_BYTES_EXPIRED, and BLOB_DB_GC_MICROS.db_bench now supports value_size_distribution_type, value_size_min, value_size_max options for generating random variable sized value. Added blob_db_compression_type option for BlobDB to enable blob compression.OptimisticTransactionDBOptions Option that allows users to configure occ validation policy. The default policy changes from kValidateSerial to kValidateParallel to reduce mutex contention.max_background_jobs dynamically through the SetDBOptions interface.enable_garbage_collection is set to true in BlobDBOptions. Garbage collection is performed during compaction: any valid blobs located in the oldest N files (where N is the number of non-TTL blob files multiplied by the value of BlobDBOptions::garbage_collection_cutoff) encountered during compaction get relocated to new blob files, and old blob files are dropped once they are no longer needed. Note: we recommend enabling periodic compactions for the base DB when using this feature to deal with the case when some old blob files are kept alive by SSTs that otherwise do not get picked for compaction.db_bench now supports the garbage_collection_cutoff option for BlobDB.creation_time of new compaction outputs.ColumnFamilyHandle pointers themselves instead of only the column family IDs when checking whether an API call uses the default column family or not.GetLiveFilesMetaData and GetColumnFamilyMetaData now expose the file number of SST files as well as the oldest blob file referenced by each SST.sst_dump command line tool recompress command now displays how many blocks were compressed and how many were not, in particular how many were not compressed because the compression ratio was not met (12.5% threshold for GoodCompressionRatio), as seen in the number.block.not_compressed counter stat since version 6.0.0.db_bench now supports and by default issues non-TTL Puts to BlobDB. TTL Puts can be enabled by specifying a non-zero value for the blob_db_max_ttl_range command line parameter explicitly.sst_dump now supports printing BlobDB blob indexes in a human-readable format. This can be enabled by specifying the decode_blob_index flag on the command line.creation_time table property for compaction output files is now set to the minimum of the creation times of all compaction inputs.LevelAndStyleCustomFilterPolicy in db_bloom_filter_test.cc. While most existing custom implementations of FilterPolicy should continue to work as before, those wrapping the return of NewBloomFilterPolicy will require overriding new function GetBuilderWithContext(), because calling GetFilterBitsBuilder() on the FilterPolicy returned by NewBloomFilterPolicy is no longer supported.snap_refresh_nanos option.UINT64_MAX - 1 which allows RocksDB to auto-tune periodic compaction scheduling. When using the default value, periodic compactions are now auto-enabled if a compaction filter is used. A value of 0 will turn off the feature completely.UINT64_MAX - 1 which allows RocksDB to auto-tune ttl value. When using the default value, TTL will be auto-enabled to 30 days, when the feature is supported. To revert the old behavior, you can explicitly set it to 0.snap_refresh_nanos is set to 0..memtable_insert_hint_per_batch to WriteOptions. If it is true, each WriteBatch will maintain its own insert hints for each memtable in concurrent write. See include/rocksdb/options.h for more details.--secondary_path to ldb to open the database as the secondary instance. This would keep the original DB intact.RegisterCustomObjects function. By linking the unit test binary with the static library, the unit test can execute this function.snap_refresh_nanos (default to 0) to periodically refresh the snapshot list in compaction jobs. Assign to 0 to disable the feature.unordered_write which trades snapshot guarantees with higher write throughput. When used with WRITE_PREPARED transactions with two_write_queues=true, it offers higher throughput with however no compromise on guarantees.failed_move_fall_back_to_copy (default is true) for external SST ingestion. When move_files is true and hard link fails, ingestion falls back to copy if failed_move_fall_back_to_copy is true. Otherwise, ingestion reports an error.list_file_range_deletes in ldb, which prints out tombstones in SST files.Puts covered by range tombstones to reappear. Note Puts may exist even if the user only ever called Merge() due to an internal conversion during compaction to the bottommost level.strict_bytes_per_sync that causes a file-writing thread to block rather than exceed the limit on bytes pending writeback specified by bytes_per_sync or wal_bytes_per_sync.IsFlushPending() == true caused by one bg thread releasing the db mutex in ~ColumnFamilyData and another thread clearing flush_requested_ flag.cache_index_and_filter_blocks == true, we now store dictionary data used for decompression in the block cache for better control over memory usage. For users of ZSTD v1.1.4+ who compile with -DZSTD_STATIC_LINKING_ONLY, this includes a digested dictionary, which is used to increase decompression speed.GetStatsHistory API to retrieve these snapshots.SstFileWriter will now use dictionary compression if it is configured in the file writer's CompressionOptions.TableProperties::num_entries and TableProperties::num_deletions now also account for number of range tombstones.number.block.not_compressed now also counts blocks not compressed due to poor compression ratio.CompactionOptionsFIFO. The option has been deprecated and ttl in ColumnFamilyOptions is used instead.NotFound point lookup result when querying the endpoint of a file that has been extended by a range tombstone.JemallocNodumpAllocator memory allocator. When being use, block cache will be excluded from core dump.PerfContextByLevel as part of PerfContext which allows storing perf context at each level. Also replaced __thread with thread_local keyword for perf_context. Added per-level perf context for bloom filter and Get query.atomic_flush. If true, RocksDB supports flushing multiple column families and atomically committing the result to MANIFEST. Useful when WAL is disabled.num_deletions and num_merge_operands members to TableProperties.MemoryAllocator, which lets the user specify custom memory allocator for block based table.DeleteRange to prevent read performance degradation. The feature is no longer marked as experimental.DBOptions::use_direct_reads now affects reads issued by BackupEngine on the database's SSTs.NO_ITERATORS is divided into two counters NO_ITERATOR_CREATED and NO_ITERATOR_DELETE. Both of them are only increasing now, just as other counters.NO_FILE_CLOSES ticker statistic, which was always zero previously.OnTableFileCreated will now be called for empty files generated during compaction. In that case, TableFileCreationInfo::file_path will be "(nil)" and TableFileCreationInfo::file_size will be zero.FlushOptions::allow_write_stall, which controls whether Flush calls start working immediately, even if it causes user writes to stall, or will wait until flush can be performed without causing write stall (similar to CompactRangeOptions::allow_write_stall). Note that the default value is false, meaning we add delay to Flush calls until stalling can be avoided when possible. This is behavior change compared to previous RocksDB versions, where Flush calls didn't check if they might cause stall or not.OnCompactionCompleted.CompactFiles run with CompactionOptions::compression == CompressionType::kDisableCompressionOption. Now that setting causes the compression type to be chosen according to the column family-wide compression options.MergeOperator::ShouldMerge in the reversed order relative to how they were merged (passed to FullMerge or FullMergeV2) for performance reasonsmax_num_ikeys.CompressionOptions::zstd_max_train_bytes to a nonzero value) now requires ZSTD version 1.1.3 or later.bottommost_compression_opts. To keep backward compatible, a new boolean enabled is added to CompressionOptions. For compression_opts, it will be always used no matter what value of enabled is. For bottommost_compression_opts, it will only be used when user set enabled=true, otherwise, compression_opts will be used for bottommost_compression as default.Statistics objects created via CreateDBStatistics(), the format of the string returned by its ToString() method has changed.ColumnFamilyOptions::ttl via SetOptions().bytes_max_delete_chunk to 0 in NewSstFileManager() as it doesn't work well with checkpoints.DBOptions::use_direct_io_for_flush_and_compaction only applies to background writes, and DBOptions::use_direct_reads applies to both user reads and background reads. This conforms with Linux's open(2) manpage, which advises against simultaneously reading a file in buffered and direct modes, due to possibly undefined behavior and degraded performance.CompressionOptions::kDefaultCompressionLevel, which is a generic way to tell RocksDB to use the compression library's default level. It is now the default value for CompressionOptions::level. Previously the level defaulted to -1, which gave poor compression ratios in ZSTD.Env::LowerThreadPoolCPUPriority(Priority) method, which lowers the CPU priority of background (esp. compaction) threads to minimize interference with foreground tasks.Env::SetBackgroundThreads(), compactions to the bottom level will be delegated to that thread pool.prefix_extractor has been moved from ImmutableCFOptions to MutableCFOptions, meaning it can be dynamically changed without a DB restart.BackupableDBOptions::max_valid_backups_to_open to not delete backup files when refcount cannot be accurately determined.BlockBasedTableConfig.setBlockCache to allow sharing a block cache across DB instances.ignore_unknown_options argument will only be effective if the option file shows it is generated using a higher version of RocksDB than the current version.CompactRange() when the range specified by the user does not overlap unflushed memtables.ColumnFamilyOptions::max_subcompactions is set greater than one, we now parallelize large manual level-based compactions.include_end option to make the range end exclusive when include_end == false in DeleteFilesInRange().CompactRangeOptions::allow_write_stall, which makes CompactRange start working immediately, even if it causes user writes to stall. The default value is false, meaning we add delay to CompactRange calls until stalling can be avoided when possible. Note this delay is not present in previous RocksDB versions.Status::InvalidArgument; previously, it returned Status::IOError.DeleteFilesInRanges() to delete files in multiple ranges at once for better performance.DisableFileDeletions() followed by GetSortedWalFiles() to not return obsolete WAL files that PurgeObsoleteFiles() is going to delete.autoTune and getBytesPerSecond() to RocksJava RateLimitermake with environment variable USE_SSE set and PORTABLE unset, will use all machine features available locally. Previously this combination only compiled SSE-related features.NUMBER_ITER_SKIP, which returns how many internal keys were skipped during iterations (e.g., due to being tombstones or duplicate versions of a key).key_lock_wait_count and key_lock_wait_time, which measure the number of times transactions wait on key locks and total amount of time waiting.IngestExternalFile() affecting databases with large number of SST files.DeleteFilesInRange() deletes a subset of files spanned by a DeleteRange() marker.BackupableDBOptions::max_valid_backups_to_open == 0 now means no backups will be opened during BackupEngine initialization. Previously this condition disabled limiting backups opened.DBOptions::preserve_deletes is a new option that allows one to specify that DB should not drop tombstones for regular deletes if they have sequence number larger than what was set by the new API call DB::SetPreserveDeletesSequenceNumber(SequenceNumber seqnum). Disabled by default.DB::SetPreserveDeletesSequenceNumber(SequenceNumber seqnum) was added, users who wish to preserve deletes are expected to periodically call this function to advance the cutoff seqnum (all deletes made before this seqnum can be dropped by DB). It's user responsibility to figure out how to advance the seqnum in the way so the tombstones are kept for the desired period of time, yet are eventually processed in time and don't eat up too much space.ReadOptions::iter_start_seqnum was added;
if set to something > 0 user will see 2 changes in iterators behavior 1) only keys written with sequence larger than this parameter would be returned and 2) the Slice returned by iter->key() now points to the memory that keep User-oriented representation of the internal key, rather than user key. New struct FullKey was added to represent internal keys, along with a new helper function ParseFullKey(const Slice& internal_key, FullKey* result);.crc32c_3way on supported platforms to improve performance. The system will choose to use this algorithm on supported platforms automatically whenever possible. If PCLMULQDQ is not supported it will fall back to the old Fast_CRC32 algorithm.DBOptions::writable_file_max_buffer_size can now be changed dynamically.DBOptions::bytes_per_sync, DBOptions::compaction_readahead_size, and DBOptions::wal_bytes_per_sync can now be changed dynamically, DBOptions::wal_bytes_per_sync will flush all memtables and switch to a new WAL file.true to the auto_tuned parameter in NewGenericRateLimiter(). The value passed as rate_bytes_per_sec will still be respected as an upper-bound.ColumnFamilyOptions::compaction_options_fifo.EventListener::OnStallConditionsChanged() callback. Users can implement it to be notified when user writes are stalled, stopped, or resumed.ReadOptions::iterate_lower_bound.DB:Open() will abort if column family inconsistency is found during PIT recovery.DeleteRange().Statistics::getHistogramString() will see fewer histogram buckets and different bucket endpoints.Slice::compare and BytewiseComparator Compare no longer accept Slices containing nullptr.Transaction::Get and Transaction::GetForUpdate variants with PinnableSlice added.Env::SetBackgroundThreads(N, Env::Priority::BOTTOM), where N > 0.MergeOperator::AllowSingleOperand.DB::VerifyChecksum(), which verifies the checksums in all SST files in a running DB.BlockBasedTableOptions::checksum = kNoChecksum.rocksdb.db.get.micros, rocksdb.db.write.micros, and rocksdb.sst.read.micros.EventListener::OnBackgroundError() callback. Users can implement it to be notified of errors causing the DB to enter read-only mode, and optionally override them.DeleteRange() is used together with subcompactions.max_background_flushes=0. Instead, users can achieve this by configuring their high-pri thread pool to have zero threads.Options::max_background_flushes, Options::max_background_compactions, and Options::base_background_compactions all with Options::max_background_jobs, which automatically decides how many threads to allocate towards flush/compaction.IOStatsContext iostats_context with IOStatsContext* get_iostats_context(); replace global variable PerfContext perf_context with PerfContext* get_perf_context().DB::IngestExternalFile() now supports ingesting files into a database containing range deletions.max_open_files option via SetDBOptions()GetAllKeyVersions to see internal versions of a range of keys.allow_ingest_behindstats_dump_period_sec option via SetDBOptions().delete_obsolete_files_period_micros option via SetDBOptions().delayed_write_rate and max_total_wal_size options via SetDBOptions().delayed_write_rate option via SetDBOptions().const WriteEntry&make rocksdbjavastatic.StackableDB::GetRawDB() to StackableDB::GetBaseDB().WriteBatch::Data() const std::string& Data() const.TableStats to TableProperties.PrefixHashRepFactory. Please use NewHashSkipListRepFactory() instead.EnableFileDeletions() and DisableFileDeletions().DB::GetOptions().DB::GetDbIdentity().SliceParts - Variant of Put() that gathers output like writev(2)Get() -- 1fdb3f -- 1.5x QPS increase for some workloads