Beyond Locks: Scalable Key-Value Stores with Fine-Grained Concurrency

Author: Denis Avetisyan

New latch-free techniques dramatically reduce thread contention and improve performance in key-value stores by enabling highly concurrent access to B-tree indexes.

This review details a novel concurrency control approach leveraging ‘notices’ and delta updates to minimize thread stalls and maximize scalability.

High-performance key-value stores are often bottlenecked by the overhead of thread contention and resulting stalls. This paper, ‘Avoiding Thread Stalls and Switches in Key-Value Stores: New Latch-Free Techniques and More’, addresses this challenge by introducing ‘notices’, a novel latch-free concurrency control mechanism coupled with delta record updating. These techniques significantly reduce wasted work and minimize thread switching, particularly within B-tree index maintenance. Could this approach unlock new levels of scalability and efficiency for modern data management systems?

The Inevitable Bottleneck: Why Key-Value Stores Struggle

Key-value stores, foundational components of modern data infrastructure, frequently encounter performance limitations as concurrent access increases. This stems from the inherent challenge of managing simultaneous requests for the same data, leading to contention where multiple processes or threads attempt to modify or read the same key at nearly the same time. Such contention isn’t merely a theoretical issue; it directly impacts responsiveness, causing delays as the system arbitrates access and ensures data consistency. While designed for speed, traditional architectures can become bogged down when faced with a high volume of competing requests, hindering their ability to scale efficiently and deliver consistently low latency – a critical requirement for many applications, including real-time analytics and high-frequency trading platforms. The resulting performance degradation necessitates careful consideration of synchronization strategies and architectural choices to mitigate the effects of concurrent access and maintain optimal operational speed.

Concurrent access to key-value stores frequently results in substantial thread switching, a phenomenon that significantly limits scalability. As multiple threads attempt to access and modify data, the operating system is compelled to rapidly alternate between them, incurring overhead with each context switch. This constant shifting prevents any single thread from maintaining prolonged execution, diminishing overall throughput and increasing latency. Modern applications, designed to handle a high volume of concurrent requests, are particularly susceptible to this performance degradation, as the cost of thread switching quickly outweighs the benefits of parallelism. The effect is a system that spends more time managing threads than actually processing data, ultimately hindering its ability to scale efficiently with increasing workloads.

Conventional key-value stores often rely on synchronization primitives like latches to manage concurrent access to shared data, but these mechanisms inherently introduce blocking. When a thread attempts to access a resource protected by a latch that is currently held by another thread, it is forced to wait-a process known as blocking. This waiting incurs significant overhead, as the operating system must context switch between threads, saving the state of the blocked thread and restoring the state of another. The accumulation of these context switches dramatically reduces overall throughput and increases latency, especially under high contention. Furthermore, the blocking nature of latches can lead to cascading delays, where multiple threads are forced to wait, exacerbating performance bottlenecks and limiting the scalability of the key-value store in demanding applications.

Beyond Locks: A Glimmer of Hope with Latch-Free Techniques

Latch-free techniques represent a concurrency control method designed to reduce contention and avoid blocking threads during data access. Traditional locking mechanisms, such as mutexes and semaphores, can introduce performance bottlenecks due to thread blocking and context switching. Latch-free approaches, conversely, utilize atomic operations to manipulate shared data without requiring threads to wait for exclusive access. This reduction in contention directly translates to improved throughput and scalability, particularly in high-contention scenarios. By eliminating blocking, these techniques allow multiple threads to make progress concurrently, maximizing CPU utilization and reducing latency. The core principle involves designing algorithms where threads can complete operations without requiring coordination or synchronization primitives that inherently introduce wait states.

The Notices mechanism addresses state change management and redundant work reduction through a latch-free approach. It functions by associating each modified data structure with a notice, which is a small, immutable record of the change. Consumers of that data structure can then efficiently check the notice to determine if new work is required, avoiding full data structure scans. This is achieved using atomic operations to publish and consume notices, allowing multiple threads to concurrently observe changes without contention or blocking. The notices themselves contain sufficient information to indicate the type of change, enabling consumers to selectively re-execute only the affected work, rather than reprocessing the entire dataset.

Data consistency in latch-free techniques is maintained through the utilization of atomic operations, specifically Compare-and-Swap (CAS) and the Epoch Mechanism. CAS allows for optimistic concurrency control by conditionally updating memory locations based on their current value, avoiding the need for locks. The Epoch Mechanism further enhances concurrency by allowing multiple readers and a single writer to access data concurrently, utilizing epoch numbers to track visibility and ensure that readers observe a consistent snapshot of data without blocking the writer. This combination of CAS and the Epoch Mechanism enables contention-free access and modification of shared data structures, eliminating the performance bottlenecks associated with traditional locking mechanisms.

Bw-Trees: Optimizing Storage for the Inevitable Write Storm

The system employs the Bw-tree data structure as its primary indexing and storage mechanism. This tree-based approach is characterized by a high fan-out, meaning each node can possess a significant number of child nodes, thereby reducing tree height and minimizing search times. Unlike traditional B-trees, the Bw-tree is specifically designed to optimize write performance in environments with frequent data modifications. This is achieved through techniques that prioritize efficient node splitting and merging, and by maintaining a balanced tree structure even with dynamic data insertion and deletion. The Bw-tree‘s architecture allows for predictable performance characteristics, crucial for maintaining consistent access times as the dataset scales.

Bw-tree node splitting and merging operations are engineered for high concurrency through the use of fine-grained locking and optimistic concurrency control. This allows multiple threads to operate on different parts of the tree simultaneously, minimizing contention. Dynamic scaling is supported by algorithms that efficiently redistribute data during splits and merges, ensuring balanced tree structures even with frequent insertions and deletions. These operations are designed to minimize lock hold times and reduce the potential for deadlocks, maintaining performance under high load and facilitating seamless expansion of the storage system’s capacity.

The Log Structured Store (LSS) improves system performance by organizing data writes sequentially on flash storage. Rather than performing random writes across the storage medium, the LSS batches incoming write operations into larger, contiguous blocks before committing them to flash. This approach significantly reduces write amplification, a key performance bottleneck in flash storage, and minimizes the number of I/O operations required. By prioritizing sequential writes, the LSS leverages the inherent speed advantages of flash memory, resulting in increased throughput and reduced latency for write-intensive workloads. This is particularly effective because flash storage has limited endurance; reducing the number of write cycles extends the lifespan of the storage medium.

From Theory to Reality: A System That Actually Scales

This novel architecture isn’t merely a theoretical construct; it has been successfully integrated into the core of Cosmos DB, a globally distributed, multi-model database service. This practical implementation allows Cosmos DB to leverage the Bw-tree’s strengths across a geographically diverse infrastructure, ensuring both high availability and consistent performance for users worldwide. By deploying this system in a production environment handling substantial data volumes and user traffic, the design has proven its robustness and scalability. The integration demonstrates the feasibility of moving beyond traditional key-value stores and adopting a more efficient data management approach within a complex, real-world database system, ultimately enhancing data accessibility and query responsiveness for a broad range of applications.

Performance gains are significantly achieved through a carefully designed system of data management centered around Delta Updating and a Mapping Table. Instead of rewriting entire data pages upon modification, Delta Updating focuses solely on recording the changes – the ‘delta’ – minimizing disk I/O and computational overhead. This approach dramatically speeds up update operations, especially in high-volume scenarios. Complementing this, the Mapping Table acts as a highly efficient directory, swiftly locating the physical storage addresses of data pages. By avoiding costly sequential scans, the Mapping Table ensures rapid data retrieval, enabling quicker access to both current and historical data versions, and ultimately boosting the overall system responsiveness and throughput.

Real-time indexing is central to the architecture’s performance, ensuring data is immediately available for querying as soon as it is written. This capability bypasses the delays associated with traditional batch indexing, which require periodic scans and updates to reflect changes. Instead, the system maintains index structures that are updated concurrently with data modifications, allowing queries to access the most current information without waiting for indexing processes to complete. This approach significantly reduces query latency and enhances responsiveness, particularly critical for applications demanding up-to-the-second data accuracy and rapid retrieval, such as financial trading platforms or real-time analytics dashboards. The benefit extends beyond speed; it also simplifies data management by eliminating the need for complex synchronization between data and its corresponding indexes.

Benchmarking revealed a significant performance advantage for the implemented Bw-tree architecture when contrasted with established key-value stores like Berkley DB and RocksDB. Experiments demonstrated substantial gains in throughput, reduced latency, and improved scalability, indicating the Bw-tree’s efficiency in handling large datasets and concurrent operations. Notably, the Bw-tree also presented a superior cost-performance profile compared to MassTree, achieving comparable results while requiring considerably less main memory; this optimized memory usage translates to lower operational expenses and broader applicability across diverse hardware configurations.

The pursuit of concurrency control, as demonstrated by this work on latch-free techniques, invariably reveals the limitations of even the most elegant designs. The paper’s focus on minimizing thread stalls and switches with ‘notices’ and delta updates-a seemingly stable solution for B-tree indexing-will, inevitably, encounter edge cases production systems delight in exposing. As Claude Shannon observed, “Communication is the transmission of information, but to really communicate it must be received, understood, and acted upon.” This applies perfectly; a beautifully designed concurrency mechanism is useless if it doesn’t handle real-world contention and workload variations. The system might be stable now, but the moment it faces sustained load, the cracks will begin to show.

What’s Next?

The pursuit of latch-free designs, as demonstrated here, invariably runs into the question of what constitutes ‘progress’. It’s a familiar cycle: complexity is traded for theoretical gains in concurrency, and then production finds a novel way to deadlock on edge cases no simulation anticipated. The authors rightly focus on minimizing thread stalls, but one suspects the real bottleneck will migrate – perhaps to memory contention, or the increasingly expensive overhead of propagating these ‘notices’. Anything called scalable hasn’t been tested properly.

The reliance on B-trees, while pragmatic, feels like a tacit admission of limits. These structures remain stubbornly sequential in many operations, and the benefits of a latch-free approach may be incrementally absorbed by the inherent structure itself. The field will likely see a resurgence of interest in more radical indexing schemes – perhaps those dismissed years ago for their complexity – simply because simplicity, eventually, wins. Better one monolith than a hundred lying microservices.

Ultimately, the most interesting work won’t be about eliminating contention entirely, but about detecting it quickly and recovering gracefully. The logs will always tell the tale. And the next ‘revolution’ will be measured not in microseconds shaved off a benchmark, but in the reduction of late-night debugging sessions.

Original article: https://arxiv.org/pdf/2601.00208.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Inevitable Bottleneck: Why Key-Value Stores Struggle

Beyond Locks: A Glimmer of Hope with Latch-Free Techniques

Bw-Trees: Optimizing Storage for the Inevitable Write Storm

From Theory to Reality: A System That Actually Scales

What’s Next?

See also: