Scaling Concurrent Stacks with Sharding and Smart Combining

Author: Denis Avetisyan

A new approach to concurrent stack design leverages sharding to unlock improved performance and scalability for multi-threaded applications.

The study demonstrates that a system subjected to exclusively push or pop operations exhibits a clear correlation between the number of threads employed and resulting throughput, measured in millions of operations per second-a relationship that defines the system’s performance characteristics under these specific workloads and suggests a scalable architecture for handling such data streams.

This paper presents SEC, a concurrent stack implementation unifying elimination and combining techniques with sharding to achieve state-of-the-art linearizability and throughput.

Despite ongoing advancements in concurrent data structures, achieving both high throughput and scalability remains a significant challenge for modern multi-threaded applications. This paper introduces a novel approach to concurrent stack design, detailed in ‘Sharded Elimination and Combining for Highly-Efficient Concurrent Stacks’, which unifies sharding, elimination, and software combining to minimize contention and maximize parallelism. Experimental results demonstrate that this implementation outperforms existing state-of-the-art concurrent stacks by up to 2X, particularly under high-contention and large-scale threaded environments. Could this architecture represent a new paradigm for building highly-efficient, lock-free data structures?

The Inherent Bottleneck of Concurrent Stack Implementations

Traditional concurrent stack implementations often encounter performance bottlenecks as the number of threads accessing the stack increases-a phenomenon known as thread contention. This limitation arises because multiple threads attempt to modify the stack’s data structures simultaneously, requiring them to wait for access, even if they are operating on different parts of the stack. These waiting periods, while seemingly brief, accumulate rapidly under heavy workloads, significantly diminishing overall throughput. The core issue isn’t necessarily the speed of individual operations, but rather the time spent coordinating access and preventing data corruption, ultimately hindering the stack’s ability to efficiently handle concurrent requests and limiting its scalability in multi-threaded environments.

While designed to avoid the blocking inherent in lock-based stacks, lock-free approaches like the Treiber Stack exhibit diminishing scalability as the number of concurrent threads increases. This performance bottleneck arises from the use of compare-and-swap (CAS) operations; under high contention, multiple threads repeatedly attempt to modify the stack’s top pointer simultaneously, resulting in frequent failed CAS attempts and wasted CPU cycles. Each failed attempt necessitates a retry, creating a contention loop where threads interfere with each other’s progress. Consequently, the theoretical benefits of lock-free design are eroded, and throughput plateaus or even declines as concurrency increases, highlighting the need for more sophisticated contention management strategies in concurrent stack implementations.

As modern computing increasingly relies on multi-core processors and concurrent applications, the efficient management of shared data structures becomes critically important. Traditional locking mechanisms, while ensuring data consistency, often introduce significant performance bottlenecks due to contention – situations where multiple threads attempt to access the same resource simultaneously. This contention wastes valuable CPU cycles and limits scalability, hindering the ability of applications to fully utilize available hardware. Consequently, research into novel techniques that minimize contention and maximize throughput in concurrent data structures, such as stacks, queues, and hash tables, is paramount. These advancements aren’t merely about incremental improvements; they represent a fundamental need to redesign core data structures for the realities of parallel processing, unlocking the potential for substantial gains in application performance and responsiveness.

Throughput performance scales with the number of threads across different update percentages (10%, 50%, and 100%) demonstrating consistent performance relative to aggregators.

The Elegance of Operation Elimination and Combination

Elimination, as implemented within the EB Stack, functions by identifying and cancelling immediately successive push and pop operations on the same stack location. This is achieved through tracking operation metadata; when a push is followed by a pop to the same address before any intervening operations, both are logically removed, avoiding memory access and associated contention. This technique directly reduces the number of operations requiring hardware execution, and minimizes lock contention by preventing unnecessary access to shared data structures. The efficiency of elimination is dependent on the rate of opposing operations occurring in close proximity, and is most effective in scenarios with high degrees of locality in stack access patterns.

Software combining, as implemented in the CC and FC Stacks, reduces synchronization overhead by aggregating multiple enqueue or dequeue operations into a single operation processed by a dedicated combiner thread. This approach amortizes the cost of synchronization – typically mutex locks or atomic operations – across numerous requests, as the overhead of acquiring and releasing resources is performed less frequently. Instead of each individual operation contending for synchronization primitives, the combiner thread serializes access, effectively batching requests and improving throughput, particularly in scenarios with high contention and frequent small operations.

Both elimination and combining techniques, while capable of reducing operational overhead, necessitate precise configuration to achieve optimal performance. The effectiveness of elimination is highly dependent on the rate of opposing operations and the granularity of cancellation; improper tuning can lead to increased complexity without commensurate benefits. Similarly, software combining introduces overhead related to thread synchronization and the management of the combiner thread itself. The size of combined operations, the scheduling of the combiner, and the handling of contention all require careful consideration and adjustment based on workload characteristics. These complexities can significantly increase the difficulty of implementation and maintenance, potentially offsetting the gains from reduced contention if not properly addressed.

The success of both elimination and combining techniques for reducing operation overhead is directly correlated to the efficiency with which related operations can be identified and processed. This identification requires mechanisms to detect opposing push and pop operations for elimination, or to aggregate multiple operations destined for the same data structure for combining. Inefficient identification introduces latency and diminishes the performance gains these techniques aim to provide; conversely, rapid and accurate identification of related operations allows for maximized cancellation or aggregation, minimizing contention and synchronization costs. The overhead of the identification process itself must remain lower than the cost of the operations it seeks to optimize to ensure a net performance benefit.

Emerald achieves consistently high throughput, reaching millions of operations per second, across both push-only and pop-only workloads with two aggregators, demonstrating performance scalability with increasing thread count.

The SEC Stack: A Synthesis of Scalability

The SEC Stack achieves improved performance by integrating elimination, combining, and sharding techniques. Sharding divides processing threads into groups, reducing contention for shared resources. Elimination identifies and removes redundant or unnecessary operations within batches before processing, while combining aggregates operations intended for the same data, minimizing overall transactional load. These optimizations collectively result in a measured throughput increase of 1.8 to 2.5 times compared to current state-of-the-art stack implementations, demonstrating significant scalability improvements.

Aggregators within the SEC Stack process operations in batches to reduce overhead. A dedicated combiner thread applies these operations on behalf of multiple requesting threads, thereby minimizing contention and maximizing throughput. Performance testing has demonstrated an average observed batch size of 41 operations when handling 100% update rates, indicating the typical scale at which these batches are processed before being applied to the underlying data structures.

The Freezer Thread within the SEC Stack architecture serves a critical role in coordinating the lifecycle of operation batches. This dedicated thread is responsible for signaling when a batch is complete and can be safely processed, initiating the elimination process whereby operations within the batch are marked for cancellation if no longer required. This coordination ensures that operations are not prematurely cancelled while still allowing for efficient reclamation of resources once the batch is fully processed, optimizing throughput and minimizing contention by controlling the scope and timing of elimination activities.

The SEC Stack employs a Fetch&Increment mechanism for updating counters, minimizing contention during concurrent modifications. Memory management is handled through Epoch-Based Reclamation, allowing for garbage collection without halting all operations. Performance analysis indicates an elimination degree of 78% within processed batches operating at a 100% update rate, meaning that 78% of operations are effectively cancelled due to epoch boundaries, reducing overall computational load and improving throughput.

Emerald's secure aggregation (SEC) throughput, measured by the number of threads, scales with the number of aggregators and is influenced by the update percentage, ranging from 100% to 10% updates and a push-only configuration, as denoted by <span class="katex-eq" data-katex-display="false">SEC_Agg1</span>. — Emerald’s secure aggregation (SEC) throughput, measured by the number of threads, scales with the number of aggregators and is influenced by the update percentage, ranging from 100% to 10% updates and a push-only configuration, as denoted by $SEC_Agg1$ .

Linearizability and the Future of Concurrent Data Structures

The SEC Stack prioritizes linearizability, a crucial property for concurrent data structures that guarantees operations appear to execute in a sequential order, as if only one thread accessed the stack at a time. This principle ensures predictable behavior, simplifying reasoning about the stack’s correctness and preventing unexpected data inconsistencies that can arise in concurrent environments. By adhering to this strict correctness criterion, the SEC Stack provides developers with a reliable foundation for building robust and scalable applications, even when faced with high levels of thread contention. The design choices within the SEC Stack are specifically geared towards upholding this guarantee, allowing for easier debugging, verification, and ultimately, more trustworthy concurrent systems.

The SEC Stack demonstrably enhances performance through a focused reduction of contention and a maximization of throughput when compared to conventional and existing lock-free stack implementations. Traditional lock-free stacks often experience bottlenecks as multiple threads attempt to access and modify the stack concurrently, leading to wasted cycles and diminished scalability. The SEC Stack mitigates this issue by strategically partitioning the stack and employing a novel access pattern that minimizes the probability of threads interfering with one another. This design allows for a greater proportion of operations to complete without delay, resulting in substantial improvements in overall performance, particularly in highly concurrent environments. Empirical evaluations reveal that the SEC Stack consistently achieves higher transaction rates and lower latency, establishing it as a compelling alternative for applications demanding efficient concurrent data access.

Researchers anticipate extending the principles demonstrated in the SEC Stack to a broader range of concurrent data structures, potentially revolutionizing designs beyond simple stack implementations. This includes examining applicability to queues, lists, and even more complex data types where contention often limits scalability. Furthermore, investigations will focus on adaptive sharding – a dynamic partitioning strategy – to optimize performance characteristics under diverse and fluctuating workloads. By intelligently adjusting the degree of sharding, the system aims to minimize latency during low-contention periods while maintaining high throughput when faced with intense concurrency, ultimately paving the way for more robust and efficient concurrent systems capable of handling increasingly complex computational demands.

The development of the SEC Stack signifies progress in the construction of concurrent systems engineered for scalability and efficiency. As application demands continue to escalate, traditional data structures often become bottlenecks, hindering performance and limiting the potential for parallel processing. The SEC Stack addresses these limitations through innovative design choices that minimize contention and maximize throughput, paving the way for systems capable of handling increasingly complex workloads. This advancement isn’t merely about improving stack performance; it represents a building block for broader architectural patterns, suggesting a future where highly concurrent operations can be executed reliably and efficiently across a multitude of cores and processing units, ultimately enabling more responsive and powerful applications.

The pursuit of highly-efficient concurrent data structures, as demonstrated by SEC, echoes a fundamental tenet of computational correctness. The paper’s innovative sharding approach, combined with elimination and combining, prioritizes not merely functional operation but provable linearizability-a mathematically rigorous guarantee of data integrity. This aligns with Dijkstra’s assertion: “It’s not enough to show something works in practice; you must prove why it works.” SEC doesn’t simply offer performance gains; it establishes a foundation for verifiable concurrency, addressing a critical need in modern, multi-threaded systems. The focus on a provable solution, rather than a merely ‘working’ one, highlights the elegance of a mathematically pure design.

What Remains Constant?

The pursuit of concurrent data structures invariably circles back to a fundamental question: Let N approach infinity – what remains invariant? This work, presenting Sharded Elimination and Combining (SEC), offers a compelling, though not conclusive, answer in the specific domain of stack implementations. The reported performance gains are noteworthy, yet they sidestep the inherent complexity of proving linearizability beyond a finite number of contending threads. Scalability, while demonstrated, feels less like a solved problem and more like a postponement of inevitable contention as N grows.

Future investigations should not focus solely on micro-optimizations, but on formal verification techniques capable of establishing guarantees even as concurrency intensifies. The blending of elimination and combining, while intuitively appealing, introduces a new class of potential race conditions that demand rigorous mathematical treatment. A truly elegant solution will not merely perform well, but will be demonstrably correct under all possible conditions, irrespective of the system load.

Ultimately, the true measure of progress lies not in achieving incremental speedups, but in reducing the complexity of reasoning about concurrent systems. SEC represents a step forward, but the horizon remains populated with unanswered questions about the limits of scalability and the true cost of concurrency. The invariant, it seems, is the enduring challenge of maintaining order amidst chaos.

Original article: https://arxiv.org/pdf/2601.04523.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Inherent Bottleneck of Concurrent Stack Implementations

The Elegance of Operation Elimination and Combination

The SEC Stack: A Synthesis of Scalability

Linearizability and the Future of Concurrent Data Structures

What Remains Constant?

See also: