Beyond Bottlenecks: Scaling Transactions with Disaggregated Memory

Author: Denis Avetisyan

A new distributed system, Lotus, tackles performance limitations in disaggregated memory architectures by radically rethinking how transactions and locking are handled.

The Lotus system employs a distributed lock table to manage concurrent access and maintain data consistency across a network, a pragmatic approach acknowledging that even the most refined theoretical concurrency models will eventually succumb to the realities of production-level stress.

Lotus optimizes disaggregated transactions by decoupling locks from data and processing them entirely within compute nodes, leveraging RDMA for enhanced performance and fault tolerance.

While disaggregated memory (DM) architectures offer increased resource utilization, existing distributed transaction systems suffer performance bottlenecks due to lock contention at memory node network interfaces. This paper introduces Lotus: Optimizing Disaggregated Transactions with Disaggregated Locks, a novel system that decouples locks from data, executing them entirely on compute nodes to alleviate these network limitations. By employing an application-aware lock management mechanism and a lock-first transaction protocol, Lotus achieves significant improvements in throughput and latency. Could this lock disaggregation approach unlock further scalability and efficiency gains in emerging DM systems and beyond?

The Inevitable Memory Wall

Contemporary applications, from high-resolution video streaming and immersive gaming to complex scientific simulations and large language models, are relentlessly pushing the boundaries of memory system capabilities. These workloads require not only vast amounts of data storage – measured in terabytes and beyond – but also the ability to access that data with exceptional speed. Traditional memory architectures, however, are struggling to keep pace. The increasing disparity between processing power and memory access speeds creates a significant bottleneck, hindering overall system performance and limiting the potential of advanced computing. This strain is particularly evident in data-intensive tasks where the rate at which data can be moved to and from the processor becomes the primary constraint, effectively stalling progress despite advancements in CPU technology.

The fundamental constraint on computing speed, often termed the von Neumann bottleneck, arises from the physical separation of the central processing unit (CPU) and memory. This architecture necessitates data constantly traveling back and forth between these components – the CPU requesting data, and memory supplying it – a process that consumes significant time and energy. As CPU speeds have dramatically increased over decades, memory access speeds have lagged behind, creating an ever-widening performance gap. This limitation isn’t simply about speed; the constant data transfer also creates a significant energy drain, impacting battery life in portable devices and increasing operational costs in data centers. Consequently, modern applications – from complex simulations to real-time data analysis and artificial intelligence – are increasingly hampered by the rate at which the CPU can obtain the data it needs, regardless of how powerful the processor itself may be. The bottleneck effectively limits the realization of theoretical computing power, driving research into innovative memory architectures and processing paradigms.

Despite considerable innovation in memory technologies, current solutions attempting to bridge the performance gap between processors and memory often introduce substantial complexity. Techniques like multi-level caches and wider memory buses offer incremental improvements, but these come at the cost of increased hardware overhead, design intricacy, and power consumption. Furthermore, these approaches frequently address symptoms rather than the root cause – the fundamental separation of processing and storage. Emerging technologies such as 3D-stacked memory and high-bandwidth memory, while promising, face challenges in manufacturing, cost, and integration with existing systems. Consequently, the widening disparity between computational capabilities and memory access speeds continues to limit overall system performance, demanding more radical architectural shifts to truly overcome these limitations and unlock the full potential of modern computing.

Disaggregating memory allows for the decoupling of data and locks, enhancing system flexibility and scalability.

Decoupling Compute and Storage: A Necessary Evolution

Disaggregated Memory (DM) architectures fundamentally alter traditional system design by removing the fixed coupling between processor and memory resources. In conventional systems, memory capacity scales proportionally with compute nodes, often resulting in underutilization or capacity mismatches. DM, conversely, pools memory into shared resources accessible via a high-speed interconnect, allowing for independent scaling of both compute and memory. This decoupling enables dynamic allocation of memory to workloads based on need, improving resource efficiency and reducing capital expenditure. Furthermore, the pooled nature of DM facilitates greater flexibility in supporting diverse workloads with varying memory footprints and allows for oversubscription of memory resources, leveraging the statistical multiplexing of memory demands across multiple applications.

Lock disaggregation represents an architectural refinement of disaggregated memory systems by physically locating lock management resources – the mechanisms controlling access to shared data – with the compute nodes processing the data. Traditionally, lock management is coupled with data storage, creating a potential bottleneck as compute nodes must remotely access these locks for every data operation. By co-locating locks with compute, lock disaggregation minimizes lock contention and significantly reduces inter-node communication latency associated with lock acquisition and release. This approach is particularly beneficial for workloads characterized by frequent, fine-grained data access and high concurrency, as it directly addresses communication overhead and improves overall transaction throughput.

Lock disaggregation addresses performance bottlenecks in high-throughput transaction processing by relocating lock management functions to the compute nodes themselves. Traditional architectures centralize lock management, creating contention as multiple compute nodes request access to shared lock resources. By distributing locks, the frequency of inter-node communication for lock acquisition and release is substantially reduced. This localized management minimizes latency associated with remote lock operations, thereby decreasing overall transaction processing time and increasing system throughput. The resulting reduction in communication overhead directly improves scalability, as fewer network resources are consumed by lock-related traffic.

Disaggregated memory architectures decouple processing and memory to improve resource utilization and scalability.

Lotus: A Scalable System Built on Decoupling

Lotus utilizes a disaggregated memory architecture, separating memory resources from compute nodes to enhance scalability and resource utilization. This disaggregation allows for dynamic allocation of memory to transaction processing tasks as needed. Complementing this is lock disaggregation, where lock management is decoupled from the data nodes themselves. Locks are managed by dedicated nodes, reducing contention and improving concurrency. This architecture is specifically designed to support high-performance transaction processing by minimizing resource bottlenecks and maximizing the throughput of concurrent transactions. The system avoids the limitations of traditional, tightly-coupled architectures where both data and lock management reside on the same physical machine.

The Lotus transaction protocol prioritizes lock acquisition before data access, utilizing Remote Direct Memory Access (RDMA) operations for efficiency. Specifically, RDMA Read operations are employed to acquire locks, allowing clients to directly read lock state from server memory without CPU intervention. Following successful lock acquisition, RDMA Write operations are then used for both lock release and data modification, minimizing latency and maximizing throughput. This lock-first approach, combined with RDMA, avoids potential contention delays associated with optimistic concurrency control and ensures transactional consistency by serializing access to shared resources.

Lotus employs application-aware lock sharding to distribute locks across the system based on access patterns observed during application profiling. This technique assigns locks protecting frequently co-accessed data to different physical nodes, reducing the likelihood of contention. Complementing this is a two-level load balancing scheme: a global load balancer distributes transactions to nodes with available resources, while a local lock balancer within each node further distributes lock requests across available lock managers. This hierarchical approach ensures both transaction-level and lock-level requests are efficiently routed, maximizing throughput and minimizing latency by avoiding bottlenecks and contention points within the system.

Lotus employs a Version Table Cache to reduce latency during data retrieval by utilizing Cacheline Versioning. This cache stores metadata indicating the version of each cacheline currently held by a transaction. By checking the Version Table, the system can determine data validity without accessing main memory, avoiding costly read operations if a transaction already holds the most recent version of a cacheline. This optimization is particularly effective in scenarios with high read contention, as it significantly reduces the number of remote memory accesses and improves transaction throughput. The Version Table Cache is designed to be lightweight and efficiently managed, minimizing its overhead and maximizing its impact on overall system performance.

Lotus employs a lock-first transaction protocol where a coordinator writes data to both the primary and backup nodes to ensure data consistency.

Demonstrating Real-World Gains: Benchmarking and Validation

A rigorous evaluation of Lotus utilized established industry benchmarks – TPCC, TATP, SmallBank, and KVS – to assess its performance capabilities. These tests, representing diverse transactional workloads, provided a standardized method for comparing Lotus against existing database architectures and disaggregated memory systems. By employing these well-recognized benchmarks, researchers ensured the validity and reproducibility of performance results, allowing for a clear demonstration of Lotus’s advancements in throughput and latency under realistic conditions. The selection of these benchmarks reflects a commitment to demonstrating Lotus’s effectiveness across a spectrum of transactional applications, solidifying its position as a high-performing database solution.

Evaluations reveal that Lotus delivers substantial gains in operational efficiency when contrasted with conventional architectures and competing disaggregated memory solutions. Performance metrics indicate a peak throughput increase of 2.1x, signifying the system’s capacity to process a significantly larger volume of transactions within a given timeframe. Complementing this heightened throughput is a marked reduction in latency – up to 49.4% lower – which translates to faster response times and an improved user experience. These improvements aren’t merely theoretical; they represent a demonstrable advancement in system responsiveness and scalability, suggesting Lotus can effectively handle demanding workloads with greater speed and efficiency than existing alternatives.

Performance evaluations reveal Lotus significantly enhances transaction processing capabilities across several industry-standard benchmarks. Notably, the system demonstrates a 1.3x increase in throughput when tested with TATP, indicating a substantial improvement in handling complex, analytical transactions. Furthermore, Lotus achieves a 1.5x throughput improvement on the TPCC benchmark, a measure of overall database performance, and a remarkable 2.1x increase on the SmallBank benchmark, designed to simulate online banking operations. These gains suggest Lotus is particularly well-suited for demanding transactional workloads and can deliver considerable performance benefits over existing systems.

Performance validation reveals Lotus significantly reduces transaction latency across key benchmarks; specifically, Lotus demonstrably minimizes the time required to complete transactions. Measurements indicate a 36.7% decrease in P50 latency for the TATP benchmark, signifying considerably faster processing of complex, high-contention transactions. Improvements extend to standard online transaction processing, with a 5.2% reduction in P50 latency on TPCC, and a particularly significant 49.4% decrease on the SmallBank benchmark, highlighting the system’s efficiency in handling simpler, yet high-volume, operations. These latency reductions collectively suggest that Lotus offers a markedly more responsive experience for applications reliant on rapid data access and transaction completion.

Evaluations of the Lotus system demonstrate a compelling ability to scale performance alongside increased workload demands. As the number of nodes within the distributed system grows, and as the volume of concurrent transactions rises, Lotus consistently maintains high throughput levels. This scalability isn’t merely theoretical; rigorous testing reveals that the system doesn’t suffer the performance bottlenecks commonly seen in traditional architectures when faced with expanding datasets and user activity. The architecture’s design effectively distributes processing and memory access, preventing single points of contention and ensuring resources are utilized efficiently, even under peak loads. This sustained performance is critical for applications requiring consistent responsiveness and the capacity to handle growing user bases, solidifying Lotus as a robust and future-proof solution.

To ensure continuous operation and data integrity, Lotus integrates a Lease-Based Membership Service, a crucial component for robust fault tolerance and system stability. This service dynamically manages cluster membership, granting temporary “leases” to nodes that allow them to participate in data operations. Should a node fail or become unresponsive, its lease expires, automatically revoking its privileges and triggering a swift reassignment of responsibilities to healthy nodes. This proactive approach minimizes disruption and prevents failed nodes from impacting overall system performance, guaranteeing consistent and reliable operation even in the face of hardware failures or network instability. The Lease-Based Membership Service operates independently of application logic, providing a foundational layer of resilience that safeguards data consistency and maximizes uptime for critical operations.

Lotus employs a two-level load balancing strategy to distribute computational tasks efficiently.

Looking Ahead: Expanding the Horizon for Disaggregated Systems

Current development efforts are heavily invested in refining the lock-first protocol within Lotus, aiming to minimize contention and maximize throughput under high concurrency. Researchers are actively exploring the integration of advanced concurrency control mechanisms, notably MVCC – or Motor – which promises to reduce locking overhead by enabling optimistic concurrency. This approach allows transactions to proceed without acquiring locks initially, only verifying for conflicts at commit time. By strategically combining the simplicity of lock-first with the performance benefits of MVCC, the system anticipates substantial improvements in scalability and responsiveness, particularly for read-heavy workloads. Further optimization includes adaptive locking strategies and fine-grained lock management to tailor concurrency control to specific data access patterns and workload characteristics.

Future performance gains for systems like Lotus are increasingly reliant on intelligent memory management and hardware specialization. Current research investigates deviating from traditional, linear memory access, exploring techniques like sharding and prefetching to reduce latency and maximize throughput. Simultaneously, integration with emerging hardware accelerators – including GPUs, FPGAs, and specialized processing units – offers the potential to offload computationally intensive tasks from the CPU, drastically improving processing speeds. This synergistic approach, combining novel memory access patterns with hardware acceleration, could unlock significant performance improvements, allowing these systems to scale efficiently and handle increasingly complex workloads. The exploration of near-memory processing, where computation is performed directly within or very close to the memory modules, represents a particularly promising avenue for future development.

The inherent flexibility of Lotus extends beyond its current implementation, presenting a clear pathway toward supporting a wider spectrum of data models and application needs. Current development anticipates modular adaptations that would allow Lotus to efficiently manage graph databases, time-series data, and even complex document structures, rather than being limited to key-value storage. This expansion isn’t simply about accommodating different types of data, but also tailoring concurrency control and consistency guarantees to specific application requirements – offering relaxed consistency for read-heavy workloads, or stronger guarantees where data integrity is paramount. Ultimately, this adaptability promises to transform Lotus from a specialized system into a versatile foundation for building a diverse range of distributed applications, dramatically increasing its potential impact and usability.

The foundational principles underpinning Lotus – specifically its lock-first protocol and emphasis on streamlined transaction management – extend far beyond the initial scope of its design. These concepts are readily adaptable to a diverse array of distributed systems, including those managing financial transactions, coordinating sensor networks, or powering collaborative editing platforms. Researchers envision applying Lotus’s core tenets to improve the efficiency and reliability of blockchain technologies, enhance the scalability of cloud databases, and even optimize the performance of real-time gaming servers. By prioritizing concurrency and minimizing contention, the strategies employed within Lotus offer a versatile toolkit for building robust and scalable distributed applications across numerous domains, fostering a promising avenue for continued innovation in the field of distributed computing.

Lotus employs application-aware lock sharding to improve concurrency and performance.

The pursuit of disaggregated memory, as explored in Lotus, invariably introduces complexity. One anticipates the elegant diagrams detailing lock disaggregation and RDMA-optimized transactions will, inevitably, become tangled in the realities of production. As Grace Hopper observed, “It’s easier to ask forgiveness than it is to get permission.” This sentiment rings particularly true when considering the system’s fault tolerance mechanisms; the theoretical benefits of decoupling locks from data are quickly overshadowed by the practical challenges of maintaining consistency across a distributed system when components fail. Lotus, while promising, will undoubtedly face a similar evolution, proving that even the most innovative designs are merely temporary reprieves from the relentless march of technical debt.

What’s Next?

The decoupling of locks and data, as demonstrated by Lotus, offers a predictable, if incremental, gain. Anyone who’s spent time in production will immediately recognize this as a shift of complexity, not a reduction. Network congestion is alleviated only to be replaced by potential hotspots within the compute nodes themselves. The paper carefully avoids addressing the operational realities of managing disaggregated memory – things like node failures, persistent storage guarantees, and the inevitable data corruption that occurs when hardware decides to be uncooperative. These are not theoretical limitations; they are simply the cost of doing business.

Future work will undoubtedly focus on scaling these disaggregated lock managers, and increasingly complex concurrency control schemes will be proposed. The pursuit of performance often leads to elegant architectures, but it’s worth remembering that ‘elegant’ rarely survives contact with actual workloads. The real challenge lies not in achieving peak throughput on a benchmark, but in maintaining acceptable latency under sustained, unpredictable load.

It’s also likely that the benefits of lock disaggregation will diminish as network technology improves. The paper rightly points to RDMA as an enabler, but the relentless march of bandwidth and decreasing latency will eventually erode the advantages. This isn’t a criticism; it’s simply an observation. Every ‘revolution’ becomes tomorrow’s tech debt, and if this code looks perfect, no one has deployed it yet.

Original article: https://arxiv.org/pdf/2512.16136.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/