Beyond Orchestration: Building Resilient Microservices with Choreography

Author: Denis Avetisyan

A new runtime system, Accompanist, leverages decentralized sagas to improve the performance and scalability of microservice architectures.

Accompanist is a runtime for resilient choreographic programming that enables fault tolerance and decentralization in distributed systems.

While orchestrators simplify fault recovery in distributed microservices, they introduce centralization challenges and are not always feasible; this paper presents ‘Accompanist: A Runtime for Resilient Choreographic Programming’, a system enabling decentralized sagas through a novel runtime. Accompanist allows programmers to implement resilient transactions as choreographic programs deployed alongside existing services, leveraging determinism, idempotency, and durable messaging. By co-designing the programming interface with a replay-based runtime, Accompanist achieves correctness-by-construction without compiler modifications-but can this approach offer comparable scalability and performance to established orchestration techniques in complex distributed environments?

The Challenge of Distributed Complexity

Contemporary software development increasingly favors a microservice architecture, wherein applications are constructed as collections of independently deployable services. This approach, while offering benefits like scalability and resilience, inherently introduces the challenge of coordinating these disparate components. Each service operates autonomously, potentially utilizing different technologies and possessing varying levels of availability, necessitating a robust system for inter-service communication and workflow management. The demand for such coordination stems from the need to execute complex business processes that span multiple services, requiring reliable message passing, error handling, and transaction management across a distributed system. Effectively managing this complexity is crucial for realizing the full potential of microservices and delivering responsive, dependable applications.

Conventional orchestration systems, despite their proven ability to manage complex processes, inherently introduce bottlenecks due to their centralized nature. These systems typically rely on a single coordinator to dictate the flow of work between various microservices, creating a single point of failure and a source of increased latency. Each task completion and subsequent instruction must pass through this central control point, adding communication overhead and delaying overall process execution. While offering a clear and manageable workflow, this centralized approach struggles to scale efficiently with a growing number of services or increasing workloads, potentially hindering responsiveness and creating performance limitations in dynamic, distributed environments. The complexity associated with maintaining and troubleshooting a central orchestrator also adds to the operational burden of managing modern, microservice-based applications.

While platforms like Temporal have demonstrated considerable success in managing complex microservice-based workflows, a continuing need exists for alternative approaches that prioritize reduced operational overhead and heightened responsiveness. Temporal’s robust features, though valuable, introduce a degree of complexity that can impact latency-sensitive applications. Researchers are actively exploring methods to distribute workflow execution more effectively, leveraging techniques such as serverless functions and event-driven architectures to minimize centralized control and enhance scalability. The goal is to achieve comparable reliability and fault tolerance to Temporal, but with a lighter footprint and faster execution times, ultimately enabling a wider range of applications to benefit from the advantages of microservice architectures without sacrificing performance.

Embracing Decentralization: The Power of Choreography

Choreography-based integration represents a departure from centralized orchestration by eliminating the need for a dedicated central controller. In this pattern, services operate independently and react to events published by other services. Each service listens for events relevant to its business logic and autonomously performs actions or publishes new events, fostering direct collaboration. This distributed approach promotes loose coupling and increased resilience, as the failure of one service does not necessarily cascade to others; however, it shifts the complexity of managing interactions from a central point to the individual services and necessitates robust event handling and monitoring capabilities.

Effective implementation of Choreography relies on technologies designed for event-driven architectures. These systems necessitate languages and frameworks capable of handling asynchronous communication and event processing. Choral is an example of such a framework, providing tools for defining event contracts, managing event schemas, and facilitating reliable event delivery between services. Beyond Choral, technologies like Kafka, RabbitMQ, and cloud-native event buses are commonly employed to build the underlying event infrastructure. The selection of these tools should consider factors such as scalability, fault tolerance, and support for event schema evolution to ensure the long-term maintainability of the Choreography-based system.

Choreography, while offering benefits in decentralized systems, introduces challenges to data consistency and reliability. Because services operate independently and react to events, maintaining transactional integrity across multiple services requires specific architectural patterns. The Saga Pattern addresses this by breaking down a single transaction into a sequence of local transactions, each performed by a single service. If one local transaction fails, the Saga executes a series of compensating transactions to undo the changes made by prior transactions, ensuring eventual consistency. Implementing Sagas necessitates careful design of compensating transactions and handling of potential failures during the compensation process, often involving mechanisms for idempotency and retries.

Accompanist: A Hybrid Approach to Workflow Efficiency

Accompanist functions as a runtime environment designed to transition applications from traditional, high-latency orchestration to more efficient choreography patterns in an incremental manner. This is achieved through a hybrid approach, allowing existing orchestration-based systems to adopt choreographic elements without requiring a complete rewrite. Accompanist doesn’t mandate an all-or-nothing switch; developers can selectively implement choreographic workflows alongside existing orchestration, progressively optimizing for performance. This staged adoption minimizes disruption and risk while capitalizing on the benefits of choreography, such as reduced latency and increased scalability, particularly in distributed systems.

Accompanist utilizes Virtual Threads, a lightweight concurrency model, to facilitate the parallel execution of choreographic workflows. Unlike traditional thread management which incurs significant overhead for each concurrent operation, Virtual Threads minimize this overhead, allowing Accompanist to manage a substantially larger number of concurrent tasks with reduced resource consumption. This approach is particularly beneficial in choreographic systems, where workflows often involve numerous independent, short-lived operations that can be executed concurrently. By enabling high levels of concurrency with minimal resource impact, Virtual Threads contribute directly to Accompanist’s ability to reduce end-to-end latency and improve overall system throughput.

Accompanist’s Sidecar implementation addresses performance bottlenecks by deploying a lightweight runtime alongside each service, thereby minimizing network latency typically associated with centralized orchestration. This approach facilitates direct, local communication between services within a choreography, improving data locality and reducing the need for serialization and deserialization across network boundaries. By processing requests and coordinating state transitions closer to the data source, the Sidecar design significantly lowers round-trip times and enhances the overall responsiveness of distributed workflows. The implementation avoids a single point of failure and enables independent scaling of individual service components, contributing to increased system resilience and efficiency.

Accompanist’s performance benefits have been validated through case studies utilizing complex, distributed system simulations. Specifically, the Online Boutique application, when fully deployed across multiple availability zones, exhibited up to a 32% reduction in end-to-end latency when utilizing Accompanist. This improvement demonstrates Accompanist’s ability to mitigate performance degradation commonly experienced in geographically distributed microservice architectures. The Warehouse Saga also served as a test case, further illustrating Accompanist’s efficacy in coordinating complex, multi-service transactions.

Performance benchmarking of Accompanist using a hotel reservation workflow demonstrates significant latency reductions compared to traditional orchestration methods. Results indicate a maximum improvement of 55% in overall response time. More critically, Accompanist achieves a 6.7x improvement in 99th percentile latency, signifying a substantial decrease in the time taken to handle the slowest 1% of requests. This improvement is particularly impactful for user-facing applications where responsiveness under peak load is crucial, and demonstrates Accompanist’s ability to deliver consistently low latency even with high request volumes.

Benchmarking using the Warehouse Saga demonstrates Accompanist’s performance advantages over Temporal. Accompanist achieved a median response time of 32 milliseconds, representing a 5.9x improvement compared to Temporal’s performance in the same benchmark. Additionally, Accompanist exhibited a 6.7x reduction in 99th percentile latency, indicating significantly improved performance under high load conditions when compared to Temporal’s observed 99th percentile latency during the Warehouse Saga test.

Towards Reactive and Resilient Distributed Systems

Accompanist’s architectural emphasis on data locality and streamlined choreography presents a compelling pathway towards constructing distributed systems capable of both rapid response and robust resilience. By prioritizing the placement of computation close to the data it processes, and by meticulously orchestrating the flow of messages between services, the framework minimizes latency and network congestion – key factors in achieving high reactivity. This deliberate design not only accelerates processing but also inherently improves fault tolerance; if one service encounters an issue, the impact is localized, and the system can dynamically adapt by leveraging data readily available to other components. Consequently, Accompanist fosters a more dependable and performant distributed environment, offering a significant advantage in applications demanding real-time responsiveness and continuous operation, such as financial trading platforms or industrial control systems.

Distributed systems built on choreographic principles benefit significantly from the integration of checkpointing mechanisms, and tools like Choral streamline this process to guarantee both fault tolerance and data consistency. Checkpointing periodically saves the state of a distributed computation, allowing the system to recover from failures without restarting from scratch. When combined with choreography – where services communicate through event-driven interactions – checkpointing enables resilient recovery even if individual services fail mid-operation. Choral facilitates this by automating the capture of consistent snapshots across multiple services, ensuring that when a failure occurs, the system can revert to a known good state and continue processing without data corruption or loss. This proactive approach to failure mitigation is crucial for building highly available and dependable distributed applications, particularly those handling sensitive or critical data.

In choreographic systems, where interactions unfold through message passing without central orchestration, idempotency emerges as a cornerstone of reliability. This principle dictates that an operation can be executed multiple times without altering the system’s state beyond the initial execution – essentially, repeating the action has the same effect as performing it once. This is critically important because message delivery isn’t always guaranteed; network issues or system failures can lead to messages being duplicated or replayed. By designing operations to be idempotent, the system can safely retry messages without the risk of unintended consequences, such as double-charging a customer or processing an order twice. This inherent resilience simplifies error handling and significantly enhances the overall robustness of distributed systems built on choreographic principles, ensuring consistent and predictable behavior even in the face of failures.

Accompanist’s architecture is uniquely positioned to capitalize on the emerging paradigm of In-Network Computing. By distributing processing capabilities directly within the network infrastructure – closer to the origin of data – latency can be dramatically reduced, exceeding the limitations of traditional client-server models. This extension envisions Accompanist not merely as a choreography engine, but as a facilitator of data-centric workflows where computations occur in transit, minimizing data movement and maximizing throughput. Such an approach promises significant benefits for real-time applications, edge computing scenarios, and large-scale data analytics, enabling systems to respond with unprecedented speed and efficiency. The framework’s focus on data locality and efficient choreography provides a strong foundation for intelligently partitioning workloads and directing computations to optimal locations within the network fabric.

The design of Accompanist, as detailed in the paper, prioritizes a shift away from centralized orchestration towards decentralized sagas. This approach inherently acknowledges the volatility of distributed systems and aims for resilience through choreography. It aligns directly with Tim Bern-Lee’s assertion: “The web is more a social creation than a technical one.” Accompanist isn’t merely a technical solution; it’s a system designed to facilitate interaction and maintain functionality even amidst component failures, mirroring the web’s original intent as a collaborative, robust medium. The system’s focus on minimizing central control points is a deliberate simplification, reflecting an understanding that unnecessary complexity introduces fragility.

Further Refinements

The pursuit of decentralized systems invariably reveals the stubborn persistence of centralization’s appeal. Accompanist offers a step toward choreographic programming’s potential, yet the challenges of observability and debugging in such architectures remain stark. The illusion of simplicity – of removing the central orchestrator – does not diminish the inherent complexity of managing distributed state and handling inevitable failures. Future work must address these practical concerns, moving beyond benchmark demonstrations toward robust, real-world deployments.

A critical, often overlooked, aspect is the human cost of decentralization. While systems may scale elegantly, the cognitive load on developers tasked with reasoning about sagas – sequences of events dispersed across services – is substantial. The field should investigate tools and methodologies that alleviate this burden, perhaps through more expressive saga definitions or automated verification techniques. True progress is not merely about building more complex systems; it is about building systems that are easier to understand.

Ultimately, the value proposition of choreographic programming hinges on its ability to adapt to change. Microservices, despite their promise of agility, often become brittle and tightly coupled. Accompanist’s contribution lies in offering a framework for building systems that are genuinely resilient, not merely fault-tolerant. The next step is to explore how this framework can support continuous evolution, allowing services to be added, removed, and modified without disrupting the overall system.

Original article: https://arxiv.org/pdf/2603.20942.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Challenge of Distributed Complexity

Embracing Decentralization: The Power of Choreography

Accompanist: A Hybrid Approach to Workflow Efficiency

Towards Reactive and Resilient Distributed Systems

Further Refinements

See also: