Building Reliable AI Software with Event History

Author: Denis Avetisyan

A new architecture leverages the power of event sourcing to bring greater predictability and auditability to software developed with large language models.

This paper introduces ESAA, an event-sourced architecture utilizing Command Query Responsibility Segregation (CQRS) to ensure state reproducibility and deterministic replay for LLM-based software engineering and multi-agent systems.

While Large Language Models (LLMs) increasingly power autonomous agents for software engineering, limitations in state management and deterministic execution hinder reliable, auditable systems. This paper introduces ESAA-Event Sourcing for Autonomous Agents-an architecture inspired by the Event Sourcing pattern and Command Query Responsibility Segregation (CQRS) principles to address these challenges. ESAA separates agent intention from state mutation, ensuring state reproducibility via an immutable event log and verifiable materialized views. Does this approach unlock the potential for truly governed and scalable LLM-driven software development workflows?

Deconstructing the Software Stack: Limits and Levers

Contemporary software development is characterized by an unprecedented rate of change and escalating system intricacy. This rapid velocity, driven by evolving user expectations and the demand for constant innovation, places immense strain on traditional, sequential methodologies. Projects once measured in months now require iterative delivery within weeks, or even days, demanding a level of responsiveness that established processes struggle to accommodate. The sheer scale of modern applications – often comprising millions of lines of code and intricate dependencies – further exacerbates the challenges, making comprehensive planning and predictable execution increasingly difficult. Consequently, teams frequently encounter delays, budget overruns, and a growing risk of delivering software that fails to meet rapidly changing requirements, prompting a search for more agile and adaptive approaches.

The longstanding reliance on manual processes within software development frequently introduces critical bottlenecks, particularly as projects scale in complexity. Traditional testing methodologies, often characterized as “brittle” due to their sensitivity to even minor code changes, struggle to keep pace with rapid iteration cycles. This fragility means that tests require constant updating, consuming valuable development time and frequently failing to comprehensively validate the system’s behavior under diverse conditions. Consequently, undetected errors can slip through the cracks, manifesting as runtime failures, security vulnerabilities, or diminished user experience-problems that become exponentially more challenging and costly to resolve once the software is deployed. The inherent limitations of these approaches highlight the need for automated, resilient testing frameworks capable of adapting to change and ensuring software quality throughout the development lifecycle.

Contemporary software development increasingly necessitates systems engineered for continuous change, pushing beyond the limitations of traditionally rigid methodologies. The escalating pace of innovation and user feedback cycles demand an architecture prioritizing adaptability and rapid iteration, not merely initial stability. This requires a fundamental shift towards resilient systems – those capable of absorbing unexpected inputs, gracefully degrading under stress, and evolving without catastrophic failure. Such systems aren’t built on exhaustive upfront planning and monolithic codebases, but rather on modular designs, automated testing, and continuous integration/continuous delivery pipelines. This proactive approach minimizes the impact of inevitable changes, allowing software to not only respond to new demands but also learn and improve over time, ensuring sustained functionality and user satisfaction in a dynamic environment.

LLMs: Re-Engineering the Code Generation Process

Large Language Models (LLMs) demonstrate the capacity to automate tasks across multiple stages of the software development lifecycle, including code generation, bug detection, documentation, and testing. Specifically, LLMs can synthesize code from natural language descriptions, identify potential vulnerabilities in existing codebases, automatically generate API documentation, and create unit tests based on code functionality. While not fully autonomous, these capabilities allow developers to offload repetitive or complex tasks, potentially increasing development speed and reducing associated costs. Current implementations show promising results in generating boilerplate code, refactoring existing systems, and assisting with code completion, although validation and human oversight remain crucial for ensuring code quality and security.

Reliable code generation with Large Language Models (LLMs) is contingent upon controlled prompting and structured output formats. LLMs, while capable of producing syntactically correct code, frequently generate solutions that lack semantic correctness or fail to meet specific functional requirements without precise guidance. Strategies such as few-shot learning, where the LLM is provided with example input-output pairs, and the specification of detailed code schemas-including required functions, data structures, and error handling-are crucial for improving result accuracy. Furthermore, techniques like constraining the LLM’s output to a specific programming paradigm or utilizing formal verification methods post-generation can enhance the dependability of the generated code and reduce the incidence of bugs or vulnerabilities.

Successful integration of Large Language Models (LLMs) into software engineering workflows requires frameworks designed for multi-agent coordination. These frameworks must facilitate the division of complex tasks into smaller, manageable sub-tasks assigned to individual LLM agents. Crucially, they need mechanisms for inter-agent communication, allowing agents to share intermediate results and dependencies. Furthermore, effective frameworks incorporate strategies for conflict resolution when multiple agents propose competing solutions, and a centralized orchestration layer is necessary to monitor progress, handle failures, and ensure the overall system adheres to defined constraints and quality standards. This coordinated approach is essential for reliable code generation, testing, and debugging beyond the capabilities of single-agent LLM interactions.

ESAA: Architecting Resilience Through Eventual Consistency

The Event Sourcing and Command Query Responsibility Segregation (ESAA) architecture enhances the resilience of Large Language Model (LLM)-powered applications by decoupling state management from application logic. Traditional state-based systems are vulnerable to data loss or corruption; ESAA mitigates this by persisting all changes to the application’s state as a sequence of immutable events in an Event Store. This event log serves as the single source of truth, allowing for the reconstruction of past states and enabling deterministic replay for debugging, auditing, and recovery from failures. By separating read (Query) and write (Command) operations, ESAA optimizes performance and scalability, allowing each side to be independently tailored to its specific needs. This architectural pattern facilitates building LLM agents capable of handling complex interactions and maintaining consistent behavior even in the face of errors or external disruptions.

The Event Store in an Event Sourcing architecture functions as a sequentially ordered, append-only log of all state changes. Each change is recorded as an immutable event, containing the data representing what happened, rather than how the system’s state was modified. This immutability is critical; events are never updated or deleted, ensuring a complete historical record. Deterministic replay is achieved by replaying these events from the beginning, applying them in order to reconstruct any past state of the system. Auditability is a direct consequence, as the Event Store provides a verifiable, chronological trail of all actions, facilitating debugging, compliance reporting, and forensic analysis. The system’s state at any point in time can therefore be reliably reconstructed and validated against the recorded event stream.

Canonical Artifacts within the ESAA architecture establish a consistent and predictable system state by defining specific, agreed-upon representations of data and interactions. Materialized Views represent derived data, pre-calculated and stored for efficient querying, ensuring a consistent interpretation of events. Boundary Contracts define the explicit interfaces between components – specifically, the expected input and output formats – thereby reducing integration issues and preventing unexpected behavior caused by differing data interpretations. These artifacts, when rigorously maintained, guarantee that all components operate on a shared understanding of data structures and permissible interactions, leading to increased system reliability and reduced debugging complexity.

Canonicalization and JSON Schema are essential components for ensuring data integrity and predictable behavior within LLM agent systems. Canonicalization involves transforming diverse data inputs into a standardized, consistent format before processing, mitigating issues arising from variations in phrasing or structure. JSON Schema then provides a contract for validating this canonicalized data against a predefined structure, including data types, required fields, and allowable values. This validation process proactively identifies and rejects malformed or unexpected inputs, preventing unintended consequences or errors in downstream logic. By enforcing a strict data contract, JSON Schema minimizes the risk of LLM agents operating on invalid data, thereby increasing application reliability and reducing debugging complexity.

Validating the System: Benchmarking and Observability

AutoGen, MetaGPT, and LangGraph are software frameworks designed to simplify the construction of multi-agent systems, which involve coordinating multiple language model instances to achieve a common goal. These frameworks provide tools for defining agent roles, establishing communication protocols between agents, and managing the overall workflow of the system. AutoGen, for example, allows developers to specify agent configurations and define conversational exchanges. MetaGPT focuses on simulating a software company through agent interaction, while LangGraph offers a graph-based approach to managing agent flows and memory. By abstracting away the complexities of inter-agent communication and coordination, these frameworks enable developers to focus on defining agent behaviors and system objectives, thereby accelerating the development of complex, LLM-powered applications.

SWE-Bench is a benchmark designed to assess the capacity of agents to generate verifiable code patches. It provides a standardized and rigorous evaluation methodology, moving beyond simple code completion to focus on the correctness and reliability of the proposed solutions. The benchmark suite consists of a collection of programming problems with associated test cases, allowing for automated evaluation of agent-generated patches against established ground truth. Successful completion requires not only syntactically correct code, but also functional accuracy as determined by passing the provided tests, thereby offering a quantifiable metric for agent performance in code modification and repair tasks.

Evaluation of the Extensible System Agent Architecture (ESAA) involved testing across two projects: a single-agent landing page and a more complex clinical dashboard system (CS2). During these tests, the ESAA implementation successfully completed a total of 50 individual tasks. Within the CS2 clinical dashboard project, the system progressed through 8 out of 15 defined phases, indicating partial completion of the more complex undertaking. These results demonstrate the architecture’s capacity to manage and execute tasks within a multi-phase development lifecycle, albeit with varying degrees of success depending on project complexity.

During the clinical dashboard project (CS2), the implemented architecture generated 86 distinct events over a 15-hour period. These events represent a detailed log of system modifications and actions taken during development. The granularity of event capture allows for comprehensive traceability of changes, providing a complete history of the system’s evolution throughout the project lifecycle. This level of detail facilitates debugging, auditing, and reproducibility of results, demonstrating the system’s capability for detailed change management and version control within a complex application.

Analysis of two case studies – a single-agent landing page project and a clinical dashboard system – revealed a successful reduction in event vocabulary from an initial set of 15 event types to a consolidated set of 5, without compromising the system’s ability to maintain complete traceability of actions. This simplification was achieved alongside a zero-rejection rate for system outputs in both evaluations. The absence of rejected outputs indicates adherence to predefined constraints and formal contracts governing the system’s permissible outputs, validating the effectiveness of the implemented control mechanisms.

Toward a Future of Adaptive Systems: Embracing Change

Event-driven, service-oriented architectures, such as ESAA, represent a significant evolution in software design, moving away from monolithic structures towards systems built from independent, interacting components. This modularity is key to enhanced adaptability; changes to one service have minimal impact on others, allowing for rapid iteration and deployment of new features. Resilience is also dramatically improved, as the failure of a single service doesn’t necessarily bring down the entire system – other services can continue functioning, and automated recovery mechanisms can restore the failed component. Critically, these architectures facilitate comprehensive audit trails; every interaction between services is logged, providing detailed insights into system behavior and simplifying debugging and security analysis. This inherent transparency, coupled with the ability to quickly respond to evolving needs, positions ESAA and similar approaches as foundational for building robust and trustworthy software in complex, dynamic environments.

Organizations increasingly face volatile environments demanding swift adaptation, and traditional software development often struggles to keep pace. Event-driven, service-oriented architectures, like ESAA, offer a solution by decoupling components and enabling independent updates and scaling. This flexibility dramatically reduces the time needed to respond to changing requirements, allowing businesses to iterate faster and seize new opportunities. Crucially, this architectural approach also inherently mitigates the risks associated with software failures; because services are isolated, a failure in one area is less likely to cascade and disrupt the entire system. This enhanced resilience translates to improved uptime, reduced recovery costs, and a more reliable experience for end-users, fostering greater trust and stability in a rapidly evolving digital landscape.

A fundamental shift towards data-centric software development promises to redefine innovation and operational efficiency. Traditionally, software design prioritizes application logic, treating data as a secondary concern; however, this paradigm is evolving. By placing data at the core of the development process-focusing on its structure, meaning, and relationships-systems can become inherently more flexible and responsive. This approach allows for easier integration of new data sources, faster adaptation to changing business needs, and the creation of more insightful analytics. The result is not simply faster development cycles, but a capacity for continuous improvement and the ability to extract maximum value from information assets, fostering a more dynamic and resilient technological landscape.

The pursuit of state reproducibility, central to ESAA’s design, echoes a fundamental tenet of rigorous systems analysis. One might consider Paul Erdős’s assertion: “A mathematician knows a lot of things, but he doesn’t know everything.” This highlights the inherent complexity within any system-even those meticulously constructed with principles like Event Sourcing. ESAA attempts to tame this complexity, not by achieving absolute omniscience, but by creating a verifiable, immutable audit trail. By embracing the idea of deterministic replay, the architecture doesn’t seek to predict all possible states, but to reconstruct any given state with certainty, acknowledging the limitations of complete foresight while maximizing control and reliability within the software engineering process.

What Lies Ahead?

The pursuit of deterministic replay in LLM-assisted software engineering, as outlined by this work, isn’t about achieving perfection-it’s about rigorously defining the boundaries of imperfection. Event Sourcing offers a compelling framework, yet the true challenge isn’t simply recording the chaos, but distilling meaningful signal from the inherent stochasticity of large language models. The architecture itself doesn’t resolve the underlying unpredictability; it merely provides a detailed log of its manifestations. Future work must grapple with quantifying and managing that uncertainty-perhaps by treating LLM responses not as commands, but as probabilistic suggestions within a broader system of constraints.

The question isn’t whether LLMs can produce reproducible code, but whether a system can tolerate their inherent variability. Focusing solely on state reproducibility risks building brittle systems, overly sensitive to minor fluctuations. A more robust approach might involve embracing controlled mutation, allowing for exploration of alternative solutions while maintaining a verifiable audit trail. Consider the implications of ‘undo’ not as a return to a prior state, but as a branching point in a directed graph of possibilities.

Ultimately, this architecture is a provocation. It invites a dismantling of traditional notions of software ‘state’ – a concept predicated on the illusion of control. The real innovation won’t be in building more reliable LLMs, but in designing systems that are fundamentally indifferent to their unreliability, treating them as complex, unpredictable components within a larger, self-correcting mechanism. The future isn’t about deterministic code; it’s about resilient systems.

Original article: https://arxiv.org/pdf/2602.23193.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/