Beyond Fail-Safe: Building Resilient Cyber-Physical Systems

Author: Denis Avetisyan

A new review synthesizes the latest approaches to ensuring the robustness of critical infrastructure in the face of growing threats.

This paper provides a systematic taxonomy of faults and attacks across layers of Cyber-Physical Systems, advocating for cross-layer defense strategies and formal verification techniques to enhance resilience.

Despite increasing reliance on Cyber-Physical Systems (CPS) across critical infrastructure, their inherent coupling of computation and physical dynamics introduces complex, cascading failure modes often defying traditional safety analysis. This paper, ‘Systemization of Knowledge: Resilience and Fault Tolerance in Cyber-Physical Systems’, presents a comprehensive, cross-layer taxonomy-the Origin-Layer-Effect (OLE)-to unify nearly two decades of CPS resilience research and reveal shared structural vulnerabilities. Our analysis, mapping diverse systems onto this taxonomy, identifies four critical gaps-physical-model manipulation, unstable machine learning control, semantic inconsistencies, and limited forensic visibility-spanning multiple CPS domains. Can a holistic, system-level approach-integrating robust control, runtime monitoring, and formal assurance-effectively mitigate these persistent threats and build truly resilient CPS?

The Expanding Threat Surface of Interconnected Systems

Cyber-Physical Systems, which intricately blend computation, networking, and physical processes, face a growing susceptibility to both accidental malfunctions and deliberate malicious attacks. This vulnerability stems from their increasing complexity and connectivity, expanding the potential attack surface for adversaries. Unlike traditional IT systems, failures in CPS can have direct and dangerous consequences in the physical world – impacting critical infrastructure, transportation networks, and even human life. Consequently, a shift beyond conventional cybersecurity practices is essential; robust security measures must be integrated throughout the entire system lifecycle, encompassing not only software and networks, but also the physical components and control systems themselves. Addressing this challenge requires a holistic approach, prioritizing resilience, redundancy, and proactive threat detection to safeguard these increasingly vital systems.

Conventional safety engineering, designed to mitigate accidental failures in systems, proves inadequate when confronting intentional, adversarial attacks on cyber-physical systems. Modern exploits extend beyond simple malfunctions, actively deceiving critical sensors like LiDAR and GPS; attackers can, for example, flood a LiDAR system with false returns, creating phantom obstacles, or inject malicious signals to disrupt GPS positioning. Furthermore, vulnerabilities in communication networks – such as CAN bus injection – allow direct manipulation of control commands, potentially overriding safety protocols. These targeted attacks don’t merely cause failures; they induce them in a way that bypasses traditional safeguards, demanding a fundamental shift towards security-aware system design and robust intrusion detection mechanisms.

The interconnected nature of cyber-physical systems introduces a significant risk of cascading failures, where an initial compromise in one component rapidly propagates throughout the entire system. This isn’t simply a linear progression; vulnerabilities in one layer-such as the network, control, or physical process-can exploit weaknesses in adjacent layers, accelerating the spread of errors. A compromised sensor, for example, might inject false data into the control system, leading to actuator malfunction and ultimately, physical damage. The speed and complexity of these cross-layer interactions mean that even seemingly minor initial faults can quickly escalate into catastrophic consequences, exceeding the capacity of traditional, isolated safety mechanisms. Understanding and mitigating these propagation pathways is therefore crucial for ensuring the resilience of critical infrastructure and autonomous systems.

Resilient Systems: Methods for Fault Detection and Mitigation

Runtime monitoring frameworks, exemplified by M2MON, operate by continuously observing a system’s behavior during execution and comparing it against pre-defined properties to detect anomalies. These properties, typically expressed as logical assertions about program variables and control flow, define the expected system behavior; deviations from these specifications trigger alerts. Effective anomaly detection hinges on the accuracy and completeness of these properties, as both false positives and false negatives can occur if the specifications are poorly defined or incomplete. While M2MON and similar frameworks offer real-time detection capabilities, the initial effort of formally specifying these properties can be substantial, requiring deep understanding of the system’s intended functionality and potential failure modes.

Sensor fusion combines data from multiple sensors to produce a more accurate and reliable estimate of a system’s state than could be achieved using any single sensor alone. This technique is particularly valuable in noisy environments where individual sensor readings may be inaccurate or incomplete. Kalman Filtering is a recursive algorithm commonly employed in sensor fusion to optimally estimate the state of a dynamic system from a series of noisy measurements. It operates by predicting the system’s next state based on a mathematical model, then correcting that prediction using the available sensor data, weighting each measurement based on its estimated noise characteristics. The resulting estimate minimizes the mean squared error, providing a statistically optimal representation of the system’s true state and improving overall system robustness.

Formal verification techniques, including HACMS (High-Assurance Control Management System) and the Simplex architecture, provide mathematical proof of system correctness against specified requirements. These methods utilize formal logic and model checking to exhaustively analyze all possible system states, guaranteeing the absence of specified errors. However, the computational complexity of these analyses increases significantly with system size and complexity. Specifically, the state-space explosion problem-where the number of states grows exponentially with the number of system components-limits the practical application of formal verification to relatively small and well-defined systems. While advancements in abstraction and compositional verification attempt to mitigate scalability issues, verifying large, complex systems remains a substantial challenge.

Continuous Rejuvenation is a proactive security technique involving the periodic, complete reset of system components to a known good state. This process aims to eliminate accumulated malware or errors before they can cause harm, functioning as a preventative measure rather than a reactive one. However, implementation requires careful consideration of system availability; frequent rejuvenation cycles increase overhead and potentially disrupt service. The optimal rejuvenation interval represents a trade-off between enhanced security posture and acceptable downtime, dependent on factors such as system criticality, threat model, and the cost of service interruption. Strategies like redundant systems and graceful degradation can mitigate availability impacts during rejuvenation cycles.

Adaptive Control and Recovery: Safeguarding Dynamic Systems

Robust control techniques address system stability when faced with disturbances and model uncertainties. These methods prioritize maintaining performance bounds despite unpredictable external factors and inaccuracies in system representation. An extension of this is Model Predictive Control (MPC), which utilizes a dynamic model of the system to predict future behavior and optimize control actions over a finite time horizon. MPC explicitly incorporates constraints on both system states and control inputs, allowing for proactive adaptation to predicted disturbances and ensuring stability within defined operational limits. The optimization problem solved by MPC at each time step typically involves minimizing a cost function, often related to tracking a desired trajectory or minimizing control effort, subject to the aforementioned constraints.

Learn2Recover is a reinforcement learning framework enabling unmanned aerial vehicles (UAVs) to autonomously acquire recovery behaviors in response to unexpected system failures. The system employs a staged training process, initially utilizing simulated failures to expose the UAV to a diverse range of fault conditions, and subsequently refining these behaviors through real-world flight testing. This allows UAVs to learn corrective actions – such as adjusting control surfaces or altering flight parameters – without explicit pre-programming for each potential failure mode. The learned policies are stored and deployed on-board the UAV, providing a degree of resilience against previously unseen or unpredictable failures that would otherwise necessitate manual intervention or result in loss of control. Performance metrics demonstrate improved recovery rates and reduced recovery times compared to traditional rule-based or pre-programmed recovery strategies.

Delayed Input Sharing functions as a proactive security measure by introducing a temporal buffer between command receipt and execution on unmanned aerial vehicles (UAVs). This technique operates by requiring a command to be acknowledged by multiple UAVs before being acted upon; if consensus is not reached within a predetermined timeframe, the command is discarded. This significantly reduces the impact of single malicious commands, as a compromised UAV cannot immediately dictate actions to the entire fleet. The delay imposed allows for anomaly detection and validation of commands against expected behaviors, providing a first line of defense against adversarial attacks and reducing the risk of compromised system integrity. The effectiveness of Delayed Input Sharing is directly related to the number of UAVs involved in the consensus process and the length of the imposed delay, balancing security with responsiveness.

Byzantine Fault Tolerance (BFT) is a fault-tolerance mechanism designed to operate reliably even when components within a system fail in arbitrary ways, including maliciously. Unlike typical fault tolerance which addresses crashes or simple errors, BFT addresses scenarios where components can actively lie or send incorrect information. This is achieved through redundancy and consensus algorithms, requiring a majority of nodes to agree on a single value to ensure correctness. Mathematically rigorous proofs demonstrate that BFT systems can achieve consensus despite the presence of up to $f$ faulty nodes within a total of $3f + 1$ nodes, guaranteeing system integrity and preventing malicious actors from compromising the overall operation.

Classifying Failures: A Taxonomy for Enhanced Understanding

The Origin-Layer-Effect Taxonomy categorizes faults in Cyber-Physical Systems (CPS) by identifying the root origin of the failure – such as environmental factors, software errors, or malicious attacks – the layer within the CPS architecture where the fault manifests – including perception, planning, control, or physical processes – and the resulting effects observed in the system’s behavior. This framework facilitates a systematized approach to failure analysis across diverse domains including Unmanned Aerial Vehicles (UAVs), automotive systems, and industrial control systems. By classifying failures based on these three attributes, the taxonomy enables comparative analysis of vulnerabilities and mitigation strategies across different CPS implementations, offering a generalized model for understanding and addressing system faults.

Effective mitigation of failures in Cyber-Physical Systems (CPS) necessitates a thorough understanding of the interaction between control-semantic faults and the resulting physical system dynamics. Control-semantic faults, encompassing errors in the intended logic of control algorithms, can manifest as incorrect actuator commands or misinterpreted sensor data. These faults propagate through the physical system, influencing its state and potentially leading to instability or unintended behavior. The severity of these effects is directly dependent on the system’s inherent dynamics – factors such as mass, inertia, friction, and resonance frequencies. Consequently, mitigation strategies must account for both the logical error within the control system and the physical response of the system, requiring a combined analysis of control software and physical modeling to predict and prevent hazardous outcomes.

Crash analysis frameworks, such as MAYDAY (Model-based Analysis of Yield, Dynamics, and Anomalies), offer a systematic approach to understanding failures in cyber-physical systems (CPS) by integrating telemetry data with dynamic system models. These frameworks move beyond simple fault identification to reconstruct event sequences leading to failure, identifying causal relationships between software anomalies and physical consequences. The integration of telemetry – encompassing sensor readings, actuator commands, and communication logs – with a dynamic model of the physical system allows for the validation of hypotheses about failure mechanisms and the quantification of system states leading to the event. This combined analysis facilitates the identification of previously unknown failure modes, the assessment of system resilience, and the development of more effective mitigation strategies by providing a detailed reconstruction of the conditions that precipitated the failure.

The RockDrone attack, and similar exploits, highlight the vulnerability of Cyber-Physical Systems (CPS) to physical attacks that exploit mechanical resonances. These attacks involve the precise application of frequencies to induce amplified oscillations within the target system’s physical components-such as rotors, frames, or actuators. The induced vibrations can lead to component failure, loss of control, or complete system destruction. Successful exploitation doesn’t require compromising the system’s software; rather, it directly manipulates physical parameters, bypassing traditional cybersecurity measures. This attack vector is particularly concerning for systems with limited physical hardening or insufficient vibration damping, as the required energy to induce resonance can be relatively low.

Towards Self-Healing Cyber-Physical Systems: A Vision for the Future

Truly self-healing cyber-physical systems demand a synergistic approach to system management, necessitating the integration of adaptive control, formal verification, and continuous runtime monitoring. Adaptive control allows the system to dynamically reconfigure itself in response to detected anomalies or failures, while formal verification provides mathematical proof of system correctness and safety properties before deployment, minimizing vulnerabilities. However, pre-verified systems still encounter unforeseen circumstances; therefore, continuous runtime monitoring observes system behavior, detecting deviations from expected norms and triggering corrective actions. This layered defense – proactively verifying, adaptively responding, and continuously observing – moves beyond simple fault tolerance, enabling systems to not only recover from failures but also to learn from them and enhance future resilience. Ultimately, this holistic approach promises a new generation of robust and dependable cyber-physical infrastructure.

The future of cyber-physical system (CPS) resilience hinges on increasingly sophisticated applications of machine learning and artificial intelligence. Current fault management often relies on reactive responses to detected anomalies; however, emerging AI techniques promise a shift towards proactive mitigation. By analyzing vast datasets of system operational parameters, predictive models can anticipate potential failures before they manifest, enabling preemptive adjustments or resource allocation. This extends beyond simple anomaly detection to encompass complex pattern recognition indicative of developing issues, even those stemming from subtle environmental changes or previously unseen attack vectors. Furthermore, reinforcement learning algorithms can optimize control strategies in real-time, adapting to dynamic conditions and autonomously implementing corrective actions, ultimately reducing downtime and enhancing overall system robustness. The integration of these intelligent systems represents a crucial step towards truly self-healing CPS capable of maintaining critical functionality even in the face of adversity.

The pursuit of robust, self-healing cyber-physical systems (CPS) necessitates a unified approach to security and collaboration, demanding the development of standardized taxonomies and protocols. Currently, a lack of consistent terminology and shared security frameworks hinders effective communication and coordinated response to vulnerabilities across different CPS implementations. Establishing common definitions for fault types, attack vectors, and system states will allow for more precise threat modeling and risk assessment. Furthermore, standardized security protocols-covering areas like authentication, authorization, and data encryption-will enable seamless integration of security measures across diverse components and facilitate interoperability between systems. This collaborative framework not only streamlines the process of identifying and mitigating threats but also fosters a more resilient ecosystem, where shared knowledge and coordinated defenses significantly reduce the potential for widespread failures and enhance overall system dependability.

Cyber-physical systems (CPS) increasingly operate in complex, interconnected environments, blurring the lines between the digital and physical realms and creating novel attack vectors demanding focused investigation. Traditional cybersecurity measures, designed to protect data and software, often prove insufficient against threats that directly manipulate physical processes. Research must prioritize understanding how cyber intrusions can trigger or exacerbate physical failures – and conversely, how physical disturbances can be exploited to compromise cyber defenses. This necessitates exploring the synergistic effects of combined attacks, where, for example, a cyberattack disables safety mechanisms, allowing a physical disruption to cause catastrophic damage. Anticipating these interwoven threats requires developing new modeling techniques, risk assessment frameworks, and resilient system designs capable of withstanding both digital and physical assaults, ultimately bolstering the reliability and safety of critical infrastructure.

The pursuit of resilient Cyber-Physical Systems demands an elegance of design, a harmony between the various layers of operation. This paper’s taxonomy of faults and attacks, and its call for cross-layer defense, exemplifies this principle. It recognizes that complexity needn’t equate to fragility; rather, a deep understanding of systemic vulnerabilities can lead to robust, intuitively understandable architectures. As Carl Sagan observed, “Somewhere, something incredible is waiting to be known.” This sentiment resonates deeply with the challenge of securing these complex systems; uncovering potential failure points and crafting effective defenses is an ongoing quest for knowledge, a refinement of understanding that elevates design beyond mere functionality.

Beyond Robustness

The systematization offered by a layered taxonomy of faults and attacks in Cyber-Physical Systems is, predictably, only a first step. While identifying vulnerabilities is essential, the true challenge lies in anticipating those not yet conceived. Each categorization, each neatly defined attack surface, risks becoming a constraint on imagination. The pursuit of resilience should not devolve into a mere cataloging of known weaknesses; it demands an acknowledgement of inherent unpredictability. Formal verification, though a powerful tool, is ultimately limited by the models upon which it relies – simplified representations of profoundly complex reality.

Future work must address the chasm between formal guarantees and operational robustness. Cross-layer analysis, advocated here, is promising, but its implementation demands a degree of interdisciplinary collaboration that remains, frankly, uncommon. The elegance of a holistic defense isn’t merely a matter of technical integration; it requires a shared philosophical commitment to viewing the system as an indivisible whole.

One might posit that true resilience isn’t about preventing all failures-an impossible aspiration-but about gracefully accommodating them. A system that anticipates its own potential for imperfection, and designs for recovery rather than prevention, possesses a quality far beyond simple fault tolerance. This shift in perspective-from a defensive posture to an adaptive one-may well define the next generation of Cyber-Physical Systems.

Original article: https://arxiv.org/pdf/2512.20873.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/