Learning to Keep Systems Safe Without Knowing How They Work

Author: Denis Avetisyan

A new data-driven approach enables safe control of complex systems, even when their internal dynamics are unknown.

A safety filter, learned solely from system transitions, employs a safety critic to map states and constraints to safety valuations and their derivatives, which then inform a quadratic programming solver-in conjunction with initial reference inputs-to generate demonstrably safe control actions, as evidenced by its ability to stabilize an inherently unstable system even when subjected to aggressive <span class="katex-eq" data-katex-display="false"> \pm 1 </span> reference commands. — A safety filter, learned solely from system transitions, employs a safety critic to map states and constraints to safety valuations and their derivatives, which then inform a quadratic programming solver-in conjunction with initial reference inputs-to generate demonstrably safe control actions, as evidenced by its ability to stabilize an inherently unstable system even when subjected to aggressive $\pm 1$ reference commands.

This work introduces the Deep QP Safety Filter, a model-free learning framework for reachability-based safety using control barrier functions and quadratic programming.

Ensuring safety in complex dynamical systems remains a significant challenge, particularly when accurate system models are unavailable. This paper introduces ‘Deep QP Safety Filter: Model-free Learning for Reachability-based Safety Filter’, a fully data-driven approach that learns a safety layer using Hamilton-Jacobi reachability without requiring prior knowledge of system dynamics. By constructing contraction-based losses and training neural networks, the method effectively learns a Quadratic Program (QP)-based safety filter that demonstrably reduces pre-convergence failures and accelerates learning in reinforcement learning tasks. Could this principled and practical approach unlock truly safe and robust model-free control for a wider range of real-world applications?

The Inherent Trade-off: Performance Versus Absolute Safety

Conventional safety protocols in control systems frequently prioritize worst-case scenarios, leading to overly cautious strategies that limit overall performance, particularly in intricate and dynamic environments. This reliance on conservative assumptions – such as underestimating system capabilities or overestimating potential disturbances – creates a safety margin, but at the cost of efficiency and responsiveness. Consequently, systems designed with these methods may operate far below their true potential, exhibiting sluggish reactions or failing to capitalize on opportunities for optimized control. This approach, while robust, can be demonstrably suboptimal in scenarios where precise and agile control is paramount, highlighting a fundamental trade-off between guaranteed safety and peak performance.

Conventional control strategies frequently prioritize safety by implementing stringent constraints, yet this often comes at the expense of achieving peak performance. The inherent difficulty lies in simultaneously satisfying rigorous safety demands and optimizing for the most efficient or desirable control policy; a system designed to rigidly avoid failure may operate far below its potential capabilities. This trade-off results in suboptimal outcomes where solutions, while secure, are neither agile nor responsive, hindering the system’s ability to effectively navigate complex environments or achieve ambitious goals. Consequently, a significant research focus centers on developing methodologies that can intelligently balance these competing priorities, enabling controllers to operate closer to their performance limits without compromising essential safety guarantees.

The learned safety filter consistently improves performance across multiple reinforcement learning tasks-including the Inverted Double Pendulum with base position <span class="katex-eq" data-katex-display="false">|x|_{\text{base}}</span> and velocity <span class="katex-eq" data-katex-display="false">|v|_{\text{base}}</span> objectives, and the standard Hopper environment-when compared to baseline PPO implementations with identical parameters. — The learned safety filter consistently improves performance across multiple reinforcement learning tasks-including the Inverted Double Pendulum with base position $|x|_{\text{base}}$ and velocity $|v|_{\text{base}}$ objectives, and the standard Hopper environment-when compared to baseline PPO implementations with identical parameters.

Data-Driven Safety: A Learned Boundary

The Deep Quadratic Programming (QP) Safety Filter is a model-free approach to enforcing safety constraints in dynamic systems. Unlike traditional methods requiring detailed system identification or explicit modeling of dynamics, this filter learns directly from observed data. This data-driven characteristic allows it to adapt to complex and previously unseen scenarios without requiring manual adjustments to a pre-defined model. The filter operates as a layer within a larger control architecture, receiving control commands as input and outputting modified commands that adhere to learned safety boundaries. These boundaries are established through data analysis, identifying permissible states and actions based on historical observations of safe system behavior. The absence of a system model simplifies implementation and enhances robustness to modeling errors, providing a flexible and adaptable safety mechanism.

The Deep QP Safety Filter leverages Quadratic Programming (QP) to generate control commands that satisfy predefined safety constraints. QP is employed as an optimization technique to find the control action that minimizes a cost function – typically related to deviations from a desired trajectory – subject to linear inequality and equality constraints representing safety boundaries. This formulation ensures that the resulting control commands are not only optimal in terms of performance but also demonstrably satisfy the specified constraints, preventing unsafe actions. The QP solver outputs a control action $u$ that minimizes $0.5 x^T Q x + 0.5 u^T R u$ subject to $A x + B u \leq c$ and $D x + E u = f$ , where $x$ represents the system state, and the matrices $Q, R, A, B, c, D, E, f$ define the cost and constraints, respectively.

The Deep QP Safety Filter incorporates Exponential Linear Unit (ELU) activation functions and Layer Normalization to improve performance and stability. ELU activations, defined as $f(x) = \begin{cases} x & \text{if } x > 0 \\ \alpha(e^x - 1) & \text{if } x \le 0 \end{cases}$ where α is a hyperparameter, mitigate the vanishing gradient problem common in deep neural networks, enabling more effective training. Layer Normalization normalizes the activations across the features for each training example, reducing internal covariate shift and accelerating the learning process. This combination results in a more robust and efficient safety filter capable of learning complex constraints from data.

Formal Guarantees: Stability and Convergence Proofs

The Deep QP Safety Filter relies on the Bellman operator to iteratively refine the learned critic, and the stability of this process is directly linked to the operator’s contraction properties. Specifically, a contraction mapping ensures that successive iterations of the Bellman operator converge to a fixed point, representing the optimal or near-optimal critic. This convergence is formally guaranteed when the operator’s induced norm is less than one $||T|| < 1$ , where $T$ represents the Bellman operator. Without this contraction property, the critic could oscillate or diverge, rendering the safety filter ineffective and potentially leading to unsafe behavior. The Deep QP formulation is designed to facilitate this contraction, allowing for a stable and reliable learning process.

The Deep QP Safety Filter’s performance analysis relies on extending the established theory of Hamilton-Jacobi Reachability (HJR) to a Time-Discounted Reachability (TDR) framework. Standard HJR, while providing guarantees of safety, does not explicitly account for the temporal cost associated with maintaining safe states; TDR addresses this by incorporating a discount factor γ into the cost function, penalizing future violations less than immediate ones. This extension allows for a more nuanced evaluation of the filter’s behavior, particularly in scenarios involving long-horizon planning. By analyzing the properties of the discounted cost-to-go function under the TDR framework, we can formally prove the stability of the learned safety constraints and guarantee that the filter converges to a safe policy, even in the presence of model uncertainties and disturbances.

The integration of Quadratic Programming (QP) within the Deep QP Safety Filter provides both enforceability and computational efficiency for learned safety constraints. QP allows the formulation of safety constraints as quadratic inequalities, which are readily solvable using well-established optimization algorithms. This contrasts with more complex constraint formulations that might lack guaranteed feasibility or impose prohibitive computational costs. Specifically, QP enables the filter to efficiently compute control actions that minimize a cost function while adhering to these learned constraints, ensuring that the system remains within safe operating regions. The computational tractability of QP, with polynomial-time solution guarantees for standard formulations, is critical for real-time implementation and scalability to high-dimensional state spaces.

Empirical Validation: Robust Performance Across Systems

The Deep Quadratic Programming (QP) Safety Filter exhibited robust performance when tested on the Double Integrator System, offering crucial validation of the learned critic’s convergence. This system, characterized by its simple dynamics, served as an ideal benchmark for assessing the filter’s ability to consistently enforce safety constraints while allowing the controller to pursue optimal trajectories. Successful operation on this foundational system confirms that the learned critic is effectively mapping states to safe and desirable actions, providing a strong basis for evaluating performance on more complex and challenging robotic platforms. The consistent ability to maintain stability and adhere to predefined boundaries within the Double Integrator environment highlights the filter’s potential to significantly improve the reliability of reinforcement learning algorithms in safety-critical applications.

The Deep QP Safety Filter’s capacity to maintain stability without overly restricting performance was rigorously tested using the Inverted Pendulum System. This dynamic model allowed researchers to assess the filter’s ‘aggressiveness’ – that is, how readily it intervenes to prevent unsafe actions. Results indicate a nuanced approach to safety; the filter successfully guided the pendulum toward a stable, upright position while simultaneously allowing it to explore control limits and achieve near-optimal performance. This balance is crucial in real-world applications where overly cautious safety mechanisms can hinder functionality, and the Inverted Pendulum System provided an ideal platform to demonstrate the filter’s ability to navigate this complex trade-off.

Simulations employing the Hopper system, a notoriously difficult benchmark due to its dynamic contact and hybrid nature, further establish the practical utility of the Deep QP Safety Filter. This complex system-requiring precise coordination of limbs and continuous contact with the ground-presents significant challenges for reinforcement learning algorithms. Despite this complexity, the filter exhibited remarkably robust performance, experiencing quadratic programming (QP) infeasibility in only 0.2343% of simulation instances. This low rate of infeasibility underscores the filter’s ability to consistently generate safe and feasible control actions, even within the constraints of a highly dynamic and contact-rich environment, and highlights its potential for real-world robotic applications.

Rigorous testing of the Deep QP Safety Filter across three distinct robotic systems – the Double Integrator, Inverted Pendulum, and Inverted Double Pendulum – yielded a remarkable result: zero instances of quadratic programming (QP) infeasibility. This consistently successful performance indicates the filter’s robust ability to generate feasible and safe control actions in diverse dynamical regimes. The absence of infeasibility is particularly significant as it suggests the method reliably avoids situations where a safe control solution cannot be computed, a critical requirement for real-world deployment in safety-sensitive applications. This finding underscores the filter’s potential to enhance the reliability and safety of reinforcement learning algorithms by consistently providing valid control signals, even in challenging scenarios.

Reinforcement learning algorithms often struggle with instability and high failure rates, particularly in complex control tasks. This work addresses this challenge by substantially reducing the incidence of catastrophic failures during the learning process. As visually demonstrated in figures 4a, 4b, and 4c, the implemented method consistently achieves a marked improvement in robustness across a range of simulated environments. By proactively mitigating unsafe actions, the algorithm allows agents to explore more effectively and converge to successful policies with significantly greater reliability, ultimately leading to more consistent and predictable performance during training and deployment.

The pursuit of a robust safety filter, as detailed in this work, resonates with a fundamental tenet of computational elegance. The Deep QP Safety Filter distinguishes itself by circumventing the need for explicit system models, relying instead on data-driven learning to guarantee reachability – a harmonious balance between ensuring safe operation and maximizing performance. This approach echoes Ada Lovelace’s observation: “The Analytical Engine has no pretensions whatever to originate anything.” While the engine – or, in this case, the control system – requires input, the ability to learn and adapt without pre-programmed dynamics demonstrates a powerful capacity for achieving safety through rigorously defined parameters, aligning with the pursuit of provable correctness rather than merely observed functionality.

What Lies Ahead?

The presented work offers a pragmatic, if not entirely satisfying, approach to the persistent problem of safe control. It sidesteps the demand for precise system models-a virtue born of necessity, given the intractable complexity of many real-world systems. However, one should not mistake empirical success for theoretical completeness. The ‘data-driven’ aspect, while convenient, merely shifts the burden of proof. The safety guarantees remain tethered to the quality and representativeness of the training data – a silent assumption that haunts all machine learning endeavors.

Future effort must address the question of certifiable robustness. Achieving safety through statistical correlation is… insufficient. A truly elegant solution will derive safety from fundamental principles, perhaps by integrating this data-driven filter with formal methods capable of verifying its behavior under adversarial conditions. The current formulation operates as a reactive shield; a proactive approach-one that anticipates and mitigates potential hazards before they manifest-remains a significant challenge.

Ultimately, the pursuit of ‘intelligent’ control demands more than skillful pattern recognition. It requires a return to first principles-a rigorous mathematical framework wherein safety isn’t a probability, but a provable consequence of the control law itself. The simplicity sought is not brevity, but non-contradiction – a logical completeness that transcends the limitations of empirical observation.

Original article: https://arxiv.org/pdf/2601.21297.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Inherent Trade-off: Performance Versus Absolute Safety

Data-Driven Safety: A Learned Boundary

Formal Guarantees: Stability and Convergence Proofs

Empirical Validation: Robust Performance Across Systems

What Lies Ahead?

See also: