Safe Control from Data: A New Approach to Robotics Safety

Author: Denis Avetisyan

Researchers have developed a method for training robots to operate safely in complex environments using only previously collected data.

The framework learns a safe region through a reachability-based Bellman equation and a value-guided barrier function-enhanced with expectile regression to handle out-of-distribution data from offline datasets-and then utilizes this learned barrier function within a CBF-QP controller to enable safe, step-wise action rollouts regardless of the reference controller employed.

This work introduces Value-Guided Offline Control Barrier Functions (V-OCBF), a framework that leverages value function estimation and expectile regression to synthesize safe controllers from offline datasets.

Ensuring the safety of autonomous systems remains a critical challenge, particularly when relying on data-driven control without online interaction. This paper introduces ‘V-OCBF: Learning Safety Filters from Offline Data via Value-Guided Offline Control Barrier Functions’, a novel framework that learns safety constraints directly from offline demonstrations by integrating value function estimation with control barrier function synthesis. The resulting approach achieves improved safety and performance by propagating safety information without requiring a dynamics model or hand-engineered barriers. Could this model-free learning of safety filters unlock truly robust and scalable safety-critical controllers for complex, real-world applications?

The Illusion of Control: Why Safety Can’t Be an Afterthought

While Reinforcement Learning (RL) has achieved remarkable success in areas like game playing and robotics, a fundamental limitation often lies in its disregard for safety during the learning process. Traditional RL algorithms prioritize maximizing cumulative reward, frequently leading agents to explore potentially dangerous states or exhibit unstable behaviors as they learn through trial and error. This presents a significant challenge when deploying RL in real-world applications – such as autonomous driving or healthcare – where even a single unsafe action can have severe consequences. The agent’s relentless pursuit of optimization, without explicit constraints, can result in the discovery of “shortcuts” that achieve high rewards but violate critical safety protocols, highlighting the need for algorithms that inherently prioritize safe exploration and stable performance alongside reward maximization.

The successful integration of reinforcement learning into practical applications, such as robotics, autonomous driving, and healthcare, hinges on the development of algorithms capable of consistently avoiding unsafe states and ensuring system stability. Unlike simulations where risks are limited, real-world deployments demand a demonstrable guarantee of safe behavior, even during the learning process. Consequently, researchers are actively exploring methods to constrain policy learning, incorporating safety layers or utilizing formal verification techniques to prove the absence of undesirable outcomes. These approaches aim to move beyond simply achieving a goal to achieving it safely, acknowledging that even a highly optimized policy is unacceptable if it poses a risk to the environment or the system itself. The challenge lies in balancing performance with robustness, creating intelligent agents that not only learn effectively but also operate predictably and reliably in complex, unpredictable environments.

Offline reinforcement learning presents a compelling strategy for training agents using previously collected datasets, circumventing the need for costly and potentially dangerous online exploration. However, a significant challenge arises from the distribution shift between the data used for training and the policy the algorithm ultimately learns – a mismatch that can lead to the agent venturing into unseen, and therefore potentially unsafe, states. Unlike online methods where an agent can learn from its mistakes in a controlled environment, offline RL algorithms must extrapolate beyond the confines of the existing data, making it difficult to guarantee safe behavior during policy improvement. Researchers are actively investigating techniques, such as constrained optimization and pessimistic policy iteration, to mitigate these risks and ensure that learned policies remain within acceptable safety boundaries, even when operating outside the observed data distribution.

By actively regulating unsafe accelerations, V-OCBF enables the agent to adapt its gait and substantially reduce safety threshold violations compared to a baseline approach, as demonstrated by the maintained forward velocity and reduced violation frequency across episodes.

Safety Through Design: The Illusion of Formal Guarantees

Control Barrier Functions (CBFs) formally define a ‘safe set’ – a region in the state space where system operation is considered acceptable – using a continuously differentiable function $h(x)$. The level sets of $h(x)$, defined as $\{x \mid h(x) \geq 0\}$, constitute this safe set. CBF-based control then designs control actions that maintain the system within these level sets. Specifically, a control Lyapunov function (CLF) is combined with a barrier function $h(x)$ to create a CBF, ensuring not only stability but also safety with respect to the defined constraints. This mathematical framework allows for rigorous verification of safety properties, differentiating it from purely stability-focused control approaches.

Forward invariance, as applied within the framework of Control Barrier Functions (CBFs), ensures the persistence of initial safety conditions. Specifically, if a dynamical system, described by $ẋ = f(x, u)$, begins within a defined ‘safe set’ $C$, forward invariance guarantees that the trajectory $x(t)$ will remain within $C$ for all future times $t > 0$, provided the control input $u$ satisfies certain conditions dictated by the CBF. This is mathematically formalized by requiring that the Lie derivative of a level set indicator function $h(x)$ with respect to the system dynamics, $L_f h(x)$, is non-negative whenever $h(x) \le 0$, effectively preventing the system from escaping the safe set. The guarantee holds under the assumption that the initial condition $x(0)$ is within $C$ and the CBF constraint is consistently enforced during operation.

Real-time implementation of Control Barrier Functions (CBFs) frequently necessitates the use of optimization-based control strategies to compute control actions that satisfy the CBF constraints. Specifically, Quadratic Programming (QP) is a common choice due to its efficiency in solving constrained optimization problems. The CBF is typically formulated as an inequality constraint, $h(x) \geq 0$, where $x$ represents the system state. QP algorithms then minimize a cost function, often related to tracking performance, subject to this CBF constraint and any other system constraints. This results in a control input that prioritizes safety, as defined by the CBF, while still achieving desired performance objectives. The computational efficiency of QP is critical for real-time applications, particularly in systems with fast dynamics or limited computational resources.

Defining and verifying a safe set for Control Barrier Functions (CBFs) in high-dimensional systems presents significant computational challenges. The complexity of describing these sets grows exponentially with dimensionality, making exhaustive verification impractical. Consequently, research focuses on scalable methods such as superlevel sets defined by a class $\mathcal{K}$ function, or polytopic approximations to reduce computational burden. Robustness is critical, as inaccuracies in defining the safe set can lead to constraint violations and unsafe behavior. Techniques like Sum of Squares (SOS) programming are employed to certify the forward invariance of the safe set, but these methods often suffer from computational cost and scalability issues. Alternative approaches include sampling-based verification and the use of machine learning to approximate the safe set, though these methods typically offer probabilistic safety guarantees rather than formal certification.

The feasible region for a velocity obstacle-based control barrier function (V-OCBF) varies with heading angle, and its cumulative volume across continuous heading angles defines the overall safe set.

Offline Safety: Learning to Stay Within the Lines

Value-Guided Control Barrier Functions (VGCBF) represent a distinct methodology for synthesizing safe control policies using exclusively offline datasets. Traditional control barrier function (CBF) methods often require online adaptation or access to a system model, limiting their applicability in scenarios where real-time interaction or model identification is impractical. VGCBF circumvents these limitations by learning a safety-critical region directly from a collection of expert or suboptimal demonstrations. This is achieved by incorporating a value function, typically learned from the same offline data, into the CBF formulation. The value function serves as a proxy for desired behavior, guiding the CBF to prioritize trajectories that not only satisfy safety constraints but also maximize expected cumulative reward as observed in the dataset. This approach enables the creation of safe controllers without requiring explicit system identification, online learning, or access to a dynamically changing environment.

Value-Guided Control Barrier Functions (VGCBF) leverage the integration of a learned value function, $Q(s,a)$, into the standard Control Barrier Function (CBF) formulation to improve both safety and performance during policy learning from offline datasets. Traditionally, CBFs ensure safety by constraining control actions to maintain states within a defined safe set. VGCBF extends this by modifying the CBF constraint to incorporate the expected cumulative reward, as estimated by the value function. This allows the algorithm to prioritize control actions that not only satisfy safety constraints but also maximize the expected return, effectively guiding the learning process towards optimal, safe behavior even with limited or imperfect data. The value function acts as a learned cost-to-go, enabling the CBF to differentiate between safe actions based on their potential for achieving a desired objective.

Expectile Regression is employed to enhance the robustness of Control Barrier Functions (CBFs) when dealing with imperfect data. Traditional regression minimizes the mean squared error, making it sensitive to outliers; conversely, expectile regression minimizes the expectile loss, which focuses on minimizing errors in the tail of the distribution. This is achieved by weighting errors based on a parameter $\alpha$, where $0 < \alpha < 1$. A smaller $\alpha$ prioritizes minimizing large negative errors, effectively increasing robustness to positive outliers and uncertainties in the data used to construct the CBF. Consequently, the resulting barrier function is less susceptible to violations caused by noisy or incomplete demonstrations, improving the safety and reliability of the learned policy.

Finite-Difference Barrier Recursion and Lie Derivatives provide a computationally efficient method for determining and applying safety constraints within the Control Barrier Function (CBF) framework. Lie derivatives, specifically $L_f h(x)$, quantify the rate of change of a barrier function $h(x)$ along the system’s dynamics $f$. Finite-Difference Barrier Recursion then approximates the solution to the CBF constraint, ensuring that the system remains within safe boundaries by iteratively calculating control inputs that maintain $h(x) > 0$. This approach avoids the need for explicit analytical derivations of the Lie derivative and allows for the application of safety constraints to complex, potentially non-linear systems where closed-form solutions are unavailable, enabling real-time enforcement of safety criteria.

Benchmarking the Illusion: Measuring What Matters (Or Doesn’t)

The development of Safety Gymnasium offers researchers a standardized and comprehensive platform for evaluating reinforcement learning algorithms designed for safety-critical applications. This suite of environments, built on the MuJoCo physics engine, includes benchmark tasks featuring robotic locomotion – such as controlling the Hopper, HalfCheetah, and Ant – which present significant challenges in maintaining stability and avoiding unintended consequences. By providing a consistent set of scenarios, Safety Gymnasium allows for direct comparison of different algorithms, enabling a more rigorous assessment of their performance and safety characteristics. This standardized approach is crucial for accelerating progress in the field, as it reduces the variability introduced by differing environment implementations and facilitates reproducible research in the pursuit of robust and reliable autonomous systems.

The Safety Gymnasium provides a standardized and versatile platform for evaluating reinforcement learning algorithms designed for safety-critical applications. By offering a suite of MuJoCo environments – including the frequently used Hopper, HalfCheetah, and Ant – researchers can now systematically assess the performance of methods like VGCBF against challenging scenarios and well-defined metrics. This controlled environment enables rigorous comparisons, allowing for a detailed understanding of each algorithm’s strengths and weaknesses in avoiding constraints and maintaining stability. Furthermore, the inclusion of tasks modeled on real-world challenges, such as autonomous vehicle collision avoidance using Dubins Dynamics, ensures that evaluations translate to practical relevance, driving progress towards deployable and trustworthy safe RL systems.

The development of truly safe autonomous systems demands evaluation beyond simple simulations; therefore, researchers are increasingly utilizing environments modeled after real-world challenges. Specifically, tasks centered around Autonomous Ground Vehicle (AGV) collision avoidance, governed by Dubins Dynamics – a model capturing the vehicle’s kinematic constraints – offer a compelling testbed for safe control strategies. This approach moves beyond idealized scenarios, forcing algorithms to account for the non-holonomic nature of vehicle motion and the complexities of maneuvering in cluttered spaces. By evaluating performance within this realistic framework, developers can rigorously validate the robustness and reliability of their algorithms before deployment, ensuring the system’s ability to navigate safely and effectively in complex, dynamic environments. These AGV simulations provide a crucial bridge between theoretical advancements and practical, real-world application.

The evaluation of novel safety-critical reinforcement learning algorithms requires robust benchmarks and established baselines. Algorithms such as BEAR-Lag, COptiDICE, and FISOR serve this purpose by providing crucial points of comparison, and importantly, demonstrate the viability of applying offline reinforcement learning techniques to domains where safety is paramount. These methods leverage previously collected data – avoiding potentially dangerous online exploration – to train control policies. Their inclusion in the Safety Gymnasium suite allows researchers to quantify improvements in safety and performance against established offline RL approaches. By demonstrating a foundation of safe learning from static datasets, these baselines pave the way for more advanced algorithms, like V-OCBF, to further push the boundaries of safe and reliable control in complex environments.

Evaluations within the Safety Gymnasium demonstrate that the V-OCBF algorithm consistently maintains a near-zero rate of safety violations across a diverse set of benchmark environments, including the challenging Hopper, HalfCheetah, and Ant simulations. This performance represents a substantial advancement over existing methods, as baseline algorithms consistently exhibit significantly higher rates of unsafe behavior during operation. The ability to reliably prevent safety breaches – effectively minimizing potentially damaging or catastrophic outcomes – is a core strength of V-OCBF, and its consistently low violation rate underscores its potential for deployment in real-world applications where safety is paramount. This achievement isn’t simply about avoiding failures; it suggests a fundamentally more robust and dependable approach to reinforcement learning control, offering a pathway towards trustworthy autonomous systems.

Recent research demonstrates a significant advancement in reinforcement learning, revealing that high performance and robust safety are not mutually exclusive goals. The proposed methodology achieves cumulative reward levels comparable to existing state-of-the-art algorithms, but crucially, it does so while maintaining a demonstrably superior safety profile. This is a departure from traditional approaches where prioritizing safety often came at the cost of performance; instead, this work establishes that both objectives can be simultaneously optimized. By consistently minimizing safety violations across a suite of challenging environments, the method offers a promising pathway toward deploying reliable and trustworthy autonomous systems in real-world applications where safety is paramount, without sacrificing the ability to achieve desired task objectives.

Evaluations within the Autonomous Ground Vehicle task reveal that the V-OCBF algorithm consistently defines a substantially larger “safe set volume” compared to alternative control strategies. This metric represents the range of states where the vehicle can operate without violating safety constraints, and a larger volume indicates a more conservative and robust approach to hazard avoidance. Essentially, V-OCBF allows the vehicle to maintain safe operation across a wider spectrum of potential scenarios and initial conditions. This characteristic is critical in real-world applications where unpredictable disturbances and modeling inaccuracies are commonplace, suggesting the algorithm prioritizes reliable safety even at the potential expense of overly aggressive maneuvering. The expanded safe set volume therefore provides evidence that V-OCBF offers a demonstrably more resilient and dependable control strategy for autonomous vehicles navigating complex environments.

Evaluations conducted with intentionally inaccurate dynamics models reveal the inherent robustness of the V-OCBF methodology. While real-world robotic systems are invariably subject to modeling errors – arising from imprecise sensor data, unmodeled friction, or unforeseen environmental factors – V-OCBF demonstrates a remarkably stable safety profile even under such conditions. Specifically, the algorithm experiences only a modest increase in safety violation rates when tested with perturbed dynamics, a performance level significantly exceeding that of competing methods. This resilience suggests that V-OCBF is well-suited for deployment in practical applications where discrepancies between the simulated and actual system behavior are inevitable, offering a pathway toward reliable and safe robotic control in complex and uncertain environments.

The MuJoCo environment simulates an autonomous ground vehicle navigating an obstacle course to a goal, with red spheres indicating unsafe states.

The pursuit of guaranteed safety, as outlined in this work with Value-Guided Offline Control Barrier Functions, often feels like chasing a phantom. The framework attempts to synthesize safety from offline data, a laudable goal, yet one riddled with assumptions about the completeness and representativeness of that data. It’s a clever application of control barrier functions and value function estimation, certainly, but one quickly discovers the limits of formal verification when confronted with the unpredictable nature of real-world deployments. As Edsger W. Dijkstra observed, “Program testing can be a very effective technique for showing the presence of bugs, but it can never show their absence.” This sentiment rings true; elegant theoretical guarantees quickly erode when faced with the sheer complexity of production systems and the inevitable corner cases the offline data simply didn’t cover.

What’s Next?

This work, predictably, shifts the problem. The elegance of synthesizing safety constraints from offline data will, at some point, encounter the inherent messiness of real-world deployments. Any claim of ‘learned’ safety is simply a statement about the dataset; a beautifully optimized filter is still brittle when faced with inputs outside its training distribution. The true test isn’t theoretical guarantees, but the inevitable failure modes production will expose. Expectile regression, while a pragmatic choice, merely postpones the need for a genuinely robust uncertainty quantification – a task which has consistently resisted neat solutions.

The immediate direction will likely involve expanding the scope of ‘offline’ to encompass more heterogeneous and noisy data. The current reliance on value functions, while computationally convenient, introduces another layer of approximation, and a potential source of cascading errors. A more fundamental challenge lies in formalizing the notion of ‘safety’ itself. The paper implicitly assumes a well-defined, static safety criterion; real-world systems rarely afford such simplicity. Consider the implications when the ‘barrier’ itself becomes a moving target, or when multiple, conflicting safety objectives must be balanced.

Documentation for these systems will, as always, be a collective self-delusion. The true specification resides not in the code or the paper, but in the observed behavior – and, crucially, in the bugs that are not reproducible. If a bug is reproducible, it implies a stable system – a rare and increasingly improbable state of affairs. The pursuit of ‘safe’ control will remain, therefore, a perpetual game of whack-a-mole, disguised as scientific progress.

Original article: https://arxiv.org/pdf/2512.10822.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Illusion of Control: Why Safety Can’t Be an Afterthought

Safety Through Design: The Illusion of Formal Guarantees

Offline Safety: Learning to Stay Within the Lines

Benchmarking the Illusion: Measuring What Matters (Or Doesn’t)

What’s Next?

See also: