Keeping Buses on Schedule: A Robust AI for Fleet Control

Author: Denis Avetisyan

Researchers have developed a new reinforcement learning framework to improve the stability and reliability of bus fleet management in the face of real-world uncertainties.

Reinforcement learning algorithms-specifically, variations of the RE-SAC method alongside SAC, DSAC-v1, and BAC-demonstrate varying rates of cumulative reward acquisition, as evidenced by their distinct learning curves, suggesting differing efficiencies in exploration and exploitation strategies.

RE-SAC disentangles aleatoric and epistemic risks using ensemble methods and regularization to prevent value collapse and ensure consistent service.

Maintaining consistent service in dynamic public transportation systems is challenged by inherent stochasticity and the difficulty of distinguishing between irreducible noise and insufficient data. This paper introduces ‘RE-SAC: Disentangling aleatoric and epistemic risks in bus fleet control: A stable and robust ensemble DRL approach’, a novel reinforcement learning framework designed to explicitly address both aleatoric and epistemic uncertainties in bus fleet control. By combining Integral Probability Metric-based regularization with a diversified Q-ensemble, RE-SAC mitigates value collapse and improves robustness in high-variability traffic conditions, achieving a significant performance gain over standard methods. Could this approach unlock more reliable and efficient autonomous control strategies for complex transportation networks?

The Inherent Stochasticity of Urban Transit

Conventional bus fleet control systems are fundamentally designed around the expectation of predictable operations, yet the reality of urban transit is rarely so orderly. Passenger demand fluctuates considerably throughout the day, influenced by events, weather, and even spontaneous needs, while traffic congestion introduces unpredictable delays due to accidents, road work, or simply peak-hour density. This constant interplay of stochastic variables creates a dynamic environment that challenges the efficacy of systems built on fixed schedules; buses may arrive at stops with uneven headways, leading to overcrowding, extended wait times, and diminished passenger satisfaction. Consequently, the inherent uncertainties of real-world operations necessitate more robust and adaptive control strategies capable of responding effectively to unforeseen circumstances and optimizing service despite fluctuating conditions.

Bus fleet control systems, when implemented with conventional reinforcement learning, often falter due to the unpredictable nature of real-world transit. Passenger numbers fluctuate, traffic patterns shift unexpectedly, and unforeseen delays become commonplace – these stochastic elements introduce considerable uncertainty. Standard reinforcement learning algorithms, designed for more stable environments, struggle to adapt to such variability, resulting in policies that are far from optimal. This translates directly into passenger dissatisfaction – increased wait times, overcrowded buses, and missed connections become frequent occurrences. Consequently, systems aiming to improve public transportation can inadvertently worsen the commuter experience when faced with the inherent randomness of urban mobility, highlighting the need for more robust and adaptive control strategies.

The efficacy of reinforcement learning in dynamic bus fleet control is surprisingly vulnerable to a subtle yet critical issue: QQValuePoisoning. This phenomenon describes how seemingly adequate training can yield agents that consistently misjudge the long-term value of different actions, leading to suboptimal control policies. Even when an agent appears to learn effectively during training – achieving high reward signals – the underlying value estimations can be fundamentally flawed due to the inherent stochasticity of passenger demand and traffic. The agent, therefore, consistently underestimates or overestimates the true benefit of certain maneuvers – such as holding a bus at a stop or diverting from a route – resulting in poor real-world performance despite having seemingly mastered the training environment. This highlights a crucial gap between theoretical learning and practical application, demanding novel approaches to value function estimation that are robust to the unpredictable nature of urban transit systems.

Training progresses with converging Q-value estimates, as demonstrated by the mean and standard deviation <span class="katex-eq" data-katex-display="false">\pm2\sigma</span>. — Training progresses with converging Q-value estimates, as demonstrated by the mean and standard deviation $\pm2\sigma$ .

RE_SAC: A Framework for Robust Bus Fleet Control

RE_SAC represents a new reinforcement learning framework designed to enhance the reliability of bus fleet control by integrating multiple techniques. Specifically, it combines robust optimization – which focuses on policy performance under adverse conditions – with ensemble learning to improve the stability of value estimations. Regularization techniques are also incorporated to prevent overfitting and promote generalization. This integrated approach aims to create a bus fleet control system that is less susceptible to disruptions and more consistently achieves optimal performance, ultimately improving the robustness of real-world bus operations.

The QQEnsemble technique within RE_SAC facilitates uncertainty quantification in value estimations by employing quantile regression to predict multiple quantiles of the value distribution. This approach moves beyond single-point estimations, providing a range of possible values and their associated probabilities, which is crucial for assessing the reliability of policy performance in dynamic, real-world bus systems. By estimating the full distribution, QQEnsemble enables a more nuanced understanding of potential outcomes, allowing for risk-aware decision-making and improved robustness against unpredictable events compared to methods relying solely on expected values. This leads to a more accurate evaluation of policy effectiveness, particularly in scenarios with high variability or limited data.

IPMWeightRegularization, implemented within the RE_SAC framework, functions as a regularization technique designed to improve the generalization capability and stability of the learned policies. This method specifically penalizes large weights in the neural networks used for value function approximation. By minimizing the magnitude of these weights, IPMWeightRegularization promotes the creation of smoother functions, reducing the model’s sensitivity to minor variations in input data. This, in turn, mitigates overfitting – the tendency of a model to perform well on training data but poorly on unseen data – and enhances the overall robustness of the bus fleet control system against unexpected operational conditions.

The RE_SAC framework utilizes a RobustMarkovDecisionProcess (RMDP) to develop bus fleet control policies optimized for worst-case scenario performance. Unlike standard Markov Decision Processes which assume fully known environments, the RMDP explicitly accounts for potential uncertainties and disturbances within the bus system. This approach prioritizes policy resilience by minimizing maximum possible loss, rather than maximizing average reward. In performance evaluations, the RE_SAC framework achieved a cumulative reward of -0.4 x 10⁶, representing the highest performance attained compared to all baseline methods tested under identical conditions.

RE-SAC demonstrates robust performance by maintaining accurate <span class="katex-eq" data-katex-display="false">Q</span>-value estimates, even when operating in highly rare or out-of-distribution states as quantified by Mahalanobis distance. — RE-SAC demonstrates robust performance by maintaining accurate $Q$ -value estimates, even when operating in highly rare or out-of-distribution states as quantified by Mahalanobis distance.

Demonstrating Robustness and Exploration in Action

The Wasserstein distance, also known as Earth Mover’s Distance, is utilized within the Robust Markov Decision Process framework to quantify the dissimilarity between probability distributions representing agent states or policies. This metric enables a formal definition of robustness by measuring the minimum ‘cost’ required to transform one distribution into another. Empirical results demonstrate that the implementation of this robust optimization technique, leveraging the Wasserstein distance, achieves a reduction in Wasserstein Distance by a factor of 2 when compared to the Soft Actor-Critic (SAC) algorithm, indicating improved resilience to distributional shifts and enhanced policy stability.

Maximum Entropy Reinforcement Learning (MERL) promotes exploration during the learning process by incorporating an entropy term into the reward function. This term incentivizes the agent to select actions that maximize not only the expected reward but also the randomness of those actions, effectively increasing the diversity of experiences sampled. By explicitly encouraging exploration, MERL mitigates the risk of premature convergence to suboptimal policies that can occur when an agent focuses solely on exploiting known rewards, thereby preventing the agent from becoming trapped in local optima and improving the likelihood of discovering globally optimal solutions.

L2 regularization, integrated within the Maximum Entropy Reinforcement Learning (MERL) framework, functions by adding a penalty term to the loss function proportional to the square of the magnitude of the policy parameters. This penalization discourages excessively large parameter values, effectively simplifying the learned policy and reducing its sensitivity to noise in the training data. Consequently, L2 regularization enhances the agent’s ability to generalize to unseen states and prevents overfitting, leading to improved performance and stability in environments with limited or noisy data. The strength of the regularization is controlled by a hyperparameter, λ, which balances the trade-off between model complexity and generalization ability.

RE_SAC builds upon the exploration benefits of Maximum Entropy Reinforcement Learning through the implementation of ensemble learning and robust optimization strategies. This combination yields a demonstrable improvement in Q-value estimation accuracy, particularly in infrequent states; RE_SAC achieves an Oracle Mean Absolute Error (MAE) of 1647 in rare states. This represents a significant reduction in estimation error compared to the Soft Actor-Critic (SAC) algorithm, which exhibits an Oracle MAE of 4343 in the same conditions, and outperforms the Distributional Soft Actor-Critic (DSAC) algorithm, which has an Oracle MAE of 5945.

The Impact of RE_SAC: Deploying Intelligent Bus Systems

Traditional bus scheduling often relies on static timetables and historical data, proving inadequate when confronted with the unpredictable realities of urban transit – unexpected traffic, fluctuating passenger demand, or even inclement weather. RE_SAC distinguishes itself by embracing these uncertainties, leveraging robust optimization and ensemble learning to dynamically adjust schedules in real-time. This proactive approach minimizes disruptions and dramatically reduces passenger wait times, as the system continuously recalibrates to maintain efficiency even amidst unforeseen events. Unlike conventional methods that struggle to adapt, RE_SAC actively anticipates and mitigates potential issues, creating a more reliable and responsive public transportation experience and demonstrating a quantifiable advantage in dynamic, real-world conditions.

RE_SAC dramatically reduces passenger wait times and boosts bus fleet efficiency through a synergistic combination of robust optimization and ensemble learning techniques. Traditional bus scheduling often struggles with real-world unpredictability – traffic fluctuations, varying passenger demand, and unexpected delays. RE_SAC addresses this by not simply predicting the most likely scenario, but by proactively accounting for a range of possibilities, ensuring schedules remain viable even under adverse conditions. The ensemble learning component further refines this process, leveraging multiple ‘intelligent agents’ to learn and adapt to dynamic urban environments, collectively identifying optimal routes and frequencies. This combined approach minimizes disruptions, distributes resources effectively, and ultimately delivers a more reliable and responsive public transportation experience, offering a quantifiable improvement over static or reactive scheduling systems.

The RE_SAC framework distinguishes itself through a design prioritizing seamless integration into diverse urban landscapes. Unlike rigid, pre-programmed systems, RE_SAC dynamically adjusts to the unique characteristics of each city – accounting for variations in road networks, population density, and peak travel times. This adaptability isn’t merely about accommodating existing conditions; the framework actively learns from real-time data, refining bus schedules and route allocations to proactively mitigate congestion. Simulations and initial deployments demonstrate that RE_SAC can significantly reduce passenger wait times and improve on-time performance, even in cities with historically challenging traffic patterns, offering a scalable solution for enhancing public transportation efficiency and accessibility.

The RE_SAC framework leverages the power of Deep Reinforcement Learning, specifically employing Actor-Critic methods like Soft Actor-Critic, to address the complexities of modern urban transportation. This approach doesn’t rely on pre-programmed schedules but instead learns optimal bus deployment strategies through continuous interaction with a simulated or real-world environment. Crucially, RE_SAC’s scalability stems from a demonstrated theoretical sample complexity of $O(H^3)$ , where H represents the planning horizon; this means the computational effort required to learn an effective policy grows at a manageable rate even as the complexity of the transportation network increases. By intelligently balancing exploration and exploitation, the system adapts to fluctuating demand and unforeseen disruptions, ultimately offering a robust and efficient solution for minimizing congestion and improving passenger experiences in diverse urban landscapes.

The pursuit of stability in complex systems, as demonstrated by RE-SAC’s disentanglement of aleatoric and epistemic uncertainties, echoes a fundamental mathematical principle. The framework’s regularization techniques, designed to prevent value collapse and maintain service regularity, are akin to establishing invariants as one considers the limit of increasing complexity. As Brian Kernighan observed, “Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.” This sentiment applies to the design of robust algorithms; elegant solutions, grounded in provable principles, offer a path to predictable behavior even when confronted with the inherent randomness-the aleatoric uncertainty-and incomplete knowledge-the epistemic uncertainty-of real-world bus fleet control.

Beyond the Horizon

The presented framework, while demonstrating a pragmatic mitigation of uncertainty in a complex system, merely addresses a symptom. The true challenge lies not in managing aleatoric and epistemic risks, but in fundamentally reducing them at the source. Future work must prioritize the development of models capable of genuine predictive power, rather than relying on ensembles to bracket potential failures. The current approach, though elegant in its separation of concerns, feels akin to building a more refined weather vane instead of controlling the wind.

A crucial limitation resides in the inherent assumption of stationarity within the bus network. Real-world systems are, demonstrably, not static. The influence of external, unmodeled events-construction, accidents, even shifts in passenger behavior-remain largely unaccounted for. Therefore, extending this line of inquiry demands exploration of continual learning methodologies, capable of adapting to dynamic environments without succumbing to catastrophic forgetting or, ironically, value collapse.

Ultimately, the pursuit of robustness should not be conflated with the acceptance of imprecision. The elegance of an algorithm, one suspects, will not be judged by its ability to tolerate error, but by its capacity to eliminate it. The question, then, is not simply how to build a more resilient bus fleet control system, but whether a truly predictive system is, in fact, attainable – or merely a beautiful, asymptotic dream.

Original article: https://arxiv.org/pdf/2603.18396.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Inherent Stochasticity of Urban Transit

RE_SAC: A Framework for Robust Bus Fleet Control

Demonstrating Robustness and Exploration in Action

The Impact of RE_SAC: Deploying Intelligent Bus Systems

Beyond the Horizon

See also: