The Repetition Curse: Overloading AI with Clever Prompts

Author: Denis Avetisyan

New research reveals how carefully crafted, repetitive requests can cripple large language models by exploiting imbalances in their internal routing mechanisms.

Expert workload distribution reveals a nuanced performance characteristic: the system allocates token routing-measured as a percentage-differently between the RepetitionCurse model and a balanced baseline, suggesting variations in how each approaches computational demand and resource utilization.

This paper demonstrates a denial-of-service vulnerability in Mixture-of-Experts models, leading to increased latency and potential information leakage under stress.

While Mixture-of-Experts (MoE) architectures offer parameter efficiency for scaling large language models, their reliance on expert parallelism introduces a critical vulnerability during inference. This paper, ‘RepetitionCurse: Measuring and Understanding Router Imbalance in Mixture-of-Experts LLMs under DoS Stress’, demonstrates that adversarial prompts-crafted using simple repetitive patterns-can exploit inherent router behavior to induce severe routing imbalance. This imbalance concentrates computation on a limited set of experts, amplifying inference latency by up to 3.063x and effectively enabling a denial-of-service attack. Could this routing fragility represent a fundamental limitation in the deployment of MoE models, and what novel defenses are needed to ensure robust and reliable service?

The Inevitable Scaling Bottleneck: Architecting for Adaptability

The relentless pursuit of increasingly capable large language models (LLMs) consistently returns to the Transformer architecture, yet scaling this foundational structure presents formidable obstacles. While the Transformer’s parallelizable nature initially facilitated rapid growth, its computational demands increase quadratically with sequence length – a prohibitive factor as models ingest longer contexts and process more complex data. This scaling limitation stems from the attention mechanism, which requires every token to be compared with every other token, creating a computational bottleneck. Consequently, simply increasing model size – adding more layers or parameters – yields diminishing returns and quickly becomes unsustainable due to the escalating costs of training and inference. Researchers are therefore exploring innovative techniques to circumvent these limitations, seeking methods to expand model capacity without a corresponding explosion in computational resources, acknowledging that the future of LLMs relies on overcoming these inherent scaling challenges.

The relentless pursuit of increasingly powerful large language models has led to exploration of architectural innovations beyond simply increasing the size of dense parameter sets. Mixture-of-Experts (MoE) presents a compelling solution by strategically distributing parameters across multiple “expert” sub-networks. Instead of activating the entire network for every input, MoE systems employ a “gating” mechanism that dynamically selects only a small subset of experts – often just two or three – to process each specific token or input segment. This selective activation drastically reduces computational demands during inference, allowing models with significantly more total parameters to operate within practical resource constraints. Consequently, MoE enables a pathway to scaling model capacity – and potentially improving performance – without incurring a proportional increase in computational cost, a crucial advancement for deploying increasingly sophisticated AI systems.

Successfully leveraging Mixture-of-Experts (MoE) architectures isn’t simply about adding more parameters; it demands careful attention to system-level challenges. While MoE theoretically unlocks greater model capacity with reduced computational demands, practical implementation often encounters performance bottlenecks stemming from data routing and communication overhead. Efficiently distributing the workload across numerous expert networks requires sophisticated load balancing strategies to prevent any single expert from becoming overwhelmed, while minimizing inter-expert communication is crucial for maintaining speed. Furthermore, irregular memory access patterns inherent in sparse activation – where only a subset of experts processes each input – can strain hardware and necessitate specialized infrastructure or algorithmic optimizations to truly realize the promised scaling benefits. Ultimately, the full potential of MoE models depends on overcoming these deployment hurdles and achieving optimal resource utilization.

A comparison of GPU event timelines reveals that balanced routing reduces computational latency within a single mixture-of-experts layer compared to imbalanced routing.

The Pursuit of Immediate Response: Optimizing for First Token Time

The prefill phase of large language model (LLM) inference represents the initial, parallel processing of the complete input sequence. This phase is a primary contributor to overall latency because all input tokens are processed simultaneously to generate the initial hidden states. The duration of the prefill phase is directly proportional to the input sequence length; longer sequences necessitate more computation during prefill. Consequently, optimizing prefill performance-through techniques like efficient attention mechanisms or model parallelism-is essential for reducing the total time required to generate a response, particularly for applications requiring fast response times.

Modern Large Language Model (LLM) inference engines, such as vLLM and SGLang, utilize a prefill-decoding separation strategy to improve hardware utilization and throughput. This approach decouples the initial, parallel processing of the entire input sequence – the prefill phase – from the subsequent, iterative generation of output tokens – the decoding phase. By separating these processes, engines can dedicate resources more efficiently; for example, prefill can leverage the full parallel processing capabilities of GPUs, while decoding can be optimized for lower latency token generation. This separation allows for concurrent prefill and decoding of different requests, increasing overall system throughput and reducing average latency compared to systems where these phases are strictly sequential.

Time To First Token (TTFT) represents the duration from the initial request submission to the generation of the first output token, serving as a primary metric for assessing the responsiveness of Large Language Model (LLM) services designed for interactive applications. Unlike total generation time, which is influenced by both prefill and decoding latency, TTFT specifically isolates the prefill stage – the processing of the entire input sequence – and is therefore a strong indicator of perceived latency by the user. Lower TTFT values directly correlate with a more fluid and engaging user experience, particularly in conversational AI, chatbots, and real-time text completion scenarios where immediate feedback is essential. Consequently, optimizing infrastructure and model parameters to minimize TTFT is a key objective for deploying performant interactive LLM services.

The RepetitionCurse attack concentrates processing on a single GPU by routing all tokens to the same top-k experts, creating a performance bottleneck while other GPUs remain idle.

The Shadow of Imbalance: Vulnerabilities in Distributed Systems

Router imbalance in Mixture-of-Experts (MoE) models occurs when the routing mechanism disproportionately assigns tokens to a subset of available experts. This uneven distribution results in some experts being heavily utilized while others remain largely idle. Consequently, performance degrades because the computational benefits of MoE – parallelization and specialization – are diminished; overloaded experts become bottlenecks, increasing latency and reducing throughput. The efficiency gains expected from distributing the workload across multiple experts are lost, and the model’s overall capacity is effectively limited by the most burdened units.

Top-k routing, a common technique used in Mixture-of-Experts (MoE) models to distribute tokens to the most relevant experts, does not guarantee balanced load distribution. While designed to select the k most suitable experts for each token, it remains vulnerable to scenarios where a disproportionate number of tokens are consistently routed to a small subset of experts. This can occur due to biases in the input data or model weights, leading to expert imbalance. Adversarial actors can exploit this by crafting inputs specifically designed to target these overloaded experts, effectively creating a denial-of-service condition and hindering overall model performance. Even with careful tuning of the k parameter, Top-k routing alone is insufficient to prevent targeted overload and requires supplementary load balancing strategies.

Router imbalance in Mixture-of-Experts (MoE) models introduces vulnerabilities to Denial-of-Service (DoS) attacks. When token distribution is uneven, certain experts receive disproportionately high request volumes, leading to resource exhaustion and performance degradation for those specific experts. This effect can be amplified by techniques like RepetitionCurse, where repeated token generation during prefilling significantly increases latency; testing indicates a potential latency amplification of up to 4.728x under specific conditions. Consequently, attackers can exploit this imbalance by directing traffic specifically to overloaded experts, effectively denying service to legitimate users by saturating their capacity and increasing overall response times.

Forging Resilience: Strategies for Efficient MoE Deployment

Efficient deployment of Mixture-of-Experts (MoE) models hinges on a principle known as expert parallelism, a technique that strategically distributes individual experts – the specialized sub-networks within the larger model – across multiple devices, typically GPUs. This distribution is not merely organizational; it fundamentally addresses the memory constraints inherent in scaling these models. By avoiding the need to load the entire model onto a single device, expert parallelism unlocks the potential to train and deploy significantly larger networks. Crucially, this approach also minimizes the intensive data transfer – and associated latency – between GPUs that would otherwise occur if experts resided on the same device and needed to share activations. The result is a substantial reduction in inter-GPU communication, allowing for faster processing and improved scalability, especially when dealing with extremely large models and datasets.

Fused kernels represent a significant performance enhancement in Mixture-of-Experts (MoE) models by streamlining computations within a single GPU. Instead of processing each expert’s contribution sequentially, these kernels jointly execute the operations for multiple experts that reside on the same device. This co-execution minimizes kernel launch overhead and maximizes hardware utilization, particularly the benefits of parallel processing capabilities. The technique effectively reduces memory access and data transfer, leading to substantial speedups compared to traditional, separate kernel executions. Consequently, fused kernels are crucial for realizing the full potential of MoE models, allowing for faster inference and training times while maintaining computational efficiency.

Effective deployment of Mixture-of-Experts (MoE) models necessitates a thorough consideration of fundamental scaling laws, most notably Amdahl’s Law. This principle dictates that the overall performance improvement achievable by parallelizing a task is limited by the sequential portion of that task; in MoE systems, this translates to the overhead associated with routing tokens to experts and combining their outputs. While expert parallelism and fused kernels significantly reduce computation time, these gains can be nullified if routing or reduction steps become bottlenecks. Consequently, optimizing these sequential components, alongside maximizing the parallelizable expert computations, is crucial for achieving true scalability. Moreover, understanding this relationship is vital for building resilient systems; attacks targeting the routing infrastructure can disproportionately impact performance, highlighting the need for redundancy and efficient fallback mechanisms within the sequential components to maintain overall system integrity.

Towards Adaptive Intelligence: Monitoring and Mitigating Future Threats

A thorough understanding of Mixture-of-Experts (MoE) model behavior necessitates examining how experts are selected during inference. Analyzing the distribution of these selections using Entropy provides a quantifiable measure of load balance – a low Entropy score suggests a concentration of requests on a small number of experts, potentially creating bottlenecks and vulnerabilities. Conversely, a high Entropy score indicates more uniform distribution, implying efficient resource utilization. This metric isn’t merely descriptive; significant deviations from expected Entropy levels can signal malicious activity, such as an attacker deliberately targeting specific experts to induce denial-of-service or exploit weaknesses. Consequently, monitoring expert selection Entropy offers a proactive approach to identifying and mitigating potential issues within MoE deployments, ensuring both performance and security are maintained.

Comprehensive benchmarking serves as a critical foundation for securing Mixture-of-Experts (MoE) models against increasingly sophisticated attack vectors. Tools like LongBench enable researchers and developers to rigorously evaluate model performance under diverse and challenging conditions, revealing potential vulnerabilities that might otherwise remain hidden. This proactive approach to testing extends beyond simple accuracy metrics, encompassing latency, throughput, and resource utilization to identify areas susceptible to denial-of-service or performance degradation attacks. By systematically stressing the model with realistic workloads and adversarial inputs, benchmarking allows for the development and implementation of robust defense mechanisms, ensuring reliable and scalable inference even under malicious conditions. The identification of vulnerabilities, such as those demonstrated by the RepetitionCurse attack which causes significant latency amplification, highlights the necessity of continuous and thorough evaluation as MoE models become increasingly prevalent.

The pursuit of resilient and scalable Mixture-of-Experts (MoE) inference necessitates a shift towards adaptive routing strategies, capable of responding to fluctuating workloads and actively neutralizing denial-of-service threats. Recent investigations, exemplified by the RepetitionCurse attack, reveal a significant vulnerability: under an eight-GPU configuration, malicious input can induce latency amplification exceeding 150% at the MoE kernel level. This underscores the inadequacy of static routing approaches, which fail to account for adversarial patterns or uneven expert utilization. Consequently, future work should prioritize dynamic routing algorithms that intelligently distribute tokens based on real-time load balancing, predicted request patterns, and proactive identification of potentially exploitative inputs, ultimately fortifying MoE models against performance degradation and ensuring consistently reliable service.

The study of routing imbalance within Mixture-of-Experts models reveals a fundamental truth about complex systems: even elegant architectures are susceptible to unforeseen vulnerabilities as time progresses. The repetitive prompts exploited in this research aren’t merely attacks; they’re accelerants of inherent systemic decay. As Ken Thompson observed, “Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.” This applies directly to MoE models; the very cleverness of expert parallelism creates a fragile equilibrium, exposed by seemingly innocuous repetition. The amplification of latency isn’t a bug, but rather the system revealing its limits under stress-a predictable consequence of increasing complexity and a testament to the arrow of time always pointing toward refactoring.

The Inevitable Tilt

The demonstrated susceptibility of Mixture-of-Experts systems to routing imbalance isn’t a flaw so much as a predictable consequence of complex architecture. Every architecture lives a life, and this one reveals its aging process under stress. The paper highlights a vulnerability, certainly, but also illuminates a fundamental truth: optimization for average-case performance often obscures brittle points susceptible to adversarial exploitation. The latency amplification observed isn’t merely a denial-of-service vector; it’s a signal of systemic strain, a measure of how far the system has drifted from its initial equilibrium.

Future work will likely focus on mitigation – more robust routing algorithms, request shaping, and perhaps even adversarial training to inoculate the system against these repetitive attacks. However, these are palliative measures. The underlying problem-the inherent difficulty of maintaining balance in a massively parallel system-will persist. Improvements age faster than one can understand them, and what appears as stability today will inevitably succumb to new forms of pressure.

The side-channel implications, though briefly touched upon, deserve further scrutiny. The information leakage exposed by this imbalance is a reminder that even seemingly innocuous performance characteristics can reveal sensitive details about the model and its internal state. The long game isn’t about preventing all attacks; it’s about understanding the contours of decay and anticipating the inevitable shifts in the system’s center of gravity.

Original article: https://arxiv.org/pdf/2512.23995.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/