Squeezing More Performance from Language Models
![Through the fusion of key-value (KV) cache blocks, the computational footprint during batch decoding is demonstrably reduced, and efficiency is further enhanced by enabling the reuse of computations across unified representations of data chunks - a strategy illustrated by the shared computation of chunks 0, 1, and 2, effectively minimizing redundant matrix operations and optimizing performance via [latex] KV [/latex] cache management.](https://arxiv.org/html/2601.03067v1/figures/assets/shared_chunks.png)
A new technique efficiently compresses and reuses memory caches, significantly boosting the speed and scalability of large language model serving.
![Through the fusion of key-value (KV) cache blocks, the computational footprint during batch decoding is demonstrably reduced, and efficiency is further enhanced by enabling the reuse of computations across unified representations of data chunks - a strategy illustrated by the shared computation of chunks 0, 1, and 2, effectively minimizing redundant matrix operations and optimizing performance via [latex] KV [/latex] cache management.](https://arxiv.org/html/2601.03067v1/figures/assets/shared_chunks.png)
A new technique efficiently compresses and reuses memory caches, significantly boosting the speed and scalability of large language model serving.

A new deep learning framework intelligently combines code’s meaning and structure to pinpoint security flaws with improved accuracy.
As the Internet of Things expands, so does the need for robust, yet efficient, security solutions tailored for resource-constrained embedded systems.

New research pinpoints the key to bolstering question answering systems against adversarial manipulation, bridging the gap between clean and attacked performance.
![Dynamic quantization in encoder-decoder automatic speech recognition models addresses error propagation through a novel calibration method that utilizes layer-wise scaling factors [latex]\alpha_{\ell}[/latex], computed based on error indicators, to correct the update direction-a refinement of standard post-training quantization [latex]Eq.(1)[/latex] that calibrates the encoder with audio data and the decoder with text and quantized encoder outputs, as defined in [latex]Eq.(9)[/latex].](https://arxiv.org/html/2601.02455v1/x3.png)
New research tackles the challenges of compressing automatic speech recognition models without sacrificing accuracy, focusing on how errors accumulate during quantization.

Researchers explore how quantum-enhanced neural networks can optimize contextual bandit algorithms, potentially offering performance gains with reduced computational demands.

New research reveals that quantum neural networks can match or exceed the robustness of traditional methods in noisy healthcare speech applications.
Researchers have discovered a powerful connection between shifted Yangians and the critical cohomology of quiver varieties, offering new insights into representation theory and quantum geometry.
![Current wireless systems rely on layered error correction-[latex]HARQ[/latex] at the MAC layer and [latex]ARQ[/latex] at the RLC layer-dependent on feedback loops, but the implementation of forward erasure correction, such as network coding, offers a path toward significantly reduced latency by preemptively addressing potential errors rather than reacting to them.](https://arxiv.org/html/2601.01645v1/Figures/Final/intro_simplified.png)
A new look at error correction techniques reveals how network coding can dramatically improve latency and efficiency in next-generation wireless networks.
Researchers have developed Bithoven, a formally verified language designed to make Bitcoin smart contracts both safer and easier to develop.