Lost in Translation: The Challenge of Mixed-Language Search

Author: Denis Avetisyan

A new study reveals that current information retrieval systems struggle significantly when faced with queries that blend multiple languages, exposing a critical weakness in real-world multilingual search.

A study establishes a retrieval benchmark for mixed-language queries, then scales evaluation across eleven diverse tasks using large language models, ultimately investigating lexicon-based vocabulary adaptation as a means to mitigate embedding space divergence between monolingual and code-switched text-a process acknowledging that systems grow, rather than being built, and that architectural choices inherently forecast eventual limitations.

Researchers benchmark the performance of existing systems on code-switched queries and identify limitations in current cross-lingual embeddings and retrieval methods.

Despite advances in multilingual natural language processing, current information retrieval systems struggle with the complexities of code-switching-the natural mixing of languages within a single conversation or document. This limitation is addressed in ‘Code-Switching Information Retrieval: Benchmarks, Analysis, and the Limits of Current Retrievers’, which presents a comprehensive analysis demonstrating that code-switching significantly degrades retrieval performance across various model architectures. Through the introduction of new benchmarks and extensive evaluation, the study reveals a substantial divergence in embedding spaces between monolingual and code-switched text, leading to performance declines of up to 27% on diverse tasks. These findings underscore a critical gap in current IR capabilities and raise the question of how to develop truly robust and representative multilingual models capable of handling the nuances of real-world language use.

The Inevitable Babel: Decoding Multilingual Search

Conventional information retrieval systems, designed primarily for monolingual queries, are increasingly challenged by the growing prevalence of multilingual search behavior. These systems typically rely on language-specific processing techniques, which struggle to effectively interpret queries that seamlessly blend multiple languages – a phenomenon known as code-switching. This creates a significant bottleneck, as the systems often fail to accurately identify relevant documents when confronted with mixed-language input, leading to diminished search results and a frustrating user experience. The core issue lies in the difficulty of simultaneously understanding the nuances of various languages within a single query, a task that demands more sophisticated linguistic analysis and cross-lingual understanding than most current IR architectures possess.

Contemporary search patterns increasingly demonstrate a phenomenon known as code-switching, where individuals seamlessly integrate multiple languages within a single query. This behavior, common among multilingual populations and reflecting natural communication styles, presents a substantial hurdle for conventional information retrieval systems. These systems are typically designed with a single language in mind, struggling to parse and understand the nuanced meaning embedded in mixed-language queries. The resulting difficulty in accurately matching user intent leads to diminished search results and a frustrating user experience, as systems often fail to retrieve relevant documents that contain information expressed in any of the languages used within the query. This challenge necessitates the development of novel approaches capable of effectively processing and interpreting the complexities of truly multilingual search behavior.

The escalating demand for information access across linguistic boundaries has exposed critical limitations in current Information Retrieval (IR) systems. While designed to efficiently process queries in a single language, these systems struggle with the increasing prevalence of code-switching – the practice of blending multiple languages within a single search – and genuinely multilingual information needs. Recent evaluations demonstrate a significant performance decline, up to 27% across diverse tasks and models, when confronted with such complexity. This degradation highlights a pressing need for more robust IR architectures capable of accurately interpreting and retrieving information from documents and queries expressed in multiple languages, suggesting a critical gap between existing technology and the realities of global information seeking.

Dense Vectors: A Foundation for Semantic Drift

Dense retrieval methods address information access by transforming both queries and documents into dense vector representations, typically utilizing neural networks. This contrasts with sparse retrieval techniques like TF-IDF or BM25 which rely on keyword matching. By embedding text into a continuous vector space, semantic similarity can be determined through calculations like cosine similarity or dot product, enabling the retrieval of documents conceptually related to a query, even if they lack shared keywords. The computational efficiency of these similarity calculations is significantly enhanced through the use of approximate nearest neighbor search algorithms and specialized hardware acceleration, allowing for rapid retrieval from large document collections. These vector representations capture contextual information, addressing limitations inherent in lexical matching approaches.

Several embedding models exhibit robust performance in multilingual semantic retrieval tasks. Specifically, e5-large-v2, mE5-large, bge-m3, and Arctic-Embed-m-v2.0 have been benchmarked to effectively encode text across multiple languages into dense vector representations. These models consistently achieve high recall rates on cross-lingual information retrieval datasets, indicating their ability to capture semantic similarity irrespective of language. Performance gains are attributed to pre-training on extensive multilingual corpora and the utilization of techniques such as masked language modeling and translation ranking, enabling the models to learn cross-lingual representations and generalize effectively to unseen languages and queries.

Contrastive learning is a key technique for training dense embedding models used in information retrieval. This approach focuses on learning representations where similar inputs are pulled closer together in vector space, while dissimilar inputs are pushed further apart. InfoNCE (Noise Contrastive Estimation) is a commonly used contrastive loss function that achieves this by framing the task as a discrimination problem: given a query, the model must distinguish the relevant document from a set of distractor, or negative, examples. The loss function encourages high similarity scores between queries and their corresponding relevant documents and low scores between queries and irrelevant documents. This process effectively teaches the model to encode semantic meaning, enabling accurate retrieval even when queries and documents do not share lexical overlap.

Visualization of embeddings for e5 and Qwen 0.6B on two information retrieval datasets reveals their relative performance in representing semantic similarity.

Late Binding: Delaying the Inevitable Mismatch

Late interaction architectures, exemplified by ColBERT v2, depart from traditional dense retrieval methods by postponing the interaction between query and document representations until a later stage in the retrieval process. Traditional methods typically compute dense embeddings for both queries and documents independently, then rely on efficient similarity search for initial ranking. In contrast, ColBERT v2 represents both queries and documents as a collection of contextualized embeddings, enabling a more fine-grained and nuanced comparison. This late interaction allows the model to consider the semantic relationships between individual tokens in the query and document, leading to improved retrieval accuracy as compared to early interaction or coarse-grained matching techniques. The architecture facilitates a deeper understanding of context and semantic relevance, thereby enhancing the ability to identify truly relevant documents.

Cross-encoder rerankers operate on the initial set of documents retrieved by a first-stage retrieval model, and re-score them based on a more detailed comparison of the query and each document. Models such as jina-reranker-v3, bge-reranker-v2-m3, and the Qwen3-Reranker series (0.6B, 4B, and 8B parameters) achieve this by processing the query and document together as a single input, allowing for complex interactions between the two. This contrasts with initial retrieval methods which typically encode query and documents independently. The re-scoring process aims to identify the most relevant documents that may have been overlooked or under-ranked during the initial, faster retrieval phase, thereby improving overall retrieval performance.

Combining late interaction architectures with cross-encoder rerankers demonstrably improves information retrieval performance on benchmark datasets such as BEIR and BRIGHT. However, performance gains are not consistent across all tasks; specifically, utilizing Qwen3-Embedding-0.6B for CS-MTEB reranking tasks can substantially reduce overall accuracy, dropping from a score of 63.09 to 37.33. This indicates that embedding quality and task-specific characteristics significantly influence the effectiveness of these combined techniques and necessitate careful consideration when deploying them in practical applications.

The Mirage of Accuracy: Benchmarks and the Real World

Rigorous evaluation of information retrieval (IR) systems necessitates benchmarks specifically designed for code-switched queries – instances where searchers seamlessly blend multiple languages within a single request. Datasets like CSR-L and CS-MTEB fulfill this crucial role, offering a standardized means to measure performance degradation when systems encounter this increasingly common real-world search behavior. These benchmarks aren’t merely academic exercises; they provide quantifiable insights into how well IR models generalize beyond monolingual data, pinpointing vulnerabilities that might otherwise go unnoticed. By exposing these limitations, CSR-L and CS-MTEB drive innovation in techniques aimed at enhancing cross-lingual search capabilities and ensuring equitable access to information for a diverse user base.

Evaluations leveraging benchmarks such as CSR-L and CS-MTEB, constructed with tools like MiMo-V2-Flash, demonstrate a crucial gap in information retrieval system performance when faced with the complexities of real-world user search habits. These datasets aren’t simply synthetic; they’re designed to mirror how individuals actually query information, often seamlessly blending multiple languages within a single search. Analyses reveal that this practice, known as code-switching, presents a significant challenge for current models, routinely inducing performance declines of up to 27% across a range of tasks – from simple keyword matching to more nuanced semantic understanding. This substantial degradation highlights the necessity for targeted research and development of techniques capable of effectively processing and interpreting code-switched queries, ensuring equitable access to information for a multilingual user base.

The increasing prevalence of code-switching – the practice of alternating between languages within a single conversation or search query – poses a significant challenge to information retrieval systems. Current evaluations demonstrate that even recently developed benchmarks, such as AILACaseDocs, experience substantial performance declines – up to 15 percentage points on datasets like Touché 2020 and TRECCOVID when utilizing English-focused bi-encoders – when confronted with these mixed-language inputs. This degradation highlights the necessity for specialized techniques to address the issue; methods like Vocabulary Expansion aim to improve model understanding of diverse linguistic combinations. As code-switching becomes increasingly common in online search behavior, the development and implementation of such strategies are crucial for maintaining effective and accurate information access for a global user base.

The pursuit of seamless information access, as this paper details with its stark performance drops in code-switched queries, feels less like engineering and more like gardening. One anticipates inevitable decay, even in the most carefully cultivated systems. Andrey Kolmogorov observed, “The most important discoveries often come from asking the wrong questions.” This sentiment resonates deeply; current information retrieval approaches, optimized for monolingual contexts, implicitly ask the wrong question when confronted with the organic, unpredictable nature of code-switching. The benchmarks presented aren’t failures, but rather honest admissions that the existing architectural prophecies – of clean language boundaries and predictable query structures – are demonstrably false. Every deploy, then, is a small apocalypse, revealing the limitations of the current garden.

What’s Next?

The observed performance degradation is not a failing of retrieval systems, but a predictable symptom. Long stability in benchmark evaluations lulls the field into believing progress is linear, obscuring the inevitable collision with real-world complexity. Code-switching is not an edge case to be ‘handled’-it is the natural state of multilingual communication. To treat it as a deviation from some idealized monolingual norm is to build systems destined to fracture under the weight of actual usage.

The current emphasis on multilingual embeddings, while valuable, addresses only a surface-level symptom. The problem isn’t merely representation, but the fundamental architecture of information access. Systems designed to dissect queries into discrete, labelled components will always struggle with the fluid, interwoven nature of code-switching. Future work should not focus on ‘fixing’ existing retrievers, but on cultivating entirely new approaches-systems that expect ambiguity, that thrive on the unexpected interplay of languages.

The true measure of progress will not be higher scores on curated benchmarks, but the graceful degradation of performance as systems encounter increasingly complex, authentic queries. The goal isn’t perfect retrieval, but resilient evolution. The system doesn’t fail when it misinterprets a query-it learns. And that learning, ultimately, is the only metric that truly matters.

Original article: https://arxiv.org/pdf/2604.17632.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Inevitable Babel: Decoding Multilingual Search

Dense Vectors: A Foundation for Semantic Drift

Late Binding: Delaying the Inevitable Mismatch

The Mirage of Accuracy: Benchmarks and the Real World

What’s Next?

See also: