Can Language Models Handle a Reworded Question?

Author: Denis Avetisyan

New research introduces a benchmark and metric to test how well AI systems maintain consistent answers when faced with paraphrased queries.

The system employs a defined prompting strategy to elicit paraphrased questions, ensuring consistent formatting of the output and facilitating controlled linguistic variation.

The RoParQ benchmark and XParaCon metric evaluate and enhance the robustness of large language models to paraphrasing, demonstrating improved consistency through supervised fine-tuning.

Despite advances in scale, Large Language Models (LLMs) often struggle with semantic consistency, exhibiting variable performance when presented with paraphrased questions. To address this limitation, we introduce RoParQ: Paraphrase-Aware Alignment of Large Language Models Towards Robustness to Paraphrased Questions, presenting a new benchmark and metric, XParaCon, designed to rigorously evaluate cross-paraphrase consistency. Our work demonstrates that targeted supervised fine-tuning, focused on paraphrase awareness, significantly enhances model robustness, achieving comparable consistency levels in smaller models to those of much larger pre-trained counterparts. Can this approach pave the way for more reliable and genuinely understanding LLMs, less susceptible to superficial variations in input phrasing?

The Fragility of Statistical Reasoning

Despite their impressive ability to generate human-like text, Large Language Models demonstrate a curious fragility in reasoning. Subtle alterations to a prompt – a change in wording, the addition of seemingly irrelevant details, or even minor grammatical shifts – can dramatically alter the model’s output, leading to inconsistent or incorrect conclusions. This sensitivity isn’t a matter of nuanced understanding being disrupted, but rather a reflection of the models’ core functioning: they excel at identifying statistical patterns in language, not at grasping underlying meaning. Consequently, a slight rephrasing can disrupt the learned correlations, causing the model to stumble where a human would easily maintain logical consistency. This poses a significant challenge for deploying these models in applications demanding reliability, as even minor variations in user input can undermine their performance and limit their practical utility.

Large Language Models, despite their impressive ability to generate human-like text, often demonstrate a fundamental weakness: a reliance on statistical correlations rather than genuine semantic comprehension. These models excel at identifying patterns within vast datasets, allowing them to predict the most probable sequence of words, but this approach lacks the nuanced understanding of meaning that characterizes human cognition. Consequently, even minor alterations in phrasing – those that do not change the underlying intent – can drastically alter a model’s output, revealing a fragility that limits its reliability in practical applications. This dependence on surface-level patterns, rather than deep reasoning, hinders performance in tasks requiring generalization, common sense, or adaptability to unforeseen contexts, ultimately restricting the robustness of these models beyond controlled laboratory settings.

Current evaluations of Large Language Models (LLMs) frequently present an optimistic, yet potentially misleading, picture of their true capabilities. Standard benchmarks, while useful for initial progress tracking, often prioritize performance on carefully curated datasets with limited linguistic variation. This narrow focus fails to expose the surprising fragility of these models; slight alterations in phrasing – synonymous substitutions or reordering of clauses – can lead to drastically different, and sometimes incorrect, outputs. Consequently, reported scores on these benchmarks may overestimate an LLM’s robustness in handling the messy, ambiguous language encountered in real-world applications, creating a disconnect between perceived performance and actual reliability. A more nuanced assessment, incorporating diverse and adversarial examples, is crucial for accurately gauging the limitations of these powerful, yet imperfect, systems.

This example demonstrates the LLM's vulnerability to producing incorrect responses when presented with paraphrased questions. — This example demonstrates the LLM’s vulnerability to producing incorrect responses when presented with paraphrased questions.

Introducing RoParQ: A Benchmark for Logical Consistency

RoParQ is a newly developed benchmark dataset intended to provide a robust evaluation of cross-paraphrase consistency in Large Language Models (LLMs). Unlike typical benchmarks that assess performance on a single question phrasing, RoParQ specifically measures a model’s ability to maintain answer stability when presented with multiple paraphrased versions of the same underlying question. The dataset is designed to be challenging for current LLMs and aims to highlight inconsistencies that might not be apparent in standard evaluations. It provides a quantifiable metric for assessing how reliably a model reasons and responds, regardless of superficial changes in question wording.

The RoParQ benchmark is built by programmatically generating paraphrased versions of questions sourced from five established datasets: CommonsenseQA, MathQA, ARC, MMLU, and UnifiedMCQA. This construction method involves applying diverse paraphrasing techniques to the original questions, creating multiple variations that retain the same semantic meaning but differ in phrasing. The resulting dataset consists of question sets – each set containing the original question and its paraphrases – designed to specifically test whether a model provides consistent answers across these variations. The inclusion of questions from diverse domains and reasoning types within these datasets ensures that the consistency tests are challenging and broadly applicable to various LLM capabilities.

Traditional evaluations of Large Language Models (LLMs) often assess performance on a single phrasing of a question, failing to account for potential instability in responses to semantically equivalent variations. RoParQ addresses this limitation by explicitly measuring cross-paraphrase consistency; the benchmark presents multiple paraphrased versions of the same question and evaluates whether the LLM provides the same answer across all versions. This approach goes beyond simply determining if an answer is correct; it assesses the reliability of the model’s reasoning process by quantifying the frequency with which it arrives at the same conclusion regardless of superficial changes in input phrasing. Consistency is measured as the percentage of question sets where the model provides a uniform response, providing a direct metric for answer stability.

The prompt shown was utilized for both fine-tuning and inference in paraphrase-aware multiple choice question answering within the math reasoning subset.

Paraphrase-Aware Alignment: Imposing Logical Rigor

Paraphrase-Aware Alignment addresses the issue of inconsistent responses from Large Language Models (LLMs) when presented with semantically equivalent but lexically different inputs. This technique utilizes Supervised Fine-Tuning (SFT) to specifically train models to recognize and respond consistently to paraphrased queries. The core principle involves constructing a training dataset containing question-answer pairs, alongside multiple paraphrases of each question. By exposing the LLM to these variations during fine-tuning, the model learns to focus on the underlying meaning of the input rather than the specific wording, thereby improving robustness and reducing sensitivity to phrasing changes.

Paraphrase-aware alignment utilizes Supervised Fine-Tuning (SFT) with a specific focus on training data designed to enhance model robustness to variations in phrasing. This is achieved by constructing datasets containing examples where the same underlying reasoning problem is presented using multiple, distinct paraphrases. By exposing the model to these paraphrase-aware examples during SFT, the training process explicitly encourages invariance to superficial linguistic changes, allowing the model to consistently generate similar outputs regardless of how the input prompt is phrased. The core principle is to decouple the model’s understanding of the meaning of a query from its specific lexical form, thereby improving generalization and reducing sensitivity to input phrasing.

LoRA (Low-Rank Adaptation) is employed as a parameter-efficient fine-tuning (PEFT) method within the paraphrase-aware alignment technique to reduce computational demands. Rather than updating all model parameters during supervised fine-tuning, LoRA introduces trainable low-rank decomposition matrices to the existing weights. This significantly decreases the number of trainable parameters – often by over 100x – while maintaining performance comparable to full fine-tuning. The reduced parameter count translates directly into lower GPU memory requirements and faster training times, enabling effective alignment of large language models on limited hardware and datasets. LoRA’s efficiency facilitates iterative refinement of alignment strategies without incurring substantial computational costs.

This prompt was employed for both fine-tuning and inference in a multiple-choice question answering system designed to assess general knowledge with paraphrase awareness.

Quantifying Consistency with XParaCon: A Metric for Logical Stability

XParaCon is utilized as an objective metric to assess model stability in the face of paraphrased inputs. The metric functions by calculating the standard deviation of a model’s accuracy across multiple question variants created through paraphrasing. A lower standard deviation, as quantified by XParaCon, indicates greater consistency in performance regardless of how a question is phrased, thereby providing a precise, numerical assessment of the model’s robustness to variations in input wording. This allows for direct comparison of model performance before and after the application of techniques, such as Paraphrase-Aware Alignment, designed to improve consistency.

XParaCon assesses cross-paraphrase consistency by quantifying the variability in a model’s performance across multiple reformulations of the same question. Specifically, it calculates the standard deviation of accuracy scores obtained when evaluating a model on a set of question variants. A lower standard deviation, as reflected in a smaller XParaCon value, indicates more consistent performance and greater robustness to changes in phrasing. This metric provides a precise, numerical measure of how reliably a model responds correctly regardless of superficial variations in input wording, offering an objective assessment of its paraphrasing capabilities and overall stability.

Evaluation using the XParaCon metric demonstrates that Paraphrase-Aware Alignment yields quantifiable improvements in model consistency across diverse question phrasing. Specifically, the Llama-3.1-8B-Instruct model exhibited an increase in XParaCon score from 2.186 to 2.629 following alignment, while the Qwen3-4B model improved from 4.489 to 4.856. Importantly, models fine-tuned with this approach achieved XParaCon scores competitive with those of significantly larger, pre-trained models, indicating an efficient means of enhancing robustness without substantial parameter increases.

Model accuracy and XParaCon scores consistently improve with scale and fine-tuning across all model families.

Towards Reliable and Consistent AI: A Foundation for Trustworthy Reasoning

The pursuit of genuinely reliable artificial intelligence necessitates tackling the subtle issue of cross-paraphrase inconsistency in Large Language Models. These models, while often proficient at generating human-like text, can exhibit unpredictable shifts in response when presented with semantically equivalent questions phrased differently. This inconsistency erodes trust, as a trustworthy system should provide stable outputs regardless of superficial input variations. Recent work demonstrates that by specifically training models to recognize and mitigate these paraphrasing-induced discrepancies, researchers are making significant strides towards more dependable AI. Addressing this challenge isn’t merely about improving accuracy; it’s about building systems that consistently reason in a predictable manner, a critical step for deploying these models in sensitive applications demanding unwavering performance and fostering user confidence.

The advancements detailed in this research extend far beyond theoretical improvements in language modeling, holding substantial promise for practical applications across numerous fields. More dependable AI, achieved through addressing inconsistencies, directly benefits question answering systems by ensuring more accurate and trustworthy responses, even when queries are subtly rephrased. Similarly, dialogue systems become more engaging and less prone to frustrating logical leaps, fostering more natural and coherent conversations. Perhaps most crucially, the ability to consistently reason – a hallmark of human intelligence – is significantly enhanced, allowing AI to tackle complex critical reasoning tasks with greater reliability and potentially unlocking solutions in areas like scientific discovery, legal analysis, and financial modeling. These improvements collectively suggest a pathway toward AI systems that are not just powerful, but also predictable and consistently aligned with human expectations.

Continued development necessitates broadening the scope of input variation considered during model training and evaluation. Current methods often focus on paraphrasing, but subtle shifts in phrasing, semantic nuance, or even the presence of adversarial examples can still destabilize Large Language Models. Consequently, future research should investigate techniques to enhance robustness against a wider spectrum of input perturbations, including those stemming from differing cultural contexts or levels of linguistic complexity. Simultaneously, refining alignment strategies – the methods used to ensure models adhere to intended goals and values – remains crucial. More robust alignment will not only improve reliability but also facilitate the creation of AI systems that are consistently predictable and trustworthy across diverse applications, ranging from complex reasoning to nuanced dialogue.

Models demonstrate varying levels of cross-paradigm consistency in the general knowledge subset, as measured by their XParaConscore.

The pursuit of robustness in Large Language Models, as detailed in this work concerning RoParQ, echoes a fundamental tenet of mathematical rigor. The study champions a focus on cross-paraphrase consistency, striving for solutions that aren’t merely functional, but demonstrably correct across varied inputs. This aligns perfectly with Andrey Kolmogorov’s assertion: “Mathematics is the art of saying complicated things in a simple way.” The elegance of RoParQ lies in its ability to distill the complex problem of paraphrased question understanding into a measurable metric, XParaCon, and a targeted improvement strategy. The work highlights that even smaller models, when guided by mathematical principles of consistency, can achieve substantial gains in reliability, moving beyond empirical ‘success’ towards provable performance.

Beyond Paraphrase: The Pursuit of Invariant Representation

The introduction of RoParQ and the XParaCon metric represent a necessary, if belated, acknowledgement of a fundamental failing in much of contemporary Large Language Model evaluation. Consistency across superficial transformations-paraphrasing being merely one such instance-should not be considered a desirable feature, but rather a baseline expectation. The observed gains from supervised fine-tuning, even in smaller models, merely confirm that current architectures are, at their core, insufficiently grounded in semantic invariance. A truly robust system should exhibit identical reasoning regardless of lexical variation; the fact that this requires explicit training suggests a deeper structural deficiency.

Future work must move beyond identifying paraphrases and focus on constructing models that inherently disregard them. The challenge lies not in generating more paraphrases for testing, but in developing architectural principles that prioritize invariant representation. Consideration should be given to symbolic reasoning techniques, or to hybrid models that combine the strengths of neural networks with the rigor of formal logic. The current emphasis on scale appears increasingly unsustainable if it fails to address this fundamental issue of representational fragility.

Ultimately, the pursuit of robustness is not merely an engineering problem; it is a philosophical one. It demands a re-evaluation of what it means for a machine to ‘understand’ – to move beyond pattern recognition and towards a genuine capacity for abstract thought, independent of superficial linguistic variations. The elegance of a solution will not be judged by benchmark scores, but by its mathematical necessity.

Original article: https://arxiv.org/pdf/2511.21568.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/