Testing the Untestable: Validating AI with Metamorphic Testing

Author: Denis Avetisyan

As Large Language Models become increasingly integrated into software, traditional testing methods struggle to keep pace, necessitating innovative approaches to ensure reliable performance.

This review explores how Metamorphic Testing provides a cost-effective and scalable solution for validating LLM-powered systems, even when labeled data is limited.

The increasing reliance on Large Language Models (LLMs) presents a paradox: while offering unprecedented capabilities, their inherent unreliability complicates traditional software validation. This challenge is addressed in ‘From Untestable to Testable: Metamorphic Testing in the Age of LLMs’, which proposes a shift towards relation-based testing to overcome the scarcity of labeled ground truth data. The paper advocates for Metamorphic Testing (MT) as a scalable and cost-effective approach, transforming multiple test executions into executable oracles for LLM-powered systems. Could MT become a cornerstone of robust AI validation, enabling more reliable and trustworthy applications in a rapidly evolving landscape?

The Test Oracle’s Collapse: LLMs and the Limits of Deterministic Validation

The proliferation of Large Language Models (LLMs) into diverse applications presents a significant hurdle for software testing, primarily concerning the ‘test oracle’ problem – the method for determining if a system’s output is correct. Historically, testing involved comparing an application’s results against pre-defined, known-correct outputs; however, LLMs, by their very nature, generate probabilistic responses. This means a single input can yield multiple valid, yet potentially differing, outputs, effectively dismantling the foundation of traditional, deterministic testing approaches. Consequently, validating the correctness of LLM-powered systems becomes far more complex, requiring innovative strategies beyond simple output comparison to ensure reliability and prevent the propagation of subtle, yet critical, errors.

Conventional software testing fundamentally depends on the ‘test oracle’ – a definitive source for verifying expected outputs. However, Large Language Models (LLMs) present a significant departure from this established paradigm; their inherent probabilistic nature means a single input can yield a multitude of plausible, yet potentially inaccurate, responses. Unlike deterministic systems where correctness is absolute, LLMs generate outputs based on statistical likelihoods, introducing a level of ambiguity that undermines traditional validation techniques. This isn’t simply a matter of finding bugs, but of defining what constitutes a correct answer when creativity and nuance are central to the model’s function; an output might be grammatically sound and contextually relevant, yet factually incorrect or subtly misleading, challenging the very foundation of automated testing and demanding novel approaches to evaluation.

The inherent unpredictability of Large Language Models presents a fundamental challenge to conventional software validation techniques. Established testing methodologies depend on comparing system outputs against pre-defined, unequivocally correct ‘oracles’ – a benchmark absent in LLM applications where multiple valid, and infinitely many plausible, responses can exist. This discrepancy undermines the core principle of deterministic testing, as discerning between acceptable creativity and genuine errors becomes exceptionally difficult. Consequently, confidently verifying the reliability and safety of LLM-powered applications requires a paradigm shift, moving beyond simple pass/fail criteria towards more nuanced evaluation metrics and potentially embracing statistical measures of performance rather than absolute correctness.

Metamorphic Testing: A Relational Approach to Validation

Metamorphic Testing (MT) diverges from traditional software testing by prioritizing the validation of relationships between test cases rather than absolute correctness. This approach necessitates defining ‘metamorphic relations’ – rules specifying how changes to test inputs should predictably affect the corresponding outputs. Rather than requiring a definitive ‘oracle’ to assess output validity, MT confirms that if input A is modified to input A’, the resultant output should change in a defined, predictable manner relative to the original output. This relational verification circumvents the need for pre-computed, correct answers and focuses on the consistency of the system’s behavior under defined transformations.

Metamorphic testing circumvents the need for pre-defined ‘ground truth’ outputs by establishing verifiable criteria based on input variations and their expected effects on the resulting output. Rather than asserting a specific output for a given input, MT defines ‘metamorphic relations’ which describe how the output should change when the input is modified in a predictable way. For example, if a function calculates the average of a list, a metamorphic relation would state that doubling all values in the input list should result in a doubling of the calculated average. This allows for automated verification of LLM behavior without requiring access to correct answers, focusing instead on the consistency of responses to related inputs.

The test oracle problem in Large Language Model (LLM) evaluation arises from the lack of definitive, pre-calculated correct answers against which to compare LLM outputs. Metamorphic Testing (MT) circumvents this by avoiding the need for a ground truth; instead, it focuses on verifying that specific, predictable relationships hold between the outputs of multiple LLM executions with related inputs. This relationship-based approach allows for effective testing without requiring absolute correctness, as the validity of the LLM’s response is determined by its consistency with expected transformations of input data, rather than comparison to a single ‘correct’ answer. Consequently, MT provides a viable method for assessing LLM behavior when establishing a traditional test oracle is impractical or impossible.

LLMORPH: A Framework for Implementing Relational Validation

LLMORPH is a specialized framework engineered for the implementation and execution of metamorphic testing (MT) on Large Language Models (LLMs). Unlike traditional testing methods, LLMORPH focuses on verifying that LLMs maintain consistent behavior across semantically equivalent inputs, rather than comparing outputs to pre-defined expected answers. The framework provides tools for generating these equivalent inputs, executing them against the target LLM, and analyzing the resulting outputs to identify violations of expected metamorphic relations. It is designed to be model-agnostic, supporting a range of LLM architectures and sizes, and provides infrastructure for automating the entire MT process, including test case generation, execution, and result analysis.

LLMORPH employs a BERT-based Similarity Score as a dynamic oracle to assess the validity of metamorphic relations. This involves generating a similarity score between the expected output, derived from the metamorphic relation, and the actual output produced by the Large Language Model. The BERT model is utilized to create contextualized embeddings of both outputs, and cosine similarity is then calculated between these embeddings. A threshold is applied to this similarity score; values below the threshold indicate a failure of the metamorphic relation, signaling a potential fault in the LLM’s behavior. This dynamic oracle approach eliminates the need for pre-defined, static test cases, allowing for automated evaluation of a wide range of metamorphic properties.

LLMORPH was evaluated using GPT-4, LLAMA 3, and HERMES-2 to assess its practical utility and efficacy in detecting LLM flaws. Testing involved the implementation of 36 distinct metamorphic relations, and analysis revealed an average failure rate of 18% across these relations. This indicates that metamorphic testing, when implemented via LLMORPH, is capable of identifying faulty behaviors in state-of-the-art large language models with a measurable degree of success. The observed failure rate demonstrates the potential of MT as a supplementary testing method beyond traditional approaches.

Expanding the Scope: MT for Agentic Systems and Continuous Integration

Agentic AI systems, distinguished by their capacity for autonomous action and interaction, demand novel testing methodologies beyond traditional approaches. Metamorphic Testing (MT) offers a powerful solution by shifting the focus from verifying specific outputs to confirming the relationships between them. Instead of needing pre-defined ‘correct’ answers – often impossible to establish for complex, evolving agents – MT verifies that predictable changes to input yield corresponding, predictable changes in the agent’s behavior. This is particularly crucial for agentic systems where even seemingly minor input variations can trigger cascading effects and unforeseen outcomes. By evaluating these relational properties, MT provides a robust means of assessing consistency, reliability, and the overall quality of agentic AI, even in the absence of ground truth data and across diverse, dynamic environments.

Metamorphic testing offers a powerful approach to validating agentic AI systems by concentrating on the relationships within the data, rather than absolute correctness. Unlike traditional testing which verifies outputs against fixed ground truth, this method examines whether expected transformations of inputs yield corresponding changes in outputs – a crucial check for systems designed to interact and adapt. This relational focus is particularly well-suited to agentic AI because these systems are evaluated on their consistent and reliable behavior across a multitude of scenarios, not simply on producing a single ‘right’ answer. By defining metamorphic relations – rules that dictate how changes in input should affect output – researchers can automatically generate test cases that probe the agent’s ability to maintain consistency even when faced with altered or unexpected inputs, ultimately ensuring a more robust and trustworthy AI.

Metamorphic testing offers a practical solution for continuous quality assurance of large language models by embedding directly into existing CI pipelines. A comprehensive review of over one thousand research papers – encompassing 24 natural language processing tasks and yielding 191 metamorphic relations – demonstrated the scalability of this approach, with approximately 560,000 automated tests executed across three distinct LLMs. Detailed manual investigation of detected violations – numbering 937 – confirmed a 62% true positive rate, highlighting the method’s effectiveness at identifying genuine flaws. Notably, certain metamorphic relations exposed a substantial failure rate – peaking at 80% – underscoring the potential for MT to reveal systemic vulnerabilities in LLM-powered applications and facilitate robust automated regression testing.

The pursuit of reliable software, particularly when integrating Large Language Models, demands a shift from solely relying on labeled data. This paper champions Metamorphic Testing as a pragmatic solution, acknowledging the limitations of traditional methods in dynamic AI landscapes. It echoes a sentiment articulated by Brian Kernighan: “Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.” The inherent complexity of LLMs necessitates a testing approach focused on verifiable relationships – the ‘metamorphic relations’ – rather than absolute correctness. This aligns with the principle that a provable algorithm, defined by consistent boundaries, is more valuable than one that merely passes initial tests, ensuring robustness beyond superficial functionality.

What’s Next?

The appeal of Metamorphic Testing (MT) lies not in discovering every flaw-a task demonstrably impossible-but in shifting the burden of proof. Traditional testing, reliant on painstakingly curated oracles, frequently mistakes implementation details for logical errors. MT, by focusing on relationships between inputs and outputs, attempts to bypass this fragility. However, the application to Large Language Models (LLMs) exposes a fundamental tension. While MT circumvents the need for absolute truth, LLMs themselves are predicated on probabilities-approximations of meaning. The question becomes not whether an LLM fails a test, but whether its failures adhere to expected statistical distributions.

Future work must address this probabilistic uncertainty. Simply demonstrating a violation of a metamorphic relation isn’t sufficient; the magnitude and frequency of such violations require rigorous quantification. A truly elegant solution will move beyond ad-hoc relation design and embrace formal methods for generating and verifying metamorphic relations, ideally grounded in the underlying mathematical properties of the LLM itself. To claim progress, these methods must demonstrate provable guarantees, not merely empirical effectiveness on a limited set of benchmarks.

The current enthusiasm for LLMs often prioritizes scale over correctness. It remains to be seen whether the field will ultimately demand-and achieve-a more mathematically principled approach to validation. Until then, MT represents a pragmatic, if imperfect, step towards establishing some semblance of confidence in these increasingly pervasive systems. The pursuit of genuine, provable reliability, however, must remain the ultimate goal.

Original article: https://arxiv.org/pdf/2603.24774.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Test Oracle’s Collapse: LLMs and the Limits of Deterministic Validation

Metamorphic Testing: A Relational Approach to Validation

LLMORPH: A Framework for Implementing Relational Validation

Expanding the Scope: MT for Agentic Systems and Continuous Integration

What’s Next?

See also: