Checking the Reasoning of AI’s Black Boxes

Author: Denis Avetisyan


A new framework offers a practical way to verify the behavior of large language models without needing access to their internal workings.

The system introduces a challenge to model integrity by simulating typical user requests and demanding post-response verification from the model’s owner, effectively creating an adversarial audit of its outputs.
The system introduces a challenge to model integrity by simulating typical user requests and demanding post-response verification from the model’s owner, effectively creating an adversarial audit of its outputs.

IMMACULATE leverages verifiable computation and logit distance distribution to audit black-box LLM APIs for increased trust and security.

Despite the increasing reliance on commercial large language models (LLMs), verifying their correct and honest execution remains a significant challenge due to their black-box nature. This paper introduces ‘IMMACULATE: A Practical LLM Auditing Framework via Verifiable Computation’, a novel approach that detects economically motivated deviations-such as model substitution or token overbilling-without requiring trusted hardware or internal model access. By leveraging verifiable computation to selectively audit requests via \logit distance distribution, IMMACULATE achieves strong detection guarantees with minimal throughput overhead. Could this framework usher in a new era of accountability and trust in the rapidly evolving landscape of LLM services?


The Inevitable Shadow of the API

The proliferation of Large Language Models (LLMs) is largely fueled by the widespread adoption of API-based inference services. This deployment method allows developers and businesses to integrate powerful AI capabilities without the substantial infrastructure and expertise required to host and maintain the models themselves. Such services provide remarkable scalability, effortlessly handling fluctuating demands and supporting a vast number of concurrent users. Moreover, API access dramatically lowers the barrier to entry, enabling rapid prototyping and innovation across diverse applications. This ease of access, however, comes with trade-offs, fundamentally shifting the relationship between those who utilize LLMs and those who control the underlying technology and infrastructure. The convenience of API-driven access has quickly become the standard for leveraging these advanced models, reshaping the landscape of artificial intelligence deployment.

The increasing prevalence of API-based Large Language Models (LLMs) has created a fundamental imbalance in trust between service providers and users. Individuals and organizations now routinely depend on these powerful AI systems without possessing meaningful insight into their internal workings or the processes that generate responses. This trust asymmetry stems from the ‘black-box’ nature of these models; users are typically unable to inspect the model’s architecture, training data, or even the specific computations performed during inference. Consequently, verifying the integrity, reliability, and security of the generated outputs becomes exceedingly difficult, forcing a dependence on the provider’s assurances – a situation ripe for potential exploitation or unforeseen errors. The lack of transparency doesn’t merely concern technical details; it extends to the very basis of decision-making, raising questions about accountability and the potential for biased or manipulated outcomes.

The increasing reliance on closed-source Large Language Models, accessed through API services, introduces substantial risks stemming from a lack of transparency. Because users cannot inspect the model’s internal workings, they are vulnerable to integrity attacks such as Model Substitution, where a cheaper, less capable model is swapped in without notification, and Token Overbilling, where usage is inaccurately reported to inflate costs. These attacks exploit the inherent trust asymmetry, potentially leading to unpredictable and unreliable outputs, alongside significant financial repercussions. Without the ability to verify the model’s identity or monitor token consumption, organizations are effectively operating with a ‘black box’, unable to guarantee the integrity of the service or the validity of the results, and exposing them to both performance degradation and unforeseen expenses.

By treating discrete selections <span class="katex-eq" data-katex-display="false">d_{i}</span> as continuous within the inference process, the workflow eliminates branching uncertainty and streamlines computation.
By treating discrete selections d_{i} as continuous within the inference process, the workflow eliminates branching uncertainty and streamlines computation.

Auditing the Oracle: A Necessary Discomfort

Black-box Large Language Model (LLM) auditing establishes an independent verification process for deployed models, functioning as a secondary evaluation of outputs without reliance on the originating system’s internal checks. This is achieved by submitting identical prompts to both the deployed LLM and the auditing system, then comparing the responses – including generated text, token counts, and associated metadata – to identify discrepancies or violations of predefined security or functional constraints. The purpose of this independent validation is to mitigate risks associated with potentially compromised or malfunctioning models, ensuring consistent and predictable behavior in production environments, and providing an additional layer of defense against malicious outputs or unintended consequences.

API-based Inference Services, while offering convenient access to Large Language Models (LLMs), introduce specific risks related to output quality and cost control. External auditing addresses these concerns by independently verifying the responses generated by the LLM and tracking token consumption. This validation process confirms that the model adheres to expected behavior, mitigating the potential for generating harmful, biased, or inaccurate outputs. Furthermore, monitoring token usage provides transparency and allows for cost management, preventing unexpected expenses due to excessive or inefficient prompting and response generation. By providing this external layer of verification, auditing enhances the reliability and predictability of LLM-powered applications relying on API access.

Black-box Large Language Model (LLM) auditing is uniquely positioned to assess models without requiring access to their internal parameters or architectural details. This characteristic is critical for evaluating proprietary and closed-source LLMs where model weights and structures are not publicly available or accessible due to intellectual property restrictions. The audit process operates solely on the inputs provided to the model and the corresponding outputs generated, allowing for external validation of behavior without necessitating any internal inspection. Consequently, organizations can implement security and reliability checks on LLMs even when they do not own or have control over the underlying model itself, focusing instead on verifying the externally observable performance characteristics.

Immaculate: A Framework for Observing the Unknowable

Immaculate is designed as a practical auditing framework for Large Language Models (LLMs) utilizing black-box auditing principles, meaning it does not require access to the model’s internal parameters or training data. This approach prioritizes efficiency and scalability by enabling audit processes to be applied to deployed models without necessitating modifications or specialized access. The framework is engineered to minimize computational cost and integration complexity, making it suitable for continuous monitoring in production environments. By focusing on external observations of model behavior – specifically, input-output relationships – Immaculate facilitates auditing without requiring knowledge of the model’s internal workings, thus broadening its applicability to a wider range of LLM deployments.

Randomized Auditing, as implemented in Immaculate, operates on the principle of stratified sampling to reduce the computational expense of comprehensive LLM auditing. Instead of evaluating all possible queries, a statistically representative subset is selected and analyzed. This approach maintains a high degree of reliability by ensuring the sampled queries accurately reflect the overall query distribution, enabling the detection of anomalous behavior with a significantly lower resource investment. The size of the query subset is determined by statistical methods to achieve a desired confidence level and minimize the margin of error, effectively balancing audit cost with the need for thoroughness.

Immaculate is designed for minimal performance impact, exhibiting less than 1% overhead in end-to-end processing time during auditing procedures. This efficiency is coupled with a very low false positive rate, measured at less than 10-5. The low false positive rate indicates a high degree of accuracy in identifying genuine security violations, minimizing unnecessary alerts or interventions. These metrics were achieved through careful optimization of the auditing process and a focus on reducing computational load without sacrificing reliability.

Immaculate utilizes a combined approach of randomized testing and Logit Distance Distribution (LDD) analysis to identify malicious behavior in Large Language Models. Randomized testing generates a representative subset of queries to reduce computational cost, while LDD analysis measures the divergence between the logit distributions of the audited model and a known good model. This combination allows for the detection of both Token Overbilling – where a malicious model reports inflated token usage – and Model Substitution attacks, where a less capable or compromised model is deployed. Testing demonstrates a detection rate exceeding 40% for Model Substitution attacks and over 1.3% for quantization attacks, indicating the framework’s efficacy in identifying these specific vulnerabilities.

The Price of Precision: A Necessary Illusion

Large Language Model inference benefits significantly from employing lower precision numerical formats such as FP8 and BF16, which can dramatically accelerate processing speeds and reduce memory requirements. However, this performance gain isn’t without cost; the reduction in the number of bits used to represent model weights and activations inherently introduces a trade-off with potential accuracy. While full-precision models-those utilizing FP32 or even FP64-offer the highest degree of numerical stability and precision, they demand substantially more computational resources. Consequently, transitioning to lower precision necessitates careful evaluation; the gains in speed must be weighed against any resulting degradation in model performance, requiring robust testing and validation to ensure acceptable levels of accuracy are maintained for specific applications.

Effective model auditing requires a nuanced approach when dealing with large language models employing reduced precision formats. Simply comparing outputs to a full-precision “gold standard” overlooks the inherent statistical differences introduced by lower precision calculations. These variations, though often subtle, can accumulate and manifest as deviations in model behavior, potentially leading to inaccurate assessments of integrity. Consequently, auditing methodologies must be designed to specifically account for the expected range of variation associated with each precision level-BF16, FP8, and others-treating these as intrinsic characteristics rather than outright errors. This necessitates the development of statistical tests sensitive enough to differentiate between acceptable precision-induced variation and genuine anomalies that signal a compromise in model functionality or security, ensuring a reliable evaluation of model performance across diverse computational landscapes.

Token overreporting, a nuanced form of token overbilling, presents a critical challenge in the responsible deployment of large language models. This phenomenon, where the number of tokens processed during inference exceeds the expected amount, can lead to inflated costs and obscured performance metrics. Meticulous token usage tracking is therefore paramount, not only for accurate billing and resource allocation, but also as a vital component of robust auditing procedures. Discrepancies between reported and actual token counts can signal subtle model deviations, potential vulnerabilities, or even malicious activity, demanding careful investigation. Addressing token overreporting necessitates precise instrumentation throughout the inference pipeline and the implementation of validation mechanisms during auditing, ensuring transparency and accountability in LLM operations.

The pursuit of verifiable computation, as detailed in this framework, echoes a fundamental truth about complex systems. Immaculate attempts to establish trust not by dismantling the black box, but by observing its outputs and verifying consistency – a delicate dance with inevitable entropy. As Paul Erdős observed, “A mathematician knows a lot of things, but knows nothing deeply.” This mirrors the limitations of auditing LLMs; complete understanding remains elusive, yet statistical verification offers a pathway to pragmatic assurance. The framework doesn’t solve the problem of LLM trustworthiness, but rather establishes a method for continuous observation, acknowledging that every architectural choice-every prompt, every parameter-prophesies future vulnerabilities. It’s a system built not to be secure, but to reveal insecurity.

The Seeds of What Will Be

Immaculate proposes a method for peering into the veiled operations of large language models, not by dismantling them – for such things rarely yield understanding – but by observing the echoes of their decisions. This approach acknowledges a fundamental truth: any attempt to build trust is a temporary reprieve. The system will, inevitably, grow beyond the intentions of its creators. The Logit Distance Distribution offers a means of tracking this growth, of noting the subtle shifts in behavior that precede more substantial changes. But distributions alone do not prevent drift, only document it.

The true challenge lies not in verifying a model’s current state, but in anticipating its future ones. Immaculate offers a valuable diagnostic, a means of observing the symptoms of change, but the underlying causes will remain elusive. Future work will likely focus on correlating distributional shifts with external factors – the subtle pressures of data, the unintended consequences of fine-tuning. The framework will be stretched, prodded, and ultimately reveal its own limitations, for every refactor begins as a prayer and ends in repentance.

One suspects that the pursuit of “trustworthy AI” is a fool’s errand, or rather, a Sisyphean one. The goal isn’t to achieve trust, but to cultivate a more nuanced understanding of failure. Immaculate, in its careful measurements, offers a path toward that understanding. It is not a shield against the inevitable, but a means of charting the landscape of what will be.


Original article: https://arxiv.org/pdf/2602.22700.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2026-03-01 22:03