Beyond Confidence: A New Approach to Spotting AI’s Blind Spots

Author: Denis Avetisyan

Researchers have developed a novel method for reliably detecting when artificial intelligence systems encounter data outside of their training, improving safety and trustworthiness.

CORE dissects input features into components aligned with classification-<span class="katex-eq" data-katex-display="false">z_{\parallel}</span>-and orthogonal residuals-<span class="katex-eq" data-katex-display="false">z_{\perp}</span>-the latter of which encodes a membership signal demonstrably more discerning than simple confidence metrics and inherently undetectable by methods relying solely on logits. — CORE dissects input features into components aligned with classification- $z_{\parallel}$ -and orthogonal residuals- $z_{\perp}$ -the latter of which encodes a membership signal demonstrably more discerning than simple confidence metrics and inherently undetectable by methods relying solely on logits.

CORE decomposes feature representations into confidence and membership signals using orthogonal subspaces to achieve robust out-of-distribution detection.

Despite advances in deep learning, reliable out-of-distribution (OOD) detection remains a challenge, with methods often failing to generalize across diverse architectures and datasets. This paper introduces $CORE$ (COnfidence + REsidual), a novel post-hoc OOD detection approach that disentangles classifier confidence from a class-specific membership signal via decomposition into orthogonal feature subspaces. By independently scoring these subspaces and combining them, $CORE$ achieves robust detection where existing methods falter, demonstrating state-of-the-art performance across five architectures and benchmark configurations. Can this orthogonal decomposition strategy unlock further improvements in the reliability and generalization of deep learning systems beyond OOD detection?

The Fragility of Familiarity: When Models Stray from the Known

Contemporary neural networks demonstrate remarkable proficiency when processing data mirroring their training conditions, achieving high accuracy on familiar examples. However, this strength is sharply contrasted by a significant vulnerability: a pronounced struggle with ‘unseen’ data – inputs that deviate from the distributions encountered during training. This limitation isn’t merely a matter of reduced performance; it represents a core reliability issue, as these models can generate confidently incorrect outputs when confronted with novel situations. The issue stems from the network’s tendency to overconfidently extrapolate from learned patterns, lacking the capacity to recognize the boundaries of its knowledge. Consequently, a system reliant on such a network faces substantial risk in real-world deployment, where unpredictable inputs are the norm, potentially leading to critical failures in applications ranging from autonomous driving to medical diagnosis.

The dependable function of modern artificial intelligence systems increasingly relies on their capacity to recognize when presented with data that diverges significantly from their training parameters. Identifying these ‘out-of-distribution’ (OOD) samples isn’t merely an academic exercise; it’s a cornerstone of reliable deployment in real-world applications. Consider autonomous vehicles, for example, where encountering an unforeseen object or weather condition demands a system’s ability to acknowledge its uncertainty and defer to a safer course of action. Similarly, in medical diagnosis, flagging an unusual patient case allows for expert review, preventing potentially harmful misdiagnoses. Without robust OOD detection, these systems risk confidently generating incorrect outputs, leading to failures with potentially severe consequences, and hindering the broader adoption of AI in critical infrastructure.

Conventional techniques for identifying anomalous data frequently demonstrate diminished efficacy as neural networks grow in intricacy and the characteristics of incoming data diverge from the training set. These methods, often reliant on pre-defined thresholds or assumptions about data normality, struggle to generalize to unseen scenarios, leading to a heightened risk of misclassification. Consequently, systems employing such approaches can experience catastrophic failures, misinterpreting novel inputs as familiar ones-a particularly concerning issue in safety-critical applications like autonomous driving or medical diagnosis where even a single erroneous decision can have severe consequences. The increasing prevalence of adversarial attacks and continuously evolving data landscapes further exacerbate these vulnerabilities, underscoring the urgent need for robust and adaptive out-of-distribution detection strategies.

Despite variations in model architecture and dataset, activation-shaping and feature-based out-of-distribution (OOD) scoring methods exhibit catastrophic failures, while <span class="katex-eq" data-katex-display="false">CORE</span> (red) consistently demonstrates superior robustness. — Despite variations in model architecture and dataset, activation-shaping and feature-based out-of-distribution (OOD) scoring methods exhibit catastrophic failures, while $CORE$ (red) consistently demonstrates superior robustness.

Dissecting the Feature Space: The CORE Decomposition

CORE operates as a post-hoc Out-of-Distribution (OOD) detection method, meaning it analyzes feature vectors after a model has been trained and made predictions. Its core innovation lies in applying orthogonal decomposition – specifically, a transformation to create mutually uncorrelated components – to the feature vector representing each input sample. This decomposition process aims to separate the feature space into components that indicate the model’s confidence in its representation and components that relate to the sample’s membership within the training distribution. By analyzing these decomposed components, CORE assesses the likelihood that a given sample originates from the data used during training, thereby enabling OOD detection without requiring access to the training data itself or modification of the original model.

CORE’s orthogonal decomposition of the feature vector isolates components indicative of model confidence and data membership. The ‘confidence’ component captures the variance explained by the model’s primary decision-making process, effectively representing how strongly the model believes in its prediction. Conversely, the ‘membership’ component isolates variance orthogonal to this primary decision, reflecting characteristics specific to the training data distribution. By separating these aspects of the feature space, CORE moves beyond a single, aggregated feature representation, enabling a more granular assessment of whether a given sample aligns with the learned data distribution and therefore, is likely in-distribution or out-of-distribution.

CORE distinguishes between in-distribution and out-of-distribution (OOD) samples by analyzing the orthogonal components resulting from feature vector decomposition. Specifically, the ‘confidence’ and ‘membership’ components provide quantifiable metrics for assessing sample validity; higher confidence and membership values generally indicate in-distribution data. Evaluation across five distinct model/dataset combinations – ResNet-18 on CIFAR-10, WRN-28-10 on SVHN, ResNet-50 on ImageNet, ViT-B/16 on CIFAR-100, and MobileNetV2 on FashionMNIST – demonstrates an average Area Under the Receiver Operating Characteristic curve (AUROC) of 84.9%, indicating a high degree of accuracy in OOD detection using this method.

Analysis of three outlier detection methods reveals that CORE exhibits weak correlation between confidence and membership, enabling separation of in-distribution (blue) and out-of-distribution (red) samples, while NNGuide’s signals are entangled and ComboOOD suffers from feature collapse and hubness.

Unveiling Membership: The Residual Signature

The CORE framework determines a ‘Membership Score’ by calculating the cosine similarity between a sample’s residual feature vector and the mean residual direction for its predicted class. The residual feature vector is computed by projecting the original feature vector onto the space orthogonal to the classifier’s weight vector, isolating class-specific information not captured by the primary classification decision. Cosine similarity, ranging from -1 to 1, quantifies the alignment between these vectors; higher scores indicate greater similarity to the expected residual pattern for that class, suggesting the sample is likely in-distribution data. This score is then used as a metric to assess the confidence of the classification and to identify potential out-of-distribution samples.

The Residual Feature Vector is derived by projecting the full feature vector onto a subspace orthogonal to the direction of the classifier’s weight vector. This projection effectively removes the component of the feature vector that is most strongly associated with the classification decision, isolating information related to intra-class variations and class-specific characteristics not directly contributing to the primary classification. Mathematically, this is achieved through a dot product operation between the feature vector and the classifier’s weight vector, subtracting the resulting scalar projection from the original feature vector, and normalizing the result. This residual vector then represents the portion of the feature space capturing nuanced details within each class, independent of the dominant classification signal.

CORE’s ability to quantify membership relies on the principle that in-distribution data exhibits consistent clustering patterns within feature space. This is leveraged through the calculation of a Membership Score, and empirically demonstrated by Area Under the Receiver Operating Characteristic curve (AUROC) results of 86.7% when evaluated on the ResNet-50 architecture and 90.7% on the Swin-B architecture. These AUROC scores represent state-of-the-art performance in identifying in-distribution data, indicating the robustness and effectiveness of the residual-based approach for anomaly detection and out-of-distribution generalization.

Performance across different scoring metrics-logit energy, k-NN distance, and penultimate activations-reveals that successful out-of-distribution (OOD) detection relies on sparse OOD spikes in CNN activations or clear separation based on feature distance, while ViT models lack discernible separation between in-distribution and OOD data.

The Geometry of Learning: Convergence and Its Implications

Neural collapse describes a striking phenomenon observed in the final layers of deep neural networks during training; as models learn, the features representing in-distribution samples progressively converge to a remarkably specific structure. This isn’t simply a reduction in variance, but a systematic organization where features from the same class cluster tightly around class means, while simultaneously becoming increasingly orthogonal to features from different classes. $\text{As training progresses, the inter-class variance diminishes, and the intra-class variance consolidates around a predictable configuration.}$ This convergence isn’t a byproduct of optimization, but appears to be a fundamental property of learned representations, enabling the network to efficiently discriminate between classes and generalize to unseen examples by establishing a highly organized feature space. The predictable structure resulting from neural collapse is crucial for understanding how models achieve robustness and reliability, particularly when evaluating membership scores and handling adversarial perturbations.

The predictable convergence of neural network features isn’t merely a structural curiosity; it directly enhances the precision of assessing sample authenticity. As features collapse towards a defined structure, the calculation of the ‘Class-Specific Mean Residual Direction’ becomes significantly more reliable. This direction, representing the average deviation of a class’s features from the overall mean, provides a robust basis for determining if a given sample genuinely belongs to that class. Consequently, the ‘Membership Score’ – a metric quantifying this belonging – gains improved accuracy, offering a more trustworthy indicator of in-distribution status and bolstering the performance of anomaly detection systems. This refined ability to discern genuine samples from outliers is crucial for applications ranging from image recognition to fraud prevention, showcasing the practical benefits of understanding feature convergence.

Recent research indicates that the structure within the space orthogonal to the primary features extracted by transformer models isn’t random noise, but rather contains substantial class-specific information. Analysis of Vision Transformer (ViT-B) and Swin Transformer (Swin-B) models reveals measurable ‘Alignment Gaps’ – discrepancies between the expected and observed distributions – of 0.267 and 0.325 respectively, demonstrating this structured variance. These gaps suggest that significant portions of the orthogonal complement aren’t uniformly distributed, but instead subtly encode class identities, offering potential avenues for enhanced classification and a more nuanced understanding of feature representation within these powerful models. This finding challenges the traditional view of orthogonal components as purely residual and highlights their potential contribution to discriminative power.

As deep neural networks become increasingly powerful, a phenomenon known as ‘Covariance Collapse’ can undermine the reliability of distance-based metrics crucial for accurate classification. This collapse refers to the tendency of learned feature representations to concentrate towards the origin in feature space, effectively diminishing the variance within each class. Consequently, metrics like the Mahalanobis Distance – which rely on accurately capturing the shape and spread of data distributions – become less discriminative, hindering the network’s ability to distinguish between different classes. This highlights a critical need for robust feature representation learning strategies that preserve sufficient inter-class variance and prevent the detrimental effects of covariance collapse, ensuring reliable performance even with highly concentrated feature spaces.

Toward More Resilient Systems: Charting a Course for the Future

The effectiveness of outlier detection methods, such as the Energy Score, is often significantly enhanced through data preprocessing techniques like Z-Score Normalization. This standardization process rescales features to have a mean of zero and a standard deviation of one, ensuring that no single feature unduly influences distance calculations. Without consistent scaling, features with larger magnitudes can dominate the assessment of similarity, potentially masking subtle but important differences between in-distribution and out-of-distribution samples. By normalizing features, the Energy Score – and similar methods relying on distance metrics – can more accurately gauge the novelty of an observation, leading to improved robustness and a reduction in false positives when identifying unusual data points.

The reliability of outlier detection methods relying on K-Nearest Neighbors can be compromised by a phenomenon known as ‘hubness’. This occurs when certain data points become excessively frequent neighbors to many others, effectively acting as ‘hubs’ in the dataset. Consequently, distance calculations become distorted, as these hubs artificially appear close to a disproportionately large number of points, regardless of their actual similarity. This bias diminishes the accuracy of outlier identification, as genuine anomalies may be incorrectly classified as being near these central hubs, obscuring their true distance from the majority of the data. Addressing hubness is therefore crucial for improving the robustness and performance of distance-based outlier detection systems.

The CORE method distinguishes itself through computational efficiency, achieving an online scoring complexity of $O(d)$ , where ‘d’ represents the dimensionality of the feature space. This linear complexity is a significant advancement, positioning CORE as the fastest feature-based Out-of-Distribution (OOD) detection method currently available. Unlike many approaches that require pre-computed nearest neighbor searches or incur higher computational costs with increasing feature dimensions, CORE’s scoring process scales favorably, enabling real-time OOD detection even with high-dimensional data. This efficiency is achieved through a streamlined scoring function that directly leverages feature representations, avoiding the need for expensive distance calculations or complex data structures, thus making it particularly well-suited for deployment in resource-constrained environments or applications demanding rapid responses.

Continued research into out-of-distribution (OOD) detection necessitates a concentrated effort on bolstering the resilience of current methodologies. While advancements like CORE demonstrate promising efficiency, the persistence of challenges – such as hubness and the need for consistent data scaling – demands innovative solutions. Future development should prioritize techniques that not only address these inherent limitations but also enhance the adaptability of OOD detection systems to the complexities of real-world deployments. This includes exploring methods for dynamic calibration, improved feature representation learning, and the creation of benchmarks that more accurately reflect the diversity and ambiguity of practical data distributions, ultimately leading to more reliable and trustworthy applications of machine learning.

The pursuit of robust out-of-distribution (OOD) detection, as demonstrated by CORE, necessitates a rigorous testing of system boundaries. This work doesn’t simply accept established feature spaces; it actively decomposes them into confidence and membership signals via orthogonal subspaces. It’s a deliberate attempt to dismantle the conventional, to understand how a system fails before proclaiming its success. This aligns perfectly with the sentiment expressed by David Hilbert: “We must be able to answer the question: What are the prerequisites for the possibility of mathematics?” Hilbert’s query, though framed for mathematics, echoes the CORE paper’s underlying principle – that a thorough understanding demands a deconstruction of fundamental assumptions and a probing of limitations, especially when dealing with the complexities of neural networks and their inherent vulnerabilities to novel inputs.

Breaking the Mold

The pursuit of out-of-distribution (OOD) detection, as exemplified by CORE, isn’t merely about flagging anomalies; it’s about dissecting the very notion of ‘in-distribution’. The method’s decomposition of feature space into confidence and membership signals suggests a tacit acknowledgement that current neural networks aren’t learning concepts, but rather exquisitely optimized input-output mappings. The residual, the ‘otherness’, becomes as informative as the signal itself. Further investigation should, logically, focus on intentionally inducing and analyzing these orthogonal residuals – deliberately perturbing the input to force revelation of the network’s underlying assumptions.

The demonstrated robustness across diverse architectures is encouraging, yet hints at a deeper, unsettling truth: that performance gains in OOD detection aren’t necessarily stemming from improved understanding, but from increasingly clever exploitation of architectural weaknesses. A truly robust system shouldn’t simply detect the unknown; it should gracefully degrade, offering a calibrated measure of its own uncertainty. This necessitates moving beyond post-hoc analysis and embedding inherent uncertainty quantification directly into the learning process.

Ultimately, the field needs to confront the implicit assumption that ‘in-distribution’ is a stable, well-defined state. Real-world data rarely conforms to neat categories. The next challenge isn’t simply better detection, but the construction of systems capable of adapting – of rewriting their internal representations in the face of genuine novelty. The goal isn’t to build a perfect gatekeeper, but a perpetual student.

Original article: https://arxiv.org/pdf/2603.18290.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/