Learning to Trust: Sharpening Aerial Imagery Analysis with Self-Supervised Learning

Author: Denis Avetisyan


A new approach to self-supervised learning leverages uncertainty estimation to build more robust representations from aerial imagery, even when data is incomplete or corrupted.

The framework introduces Trust-SSL, a self-supervised learning approach that refines contrastive alignment using per-factor trust weights derived from conflict and ignorance, effectively adding a bounded, trust-aware correction to the standard contrastive gradient via an additive-residual selective alignment term-a mechanism designed to preserve the base gradient while acknowledging and mitigating uncertainty in the learned representations-and leverages a ResNet-50 backbone alongside both a standard SimCLR branch and an auxiliary corruption-family classifier to achieve this nuanced refinement.
The framework introduces Trust-SSL, a self-supervised learning approach that refines contrastive alignment using per-factor trust weights derived from conflict and ignorance, effectively adding a bounded, trust-aware correction to the standard contrastive gradient via an additive-residual selective alignment term-a mechanism designed to preserve the base gradient while acknowledging and mitigating uncertainty in the learned representations-and leverages a ResNet-50 backbone alongside both a standard SimCLR branch and an auxiliary corruption-family classifier to achieve this nuanced refinement.

This work introduces Trust-SSL, an additive-residual selective invariance method for enhancing the reliability of self-supervised learning in remote sensing applications.

Enforcing invariance is central to self-supervised learning, yet standard approaches struggle when aerial imagery is significantly degraded by real-world corruptions. This paper introduces ‘Trust-SSL: Additive-Residual Selective Invariance for Robust Aerial Self-Supervised Learning’, a novel training strategy and architectural modification that enhances robustness by incorporating a learned, per-sample trust weight into the alignment objective via an additive residual. Experiments across multiple datasets demonstrate that this additive approach-contrasting with multiplicative gating-improves representation learning and achieves state-of-the-art performance, particularly under severe data erasure. Could this uncertainty-aware formulation offer a generalizable principle for improving self-supervised learning across diverse data modalities and corruption types?


The Illusion of Perfect Representation

The escalating demand for insights from visual data – spanning Earth observation via remote sensing to the analysis of imagery in computer vision – frequently outpaces the availability of meticulously labeled datasets. This disparity has propelled Self-Supervised Learning (SSL) to the forefront of data science. SSL techniques ingeniously circumvent the need for exhaustive manual annotation by enabling models to learn meaningful representations directly from the inherent structure within unlabeled data. By formulating pretext tasks – such as predicting image rotations or completing missing patches – these models develop a deep understanding of visual features without explicit guidance. Consequently, SSL unlocks the potential of vast, readily available unlabeled datasets, significantly enhancing the performance and scalability of computer vision applications across diverse fields, from precision agriculture and urban planning to autonomous navigation and environmental monitoring.

Current self-supervised learning techniques, particularly those relying on contrastive methods, frequently encounter difficulties when applied to real-world remote sensing and computer vision tasks. These methods, while effective in controlled environments, exhibit a vulnerability to domain shifts – changes in data distribution between training and deployment – and the presence of corrupted inputs like noise or atmospheric distortions. This limitation stems from their reliance on learning feature similarities, which can be easily disrupted by even minor variations in input characteristics. Consequently, models trained with standard contrastive learning often experience a significant drop in performance when faced with data that deviates from the training distribution, hindering their practical applicability in dynamic and unpredictable environments.

The pursuit of genuinely robust representation learning in computer vision and remote sensing necessitates a shift towards models that don’t simply make predictions, but also assess their own confidence in those predictions. Traditional approaches often yield overconfident outputs when confronted with data differing from the training set, or when inputs are noisy or corrupted; a model capable of quantifying its uncertainty – expressing, for instance, a low confidence score for ambiguous or out-of-distribution samples – offers a crucial safeguard against erroneous decision-making. This ability to ā€˜know what it doesn’t know’ isn’t merely about flagging potentially incorrect results, but enables downstream systems to intelligently request further data, employ alternative algorithms, or defer to human oversight, thereby enhancing overall system reliability and safety in real-world applications. Such uncertainty estimates, often expressed as probabilities or variance, move beyond simple pattern recognition towards a more nuanced form of ā€˜intelligent’ data interpretation.

During Trust-SSL training, total loss decreases while conflict <span class="katex-eq" data-katex-display="false">\bar{K}</span> and ignorance <span class="katex-eq" data-katex-display="false">\bar{I}</span> diminish, alongside a reduction in auxiliary corruption-family classifier loss, all modulated by a selective term <span class="katex-eq" data-katex-display="false">\lambda\\_{\\text{sel}}(e)</span> that increases between epochs 100 and 150.
During Trust-SSL training, total loss decreases while conflict \bar{K} and ignorance \bar{I} diminish, alongside a reduction in auxiliary corruption-family classifier loss, all modulated by a selective term \lambda\\_{\\text{sel}}(e) that increases between epochs 100 and 150.

Selective Invariance: A FaƧade of Robustness

Additive-Residual Selective Invariance presents a self-supervised learning (SSL) methodology that dynamically adjusts feature alignment based on learned confidence levels. This is achieved by introducing an additive residual connection alongside a mechanism that halts gradient propagation – the stop-gradient operation – allowing the network to selectively incorporate information from augmented views. Rather than rigidly enforcing feature alignment, the model learns to weigh the contribution of each augmented view based on a ā€˜trust’ signal, effectively prioritizing more reliable features during representation learning. This adaptive alignment facilitates a more robust and nuanced understanding of the input data, moving beyond traditional methods that treat all augmentations equally.

The Additive-Residual Selective Invariance (ARSI) method utilizes an additive residual connection in conjunction with a stop-gradient mechanism to control information propagation during self-supervised learning. Specifically, the residual branch allows the network to learn transformations independent of the primary alignment pathway. The stop-gradient operation, applied to the residual branch’s output before addition, prevents gradients from flowing back through this branch during the initial forward pass, effectively decoupling it from the core representation learning process. This selective blocking of gradients enables the residual to function as a learned trust signal, modulating the primary pathway’s output and allowing the model to selectively incorporate or ignore information based on the confidence derived from the residual branch; this promotes robust feature extraction by prioritizing reliable signals and mitigating the impact of noisy or irrelevant augmentations.

The method decomposes the network’s output representation into multiple Factor Subspaces. Each subspace is designed to capture information relevant to a specific family of data augmentations. This decomposition isn’t a strict partitioning; rather, each subspace exhibits a weak association with a particular augmentation family, allowing for shared information and redundancy. The intention is to create a modular representation where changes induced by one augmentation family primarily affect the corresponding subspace, while leaving others relatively unchanged. This facilitates isolating the effects of individual augmentations and improving the model’s robustness and generalization capabilities by focusing learned features on specific data transformations.

The method explicitly represents uncertainty by quantifying both conflict and ignorance during the self-supervised learning process. Conflict is measured as the disagreement between predictions generated from different augmented views of the same input, while ignorance reflects the model’s inability to confidently predict across all augmentation families. These metrics are not simply used as loss terms, but are tracked and utilized to modulate the learning process, allowing the model to assign higher trust to predictions supported by multiple, consistent views and to flag instances where data is inherently ambiguous or outside the training distribution. This explicit uncertainty representation facilitates more reliable predictions by enabling the model to abstain from making confident assertions when faced with conflicting or insufficient evidence, ultimately improving robustness and calibration.

Trust-SSL demonstrates strong robustness to erasure-type corruptions, particularly on the EuroSAT dataset, as indicated by its comparatively high top-1 accuracy across these scenarios.
Trust-SSL demonstrates strong robustness to erasure-type corruptions, particularly on the EuroSAT dataset, as indicated by its comparatively high top-1 accuracy across these scenarios.

Trust-SSL: The Illusion of Intelligent Features

Trust-SSL utilizes Additive-Residual Selective Invariance to decompose input representations into Factor Subspaces. This decomposition involves projecting data into multiple subspaces, each designed to capture distinct factors of variation. The ā€œadditiveā€ component refers to the summation of representations learned within these subspaces, while the ā€œresidualā€ component captures information not explained by the primary factors. Selective Invariance ensures the model learns to focus on the most relevant subspaces for a given task, improving robustness and generalization. This approach allows the model to represent data in a more modular and interpretable way, facilitating the identification and isolation of key features.

Dempster-Shafer theory is utilized within Trust-SSL to aggregate evidence derived from the Factor Subspaces. This approach diverges from standard probabilistic methods by assigning masses to sets of possible outcomes, rather than single probabilities. Consequently, Dempster-Shafer Fusion allows the model to explicitly represent both conflicting evidence – where multiple subspaces disagree on a classification – and ignorance, indicating a lack of sufficient information within the subspaces to make a definitive judgment. The fusion process calculates the total belief in a given outcome by combining the masses assigned to all sets containing that outcome, while also accounting for conflicts through a normalization factor. This capability is crucial for handling noisy or ambiguous data commonly found in aerial imagery and allows Trust-SSL to avoid overconfident predictions when faced with uncertain inputs.

Evidential Belief States (EBS) within Trust-SSL represent uncertainty using a probability mass function over possible labels, rather than a single point estimate. This allows the model to explicitly quantify both aleatoric (data-dependent) and epistemic (model-dependent) uncertainty. Each EBS consists of a vector of beliefs, one for each class, and a belief mass representing the degree of ignorance or conflict. The summation of beliefs for all classes does not necessarily equal one, accommodating abstention from prediction when evidence is insufficient. This framework enables Dempster-Shafer fusion to combine EBS from Factor Subspaces in a mathematically consistent manner, effectively modeling ambiguity and allowing the model to defer judgment when facing conflicting or incomplete information.

Evaluation of the Trust-SSL model on standard aerial imagery benchmarks demonstrates a mean linear probe accuracy of 90.20%. This performance metric indicates the quality of the learned feature representations when evaluated using a linear classifier trained on the frozen features. Comparative analysis against existing self-supervised learning methods operating on the same datasets confirms that Trust-SSL achieves state-of-the-art results, representing a statistically significant improvement in feature quality for downstream tasks within the aerial imagery domain.

Analysis of factor-averaged conflict <span class="katex-eq" data-katex-display="false">\bar{K}</span> and ignorance <span class="katex-eq" data-katex-display="false">\bar{I}</span> across varying corruption severities on EuroSAT reveals monotonic increases in conflict for contradiction-family corruptions, information loss for weather corruptions, and a counter-predicted moderate rise in conflict with slight ignorance decrease for erasure-family corruptions.
Analysis of factor-averaged conflict \bar{K} and ignorance \bar{I} across varying corruption severities on EuroSAT reveals monotonic increases in conflict for contradiction-family corruptions, information loss for weather corruptions, and a counter-predicted moderate rise in conflict with slight ignorance decrease for erasure-family corruptions.

Beyond the Benchmarks: A Fragile Resilience

To assess the quality of the learned representations, a linear probe was employed to evaluate the pretrained features extracted by Trust-SSL across diverse aerial scene classification tasks. The methodology involved freezing the weights of the Trust-SSL model and training a simple linear classifier on top of the extracted features using datasets such as EuroSAT, AID, and NWPU-RESISC45. This approach effectively isolates the feature learning capability of Trust-SSL, revealing how well the model captures meaningful information from unlabeled aerial imagery. Results from these datasets demonstrate the model’s ability to generalize to different scene types and image resolutions, providing a strong indication of the robustness and transferability of the learned features for downstream applications in remote sensing and geospatial analysis.

To rigorously evaluate the model’s practical utility, researchers subjected the pretrained features to a battery of corruption scenarios. This testing process deliberately introduced various forms of noise and degradation, simulating real-world conditions such as blur, fog, snow, and differing levels of compression. The model’s performance under these adverse conditions revealed a significant degree of robustness, consistently outperforming baseline methods like SimCLR – achieving gains of up to 19.9 points on corrupted images. This resilience suggests that Trust-SSL learns features that are less sensitive to superficial distortions, focusing instead on the core, semantic characteristics of aerial scenes and offering a more reliable foundation for downstream tasks even with imperfect input data.

To assess the adaptability of Trust-SSL, researchers investigated its performance on a markedly different dataset – the BDD100K driving dataset – through cross-domain transfer learning. This experiment moved beyond typical aerial scene classification to evaluate the model’s capacity to extract and utilize features applicable to ground-level imagery captured in a driving context. By training on aerial scenes and testing on driving scenes, the study gauged whether the learned representations were sufficiently generalizable to overcome the substantial domain gap in image characteristics, such as viewpoint, lighting, and object composition. Successful transfer learning to BDD100K indicates that Trust-SSL doesn’t simply memorize aerial features, but instead learns underlying visual principles applicable to a broader range of visual data, highlighting its potential for diverse applications beyond remote sensing.

Evaluations reveal that Trust-SSL exhibits substantial improvements in handling compromised visual data; specifically, the model achieves gains of up to 19.9 percentage points when classifying images subjected to various corruptions, surpassing the performance of the SimCLR framework. This resilience extends to scenarios with significant occlusion, as demonstrated by a 5.4% increase in accuracy on the challenging NWPU-RESISC45 dataset-a benchmark known for its obscured imagery. These findings suggest Trust-SSL learns more robust feature representations, enabling it to maintain classification accuracy even when presented with noisy or incomplete visual information, indicating a heightened ability to generalize beyond ideal conditions.

Across challenging weather conditions in the BDD100K dataset, Trust-SSL and its variants consistently outperformed self-supervised learning methods like SimCLR, BYOL, and VICReg, as measured by Mahalanobis AUROC, demonstrating robustness to distribution shift.
Across challenging weather conditions in the BDD100K dataset, Trust-SSL and its variants consistently outperformed self-supervised learning methods like SimCLR, BYOL, and VICReg, as measured by Mahalanobis AUROC, demonstrating robustness to distribution shift.

The pursuit of robust self-supervised learning, as demonstrated in this work with additive-residual selective invariance, feels predictably cyclical. It’s another layer of complexity built to address the inherent fragility of even the most elegant systems. The researchers attempt to build resilience against data corruption, but one suspects production environments will inevitably discover new and creative ways to break things. Fei-Fei Li once said, ā€œAI is not about building machines that think like humans; it’s about building machines that help humans think.ā€ This feels apt; this isn’t about achieving perfect representation, it’s about building tools that slightly delay the inevitable moment when the whole thing falls apart, all while accruing technical debt. Everything new is just the old thing with worse docs.

What Lies Ahead?

The pursuit of robust self-supervised learning, particularly for remote sensing data, invariably circles back to the problem of distribution shift. This work demonstrates a nuanced improvement-an additive residual for uncertainty-but it merely refines the inevitable. Every elegantly constructed invariance will eventually fracture against an unforeseen corruption, a novel sensor artifact, or the simple, brutal reality of atmospheric conditions. The current focus on uncertainty estimation feels almost…optimistic. It suggests a belief that knowing when a representation is failing somehow mitigates the failure itself.

Future work will undoubtedly explore scaling these techniques – larger datasets, more complex architectures. Yet, the deeper challenge remains: how to build representations that are not simply less surprised by the unexpected, but fundamentally unaffected by it. The additive residual is a clever mechanism, but it’s still a reactive measure. The next iteration will likely involve attempts to preemptively model the space of possible corruptions, a task that borders on the computationally intractable.

It is worth remembering that every abstraction dies in production. Selective invariance, uncertainty quantification, even the very concept of a ā€˜robust’ representation-all will succumb to edge cases. The art, then, isn’t in eliminating failure, but in designing systems that fail gracefully, and whose failures are, at the very least, instructive.


Original article: https://arxiv.org/pdf/2604.21349.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2026-04-24 18:49