Advanced Representation Learning in Remote Sensing Image

TL;DR

We give a comprehensive study of the advanced technique for remote sensing representation learning including several latest self-supervised learning and fine-tuning for downstream tasks technique and self-supervised learning including like contrastive learning, a masked autoencoder style, JEPA style and so on.

Multimodal SSL for Remote Sensing Overview — **Multimodal Self-Supervised Learning for Remote Sensing Overview.** The taxonomy includes data combinations (multimodal images, text, ground-level geospatial data), multimodal SSL objectives (discriminative-based, generative-based), fusion strategies (objective-driven, decision-level, hierarchical, shared encoder, FM-powered), and diverse downstream tasks ranging from image processing to socioeconomic prediction. Image Credit: [Bai et al., 2025]

Introduction

The success of foundation models, especially large language models (LLMs), has demonstrated the effectiveness of Self-Supervised Pre-training plus Domain Supervised Fine-tuning strategy. This paradigm has been successfully adopted in the remote sensing domain, achieving remarkable results. Remote sensing data is often considered a distinct modality in machine learning due to its unique spatio-temporal characteristics, varying resolutions, and diverse sensor types [Rolf et al., 2024].

Satellite images variation — **Upper.** Satellite images of the same location can vary widely depending on factors like spatial resolution and cropping extent, temporal dimension, and satellite mission or instrument. ML methods that leverage these factors can drastically outperform methods for general images. **Left.** In SatML, multiple observations and multiple (or no) labels may correspond to a given (lat, lon, time) index, whereas in many ML settings, labels are defined directly from images. **Center.** SatML has distinct considerations for deployment and evaluation. Deployment datasets are often dense, and much larger than training datasets. Spatio-temporal covariate shifts necessitate spatially aware model validation for out-of-sample model deployment. **Right.** SatML can enrich many research areas in ML, e.g., multi-modal, self-supervised, and distributionally robust learning. Image Credit: [Rolf et al., 2024]

SatML multiple observations — **Upper.** Satellite images of the same location can vary widely depending on factors like spatial resolution and cropping extent, temporal dimension, and satellite mission or instrument. ML methods that leverage these factors can drastically outperform methods for general images. **Left.** In SatML, multiple observations and multiple (or no) labels may correspond to a given (lat, lon, time) index, whereas in many ML settings, labels are defined directly from images. **Center.** SatML has distinct considerations for deployment and evaluation. Deployment datasets are often dense, and much larger than training datasets. Spatio-temporal covariate shifts necessitate spatially aware model validation for out-of-sample model deployment. **Right.** SatML can enrich many research areas in ML, e.g., multi-modal, self-supervised, and distributionally robust learning. Image Credit: [Rolf et al., 2024]

Review: Self-Supervised Learning in Remote Sensing (2020-2025)

We review the latest novel papers that adopt self-supervised learning in remote sensing images.

SSL Methods for RS Overview — **Self-Supervised Learning Methods for Remote Sensing Overview.** **Top left:** Contrastive methods attract representations originating from the same sample and repel representations from other samples. **Top center:** Reconstruction methods predict pixels of hidden patches. **Top right:** I-JEPA and AnySat predict representations of hidden patches. **Bottom left:** LatentMIM (which lacks an RS instantiation) attracts representations originating from the same patch and repels representations from other patches. **Bottom right (Galileo):** Our method simultaneously attracts varied-level representations of the same patch and repels elsewhere while it attracts pixel predictions of the same patch and repels elsewhere. This strategy encourages learning global and local features. Image Credit: [Tseng et al., 2025]

1. Contrastive Learning

1.1 CLIP-style Contrastive Learning

General Domain CLIP — General Domain: CLIP [Radford et al., 2021] architecture. CLIP jointly trains an image encoder and a text encoder to predict the correct pairings of a batch of (image, text) training examples. At test time the learned text encoder synthesizes a zero-shot linear classifier by embedding the names or descriptions of the target dataset's classes. Improvements in contrastive scaling were also demonstrated in SimCLRv2 [Chen et al., 2020].

Adoption in Remote Sensing:

RS Domain RemoteCLIP — Remote Sensing: RemoteCLIP [Liu et al., 2024] architecture adapted for remote sensing data. The contribution of RemoteCLIP compared to CLIP is that it converts heterogeneous annotations into a unified image-caption data format based on box-to-caption (B2C) and mask-to-box (M2B) conversion.

1.2 Barlow Twins-style Contrastive Learning

General Domain Barlow Twins — General Domain: Barlow Twins [Zbontar et al., 2021] architecture. It employs self-supervised learning (SSL) in image only (compared to image-text alignment in CLIP). Barlow Twins’s objective function measures the crosscorrelation matrix between the embeddings of two identical networks fed with distorted versions of a batch of samples, and tries to make this matrix close to the identity. This causes the embedding vectors of distorted versions of a sample to be similar, while minimizing the redundancy between the components of these vectors. Non-contrastive approaches like BYOL [Grill et al., 2020] also achieve similar goals without negative pairs.

RS Domain DECUR — Remote Sensing: DECUR [Wang et al., 2024] architecture for multimodal self-supervised learning. DECUR decouples common and unique representations across modalities. $ M1 $ and $ M2 $ represent two modalities. Two augmented views from each modality are fed to modality-specific encoders $ (E1,E2) $ and projectors $ (P1,P2) $ to get embeddings $ Z $. For cross-modal embeddings, dimensions are separated into common and unique parts: the correlation matrix of common dimensions is optimized to be close to the identity, while that of unique ones to zero. For intra-modal embeddings, both common and unique dimensions are used to calculate the correlation matrix which is optimized to be close to the identity. DeCUR optionally adds deformable attention (the green shadowed region on the right side) in the last layers of ConvNet encoders to boost modality-informative learning.

📐 Detailed Loss Definition of DeCUR (Click to expand)

Cross-Correlation Matrix

Given two embedding vectors $ Z^A, Z^B \in \mathbb{R}^K $, the cross-correlation matrix $ \mathcal{C} $ between them is formulated as:

$$ \mathcal{C}_{ij} = \frac{\sum_b z_{b,i}^A z_{b,j}^B}{\sqrt{\sum_b (z_{b,i}^A)^2} \sqrt{\sum_b (z_{b,j}^B)^2}} $$

where $ b $ indexes batch samples, and $ i, j $ index the dimension of the embedding vectors. $ \mathcal{C} \in \mathbb{R}^{K \times K} $ is a square matrix with values ranging from -1 to 1.

Cross-Modal Representation Decoupling

The correlation matrix $ \mathcal{C} $ is calculated between two embeddings from different modalities, such as $ Z_{M1}' $ and $ Z_{M2}' $. The total embedding dimension $ K $ is separated into $ K_c $ and $ K_u $ with $ K_c + K_u = K $ to store common and unique representations, respectively.

Cross-modal common representations loss:

$$ \mathcal{L}_{com} = \sum_i (1 - \mathcal{C}_{cii})^2 + \lambda_c \cdot \sum_i \sum_{j \neq i} \mathcal{C}_{cij}^2 $$

where $ \lambda_c $ is a positive constant trading off the importance of the first invariance term (to make the common embeddings invariant to the input modalities) and the second redundancy reduction term (to decorrelate the embedding vector components and avoid model collapse).

Modality-unique representations loss:

$$ \mathcal{L}_{uni} = \sum_i \mathcal{C}_{uii}^2 + \lambda_u \cdot \sum_i \sum_{j \neq i} \mathcal{C}_{uij}^2 $$

where $ \lambda_u $ is a positive constant trading off the importance of the first decorrelation term (to decorrelate different modalities) and the second redundancy reduction term (to decorrelate the embedding vector components).

Intra-Modal Representation Enhancement

To avoid the collapse of the decoupled unique dimensions in the cross-modal training, as well as to boost intra-modal representations, intra-modal training that covers all the embedding dimensions is introduced. For each modality, a cross-correlation matrix $ \mathcal{C}_{M1} $ (or $ \mathcal{C}_{M2} $) is generated from the full dimensions of the embedding vectors $ Z_{M1}' $ and $ Z_{M1}'' $ (or $ Z_{M2}' $ and $ Z_{M2}'' $).

Intra-modal losses:

$$ \mathcal{L}_{M1} = \sum_i (1 - \mathcal{C}_{\text{M1}ii})^2 + \lambda_{M1} \cdot \sum_i \sum_{j \neq i} \mathcal{C}_{\text{M1}ij}^2 $$ $$ \mathcal{L}_{M2} = \sum_i (1 - \mathcal{C}_{\text{M2}ii})^2 + \lambda_{M2} \cdot \sum_i \sum_{j \neq i} \mathcal{C}_{\text{M2}ij}^2 $$

where $ \lambda_{M1} $ and $ \lambda_{M2} $ are positive constants trading off the importance of the invariance term and the redundancy reduction term.

Overall Training Objective

Combining the cross-modal common and unique losses, and the intra-modal losses, the overall training objective of DeCUR reads:

$$ \mathcal{L} = \mathcal{L}_{com} + \mathcal{L}_{uni} + \mathcal{L}_{M1} + \mathcal{L}_{M2} $$

Deformable Attention for Modality-Informative Features

Apart from the DeCUR loss design, deformable attention is adopted to help ConvNet models focus on modality-informative regions. The deformable attention module was proposed in DAT and DAT++ to efficiently model the relations among feature tokens under the guidance of the important regions in the feature maps. Given an input feature map $ x \in \mathbb{R}^{H \times W \times C} $, a downsampled grid of points $ p \in \mathbb{R}^{H_G \times W_G \times 2} $ is generated as references, where $ H_G = H/r $ with $ r $ being the downscaling ratio. The query tokens $ q $ are fed into a lightweight sub-network $ \theta_{\text{offset}} $ to generate the offsets $ \Delta p \in \mathbb{R}^{H_G \times W_G \times 2} $ in order to get final deformed points with $ p + \Delta p $. Then the features are sampled from $ x $ at the locations of deformed points and interpolated to a feature map $ \bar{x} \in \mathbb{R}^{H_G \times W_G \times C} $. This sampled feature map $ \bar{x} $ is projected to keys $ k $ and values $ v $, where $ k = \bar{x}W_k $ and $ v = \bar{x}W_v $. Softmax attention is then calculated on flattened queries $ q $ and keys $ k $ and multiplied with values $ v $. The final output is reshaped back to the same size as the input feature map $ x $.

The deformable attention module is adopted in the last two stages of the encoder to learn regional focus while keeping efficiency. A residual connection from the input feature map to the output of the deformable attention module is added to restrict unexpected influences of the attention module, such as biasing the pretraining towards the pretext task by selecting unexpected deformable points.

2. Masked Autoencoder (MAE)

General Domain MAE — General Domain: Masked Autoencoder (MAE) [He et al., 2022] architecture. During pre-training, a large random subset of image patches (e.g., 75%) is masked out. The encoder is applied to the small subset of visible patches. Mask tokens are introduced after the encoder, and the full set of encoded patches and mask tokens is processed by a small decoder that reconstructs the original image in pixels. After pre-training, the decoder is discarded and the encoder is applied to uncorrupted images (full sets of patches) for recognition tasks.

Adoption in Remote Sensing:

Models: SatMAE [Cong et al., 2022], SatMAE++ [Noman et al., 2024], CROMA [Fuller et al., 2023], MMEarth [Nedungadi et al., 2024], DOFA [Xiong et al., 2025].

RS Domain SatMAE — Remote Sensing: SatMAE [Cong et al., 2022] architecture for temporal and multi-spectral satellite imagery.

SatMAE++ reconstruction results at multi-scale level. Examples from fMoW-Sentinel dataset are shown here. For illustration, we show the RGB channels of the multi-spectral data here. The images are reconstructed at resolutions of $(H, W)$, $(2H, 2W)$, and $(4H, 4W)$, respectively. We observe that the proposed model provide better reconstruction results compared to SatMAE at resolution of $(H, W)$. Here, we compare the reconstruction performance of our framework with the baseline SatMAE. We observe that the reconstruction results of SatMAE on visible patches is worse compared to the masked patches. Whereas our framework provides much better results on all the patches including the visible patches. The above reported results demonstrate the effectiveness of multi-scale pre-training framework SatMAE++.

SatMAE++ vs SatMAE reconstruction comparison — SatMAE++ reconstruction results at multi-scale level. Examples from fMoW-Sentinel dataset are shown here. For illustration, we show the RGB channels of the multi-spectral data here. The images are reconstructed at resolutions of $(H, W)$, $(2H, 2W)$, and $(4H, 4W)$, respectively. We observe that the proposed model provide better reconstruction results compared to SatMAE at resolution of $(H, W)$. Here, we compare the reconstruction performance of our framework with the baseline SatMAE. We observe that the reconstruction results of SatMAE on visible patches is worse compared to the masked patches. Whereas our framework provides much better results on all the patches including the visible patches. The above reported results demonstrate the effectiveness of multi-scale pre-training framework SatMAE++.

RS Domain MMEarth — Remote Sensing: MMEarth [Nedungadi et al., 2024] architecture extends Masked Autoencoders, which reconstruct only the input image, by incorporating multiple pretext tasks using aligned pixel-level as well as image-level modalities.

3. I-JEPA Style

Background (General Domain):

Representative works: I-JEPA [Assran et al., 2023]

Common architectures for self-supervised learning, in which the system learns to capture the relationships between its inputs. The objective is to assign a high energy (large scaler value) to incompatible inputs, and to assign a low energy (low scaler value) to compatible inputs. **(a) Joint-Embedding Architectures** (e.g., CLIP, SimCLR) learn to output similar embeddings for compatible inputs $ x, y $ and dissimilar embeddings for incompatible inputs. **(b) Generative Architectures** (e.g., MAE) learn to directly reconstruct a signal $ y $ from a compatible signal $ x $, using a decoder network that is conditioned on additional (possibly latent) variables $ z $ to facilitate reconstruction. **(c) Joint-Embedding Predictive Architectures** (e.g., I-JEPA) learn to predict the embeddings of a signal $ y $ from a compatible signal $ x $, using a predictor network that is conditioned on additional (possibly latent) variables $ z $ to facilitate prediction.

**Embedding Predictive Architecture** uses a single context block to predict the representations of various target blocks originating from the same image. The context encoder is a Vision Transformer (ViT), which only processes the visible context patches. The predictor is a narrow ViT that takes the context encoder output and, conditioned on positional tokens (shown in color), predicts the representations of a target block at a specific location. The target representations correspond to the outputs of the target-encoder, the weights of which are updated at each iteration via an exponential moving average of the context encoder weights

I-JEPA architecture details — Common architectures for self-supervised learning, in which the system learns to capture the relationships between its inputs. The objective is to assign a high energy (large scaler value) to incompatible inputs, and to assign a low energy (low scaler value) to compatible inputs. **(a) Joint-Embedding Architectures** (e.g., CLIP, SimCLR) learn to output similar embeddings for compatible inputs $ x, y $ and dissimilar embeddings for incompatible inputs. **(b) Generative Architectures** (e.g., MAE) learn to directly reconstruct a signal $ y $ from a compatible signal $ x $, using a decoder network that is conditioned on additional (possibly latent) variables $ z $ to facilitate reconstruction. **(c) Joint-Embedding Predictive Architectures** (e.g., I-JEPA) learn to predict the embeddings of a signal $ y $ from a compatible signal $ x $, using a predictor network that is conditioned on additional (possibly latent) variables $ z $ to facilitate prediction.

**Embedding Predictive Architecture** uses a single context block to predict the representations of various target blocks originating from the same image. The context encoder is a Vision Transformer (ViT), which only processes the visible context patches. The predictor is a narrow ViT that takes the context encoder output and, conditioned on positional tokens (shown in color), predicts the representations of a target block at a specific location. The target representations correspond to the outputs of the target-encoder, the weights of which are updated at each iteration via an exponential moving average of the context encoder weights

Adoption in Remote Sensing:

Models: AnySat [Astruc et al., 2025].

RS Domain Adoption of I-JEPA style — Remote Sensing: AnySat [Astruc et al., 2025] architecture. Architecture of AnySat. We begin each iteration by randomly selecting a dataset among GeoPlex and sampling a tile. Each available modality is divided into spatially aligned patches of size $ P $. The student network's patch encoder $ \phi_S^{\text{patch}} $ embeds each patch and we apply a contrastive loss to encourage spatial consistency across modalities. We then apply dropping and masking: some patches have all modalities removed (dropping), while others have only random modalities removed (masking). The remaining patches are merged in the modality combiner $ \phi_S^{\text{comb}} $ to form multimodal representations $ f_S^{\star} $ for the non-dropped patches. The predictor $ \phi_S^{\text{pred}} $ then reconstructs the embeddings of the dropped patches. Finally, the student network's output is compared to the teacher's, whose weights are an Exponential Moving Average (EMA) of the student's weights and which processes the complete set of patches without masking or dropping.

4. Latent Masked Image Modeling

General Domain Latent MIM — General Domain: Latent Masked Image Modeling [Wei et al., 2024] Overview. Models are trained to reconstruct the latent representations generated by a target encoder at withheld locations. Four major challenges for effectively deploying latent MIM are identified in this work, as well as potential solutions. These challenges relate to joint encoder optimization, direct reconstruction loss, the semantic correlation between visible and target patches, and the decoder design.

Not yet implemented in RS

Remote Sensing: Latent MIM is currently unexplored in this domain.

5. Galileo (Current SOTA)

Source: Galileo [Tseng et al., 2025].
Key Features:
- Highly multimodal transformer to represent many remote sensing modalities.
- Novel self-supervised learning algorithm extracting multi-scale features across a flexible set of input modalities through mask modeling.
- Dual global and local contrastive losses which differ in their targets and masking strategies.

Galileo RS Architecture — Remote Sensing: **Galileo's Architecture.** We train Galileo with our global (left) and local (right) pretraining losses. Black-outlined tokens are model outputs, black-striped tokens are model inputs. **Steps:** ❶ sample from dataset and mask (structured left, unstructured right), ❷ encode "visible" tokens, ❸ predict targets given target queries and visible encodings, ❹ encode targets (deep left, shallow right) with stop gradient, and ❺ calculate within-sample token contrastive loss.

Other SSL Techniques

DINO [Caron et al., 2021] / DINOv2 [Oquab et al., 2024] / DINOv3 [Siméoni et al., 2025]: Self-distillation with no labels using Vision Transformers. DINO learns visual features through a self-supervised teacher-student framework. DINOv2 improves robustness and scalability for larger models and datasets, while DINOv3 introduces further architectural and training optimizations. Adoption in Remote Sensing: DINO was adapted for joint SAR-optical representation learning using Vision Transformers [Wang et al., 2022b].

Remote Sensing: **SAR-Optical DINO** [Wang et al., 2022b] architecture for joint representation learning. The model employs a teacher-student framework to learn from SAR and optical data. Two augmented views $ x_1 $ and $ x_2 $ (which can be SAR, optical, or a combination) are processed by student and teacher encoders $ E_{student} $ and $ E_{teacher} $. The student is trained using a contrastive loss to match the teacher's output (after centering and softmax), while the teacher's weights are updated via an exponential moving average (EMA) of the student's weights.

iBOT [Zhou et al., 2022]: Image BERT pre-training with online tokenizer. iBOT combines masked image modeling with self-distillation, using an online tokenizer to generate supervision signals for masked patch prediction. Adoption in Remote Sensing: iBOT was adapted for remote sensing by pre-training on the Million-AID dataset, showing significant improvements in scene classification and other downstream tasks [Dimitrovski et al., 2024].

Other Remote Sensing Foundation Models

There are also some related works that introduce remote sensing foundation models:

SoftCON [Wang et al., 2024b] / Satlas [Bastani et al., 2023]
Prithvi-EO-2.0 [Szwarcman et al., 2025]
RS-vHeat [Hu et al., 2025]
SkySense Series: SkySense [Guo et al., 2024], SkySense v2 [Zhang et al., 2025], SkySense-O [Zhu et al., 2025], SkySense++ [Wu et al., 2025]
GFM [Mendieta et al., 2023]
MA3E [Chen et al., 2024c]
CaCo [Mall et al., 2023]
AlphaEarth [Brown et al., 2025]

Adaptation to Downstream Tasks

General Pipeline of SSL — **The general pipeline of SSL.** The visual representation is learned through self-supervision that comes from the unlabeled data. The learned parameters serve as a pretrained model and are transferred to supervised downstream tasks for fine-tuning. Image Credit: [Wang et al., 2022]

For a more in-depth exploration of self-supervised learning, readers are encouraged to consult several excellent surveys. A general overview of SSL algorithms and applications can be found in [Gui et al., 2024], while [Wang et al., 2022] and [Bai et al., 2025] provide comprehensive reviews specifically for remote sensing. Additionally, [Zong et al., 2025] explores the broader landscape of self-supervised multimodal learning.

References

Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A Simple Framework for Contrastive Learning of Visual Representations. In International Conference on Machine Learning (ICML).
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML).
Zbontar, J., Jing, L., Misra, I., LeCun, Y., & Deny, S. (2021). Barlow Twins: Self-Supervised Learning via Redundancy Reduction. In Proceedings of the 38th International Conference on Machine Learning (ICML).
Fuller, A., Millard, K., & Green, J. (2023). CROMA: Remote Sensing Representations with Contrastive Radar-Optical Masked Autoencoders. In Advances in Neural Information Processing Systems (NeurIPS).
Wang, Y., Albrecht, C. M., Braham, N. A. A., Liu, C., Xiong, Z., & Zhu, X. X. (2024a). Decoupling Common and Unique Representations for Multimodal Self-supervised Learning. In Proceedings of the European Conference on Computer Vision (ECCV).
Astruc, G., Gonthier, N., Mallet, C., & Landrieu, L. (2025). AnySat: One Earth Observation Model for Many Resolutions, Scales, and Modalities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022). Masked Autoencoders Are Scalable Vision Learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., & Ermon, S. (2022). SatMAE: Pre-training Transformers for Temporal and Multi-Spectral Satellite Imagery. In Advances in Neural Information Processing Systems (NeurIPS).
Noman, M., Naseer, M., Cholakkal, H., Anwer, R. M., Khan, S., & Khan, F. S. (2024). Rethinking Transformers Pre-Training for Multi-Spectral Satellite Imagery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
Xiong, Z., Wang, Y., Zhang, F., Stewart, A. J., Hanna, J., Borth, D., Papoutsis, I., Saux, B. L., Camps-Valls, G., & Zhu, X. X. (2025). Neural Plasticity-Inspired Multimodal Foundation Model for Earth Observation. arXiv preprint arXiv:2403.15356.
Nedungadi, V., Kariryaa, A., Oehmcke, S., Belongie, S., Igel, C., & Lang, N. (2024). MMEarth: Exploring Multi-modal Pretext Tasks for Geospatial Representation Learning. In Proceedings of the European Conference on Computer Vision (ECCV).
Assran, M., Duval, Q., Misra, I., Bojanowski, P., Vincent, P., Rabbat, M., LeCun, Y., & Ballas, N. (2023). Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
Wei, Y., Gupta, A., & Morgado, P. (2024). Towards Latent Masked Image Modeling for Self-supervised Visual Representation Learning. In Proceedings of the European Conference on Computer Vision (ECCV).
Tseng, G., Fuller, A., Reil, M., Herzog, H., Beukema, P., Bastani, F., Green, J. R., Shelhamer, E., Kerner, H., & Rolnick, D. (2025). Galileo: Learning Global & Local Features of Many Remote Sensing Modalities. In International Conference on Machine Learning (ICML).
Wang, Y., Albrecht, C. M., & Zhu, X. X. (2024b). Multilabel-Guided Soft Contrastive Learning for Efficient Earth Observation Pretraining. IEEE Transactions on Geoscience and Remote Sensing, 62, 1–16.
Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., & Kembhavi, A. (2023). SatlasPretrain: A Large-Scale Dataset for Remote Sensing Image Understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).
Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., He, H., Wang, J., Chen, J., Yang, M., Zhang, Y., & Li, Y. (2024). SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
Zhang, Y., Ru, L., Wu, K., Yu, L., Liang, L., Li, Y., & Chen, J. (2025). SkySense V2: A Unified Foundation Model for Multi-modal Remote Sensing. In International Conference on Computer Vision (ICCV).
Zhu, Q., Lao, J., Ji, D., Luo, J., Wu, K., Zhang, Y., Ru, L., Wang, J., Chen, J., Yang, M., Liu, D., & Zhao, F. (2025). SkySense-O: Towards Open-World Remote Sensing Interpretation with Vision-Centric Visual-Language Modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
Wu, K., Zhang, Y., Ru, L., Dang, B., Lao, J., Yu, L., Luo, J., Zhu, Z., Sun, Y., Zhang, J., Zhu, Q., Wang, J., Yang, Ming, Chen, J., Zhang, Y., & Li, Y. (2025). A Semantic-Enhanced Multi-Modal Remote Sensing Foundation Model for Earth Observation. Nature Machine Intelligence, 7(8), 1235–1249.
Szwarcman, D., Roy, S., Fraccaro, P., Gíslason, Þ. E., Blumenstiel, B., Ghosal, R., de Oliveira, P. H., Almeida, J. L. d. S., Sedona, R., Kang, Y., Chakraborty, S., Wang, S., Gomes, C., Kumar, A., Truong, M., Godwin, D., Lee, H., Hsu, C.-Y., Asanjan, A. A., Mujeci, B., Shidham, D., Keenan, T., Arevalo, P., Li, W., Alemohammad, H., Olofsson, P., Hain, C., Kennedy, R., Zadrozny, B., Bell, D., Cavallaro, G., Watson, C., Maskey, M., Ramachandran, R., & Moreno, J. B. (2025). Prithvi-EO-2.0: A Versatile Multi-Temporal Foundation Model for Earth Observation Applications. arXiv preprint arXiv:2412.02732.
Mendieta, M., Han, B., Shi, X., Zhu, Y., & Chen, C. (2023). Towards Geospatial Foundation Models via Continual Pretraining. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).
Chen, B., Chen, Y., Wang, Z., Chen, Y., & Li, X. (2024c). Masked Angle-Aware Autoencoder for Remote Sensing Images. In Proceedings of the European Conference on Computer Vision (ECCV).
Mall, U., Hariharan, B., & Bala, K. (2023). Change-Aware Sampling and Contrastive Learning for Satellite Images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
Hu, H., Wang, P., Bi, H., Tong, B., Wang, Z., Diao, W., Chang, H., Feng, Y., Zhang, Z., Wang, Y., Ye, Q., Fu, K., & Sun, X. (2025). RS-vHeat: Heat Conduction Guided Efficient Remote Sensing Foundation Model. In International Conference on Computer Vision (ICCV).
Rolf, E., Klemmer, K., Robinson, C., & Kerner, H. (2024). Position: Mission Critical – Satellite Data Is a Distinct Modality in Machine Learning. In Proceedings of the 41st International Conference on Machine Learning (ICML).
Brown, C. F., Kazmierski, M. R., Pasquarella, V. J., Rucklidge, W. J., Samsikova, M., Zhang, C., Shelhamer, E., Lahera, E., Wiles, O., Ilyushchenko, S., Gorelick, N., Zhang, L. L., Alj, S., Schechter, E., Askay, S., Guinan, O., Moore, R., Boukouvalas, A., & Kohli, P. (2025). AlphaEarth Foundations: An Embedding Field Model for Accurate and Efficient Global Mapping from Sparse Label Data. arXiv preprint arXiv:2507.22291.
Liu, F., Chen, D., Guan, Z., Zhou, X., Zhu, J., Ye, Q., Fu, L., & Zhou, J. (2024). RemoteCLIP: A Vision Language Foundation Model for Remote Sensing. IEEE Transactions on Geoscience and Remote Sensing, 62, 1–16.
Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021). Emerging Properties in Self-Supervised Vision Transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).
Chen, T., Kornblith, S., Swersky, K., Norouzi, M., & Hinton, G. E. (2020). Big Self-Supervised Models Are Strong Semi-Supervised Learners. In Advances in Neural Information Processing Systems (NeurIPS).
Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., Piot, B., Kavukcuoglu, K., Munos, R., & Valko, M. (2020). Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning. In Advances in Neural Information Processing Systems (NeurIPS).
Zhou, J., Wei, C., Wang, H., Shen, W., Xie, C., Yuille, A., & Kong, T. (2022). iBOT: Image BERT Pre-training with Online Tokenizer. In International Conference on Learning Representations (ICLR).
Oquab, M., Darcet, T., Moutakanni, T., Vo, H. V., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., & El-Nouby, A. (2024). DINOv2: Learning Robust Visual Features without Supervision. Transactions on Machine Learning Research.
Siméoni, O., Vo, H. V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khalidov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., Massa, F., Haziza, D., Wehrstedt, L., Wang, J., Darcet, T., Moutakanni, T., Sentana, L., Roberts, C., Vedaldi, A., Tolan, J., Brandt, J., Couprie, C., Mairal, J., Jégou, H., Labatut, P., & Bojanowski, P. (2025). DINOv3. arXiv preprint arXiv:2508.10104.
Wang, Y., Albrecht, C. M., Braham, N. A. A., Mou, L., & Zhu, X. X. (2022). Self-Supervised Learning in Remote Sensing: A Review. IEEE Geoscience and Remote Sensing Magazine, 10(4), 213–247.
Gui, J., Chen, T., Zhang, J., Cao, Q., Sun, Z., Luo, H., & Tao, D. (2024). A Survey on Self-Supervised Learning: Algorithms, Applications, and Future Trends. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12), 9052–9071.
Zong, Y., Aodha, O. M., & Hospedales, T. M. (2025). Self-Supervised Multimodal Learning: A Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(7), 5299–5318.
Bai, L., Zhang, X., Qin, W., Long, J., Wang, H., Dong, X., & Du, S. (2025). From Orbit to Ground: A Comprehensive Review of Multimodal Self-Supervised Learning for Remote Sensing. IEEE Geoscience and Remote Sensing Magazine, 13(4), 346–381.
Wang, Y., Albrecht, C. M., & Zhu, X. X. (2022b). Self-Supervised Vision Transformers for Joint SAR-Optical Representation Learning. In IGARSS 2022 - 2022 IEEE International Geoscience and Remote Sensing Symposium.
Dimitrovski, I., Kitanovski, I., Simidjievski, N., & Kocev, D. (2024). In-Domain Self-Supervised Learning Improves Remote Sensing Image Scene Classification. IEEE Geoscience and Remote Sensing Letters, 21, 1–5.