TL;DR
We give a comprehensive study of the advanced technique for remote sensing representation learning including latest several latest self-supervised learning and fine-tuning for downstream tasks technique and self-supervised learning including like constructive learning a mark must autoencoder style JEPA style and so on.
Introduction
(Placeholder for Introduction)
Preliminary
The success of the foundation model, especially the large language models (LLMs), has demonstrated the effectiveness of Self-Supervised Pre-training plus Domain Supervised Fine-tuning strategy.
This paradigm is later on adopted into our remote sensing domain and also achieved great success.
Unique Attributes of Remote Sensing Images
Remote sensing data is often considered a distinct modality in machine learning due to its unique spatio-temporal characteristics, varying resolutions, and diverse sensor types [Rolf et al., 2024].
Review: Self-Supervised Learning in Remote Sensing (2020-2025)
We review the latest novel papers that adopt self-supervised learning in remote sensing images.
1. Contrastive Learning
Background (General Domain):
- Representative works: SimCLR [Chen et al., 2020], CLIP [Radford et al., 2021], BarlowTwins [Zbontar et al., 2021].
- Core Idea: Constructive learning.
Adoption in Remote Sensing:
- Models: CROMA [Fuller et al., 2023], DECUR [Wang et al., 2024], AnySat [Astruc et al., 2025].
2. Masked Autoencoder (MAE)
Background (General Domain):
- Representative works: MAE [He et al., 2022].
Adoption in Remote Sensing:
- Models: SatMAE [Cong et al., 2022], SatMAE++ [Noman et al., 2024], CROMA [Fuller et al., 2023], MMEarth [Nedungadi et al., 2024], DOFA [Xiong et al., 2025].
3. I-JEPA Style
Background (General Domain):
- Representative works: I-JEPA [Assran et al., 2023].
Adoption in Remote Sensing:
- Models: AnySat [Astruc et al., 2025].
4. Latent Masked Image Modeling
- This paradigm currently exists in the general domain (e.g., LatentMIM [Wei et al., 2024]) but has not yet been implemented in the remote sensing domain.
5. Galileo (Current SOTA)
- Source: Galileo [Tseng et al., 2025].
- Key Features:
- Highly multimodal transformer to represent many remote sensing modalities.
- Novel self-supervised learning algorithm extracting multi-scale features across a flexible set of input modalities through mask modeling.
- Dual global and local contrastive losses which differ in their targets and masking strategies.
Other Remote Sensing Foundation Models
There are also some related works that introduce remote sensing foundation models:
- SoftCON [Wang et al., 2024b] / Satlas [Bastani et al., 2023] (Allen AI Institute)
- Prithvi-EO-2.0 [Szwarcman et al., 2025]
- RS-vHeat [Hu et al., 2025]
- SkySense Series: SkySense [Guo et al., 2024], SkySense v2 [Zhang et al., 2025], SkySense-O [Zhu et al., 2025], SkySense++ [Wu et al., 2025]
- GFM [Mendieta et al., 2023]
- MA3E [Chen et al., 2024c]
- CaCo [Mall et al., 2023]
- AlphaEarth [Brown et al., 2025]
Adaptation to Downstream Tasks
(Placeholder for section on adapting self-supervised models to remote sensing downstream tasks)
References
-
Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A Simple Framework for Contrastive Learning of Visual Representations. In International Conference on Machine Learning (ICML).
-
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML).
-
Zbontar, J., Jing, L., Misra, I., LeCun, Y., & Deny, S. (2021). Barlow Twins: Self-Supervised Learning via Redundancy Reduction. In Proceedings of the 38th International Conference on Machine Learning (ICML).
-
Fuller, A., Millard, K., & Green, J. (2023). CROMA: Remote Sensing Representations with Contrastive Radar-Optical Masked Autoencoders. In Advances in Neural Information Processing Systems (NeurIPS).
-
Wang, Y., Albrecht, C. M., Braham, N. A. A., Liu, C., Xiong, Z., & Zhu, X. X. (2024a). Decoupling Common and Unique Representations for Multimodal Self-supervised Learning. In Proceedings of the European Conference on Computer Vision (ECCV).
-
Astruc, G., Gonthier, N., Mallet, C., & Landrieu, L. (2025). AnySat: One Earth Observation Model for Many Resolutions, Scales, and Modalities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
-
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022). Masked Autoencoders Are Scalable Vision Learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
-
Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., & Ermon, S. (2022). SatMAE: Pre-training Transformers for Temporal and Multi-Spectral Satellite Imagery. In Advances in Neural Information Processing Systems (NeurIPS).
-
Noman, M., Naseer, M., Cholakkal, H., Anwer, R. M., Khan, S., & Khan, F. S. (2024). Rethinking Transformers Pre-Training for Multi-Spectral Satellite Imagery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
-
Xiong, Z., Wang, Y., Zhang, F., Stewart, A. J., Hanna, J., Borth, D., Papoutsis, I., Saux, B. L., Camps-Valls, G., & Zhu, X. X. (2025). Neural Plasticity-Inspired Multimodal Foundation Model for Earth Observation. arXiv preprint arXiv:2403.15356.
-
Nedungadi, V., Kariryaa, A., Oehmcke, S., Belongie, S., Igel, C., & Lang, N. (2024). MMEarth: Exploring Multi-modal Pretext Tasks for Geospatial Representation Learning. In Proceedings of the European Conference on Computer Vision (ECCV).
-
Assran, M., Duval, Q., Misra, I., Bojanowski, P., Vincent, P., Rabbat, M., LeCun, Y., & Ballas, N. (2023). Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
-
Wei, Y., Gupta, A., & Morgado, P. (2024). Towards Latent Masked Image Modeling for Self-supervised Visual Representation Learning. In Proceedings of the European Conference on Computer Vision (ECCV).
-
Tseng, G., Fuller, A., Reil, M., Herzog, H., Beukema, P., Bastani, F., Green, J. R., Shelhamer, E., Kerner, H., & Rolnick, D. (2025). Galileo: Learning Global & Local Features of Many Remote Sensing Modalities. In International Conference on Machine Learning (ICML).
-
Wang, Y., Albrecht, C. M., & Zhu, X. X. (2024b). Multilabel-Guided Soft Contrastive Learning for Efficient Earth Observation Pretraining. IEEE Transactions on Geoscience and Remote Sensing, 62, 1–16.
-
Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., & Kembhavi, A. (2023). SatlasPretrain: A Large-Scale Dataset for Remote Sensing Image Understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).
-
Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., He, H., Wang, J., Chen, J., Yang, M., Zhang, Y., & Li, Y. (2024). SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
-
Zhang, Y., Ru, L., Wu, K., Yu, L., Liang, L., Li, Y., & Chen, J. (2025). SkySense V2: A Unified Foundation Model for Multi-modal Remote Sensing. In International Conference on Computer Vision (ICCV).
-
Zhu, Q., Lao, J., Ji, D., Luo, J., Wu, K., Zhang, Y., Ru, L., Wang, J., Chen, J., Yang, M., Liu, D., & Zhao, F. (2025). SkySense-O: Towards Open-World Remote Sensing Interpretation with Vision-Centric Visual-Language Modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
-
Wu, K., Zhang, Y., Ru, L., Dang, B., Lao, J., Yu, L., Luo, J., Zhu, Z., Sun, Y., Zhang, J., Zhu, Q., Wang, J., Yang, Ming, Chen, J., Zhang, Y., & Li, Y. (2025). A Semantic-Enhanced Multi-Modal Remote Sensing Foundation Model for Earth Observation. Nature Machine Intelligence, 7(8), 1235–1249.
-
Szwarcman, D., Roy, S., Fraccaro, P., Gíslason, Þ. E., Blumenstiel, B., Ghosal, R., de Oliveira, P. H., Almeida, J. L. d. S., Sedona, R., Kang, Y., Chakraborty, S., Wang, S., Gomes, C., Kumar, A., Truong, M., Godwin, D., Lee, H., Hsu, C.-Y., Asanjan, A. A., Mujeci, B., Shidham, D., Keenan, T., Arevalo, P., Li, W., Alemohammad, H., Olofsson, P., Hain, C., Kennedy, R., Zadrozny, B., Bell, D., Cavallaro, G., Watson, C., Maskey, M., Ramachandran, R., & Moreno, J. B. (2025). Prithvi-EO-2.0: A Versatile Multi-Temporal Foundation Model for Earth Observation Applications. arXiv preprint arXiv:2412.02732.
-
Mendieta, M., Han, B., Shi, X., Zhu, Y., & Chen, C. (2023). Towards Geospatial Foundation Models via Continual Pretraining. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).
-
Chen, B., Chen, Y., Wang, Z., Chen, Y., & Li, X. (2024c). Masked Angle-Aware Autoencoder for Remote Sensing Images. In Proceedings of the European Conference on Computer Vision (ECCV).
-
Mall, U., Hariharan, B., & Bala, K. (2023). Change-Aware Sampling and Contrastive Learning for Satellite Images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
-
Hu, H., Wang, P., Bi, H., Tong, B., Wang, Z., Diao, W., Chang, H., Feng, Y., Zhang, Z., Wang, Y., Ye, Q., Fu, K., & Sun, X. (2025). RS-vHeat: Heat Conduction Guided Efficient Remote Sensing Foundation Model. In International Conference on Computer Vision (ICCV).
-
Rolf, E., Klemmer, K., Robinson, C., & Kerner, H. (2024). Position: Mission Critical – Satellite Data Is a Distinct Modality in Machine Learning. In Proceedings of the 41st International Conference on Machine Learning (ICML).
-
Brown, C. F., Kazmierski, M. R., Pasquarella, V. J., Rucklidge, W. J., Samsikova, M., Zhang, C., Shelhamer, E., Lahera, E., Wiles, O., Ilyushchenko, S., Gorelick, N., Zhang, L. L., Alj, S., Schechter, E., Askay, S., Guinan, O., Moore, R., Boukouvalas, A., & Kohli, P. (2025). AlphaEarth Foundations: An Embedding Field Model for Accurate and Efficient Global Mapping from Sparse Label Data. arXiv preprint arXiv:2507.22291.