TL;DR

We give a comprehensive study of the advanced technique for remote sensing representation learning including latest several latest self-supervised learning and fine-tuning for downstream tasks technique and self-supervised learning including like constructive learning a mark must autoencoder style JEPA style and so on.

Introduction

(Placeholder for Introduction)

Preliminary

The success of the foundation model, especially the large language models (LLMs), has demonstrated the effectiveness of Self-Supervised Pre-training plus Domain Supervised Fine-tuning strategy.

This paradigm is later on adopted into our remote sensing domain and also achieved great success.

Unique Attributes of Remote Sensing Images

Remote sensing data is often considered a distinct modality in machine learning due to its unique spatio-temporal characteristics, varying resolutions, and diverse sensor types [Rolf et al., 2024].

Review: Self-Supervised Learning in Remote Sensing (2020-2025)

We review the latest novel papers that adopt self-supervised learning in remote sensing images.

1. Contrastive Learning

Background (General Domain):

General Domain Contrastive Learning
General Domain: Contrastive Learning architecture.

Adoption in Remote Sensing:

RS Domain Contrastive Learning
Remote Sensing: Specialized contrastive learning architectures.

2. Masked Autoencoder (MAE)

Background (General Domain):

General Domain MAE
General Domain: Masked Autoencoder (MAE) architecture.

Adoption in Remote Sensing:

RS Domain MAE
Remote Sensing: Evolution of MAE for geospatial data.

3. I-JEPA Style

Background (General Domain):

General Domain I-JEPA
General Domain: Joint-Embedding Predictive Architecture (I-JEPA).

Adoption in Remote Sensing:

RS Domain Adoption of I-JEPA style
Remote Sensing: Adoption of I-JEPA for multi-modal Earth observation.

4. Latent Masked Image Modeling

  • This paradigm currently exists in the general domain (e.g., LatentMIM [Wei et al., 2024]) but has not yet been implemented in the remote sensing domain.
General Domain Latent MIM
General Domain: Latent Masked Image Modeling (LatentMIM) architecture.
Not yet implemented in RS
Remote Sensing: Latent MIM is currently unexplored in this domain.

5. Galileo (Current SOTA)

  • Source: Galileo [Tseng et al., 2025].
  • Key Features:
    • Highly multimodal transformer to represent many remote sensing modalities.
    • Novel self-supervised learning algorithm extracting multi-scale features across a flexible set of input modalities through mask modeling.
    • Dual global and local contrastive losses which differ in their targets and masking strategies.
General Multimodal Transformer
General Domain: Standard Multimodal Transformer architecture.
Galileo RS Architecture
Remote Sensing: Galileo's specialized architecture for multi-modal geospatial data.

Other Remote Sensing Foundation Models

There are also some related works that introduce remote sensing foundation models:

Adaptation to Downstream Tasks

(Placeholder for section on adapting self-supervised models to remote sensing downstream tasks)


References

  1. Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A Simple Framework for Contrastive Learning of Visual Representations. In International Conference on Machine Learning (ICML).

  2. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML).

  3. Zbontar, J., Jing, L., Misra, I., LeCun, Y., & Deny, S. (2021). Barlow Twins: Self-Supervised Learning via Redundancy Reduction. In Proceedings of the 38th International Conference on Machine Learning (ICML).

  4. Fuller, A., Millard, K., & Green, J. (2023). CROMA: Remote Sensing Representations with Contrastive Radar-Optical Masked Autoencoders. In Advances in Neural Information Processing Systems (NeurIPS).

  5. Wang, Y., Albrecht, C. M., Braham, N. A. A., Liu, C., Xiong, Z., & Zhu, X. X. (2024a). Decoupling Common and Unique Representations for Multimodal Self-supervised Learning. In Proceedings of the European Conference on Computer Vision (ECCV).

  6. Astruc, G., Gonthier, N., Mallet, C., & Landrieu, L. (2025). AnySat: One Earth Observation Model for Many Resolutions, Scales, and Modalities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

  7. He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022). Masked Autoencoders Are Scalable Vision Learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

  8. Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., & Ermon, S. (2022). SatMAE: Pre-training Transformers for Temporal and Multi-Spectral Satellite Imagery. In Advances in Neural Information Processing Systems (NeurIPS).

  9. Noman, M., Naseer, M., Cholakkal, H., Anwer, R. M., Khan, S., & Khan, F. S. (2024). Rethinking Transformers Pre-Training for Multi-Spectral Satellite Imagery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

  10. Xiong, Z., Wang, Y., Zhang, F., Stewart, A. J., Hanna, J., Borth, D., Papoutsis, I., Saux, B. L., Camps-Valls, G., & Zhu, X. X. (2025). Neural Plasticity-Inspired Multimodal Foundation Model for Earth Observation. arXiv preprint arXiv:2403.15356.

  11. Nedungadi, V., Kariryaa, A., Oehmcke, S., Belongie, S., Igel, C., & Lang, N. (2024). MMEarth: Exploring Multi-modal Pretext Tasks for Geospatial Representation Learning. In Proceedings of the European Conference on Computer Vision (ECCV).

  12. Assran, M., Duval, Q., Misra, I., Bojanowski, P., Vincent, P., Rabbat, M., LeCun, Y., & Ballas, N. (2023). Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

  13. Wei, Y., Gupta, A., & Morgado, P. (2024). Towards Latent Masked Image Modeling for Self-supervised Visual Representation Learning. In Proceedings of the European Conference on Computer Vision (ECCV).

  14. Tseng, G., Fuller, A., Reil, M., Herzog, H., Beukema, P., Bastani, F., Green, J. R., Shelhamer, E., Kerner, H., & Rolnick, D. (2025). Galileo: Learning Global & Local Features of Many Remote Sensing Modalities. In International Conference on Machine Learning (ICML).

  15. Wang, Y., Albrecht, C. M., & Zhu, X. X. (2024b). Multilabel-Guided Soft Contrastive Learning for Efficient Earth Observation Pretraining. IEEE Transactions on Geoscience and Remote Sensing, 62, 1–16.

  16. Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., & Kembhavi, A. (2023). SatlasPretrain: A Large-Scale Dataset for Remote Sensing Image Understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).

  17. Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., He, H., Wang, J., Chen, J., Yang, M., Zhang, Y., & Li, Y. (2024). SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

  18. Zhang, Y., Ru, L., Wu, K., Yu, L., Liang, L., Li, Y., & Chen, J. (2025). SkySense V2: A Unified Foundation Model for Multi-modal Remote Sensing. In International Conference on Computer Vision (ICCV).

  19. Zhu, Q., Lao, J., Ji, D., Luo, J., Wu, K., Zhang, Y., Ru, L., Wang, J., Chen, J., Yang, M., Liu, D., & Zhao, F. (2025). SkySense-O: Towards Open-World Remote Sensing Interpretation with Vision-Centric Visual-Language Modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

  20. Wu, K., Zhang, Y., Ru, L., Dang, B., Lao, J., Yu, L., Luo, J., Zhu, Z., Sun, Y., Zhang, J., Zhu, Q., Wang, J., Yang, Ming, Chen, J., Zhang, Y., & Li, Y. (2025). A Semantic-Enhanced Multi-Modal Remote Sensing Foundation Model for Earth Observation. Nature Machine Intelligence, 7(8), 1235–1249.

  21. Szwarcman, D., Roy, S., Fraccaro, P., Gíslason, Þ. E., Blumenstiel, B., Ghosal, R., de Oliveira, P. H., Almeida, J. L. d. S., Sedona, R., Kang, Y., Chakraborty, S., Wang, S., Gomes, C., Kumar, A., Truong, M., Godwin, D., Lee, H., Hsu, C.-Y., Asanjan, A. A., Mujeci, B., Shidham, D., Keenan, T., Arevalo, P., Li, W., Alemohammad, H., Olofsson, P., Hain, C., Kennedy, R., Zadrozny, B., Bell, D., Cavallaro, G., Watson, C., Maskey, M., Ramachandran, R., & Moreno, J. B. (2025). Prithvi-EO-2.0: A Versatile Multi-Temporal Foundation Model for Earth Observation Applications. arXiv preprint arXiv:2412.02732.

  22. Mendieta, M., Han, B., Shi, X., Zhu, Y., & Chen, C. (2023). Towards Geospatial Foundation Models via Continual Pretraining. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).

  23. Chen, B., Chen, Y., Wang, Z., Chen, Y., & Li, X. (2024c). Masked Angle-Aware Autoencoder for Remote Sensing Images. In Proceedings of the European Conference on Computer Vision (ECCV).

  24. Mall, U., Hariharan, B., & Bala, K. (2023). Change-Aware Sampling and Contrastive Learning for Satellite Images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

  25. Hu, H., Wang, P., Bi, H., Tong, B., Wang, Z., Diao, W., Chang, H., Feng, Y., Zhang, Z., Wang, Y., Ye, Q., Fu, K., & Sun, X. (2025). RS-vHeat: Heat Conduction Guided Efficient Remote Sensing Foundation Model. In International Conference on Computer Vision (ICCV).

  26. Rolf, E., Klemmer, K., Robinson, C., & Kerner, H. (2024). Position: Mission Critical – Satellite Data Is a Distinct Modality in Machine Learning. In Proceedings of the 41st International Conference on Machine Learning (ICML).

  27. Brown, C. F., Kazmierski, M. R., Pasquarella, V. J., Rucklidge, W. J., Samsikova, M., Zhang, C., Shelhamer, E., Lahera, E., Wiles, O., Ilyushchenko, S., Gorelick, N., Zhang, L. L., Alj, S., Schechter, E., Askay, S., Guinan, O., Moore, R., Boukouvalas, A., & Kohli, P. (2025). AlphaEarth Foundations: An Embedding Field Model for Accurate and Efficient Global Mapping from Sparse Label Data. arXiv preprint arXiv:2507.22291.