Image Processing
Multi-Modal Self-Supervised Learning for Aligned Satellite Image Time Series Encoder
Publié le
Foundation models based on self-supervised learning (SSL) have attracted a great deal of interest in land monitoring. While early approaches relied primarily on masked autoencoder (MAE) strategies, attention is now turning to discriminative SSL methods. However, the application of multi-view discriminative learning to Satellite Image Time Series (SITS) remains a largely under-discussed challenge, particularly in the context of monomodal SITS. In this paper, we argue that exploiting multimodal SITS, composed of Sentinel-1 and Sentinel-2 images, is an effective way to exploit the potential of multi-view discriminative learning. We introduce MALICE, a multi-modal extension of ALISE that is trained using a hybrid SSL strategy combining MAE and discriminative approaches. Experimental results on downstream tasks, including crop type segmentation and tree cover density estimation, demonstrate that leveraging multimodality during pre-training significantly improves the quality of representations, particularly for Sentinel-1 embeddings. An ablation study corroborates these findings and highlights the importance of multimodal SSL strategies for SITS-based foundation models. The implementation and pre-trained models are available at: https://github.com/ekalinicheva/malice/.