Overview
Overall architecture of LC4-DViT.
Deformable Vision Transformer (DViT) module.
The pipeline first enhances low-resolution remote sensing images using RRDBNet within the Real-ESRGAN framework, then leverages GPT-4o-generated land-cover descriptions together with Stable Diffusion to synthesize diverse augmented samples. Finally, the proposed deformation-aware ViT (DViT) performs land-cover classification by explicitly modeling complex landform geometries, improving robustness and accuracy.
Abstract
Land-cover underpins ecosystem services, hydrologic regulation, disaster-risk reduction, and evidence-based land planning; timely and accurate land-cover maps are therefore critical for environmental stewardship. Remote sensing-based land-cover classification provides a scalable route to such maps, but it is hindered by scarce and imbalanced annotations as well as geometric distortions in high-resolution scenes. We propose LC4-DViT (Land-cover Creation for Land-cover Classification with Deformable Vision Transformer), a framework that combines generative data creation with a deformation-aware Vision Transformer. A text-guided diffusion pipeline uses GPT-4o-generated scene descriptions and super-resolved exemplars to synthesize class-balanced, high-fidelity training images, while DViT couples a DCNv4 deformable convolutional backbone with a Vision Transformer encoder to jointly capture fine-scale geometry and global context. On eight classes from AID (Beach, Bridge, Desert, Forest, Mountain, Pond, Port, River), DViT achieves 0.9572 overall accuracy, 0.9576 macro F1-score, and 0.9510 Cohen’s Kappa. Cross-dataset experiments on SIRI-WHU (Harbor, Pond, River) yield 0.9333 overall accuracy, 0.9316 macro F1, and 0.8989 Kappa, indicating good transferability.
Results
Image Quality Assessment
We use KID (Kernel Inception Distance) to evaluate the quality of the generated images.
| Method | Dataset | KID ↓ |
|---|---|---|
| LVC-DViT (Ours) | AID | 12.10 |
| UniControl | CNSATMAP | 20.42 |
Overall Performance on AID Dataset
Overall metrics comparing DViT, ResNet50, FlashInternImage, ViT, and MobileNetV2 on AID Dataset.
| Model | Overall Accuracy | Mean Accuracy | Cohen-Kappa | Precision (macro) | Recall (macro) | F1-score (macro) |
|---|---|---|---|---|---|---|
| DViT | 0.9572 | 0.9592 | 0.9510 | 0.9576 | 0.9592 | 0.9576 |
| ResNet50 | 0.9311 | 0.9342 | 0.9211 | 0.9354 | 0.9342 | 0.9332 |
| FlashInternImage | 0.9404 | 0.9409 | 0.9317 | 0.9450 | 0.9409 | 0.9416 |
| ViT | 0.9274 | 0.9306 | 0.9169 | 0.9337 | 0.9306 | 0.9300 |
| MobileNetV2 | 0.9348 | 0.9389 | 0.9254 | 0.9411 | 0.9389 | 0.9378 |
Per-class Performance on AID Dataset
| Class | Model | Acc | Prec | F1 | Kappa |
|---|---|---|---|---|---|
| Beach | DViT (Ours) | 0.9306 | 0.9710 | 0.9504 | 0.9429 |
| ResNet50 | 0.9167 | 0.9706 | 0.9429 | 0.9343 | |
| FlashInternImage | 0.9306 | 0.9853 | 0.9571 | 0.9507 | |
| ViT | 0.9861 | 0.9595 | 0.9726 | 0.9683 | |
| MobileNetV2 | 0.9444 | 0.9714 | 0.9577 | 0.9513 | |
| Bridge | DViT (Ours) | 0.9242 | 0.9683 | 0.9457 | 0.9383 |
| ResNet50 | 0.9091 | 0.7692 | 0.8333 | 0.8077 | |
| FlashInternImage | 0.8182 | 0.9474 | 0.8780 | 0.8624 | |
| ViT | 0.9394 | 0.7848 | 0.8552 | 0.8328 | |
| MobileNetV2 | 0.9545 | 0.8400 | 0.8936 | 0.8766 | |
| Desert | DViT (Ours) | 1.0000 | 0.9524 | 0.9756 | 0.9725 |
| ResNet50 | 1.0000 | 0.9375 | 0.9677 | 0.9635 | |
| FlashInternImage | 0.9833 | 0.9672 | 0.9752 | 0.9721 | |
| ViT | 1.0000 | 1.0000 | 1.0000 | 1.0000 | |
| MobileNetV2 | 0.9833 | 0.9833 | 0.9833 | 0.9812 | |
| Forest | DViT (Ours) | 1.0000 | 0.9434 | 0.9709 | 0.9678 |
| ResNet50 | 0.9800 | 0.9608 | 0.9703 | 0.9672 | |
| FlashInternImage | 0.9600 | 0.9796 | 0.9697 | 0.9666 | |
| ViT | 0.9400 | 0.9592 | 0.9495 | 0.9444 | |
| MobileNetV2 | 0.9800 | 0.9800 | 0.9800 | 0.9779 | |
| Mountain | DViT (Ours) | 0.9706 | 0.8800 | 0.9231 | 0.9113 |
| ResNet50 | 0.9706 | 0.9565 | 0.9635 | 0.9582 | |
| FlashInternImage | 0.9853 | 0.8933 | 0.9371 | 0.9274 | |
| ViT | 0.9706 | 0.8684 | 0.9167 | 0.9038 | |
| MobileNetV2 | 0.9853 | 0.8272 | 0.8993 | 0.8833 | |
| Pond | DViT (Ours) | 0.9870 | 0.9870 | 0.9870 | 0.9848 |
| ResNet50 | 0.9221 | 0.9467 | 0.9342 | 0.9234 | |
| FlashInternImage | 0.9610 | 1.0000 | 0.9801 | 0.9769 | |
| ViT | 0.8571 | 0.9429 | 0.8980 | 0.8818 | |
| MobileNetV2 | 0.8571 | 0.9851 | 0.9167 | 0.9038 | |
| Port | DViT (Ours) | 0.9275 | 1.0000 | 0.9624 | 0.9571 |
| ResNet50 | 0.8551 | 0.9833 | 0.9147 | 0.9032 | |
| FlashInternImage | 0.9420 | 0.9420 | 0.9420 | 0.9335 | |
| ViT | 0.8986 | 1.0000 | 0.9466 | 0.9392 | |
| MobileNetV2 | 0.9130 | 0.9844 | 0.9474 | 0.9399 | |
| River | DViT (Ours) | 0.9333 | 0.9589 | 0.9459 | 0.9373 |
| ResNet50 | 0.9200 | 0.9583 | 0.9388 | 0.9291 | |
| FlashInternImage | 0.9467 | 0.8452 | 0.8931 | 0.8746 | |
| ViT | 0.8533 | 0.9552 | 0.9014 | 0.8864 | |
| MobileNetV2 | 0.8933 | 0.9571 | 0.9241 | 0.9123 |
Normalized Confusion Matrices on AID Dataset
(a) ViT normalized confusion matrix
(b) FlashInternImage normalized confusion matrix
(c) ResNet50 normalized confusion matrix
(d) MobileNetV2 normalized confusion matrix
(e) DViT normalized confusion matrix
Ablation Study (AID)
Ablation on two data-centric components: SupreR (Real-ESRGAN) and Diffusion (GPT-4o-assisted controlled diffusion).
| Model | SupreR | Diffusion | Overall Accuracy | Mean Accuracy | Cohen-Kappa | Precision (macro) | Recall (macro) | F1-score (macro) |
|---|---|---|---|---|---|---|---|---|
| DViT | 0.9162 | 0.9185 | 0.9041 | 0.9213 | 0.9185 | 0.9183 | ||
| ✓ | 0.9311 | 0.9341 | 0.9211 | 0.9326 | 0.9341 | 0.9326 | ||
| ✓ | 0.9367 | 0.9383 | 0.9275 | 0.9386 | 0.9383 | 0.9380 | ||
| ✓ | ✓ | 0.9572 | 0.9592 | 0.9510 | 0.9576 | 0.9592 | 0.9576 |
Evaluation of Model Attention via Heatmaps
Model’s attention to key areas of the images. Compares class-activation heatmaps of different models for eight scene categories (Beach, Bridge, Desert, Forest, Mountain, Pond, Port, and River). The first column is the original scene and the other columns are the attention heatmaps of different models.
GPT-4o Judge Scores (0–3)
| Model | DViT | FlashInternImage | ViT | ResNet50 | MobileNetV2 |
|---|---|---|---|---|---|
| Average Score | 2.625 | 2.5 | 2.25 | 2.0 | 2.5 |
Based on the eight representative heatmaps, we further performed an LLM-based (GPT-4o) evaluation of heatmap quality. For each image-model pair, GPT-4o was prompted as an impartial judge of remote-sensing attention maps following this protocol:
- Inputs: The ground-truth scene category, the original RGB image, and the corresponding attention heatmap (warmer colors = higher attention).
- Evaluation Focus: Only assess the spatial alignment between high-activation regions and the semantic regions of the ground-truth class (e.g., water body, shoreline, river course, forest canopy, mountain ridge, bridge span), ignoring model architecture, training details, and any numerical scores.
-
Scoring Rubric: Assign an integer score s ∈ {0, 1, 2, 3}:
- 0 — Attention almost entirely on irrelevant areas; class-relevant regions largely ignored.
- 1 — Partial overlap with relevant regions, but substantial strong attention on irrelevant areas.
- 2 — Most strong attention on key class regions and structures; limited spillover to background.
- 3 — Near-perfect alignment with class-discriminative regions/boundaries; minimal unnecessary focus.
- Outputs: A single discrete score and a brief natural-language explanation supporting the rating.
Cross-dataset Evaluation (SIRI-WHU)
Overall Performance on SIRI-WHU Dataset.
| Model | Overall Accuracy | Mean Accuracy | Cohen-Kappa | Precision (macro) | Recall (macro) | F1-score (macro) |
|---|---|---|---|---|---|---|
| DViT | 0.9333 | 0.9333 | 0.8989 | 0.9406 | 0.9333 | 0.9316 |
| FlashInternImage | 0.8667 | 0.8667 | 0.8000 | 0.8808 | 0.8667 | 0.8656 |
| ViT | 0.8500 | 0.8500 | 0.7750 | 0.8541 | 0.8500 | 0.8510 |
| ResNet-50 | 0.8667 | 0.8667 | 0.8000 | 0.8744 | 0.8667 | 0.8628 |
| MobileNetV2 | 0.8833 | 0.8833 | 0.8250 | 0.8952 | 0.8833 | 0.8828 |
Per-class Performance on SIRI-WHU Dataset.
| Class | Model | Acc | Prec | F1 | Kappa |
|---|---|---|---|---|---|
| Harbor | DViT (Ours) | 1.0000 | 0.8696 | 0.9302 | 0.8916 |
| ResNet50 | 0.9500 | 0.8636 | 0.9048 | 0.8537 | |
| FlashInternImage | 0.8500 | 0.7727 | 0.8095 | 0.7073 | |
| ViT | 0.8500 | 0.8947 | 0.8718 | 0.8101 | |
| MobileNetV2 | 0.8500 | 0.9444 | 0.8947 | 0.8462 | |
| Pond | DViT (Ours) | 1.0000 | 0.9524 | 0.9756 | 0.9630 |
| ResNet50 | 0.9500 | 0.8261 | 0.8837 | 0.8193 | |
| FlashInternImage | 1.0000 | 0.8696 | 0.9302 | 0.8916 | |
| ViT | 0.8500 | 0.8947 | 0.8718 | 0.8101 | |
| MobileNetV2 | 1.0000 | 0.8000 | 0.8889 | 0.8235 | |
| River | DViT (Ours) | 0.8000 | 1.0000 | 0.8889 | 0.8421 |
| ResNet50 | 0.7000 | 0.9333 | 0.8000 | 0.7200 | |
| FlashInternImage | 0.7500 | 1.0000 | 0.8571 | 0.8000 | |
| ViT | 0.8500 | 0.7727 | 0.8095 | 0.7073 | |
| MobileNetV2 | 0.8000 | 0.9412 | 0.8649 | 0.8052 |
BibTeX
@article{wang2025lc4,
title={LC4-DViT: Land-cover Creation for Land-cover Classification with Deformable Vision Transformer},
author={Wang, Kai and Chen, Siyi and Pang, Weicong and Zhang, Chenchen and Gao, Renjun and Chen, Ziru and Li, Cheng and Gu, Dasa and Huang, Rui and Lau, Alexis Kai Hon},
journal={arXiv preprint arXiv:2511.22812},
year={2025}
}