LC4-DViT: Land-cover Creation for Land-cover Classification
with Deformable Vision Transformer

HKUST · CUHK(SZ) · JHU · NUS · MUST
*Equal contribution. Corresponding authors.
Remote Sensing Image Diffusion Land-cover Classification Deformable convolution network Vision Transformer

Overview

Overall architecture of LC4-DViT

Overall architecture of LC4-DViT.

Deformable Vision Transformer module

Deformable Vision Transformer (DViT) module.

The pipeline first enhances low-resolution remote sensing images using RRDBNet within the Real-ESRGAN framework, then leverages GPT-4o-generated land-cover descriptions together with Stable Diffusion to synthesize diverse augmented samples. Finally, the proposed deformation-aware ViT (DViT) performs land-cover classification by explicitly modeling complex landform geometries, improving robustness and accuracy.

Abstract

Land-cover underpins ecosystem services, hydrologic regulation, disaster-risk reduction, and evidence-based land planning; timely and accurate land-cover maps are therefore critical for environmental stewardship. Remote sensing-based land-cover classification provides a scalable route to such maps, but it is hindered by scarce and imbalanced annotations as well as geometric distortions in high-resolution scenes. We propose LC4-DViT (Land-cover Creation for Land-cover Classification with Deformable Vision Transformer), a framework that combines generative data creation with a deformation-aware Vision Transformer. A text-guided diffusion pipeline uses GPT-4o-generated scene descriptions and super-resolved exemplars to synthesize class-balanced, high-fidelity training images, while DViT couples a DCNv4 deformable convolutional backbone with a Vision Transformer encoder to jointly capture fine-scale geometry and global context. On eight classes from AID (Beach, Bridge, Desert, Forest, Mountain, Pond, Port, River), DViT achieves 0.9572 overall accuracy, 0.9576 macro F1-score, and 0.9510 Cohen’s Kappa. Cross-dataset experiments on SIRI-WHU (Harbor, Pond, River) yield 0.9333 overall accuracy, 0.9316 macro F1, and 0.8989 Kappa, indicating good transferability.

Results

Image Quality Assessment

We use KID (Kernel Inception Distance) to evaluate the quality of the generated images.

Method Dataset KID ↓
LVC-DViT (Ours) AID 12.10
UniControl CNSATMAP 20.42

Overall Performance on AID Dataset

Overall metrics comparing DViT, ResNet50, FlashInternImage, ViT, and MobileNetV2 on AID Dataset.

Model Overall Accuracy Mean Accuracy Cohen-Kappa Precision (macro) Recall (macro) F1-score (macro)
DViT 0.9572 0.9592 0.9510 0.9576 0.9592 0.9576
ResNet50 0.9311 0.9342 0.9211 0.9354 0.9342 0.9332
FlashInternImage 0.9404 0.9409 0.9317 0.9450 0.9409 0.9416
ViT 0.9274 0.9306 0.9169 0.9337 0.9306 0.9300
MobileNetV2 0.9348 0.9389 0.9254 0.9411 0.9389 0.9378

Per-class Performance on AID Dataset

Class Model Acc Prec F1 Kappa
BeachDViT (Ours)0.93060.97100.95040.9429
ResNet500.91670.97060.94290.9343
FlashInternImage0.93060.98530.95710.9507
ViT0.98610.95950.97260.9683
MobileNetV20.94440.97140.95770.9513
BridgeDViT (Ours)0.92420.96830.94570.9383
ResNet500.90910.76920.83330.8077
FlashInternImage0.81820.94740.87800.8624
ViT0.93940.78480.85520.8328
MobileNetV20.95450.84000.89360.8766
DesertDViT (Ours)1.00000.95240.97560.9725
ResNet501.00000.93750.96770.9635
FlashInternImage0.98330.96720.97520.9721
ViT1.00001.00001.00001.0000
MobileNetV20.98330.98330.98330.9812
ForestDViT (Ours)1.00000.94340.97090.9678
ResNet500.98000.96080.97030.9672
FlashInternImage0.96000.97960.96970.9666
ViT0.94000.95920.94950.9444
MobileNetV20.98000.98000.98000.9779
MountainDViT (Ours)0.97060.88000.92310.9113
ResNet500.97060.95650.96350.9582
FlashInternImage0.98530.89330.93710.9274
ViT0.97060.86840.91670.9038
MobileNetV20.98530.82720.89930.8833
PondDViT (Ours)0.98700.98700.98700.9848
ResNet500.92210.94670.93420.9234
FlashInternImage0.96101.00000.98010.9769
ViT0.85710.94290.89800.8818
MobileNetV20.85710.98510.91670.9038
PortDViT (Ours)0.92751.00000.96240.9571
ResNet500.85510.98330.91470.9032
FlashInternImage0.94200.94200.94200.9335
ViT0.89861.00000.94660.9392
MobileNetV20.91300.98440.94740.9399
RiverDViT (Ours)0.93330.95890.94590.9373
ResNet500.92000.95830.93880.9291
FlashInternImage0.94670.84520.89310.8746
ViT0.85330.95520.90140.8864
MobileNetV20.89330.95710.92410.9123

Normalized Confusion Matrices on AID Dataset

ViT normalized confusion matrix on AID

(a) ViT normalized confusion matrix

FlashInternImage normalized confusion matrix on AID

(b) FlashInternImage normalized confusion matrix

ResNet50 normalized confusion matrix on AID

(c) ResNet50 normalized confusion matrix

MobileNetV2 normalized confusion matrix on AID

(d) MobileNetV2 normalized confusion matrix

DViT normalized confusion matrix on AID

(e) DViT normalized confusion matrix

Ablation Study (AID)

Ablation on two data-centric components: SupreR (Real-ESRGAN) and Diffusion (GPT-4o-assisted controlled diffusion).

Model SupreR Diffusion Overall Accuracy Mean Accuracy Cohen-Kappa Precision (macro) Recall (macro) F1-score (macro)
DViT 0.9162 0.9185 0.9041 0.9213 0.9185 0.9183
0.9311 0.9341 0.9211 0.9326 0.9341 0.9326
0.9367 0.9383 0.9275 0.9386 0.9383 0.9380
0.9572 0.9592 0.9510 0.9576 0.9592 0.9576

Evaluation of Model Attention via Heatmaps

Model's attention to key areas of the images

Model’s attention to key areas of the images. Compares class-activation heatmaps of different models for eight scene categories (Beach, Bridge, Desert, Forest, Mountain, Pond, Port, and River). The first column is the original scene and the other columns are the attention heatmaps of different models.

GPT-4o Judge Scores (0–3)

Model DViT FlashInternImage ViT ResNet50 MobileNetV2
Average Score 2.625 2.5 2.25 2.0 2.5

Based on the eight representative heatmaps, we further performed an LLM-based (GPT-4o) evaluation of heatmap quality. For each image-model pair, GPT-4o was prompted as an impartial judge of remote-sensing attention maps following this protocol:

  • Inputs: The ground-truth scene category, the original RGB image, and the corresponding attention heatmap (warmer colors = higher attention).
  • Evaluation Focus: Only assess the spatial alignment between high-activation regions and the semantic regions of the ground-truth class (e.g., water body, shoreline, river course, forest canopy, mountain ridge, bridge span), ignoring model architecture, training details, and any numerical scores.
  • Scoring Rubric: Assign an integer score s ∈ {0, 1, 2, 3}:
    • 0 — Attention almost entirely on irrelevant areas; class-relevant regions largely ignored.
    • 1 — Partial overlap with relevant regions, but substantial strong attention on irrelevant areas.
    • 2 — Most strong attention on key class regions and structures; limited spillover to background.
    • 3 — Near-perfect alignment with class-discriminative regions/boundaries; minimal unnecessary focus.
  • Outputs: A single discrete score and a brief natural-language explanation supporting the rating.

Cross-dataset Evaluation (SIRI-WHU)

Overall Performance on SIRI-WHU Dataset.

Model Overall Accuracy Mean Accuracy Cohen-Kappa Precision (macro) Recall (macro) F1-score (macro)
DViT 0.9333 0.9333 0.8989 0.9406 0.9333 0.9316
FlashInternImage 0.8667 0.8667 0.8000 0.8808 0.8667 0.8656
ViT 0.8500 0.8500 0.7750 0.8541 0.8500 0.8510
ResNet-50 0.8667 0.8667 0.8000 0.8744 0.8667 0.8628
MobileNetV2 0.8833 0.8833 0.8250 0.8952 0.8833 0.8828

Per-class Performance on SIRI-WHU Dataset.

Class Model Acc Prec F1 Kappa
HarborDViT (Ours)1.00000.86960.93020.8916
ResNet500.95000.86360.90480.8537
FlashInternImage0.85000.77270.80950.7073
ViT0.85000.89470.87180.8101
MobileNetV20.85000.94440.89470.8462
PondDViT (Ours)1.00000.95240.97560.9630
ResNet500.95000.82610.88370.8193
FlashInternImage1.00000.86960.93020.8916
ViT0.85000.89470.87180.8101
MobileNetV21.00000.80000.88890.8235
RiverDViT (Ours)0.80001.00000.88890.8421
ResNet500.70000.93330.80000.7200
FlashInternImage0.75001.00000.85710.8000
ViT0.85000.77270.80950.7073
MobileNetV20.80000.94120.86490.8052

BibTeX

@article{wang2025lc4,
          title={LC4-DViT: Land-cover Creation for Land-cover Classification with Deformable Vision Transformer},
          author={Wang, Kai and Chen, Siyi and Pang, Weicong and Zhang, Chenchen and Gao, Renjun and Chen, Ziru and Li, Cheng and Gu, Dasa and Huang, Rui and Lau, Alexis Kai Hon},
          journal={arXiv preprint arXiv:2511.22812},
          year={2025}
        }