ViT-Up

ViT-Up

Faithful Feature Upsampling for Vision Transformers

model = torch.hub.load(
    "krispinwandel/vit-up",
    "vit_up_dinov3_splus",
    pretrained=True,
    trust_repo=True,
).eval()

query_features = model(pixel_values, query_coords)
ViT-Up model overview figure
ViT-Up is an implicit feature upsampler for Vision Transformers that predicts backbone-aligned features at arbitrary continuous image coordinates.
65.41Cityscapes mIoU+2.07 vs. best baseline
55.44SPair-71k PCK@0.10+4.17 vs. best baseline
62.72COCO depth δ1+0.55 vs. best baseline
80.81NAVI PCK@0.10+0.50 vs. best baseline

Method

ViT-Up builds query features from intermediate ViT hidden states, reducing feature leakage while preserving alignment with the backbone feature space.

ViT-Up architecture diagram
Training uses frozen multi-resolution ViT teacher features to supervise LoRA-adapted student features; query embeddings are refined through ViT-Up blocks that combine token-level cross-attention with sub-token FeatX modulation.

Result Highlights

Across dense prediction and correspondence benchmarks, ViT-Up consistently improves over state-of-the-art image-guided upsamplers, with gains up to +2.07 mIoU on Cityscapes and +4.17 PCK@0.10 on SPair-71k using DINOv3-S+.

Best semantic segmentation gain
+2.07
Cityscapes mIoU
Best depth improvement
+1.33
COCO RMSE reduction
Best semantic correspondence gain
+5.11
SPair-71k PCK@0.05
Best geometric correspondence gain
+0.50
NAVI PCK@0.10

Probing gains vs. best baseline

COCO mIoU
+0.23
VOC mIoU
+1.63
ADE mIoU
+0.49
City mIoU
+2.07
Depth d1
+0.55
Depth RMSE
+1.33

Correspondence gains vs. best baseline

SPair .10
+4.17
SPair .05
+5.11
SPair .01
+3.47
NAVI .10
+0.50
NAVI .05
+0.41
NAVI .01
+0.25
Probing results on DINOv3-S+. Higher is better for mIoU, accuracy, and delta1; lower is better for RMSE.
Method COCO Seg. VOC ADE20K Cityscapes COCO Depth
mIoU ↑Acc ↑ mIoU ↑Acc ↑ mIoU ↑Acc ↑ mIoU ↑Acc ↑ δ1 ↑RMSE ↓
Bilinear63.1081.8584.8896.4543.2776.1761.3693.4461.5262.80
JAFAR62.5081.5083.8896.1642.4875.8157.7892.4760.6464.92
AnyUp63.0381.8384.5496.3442.7776.0258.9692.9361.6662.62
UpLiFT63.7982.2885.6996.7244.2476.7163.0893.9461.8461.79
NAF63.8682.3385.8496.7244.1776.6963.3494.1362.1761.15
ViT-Up (Ours)64.0982.4987.4797.1444.7377.0665.4194.7362.7259.82
Gain vs. best baseline+0.23+0.16+1.63+0.42+0.49+0.35+2.07+0.60+0.55+1.33
Correspondence results on DINOv3-S+. PCK is reported at different tolerance thresholds; higher is better for all metrics.
Method SPair-71k NAVI
0.10 ↑0.05 ↑0.01 ↑ 0.10 ↑0.05 ↑0.01 ↑
Bilinear51.2733.743.8380.1651.1833.58
JAFAR36.8218.591.8979.0447.0226.60
AnyUp37.6319.311.9780.3148.7828.37
UpLiFT46.8729.153.4379.3549.0530.49
NAF48.6833.962.8980.0350.2931.62
ViT-Up (Ours)55.4439.077.3080.8151.5933.83
Gain vs. best baseline+4.17+5.11+3.47+0.50+0.41+0.25

People

Krispin Wandel

Krispin Wandel

Krispin Wandel received the M.Sc. degree in computational science and engineering from ETH Zurich, Zurich, Switzerland. He is currently pursuing the Ph.D. degree with the Department of Automation, Shanghai Jiao Tong University, Shanghai, China, under the supervision of Prof. Hesheng Wang. His research interests include visual representation learning, dense prediction, semantic correspondence, and robotics.

Jingchuan Wang

Jingchuan Wang

Jingchuan Wang received the Ph.D., M.Phil. and B.Eng. degree in Control Theory and Control Engineering from the Shanghai Jiao Tong University (SJTU), Shanghai, China, in 2002, 2005 and 2014, respectively. Now, he is a Professor in the School of Automation and Intelligent Sensing, and Institute of Medical Robotics at SJTU. He is also an IEEE senior member. His research interests include service robot, mobile robot's localization and navigation.

Hesheng Wang

Hesheng Wang

Hesheng Wang (Senior Member, IEEE) received the B.Eng. degree in electrical engineering from the Harbin Institute of Technology, Harbin, China, in 2002, and the M.Phil. and Ph.D. degrees in automation and computer-aided engineering from the Chinese University of Hong Kong, Hong Kong, in 2004 and 2007, respectively. He is currently a Distinguished Professor with the School of Automation and Intelligent Sensing, Shanghai Jiao Tong University, Shanghai, China. His research interests include visual servoing, intelligent robotics, computer vision, and autonomous driving. Dr. Wang is an Associate Editor of Robotic Intelligence and Automation and the International Journal of Humanoid Robotics, a Senior Editor of the IEEE/ASME Transactions on Mechatronics, and Editor-in-Chief of Robot Learning. He served as an Associate Editor of IEEE Transactions on Robotics from 2015 to 2019 and IEEE Transactions on Automation Science and Engineering from 2021 to 2023. He was the General Chair of IEEE/RSJ IROS 2025, IEEE ROBIO 2022, and IEEE RCAR 2016.