PixelRefer: A Unified Framework for Spatio-Temporal Object Referring with Arbitrary Granularity

1Zhejiang University 2DAMO Academy, Alibaba Group 3Hupan Lab 4The Hong Kong Polytechnic University

🔥Highlights

  • Unified Fine-Grained MLLM. We introduce a unified multimodal LLM framework that supports precise, region-specific understanding in both static images and dynamic videos, overcoming the holistic, scene-level bias of prior MLLMs.
  • Dual Paradigms with Scale-Adaptive Tokens. We develop two complementary designs: PixelRefer (Vision-Object) and PixelRefer-Lite (Object-Only). Both leverage a Scale-Adaptive Object Tokenizer (SAOT) for precise, semantically rich region representations. PixelRefer-Lite further integrates an Object-Centric Infusion (OCI) module that efficiently fuses global and object-level cues at early layers while preserving semantic fidelity.
  • Data & Benchmarks. We curate PixelRefer-2.2M, a large-scale, high-quality object-centric instruction dataset.
  • State-of-the-Art with Efficiency. Extensive experiments show PixelRefer achieves SOTA across diverse image and video benchmarks with fewer training samples. PixelRefer-Lite delivers substantial runtime and memory savings, underscoring its practicality for real-world deployment.

Overview

Performance comparison of PixelRefer models

🔥 Highlights:

(a) PixelRefer and Lite surpass prior MLLMs on diverse benchmarks.

(b) They achieve top results with fewer samples.

(c) PixelRefer-Lite cuts inference time and memory use.

PixelRefer Model

Framework illustration for region-level representations.

Frameworks of two complementary paradigms for region-level representations in our approach:
(a) illustrates Vision-Object Framework, while (b) presents Object-Only Framework.

Framework illustration for region-level representations.

Architecture of our proposed Scale-Adaptive Object Tokenizer.

Dataset

Overview of datasets used for model training.

Overview of datasets used for model training.
Left: Data distribution for Foundational Object Perception training (1.4M samples).
Right: Data for Visual Instruction Tuning (0.8M samples).

Main Results

Performance on image-level region understanding benchmarks:
Includes category-level (LVIS and PACO), detailed captioning (DLC-Bench and Ref-L4 [CLAIR]), phrase-level (Ref-L4 and VG), and reasoning-level (Ferret-Reasoning).

Performance on image-level region understanding benchmarks.

Performance comparisons on VideoRefer-Bench.

Performance comparison on VideoRefer-Bench.

Inference time and memory usage on DLC-Bench (Image) and HC-STVG (Video).
We report per-item inference time (s/item) and peak GPU memory (GB).

Inference time and memory usage comparison on DLC-Bench and HC-STVG.

BibTeX

@article{yuan2025pixelrefer,
  title     = {PixelRefer: A Unified Framework for Spatio-Temporal Object Referring with Arbitrary Granularity},
  author    = {Yuqian Yuan and Wenqiao Zhang and Xin Li and Shihao Wang and Kehan Li and Wentong Li and Jun Xiao and Lei Zhang and Beng Chin Ooi},
  year      = {2025},
  journal   = {arXiv},
}

@inproceedings{yuan2025videorefer,
  title     = {Videorefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM},
  author    = {Yuqian Yuan and Hang Zhang and Wentong Li and Zesen Cheng and Boqiang Zhang and Long Li and Xin Li and Deli Zhao and Wenqiao Zhang and Yueting Zhuang and others},
  booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference},
  pages     = {18970--18980},
  year      = {2025},
}