EOC-Bench : Can MLLMs Identify, Recall, and Forecast Objects in an Egocentric World?

1Zhejiang University 2DAMO Academy, Alibaba Group 3Hupan Lab
Cover

Abstract

The emergence of multimodal large language models (MLLMs) has driven breakthroughs in egocentric vision applications. These applications necessitate persistent, context-aware understanding of objects, as users interact with tools in dynamic and cluttered environments. However, existing embodied benchmarks primarily focus on static scene exploration, emphasizing object's appearance and spatial attributes while neglecting the assessment of dynamic changes arising from users' interactions. To address this gap, we introduce EOC-Bench, an innovative benchmark designed to systematically evaluate object-centric embodied cognition in dynamic egocentric scenarios. Specially, EOC-Bench features 3,277 meticulously annotated QA pairs categorized into three temporal categories: Past, Present, and Future, covering 11 fine-grained evaluation dimensions and 3 visual object referencing types. To ensure thorough assessment, we develop a mixed-format human-in-the-loop annotation framework with four types of questions and design a novel multi-scale temporal accuracy metric for open-ended temporal evaluation. Based on EOC-Bench, we conduct comprehensive evaluations of various proprietary, open-source, and object-level MLLMs. EOC-Bench serves as a crucial tool for advancing the embodied object cognitive capabilities of MLLMs, establishing a robust foundation for developing reliable core models for embodied systems.

Tasks Definition

EOC-Bench structures questions into three temporally grounded categories: Past, Present, and Future, with a total of 11 categories.

Object State Retrospection (OSR): Evaluates the capability to monitor changes in object attributes including color, shape, size, posture, temperature, and motion.

Object Location Retrospection (OLR): Measures historical positioning accuracy across multiple granularity: macro-level (room-scale), meso-level (platform/container positioning), and micro-level (precise location).

Object Relationship Evolution (ORE): Examines changes in object relationships, encompassing spatial relationships, motion state dynamics, and temporal sequence relationships.

Absolute Time Perception (ATP): Assesses absolute time cognition precision through two key aspects, including pinpointing specific time points and understanding time durations.

Results

Overall data distribution of EOC-Bench. EOC-Bench encompasses three temporal dimensions: Past, Present, and Future, comprehensively evaluating 11 embodied cognitive abilities.

Results

Results

Comparison of mainstream MLLMs on EOC-Bench. Left: Performance on 11 evaluation tasks within EOC-Bench. Right: Performance across different question types spanning Past, Present and Future categories.

EOC-Bench Leaderboard

🥇🥈🥉 indicate the top-3 models. The best results are highlighted in bold and underlined.

Orange: Proprietary Multimodal Foundation Models    Purple: Object-level MLLMs    Others: Open-Source Multimodal Foundation Models

# Method Input Mean Past Present Future
OSR OLR ORE ATP Mean ISR OR PFI AP Mean TMP SCP DRP Mean
1 GPT-4o 🥇 32f 61.83 66.04 71.93 46.56 34.46 54.91 71.46 52.85 78.18 62.75 67.32 69.61 68.69 68.97 69.11
2 Gemini-2.0-flash 🥈 32f 57.38 63.46 65.10 32.56 28.60 47.87 68.84 57.52 69.68 65.69 65.95 58.54 64.02 57.95 60.75
3 InternVL2.5-78B 🥉 32f 52.33 53.46 63.96 33.15 12.01 41.35 66.67 50.74 67.10 52.94 61.72 67.80 50.47 54.55 58.19
4 InternVL2.5-38B 32f 52.31 55.40 59.62 30.92 10.89 39.89 64.15 54.28 71.29 64.71 63.35 60.98 54.67 57.95 57.79
5 Qwen2.5-VL-72B 1fps 49.87 51.25 51.22 40.11 8.48 38.41 61.31 47.79 67.10 57.84 58.98 56.10 60.65 54.55 57.76
6 LLaVA-Video-72B 32f 49.59 49.03 56.91 26.74 24.02 39.59 63.32 47.20 63.87 50.00 58.38 56.10 55.14 47.73 54.24
7 GPT-4o-mini 32f 49.47 53.26 52.35 29.68 21.10 39.47 58.46 49.26 67.74 58.82 58.31 56.59 50.00 54.55 53.45
8 LLaVA-OV-72B 32f 47.88 46.81 50.95 26.46 12.91 34.81 64.15 51.33 64.52 49.02 59.87 58.05 46.73 54.55 52.66
9 VideoLLaMA3-7B 1fps 46.04 45.15 52.85 24.51 15.54 35.00 57.96 48.67 62.58 49.02 56.01 52.20 49.54 48.86 50.49
10 InternVL2.5-8B 32f 45.15 45.71 54.47 39.00 9.76 37.87 55.44 48.97 54.84 41.18 52.60 49.76 38.79 53.41 45.76
11 Qwen2.5-VL-7B 1fps 43.13 47.37 46.34 21.45 8.18 31.38 57.29 44.54 59.35 49.02 53.93 48.78 46.30 46.59 47.35
12 LLaVA-Video-7B 32f 41.82 44.32 48.51 22.56 9.76 31.82 54.27 43.66 55.81 49.02 51.56 45.85 40.65 47.73 43.98
13 VideoLLaMA2-72B 16f 41.55 43.77 51.22 24.23 6.46 32.03 50.08 37.46 58.06 45.10 48.37 49.27 50.47 51.14 50.10
14 LLaVA-OV-7B 32f 40.46 40.72 45.53 22.84 9.53 30.15 54.10 43.07 52.58 46.08 50.37 47.32 37.38 46.59 43.00
15 VideoRefer-7B 16f 40.44 47.37 55.01 23.40 10.59 34.69 48.91 39.82 53.55 38.24 46.88 41.95 35.51 43.18 39.45
16 VideoLLaMA3-2B 1fps 38.41 37.12 46.88 21.17 11.26 29.57 49.92 43.36 48.39 38.24 47.03 43.41 36.11 43.18 40.28
17 Qwen2.5-VL-3B 1fps 38.17 38.78 48.78 23.96 7.66 30.34 49.92 38.94 45.16 38.24 45.18 42.93 36.57 50.00 41.45
18 VideoLLaMA2.1-7B 16f 37.74 44.88 42.82 19.22 11.64 30.08 47.24 37.17 51.94 39.22 45.18 40.00 36.92 44.32 39.45
19 NVILA-8B 32f 37.69 37.40 46.61 20.89 12.09 29.69 44.39 41.59 49.03 46.08 44.88 42.44 38.32 44.32 41.03
20 LongVA-7B 32f 35.34 36.84 43.36 17.83 15.32 28.69 38.19 36.58 48.06 42.16 40.36 39.02 42.06 40.91 40.63
21 VideoLLaVA-7B 8f 34.11 31.86 37.94 27.58 13.14 27.97 41.04 35.10 40.97 37.25 39.24 40.98 31.78 44.32 37.67


🚨 To submit your results to the leaderboard, please send to circleradon@gmail.com with your result json files.

🚨 For more evaluation details, please refer to our github repo.

QUIZ

Comparison between EOC-Bench and Other Benchmarks

Results

Comparison of widely adopted Embodied/General VideoQA benchmarks with our EOC-Bench. P, B, M and A represent visual prompts for object referencing, specifically as point, box, mask and arrow, respectively.

Demo Videos

BibTeX

@article{yuan2025eocbench,
      author    = {Yuqian Yuan, Ronghao Dang, Long Li, Wentong Li, Dian Jiao, Xin Li, Deli Zhao, Fan Wang, Wenqiao Zhang, Jun Xiao, Yueting Zhuang},
      title     = {EOC-Bench: Can MLLMs Identify, Recall, and Forecast Objects in an Egocentric World?},
      journal   = {arXiv},
      year      = {2025},
      url       = {https://arxiv.org/abs/2506.05287}
    }