🥇🥈🥉 indicate the top-3 models. The best results are highlighted in bold and underlined.
Orange: Proprietary Multimodal Foundation Models Purple: Object-level MLLMs Others: Open-Source Multimodal Foundation Models
# | Method | Input | Mean | Past | Present | Future | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
OSR | OLR | ORE | ATP | Mean | ISR | OR | PFI | AP | Mean | TMP | SCP | DRP | Mean | ||||
1 | GPT-4o 🥇 | 32f | 61.83 | 66.04 | 71.93 | 46.56 | 34.46 | 54.91 | 71.46 | 52.85 | 78.18 | 62.75 | 67.32 | 69.61 | 68.69 | 68.97 | 69.11 |
2 | Gemini-2.0-flash 🥈 | 32f | 57.38 | 63.46 | 65.10 | 32.56 | 28.60 | 47.87 | 68.84 | 57.52 | 69.68 | 65.69 | 65.95 | 58.54 | 64.02 | 57.95 | 60.75 |
3 | InternVL2.5-78B 🥉 | 32f | 52.33 | 53.46 | 63.96 | 33.15 | 12.01 | 41.35 | 66.67 | 50.74 | 67.10 | 52.94 | 61.72 | 67.80 | 50.47 | 54.55 | 58.19 |
4 | InternVL2.5-38B | 32f | 52.31 | 55.40 | 59.62 | 30.92 | 10.89 | 39.89 | 64.15 | 54.28 | 71.29 | 64.71 | 63.35 | 60.98 | 54.67 | 57.95 | 57.79 |
5 | Qwen2.5-VL-72B | 1fps | 49.87 | 51.25 | 51.22 | 40.11 | 8.48 | 38.41 | 61.31 | 47.79 | 67.10 | 57.84 | 58.98 | 56.10 | 60.65 | 54.55 | 57.76 |
6 | LLaVA-Video-72B | 32f | 49.59 | 49.03 | 56.91 | 26.74 | 24.02 | 39.59 | 63.32 | 47.20 | 63.87 | 50.00 | 58.38 | 56.10 | 55.14 | 47.73 | 54.24 |
7 | GPT-4o-mini | 32f | 49.47 | 53.26 | 52.35 | 29.68 | 21.10 | 39.47 | 58.46 | 49.26 | 67.74 | 58.82 | 58.31 | 56.59 | 50.00 | 54.55 | 53.45 |
8 | LLaVA-OV-72B | 32f | 47.88 | 46.81 | 50.95 | 26.46 | 12.91 | 34.81 | 64.15 | 51.33 | 64.52 | 49.02 | 59.87 | 58.05 | 46.73 | 54.55 | 52.66 |
9 | VideoLLaMA3-7B | 1fps | 46.04 | 45.15 | 52.85 | 24.51 | 15.54 | 35.00 | 57.96 | 48.67 | 62.58 | 49.02 | 56.01 | 52.20 | 49.54 | 48.86 | 50.49 |
10 | InternVL2.5-8B | 32f | 45.15 | 45.71 | 54.47 | 39.00 | 9.76 | 37.87 | 55.44 | 48.97 | 54.84 | 41.18 | 52.60 | 49.76 | 38.79 | 53.41 | 45.76 |
11 | Qwen2.5-VL-7B | 1fps | 43.13 | 47.37 | 46.34 | 21.45 | 8.18 | 31.38 | 57.29 | 44.54 | 59.35 | 49.02 | 53.93 | 48.78 | 46.30 | 46.59 | 47.35 |
12 | LLaVA-Video-7B | 32f | 41.82 | 44.32 | 48.51 | 22.56 | 9.76 | 31.82 | 54.27 | 43.66 | 55.81 | 49.02 | 51.56 | 45.85 | 40.65 | 47.73 | 43.98 |
13 | VideoLLaMA2-72B | 16f | 41.55 | 43.77 | 51.22 | 24.23 | 6.46 | 32.03 | 50.08 | 37.46 | 58.06 | 45.10 | 48.37 | 49.27 | 50.47 | 51.14 | 50.10 |
14 | LLaVA-OV-7B | 32f | 40.46 | 40.72 | 45.53 | 22.84 | 9.53 | 30.15 | 54.10 | 43.07 | 52.58 | 46.08 | 50.37 | 47.32 | 37.38 | 46.59 | 43.00 |
15 | VideoRefer-7B | 16f | 40.44 | 47.37 | 55.01 | 23.40 | 10.59 | 34.69 | 48.91 | 39.82 | 53.55 | 38.24 | 46.88 | 41.95 | 35.51 | 43.18 | 39.45 |
16 | VideoLLaMA3-2B | 1fps | 38.41 | 37.12 | 46.88 | 21.17 | 11.26 | 29.57 | 49.92 | 43.36 | 48.39 | 38.24 | 47.03 | 43.41 | 36.11 | 43.18 | 40.28 |
17 | Qwen2.5-VL-3B | 1fps | 38.17 | 38.78 | 48.78 | 23.96 | 7.66 | 30.34 | 49.92 | 38.94 | 45.16 | 38.24 | 45.18 | 42.93 | 36.57 | 50.00 | 41.45 |
18 | VideoLLaMA2.1-7B | 16f | 37.74 | 44.88 | 42.82 | 19.22 | 11.64 | 30.08 | 47.24 | 37.17 | 51.94 | 39.22 | 45.18 | 40.00 | 36.92 | 44.32 | 39.45 |
19 | NVILA-8B | 32f | 37.69 | 37.40 | 46.61 | 20.89 | 12.09 | 29.69 | 44.39 | 41.59 | 49.03 | 46.08 | 44.88 | 42.44 | 38.32 | 44.32 | 41.03 |
20 | LongVA-7B | 32f | 35.34 | 36.84 | 43.36 | 17.83 | 15.32 | 28.69 | 38.19 | 36.58 | 48.06 | 42.16 | 40.36 | 39.02 | 42.06 | 40.91 | 40.63 |
21 | VideoLLaVA-7B | 8f | 34.11 | 31.86 | 37.94 | 27.58 | 13.14 | 27.97 | 41.04 | 35.10 | 40.97 | 37.25 | 39.24 | 40.98 | 31.78 | 44.32 | 37.67 |