Logo

Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models

1 University of California Los Angeles   * Equal Contribution  
arXiv Code

🤗

Dataset

🏆

Leaderboard

🌐

Twitter

Abstract

Recent advancements in AI have led to the development of large multimodal models (LMMs) capable of processing complex tasks involving joint reasoning over text and visual content in the image (e.g., navigating maps in public places). This paper introduces ConTextual, a novel benchmark comprising instructions designed explicitly to evaluate LMMs' ability to perform context-sensitive text-rich visual reasoning. ConTextual emphasizes diverse real-world scenarios (e.g., time-reading, navigation, shopping and more) demanding a deeper understanding of the interactions between textual and visual elements. Our findings reveal a significant performance gap of 30.8% between the best-performing LMM, GPT-4V(ision), and human capabilities using human evaluation indicating substantial room for improvement in context-sensitive text-rich visual reasoning. Notably, while GPT-4V excelled in abstract categories like meme and quote interpretation, its overall performance still lagged behind humans. In addition to human evaluations, we also employed automatic evaluation metrics using GPT-4, uncovering similar trends in performance disparities. We also perform a fine-grained evaluation across diverse visual contexts and provide qualitative analysis which provides a robust framework for future advancements in the LMM design.

Leaderboard on Test (Overall) dataset (GPT-4 eval)

Accuracy scores on the Test dataset (506) examples of ConTextual

# Model Method Source Date ALL Time Shop. Nav. Abs. App. Web. Info. Misc. NS.
- Human Performance - Link 2024-01-24 69.6 64.0 64.0 73.5 75.5 64.0 58.0 72.0 78.0
1 GPT-4V(ision) 🥇 LMM 🖼️ Link 2024-01-24 47.4 18.0 54.0 48.0 100.0 48.0 42.0 28.0 48.0
2 Gemini-Pro-Vision 🥈 LMM 🖼️ Link 2024-01-24 40.2 16.0 32.7 28.6 65.3 44.9 43.8 20.0 52.8
3 ShareGPT-4V-7B 🥉 LMM 🖼️ Link 2024-01-24 22.6 0.0 16.0 20.0 28.6 20.0 20.0 14.0 37.7
4 GPT-4 w/ Layout-aware OCR + Caption LLM 👓 Link 2024-01-24 22.2 6.0 16.0 24.0 57.1 14.0 18.0 8.0 27.3
5 Qwen-VL LMM 🖼️ Link 2024-01-24 21.8 4.0 20.0 24.0 53.1 6.0 18.0 14.0 27.3
6 LLaVA-1.5B-13B LMM 🖼️ Link 2024-01-24 20.8 4.0 10.0 18.0 44.9 16.0 26.0 4.0 29.7
7 mPLUG-Owl-v2-7B LMM 🖼️ Link 2024-01-24 18.6 4.0 8.0 24.0 32.7 20.0 10.0 12.0 26.0
8 GPT-4 w/ Layout-aware OCR LLM 👓 Link 2024-01-24 18.2 8.0 20.0 18.0 34.7 10.0 16.0 16.0 20.7
9 GPT-4 w/ OCR* LLM 👓 Link 2024-01-24 15.9 4.0 10.0 14.0 30.6 8.0 16.0 28.6 16.9
10 LLaVAR-13B LMM 🖼️ Link 2024-01-24 14.9 10.0 16.0 6.0 44.9 8.0 10.0 6.0 16.7
11 BLIVA LMM 🖼️ Link 2024-01-24 10.3 2.0 4.0 14.0 24.5 4.0 8.0 4.0 14.7
12 InstructBLIP-Vicuna-7B LMM 🖼️ Link 2024-01-24 9.7 2.0 4.0 16.0 20.0 6.0 12.0 2.1 12.0
13 Idefics-9B LMM 🖼️ Link 2024-01-24 7.7 4.0 2.0 12.0 12.0 0.0 6.0 2.0 13.3
You can access the Test dataset on HuggingFace using this link.



Leaderboard on Val subset (GPT-4 eval)

Accuracy scores on the Val subset (100 examples) of ConTextual.

# Model Method Source Date ALL Time Shop. Nav. Abs. App. Web. Info. Misc. NS.
- Human Performance - Link 2024-01-24 72.0 90.0 90.0 70.0 70.0 60.0 50.0 80.0 70.0
1 GPT-4V(ision) 🥇 LMM 🖼️ Link 2024-01-24 53.0 40.0 60.0 50.0 100.0 50.0 30.0 30.0 56.7
2 Gemini-Pro-Vision 🥈 LMM 🖼️ Link 2024-01-24 37.8 20.0 30.0 10.0 80.0 44.4 30.0 20.0 46.7
3 GPT-4 w/ Layout-aware OCR + Caption 🥉 LLM 👓 Link 2024-01-24 23.0 10.0 10.0 40.0 60.0 0.0 10.0 20.0 26.7
4 ShareGPT-4V-7B LMM 🖼️ Link 2024-01-24 17.0 0.0 30.0 10.0 30.0 10.0 10.0 0.0 26.7
5 LLaVA-1.5B-13B LMM 🖼️ Link 2024-01-24 16.0 0.0 10.0 10.0 50.0 10.0 20.0 10.0 16.7
You can access the Val subset on HuggingFace using this link.


Human Performance*: Average human performance from AMT annotators.
GPT-4V: Open-AI's LMM GPT-4V(ision).
Gemini-Pro-Vision: Google's LMM Gemini-Pro-Vision.
GPT-4*: GPT-4 Turbo.

Method types: LMM 🖼️: Large Multimodal Model, LLM 👓 Augmented Large Language Model.

Visual Scenarios: Time: Time, Shop: Shopping, Nav: Navigation, Abs: Abstract, App: Application Usage, Web: Web Usage, Info: Infographic, Misc NS: Miscellaneous Natural Scenes,

🚨🚨 The leaderboard is continuously being updated. To submit your results to the leaderboard, please send your model predictions for the image urls in the Test dataset (506 examples) to rwadhawan7@g.ucla.edu and hbansal@g.ucla.edu.

Test Dataset Viewer

Val subset Viewer

BibTeX

@misc{wadhawan2024contextual,
  title={ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models}, 
  author={Rohan Wadhawan and Hritik Bansal and Kai-Wei Chang and Nanyun Peng},
  year={2024},
  eprint={2401.13311},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}