Illustration of an episode history along with questions and answers from our OpenEQA benchmark, which contains 1600+ untemplated questions that tests several aspects of open vocabulary embodied question answering.


We present a modern formulation of Embodied Question Answering (EQA) as the task of understanding an environment well enough to answer questions about it in natural language. An agent can achieve such an understanding by either drawing upon episodic memory, exemplified by agents on smart glasses, or by actively exploring the environment, as in the case of mobile robots. We accompany our formulation with OpenEQA - the first open-vocabulary benchmark dataset for EQA supporting both episodic memory and active exploration use cases. OpenEQA contains over 1600 high-quality human generated questions drawn from over 180 real-world environments. In addition to the dataset, we also provide an automatic LLM-powered evaluation protocol that has excellent correlation with human judgement. Using this dataset and evaluation protocol, we evaluate several state-of-the-art foundation models like GPT-4V and find that they significantly lag behind human-level performance. Consequently, OpenEQA stands out as a straightforward, measurable, and practically relevant benchmark that poses a considerable challenge to current generation of AI models. We hope this inspires and stimulates future research at the intersection of Embodied AI, conversational agents, and world models.

Performance of State-of-the-Art Models

LLM vs. Multi-Modal LLM Performance on EM-EQA.
We evaluated several multi-modal LLMs including Claude 3, Gemini Pro, and GPT-4V on OpenEQA. We find that these models consistently outperform text-only (or blind) LLM baselines such as LLaMA-2 or GPT-4. However, performance is substantially worse than the human baselines.

OpenEQA Dataset Statistics

Example questions and dataset statistics of OpenEQA.
The episode history H provides a human-like tour of a home. EQA agents must answer diverse, human-generated questions Q from 7 EQA categories, aiming match the ground answers A*. Tours are collected from diverse environments including home and office locations (not shown above). Dataset statistics (right) break down the question distribution by video source (top), question category (middle), and episodic memory vs active setting. Note that, by design, the HM3D questions are shared across the EM-EQA and A-EQA settings.

Automated Evaluation Workflow

Illustration of LLM-Match evaluation and workflow.
While the open-vocabulary nature makes EQA realistic, it poses a challenge for evaluation due to multiplicity of correct answers. One approach to evaluation is human trials, but it can be prohibitively slow and expensive, especially for benchmarks. As an alternative, we use an LLM to evaluate the correctness of open-vocabulary answers produced by EQA agents.

Performance by Category

Category-level performance on EM-EQA.
We find that agents with access to visual information excel at localizing and recognizing objects and attributes, and make better use of this information to answer questions that require world knowledge. However, on other categories performance is closer to the blind LLM baseline (GPT-4), indicating substantial room for improvement on OpenEQA.


        title         = {OpenEQA: Embodied Question Answering in the Era of Foundation Models}, 
        booktitle     = {Conference on Computer Vision and Pattern Recognition (CVPR)},
        author        = {Majumdar, Arjun and Ajay, Anurag and Zhang, Xiaohan and Putta, Pranav and Yenamandra, Sriram and Henaff, Mikael and Silwal, Sneha and Mcvay, Paul and Maksymets, Oleksandr and Arnaud, Sergio and Yadav, Karmesh and Li, Qiyang and Newman, Ben and Sharma, Mohit and Berges, Vincent and Zhang, Shiqi and Agrawal, Pulkit and Bisk, Yonatan and Batra, Dhruv and Kalakrishnan, Mrinal and Meier, Franziska and Paxton, Chris and Sax, Sasha and Rajeswaran, Aravind},
        year          = {2024},