Introducing MultiNet: a comprehensive benchmark we are building to evaluate and advance the development of the emerging category of Multimodal Action Models. Key contributions of Multinet include:
In our initial release, we present the results of evaluating GPT-4o (SoTA VLM), OpenVLA (SoTA VLA), and HuggingFace's JAT (novel generalist model) on 20 diverse and challenging OpenX Embodiment datasets. We aim to quickly iterate and expand our evaluation results to more datasets within Multinet.
Multinet v0.1! How well do SoTA VLMs and VLAs do on real-world robotics tasks? Read more on our Release page
Deepmind's GATO represented a significant milestone as the first truly multi-modal, multi-task, multi-embodiment agent. Operating with a single neural network and fixed weights, Gato demonstrated capabilities spanning Atari gameplay, image captioning, conversational interaction, and real-world robotic manipulation. However,both the model and the majority of its training datasets remain closed-source, limiting its impact on the broader research community.
The emerging class of Vision-Language-Action models such as OpenVLA, and Octo display impressive capabilities of taking vision-language-grounded action in control tasks. However, they are trained for a niche set of robotics tasks, their performance on vision-language understanding/generation tasks is not well understood.
Building next generation "generalist" models requires training on diverse datasets that span multiple modalities and task types. Such models must excel not only at individual modalities (vision, language, or control/action) but also at tasks that require seamless integration across modalities - a requirement that better reflects real-world scenarios. Currently, there exists no large-scale, open-source dataset specifically designed for training and evaluating such generalist models. This gap motivated the development of Multinet.
Multinet offers a rich collection of diverse datasets to develop comprehensive multimodal capabilities across multiple domains. It can be utilized to train models that will cultivate sophisticated vision-language association knowledge, high quality language understanding and generation abilities, and prowess in reward-based action trajectories in a myriad of environments and tasks. This consolidation effort not only provides pre-training scale data for a generalist objective, but also serves as a valuable benchmark for evaluating the capabilities of current SoTA VLMs and VLAs, illuminating paths toward more versatile and capable generalist AI systems.
Dataset | Task/Data type | Modality |
---|---|---|
OBELICS | Interleaved Image-Text | Vision-Language |
COYO-700M | Image-Text pairs | Vision-Language |
MS COCO | Object detection, segmentation, key-point detection, captioning | Vision-Language |
Conceptual Captions | Image Captioning | Vision-Language |
A-OKVQA | Visual Question Answering | Vision-Language |
VQA-v2 | Visual Question Answering | Vision-Language |
Datacomp-1B | Image-Text pairs | Vision-Language |
Fineweb-edu | High quality text corpus | Language |
Categorization of Vision-Language and Language datasets that can be used for training, fine-tuning, and some of them for evaluation
Dataset | Task/Data type | Modality |
---|---|---|
Flickr30k | Image Captioning | Vision-Language |
TextVQA | Visual Question Answering | Vision-Language |
VizWiz | Visual Question Answering | Vision-Language |
WinoGAViL | Vision-based Commonsense Reasoning | Vision-Language |
ImageNet-R | Image-Text Pairs | Vision-Language |
ObjectNet | Image-Text Pairs | Vision-Language |
Hellaswag | Commonsense Reasoning | Language |
ARC | Complex Reasoning and Knowledge Application | Language |
CommonsenseQA | Commonsense Reasoning | Language |
MMLU | Knowledge-intensive Question Answering | Language |
Categorization of Vision-Language and Language datasets that are purely for evaluation
Dataset | Task/Data type | Control Category |
---|---|---|
DM Lab | Teach RL Agents 3D vision | Navigation-based control |
DM Control Suite | Physics-based simulation environments | Locomotion-based control |
ALE Atari | Atari games | Game-based control |
Baby AI | Language-grounded navigation | Navigation-based control |
MuJoCo | Multi-joint dynamics | Locomotion-based control |
Meta-World | Meta-RL and Multi-task learning | Manipulation-based control |
V-D4RL | Pixel-based analogues of DM Control Suite | Locomotion-based control |
Procgen | Procedurally generated Atari-like environments | Game-based control |
Open X Embodiment | Real-world Robotics tasks | Manipulation-based control and Locomotion-based control |
LocoMuJoCo | Imitation learning for locomotion | Locomotion-based control |
Categorization of control datasets
While existing benchmarks excel at evaluating specific capabilities and modalities, there remains a notable gap in holistic evaluation frameworks that can assess both the action capabilities of Vision-Language Models (VLMs) and the multimodal understanding of Vision-Language-Action Models (VLAs). Multinet addresses this gap by providing a comprehensive benchmark that spans vision-language, language, RL, and Robotics tasks. Our work consolidates diverse, high-quality datasets and establishes standardized evaluation metrics to enable systematic comparison of state-of-the-art models. For a detailed list of datasets and metrics included in Multinet, please refer to our dataset specification paper
Metric | Evaluation Category |
---|---|
CIDEr | Image Captioning, Image-based Text retrieval |
VQA Accuracy | Visual Question Answering |
Recall@K | Image understanding, Text-based image retrieval |
Accuracy | VQA, Commonsense reasoning, Text understanding |
Mean Squared Error | RL, Robotics |
Metrics in Multinet benchmark and the categories of tasks they each evaluate
Multinet represents a significant step toward advancing generalist AI systems. Multinet establishes a comprehensive benchmark for evaluating truly generalist models that can operate across multiple modalities, tasks, and environments. Our initial findings demonstrate a significant capability gap in current state-of-the-art models: Today's VLMs, VLAs, and generalist models struggle to maintain consistent performance across a diverse set of real-world robotics tasks that they have not been exposed to before.
Current Vision-Language-Action models typically excel at vision-language-grounded actions but may underperform in pure vision-language or language tasks. Multinet provides pre-training scale data across all these modalities, enabling the development of models that achieve state-of-the-art performance across all constituent tasks, not just their primary domain.
A significant contribution of this work is our open-source toolkit for standardizing robotics and reinforcement learning data. Many existing datasets suffer from outdated formats, poor maintenance, and accessibility issues. Multinet's toolkit provides stable access methods, conversion to a unified format, and ease of use for training, fine-tuning, and evaluation.
The release of Multinet marks an important first step toward a new paradigm of foundation models, but significant opportunities remain for expansion and improvement.
@misc{guruprasad2024benchmarking,
author = {Guruprasad, Pranav and Sikka, Harshvardhan and Song, Jaewoo and Wang, Yangyue and Liang, Paul},
title = {Benchmarking Vision, Language, & Action Models on Robotic Learning Tasks},
DOI = {10.20944/preprints202411.0494.v1},
year = {2024},
}