MultiNet: A Generalist Benchmark for Vision-Language & Action models

This is a collaborative effort between Manifold Research and Metarch.ai This work is sponsored by Metarch.ai.
Interested in using Action Models in Production? Reach out here

Introduction

Introducing MultiNet: a comprehensive benchmark we are building to evaluate and advance the development of the emerging category of Multimodal Action Models. Key contributions of Multinet include:

  • The largest open-source generalist dataset, consolidating diverse modalities and tasks suitable for pre-training, fine-tuning, and evaluation of Multimodal Action Models
  • Detailed analysis of the constituent datasets’ validity and utility for generalist objectives
  • Open-source software infrastructure for downloading, managing, and utilizing the benchmark data
  • Open-source software toolkit for translating control (RL and Robotics) datasets of various formats from various sources into a unified TensorFlow Datasets format, making them immediately usable for training, fine-tuning, and evaluation
  • A general framework for mapping VLMs to other modality classes, with particular emphasis on action spaces
  • Open-source evaluation frameworks to profile SoTA VLMs and VLAs on datasets within Multinet

In our initial release, we present the results of evaluating GPT-4o (SoTA VLM), OpenVLA (SoTA VLA), and HuggingFace's JAT (novel generalist model) on 20 diverse and challenging OpenX Embodiment datasets. We aim to quickly iterate and expand our evaluation results to more datasets within Multinet.

Interested in contributing/collaborating on this effort? Join the conversation on Discord

News

🎉

Multinet v0.1! How well do SoTA VLMs and VLAs do on real-world robotics tasks? Read more on our Release page

Motivation

Deepmind's GATO represented a significant milestone as the first truly multi-modal, multi-task, multi-embodiment agent. Operating with a single neural network and fixed weights, Gato demonstrated capabilities spanning Atari gameplay, image captioning, conversational interaction, and real-world robotic manipulation. However,both the model and the majority of its training datasets remain closed-source, limiting its impact on the broader research community.

The emerging class of Vision-Language-Action models such as OpenVLA, and Octo display impressive capabilities of taking vision-language-grounded action in control tasks. However, they are trained for a niche set of robotics tasks, their performance on vision-language understanding/generation tasks is not well understood.

Building next generation "generalist" models requires training on diverse datasets that span multiple modalities and task types. Such models must excel not only at individual modalities (vision, language, or control/action) but also at tasks that require seamless integration across modalities - a requirement that better reflects real-world scenarios. Currently, there exists no large-scale, open-source dataset specifically designed for training and evaluating such generalist models. This gap motivated the development of Multinet.

Dataset Coverage and Analysis

Multinet offers a rich collection of diverse datasets to develop comprehensive multimodal capabilities across multiple domains. It can be utilized to train models that will cultivate sophisticated vision-language association knowledge, high quality language understanding and generation abilities, and prowess in reward-based action trajectories in a myriad of environments and tasks. This consolidation effort not only provides pre-training scale data for a generalist objective, but also serves as a valuable benchmark for evaluating the capabilities of current SoTA VLMs and VLAs, illuminating paths toward more versatile and capable generalist AI systems.

Distribution of datasets across modalities in Multinet Vision-Language (29%) Language (13%) Control (58%) Control represents the largest portion (58%) due to the extensive OpenX-Embodiment collection, followed by Vision-Language (29%) and Language (13%) datasets.
Dataset Task/Data type Modality
OBELICS Interleaved Image-Text Vision-Language
COYO-700M Image-Text pairs Vision-Language
MS COCO Object detection, segmentation, key-point detection, captioning Vision-Language
Conceptual Captions Image Captioning Vision-Language
A-OKVQA Visual Question Answering Vision-Language
VQA-v2 Visual Question Answering Vision-Language
Datacomp-1B Image-Text pairs Vision-Language
Fineweb-edu High quality text corpus Language

Categorization of Vision-Language and Language datasets that can be used for training, fine-tuning, and some of them for evaluation

Dataset Task/Data type Modality
Flickr30k Image Captioning Vision-Language
TextVQA Visual Question Answering Vision-Language
VizWiz Visual Question Answering Vision-Language
WinoGAViL Vision-based Commonsense Reasoning Vision-Language
ImageNet-R Image-Text Pairs Vision-Language
ObjectNet Image-Text Pairs Vision-Language
Hellaswag Commonsense Reasoning Language
ARC Complex Reasoning and Knowledge Application Language
CommonsenseQA Commonsense Reasoning Language
MMLU Knowledge-intensive Question Answering Language

Categorization of Vision-Language and Language datasets that are purely for evaluation

Dataset Task/Data type Control Category
DM Lab Teach RL Agents 3D vision Navigation-based control
DM Control Suite Physics-based simulation environments Locomotion-based control
ALE Atari Atari games Game-based control
Baby AI Language-grounded navigation Navigation-based control
MuJoCo Multi-joint dynamics Locomotion-based control
Meta-World Meta-RL and Multi-task learning Manipulation-based control
V-D4RL Pixel-based analogues of DM Control Suite Locomotion-based control
Procgen Procedurally generated Atari-like environments Game-based control
Open X Embodiment Real-world Robotics tasks Manipulation-based control and Locomotion-based control
LocoMuJoCo Imitation learning for locomotion Locomotion-based control

Categorization of control datasets

Benchmark

While existing benchmarks excel at evaluating specific capabilities and modalities, there remains a notable gap in holistic evaluation frameworks that can assess both the action capabilities of Vision-Language Models (VLMs) and the multimodal understanding of Vision-Language-Action Models (VLAs). Multinet addresses this gap by providing a comprehensive benchmark that spans vision-language, language, RL, and Robotics tasks. Our work consolidates diverse, high-quality datasets and establishes standardized evaluation metrics to enable systematic comparison of state-of-the-art models. For a detailed list of datasets and metrics included in Multinet, please refer to our dataset specification paper

Metric Evaluation Category
CIDEr Image Captioning, Image-based Text retrieval
VQA Accuracy Visual Question Answering
Recall@K Image understanding, Text-based image retrieval
Accuracy VQA, Commonsense reasoning, Text understanding
Mean Squared Error RL, Robotics

Metrics in Multinet benchmark and the categories of tasks they each evaluate

Importance of Multinet

Multinet represents a significant step toward advancing generalist AI systems. Multinet establishes a comprehensive benchmark for evaluating truly generalist models that can operate across multiple modalities, tasks, and environments. Our initial findings demonstrate a significant capability gap in current state-of-the-art models: Today's VLMs, VLAs, and generalist models struggle to maintain consistent performance across a diverse set of real-world robotics tasks that they have not been exposed to before.

Current Vision-Language-Action models typically excel at vision-language-grounded actions but may underperform in pure vision-language or language tasks. Multinet provides pre-training scale data across all these modalities, enabling the development of models that achieve state-of-the-art performance across all constituent tasks, not just their primary domain.

A significant contribution of this work is our open-source toolkit for standardizing robotics and reinforcement learning data. Many existing datasets suffer from outdated formats, poor maintenance, and accessibility issues. Multinet's toolkit provides stable access methods, conversion to a unified format, and ease of use for training, fine-tuning, and evaluation.

What's Next?

The release of Multinet marks an important first step toward a new paradigm of foundation models, but significant opportunities remain for expansion and improvement.

  • While current VLAs show promising results on control tasks, we plan to systematically evaluate their performance on pure vision-language and language tasks to assess whether fine-tuning or co-fine-tuning on control tasks compromises their capabilities in individual modalities.
  • We also aim to broaden our evaluation scope beyond the OpenX-Embodiment dataset. By incorporating the diverse control tasks described above, we can better understand how VLAs and generalist models perform on completely out-of-distribution data.
  • Currently, our profiling efforts as seen in our first release focus on zero-shot performance, future work will explore few-shot learning and fine-tuning scenarios.
  • We are especially interested in fine-tuning and transferring VLAs to novel domains. We are exploring how these models might be adapted to software environments, potentially enabling more capable digital agents by leveraging insights from embodied learning.
  • Finally, we envision transforming Multinet from its current offline form into an online benchmark. This evolution may include the development of simulation environments for both 2D and 3D control tasks, enabling more dynamic and interactive evaluation of model capabilities.

BibTeX

@misc{guruprasad2024benchmarking,
    author    = {Guruprasad, Pranav and Sikka, Harshvardhan and Song, Jaewoo and Wang, Yangyue and Liang, Paul},
    title     = {Benchmarking Vision, Language, & Action Models on Robotic Learning Tasks},
    DOI       = {10.20944/preprints202411.0494.v1},
    year      = {2024},
  }