Recent advances in machine learning have demonstrated the potential of large-scale models to exhibit broad generalization capabilities across diverse tasks. Emerging Vision-Language-Action (VLA) models such as Pi0 demonstrate impressive capabilities in taking vision-language-grounded actions in robotics control tasks. Yet, these models are typically specialized for a narrow set of robotics tasks, and their performance on broader vision-language understanding remains underexplored. Similarly, Vision-Language models such as GPT-5 demonstrate impressive capabilities in vision-language and language understanding and reasoning tasks. Yet, their performance on completely out-of-distribution tasks such as low-level robotics control or discrete action gameplay is not quantitatively analyzed. To build generalist multimodal systems that are capable of completing long-horizon tasks and actually bringing value to the world, we believe that we need to build multimodal systems of a new kind. Very important to this mission is the ability to understand and evaluate the capabilities of these multimodal systems.
Multinet addresses these limitations by:
Through our initial releases, and latest work, we've demonstrated significant capability gaps in current models when handling out-of-distribution tasks. These findings highlight the need for more versatile generalist architectures and provide a foundation for developing the next generation of truly generalist multimodal AI systems.
Multinet offers a rich collection of diverse datasets to comprehensively understand multimodal capabilities of state-of-the-art models across multiple domains, tasks, and environments. It can also be utilized to understand pitfalls, and precise failure modes of current models on sophisticated vision-language association tasks, high quality language understanding and generation tasks, and reward-based action trajectories. This consolidation effort serves as a valuable benchmark for evaluating the capabilities of current SoTA VLMs, VLAs, and generalist models, illuminating paths toward more versatile and capable generalist AI systems.
| Dataset | Description / Task Type | Category |
|---|---|---|
| ODinW | Diverse object detection in-the-wild | Vision-Language |
| SQA3D | Situated Question Answering in 3D Scenes | Vision-Language |
| PIQA | Physical commonsense reasoning (Question Answering) | Language |
| BFCL v3 | Multi-turn conversational function calling | Language |
| Open X Embodiment | Large-scale, diverse robotics data for generalist policies | Control |
| Overcooked AI | Multi-agent coordination in gameplay | Control |
While existing benchmarks excel at evaluating specific capabilities and modalities, there remains a notable gap in holistic evaluation frameworks that can assess both the action capabilities of Vision-Language Models (VLMs) and the multimodal understanding of Vision-Language-Action Models (VLAs). Multinet addresses this gap by providing a comprehensive benchmark that spans vision-language and control tasks. Our work consolidates diverse, high-quality datasets and establishes standardized evaluation metrics to enable systematic comparison of state-of-the-art models.
| Metric | Evaluation Category |
|---|---|
|
Exact match rate/Accuracy Turn-level accuracy Conversation-level accuracy Invalids Percentage |
Function calling |
|
Exact match rate/Accuracy Similarity score Invalids Percentage |
Spatial reasoning and Visual Question Answering |
|
Exact match rate/Accuracy Precision (Micro, Macro, and Class-wise variants) Recall (Micro, Macro, and Class-wise variants) F1 Score (Micro, Macro, and Class-wise variants) Invalids Percentage |
Image understanding and classification |
| Exact match rate/Accuracy | Commonsense reasoning |
|
Mean Squared Error Brier Mean Absolute Error Precision (Micro, Macro, and Class-wise variants) Recall (Micro, Macro, and Class-wise variants) F1 Score (Micro, Macro, and Class-wise variants) Invalids Percentage |
Robotics and Gameplay |
Metrics in Multinet benchmark and the categories of tasks they each evaluate
The release of Multinet is a preliminary but important step toward next-generation AI systems, with ambitious research directions planned for future iterations:
Through these targeted investigations, we aim to establish the foundational principles for building truly unified multimodal action models capable of seamless operation across various domains such as robotics, software environments, and complex reasoning tasks without architectural degradation.
@misc{guruprasad2024benchmarking,
author = {Guruprasad, Pranav and Sikka, Harshvardhan and Song, Jaewoo and Wang, Yangyue and Liang, Paul},
title = {Benchmarking Vision, Language, & Action Models on Robotic Learning Tasks},
DOI = {10.20944/preprints202411.0494.v1},
year = {2024},
}