Need to Run Evaluations on Production Multimodal, Computer Use, or Robotics AI Systems? We can help! Get Started

MultiNet: A Generalization Benchmark for Next Generation Multimodal Models

MultiNet v1.0 Code

The Importance of a Systematic Benchmark in the Age of Multimodal, Action Grounded AI Systems

Recent advances in machine learning have demonstrated the potential of large-scale models to exhibit broad generalization capabilities across diverse tasks. Emerging Vision-Language-Action (VLA) models such as Pi0 demonstrate impressive capabilities in taking vision-language-grounded actions in robotics control tasks. Yet, these models are typically specialized for a narrow set of robotics tasks, and their performance on broader vision-language understanding remains underexplored. Similarly, Vision-Language models such as GPT-5 demonstrate impressive capabilities in vision-language and language understanding and reasoning tasks. Yet, their performance on completely out-of-distribution tasks such as low-level robotics control or discrete action gameplay is not quantitatively analyzed. To build generalist multimodal systems that are capable of completing long-horizon tasks and actually bringing value to the world, we believe that we need to build multimodal systems of a new kind. Very important to this mission is the ability to understand and evaluate the capabilities of these multimodal systems.

Multinet addresses these limitations by:

  • Advancing Generalist Foundation Models: Providing a comprehensive benchmark to evaluate the generalist models of today across multiple modalities, domains, and environments
  • Enabling Next-Generation Multimodal Models: Offering insights on specific failure modes and capabilities of current models in generalizing to out-of-distribution tasks, thus informing the development of models that can achieve state-of-the-art performance across all constituent tasks
  • Standardizing Evaluation: Creating unified assessment protocols for robotics foundation models beyond narrow control tasks, and vision-language foundation models beyond multimodal understanding tasks
  • Democratizing Access: Releasing open-source tools that address challenges of outdated formats, poor maintenance, and accessibility issues for what are typically rich evaluation datasets

Through our initial releases, and latest work, we've demonstrated significant capability gaps in current models when handling out-of-distribution tasks. These findings highlight the need for more versatile generalist architectures and provide a foundation for developing the next generation of truly generalist multimodal AI systems.

Dataset Coverage and Analysis

Multinet offers a rich collection of diverse datasets to comprehensively understand multimodal capabilities of state-of-the-art models across multiple domains, tasks, and environments. It can also be utilized to understand pitfalls, and precise failure modes of current models on sophisticated vision-language association tasks, high quality language understanding and generation tasks, and reward-based action trajectories. This consolidation effort serves as a valuable benchmark for evaluating the capabilities of current SoTA VLMs, VLAs, and generalist models, illuminating paths toward more versatile and capable generalist AI systems.

Distribution of datasets by category in Multinet Robotics (54.55%) Vision Understanding (9.09%) Gameplay (9.09%) Spatial Reasoning (9.09%) Language Understanding (9.09%) Function Calling (9.09%) Robotics represents the largest portion (54.55%) with 6 datasets from the OpenX-Embodiment collection, while the remaining 5 categories each contribute equally (9.09%) with 1 dataset each.

Explore Multinet Datasets

Dataset Description / Task Type Category
ODinW Diverse object detection in-the-wild Vision-Language
SQA3D Situated Question Answering in 3D Scenes Vision-Language
PIQA Physical commonsense reasoning (Question Answering) Language
BFCL v3 Multi-turn conversational function calling Language
Open X Embodiment Large-scale, diverse robotics data for generalist policies Control
Overcooked AI Multi-agent coordination in gameplay Control

Benchmark

While existing benchmarks excel at evaluating specific capabilities and modalities, there remains a notable gap in holistic evaluation frameworks that can assess both the action capabilities of Vision-Language Models (VLMs) and the multimodal understanding of Vision-Language-Action Models (VLAs). Multinet addresses this gap by providing a comprehensive benchmark that spans vision-language and control tasks. Our work consolidates diverse, high-quality datasets and establishes standardized evaluation metrics to enable systematic comparison of state-of-the-art models.

Explore Benchmark Metrics & Categories

Metric Evaluation Category
Exact match rate/Accuracy
Turn-level accuracy
Conversation-level accuracy
Invalids Percentage
Function calling
Exact match rate/Accuracy
Similarity score
Invalids Percentage
Spatial reasoning and Visual Question Answering
Exact match rate/Accuracy
Precision (Micro, Macro, and Class-wise variants)
Recall (Micro, Macro, and Class-wise variants)
F1 Score (Micro, Macro, and Class-wise variants)
Invalids Percentage
Image understanding and classification
Exact match rate/Accuracy Commonsense reasoning
Mean Squared Error
Brier Mean Absolute Error
Precision (Micro, Macro, and Class-wise variants)
Recall (Micro, Macro, and Class-wise variants)
F1 Score (Micro, Macro, and Class-wise variants)
Invalids Percentage
Robotics and Gameplay

Metrics in Multinet benchmark and the categories of tasks they each evaluate

What's Next

The release of Multinet is a preliminary but important step toward next-generation AI systems, with ambitious research directions planned for future iterations:

  • Knowledge Insulation Architecture: Investigate architectural designs that prevent domain-specific fine-tuning from corrupting existing capabilities from pre-training.
  • Live Dynamic Benchmarks: Develop continuously updating evaluation frameworks that adapt to model capabilities, preventing overfitting to static test sets while enabling real-time assessment of deployed systems across diverse scenarios.
  • Cross-Modal Failure Analysis: Systematically analyze why current models exhibit fundamental misalignment between input domains and output modalities, developing architectural modifications to prevent cross-modal contamination in multi-task scenarios.
  • Progressive Multi-Domain Training: Research learning paradigms that gradually introduce new domains without catastrophic forgetting, moving beyond current zero-shot limitations to enable meta-learning for rapid adaptation while preserving existing knowledge.
  • Causal World Model Evaluation: Build and leverage world models as evaluators to test whether next-generation multimodal systems understand causal relationships rather than statistical correlations, enabling counterfactual analysis through systematic environmental perturbation.

Through these targeted investigations, we aim to establish the foundational principles for building truly unified multimodal action models capable of seamless operation across various domains such as robotics, software environments, and complex reasoning tasks without architectural degradation.

Citation

@misc{guruprasad2024benchmarking,
        author    = {Guruprasad, Pranav and Sikka, Harshvardhan and Song, Jaewoo and Wang, Yangyue and Liang, Paul},
        title     = {Benchmarking Vision, Language, & Action Models on Robotic Learning Tasks},
        DOI       = {10.20944/preprints202411.0494.v1},
        year      = {2024},
      }