MultiNet v1.0: A Comprehensive Benchmark for Evaluating Multimodal Reasoning and Action Models Across Diverse Domains

Abstract

Multimodal reasoning and action models hold immense promise as general-purpose agents, yet the current evaluation landscape remains fragmented with domain-specific benchmarks that fail to capture true generalization capabilities. This critical gap prevents us from understanding where these sophisticated systems excel and where they fail. We introduce MultiNet v1.0, a unified benchmark suite that bridges this evaluation gap by systematically assessing state-of-the-art Vision-Language Models (VLMs), Vision-Language-Action (VLA) models, and generalist models across robotics, multi-agent gameplay, and multimodal reasoning tasks. Our evaluation suite spans 11 datasets and includes state-of-the-art models such as GPT-5, Pi0, and Magma in their respective categories.

Key contributions of MultiNet v1.0 include:

Comprehensive Domain Coverage: Evaluation across robotics, gameplay, commonsense reasoning, spatial reasoning, visual question answering, and visual understanding tasks. These capabilities are essential for generalist models and systems.
Standardized Evaluation Protocols: Unified metrics and evaluation procedures for fair comparison across different model architectures
Model Adaptation Framework: Open-source code for adapting diverse models to various out-of-distribution task domains
Extensive Analysis: In-depth analysis of model capabilities, failure modes, and architectural trade-offs
Open-Source Toolkit: Complete evaluation harness and benchmarking tools for the research community

Our findings reveal significant insights into the current state of multimodal AI, highlighting both promising capabilities and critical limitations that inform future research directions. We release our complete benchmark suite, evaluation framework, and detailed analysis to accelerate progress in this field.

Dataset Coverage

MultiNet v1.0 evaluates models across six major domains using 11 diverse datasets. Each dataset presents unique challenges in vision-language-action understanding, from robotic manipulation to complex reasoning tasks.

	OpenX	Overcooked	PIQA	ODINW	SQA3D	BFCL
Dataset
Description	Large-scale robotics dataset with diverse manipulation and locomotion tasks across multiple robot embodiments and environments	Cooperative cooking simulation requiring coordination, planning, and multi-agent interaction in kitchen environments	Physical interaction question answering dataset testing common-sense reasoning about object properties and interactions	Object detection dataset with diverse domains testing visual recognition across varied contexts and object categories in the wild	3D scene understanding dataset combining spatial reasoning with question answering in complex indoor environments	Berkeley function calling dataset for evaluating multi-turn conversational function calling capabilities
Domain	Robotics	Gameplay	Physical Commonsense Reasoning	Object detection	3D Spatial Reasoning	Function Calling

MultiNet v1.0 evaluates models across diverse modalities and task types, from low-level robotic control to high-level reasoning.

💡 Tip: Hover over images to zoom, or click to view full-size in lightbox

Evaluation Methodology and Metrics

MultiNet v1.0 employs standardized evaluation metrics tailored to each task category, ensuring comprehensive and fair assessment across diverse model architectures. Our evaluation framework adapts metrics to the unique characteristics of each domain while maintaining consistency for cross-domain comparisons:

Brier Scores
Exact Match metrics
Macro scores
Class-wise scores
Continuous space metrics
Embedding similarity

Brier MAE

Mean Absolute Error of Brier scores.

Brier MAE

Brier Score:
¹⁄_N ∑_t=1^N ∑_i=1^R (f_ti - o_ti)²

f_ti: pred prob class i, time t
o_ti: ground truth
R: num classes, N: num timesteps.
Brier MAE is a variation of the original Brier score which is a useful method to measure the accuracy of probabilistic predictions.

Normalized Brier MAE

Normalized version of Brier MAE.

Normalized Brier MAE

Description:

Average of Brier Absolute Errors that have been min-max normalized using the overall min and max Brier absolute errors across all timesteps.

Norm. Quantile Filtered Brier MAE

Normalized Brier MAE filtered by quantiles.

Norm. Quantile Filtered Brier MAE

Description:

Brier MAE values are filtered (only considering the error values that are within the 5th to 95th percentile), and then normalized based on the quantile filtered min/max errors.

Max Relative Brier MAE

Max relative error of Brier scores.

Max Relative Brier MAE

max(MAE) / median(MAE)

Quantifies how the worst-case error deviates from the typical (median) error for a given subdataset.

% Invalid Actions

Predictions outside the valid prediction space.

% Invalid Actions

(Invalid predictions / Total predictions) * 100

Percentage of predictions falling outside the valid prediction space of a subdataset.

Micro Precision

Precision of the model across all predictions.

Micro Precision

Micro Precision:
True Positives / (True Positives + False Positives)

Calculated globally across all predictions.

Micro Recall

Recall of the model across all predictions.

Micro Recall

Micro Recall:
True Positives / (True Positives + False Negatives)

Calculated globally across all predictions.

Micro F1 Score

Harmonic mean of Micro Precision and Recall.

Micro F1 Score

Micro F1:
2 * (Micro Prec * Micro Rec) / (Micro Prec + Micro Rec)

Balanced measure for global performance.

Macro Precision

Avg. of class-wise precision.

Macro Precision

Calculated by averaging class-wise precision scores across all classes. Treats all classes equally.

Macro Recall

Avg. of class-wise recall.

Macro Recall

Calculated by averaging class-wise recall scores across all classes. Important for imbalanced datasets.

Macro F1 Score

Avg. of class-wise F1 scores.

Macro F1 Score

Calculated by averaging class-wise F1 scores. Balanced macro-average performance.

Class-wise Precision

Precision for each class.

Class-wise Precision

Precision calculated individually for each class. Helps understand bias towards specific classes.

Class-wise Recall

Recall for each class.

Class-wise Recall

Recall calculated individually for each class. Shows per-class identification performance.

Class-wise F1 Score

F1 for each class.

Class-wise F1 Score

F1 score calculated individually for each class. Balanced per-class performance.

MSE

Mean Squared Error for continuous action prediction.

Action MSE

MSE = ¹⁄_N ∑_t=1^N ||a_pred - a_gt||²

Measures accuracy of continuous action predictions in robotics tasks. Lower values indicate better performance.

Action MAE

Mean Absolute Error for continuous action prediction.

Action MAE

MAE = ¹⁄_N ∑_t=1^N |a_pred - a_gt|

Measures average absolute deviation between predicted and ground truth actions. More robust to outliers than MSE.

Success Rate

Percentage of predictions that match exactly with the corresponding ground truth.

Success Rate

Success Rate = (Successful Predictions / Total Predictions) × 100%

Measures the percentage of predictions that match exactly with ground truth according to task-specific success criteria.

Normalized MSE

Normalized version of Mean Squared Error.

Normalized MSE

Description:

Average of Mean Squared Errors that have been min-max normalized using the overall min and max MSE values across all timesteps.

Normalized MAE

Normalized version of Mean Absolute Error.

Normalized MAE

Description:

Average of Mean Absolute Errors that have been min-max normalized using the overall min and max MAE values across all timesteps.

Norm. Quantile Filtered MAE

Normalized MAE filtered by quantiles.

Norm. Quantile Filtered MAE

Description:

MAE values are filtered (only considering the error values that are within the 5th to 95th percentile), and then normalized based on the quantile filtered min/max errors.

Max Relative MAE

Max relative error of MAE values.

Max Relative MAE

max(MAE) / median(MAE)

Quantifies how the worst-case MAE deviates from the typical (median) MAE for a given subdataset.

Cosine Similarity

Cosine similarity between predicted and ground truth text answer embeddings.

Cosine Similarity

cos(θ) = (A · B) / (||A|| ||B||)

Measures the cosine of the angle between predicted and ground truth text answer embedding vectors. Values range from -1 to 1, with 1 indicating perfect alignment.

Results

Our evaluation across diverse domains reveals significant insights into model performance and capabilities. Below we present detailed results from our evaluation suite.

Select Dataset

Model Performance Comparison

Evaluation across 7 diverse tasks across robotics, digital control, and multimodal reasoning.

Loading leaderboard data...

Task Groups:

Robotics Control

Digital Control

Spatial Reasoning

Image Classification

Tool Use

Metrics:

EM: Exact Match Rate (%)

F1: Macro F1 Score (%)

MAE: Mean Absolute Error

Visual Indicators:

🏆 Best score per task

Wins: Number of tasks where model scored highest

Notes:

* We did not profile GPT5 on BFCLv3 with this release. See Gorilla leaderboard for BFCLv4 results.

Model Output Comparison

Compare how different models respond to the same visual input from the ODinW selfdrivingCar dataset

Input

What object is shown in this image from the selfdrivingCar dataset?
Option 0: biker Option 1: car ...
Output the number (0-10) of the correct option only.

Model

Output

Loading...

Pi0's prediction space collapse visualized

Pi0 experiences prediction space collapse on the Overcooked dataset, centered around the actions 25 and 26, which maps to (Player 1: STAY, Player 2: SOUTH) and (Player 1: STAY, Player 2: EAST) respectively.

Frequency of predicted action classes for Pi0 model on Overcooked dataset

Key Findings and Analysis

1. Catastrophic Cross-Domain Failure: MultiNet v1.0 reveals catastrophic failure at the boundaries of vision-language-action models. No current model achieves true cross-domain generalization — Pi0 performance drops to 0% on basic vision-language tasks. GPT-5 while performing relatively better, still does not achieve anywhere near the performance necessary for succesful task completion.

2. Domain-Specific Fine-Tuning Corruption: Fine-tuning for robotics seems to systematically corrupt vision language models. Pi0 exhibits repetitive "increa" token spam, suggesting that action-oriented training degrades linguistic capabilities through catastrophic forgetting of language generation pathways.

3. Output Modality Misalignment: Magma, designed as a generalist model, produces spatial coordinates instead of text answers when prompted with language tasks. This reveals fundamental misalignment between input processing and output generation across different task domains.

4. Limited Impact of Prompt Engineering: Our prompt engineering experiments only led to marginal gains (~20% improvement) that cannot bridge fundamental architectural incompatibilities. This suggests that current model limitations are structural rather than interface-related.

5. Need for Architectural Innovation: These results demonstrate that current training paradigms create overspecialized models with incompatible domain-specific biases. This necessitates fundamental rethinking of modular architectures and progressive training strategies for truly unified multimodal systems.

Looking Forward

We are exploring several near-term experiments with collaborators, as well as larger-scale research directions that build on MultiNet's findings. If you're interested in contributing to the future of multimodal AI evaluation and development, we encourage you to get involved. Join our Discord community to connect with researchers and explore opportunities to contribute as part of the Manifold Research team.

Near Term Experiments

Building on MultiNet v1.0's findings of catastrophic failure at domain boundaries, our immediate research priorities focus on understanding and mitigating the fundamental limitations of current vision-language & action models. These investigations target the core mechanisms behind knowledge degradation, architectural incompatibilities, and the emergence of failure modes that prevent true cross-domain generalization.

Investigating the Gibberish Outputs of VLAs ▼

Token degradation analysis: Examine how fine-tuning on action sequences corrupts language generation pathways, particularly the emergence of repetitive tokens like Pi0's "increa" spam
Mitigation strategies: Develop strategies to mitigate the degradation of language generation pathways despite end-to-end fine-tuning on entirely new data distributions

Investigating SoM/ToM Outputs of Magma ▼

Coordinate-text mapping errors: Investigate why Magma produces spatial coordinates instead of natural language responses when prompted with inputs of completely different domains, revealing fundamental misalignment in output modalities
Output modality alignment: Develop fine-tuning/training techniques and architectural modifications to ensure proper alignment between input domains and output modalities, preventing cross-modal contamination in multi-task models

Pi0.5 - SoTA Knowledge-Insulated VLA Performance on v1.0 ▼

Architectural improvements: Evaluate next-generation knowledge insulation techniques in Pi0.5 against MultiNet v1's comprehensive benchmark suite
Domain transfer efficiency: Measure how effectively Pi0.5 maintains performance across vision-language tasks while adapting to robotic control
Failure mode comparison: Contrast Pi0.5's failure patterns with previous Vision-Language-Action models to validate architectural advances in preventing catastrophic forgetting

Knowledge Insulation Testing and Experiments ▼

Modular architecture evaluation: Test compartmentalized model designs that isolate domain-specific knowledge while maintaining shared representations for common reasoning tasks
Progressive fine-tuning protocols: Investigate learning approaches that gradually introduce new domains without corrupting existing capabilities

Long-term Research Directions

Our long-term vision extends beyond addressing current limitations to fundamentally reimagining how we evaluate, understand, and build multimodal action models. These goals represent paradigm shifts toward more robust, adaptive, and truly general AI systems that can seamlessly operate across diverse domains while maintaining coherent reasoning capabilities.

Live Benchmarks ▼

Dynamic evaluation frameworks: Develop continuously updating benchmarks that adapt to model capabilities, preventing overfitting to static test sets
Real-time performance monitoring: Create systems for ongoing assessment of deployed VLAs across diverse real-world scenarios
Community-driven evaluation: Build platforms for researchers to contribute new tasks and domains, ensuring benchmark relevance as the field evolves

World Models as Evaluators ▼

Causal reasoning assessment: Build and leverage world models to evaluate whether multimodal systems of the future understand causal relationships rather than just statistical correlations
Counterfactual analysis: Deploy world models to test the robustness of multimodal systems through systematic perturbation of environmental conditions and task parameters

Building the Next Generation of Multimodal Action Models ▼

Unified architecture design: Develop foundational architectures that natively support multiple modalities without domain-specific fine-tuning degradation
Compositional reasoning systems: Create models that can decompose complex tasks into modular components, enabling flexible recombination across domains
Meta-learning for rapid adaptation: Build systems that can quickly acquire new capabilities while preserving existing knowledge, moving beyond current catastrophic forgetting limitations

Citation

@article{guruprasad2025multinet,
  author = {Pranav Guruprasad and Sudipta Chowdhury and Harshvardhan Sikka and Mridul Sharma and Helen Lu and Sean Rivera and Aryan Khurana and Yangyue Wang},
  title = {MultiNet v1.0: A Comprehensive Benchmark for Evaluating Multimodal Reasoning and Action Models Across Diverse Domains},
  journal = {Manifold Research Publications},
  year = {2025},
  note = {https://multinet.ai/static/pages/Multinetv1.html},
  doi = {10.5281/zenodo.17404313}
}