Need to Run Evaluations on Production Multimodal, Computer Use, or Robotics AI Systems? We can help! Get Started

MultiNet: Evaluating Multimodal Generalization in Next Generation Models

Multinet is an open science initiative with contributions from leading research teams at institutions like:


Interested in Collaborating? Reach out to us Emailor join the working group on Discord



Research and Software for evaluating Next Generation Model Generalization on Diverse Multimodal Tasks

Multinet is a comprehensive benchmarking initiative for evaluating multimodal models across vision, language, and action tasks. It provides:

  • Well-reasoned and curated set of evaluation datasets to assess multimodal understanding and action taking capabilities of SoTA multimodal models
  • Varied evaluation tasks including Common Sense Reasoning, Object Detection in the wild, Spatial Reasoning, Visual Question Answering, Robotics, and complex Multi-Agent game-playing
  • Open-source toolkit to standardize the process of obtaining and utilizing data from various sources and of different formats
  • Open-source adaptations of diverse models to various out-of-distribution task domains
  • Comprehensive evaluation protocols for profiling SoTA VLMs, VLAs, and generalist models on benchmark datasets

Multinet v1.0 Leaderboard

Evaluation across 7 diverse tasks across robotics, digital control, and multimodal reasoning.

Loading leaderboard data...

Leaderboard Legend and Notes

To learn more about our preliminary analysis and methodology, check out our V1 analysis page. To have your model evaluated, please reach out to pranav@metarch.ai

Task Groups:

Robotics Control
Digital Control
Spatial Reasoning
Image Classification
Tool Use

Metrics:

EM: Exact Match Rate (%)
F1: Macro F1 Score (%)
MAE: Mean Absolute Error

Visual Indicators:

🏆 Best score per task
Wins: Number of tasks where model scored highest

Notes:

* We did not profile GPT5 on BFCLv3 with this release. See Gorilla leaderboard for BFCLv4 results.

Explore our Research

Previous Benchmark Releases

Comprehensive benchmarks for evaluating multimodal models across various modalities and tasks

Benchmark Releases

Our benchmarking efforts:

Model Releases

Open-source generalist multimodal action models for research and development

Model Releases

Our model implementations:

  • μGato: Simple, open-source implementation of DeepMind's Gato
  • NEKO: GATO-style model for image, text, RL, and robotics tasks

Software Releases

Open-source tools and frameworks for building and evaluating multimodal models of the future

Software Releases

Our software contributions:

  • Eval Harness: Systematic evaluation framework for multimodal models
  • Toolkit: Data curation SDK for evaluation datasets
  • GenESIS: Framework for mapping language models to actions

News

🌟

Multinet v1.0 released! We release our most comprehensive benchmark yet - evaluating a SoTA VLM, VLA, and generalist models on a wide variety of multimodal understanding and action datasets. Read more here.

🏆

Multinet v0.2 released! We systematically profile state-of-the-art VLAs and VLMs perform in procedurally generated OOD game environments. Read more about our recent release here.

🏅

Paper accepted at ICML 2025! Our paper detailing the Open-Source contributions of Multinet that benefit the AI community has been accepted at the CodeML Workshop at ICML 2025! Read our paper here.

🎉

Multinet v0.1 released! How well do state-of-the-art VLMs and VLAs perform on real-world robotics tasks? Read more on our release Page.

🚀

Introducing Multinet! A comprehensive benchmark for evaluating multimodal generalization in next generation models. Learn more here.

Research Talks & Demos

Explore our collection of presentations showcasing Multinet's vision, progress, and development journey!

Citation

@article{guruprasad2025multinet,
  author = {Pranav Guruprasad and Sudipta Chowdhury and Harshvardhan Sikka and Mridul Sharma and Helen Lu and Sean Rivera and Aryan Khurana and Yangyue Wang},
  title = {MultiNet v1.0: A Comprehensive Benchmark for Evaluating Multimodal Reasoning and Action Models Across Diverse Domains},
  journal = {Manifold Research Publications},
  year = {2025},
  note = {https://multinet.ai/static/pages/Multinetv1.html},
  doi = {10.5281/zenodo.17404313}
}