MultiNet: Evaluating Multimodal Generalization in Next Generation Models

Multinet is an open science initiative with contributions from leading research teams at institutions like:

Interested in Collaborating? Reach out to us or join the working group on Discord

Research and Software for evaluating Next Generation Model Generalization on Diverse Multimodal Tasks

Multinet is a comprehensive benchmarking initiative for evaluating multimodal models across vision, language, and action tasks. It provides:

Well-reasoned and curated set of evaluation datasets to assess multimodal understanding and action taking capabilities of SoTA multimodal models
Varied evaluation tasks including Common Sense Reasoning, Object Detection in the wild, Spatial Reasoning, Visual Question Answering, Robotics, and complex Multi-Agent game-playing
Open-source toolkit to standardize the process of obtaining and utilizing data from various sources and of different formats
Open-source adaptations of diverse models to various out-of-distribution task domains
Comprehensive evaluation protocols for profiling SoTA VLMs, VLAs, and generalist models on benchmark datasets

Multinet v1.0 Leaderboard

Evaluation across 7 diverse tasks across robotics, digital control, and multimodal reasoning.

Loading leaderboard data...

Leaderboard Legend and Notes

▼

To learn more about our preliminary analysis and methodology, check out our V1 analysis page.

Task Groups:

Robotics Control

Digital Control

Spatial Reasoning

Image Classification

Tool Use

Metrics:

EM: Exact Match Rate (%)

F1: Macro F1 Score (%)

MAE: Mean Absolute Error

Visual Indicators:

🏆 Best score per task

Wins: Number of tasks where model scored highest

Notes:

* We did not profile GPT5 on BFCLv3 with this release. See Gorilla leaderboard for BFCLv4 results.

Explore our Research

MultiNet v1.0

Comprehensive benchmark to evaluate the next generation of multimodal models

MultiNet v1.0

Our latest benchmark release featuring:

A comprehensive set of evaluation datasets spanning vision, language, and action
Standardized evaluation protocols for profiling SoTA VLMs, VLAs, and generalist models on benchmark datasets
Model adaptations of SoTA VLM, VLA, and generalist models to benchmark datasets
Comprehensive performance comparisons
Thorough analysis to understand model capabilities, and specific failure modes that can inform the development of the next generation of multimodal models

Previous Benchmark Releases

Comprehensive benchmarks for evaluating multimodal models across various modalities and tasks

Benchmark Releases

Our benchmarking efforts:

v0.2 - Gameplay: Benchmarking models in procedurally generated game environments
v0.1 - Robotics: Evaluating models on real-world robotics tasks

Model Releases

Open-source generalist multimodal action models for research and development

Model Releases

Our model implementations:

μGato: Simple, open-source implementation of DeepMind's Gato
NEKO: GATO-style model for image, text, RL, and robotics tasks

Software Releases

Open-source tools and frameworks for building and evaluating multimodal models of the future

Software Releases

Our software contributions:

Model Adaptations: Model adaptations of SoTA VLAs to Multinet datasets, such as OpenVLA, Pi0 and Magma
Toolkit: Data curation SDK for evaluation datasets
GenESIS: Framework for mapping language models to actions

About Multinet

A comprehensive benchmark for evaluating Multimodal Action Models

About Multinet

Key contributions include:

Comprehensive set of evaluation datasets for evaluating multimodal models
Tools for standardizing robotics and RL data
Framework for mapping language models to actions
Evaluation suite for multimodal models
Open-source adaptations of diverse models to various out-of-distribution task domains
Comprehensive evaluation protocols for profiling SoTA VLMs, VLAs, and generalist models on benchmark datasets

Submit Your Model

Step-by-step instructions for submitting your model to the MultiNet leaderboard

Submit Your Model

This guide covers the complete submission process, including:

How to implement the required dataset adapters for your model
Preparing your environment and code using our Docker template
Testing your model locally on sample data
Example adapter implementations for reference

News

🌟

Multinet v1.0 released! We release our most comprehensive benchmark yet - evaluating a SoTA VLM, VLA, and generalist models on a wide variety of multimodal understanding and action datasets. Read more here.

🏆

Multinet v0.2 released! We systematically profile state-of-the-art VLAs and VLMs perform in procedurally generated OOD game environments. Read more about our recent release here.

🏅

Paper accepted at ICML 2025! Our paper detailing the Open-Source contributions of Multinet that benefit the AI community has been accepted at the CodeML Workshop at ICML 2025! Read our paper here.

🎉

Multinet v0.1 released! How well do state-of-the-art VLMs and VLAs perform on real-world robotics tasks? Read more on our release Page.

🚀

Introducing Multinet! A comprehensive benchmark for evaluating multimodal generalization in next generation models. Learn more here.

Research Talks & Demos

Explore our collection of presentations showcasing Multinet's vision, progress, and development journey!

Citation

@article{guruprasad2025multinet,
  author = {Pranav Guruprasad and Sudipta Chowdhury and Harshvardhan Sikka and Mridul Sharma and Helen Lu and Sean Rivera and Aryan Khurana and Yangyue Wang},
  title = {MultiNet v1.0: A Comprehensive Benchmark for Evaluating Multimodal Reasoning and Action Models Across Diverse Domains},
  journal = {Manifold Research Publications},
  year = {2025},
  note = {https://multinet.ai/static/pages/Multinetv1.html},
  doi = {10.5281/zenodo.17404313}
}