MultiNet: A Generalist Benchmark for Multimodal Action models

Multinet is an open science initiative with contributions from leading research teams at instituitions like:

Interested in Collaborating? Reach out to us or join the working group on Discord

Systems, Algorithms, and Research for evaluating Next Generation Action Models

Multinet is a comprehensive benchmarking initiative for evaluating generalist models across vision, language, and action. It provides:

Consolidation of diverse datasets and standardized protocols for assessing multimodal understanding
Extensive training data (800M+ image-text pairs, Trillion+ language tokens, 35TB+ control data)
Varied evaluation tasks including captioning, VQA, robotics, game-playing, commonsense reasoning, and simulated locomotion/manipulation
Open-source toolkit to standardize the process of obtaining and utilizing robotics/RL data

Explore our Research

Multinet v0.2

Benchmarking Vision, Language, & Action Models in Procedurally Generated, Open Ended Action Environments

Multinet v0.2

Key contributions include:

Benchmarking SOTA models in procedurally generated environments.
Impact of architecture, data, & processing on model generalization.
Insights on action space & image complexity for generalist AI.

Evaluation Harness and Toolkit

An Open-Source Software Toolkit & Benchmark Suite for the Evaluation and Adaptation of Multimodal Action Models

Evaluation Harness and Toolkit

Beyond the benchmark we aim to provide the community with the essential resources—datasets, tools, and standardized evaluation protocols—to systematically compare different approaches, gain deeper insights into the challenges and opportunities in building generalist AI, and ultimately accelerate the development of truly general-purpose intelligent systems. This includes:

A Large-Scale Generalist Dataset
An Open-Source Data Curation SDK
A Systematic Evaluation Harness
Open-source adaptations for SoTA VLA models and VLMs
In-depth Experiments and Analysis

Multinet v0.1

Benchmarking Vision, Language, and Action models on Robotics tasks

Multinet v0.1

Key contributions include:

Profiling results for VLM, VLA, and generalist models
Analysis of model generalization to robotics tasks
Evaluation metrics for OpenX Dataset
Framework for mapping VLMs to actions
Open-source benchmark infrastructure

Dataset Specification

Potentially the largest open-source generalist dataset, consolidating diverse modalities and tasks suitable for pre-training, fine-tuning, and evaluation

Dataset Specification

MultiNet addresses the gap in holistic evaluation by assessing both Vision-Language Model (VLM) action capabilities and Vision-Language-Action Model (VLA) multimodal understanding across vision-language, language, and control tasks:

As a part of this effort we consolidate a diverse set of high-quality datasets (e.g., OBELICS, COYO-700M, OpenX-Embodiment, Mujoco, Procgen, Atari, WinoGAViL, VQA-v2, etc.).
Carefully curated training data and corresponding evaluation data, which includes:
- Over 800M image-text pairs for vision-language association.
- 1.3T tokens for language understanding.
- Over 35 TB of data for robotics and RL control tasks.

GenESIS

Generalizable Extendable Stratified Inference System - A framework for mapping language models to actions

GenESIS

GenESIS structures prompts for action generation using:

System-level goals and constraints.
Task and environment context/rules.
Integration of visual input
Clear action space definitions.
Precise output formatting instructions.

μGato

A simple, open-source implementation of what is described in DeepMind's Gato paper

μGato

This project marks our initial step towards building a multimodal generalist action model. The current version offers:

Simplicity: Easy-to-understand GATO implementation.
Baseline Model: A basic generalist model that can be used as a baseline for what we build in the future
Experimentation: Designed for iterability and exploration.
Unified Data: Simplifies handling of diverse modalities.
Learning Tool: Interactive notebook included.

NEKO

An open-source GATO-style generalist multimodal model for image, text, RL, and robotics tasks.

NEKO

NEKO is a distributed, open-source effort to build a Large Multimodal Model tackling "massively multimodal" scenarios (3+ modalities) to advance generalist AI.

Massively Multimodal: Aims to handle images, text, audio, video, control & proprioception.
Generalist vision: Designed for numerous objectives across diverse tasks, adapting to new domains.
Open Source & Community: Fosters collective progress and understanding in AI systems.
GATO-Inspired, future-focused: Builds on generalist principles with cutting-edge advancements.

About Multinet

A comprehensive benchmark for evaluating Multimodal Action Models

About Multinet

Key contributions include:

Comprehensive dataset for evaluating multimodal models
Tools for standardizing robotics and RL data
Framework for mapping language models to actions
Evaluation suite for vision-language agents

News

🏆

Multinet v0.2 released! We systematically profile state-of-the-art VLAs and VLMs perform in procedurally generated OOD game environments. Read more about our recent release here.

🏅

Paper accepted at ICML 2025! Our paper detailing the Open-Source contributions of Multinet that benefit the AI community has been accepted at the CodeML Workshop at ICML 2025! Read our paper here.

🎉

Multinet v0.1 released! How well do state-of-the-art VLMs and VLAs perform on real-world robotics tasks? Read more on our release Page.

🚀

Introducing Multinet! A new generalist benchmark to evaluate Vision-Language & Action models. Learn more here.

Research Talks & Demos

Explore our collection of presentations showcasing Multinet's vision, progress, and development journey!

Citations


        Multinet v0.2 - Benchmarking Vision, Language, & Action Models in Procedurally Generated, Open Ended Action Environments
        

        @misc{guruprasad2025benchmarkingvisionlanguage,
        title={Benchmarking Vision, Language, & Action Models in Procedurally Generated, Open Ended Action Environments}, 
        author={Pranav Guruprasad and Yangyue Wang and Sudipta Chowdhury and Harshvardhan Sikka},
        year={2025},
        eprint={2505.05540},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2505.05540}, 
        }


        Open-Source Evaluation Harness and Toolkit (accepted at ICML 2025)
        

        @misc{guruprasad2025opensourcesoftwaretoolkit,
        title={An Open-Source Software Toolkit & Benchmark Suite for the Evaluation and Adaptation of Multimodal Action Models}, 
        author={Pranav Guruprasad and Yangyue Wang and Sudipta Chowdhury and Jaewoo Song and Harshvardhan Sikka},
        year={2025},
        eprint={2506.09172},
        archivePrefix={arXiv},
        primaryClass={cs.LG},
        url={https://arxiv.org/abs/2506.09172}, 
        }


        Multinet v0.1 - Benchmarking Vision, Language, & Action Models on Robotic Learning Tasks
        

        @misc{guruprasad2024benchmarkingvisionlanguage,
        title={Benchmarking Vision, Language, & Action Models on Robotic Learning Tasks}, 
        author={Pranav Guruprasad and Harshvardhan Sikka and Jaewoo Song and Yangyue Wang and Paul Pu Liang},
        year={2024},
        eprint={2411.05821},
        archivePrefix={arXiv},
        primaryClass={cs.RO},
        url={https://arxiv.org/abs/2411.05821},
        }


        Multinet Vision and Dataset specification
        

        @misc{guruprasad2024benchmarking,
        author={Guruprasad, Pranav and Sikka, Harshvardhan and Song, Jaewoo and Wang, Yangyue and Liang, Paul},
        title={Benchmarking Vision, Language, & Action Models on Robotic Learning Tasks},
        DOI={10.20944/preprints202411.0494.v1},
        year={2024},
        }