Benchmarking Vision, Language, & Action Models in Procedurally Generated, Open Ended Action Environments


*Indicates Equal Contribution

Interested in evaluating your model on generalist multimodal datasets? Reach out here
MultiNet v0.2 Release Visual

MultiNet v0.2 extends the MultiNet benchmark to effectively evaluate SoTA VLAs and VLMs on multistep action trajectories in procedurally generated, open ended game environments.

Abstract

Vision-language-action (VLA) models represent an important step toward general-purpose robotic systems by integrating visual perception, language understanding, and action execution. However, systematic evaluation of these models, particularly their zero-shot generalization capabilities in procedurally out-of-distribution (OOD) environments, remains limited. In this paper, we introduce MultiNet v0.2, a comprehensive benchmark designed to evaluate and analyze the generalization performance of state-of-the-art VLMs and VLAs—including GPT-4o, GPT-4.1, OpenVLA, Pi0 Base, and Pi0 FAST—on diverse procedural tasks from the Procgen benchmark. Our analysis reveals several critical insights:

  • All evaluated models exhibit significant limitations in zero-shot generalization to OOD tasks, with performance heavily influenced by factors such as action representation and task complexity
  • VLAs generally outperforms other models due to their robust architectural design.
  • VLM variants demonstrate substantial improvements when constrained appropriately, highlighting the sensitivity of model performance to precise prompt engineering.
We release our benchmark, evaluation framework, and findings to enable the assessment of future VLA models and identify critical areas for improvement in their application to out of distibrution digital tasks.

Dataset coverage

Below we present an overview of the Procgen dataset used in MultiNet V0.2. Each sub-dataset within Procgen exhibits diverse environment layouts, tasks, objectives, reward structures, and discrete actionspaces, predominantly involving directional movements and specific game-related interactions.

Bigfish Bossfight Caveflyer Chaser Climber Coinrun Dodgeball Fruitbot Heist Jumper Leaper Maze Miner Ninja Plunder Starpilot
Environment Bigfish Gameplay Bossfight Gameplay Caveflyer Gameplay Chaser Gameplay Climber Gameplay Coinrun Gameplay Dodgeball Gameplay Fruitbot Gameplay Heist Gameplay Jumper Gameplay Leaper Gameplay Maze Gameplay Miner Gameplay Ninja Gameplay Plunder Gameplay Starpilot Gameplay
Task Description Eat smaller fish to grow while avoiding larger fish Dodge projectiles and damage the boss when shields are down Navigate caves, destroy targets, and avoid obstacles to reach exit Collect orbs and stars to make enemies vulnerable for consumption Climb platforms and collect stars while avoiding flying monsters Reach the coin while avoiding saws, enemies, and chasms Hit enemies with balls and reach the unlocked platform Collect fruit, avoid non-fruit, and use keys to unlock gates Collect colored keys to unlock doors and steal the gem Use double jump to navigate platforms and find the carrot Cross lanes by avoiding cars and hopping on logs Navigate the maze to find and collect the cheese Dig through dirt, collect diamonds, and avoid falling objects Jump across ledges and use throwing stars to clear bombs Fire cannonballs at enemy ships while avoiding friendly ships Dodge projectiles and defeat enemies while navigating obstacles
Action Space 8 directions + no-op 8 directions + no-op + fire bullet 8 directions + no-op + fire bullet 8 directions + no-op 8 directions + no-op 8 directions + no-op 8 directions + no-op + fire ball 8 directions + no-op + fire key 8 directions + no-op 8 directions + no-op 8 directions + no-op 8 directions + no-op 8 directions + no-op 8 directions + no-op + 4 throwing star directions 8 directions + no-op + fire cannonball 8 directions + no-op + 2 fire bullet directions
Action space dimensions include: movement actions (8 directions + no-op) and special actions (varies by environment). To understand more about these environments and their action spaces, please refer to the Procgen benchmark.

All 5 models - OpenVLA, Pi0 Base, Pi0 FAST, GPT 4o and GPT 4.1 were evaluated on offline trajectories from expert reinforcement learning (RL) agents trained on the Procgen dataset, which were sourced from Meta AI's publically available repository. The test splits for each of the 16 Procgen datasets consisted of a random selection of 10% of the total episodes in the dataset. Procgen environments are procedurally generated, Atari-like 2D games designed to evaluate visual and motor skills of RL agents. Each dataset within Procgen exhibits diverse environment layouts, tasks, objectives, reward structures, and discrete actionspaces, predominantly involving directional movements and specific game-related interactions.

Metrics and Benchmark

In order to evaluate the performance of each of these models on offline trajectories of expert RL agents in these environments with discrete action spaces on a per-timestep basis, we use a variety of metrics to best capture the performance of the models and assess them as fair as possible. The following key metrics are used:

To visualize the effects of model architecture, training methods, and output processing techniques on the concentrated VS diffused nature of the models' predicted actions, we use confusion matrices depicting the frequency of predictions on a per-action class basis, and the frequency of correct predictions. Additionally, to observe the correlation between model performance and image complexity, we use correlation matrices between Shannon Entropy and Delentropy values of the images and the Macro Recall values of the models. To learn more about the metrics, please refer to the paper.

Results and Analysis

We find significant limitations in model performance arising from architectural constraints, training paradigms, and input-output biases inherent to the models. The stark domain discrepancy between training data—primarily continuous-action robotics datasets, and general web-scale vision language data—and discrete-action game environments emerged as a critical barrier to effective zero-shot generalization. We also identified notable differences in model behaviors linked directly to their architectures, training strategies, and input/output processing techniques.

Overall Performance

Evaluation metrics



Frequency of predicted action classes

Models

Procgen Datasets

Correlation matrix between image entropies and model performance

OpenVLA's robust action-space clamping technique consistently provided superior generalization, minimizing invalid outputs and exhibiting relative resilience to out-of-distribution scenarios. Conversely, autoregressive models like GPT-4x displayed substantial difficulties in generalizing, especially under complex image conditions, and frequently defaulted to overly simplistic or biased action choices. Additionally, Pi0 models showed intermediate performance influenced heavily by their diffusion-based (Pi0 Base) or autoregressive (Pi0 FAST) decoding methods, with Pi0 FAST notably sensitive to image complexity and unable to restrict majority of its predictions to a desired output range. For a more thorough look at the results and our analysis, please refer to the paper.

Citation

@misc{guruprasad2025benchmarkingvisionlanguage,
        title={Benchmarking Vision, Language, & Action Models in Procedurally Generated, Open Ended Action Environments}, 
        author={Pranav Guruprasad and Yangyue Wang and Sudipta Chowdhury and Harshvardhan Sikka},
        year={2025},
        eprint={2505.05540},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2505.05540}, 
  }