Benchmarking Vision, Language, & Action Models in Procedurally Generated, Open Ended Action Environments

Pranav Guruprasad^*12, Yangyue Wang^*12, Sudipta Chowdhury¹, Harsh Sikka¹²³, Paul Pu Liang⁴,

¹ Manifold Research, ² Metarch AI, ³ Georgia Tech, ⁴ MIT
^*Indicates Equal Contribution
Interested in evaluating your model on generalist multimodal datasets? Reach out here

Paper Code GenESIS framework

MultiNet v0.2 extends the MultiNet benchmark to effectively evaluate SoTA VLAs and VLMs on multistep action trajectories in procedurally generated, open ended game environments.

Abstract

Vision-language-action (VLA) models represent an important step toward general-purpose robotic systems by integrating visual perception, language understanding, and action execution. However, systematic evaluation of these models, particularly their zero-shot generalization capabilities in procedurally out-of-distribution (OOD) environments, remains limited. In this paper, we introduce MultiNet v0.2, a comprehensive benchmark designed to evaluate and analyze the generalization performance of state-of-the-art VLMs and VLAs—including GPT-4o, GPT-4.1, OpenVLA, Pi0 Base, and Pi0 FAST—on diverse procedural tasks from the Procgen benchmark. Our analysis reveals several critical insights:

All evaluated models exhibit significant limitations in zero-shot generalization to OOD tasks, with performance heavily influenced by factors such as action representation and task complexity
VLAs generally outperforms other models due to their robust architectural design.
VLM variants demonstrate substantial improvements when constrained appropriately, highlighting the sensitivity of model performance to precise prompt engineering.

We release our benchmark, evaluation framework, and findings to enable the assessment of future VLA models and identify critical areas for improvement in their application to out of distibrution digital tasks.

Dataset coverage

Below we present an overview of the Procgen dataset used in MultiNet V0.2. Each sub-dataset within Procgen exhibits diverse environment layouts, tasks, objectives, reward structures, and discrete actionspaces, predominantly involving directional movements and specific game-related interactions.

	Bigfish	Bossfight	Caveflyer	Chaser	Climber	Coinrun	Dodgeball	Fruitbot	Heist	Jumper	Leaper	Maze	Miner	Ninja	Plunder	Starpilot
Environment
Task Description	Eat smaller fish to grow while avoiding larger fish	Dodge projectiles and damage the boss when shields are down	Navigate caves, destroy targets, and avoid obstacles to reach exit	Collect orbs and stars to make enemies vulnerable for consumption	Climb platforms and collect stars while avoiding flying monsters	Reach the coin while avoiding saws, enemies, and chasms	Hit enemies with balls and reach the unlocked platform	Collect fruit, avoid non-fruit, and use keys to unlock gates	Collect colored keys to unlock doors and steal the gem	Use double jump to navigate platforms and find the carrot	Cross lanes by avoiding cars and hopping on logs	Navigate the maze to find and collect the cheese	Dig through dirt, collect diamonds, and avoid falling objects	Jump across ledges and use throwing stars to clear bombs	Fire cannonballs at enemy ships while avoiding friendly ships	Dodge projectiles and defeat enemies while navigating obstacles
Action Space	8 directions + no-op	8 directions + no-op + fire bullet	8 directions + no-op + fire bullet	8 directions + no-op	8 directions + no-op	8 directions + no-op	8 directions + no-op + fire ball	8 directions + no-op + fire key	8 directions + no-op	8 directions + no-op	8 directions + no-op	8 directions + no-op	8 directions + no-op	8 directions + no-op + 4 throwing star directions	8 directions + no-op + fire cannonball	8 directions + no-op + 2 fire bullet directions

Action space dimensions include: movement actions (8 directions + no-op) and special actions (varies by environment). To understand more about these environments and their action spaces, please refer to the Procgen benchmark.

All 5 models - OpenVLA, Pi0 Base, Pi0 FAST, GPT 4o and GPT 4.1 were evaluated on offline trajectories from expert reinforcement learning (RL) agents trained on the Procgen dataset, which were sourced from Meta AI's publically available repository. The test splits for each of the 16 Procgen datasets consisted of a random selection of 10% of the total episodes in the dataset. Procgen environments are procedurally generated, Atari-like 2D games designed to evaluate visual and motor skills of RL agents. Each dataset within Procgen exhibits diverse environment layouts, tasks, objectives, reward structures, and discrete actionspaces, predominantly involving directional movements and specific game-related interactions.

Metrics and Benchmark

In order to evaluate the performance of each of these models on offline trajectories of expert RL agents in these environments with discrete action spaces on a per-timestep basis, we use a variety of metrics to best capture the performance of the models and assess them as fair as possible. The following key metrics are used:

Brier Scores
Exact Match metrics
Macro scores
Action class-wise scores

Brier MAE

Mean Absolute Error of Brier scores.

Brier MAE

Brier Score:
¹⁄_N ∑_t=1^N ∑_i=1^R (f_ti - o_ti)²

f_ti: pred prob class i, time t
o_ti: ground truth
R: num classes, N: num timesteps.
Brier MAE is a variation of the original Brier score which is a useful method to measure the accuracy of probabilistic predictions.

Normalized Brier MAE

Normalized version of Brier MAE.

Normalized Brier MAE

Description:

Average of Brier Absolute Errors that have been min-max normalized using the overall min and max Brier absolute errors across all timesteps.

Norm. Quantile Filtered Brier MAE

Normalized Brier MAE filtered by quantiles.

Norm. Quantile Filtered Brier MAE

Description:

Brier MAE values are filtered (only considering the error values that are within the 5th to 95th percentile), and then normalized based on the quantile filtered min/max errors.

Max Relative Brier MAE

Max relative error of Brier scores.

Max Relative Brier MAE

max(MAE) / median(MAE)

Quantifies how the worst-case error deviates from the typical (median) error for a given subdataset.

% Invalid Actions

Predicted actions outside the valid action space.

% Invalid Actions

(Timesteps w/ invalid preds / Total timesteps) * 100

Percentage of predictions falling outside the valid action space of a Procgen subdataset.

Micro Precision

Precision of the model across all timesteps.

Micro Precision

Micro Precision:
True Positives / (True Positives + False Positives)

Calculated globally across all timesteps.

Micro Recall

Recall of the model across all timesteps.

Micro Recall

Micro Recall:
True Positives / (True Positives + False Negatives)

Calculated globally across all timesteps.

Micro F1 Score

Harmonic mean of Micro Precision and Recall.

Micro F1 Score

Micro F1:
2 * (Micro Prec * Micro Rec) / (Micro Prec + Micro Rec)

Balanced measure for global performance.

Macro Precision

Avg. of class-wise precision.

Macro Precision

Calculated by averaging class-wise precision scores across all classes. Treats all classes equally.

Macro Recall

Avg. of class-wise recall.

Macro Recall

Calculated by averaging class-wise recall scores across all classes. Important for imbalanced datasets.

Macro F1 Score

Avg. of class-wise F1 scores.

Macro F1 Score

Calculated by averaging class-wise F1 scores. Balanced macro-average performance.

Class-wise Precision

Precision for each action class.

Class-wise Precision

Precision calculated individually for each action class. Helps understand bias towards specific classes.

Class-wise Recall

Recall for each action class.

Class-wise Recall

Recall calculated individually for each action class. Shows per-class identification performance.

Class-wise F1 Score

F1 for each action class.

Class-wise F1 Score

F1 score calculated individually for each action class. Balanced per-class performance.

To visualize the effects of model architecture, training methods, and output processing techniques on the concentrated VS diffused nature of the models' predicted actions, we use confusion matrices depicting the frequency of predictions on a per-action class basis, and the frequency of correct predictions. Additionally, to observe the correlation between model performance and image complexity, we use correlation matrices between Shannon Entropy and Delentropy values of the images and the Macro Recall values of the models. To learn more about the metrics, please refer to the paper.

Results and Analysis

We find significant limitations in model performance arising from architectural constraints, training paradigms, and input-output biases inherent to the models. The stark domain discrepancy between training data—primarily continuous-action robotics datasets, and general web-scale vision language data—and discrete-action game environments emerged as a critical barrier to effective zero-shot generalization. We also identified notable differences in model behaviors linked directly to their architectures, training strategies, and input/output processing techniques.

Overall Performance

Evaluation metrics

Frequency of predicted action classes

Models

Procgen Datasets

Correlation matrix between image entropies and model performance

OpenVLA's robust action-space clamping technique consistently provided superior generalization, minimizing invalid outputs and exhibiting relative resilience to out-of-distribution scenarios. Conversely, autoregressive models like GPT-4x displayed substantial difficulties in generalizing, especially under complex image conditions, and frequently defaulted to overly simplistic or biased action choices. Additionally, Pi0 models showed intermediate performance influenced heavily by their diffusion-based (Pi0 Base) or autoregressive (Pi0 FAST) decoding methods, with Pi0 FAST notably sensitive to image complexity and unable to restrict majority of its predictions to a desired output range. For a more thorough look at the results and our analysis, please refer to the paper.

Citation

@misc{guruprasad2025benchmarkingvisionlanguage,
        title={Benchmarking Vision, Language, & Action Models in Procedurally Generated, Open Ended Action Environments}, 
        author={Pranav Guruprasad and Yangyue Wang and Sudipta Chowdhury and Harshvardhan Sikka},
        year={2025},
        eprint={2505.05540},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2505.05540}, 
  }