Validation sets for state of the art agents

RL Gym

Structured problem sets built from real-world failures: brittle planning, poor recovery, unreliable tool use, and weak judgment.

Review the workflow Explore test areas

Why test cases

Agents often look capable until the task is shaped to reveal the failure.

We are building problem sets that make state-of-the-art agent weaknesses visible. Each case is grounded in issues found while testing state-of-the-art agents against practical datasets, then shaped into reproducible tasks, expected behaviors, and scoring notes.

Problem areas

Failure modes we convert into tests

Forecasting

Dataset-driven tasks that start simply, then force the agent to preserve intent, state, constraints, and intermediate results across many steps.

Imbalanced Classification

Cases where the agent must ask a clarifying question, refuse a bad assumption, or preserve user intent instead of rushing into an apparently plausible answer.

Data Quality

Tests that introduce stale facts, corrected requirements, hidden dependencies, or conflicting context to see what the agent keeps, drops, or invents.

Probabilistic Classification

Scenarios where tools error, return partial data, or contradict expectations, revealing whether the agent can verify, adapt, and recover.

Use Cases

Tests designed around real-world problems.

We choose datasets and design problems based on real-world use cases that cause real-world issues. Whether it's issues for your business, your research, or your curiosity, failures in these areas are likely to lead to real-world harm, and improvements are likely to lead to real-world impact.

Find a relevant dataset using real-world data.

Impact matters. Which is why we focus on datasets that reflect real-world scenarios.

Problem definition and framing.

Define who would use the dataset and why they would be using it.

Test rigorously.

Challenge the agent on a variety of scenarios related to the problem domain.

Track failures.

Wherever the agent fails, document the circumstances and outcomes.

Workflow

From dataset scenario to report card

Each problem set moves through the same loop: choose a scenario, configure the agent, watch execution, then review the report card.

Pick a training scenario

Start from a real-world dataset scenario with a task pool, environment, and validators.

Configure the run

Connect your desired agent, set sampling and runtime limits, choose validators, and prepare the launch.

Watch the execution

Observe live task pools, event streams, retries, latency, and model state.

The report card

Turn the run into scores, validator breakdowns, failure clusters, and improvement signals.

Next step

Found a failure case in the wild?

Define your problem, the agent used, what you expected to happen, and what actually happened.

Submit it