A verification is an automated quality check that runs after a test completes. It analyzes the recorded artifacts — video, screenshots, logs, audio — to answer specific questions about what happened during the run.
While checkpoints answer “did the agent reach each goal?”, verifications answer a different question: “did the game behave correctly along the way?”
Checkpoints vs Verifications
| Checkpoints | Verifications |
|---|
| Question | ”Did the agent reach this milestone?" | "Did the game behave correctly?” |
| When | Evaluated in real time, during the run | Evaluated after the run completes, against recorded artifacts |
| Who decides | The agent, based on what it sees on screen | An AI evaluator, reviewing the recorded evidence |
| Focus | Agent progress through the test plan | Game quality — visuals, audio, UI behavior, state correctness |
| Analogy | A checklist the QA tester follows | The QA report they write after reviewing the recording |
Think of it this way: checkpoints guide the agent during play. Verifications review the tape after play.
Assertions
An assertion is a specific question you want answered about the run.
Assertions are the building blocks of a verification. Each assertion is a natural-language question that an AI evaluator answers by examining the run’s recorded artifacts.
Examples:
- “Does the shop popup appear and display correctly?”
- “Did the player health drop below 50 at any point?”
- “Is the tutorial prompt visible when the player reaches level 2?”
- “Does the UI feel responsive throughout the tutorial?”
Assertions are written in plain language — you describe what you want to check, and the evaluator figures out how to analyze the evidence.
Scopes — When and Where to Check
Each assertion has a scope that defines which part of the run it examines.
| Scope | What it covers | Example use |
|---|
| Global | The entire run, from start to finish | ”No visual glitches at any point” |
| Checkpoint | Around a specific checkpoint (when it’s reached, before, or after) | “Tutorial prompt appears when player starts level 2” |
| Time window | A specific time range within the run | ”No crashes during the first 5 minutes” |
| Terminal | Evaluated once at the end of the run | ”Final score is above 1000” |
Scopes keep evaluations focused and efficient. Checking “is the shop popup visible?” only makes sense when the player is actually in the shop — not throughout the entire run.
Inference Types — What the Evaluator Examines
Each assertion specifies which artifacts the evaluator should look at:
- Screenshots / Frames — For visual checks (“Is the button visible?”, “Is the UI correctly laid out?”)
- Video — For motion and timing checks (“Does the animation play smoothly?”)
- Audio — For sound checks (“Does the win sound play?”)
- Logs — For technical checks (“Did the API return an error?”, “Is performance within acceptable range?”)
The evaluator only examines the artifacts you specify, which keeps it focused on what matters for each assertion.
Outcomes
After evaluation, each assertion receives one of these outcomes:
| Outcome | Meaning |
|---|
| Passed | The assertion was satisfied — the game behaved as expected |
| Failed | The assertion was violated — something went wrong |
| Skipped | The scope didn’t apply (e.g., the checkpoint was never reached) |
| Inconclusive | The evaluator couldn’t determine a clear pass or fail from the available evidence |
Severity Levels
Not all failures are equal. Each assertion has a severity that indicates how serious a failure is:
- Blocker — Fails the entire run. Something critical is broken
- Critical — Serious issue that likely fails the run
- Major — Notable problem worth flagging
- Minor — Cosmetic or edge-case issue
- Info — Just an observation, not a failure
Severity helps you triage results quickly — focus on blockers and criticals first, then work down.
Reference Assets
Some assertions compare the run against a reference — a known-good baseline. For example:
- Compare a screenshot to a reference image to detect visual regressions
- Compare UI layout to a baseline to catch layout shifts
- Compare audio to a reference clip
This is especially useful for regression testing, where you want to confirm that a new build looks and behaves the same as the previous version.
When Verifications Run
- Automatically after a run — By default, verifications execute as soon as the run completes and artifacts are available
- Manual re-trigger — You can re-run verifications on a completed run at any time, for example after adjusting your assertions or adding new ones
Test Archetypes
Different types of tests use verifications differently:
| Test Type | Typical Verification |
|---|
| Smoke test | Simple: no crash, reaches main menu |
| Tutorial test | Presence checks at each step + stuck detection |
| Regression test | Many assertions with reference comparisons, automated verdicts |
| Feature test | Focused assertions specific to the new feature |
| Exploration test | No explicit assertions — anomaly detection only, human reviews findings |