Skip to main content
A verification is an automated quality check that runs after a test completes. It analyzes the recorded artifacts — video, screenshots, logs, audio — to answer specific questions about what happened during the run. While checkpoints answer “did the agent reach each goal?”, verifications answer a different question: “did the game behave correctly along the way?”

Checkpoints vs Verifications

CheckpointsVerifications
Question”Did the agent reach this milestone?""Did the game behave correctly?”
WhenEvaluated in real time, during the runEvaluated after the run completes, against recorded artifacts
Who decidesThe agent, based on what it sees on screenAn AI evaluator, reviewing the recorded evidence
FocusAgent progress through the test planGame quality — visuals, audio, UI behavior, state correctness
AnalogyA checklist the QA tester followsThe QA report they write after reviewing the recording
Think of it this way: checkpoints guide the agent during play. Verifications review the tape after play.

Assertions

An assertion is a specific question you want answered about the run.
Assertions are the building blocks of a verification. Each assertion is a natural-language question that an AI evaluator answers by examining the run’s recorded artifacts. Examples:
  • “Does the shop popup appear and display correctly?”
  • “Did the player health drop below 50 at any point?”
  • “Is the tutorial prompt visible when the player reaches level 2?”
  • “Does the UI feel responsive throughout the tutorial?”
Assertions are written in plain language — you describe what you want to check, and the evaluator figures out how to analyze the evidence.

Scopes — When and Where to Check

Each assertion has a scope that defines which part of the run it examines.
ScopeWhat it coversExample use
GlobalThe entire run, from start to finish”No visual glitches at any point”
CheckpointAround a specific checkpoint (when it’s reached, before, or after)“Tutorial prompt appears when player starts level 2”
Time windowA specific time range within the run”No crashes during the first 5 minutes”
TerminalEvaluated once at the end of the run”Final score is above 1000”
Scopes keep evaluations focused and efficient. Checking “is the shop popup visible?” only makes sense when the player is actually in the shop — not throughout the entire run.

Inference Types — What the Evaluator Examines

Each assertion specifies which artifacts the evaluator should look at:
  • Screenshots / Frames — For visual checks (“Is the button visible?”, “Is the UI correctly laid out?”)
  • Video — For motion and timing checks (“Does the animation play smoothly?”)
  • Audio — For sound checks (“Does the win sound play?”)
  • Logs — For technical checks (“Did the API return an error?”, “Is performance within acceptable range?”)
The evaluator only examines the artifacts you specify, which keeps it focused on what matters for each assertion.

Outcomes

After evaluation, each assertion receives one of these outcomes:
OutcomeMeaning
PassedThe assertion was satisfied — the game behaved as expected
FailedThe assertion was violated — something went wrong
SkippedThe scope didn’t apply (e.g., the checkpoint was never reached)
InconclusiveThe evaluator couldn’t determine a clear pass or fail from the available evidence

Severity Levels

Not all failures are equal. Each assertion has a severity that indicates how serious a failure is:
  • Blocker — Fails the entire run. Something critical is broken
  • Critical — Serious issue that likely fails the run
  • Major — Notable problem worth flagging
  • Minor — Cosmetic or edge-case issue
  • Info — Just an observation, not a failure
Severity helps you triage results quickly — focus on blockers and criticals first, then work down.

Reference Assets

Some assertions compare the run against a reference — a known-good baseline. For example:
  • Compare a screenshot to a reference image to detect visual regressions
  • Compare UI layout to a baseline to catch layout shifts
  • Compare audio to a reference clip
This is especially useful for regression testing, where you want to confirm that a new build looks and behaves the same as the previous version.

When Verifications Run

  • Automatically after a run — By default, verifications execute as soon as the run completes and artifacts are available
  • Manual re-trigger — You can re-run verifications on a completed run at any time, for example after adjusting your assertions or adding new ones

Test Archetypes

Different types of tests use verifications differently:
Test TypeTypical Verification
Smoke testSimple: no crash, reaches main menu
Tutorial testPresence checks at each step + stuck detection
Regression testMany assertions with reference comparisons, automated verdicts
Feature testFocused assertions specific to the new feature
Exploration testNo explicit assertions — anomaly detection only, human reviews findings