Verifications & Assertions

A verification is an automated quality check that runs after a test completes. It analyzes the recorded artifacts — video, screenshots, logs, audio — to answer specific questions about what happened during the run. While checkpoints answer “did the agent reach each goal?”, verifications answer a different question: “did the game behave correctly along the way?”

Checkpoints vs Verifications

	Checkpoints	Verifications
Question	”Did the agent reach this milestone?"	"Did the game behave correctly?”
When	Evaluated in real time, during the run	Evaluated after the run completes, against recorded artifacts
Who decides	The agent, based on what it sees on screen	An AI evaluator, reviewing the recorded evidence
Focus	Agent progress through the test plan	Game quality — visuals, audio, UI behavior, state correctness
Analogy	A checklist the QA tester follows	The QA report they write after reviewing the recording

Think of it this way: checkpoints guide the agent during play. Verifications review the tape after play.

Assertions

An assertion is a specific question you want answered about the run.

Assertions are the building blocks of a verification. Each assertion is a natural-language question that an AI evaluator answers by examining the run’s recorded artifacts. Examples:

“Does the shop popup appear and display correctly?”
“Did the player health drop below 50 at any point?”
“Is the tutorial prompt visible when the player reaches level 2?”
“Does the UI feel responsive throughout the tutorial?”

Assertions are written in plain language — you describe what you want to check, and the evaluator figures out how to analyze the evidence.

Scopes — When and Where to Check

Each assertion has a scope that defines which part of the run it examines.

Scope	What it covers	Example use
Global	The entire run, from start to finish	”No visual glitches at any point”
Checkpoint	Around a specific checkpoint (when it’s reached, before, or after)	“Tutorial prompt appears when player starts level 2”
Time window	A specific time range within the run	”No crashes during the first 5 minutes”
Terminal	Evaluated once at the end of the run	”Final score is above 1000”

Scopes keep evaluations focused and efficient. Checking “is the shop popup visible?” only makes sense when the player is actually in the shop — not throughout the entire run.

Inference Types — What the Evaluator Examines

Each assertion specifies which artifacts the evaluator should look at:

Screenshots / Frames — For visual checks (“Is the button visible?”, “Is the UI correctly laid out?”)
Video — For motion and timing checks (“Does the animation play smoothly?”)
Audio — For sound checks (“Does the win sound play?”)
Logs — For technical checks (“Did the API return an error?”, “Is performance within acceptable range?”)

The evaluator only examines the artifacts you specify, which keeps it focused on what matters for each assertion.

Outcomes

After evaluation, each assertion receives one of these outcomes:

Outcome	Meaning
Passed	The assertion was satisfied — the game behaved as expected
Failed	The assertion was violated — something went wrong
Skipped	The scope didn’t apply (e.g., the checkpoint was never reached)
Inconclusive	The evaluator couldn’t determine a clear pass or fail from the available evidence

Severity Levels

Not all failures are equal. Each assertion has a severity that indicates how serious a failure is:

Blocker — Fails the entire run. Something critical is broken
Critical — Serious issue that likely fails the run
Major — Notable problem worth flagging
Minor — Cosmetic or edge-case issue
Info — Just an observation, not a failure

Severity helps you triage results quickly — focus on blockers and criticals first, then work down.

Reference Assets

Some assertions compare the run against a reference — a known-good baseline. For example:

Compare a screenshot to a reference image to detect visual regressions
Compare UI layout to a baseline to catch layout shifts
Compare audio to a reference clip

This is especially useful for regression testing, where you want to confirm that a new build looks and behaves the same as the previous version.

When Verifications Run

Automatically after a run — By default, verifications execute as soon as the run completes and artifacts are available
Manual re-trigger — You can re-run verifications on a completed run at any time, for example after adjusting your assertions or adding new ones

Test Archetypes

Different types of tests use verifications differently:

Test Type	Typical Verification
Smoke test	Simple: no crash, reaches main menu
Tutorial test	Presence checks at each step + stuck detection
Regression test	Many assertions with reference comparisons, automated verdicts
Feature test	Focused assertions specific to the new feature
Exploration test	No explicit assertions — anomaly detection only, human reviews findings

Getting Started

Core Concepts

Verifications & Assertions

Checkpoints vs Verifications

Assertions

Scopes — When and Where to Check

Inference Types — What the Evaluator Examines

Outcomes

Severity Levels

Reference Assets

When Verifications Run

Test Archetypes

Getting Started

Core Concepts

​Checkpoints vs Verifications

​Assertions

​Scopes — When and Where to Check

​Inference Types — What the Evaluator Examines

​Outcomes

​Severity Levels

​Reference Assets

​When Verifications Run

​Test Archetypes

Checkpoints vs Verifications

Assertions

Scopes — When and Where to Check

Inference Types — What the Evaluator Examines

Outcomes

Severity Levels

Reference Assets

When Verifications Run

Test Archetypes