Validation
How AEGIS uses gradient scoring, judge agents, and consensus to determine whether an iteration's output is acceptable.
Validation
AEGIS validates every iteration's output before deciding whether to accept it or start another attempt. It also supports a separate pre-dispatch tool judge for selected tool calls during the inner loop. Rather than a binary pass/fail, every validator produces a ValidationScore (0.0-1.0) and a Confidence (0.0-1.0). The execution loop compares scores against declared thresholds, choosing between three outcomes: accept the output, inject the failure reason and retry, or exhaust the iteration budget and fail permanently.
Why Gradient Scoring?
A binary validator can only tell the orchestrator that output failed. A gradient score tells it how badly — which the refinement prompt can use to modulate how much the agent needs to change.
A score of 0.85 with the reasoning "logic is correct but error handling is absent" produces a precise, targeted refinement prompt. A score of 0.15 with the reasoning "completely wrong approach" produces a different one — prompting a rewrite rather than a small patch. The Cortex learning layer also uses these scores to weight which error-solution patterns are reliably successful.
ValidationResults Structure
At the end of every iteration, the orchestrator populates a ValidationResults record with one or more sub-results depending on which validators are configured in the agent manifest:
ValidationResults
├── system — exit code and stderr from the container process
├── output — deterministic structural checks (JSON schema, regex)
├── semantic — single LLM judge's gradient score
├── gradient — GradientResult from the judge execution
└── consensus — MultiJudgeConsensus when multiple judges are usedsemantic stores the boolean outcome and score from a single judge. gradient holds the full GradientResult (score, confidence, reasoning, and optional signals). consensus stores the MultiJudgeConsensus record when a multi_judge validator runs — including each judge's individual score and the aggregation strategy used.
How the Execution Loop Uses Scores
The ExecutionSupervisor runs validators sequentially after the inner loop completes. The effective score for the iteration is the lowest score across all configured validators. If that minimum score is below any validator's min_score, the iteration transitions to Refining instead of Success.
All validators pass ┌──────────┐
(min score ≥ threshold) │ Running │
┌──────────────────────────▶│ │
│ └────┬─────┘
│ │ inner loop completes
│ ▼
│ ┌──────────┐
│ score ≥ threshold │Validating│
│ ─────────────────────────┤ ├────────────────────────▶ Success
│ └────┬─────┘
│ │ score < threshold
│ │ AND iterations remaining
│ ▼
│ ┌──────────┐ inject error + reasoning
└──────────────────────────│ Refining │──────────────────────────▶ (next iteration)
└────┬─────┘
│ iterations == max_iterations
▼
FailedValidators are evaluated in the order they are declared. Expensive LLM judges are only reached if the cheaper deterministic validators (exit code, JSON schema, regex) pass first — making it cost-effective to chain them. See Configuring Agent Validation for ordering strategies.
Policy Violations in Validation Context
The ValidationContext passed to each iteration includes a policy_violations field — a list of tool names that were blocked by platform policy during that iteration's inner loop. When a tool call is denied at the policy enforcement layer (not by the semantic judge, but by the security policy itself), the tool name is appended to this list.
The SemanticAgentValidator includes policy_violations in the payload sent to each judge. This gives judges the information they need to distinguish two otherwise identical-looking outcomes:
- The agent never attempted a particular tool.
- The agent attempted the tool but was blocked by platform policy.
Without policy_violations, a judge reasoning about tool usage gaps cannot tell whether the gap reflects a design choice or a denied attempt. With it, a judge can acknowledge that the agent followed the correct intent but was constrained by the platform, and score accordingly.
Confidence Gating
When a judge's self-reported confidence falls below the validator's min_confidence setting, the score is treated as if the threshold was not met — the same consequence as a low score. The iteration moves to Refining, and the low-confidence reasoning is injected as error context for the next attempt. The judge is not re-run within the same iteration.
Inner-Loop Tool Validation
While the outer-loop validation runs at the end of an iteration, AEGIS also supports inner-loop validation via the tool_validation field. For the full tool-call flow, judge payload contract, and operator bypass semantics, see Tool-Call Judging.
This is a pre-execution semantic judge for tool intent, not a post-execution output review. If an agent proposes a dangerous cmd.run payload, the orchestrator pauses the execution, submits the proposed tool call to the semantic judge, and uses the returned gradient result to either permit the invocation or reject it synchronously.
The judge payload includes the current task context, the proposed tool call, the available tool list, the worker mount context, the judge criteria, and a validation-context marker that tells the judge it is operating in the inner loop. The judge returns a JSON verdict with score, confidence, and reasoning, plus optional signals and metadata. The orchestrator compares that verdict to min_score and min_confidence; if either check fails, the call is blocked and the agent gets a reasoning-rich retry signal without consuming the full iteration.
Operationally, this reduces blast radius. Unsafe or low-confidence tool intent is stopped before side effects occur, the decision is auditable as its own child execution, and operators can keep hard policy enforcement separate from semantic review.
Judge Agents Are Child Executions
When a semantic, multi_judge, or tool_validation judge fires, the orchestrator does not call an internal function. It spawns the judge agent as a child execution — a full, isolated container run tracked in the same execution tree as the parent.
Every execution carries an ExecutionHierarchy:
| Field | Description |
|---|---|
parent_execution_id | UUID of the execution that spawned this one. null for root executions. |
depth | Nesting depth. 0 = root, 1 = first-level child (e.g., a judge), 2 = second-level child. |
path | Ordered list of ancestor execution UUIDs from root to this execution. |
This means judge executions are:
- Visible in execution history — child executions appear in execution APIs and event streams alongside worker executions.
- Isolated — the judge runs in its own container with its own security policy. It cannot read or write to the parent execution's workspace unless the judge execution is explicitly granted access to the same mounted volume set.
- Audited — every judge invocation generates the full set of execution events (
ExecutionStarted,IterationCompleted, etc.), making the validation decision fully inspectable.
Use execution APIs and logs to inspect judge executions spawned by a parent.
Recursive Depth Limit
Because judges are full executions, a judge agent could theoretically declare its own multi_judge validator — spawning further child executions. This is intentionally supported for composing specialized judges, but unbounded recursion is prevented by a hard cap:
MAX_RECURSIVE_DEPTH = 3An execution at depth 3 cannot spawn child executions. Any validator that would do so fails with MaxRecursiveDepthExceeded, and the iteration is marked Failed without consuming another retry. Well-designed judge pipelines never come close to this limit: a root worker at depth 0 spawns a judge at depth 1; if that judge uses a semantic validator, its judge runs at depth 2 — leaving one level of headroom.
Multi-Judge Consensus
When multiple judges run (multi_judge validator or ParallelAgents workflow state), their individual GradientResult scores are aggregated into a MultiJudgeConsensus record:
| Field | Description |
|---|---|
final_score | Aggregated score from all judges (0.0–1.0). |
consensus_confidence | Agreement level among judges. High variance between judges produces a lower confidence. |
individual_results | Each judge's AgentId paired with their full GradientResult. |
strategy | The aggregation strategy used (weighted_average, majority, unanimous, best_of_n). |
All four judges run in parallel. The orchestrator collects all results before computing consensus, so parallel judges do not add wall-clock time beyond the slowest judge.
Strategy Summary
| Strategy | Algorithm | When to use |
|---|---|---|
weighted_average | Weighted mean of scores; confidence penalised by inter-judge variance. | General-purpose gradient validation. |
majority | Binary vote (score ≥ threshold = pass). Simple majority wins. | Approve/reject decisions where nuance matters less than agreement. |
unanimous | All judges must score ≥ threshold. Uses minimum confidence across judges. | Security audits, production deployment gates. |
best_of_n | Rank by score × confidence; take top N; weighted average of those N. | Reduce impact of outlier or misbehaving judges. |
For configuration details — validator YAML syntax, threshold fields, judge mode: one-shot requirements — see Configuring Agent Validation.
For using ParallelAgents states in workflows to run judges as part of a multi-stage pipeline, see the Workflow Manifest Reference.