Workflow Engine
Architecture of the AEGIS workflow FSM, Temporal integration, Blackboard system, and workflow execution lifecycle.
Workflow Engine
The AEGIS workflow engine executes declarative finite-state machine (FSM) workflows defined in YAML. It uses Temporal as the durable execution backend — Temporal guarantees that workflow state survives orchestrator restarts, container crashes, and network partitions.
Architecture Overview
The workflow engine spans two components: the Rust orchestrator handles registration, state persistence, and event sourcing; the TypeScript worker owns FSM interpretation and delegates all agent execution back to the Rust runtime via gRPC.
┌─────────────Rust Orchestrator──────────────┐
│ │
WorkflowParser WorkflowRepository │
(YAML → domain) (PostgreSQL) │
│ │
└──────── StartWorkflowUseCase ──────────────┘
│
┌──────────────────┴───────────────────┐
│ │
HTTP POST /register-workflow gRPC StartWorkflowExecution
│ │
▼ ▼
TypeScript Worker (:3000) Temporal Server (:7233)
workflow_definitions (PostgreSQL) │
│ task queue: aegis-agents
└──────────────────────────────────┘
│
aegis_workflow()
(single generic FSM interpreter)
│
┌────────────────────────┼────────────────────────┐
│ │ │
executeAgentActivity executeSystemCommandActivity validateOutputActivity
│ │ │
└────────────────────────┴────────────────────────┘
│
gRPC AegisRuntime (:50051) ←── Rust Orchestrator
ExecuteAgent (streaming)
ExecuteSystemCommand (unary)
ValidateWithJudges (unary)
│
HTTP POST /v1/temporal-events ──► Rust EventBus
(WorkflowExecution event store)Temporal Integration
Temporal is used for durable workflow execution only. AEGIS does not expose Temporal concepts (workflows, activities, signals) in its public API or manifest format. Temporal is an infrastructure concern behind an Anti-Corruption Layer (ACL).
The AEGIS Workflow manifest maps to Temporal as follows:
| AEGIS Concept | Temporal Concept |
|---|---|
Workflow | Temporal Workflow Definition |
WorkflowExecution | Temporal Workflow Run |
WorkflowState (Agent) | Temporal Activity |
WorkflowState (System) | Temporal Activity |
WorkflowState (Human) | Temporal Signal Handler |
WorkflowState (ParallelAgents) | Parallel Temporal Activities + aggregation |
WorkflowState (ContainerRun) | Temporal Activity (executeContainerRunActivity) |
WorkflowState (ParallelContainerRun) | Parallel Temporal Activities (executeContainerRunActivity × N) |
WorkflowState (Subworkflow) | Temporal Child Workflow Execution |
Blackboard | Temporal Workflow State (persisted in Temporal's event history) |
TransitionRule | Conditional logic within the Temporal workflow function |
Durability Benefits
Because Temporal persists workflow event history:
- An orchestrator crash during
WorkflowState.Agent(e.g., while the agent container is running) will resume from the last committed state when the orchestrator restarts. - Human states can wait indefinitely for signals without consuming memory or CPU.
- Workflow executions running for hours or days are fully supported.
Generic Interpreter Pattern
The TypeScript worker registers a single Temporal workflow function — aegis_workflow — on the aegis-agents task queue. Every AEGIS workflow, regardless of how many states it defines, executes through this one function. When the workflow starts, aegis_workflow calls the fetchWorkflowDefinition activity to load the FSM definition from PostgreSQL, initialises the Blackboard from the workflow's spec.context and any start-input, then runs the FSM loop until a terminal state (no transitions) is reached or the iteration safety bound of 1000 steps is hit.
This means deploying a new workflow YAML requires no TypeScript code changes and no new Temporal workflow function. The manifest is registered to the workflow_definitions table and becomes immediately executable by any running worker replica.
Domain Model
struct Workflow {
id: WorkflowId,
metadata: WorkflowMetadata, // name, version, labels
scope: WorkflowScope, // Global | Tenant | User
spec: WorkflowSpec,
created_at: DateTime<Utc>,
}
struct WorkflowSpec {
initial_state: String, // entry state name
context: HashMap<String, Value>, // default Blackboard values (spec.context)
states: HashMap<String, WorkflowState>,
}
// The Workflow aggregate includes a `scope` field (WorkflowScope) that controls visibility.
// Scope can be Global (platform-wide, tenant_id = "aegis-system"), Tenant (default, visible
// to all users in the tenant), or User (private to the owning user). Name-based lookups
// resolve using narrowest-first precedence: User → Tenant → Global.
enum WorkflowScope {
Global, // tenant_id = "aegis-system"; visible to all tenants
Tenant, // visible to all users within the owning tenant (default)
User, // visible only to the owning user within their tenant
}
struct WorkflowState {
kind: StateKind,
timeout: Option<Duration>,
transitions: Vec<TransitionRule>,
}
// All state-specific config is carried inside the StateKind variant.
enum StateKind {
Agent {
agent: String, // deployed agent name
input: String, // Handlebars input template
isolation: Option<IsolationMode>,
},
System {
command: String, // shell command (Handlebars-rendered)
env: HashMap<String, String>,
workdir: Option<String>,
},
Human {
prompt: String, // shown to operator (Handlebars-rendered)
default_response: Option<String>, // used when timeout elapses with no signal
},
ParallelAgents {
agents: Vec<ParallelAgentConfig>, // concurrent agent configs
consensus: ConsensusConfig, // aggregation strategy
},
ContainerRun {
name: Option<String>, // human-readable label for Synapse UI
image: String, // registry/repo:tag or standard runtime reference
image_pull_policy: ImagePullPolicy, // Always | IfNotPresent | Never
command: Vec<String>, // exec-form command; joined under sh -c if shell=true
shell: bool, // wrap in /bin/sh -c
env: HashMap<String, String>,
workdir: Option<String>,
volumes: Vec<ContainerVolumeMount>,
resources: Option<ContainerResources>, // cpu millicores, memory, timeout
registry_credentials: Option<String>, // OpenBao path resolved by orchestrator
retry: Option<RetryConfig>,
},
ParallelContainerRun {
steps: Vec<ContainerRunConfig>, // concurrent container steps
completion: ParallelCompletionStrategy, // AllSucceed | AnySucceed | BestEffort
},
Subworkflow {
workflow_id: String, // deployed child workflow name or UUID
mode: SubworkflowMode, // Blocking | FireAndForget
result_key: Option<String>, // Blackboard key for child's final output (blocking)
input: Option<String>, // Handlebars template passed as child's start input
},
}
enum SubworkflowMode {
Blocking, // parent waits for child completion
FireAndForget, // parent continues immediately
}
struct TransitionRule {
condition: Option<TransitionCondition>, // None = unconditional (always)
target: String, // target state name
feedback: Option<String>, // Handlebars string; available as {{state.feedback}}
}
enum TransitionCondition {
Always,
OnSuccess, OnFailure,
ScoreAbove { threshold: f64 },
ScoreBelow { threshold: f64 },
ScoreBetween { min: f64, max: f64 },
ScoreAndConfidenceAbove { score: f64, confidence: f64 },
ConfidenceAbove { threshold: f64 },
Consensus { threshold: f64, agreement: f64 },
AllApproved, AnyRejected,
InputEquals { value: String },
InputEqualsYes, InputEqualsNo,
ExitCodeZero, ExitCodeNonZero,
ExitCode { code: i32 },
Custom { expression: String }, // Handlebars boolean expression
}gRPC Bridge
The TypeScript worker never executes agent code directly. All compute-heavy work is delegated back to the Rust orchestrator via the AegisRuntime gRPC service:
| Activity | gRPC Method | Call Type | Purpose |
|---|---|---|---|
executeAgentActivity | AegisRuntime.ExecuteAgent | server-streaming | Runs a full 100monkeys iterative execution; collects all ExecutionEvent stream messages before returning. |
executeSystemCommandActivity | AegisRuntime.ExecuteSystemCommand | unary | Runs a shell command on the orchestrator host. |
validateOutputActivity | AegisRuntime.ValidateWithJudges | unary | Runs gradient validation against configured judge agents; returns a score (0.0–1.0) and confidence. |
executeParallelAgentsActivity | AegisRuntime.ExecuteAgent × N, then ValidateWithJudges | parallel + unary | Fans out to N concurrent ExecuteAgent calls via Promise.all, then computes consensus. |
executeContainerRunActivity | AegisRuntime.ExecuteContainerRun | unary | Creates a container, mounts volumes via NFS gateway, runs the command, captures stdout/stderr/exit_code, destroys the container. No LLM, no bootstrap, no iteration loop. |
executeParallelContainerRunActivity | AegisRuntime.ExecuteContainerRun × N | Promise.allSettled + aggregation | Dispatches N concurrent executeContainerRunActivity calls; aggregates results per ParallelCompletionStrategy. |
executeSubworkflowActivity | Temporal Child Workflow API | child workflow | Starts a child workflow execution via Temporal's native child workflow support. In blocking mode, awaits the child's completion and returns the final Blackboard. In fire_and_forget mode, starts the child and returns immediately with the child execution ID. |
The gRPC server runs on port 50051 (configurable via AEGIS_RUNTIME_GRPC_URL). Security policy enforcement, SEAL attestation, and container lifecycle remain entirely in Rust — the TypeScript worker has no direct access to agent containers or the Docker socket.
After each state transition the worker POSTs a structured event to POST /v1/temporal-events on the Rust orchestrator. The TemporalEventListener maps these to typed WorkflowEvent domain objects, appends them to the WorkflowExecution event store in PostgreSQL, and publishes them to the internal EventBus.
Blackboard System
The Blackboard is a HashMap<String, serde_json::Value> that carries shared context across all states in a workflow execution. It is owned by the Temporal worker during live execution and persisted by Temporal's event-sourcing.
Ownership Lifecycle
The Blackboard moves through three distinct phases:
| Phase | Owner | What happens |
|---|---|---|
| Seed | Rust orchestrator | spec.context keys and any values from StartWorkflowExecutionRequest.blackboard are written into WorkflowExecution.blackboard before the Temporal worker starts. Accessible in manifests as {{workflow.context.KEY}} or {{blackboard.KEY}}. |
| Live execution | TypeScript Temporal worker | The FSM runs. Each state's output is written to the Blackboard under the state's name (e.g. REQUIREMENTS.output, REQUIREMENTS.status, REQUIREMENTS.score). Handlebars templates in input, command, env, prompt, and feedback fields are resolved against the live Blackboard. |
| Capture | Rust orchestrator | On WorkflowExecutionCompleted, the final_blackboard snapshot from Temporal is merged back into WorkflowExecution.blackboard and persisted to PostgreSQL. A WorkflowCompleted domain event carrying final_blackboard is published to the event bus. |
The Rust orchestrator never mutates the Blackboard during live execution — only the Temporal worker does.
State Output Conventions
When a state completes, the Temporal worker writes the execution result to the Blackboard under the state's name. The shape differs by state kind:
Agent state:
Blackboard["REQUIREMENTS"] = {
"status": "success",
"output": "...agent's final output text...",
"score": 0.92,
"iterations": 2
}ContainerRun state:
Blackboard["BUILD"] = {
"exit_code": 0,
"stdout": "Compiling my-crate v0.1.0\n Finished release in 12.3s",
"stderr": "",
"duration_ms": 12350
}ParallelContainerRun state (keyed by step name):
Blackboard["TEST"] = {
"unit-tests": { "exit_code": 0, "stdout": "...", "stderr": "", "duration_ms": 4200 },
"lint": { "exit_code": 0, "stdout": "", "stderr": "", "duration_ms": 870 },
"format-check": { "exit_code": 1, "stdout": "", "stderr": "...", "duration_ms": 410 }
}Downstream states reference these values via Handlebars:
ARCHITECTURE:
kind: Agent
agent: architect-agent
input: |
Requirements: {{REQUIREMENTS.output}}
Design the architecture.
transitions:
- condition: on_success
target: ARCH_APPROVALTemplate Variables
Handlebars templates are rendered against the live Blackboard at the moment each state executes. The full variable reference is in the Workflow Manifest Reference.
Key variables:
| Variable | Description |
|---|---|
{{STATE.output}} | Final output text of an Agent state. |
{{STATE.output.stdout}} | stdout of a System or ContainerRun state. |
{{STATE.output.stderr}} | stderr of a System or ContainerRun state. |
{{STATE.output.exit_code}} | Exit code of a System or ContainerRun state. |
{{STATE.output.duration_ms}} | Wall-clock duration in ms (ContainerRun states). |
{{STATE.output.STEPNAME.stdout}} | Per-step stdout (ParallelContainerRun states). |
{{STATE.output.STEPNAME.stderr}} | Per-step stderr (ParallelContainerRun states). |
{{STATE.output.STEPNAME.exit_code}} | Per-step exit code (ParallelContainerRun states). |
{{STATE.status}} | "success" or "failed". |
{{STATE.score}} | Validation score of an Agent state (0.0–1.0). |
{{STATE.consensus.score}} | Aggregated score of a ParallelAgents state. |
{{STATE.agents.N.output}} | Individual judge output (ParallelAgents, 0-indexed). |
{{workflow.context.KEY}} | Value from spec.context. |
{{blackboard.KEY}} | Any key set in spec.context or by a prior System state. |
{{state.feedback}} | feedback string from the incoming transition. |
{{input.KEY}} | Key from the workflow start --input payload. |
Human State Lifecycle
A WorkflowState.Human suspends the Temporal workflow run and waits for an external signal:
WorkflowExecution
state = "approve_requirements"
status = WAITING_FOR_SIGNAL
│
│ Signal arrives via:
│ POST /v1/workflows/executions/{id}/signal
│ {"response": "approved"}
│ or via HTTP API:
│ POST /v1/human-approvals/{id}/approve
▼
Temporal signal received → Blackboard["approve_requirements"]["decision"] = "approved"
TransitionRule evaluates → target = "implement"
Workflow continuesHuman states respect timeout_secs. If no signal arrives within the timeout, the workflow evaluates its transitionS. Typically the last (unconditional) transition leads to a failed state.
ContainerRun State
When a WorkflowState.ContainerRun is entered, the Temporal worker calls executeContainerRunActivity. This activity calls the Rust orchestrator via AegisRuntime.ExecuteContainerRun gRPC — the orchestrator creates the container, mounts volumes through the NFS Server Gateway, runs the command, captures output, destroys the container, and returns the result. There is no LLM, no bootstrap.py, and no 100monkeys iteration loop.
Enter BUILD (ContainerRun)
│
├── executeContainerRunActivity(image, command, volumes, resources)
│ │
│ ├── gRPC AegisRuntime.ExecuteContainerRun
│ │ ├── Pull image (respects image_pull_policy)
│ │ ├── docker run --rm -v <NFS-mount>:/workspace --cpus 2 --memory 4g
│ │ │ └── cargo build --release
│ │ ├── Capture stdout + stderr (up to 1 MB each)
│ │ └── Return { exit_code, stdout, stderr, duration_ms }
│ │
│ └── Blackboard["BUILD"] = { exit_code, stdout, stderr, duration_ms }
│
TransitionRules evaluated on exit_code → next state chosenFor ParallelContainerRun, the worker fans out with Promise.allSettled and aggregates per ParallelCompletionStrategy:
Enter TEST (ParallelContainerRun, completion: all_succeed)
│
├── executeContainerRunActivity(unit-tests, ...)
├── executeContainerRunActivity(lint, ...) (all three run simultaneously)
└── executeContainerRunActivity(format-check, ...)
│
└── Promise.allSettled — wait for all three
│
completion: all_succeed
→ all exit_code == 0? → on_success transition
→ any exit_code != 0? → on_failure transition
│
Blackboard["TEST"] = {
"unit-tests": { exit_code, stdout, stderr, duration_ms },
"lint": { exit_code, stdout, stderr, duration_ms },
"format-check": { exit_code, stdout, stderr, duration_ms }
}ParallelAgents State
When a WorkflowState.ParallelAgents is entered, the Temporal workflow dispatches all listed agent IDs concurrently as Temporal Activities:
Enter PARALLEL_REVIEW
│
├── Activity: security-reviewer
├── Activity: performance-reviewer (all three run simultaneously)
└── Activity: style-reviewer
│
└── All three complete
│
Blackboard["PARALLEL_REVIEW"] = {
"consensus": {
"score": 0.91,
"confidence": 0.87,
"all_succeeded": true
},
"agents": [
{"status": "success", "score": 0.91, "output": "{...}"},
{"status": "success", "score": 0.88, "output": "{...}"},
{"status": "success", "score": 0.95, "output": "{...}"}
]
}
│
TransitionRules evaluated → next state chosenIf any single agent fails, all_succeeded is set to false. The default transition (no condition) should lead to the failure handling state.
Workflow Composition
The Subworkflow state kind enables parent-child workflow composition. When a workflow state with kind: Subworkflow is entered, the Temporal worker starts a child workflow execution using Temporal's native child workflow API rather than a gRPC activity call.
How It Works
Enter TRIGGER_CHILD (Subworkflow, mode: blocking)
│
├── Temporal startChildWorkflowExecution(workflow_id, input)
│ │
│ ├── Child workflow starts as an independent Temporal workflow run
│ ├── Child has its own Blackboard, state machine, and execution history
│ ├── Child runs the same aegis_workflow() generic interpreter
│ │
│ └── Parent waits for child WorkflowExecutionCompleted signal
│ │
│ └── Child's final_blackboard returned to parent
│
├── Blackboard["TRIGGER_CHILD"] = {
│ "status": "success",
│ "result": <child's final output>,
│ "child_execution_id": "wfx-child-..."
│ }
├── Blackboard["pipeline_output"] = <child's final_blackboard> (via result_key)
│
TransitionRules evaluated → next state chosenIn fire_and_forget mode, the parent does not wait:
Enter START_BACKGROUND (Subworkflow, mode: fire_and_forget)
│
├── Temporal startChildWorkflowExecution(workflow_id, input)
│ └── Child starts; parent does NOT await completion
│
├── Blackboard["START_BACKGROUND"] = {
│ "status": "success",
│ "child_execution_id": "wfx-child-..."
│ }
│
TransitionRules evaluated → parent continues immediatelyDesign Principles
- Isolation: The child workflow has its own Blackboard. It cannot read or write the parent's Blackboard directly. Data flows in via
inputand out via the child'sfinal_blackboard. - Durability: Temporal's child workflow mechanism ensures that if the parent crashes, the child either completes or is cancelled according to the configured
ParentClosePolicy. By default, children are allowed to run to completion. - Reusability: Any deployed workflow can be invoked as a child. The child does not need to know it is being called from a parent — it is a standard workflow.
- Observability: Child workflow executions appear as independent runs in the Temporal UI and in the AEGIS execution history API. The parent-child relationship is tracked via
WorkflowChildStartedandWorkflowChildCompleteddomain events.
Workflow Execution Events
The workflow engine publishes to the event bus:
| Event | Trigger |
|---|---|
WorkflowStarted | StartWorkflowUseCase completes |
WorkflowStateEntered | Temporal activity begins |
WorkflowStateCompleted | Temporal activity succeeds |
WorkflowStateFailed | Temporal activity fails or times out |
WorkflowSignalReceived | Human state receives signal |
WorkflowCompleted | Reaches a terminal state with no transitions |
WorkflowChildStarted | Subworkflow state starts a child workflow execution |
WorkflowChildCompleted | Child workflow execution completes (blocking mode) |
WorkflowFailed | Reaches a terminal error state or exceeds Temporal timeout |
These events are consumable via the gRPC streaming API for real-time monitoring dashboards.