Aegis Orchestrator
Core Concepts

Swarms

Multi-agent swarm coordination — spawning child executions, inter-agent messaging, resource locking, and cascade cancellation.

Swarms

A Swarm is a group of agent executions that share a common parent root execution. One agent (the parent) spawns one or more child executions, coordinates with them via messaging, and synchronizes access to shared resources via TTL-backed locks. The orchestrator tracks the entire parent-child hierarchy and enforces security boundaries at spawn time.


Swarm Topology

Root Execution (parent)
├── Child Execution A  (depth 1)
│   ├── Child Execution A1  (depth 2)
│   └── Child Execution A2  (depth 2)
└── Child Execution B  (depth 1)
    └── Child Execution B1  (depth 2)

Maximum recursive depth: 3. An execution at depth 3 cannot spawn further children. Attempts to do so are rejected with SpawnError::MaxDepthExceeded.

Swarm members are tracked by ExecutionId — the internal map is HashMap<ExecutionId, AgentId>. Each child gets a unique execution_id + agent_id pair at spawn time, and the child's ExecutionHierarchy has its swarm_id set so that every execution in the tree is traceable back to the swarm.


Spawning Child Agents

Agents spawn children by calling the aegis.spawn_child MCP tool at runtime. The child executes asynchronously — spawn_child returns immediately with identifiers, and the parent uses aegis.await_child to block until completion.

from aegis import AegisClient

client = AegisClient()

# Spawn a child agent
result = client.call_tool("aegis.spawn_child", {
    "manifest_yaml": open("/agent/worker-manifest.yaml").read(),
    # swarm_id is optional; omit to have the orchestrator create a new swarm
})

child_agent_id = result["agent_id"]       # AgentId of the spawned child
child_execution_id = result["execution_id"]  # ExecutionId of the child's run
swarm_id = result["swarm_id"]                # SwarmId (created automatically if omitted)

# Do other work while the child runs...

# Block until child completes (or timeout)
outcome = client.call_tool("aegis.await_child", {
    "execution_id": child_execution_id,
    "timeout_secs": 300
})

if outcome["status"] == "completed":
    print(f"Child succeeded: {outcome['output']}")
else:
    print(f"Child did not succeed: {outcome['status']}")

Security Context Ceiling

A child agent's security_context must be a subset (≤) of its parent's security_context. The orchestrator enforces this at spawn time and rejects the call with SpawnError::ContextExceedsParentCeiling if the child requests broader permissions than the parent holds.

This prevents privilege escalation via spawned children. A parent holding a restricted SecurityContext cannot grant a child execution broader permissions than it holds itself.

Phase 1 (current): Enforcement uses name-based comparison — the child's security_context name must exactly match the parent's. A child requesting a different security context name is rejected. Phase 2 will introduce capability-lattice comparison, enabling children to request a strict subset of the parent's capabilities under a different context name.

ZaruSession Invariant

The ZaruSession (the user's active session in the Zaru client) maintains an invariant where it tracks exactly one root execution at a time. While a swarm can create many concurrent child executions, they are all subordinates to the single active root execution.

The Glass Laboratory renders this hierarchy as a coherent multi-agent execution tree, surfacing child agent progress as delegation bubbles inline in the conversation thread via ExecutionNarrative.

OpenBao Credential Isolation

AEGIS uses the Orchestrator Proxy Lease Model for distributing secrets within a swarm. Each child execution triggers its own independent credential resolution call to OpenBao at spawn time.

This ensures that dynamic secret leases are isolated across child-execution boundaries; a child execution's lease expiry is independent of its siblings' or its parent's lifetime.


Inter-Agent Messaging

Agents within a swarm can send messages to each other using unicast (to a specific agent) or broadcast (to all agents in the swarm).

# Unicast to a specific agent
client.call_tool("aegis.send_message", {
    "to_agent_id": "agent-uuid-here",
    "payload": b"<serialized task data>"
})

# Broadcast to all agents in the swarm
client.call_tool("aegis.broadcast_message", {
    "swarm_id": swarm_id,
    "payload": b"<serialized task data>"
})

Messages are raw bytes. Agents are responsible for serialization (e.g., JSON, msgpack). Message payloads are not logged — only the payload size is recorded in MessageSent domain events for audit purposes.

There is no message ordering guarantee between different sender-receiver pairs. Within a single sender-receiver pair, messages are delivered in send order.


Resource Locking

When multiple child agents need exclusive access to a shared resource (for example, writing to the same file or updating a shared database row), they use the ResourceLock mechanism.

# Acquire a lock
lock = client.call_tool("aegis.acquire_lock", {
    "resource": "workspace/shared-config.json",
    "ttl_secs": 60   # lock auto-expires after 60 seconds even if not released
})

lock_token = lock["lock_token"]

try:
    # ... exclusive work ...
    pass
finally:
    # Release the lock
    client.call_tool("aegis.release_lock", {
        "lock_token": lock_token
    })

Lock Behavior

PropertyValue
Default TTL300 seconds (5 minutes)
Execution bindingEach lock stores the holder's execution_id. When that execution completes or is cancelled, the lock is automatically released.
Contention behavioracquire_lock blocks until the lock is available or the call times out.
Background GCA background task runs every 30 seconds, sweeping expired locks. LockExpired domain event is emitted for each.

To avoid deadlocks:

  • Always use try/finally to release locks.
  • Set TTLs conservatively — if your critical section takes 10 seconds, use a 30-second TTL.
  • Avoid circular lock acquisition (A waits for B's lock while B waits for A's lock).

Cascade Cancellation

Cancelling a parent execution automatically cascades to its swarm and all child executions. This is wired through the SwarmCancellationPort trait — when the core execution engine processes a cancellation for a parent execution, it invokes the port, which cancels the associated swarm and every live child within it.

Swarm control is not exposed as dedicated CLI commands in the current release. Use the execution cancellation endpoint:

POST /v1/executions/{execution_id}/cancel

The cancellation reason is recorded in SwarmDissolved and per-child cancellation domain events for the audit trail. Possible reasons:

ReasonDescription
ParentCancelledParent execution was explicitly cancelled.
ManualOperator issued a manual cancellation through orchestration controls.
AllChildrenCompleteSwarm dissolved naturally after all children finished.
SecurityViolationA security policy violation triggered swarm termination.

Swarm Lifecycle

Created ──▶ Active ──▶ Dissolving ──▶ Dissolved


               (all children complete, or cancel called)

A swarm enters Dissolving when:

  • All child executions have completed/failed, or
  • a manual cancellation request is issued.

It transitions to Dissolved once all in-flight child state is cleaned up (locks released, messages drained). The dissolved_at: Option<DateTime<Utc>> timestamp is recorded on the Swarm aggregate when it enters the Dissolved state.


Monitoring Swarms

Swarm lifecycle and child execution state are observable through execution events and the streaming APIs.

Six SwarmEvent variants are published to the event bus:

EventKey Fields
SwarmCreatedswarm_id, parent_execution_id, created_at
ChildSpawnedswarm_id, agent_id, execution_id, spawned_at
SwarmDissolvedswarm_id, reason, dissolved_at
LockAcquiredswarm_id, resource_id, holder, execution_id
LockReleasedswarm_id, resource_id
MessageBroadcastswarm_id, from, recipient_count

These events can be consumed by external systems via the gRPC streaming API. See CLI Capability Matrix for current CLI command coverage.

On this page