Building Multi-Agent Swarms
How to spawn child agents, pass messages, use resource locks, and manage swarm lifecycle.
Building Multi-Agent Swarms
This guide covers how to build agents that coordinate with other agents via the AEGIS swarm system. Parent agents can spawn children, await completion, exchange messages, and use resource locks via SEAL-secured MCP tool calls.
When to Use Swarms
Use swarms when a single agent's task is best decomposed into parallel or sequential sub-tasks:
- Parallel processing: Analyze multiple files simultaneously with one agent per file.
- Specialization: Route sub-tasks to agents with different capabilities (e.g., a
python-expertand asecurity-reviewerin parallel). - Pipeline decomposition: Break a large task into sequential stages, each with isolated state and its own iteration loop.
If your decomposition is static and predictable, consider a Workflow FSM instead — it is easier to monitor and debug. Use swarms when decomposition needs to be dynamic (e.g., spawn one child per input item).
Basic Spawn and Await Pattern
import json
from aegis import AegisClient
client = AegisClient()
task = client.get_task()
files = task.input.get("files", [])
# Spawn a child agent for each file
spawned = []
for file_path in files:
result = client.call_tool("aegis.spawn_child", {
"manifest_yaml": f"""
apiVersion: 100monkeys.ai/v1
kind: Agent
metadata:
name: file-analyzer-child
spec:
image: myregistry/file-analyzer:latest
capabilities:
- fs.read
- cmd.run
security:
security_context: default
resources:
timeout_secs: 120
""",
})
spawned.append({
"file": file_path,
"execution_id": result["execution_id"],
"swarm_id": result["swarm_id"]
})
# Await all children
results = []
for item in spawned:
outcome = client.call_tool("aegis.await_child", {
"execution_id": item["execution_id"],
"timeout_secs": 150
})
results.append({
"file": item["file"],
"status": outcome["status"],
"output": outcome.get("output", "")
})
# Write summary
with open("/workspace/analysis_results.json", "w") as f:
json.dump(results, f, indent=2)
print(json.dumps({"processed": len(results), "results_path": "/workspace/analysis_results.json"}))Passing Context to Children
The child execution receives what is passed via task.input. You can inject per-child context by embedding it in the manifest env section, or by having the child read from a shared volume that the parent writes first.
Option A: Volume-Backed Context
Write a context file to the shared workspace before spawning:
# Parent writes per-child task files
for i, file_path in enumerate(files):
context_path = f"/workspace/tasks/task_{i}.json"
client.call_tool("fs.write", {
"path": context_path,
"content": json.dumps({"target_file": file_path})
})
client.call_tool("aegis.spawn_child", {
"manifest_yaml": child_manifest(task_index=i),
})The child reads its task file from /workspace/tasks/task_{i}.json at startup.
Option B: Environment Variables in the Manifest
Embed small parameters directly in the dynamically generated manifest:
def child_manifest(file_path: str) -> str:
return f"""
apiVersion: 100monkeys.ai/v1
kind: Agent
metadata:
name: worker
spec:
image: myregistry/worker:latest
capabilities:
- fs.read
- fs.write
security:
security_context: default
resources:
timeout_secs: 120
environment:
TARGET_FILE: "{file_path}"
"""The child reads TARGET_FILE from its environment at runtime.
Resource Locking
When multiple children might write to the same file or shared state, use resource locks.
Each lock is internally bound to the calling execution's execution_id. This means locks are automatically released when the holding execution completes or is cancelled, in addition to the TTL-based expiry. You do not need to pass execution_id explicitly — the orchestrator resolves it from the caller's context.
# Child agent acquires a lock before writing to shared output
lock = client.call_tool("aegis.acquire_lock", {
"resource": "workspace/aggregated_output.json",
"ttl_secs": 30
})
try:
# Read current state
try:
current = json.loads(client.call_tool("fs.read", {"path": "/workspace/aggregated_output.json"})["content"])
except:
current = []
# Append this child's result
current.append({"file": os.environ["TARGET_FILE"], "result": my_result})
# Write back
client.call_tool("fs.write", {
"path": "/workspace/aggregated_output.json",
"content": json.dumps(current)
})
finally:
client.call_tool("aegis.release_lock", {
"lock_token": lock["lock_token"]
})Inter-Agent Messaging
Use messaging for lightweight coordination between long-running agents in the same swarm:
# Parent sends work items to a child
client.call_tool("aegis.send_message", {
"to_agent_id": child_agent_id,
"payload": json.dumps({"command": "analyze", "target": "/workspace/file.py"}).encode()
})
# Broadcast a cancellation signal to all agents in the swarm
client.call_tool("aegis.broadcast_message", {
"swarm_id": swarm_id,
"payload": b'{"command": "stop"}'
})Messages are delivered in send order between the same sender-receiver pair. There is no delivery guarantee across different sender-receiver pairs.
Security Context Ceiling
A child agent's security_context must be a subset of the parent's. If your parent runs with security_context: default and you attempt to spawn a child with security_context: privileged, the spawn will be rejected:
SpawnError: ContextExceedsParentCeiling (requested=privileged, parent=default)Phase 1 (current): Enforcement uses name-based comparison — the child's security_context name must exactly match the parent's. A child requesting a different security context name is rejected. Phase 2 will introduce capability-lattice comparison, enabling strict-subset requests under different context names.
Always use the same or more restrictive security context for child agents.
Depth Limit
AEGIS enforces a maximum swarm depth of 3. A root agent (depth 0) can spawn depth-1 children. Depth-1 children can spawn depth-2 children. Depth-2 children can spawn depth-3 children. Attempting to spawn from depth-3 returns SpawnError::MaxDepthExceeded.
Design your decomposition to stay within this limit. If you need deeper hierarchies, flatten the decomposition into wider (more parallel) rather than deeper (more sequential) structures.
Monitoring Swarms
Swarm-specific listing and cancellation are not exposed as dedicated CLI commands in the current release. Use execution logs/streams for observability and POST /v1/executions/{execution_id}/cancel for cancellation. Cancelling a parent execution automatically cascades to all children via SwarmCancellationPort.
Six SwarmEvent variants are published to the event bus and available via gRPC streaming:
SwarmCreated— emitted when a new swarm is initializedChildSpawned— emitted for each child execution added to the swarmSwarmDissolved— emitted when the swarm enters Dissolved state (includesreasonanddissolved_at)LockAcquired— emitted when a resource lock is granted (includesexecution_idof the holder)LockReleased— emitted when a resource lock is released or expiresMessageBroadcast— emitted for broadcast messages (includesrecipient_count)