Multi-Node Deployment
Distribute AEGIS across multiple machines for production deployments using the gRPC cluster protocol.
Multi-Node Deployment
AEGIS supports distributed deployments across multiple machines. Each machine runs one aegis daemon process configured with two orthogonal roles:
spec.node.type— the node's deployment role (whether it hosts agents vs. serves as an API entry point)spec.cluster.role— the node's cluster coordination role (whether it controls the cluster vs. accepts forwarded work)
These two settings are independent and can be mixed freely. A node can be spec.node.type: orchestrator and spec.cluster.role: worker at the same time.
Node Types (spec.node.type)
| Type | Role |
|---|---|
orchestrator | Hosts the management plane: API server, workflow engine, Temporal client, Cortex connection, secrets manager. Does not run agent containers locally. |
edge | Executes agent containers (Docker runtime). Does not expose the public API. Connects to a controller node for task assignment. |
hybrid | Combines both roles on a single machine. The default for development and small deployments. |
Cluster Protocol (spec.cluster.role)
AEGIS uses a dedicated NodeClusterService gRPC protocol on port 50056 for inter-node coordination. This is separate from the agent gRPC API on port 50051.
Cluster Roles
| Role | Description |
|---|---|
controller | Manages the NodeCluster aggregate. Routes executions to workers. Issues NodeSecurityTokens to attested workers. Exposes NodeClusterService on port 50056. |
worker | Attests to a controller on startup. Advertises NodeCapabilityAdvertisement. Accepts forwarded executions via the ForwardExecution RPC. |
hybrid | Controller and worker in one process. Used for standalone single-node and development deployments. No separate controller endpoint needed. |
The Two-Tier Model
spec.node.type → determines what the node DOES for agents
(run the API, run containers, or both)
spec.cluster.role → determines how nodes COORDINATE with each other
(who routes work, who accepts work)A common production pairing: the API-facing machine is type: orchestrator, cluster.role: controller; each GPU machine is type: edge, cluster.role: worker. The controller receives execution requests from clients and routes them to workers best suited to run them.
NodeSecurityToken
After a worker successfully attests to a controller, the controller issues a NodeSecurityToken: an RS256 JWT signed by the controller's OpenBao Transit key. This token:
- Has a 1-hour TTL and is auto-refreshed before expiry
- Is analogous to the agent
SecurityTokenfrom the SEAL protocol, but scoped to node identity - Contains
node_id,role,capabilities_hash,iat, andexp - Is included in every subsequent cluster RPC wrapped in a
SealNodeEnvelope
All inter-node gRPC calls use this envelope:
SealNodeEnvelope {
node_security_token: <NodeSecurityToken JWT>,
signature: <Ed25519 signature over payload>,
payload: <serialized RPC request bytes>
}Typical Topologies
Development / Single Node
┌──────────────────────────────┐
│ Hybrid Node │ spec.node.type: hybrid
│ spec.cluster.role: hybrid │
│ │
│ API Server (gRPC + REST) │
│ Scheduler │
│ Docker (agent containers) │
│ NodeClusterService :50056 │ ← routes to itself
└──────────────────────────────┘Use type: hybrid and cluster.role: hybrid for local development and small deployments. This is the default in aegis-config.yaml.
Production — Separated Control / Data Plane
┌────────────────────────────────────────┐
│ Orchestrator Node │ spec.node.type: orchestrator
│ (1–3 instances) │
│ │
│ API → gRPC :50051 → REST :8088 │
│ Workflow engine │
│ Temporal client │
│ Secrets (OpenBao) │
└──────────────┬─────────────────────────┘
│ (internal network)
┌──────┴──────┐
│ │
┌───────▼───┐ ┌─────▼─────┐
│ Edge #1 │ │ Edge #2 │ spec.node.type: edge
│ Docker │ │ Docker │
│ agents │ │ agents │
└───────────┘ └───────────┘Edge nodes handle the compute-intensive agent workloads. Adding more edge nodes scales execution throughput without affecting the orchestrator.
Production — Cluster Protocol (Controller + Workers)
This topology adds the NodeClusterService layer for authenticated, capability-aware execution routing.
┌──────────────────────────────────────────────────────────────────────┐
│ Controller Node │
│ spec.node.type: orchestrator │
│ spec.cluster.role: controller │
│ │
│ gRPC :50051 ← agent API (external clients) │
│ gRPC :50056 ← NodeClusterService (workers only) │
└──────────────────────────────┬───────────────────────────────────────┘
│ SealNodeEnvelope over gRPC :50056
┌───────────────┼───────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Worker 1 │ │ Worker 2 │ │ Worker 3 │
│ :50051 │ │ :50051 │ │ :50051 │
└──────────┘ └──────────┘ └──────────┘
spec.node.type: edge
spec.cluster.role: workerWhen a client submits an execution to the controller, the NodeRouter selects the best available worker (Phase 1: round-robin among healthy workers with matching tags; Phase 2: load-aware scoring). The controller forwards the execution to the selected worker via the ForwardExecution server-streaming RPC on port 50051, then streams ExecutionEvents back to the original client.
The ClusterAwareExecutionService handles this routing and forwarding transparently — it wraps the local ExecutionService and routes to workers when cluster mode is enabled, falling back to local execution when no workers are available.
Configuring Nodes
Orchestrator / Controller Node
apiVersion: 100monkeys.ai/v1
kind: NodeConfig
metadata:
name: "orchestrator-primary"
spec:
node:
id: "orch-node-1"
type: "orchestrator"
region: "us-west-2"
tags: ["primary"]
# Orchestrator nodes must specify all external dependencies
llm_providers: [...]
storage: { backend: "seaweedfs", ... }
# ...Edge Node
apiVersion: 100monkeys.ai/v1
kind: NodeConfig
metadata:
name: "edge-worker-1"
spec:
node:
id: "edge-node-1"
type: "edge"
region: "us-west-2"
tags: ["gpu", "large-memory"]
resources:
cpu_cores: 32
memory_gb: 128
disk_gb: 500
gpu: true
runtime:
# Point the edge node at the orchestrator for callbacks
orchestrator_url: "https://orchestrator.internal:8080"
docker_network_mode: "aegis-net"
nfs_server_host: "127.0.0.1"
# Edge nodes do not need llm_providers or storage config —
# they delegate those duties to the orchestratorspec.node.resources
Declare available hardware so the scheduler can make placement decisions:
| Field | Type | Description |
|---|---|---|
cpu_cores | integer | CPU cores available to agent containers |
memory_gb | integer | RAM in GB available to agent containers |
disk_gb | integer | Disk space in GB |
gpu | boolean | Whether a GPU is available |
spec.node.tags
Tags are used for execution target matching. An agent manifest can specify spec.execution.target_tags to pin executions to nodes with matching tags:
# In agent manifest
spec:
execution:
target_tags: ["gpu"] # Only schedule on nodes tagged "gpu"Cluster Configuration
The spec.cluster block configures the gRPC cluster protocol. All nodes require a persistent Ed25519 keypair; if keypair_path does not exist on disk, AEGIS auto-generates one on first run.
Controller Node
spec:
node:
id: "ctrl-001"
type: orchestrator
region: us-east-1
tags: [controller, production]
network:
port: 8080
grpc_port: 50051
cluster:
role: controller
node_id: "ctrl-001" # must match spec.node.id
keypair_path: /etc/aegis/node-keypair.pem
cluster_grpc_port: 50056The controller exposes NodeClusterService on cluster_grpc_port (default 50056). Only attested workers may call RPCs on this port.
Worker Node
spec:
node:
id: "worker-gpu-001"
type: edge
region: us-east-1
tags: [gpu, production]
resources:
cpu_cores: 16
memory_gb: 64
gpu: true
cluster:
role: worker
node_id: "worker-gpu-001" # must match spec.node.id
controller_endpoint: "https://ctrl-001.internal:50056"
keypair_path: /etc/aegis/node-keypair.pem
heartbeat_interval_seconds: 30Workers contact the controller at controller_endpoint on startup to perform attestation. The heartbeat_interval_seconds (default 30) controls how frequently the worker sends a Heartbeat RPC carrying its current load and capabilities.
Hybrid Node (Standalone / Development)
spec:
node:
id: "dev-local"
type: hybrid
cluster:
role: hybrid # controller + worker in one process
node_id: "dev-local"
keypair_path: /etc/aegis/node-keypair.pemNo controller_endpoint is needed in hybrid mode — the process routes executions to itself.
Node Attestation Flow
Before a worker can receive forwarded executions, it must prove its identity to the controller. This happens automatically on startup via the following sequence:
Worker Controller
│ │
│ 1. Load Ed25519 keypair from │
│ spec.cluster.keypair_path │
│ (auto-generated if absent) │
│ │
│──── AttestNode(node_id, public_key) ────────►│
│ │ 2. Generates cryptographic challenge
│◄─── ChallengeNode(challenge_bytes) ──────────│
│ │
│ 3. Signs challenge_bytes with │
│ Ed25519 private key │
│ │
│──── ChallengeNode(signature) ───────────────►│
│ │ 4. Verifies signature against
│ │ registered public key
│◄─── NodeSecurityToken (RS256 JWT) ───────────│ 5. Issues token (1-hour TTL)
│ │ via OpenBao Transit signing
│ 6. Wraps all future RPCs in │
│ SealNodeEnvelope { │
│ node_security_token, │
│ signature, │
│ payload │
│ } │
│ │
│──── RegisterNode(capabilities) ────────────►│ 7. Advertises NodeCapabilityAdvertisement
│ │ { gpu_count, vram_gb, cpu_cores,
│ │ available_memory_gb,
│ │ supported_runtimes, tags }
│◄─── Registered ─────────────────────────────│
│ │
│──── Heartbeat (every 30s) ─────────────────►│ 8. Keeps NodePeer status Active
│◄─── NodeCommand (optional) ─────────────────│ Response may carry commands:
│ │ drain, config push, shutdownThe NodeSecurityToken is automatically refreshed before its 1-hour expiry via the ongoing heartbeat cycle. No manual token management is required.
NodePeer status transitions:
Active— node is registered and sending heartbeats within the expected intervalDraining— controller has issued a drain command; no new executions are routed to this nodeUnhealthy— no heartbeat received within 3× the expected interval
NodeClusterService RPC Reference
The NodeClusterService exposes 10 RPCs on port 50056. All RPCs except AttestNode and ChallengeNode require a valid NodeSecurityToken in the authorization metadata key and an SealNodeEnvelope wrapping the payload. AttestNode is unauthenticated (it is the first call a new worker makes).
| RPC | Direction | Description |
|---|---|---|
AttestNode | Worker → Controller | Initiate node attestation; send public key |
ChallengeNode | Bidirectional | Controller issues challenge; worker returns Ed25519 signature |
RegisterNode | Worker → Controller | Register with NodeCapabilityAdvertisement after attestation |
Heartbeat | Worker → Controller | Periodic status update (default 30s); response may carry NodeCommands |
DeregisterNode | Worker → Controller | Graceful deregistration before shutdown |
RouteExecution | Controller-internal | Returns ExecutionRoute { target_node_id, worker_grpc_address } |
ForwardExecution | Controller → Worker (server-streaming) | Execute an agent on this worker; streams ExecutionEvents back |
Execution forwarding end-to-end flow:
- Client sends an execution request to the controller.
ClusterAwareExecutionServicecallsRouteExecutionUseCaseto select a worker based on health, tags, and availability.- Controller connects to the selected worker via
NodeClusterClient::connect_to_worker(). - Controller calls
forward_execution()with the originalexecution_idpreserved for end-to-end tracing correlation. - Worker runs the execution locally via
start_execution_with_id(), importing the upstream execution ID rather than generating a new one. - Execution events stream back to the controller via the gRPC server-streaming response, which relays them to the original client.
| SyncConfig | Worker → Controller | Worker requests current config from controller |
| PushConfig | Controller → Worker | Controller pushes updated config to a specific worker |
| ListPeers | Any → Controller | List all registered NodePeers with their status and capabilities |
Node Registration
Node registration is performed via gRPC using the NodeClusterService protocol described above, not via HTTP. The original HTTP-based NodeIdentity registration is used in legacy single-node mode only and is not involved in cluster coordination.
The registration sequence after successful attestation:
- Worker calls
RegisterNodecarrying itsNodeCapabilityAdvertisement - Controller records the
NodePeerin theNodeClusteraggregate - Worker enters the heartbeat loop (
Heartbeateveryheartbeat_interval_seconds) - Controller updates
NodePeer.last_heartbeat_atandNodePeer.statuson each heartbeat - On graceful shutdown, worker calls
DeregisterNode; controller marksNodePeeras removed
Networking Requirements
| Connection | Protocol | Port | Direction |
|---|---|---|---|
| Client → Controller | HTTP REST | 8088 | Inbound to controller |
| Client → Controller | gRPC (agent API) | 50051 | Inbound to controller |
| Worker → Controller | gRPC (NodeClusterService) | 50056 | Outbound from worker |
| Controller → Worker | gRPC (ForwardExecution) | 50051 | Outbound from controller |
| Orchestrator → Temporal | 7233 | Outbound | Workflow engine |
| Orchestrator → SeaweedFS | 8888 | Outbound | Storage filer |
| Edge → SeaweedFS | 8888 | Outbound | Volume data access |
| Edge agent containers → Edge daemon | 2049 (NFS) | Internal | Volume mounts via NFS Gateway |
Firewall rules must allow:
- Controller inbound on 8080, 50051, and 50056
- Workers inbound on 50051 (for
ForwardExecutionstreaming from controller) - Workers outbound to controller on 50056
Port 50056 should not be exposed to external clients — it is for inter-node cluster coordination only. Use network-level ACLs to restrict access to known worker IPs.
High Availability
Phase 1 (Current): Single Controller + Multiple Workers
Deploy one controller node and N worker nodes. The controller is the coordination point; workers are horizontally scalable. Multiple worker nodes provide execution capacity and fault tolerance for agent workloads.
┌──────────────────────┐
│ Load Balancer │ (agent API traffic only)
└──────────┬───────────┘
│ :8080 / :50051
┌──────────▼───────────┐
│ Controller Node │ spec.cluster.role: controller
│ (single instance) │
└──────────┬───────────┘
│ port :50056
┌─────────────────┼─────────────────┐
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Worker 1 │ │ Worker 2 │ │ Worker 3 │
│ :50051 │ │ :50051 │ │ :50051 │
└──────────────┘ └──────────────┘ └──────────────┘For controller resilience in Phase 1, run the controller with PostgreSQL as the persistence backend; on controller restart, workers re-attest automatically via the AttestNode → ChallengeNode → RegisterNode flow. The controller outage window only affects routing; in-flight executions on workers complete normally.
Phase 2 (Planned): Controller HA with Raft
Phase 2 will introduce a Raft-based controller consensus layer, allowing N controller replicas with automatic leader election. Workers will use a discovery endpoint returned during attestation to locate the current leader. This is not yet implemented.
For all orchestrator instances in either phase, share the same PostgreSQL database for consistent execution state. Redis is optional for session-level caching.
┌─────────────────┐
│ Load Balancer │
└────────┬────────┘
┌───────────────┼────────────────┐
┌──────▼──────┐ ┌─────▼──────┐ ┌─────▼──────┐
│Controller 1 │ │Controller 2│ │Controller 3│ (Phase 2)
│ (leader) │ │ (follower) │ │ (follower) │
└──────┬──────┘ └────────────┘ └────────────┘
│ Raft consensus
┌──────▼──────────────┐
│ PostgreSQL │
│ (shared state) │
└─────────────────────┘Day-2 Operations: mTLS Certificate Rotation
In production environments, AEGIS requires mutual TLS (mTLS) for all inter-node communication on port 50056. This ensures that only nodes possessing a certificate signed by the platform CA can even attempt the SEAL attestation flow.
Rotating the Platform CA
If the root platform CA is rotated, you must perform a multi-step rollout to avoid cluster-wide disconnects:
- Distribute New CA: Update
spec.cluster.tls.ca_certon all nodes to include BOTH the old and new CA certificates in a single PEM bundle. Restart all nodes. - Issue New Node Certs: Generate new node certificates signed by the new CA.
- Update Node Certs: Update
spec.cluster.tls.cert_pathandkey_pathon each node one-by-one and restart. - Remove Old CA: Once all nodes are using certificates from the new CA, remove the old CA from the
ca_certbundle.
Zero-Downtime Node Certificate Rotation
Individual node certificates (e.g., node.crt) can be rotated without cluster downtime as long as the CA remains valid. The aegis daemon watches the certificate files on disk and reloads them automatically upon change (when using standard spec.cluster.tls configuration).
# Example: Rotating a worker certificate
cp new-node.crt /etc/aegis/certs/node.crt
cp new-node.key /etc/aegis/certs/node.key
# AEGIS will detect the change and use the new cert for the next gRPC connectionTroubleshooting Cluster Operations
NodeSecurityToken Attestation Failures
If a worker fails to join the cluster, check the controller logs for the following common errors:
| Error | Root Cause | Resolution |
|---|---|---|
TokenExpired | Clock Skew: The worker's system clock is significantly ahead of the controller's. | Synchronize clocks on all nodes using NTP (e.g., chronyd). |
NonceReplay | Replay Attack / Rapid Restart: The worker sent an attestation request with a nonce that was already used in the last 5 minutes. | Wait 5 minutes before restarting the worker, or ensure the worker generates a fresh UUID for the nonce field on every attempt. |
InvalidSignature | Keypair Mismatch: The worker's Ed25519 signature does not match its registered public key. | Verify that spec.cluster.keypair_path points to the same persistent key that was used during the initial RegisterNode call. |
UntrustedCA | mTLS Failure: The certificate presented by the node is not signed by the CA in spec.cluster.tls.ca_cert. | Verify that all nodes share the same platform CA bundle. |
Diagnostic Commands
Use the aegis CLI on the controller node to inspect cluster health:
# List all registered peers
aegis node peers
# Check end-to-end cluster health from the local node config
aegis status --clusterHealth Endpoints
Each AEGIS daemon exposes HTTP health endpoints on the configured REST port (default 8080). These endpoints are useful for load balancer health checks, Kubernetes probes, and manual diagnostics.
| Endpoint | Method | Description |
|---|---|---|
/health/live | GET | Liveness probe. Returns 200 OK if the process is running. Does not check downstream dependencies. |
/health/ready | GET | Readiness probe. Returns 200 OK only when all critical subsystems (database, Temporal, event bus) are initialised. Returns 503 Service Unavailable during startup or if a dependency becomes unreachable. |
/health | GET | Composite health check. Returns 200 OK with a JSON body containing uptime_seconds and subsystem status. Used by the aegis status CLI command and the daemon client library. |
In a clustered deployment, the controller node's HealthSweeper background task also monitors worker health by evaluating heartbeat freshness. If a worker misses 3 consecutive heartbeat intervals, the HealthSweeper marks its NodePeer status as Unhealthy and emits a ClusterEvent::NodeUnhealthy event.
Remote Volume Access
When an agent executing on Worker Node A needs to access a volume that resides on Worker Node B, AEGIS transparently proxies the file operation via the RemoteStorageService gRPC protocol.
How It Works
- The
StorageRouteron Node A inspects the file path. Paths prefixed with/aegis/seal/{node_id}/{volume_id}/...are routed to theSealStorageProvider. SealStorageProviderextracts the targetnode_idfrom the path, looks up the node's gRPC address in theNodeClusterRepository, and establishes (or reuses) a gRPC channel.- Each RPC call is wrapped in an
SealNodeEnvelopecarrying the node's Ed25519 signature andNodeSecurityToken. - The target node's
RemoteStorageServiceHandlerverifies the envelope, performs authoritativeAegisFSALaccess checks, and delegates to its localStorageProvider. - The response is returned to the requesting node and surfaced through the standard
StorageProvidertrait.
Supported Operations
All POSIX-style file operations are supported over the wire:
CreateDirectory,DeleteDirectorySetQuota,GetUsageOpenFile,ReadAt,WriteAt,CloseFileStat,ReaddirCreateFile,DeleteFile,RenameHealthCheck
Path Format
Remote volume paths follow this convention:
/aegis/seal/{target_node_id}/{volume_id}/{path/within/volume}The StorageRouter detects the /aegis/seal/ prefix and dispatches to SealStorageProvider. All other /aegis/volumes/... paths route to the default backend (SeaweedFS/OpenDAL), and bare absolute paths (/opt/data/...) route to the local host filesystem.
Configuration Hierarchy
AEGIS supports hierarchical configuration layering with the following precedence (lowest to highest):
| Scope | Description | Example |
|---|---|---|
| Global | Cluster-wide defaults applied to all nodes | Default LLM provider settings |
| Tenant | Per-tenant overrides (keyed by TenantSlug) | Tenant-specific rate limits |
| Node | Per-node overrides (keyed by NodeId) | Node-specific runtime settings |
| Local | On-disk aegis-config.yaml loaded at startup | Hardware-specific configuration |
When the controller pushes a configuration update via the PushConfig heartbeat command, the worker merges the layers in precedence order. Each layer is stored as a ConfigSnapshot value object containing the scope, a JSON payload, and a version hash. The merged result is a MergedConfig that the worker applies atomically.
Workers can also pull their current effective configuration on demand using the SyncConfig RPC. This is useful after a restart when the worker needs to catch up on configuration changes that occurred while it was offline.
After merging configuration layers, the worker runs EffectiveConfigValidator to verify that the merged result contains all required sections (runtime, storage, llm) before accepting work. If validation fails, the worker logs a fatal error and refuses to enter the heartbeat loop — preventing a misconfigured node from silently accepting executions it cannot fulfil.
Worker Lifecycle
When a daemon starts with spec.cluster.role: worker or spec.cluster.role: hybrid and a controller.endpoint is configured, the daemon automatically spawns a WorkerLifecycle background task. This task manages the full lifecycle of the worker's relationship with the cluster controller.
Lifecycle Stages
┌─────────┐ ┌──────────┐ ┌──────────┐ ┌───────────┐ ┌────────────┐
│ Connect │────▶│ Attest │────▶│ Register │────▶│ Heartbeat │────▶│ Deregister │
└─────────┘ └──────────┘ └──────────┘ │ Loop │ └────────────┘
└───────────┘- Connect — Establish a gRPC channel to the controller's
NodeClusterServiceon port 50056. - Attest — Perform the two-step Ed25519 challenge handshake (
AttestNode+ChallengeNode). On success, the worker receives aNodeSecurityTokenJWT. - Register — Call
RegisterNodeto advertise the worker'sNodeCapabilityAdvertisement(GPU count, CPU cores, memory, supported runtimes, tags). The controller creates initialNodeConfigAssignmentandRuntimeRegistryAssignmentrecords for the worker via theNodeRegistryRepository, separating cluster membership ("is this node alive?") from configuration assignment ("what should this node run?"). - Heartbeat Loop — Send
HeartbeatRPCs at the configuredheartbeat_interval_secs(default 30s). Process anyNodeCommands returned in the response:- Drain — Stop accepting new executions; complete in-flight work.
- PushConfig — Apply a configuration update from the controller.
- Shutdown — Begin graceful process shutdown after draining.
- Deregister — On daemon shutdown (SIGTERM / Ctrl+C), call
DeregisterNodeto cleanly remove the worker from the cluster.
If a heartbeat fails (e.g., network partition), the worker logs a warning and retries on the next interval. The controller's HealthSweeper will mark the worker as Unhealthy after 3 missed intervals.
Configuration
The worker lifecycle is controlled by the spec.cluster section of aegis-config.yaml:
spec:
cluster:
enabled: true
role: worker
controller:
endpoint: "http://controller.internal:50056"
node_keypair_path: /etc/aegis/node-keypair.pem
heartbeat_interval_secs: 30 # How often to send heartbeats
token_refresh_margin_secs: 120 # Re-attest this many seconds before token expirySee Also
Container Registry & Image Management
How AEGIS discovers, pulls, caches, and authenticates container images for standard and custom runtimes — including ImagePullPolicy, private registry credentials, failure scenarios, and pre-caching for airgapped environments.
Observability
Observability stack (Jaeger, Prometheus, Grafana, Loki), structured logging, OTLP export, alert rules, and metrics for AEGIS deployments.