Distribute AEGIS across multiple machines for production deployments using the gRPC cluster protocol.

Multi-Node Deployment

AEGIS supports distributed deployments across multiple machines. Each machine runs one aegis daemon process configured with two orthogonal roles:

spec.node.type — the node's deployment role (whether it hosts agents vs. serves as an API entry point)
spec.cluster.role — the node's cluster coordination role (whether it controls the cluster vs. accepts forwarded work)

These two settings are independent and can be mixed freely. A node can be spec.node.type: orchestrator and spec.cluster.role: worker at the same time.

Node Types (`spec.node.type`)

Type	Role
`orchestrator`	Hosts the management plane: API server, workflow engine, Temporal client, Cortex connection, secrets manager. Does not run agent containers locally.
`edge`	Executes agent containers (Docker runtime). Does not expose the public API. Connects to a controller node for task assignment.
`hybrid`	Combines both roles on a single machine. The default for development and small deployments.

Cluster Protocol (`spec.cluster.role`)

AEGIS uses a dedicated NodeClusterService gRPC protocol on port 50056 for inter-node coordination. This is separate from the agent gRPC API on port 50051.

Cluster Roles

Role	Description
`controller`	Manages the `NodeCluster` aggregate. Routes executions to workers. Issues `NodeSecurityToken`s to attested workers. Exposes `NodeClusterService` on port 50056.
`worker`	Attests to a controller on startup. Advertises `NodeCapabilityAdvertisement`. Accepts forwarded executions via the `ForwardExecution` RPC.
`hybrid`	Controller and worker in one process. Used for standalone single-node and development deployments. No separate controller endpoint needed.

The Two-Tier Model

  spec.node.type     →  determines what the node DOES for agents
                        (run the API, run containers, or both)

  spec.cluster.role  →  determines how nodes COORDINATE with each other
                        (who routes work, who accepts work)

A common production pairing: the API-facing machine is type: orchestrator, cluster.role: controller; each GPU machine is type: edge, cluster.role: worker. The controller receives execution requests from clients and routes them to workers best suited to run them.

NodeSecurityToken

After a worker successfully attests to a controller, the controller issues a NodeSecurityToken: an RS256 JWT signed by the controller's OpenBao Transit key. This token:

Has a 1-hour TTL and is auto-refreshed before expiry
Is analogous to the agent SecurityToken from the SEAL protocol, but scoped to node identity
Contains node_id, role, capabilities_hash, iat, and exp
Is included in every subsequent cluster RPC wrapped in a SealNodeEnvelope

All inter-node gRPC calls use this envelope:

SealNodeEnvelope {
  node_security_token: <NodeSecurityToken JWT>,
  signature: <Ed25519 signature over payload>,
  payload: <serialized RPC request bytes>
}

Typical Topologies

Development / Single Node

┌──────────────────────────────┐
│   Hybrid Node                │  spec.node.type: hybrid
│   spec.cluster.role: hybrid  │
│                              │
│  API Server  (gRPC + REST)   │
│  Scheduler                   │
│  Docker (agent containers)   │
│  NodeClusterService :50056   │  ← routes to itself
└──────────────────────────────┘

Use type: hybrid and cluster.role: hybrid for local development and small deployments. This is the default in aegis-config.yaml.

Production — Separated Control / Data Plane

┌────────────────────────────────────────┐
│   Orchestrator Node                    │  spec.node.type: orchestrator
│   (1–3 instances)                      │
│                                        │
│  API → gRPC :50051 → REST :8088        │
│  Workflow engine                        │
│  Temporal client                        │
│  Secrets (OpenBao)                     │
└──────────────┬─────────────────────────┘
               │  (internal network)
        ┌──────┴──────┐
        │             │
┌───────▼───┐   ┌─────▼─────┐
│  Edge #1  │   │  Edge #2  │   spec.node.type: edge
│  Docker   │   │  Docker   │
│  agents   │   │  agents   │
└───────────┘   └───────────┘

Edge nodes handle the compute-intensive agent workloads. Adding more edge nodes scales execution throughput without affecting the orchestrator.

Production — Cluster Protocol (Controller + Workers)

This topology adds the NodeClusterService layer for authenticated, capability-aware execution routing.

┌──────────────────────────────────────────────────────────────────────┐
│                         Controller Node                              │
│          spec.node.type: orchestrator                                │
│          spec.cluster.role: controller                               │
│                                                                      │
│   gRPC :50051  ←  agent API (external clients)                      │
│   gRPC :50056  ←  NodeClusterService (workers only)                 │
└──────────────────────────────┬───────────────────────────────────────┘
                               │ SealNodeEnvelope over gRPC :50056
               ┌───────────────┼───────────────┐
               ▼               ▼               ▼
        ┌──────────┐    ┌──────────┐    ┌──────────┐
        │ Worker 1 │    │ Worker 2 │    │ Worker 3 │
        │ :50051   │    │ :50051   │    │ :50051   │
        └──────────┘    └──────────┘    └──────────┘
          spec.node.type: edge
          spec.cluster.role: worker

When a client submits an execution to the controller, the NodeRouter selects the best available worker (Phase 1: round-robin among healthy workers with matching tags; Phase 2: load-aware scoring). The controller forwards the execution to the selected worker via the ForwardExecution server-streaming RPC on port 50051, then streams ExecutionEvents back to the original client.

The ClusterAwareExecutionService handles this routing and forwarding transparently — it wraps the local ExecutionService and routes to workers when cluster mode is enabled, falling back to local execution when no workers are available.

Configuring Nodes

Orchestrator / Controller Node

apiVersion: 100monkeys.ai/v1
kind: NodeConfig
metadata:
  name: "orchestrator-primary"
spec:
  node:
    id: "orch-node-1"
    type: "orchestrator"
    region: "us-west-2"
    tags: ["primary"]

  # Orchestrator nodes must specify all external dependencies
  llm_providers: [...]
  storage: { backend: "seaweedfs", ... }
  # ...

Edge Node

apiVersion: 100monkeys.ai/v1
kind: NodeConfig
metadata:
  name: "edge-worker-1"
spec:
  node:
    id: "edge-node-1"
    type: "edge"
    region: "us-west-2"
    tags: ["gpu", "large-memory"]
    resources:
      cpu_cores: 32
      memory_gb: 128
      disk_gb: 500
      gpu: true

  runtime:
    # Point the edge node at the orchestrator for callbacks
    orchestrator_url: "https://orchestrator.internal:8080"
    docker_network_mode: "aegis-net"
    nfs_server_host: "127.0.0.1"

  # Edge nodes do not need llm_providers or storage config —
  # they delegate those duties to the orchestrator

`spec.node.resources`

Declare available hardware so the scheduler can make placement decisions:

Field	Type	Description
`cpu_cores`	integer	CPU cores available to agent containers
`memory_gb`	integer	RAM in GB available to agent containers
`disk_gb`	integer	Disk space in GB
`gpu`	boolean	Whether a GPU is available

`spec.node.tags`

Tags are used for execution target matching. An agent manifest can specify spec.execution.target_tags to pin executions to nodes with matching tags:

# In agent manifest
spec:
  execution:
    target_tags: ["gpu"]    # Only schedule on nodes tagged "gpu"

Cluster Configuration

The spec.cluster block configures the gRPC cluster protocol. All nodes require a persistent Ed25519 keypair; if keypair_path does not exist on disk, AEGIS auto-generates one on first run.

Controller Node

spec:
  node:
    id: "ctrl-001"
    type: orchestrator
    region: us-east-1
    tags: [controller, production]
  network:
    port: 8080
    grpc_port: 50051
  cluster:
    role: controller
    node_id: "ctrl-001"           # must match spec.node.id
    keypair_path: /etc/aegis/node-keypair.pem
    cluster_grpc_port: 50056

The controller exposes NodeClusterService on cluster_grpc_port (default 50056). Only attested workers may call RPCs on this port.

Worker Node

spec:
  node:
    id: "worker-gpu-001"
    type: edge
    region: us-east-1
    tags: [gpu, production]
    resources:
      cpu_cores: 16
      memory_gb: 64
      gpu: true
  cluster:
    role: worker
    node_id: "worker-gpu-001"     # must match spec.node.id
    controller_endpoint: "https://ctrl-001.internal:50056"
    keypair_path: /etc/aegis/node-keypair.pem
    heartbeat_interval_seconds: 30

Workers contact the controller at controller_endpoint on startup to perform attestation. The heartbeat_interval_seconds (default 30) controls how frequently the worker sends a Heartbeat RPC carrying its current load and capabilities.

Hybrid Node (Standalone / Development)

spec:
  node:
    id: "dev-local"
    type: hybrid
  cluster:
    role: hybrid   # controller + worker in one process
    node_id: "dev-local"
    keypair_path: /etc/aegis/node-keypair.pem

No controller_endpoint is needed in hybrid mode — the process routes executions to itself.

Node Attestation Flow

Before a worker can receive forwarded executions, it must prove its identity to the controller. This happens automatically on startup via the following sequence:

Worker                                      Controller
  │                                              │
  │  1. Load Ed25519 keypair from               │
  │     spec.cluster.keypair_path               │
  │     (auto-generated if absent)              │
  │                                              │
  │──── AttestNode(node_id, public_key) ────────►│
  │                                              │  2. Generates cryptographic challenge
  │◄─── ChallengeNode(challenge_bytes) ──────────│
  │                                              │
  │  3. Signs challenge_bytes with              │
  │     Ed25519 private key                     │
  │                                              │
  │──── ChallengeNode(signature) ───────────────►│
  │                                              │  4. Verifies signature against
  │                                              │     registered public key
  │◄─── NodeSecurityToken (RS256 JWT) ───────────│  5. Issues token (1-hour TTL)
  │                                              │     via OpenBao Transit signing
  │  6. Wraps all future RPCs in               │
  │     SealNodeEnvelope {                      │
  │       node_security_token,                  │
  │       signature,                            │
  │       payload                                │
  │     }                                       │
  │                                              │
  │──── RegisterNode(capabilities) ────────────►│  7. Advertises NodeCapabilityAdvertisement
  │                                              │     { gpu_count, vram_gb, cpu_cores,
  │                                              │       available_memory_gb,
  │                                              │       supported_runtimes, tags }
  │◄─── Registered ─────────────────────────────│
  │                                              │
  │──── Heartbeat (every 30s) ─────────────────►│  8. Keeps NodePeer status Active
  │◄─── NodeCommand (optional) ─────────────────│     Response may carry commands:
  │                                              │     drain, config push, shutdown

The NodeSecurityToken is automatically refreshed before its 1-hour expiry via the ongoing heartbeat cycle. No manual token management is required.

NodePeer status transitions:

Active — node is registered and sending heartbeats within the expected interval
Draining — controller has issued a drain command; no new executions are routed to this node
Unhealthy — no heartbeat received within 3× the expected interval

NodeClusterService RPC Reference

The NodeClusterService exposes 10 RPCs on port 50056. All RPCs except AttestNode and ChallengeNode require a valid NodeSecurityToken in the authorization metadata key and an SealNodeEnvelope wrapping the payload. AttestNode is unauthenticated (it is the first call a new worker makes).

RPC	Direction	Description
`AttestNode`	Worker → Controller	Initiate node attestation; send public key
`ChallengeNode`	Bidirectional	Controller issues challenge; worker returns Ed25519 signature
`RegisterNode`	Worker → Controller	Register with `NodeCapabilityAdvertisement` after attestation
`Heartbeat`	Worker → Controller	Periodic status update (default 30s); response may carry `NodeCommand`s
`DeregisterNode`	Worker → Controller	Graceful deregistration before shutdown
`RouteExecution`	Controller-internal	Returns `ExecutionRoute { target_node_id, worker_grpc_address }`
`ForwardExecution`	Controller → Worker (server-streaming)	Execute an agent on this worker; streams `ExecutionEvent`s back

Execution forwarding end-to-end flow:

Client sends an execution request to the controller.
ClusterAwareExecutionService calls RouteExecutionUseCase to select a worker based on health, tags, and availability.
Controller connects to the selected worker via NodeClusterClient::connect_to_worker().
Controller calls forward_execution() with the original execution_id preserved for end-to-end tracing correlation.
Worker runs the execution locally via start_execution_with_id(), importing the upstream execution ID rather than generating a new one.
Execution events stream back to the controller via the gRPC server-streaming response, which relays them to the original client.

Node Registration

Node registration is performed via gRPC using the NodeClusterService protocol described above, not via HTTP. The original HTTP-based NodeIdentity registration is used in legacy single-node mode only and is not involved in cluster coordination.

The registration sequence after successful attestation:

Worker calls RegisterNode carrying its NodeCapabilityAdvertisement
Controller records the NodePeer in the NodeCluster aggregate
Worker enters the heartbeat loop (Heartbeat every heartbeat_interval_seconds)
Controller updates NodePeer.last_heartbeat_at and NodePeer.status on each heartbeat
On graceful shutdown, worker calls DeregisterNode; controller marks NodePeer as removed

Networking Requirements

Connection	Protocol	Port	Direction
Client → Controller	HTTP REST	8088	Inbound to controller
Client → Controller	gRPC (agent API)	50051	Inbound to controller
Worker → Controller	gRPC (NodeClusterService)	50056	Outbound from worker
Controller → Worker	gRPC (ForwardExecution)	50051	Outbound from controller
Orchestrator → Temporal	7233	Outbound	Workflow engine
Orchestrator → SeaweedFS	8888	Outbound	Storage filer
Edge → SeaweedFS	8888	Outbound	Volume data access
Edge agent containers → Edge daemon	2049 (NFS)	Internal	Volume mounts via NFS Gateway

Firewall rules must allow:

Controller inbound on 8080, 50051, and 50056
Workers inbound on 50051 (for ForwardExecution streaming from controller)
Workers outbound to controller on 50056

Port 50056 should not be exposed to external clients — it is for inter-node cluster coordination only. Use network-level ACLs to restrict access to known worker IPs.

High Availability

Phase 1 (Current): Single Controller + Multiple Workers

Deploy one controller node and N worker nodes. The controller is the coordination point; workers are horizontally scalable. Multiple worker nodes provide execution capacity and fault tolerance for agent workloads.

                     ┌──────────────────────┐
                     │    Load Balancer      │  (agent API traffic only)
                     └──────────┬───────────┘
                                │ :8080 / :50051
                     ┌──────────▼───────────┐
                     │   Controller Node     │  spec.cluster.role: controller
                     │   (single instance)  │
                     └──────────┬───────────┘
                                │  port :50056
              ┌─────────────────┼─────────────────┐
              ▼                 ▼                 ▼
    ┌──────────────┐  ┌──────────────┐  ┌──────────────┐
    │   Worker 1   │  │   Worker 2   │  │   Worker 3   │
    │   :50051     │  │   :50051     │  │   :50051     │
    └──────────────┘  └──────────────┘  └──────────────┘

For controller resilience in Phase 1, run the controller with PostgreSQL as the persistence backend; on controller restart, workers re-attest automatically via the AttestNode → ChallengeNode → RegisterNode flow. The controller outage window only affects routing; in-flight executions on workers complete normally.

Phase 2 (Planned): Controller HA with Raft

Phase 2 will introduce a Raft-based controller consensus layer, allowing N controller replicas with automatic leader election. Workers will use a discovery endpoint returned during attestation to locate the current leader. This is not yet implemented.

For all orchestrator instances in either phase, share the same PostgreSQL database for consistent execution state. Redis is optional for session-level caching.

                      ┌─────────────────┐
                      │  Load Balancer  │
                      └────────┬────────┘
               ┌───────────────┼────────────────┐
        ┌──────▼──────┐  ┌─────▼──────┐  ┌─────▼──────┐
        │Controller 1 │  │Controller 2│  │Controller 3│  (Phase 2)
        │  (leader)   │  │ (follower) │  │ (follower) │
        └──────┬──────┘  └────────────┘  └────────────┘
               │  Raft consensus
        ┌──────▼──────────────┐
        │    PostgreSQL        │
        │    (shared state)   │
        └─────────────────────┘

Day-2 Operations: mTLS Certificate Rotation

In production environments, AEGIS requires mutual TLS (mTLS) for all inter-node communication on port 50056. This ensures that only nodes possessing a certificate signed by the platform CA can even attempt the SEAL attestation flow.

Rotating the Platform CA

If the root platform CA is rotated, you must perform a multi-step rollout to avoid cluster-wide disconnects:

Distribute New CA: Update spec.cluster.tls.ca_cert on all nodes to include BOTH the old and new CA certificates in a single PEM bundle. Restart all nodes.
Issue New Node Certs: Generate new node certificates signed by the new CA.
Update Node Certs: Update spec.cluster.tls.cert_path and key_path on each node one-by-one and restart.
Remove Old CA: Once all nodes are using certificates from the new CA, remove the old CA from the ca_cert bundle.

Zero-Downtime Node Certificate Rotation

Individual node certificates (e.g., node.crt) can be rotated without cluster downtime as long as the CA remains valid. The aegis daemon watches the certificate files on disk and reloads them automatically upon change (when using standard spec.cluster.tls configuration).

# Example: Rotating a worker certificate
cp new-node.crt /etc/aegis/certs/node.crt
cp new-node.key /etc/aegis/certs/node.key
# AEGIS will detect the change and use the new cert for the next gRPC connection

Troubleshooting Cluster Operations

NodeSecurityToken Attestation Failures

If a worker fails to join the cluster, check the controller logs for the following common errors:

Error	Root Cause	Resolution
`TokenExpired`	Clock Skew: The worker's system clock is significantly ahead of the controller's.	Synchronize clocks on all nodes using NTP (e.g., `chronyd`).
`NonceReplay`	Replay Attack / Rapid Restart: The worker sent an attestation request with a nonce that was already used in the last 5 minutes.	Wait 5 minutes before restarting the worker, or ensure the worker generates a fresh UUID for the `nonce` field on every attempt.
`InvalidSignature`	Keypair Mismatch: The worker's Ed25519 signature does not match its registered public key.	Verify that `spec.cluster.keypair_path` points to the same persistent key that was used during the initial `RegisterNode` call.
`UntrustedCA`	mTLS Failure: The certificate presented by the node is not signed by the CA in `spec.cluster.tls.ca_cert`.	Verify that all nodes share the same platform CA bundle.

Diagnostic Commands

Use the aegis CLI on the controller node to inspect cluster health:

# List all registered peers
aegis node peers

# Check end-to-end cluster health from the local node config
aegis status --cluster

Health Endpoints

Each AEGIS daemon exposes HTTP health endpoints on the configured REST port (default 8080). These endpoints are useful for load balancer health checks, Kubernetes probes, and manual diagnostics.

Endpoint	Method	Description
`/health/live`	GET	Liveness probe. Returns `200 OK` if the process is running. Does not check downstream dependencies.
`/health/ready`	GET	Readiness probe. Returns `200 OK` only when all critical subsystems (database, Temporal, event bus) are initialised. Returns `503 Service Unavailable` during startup or if a dependency becomes unreachable.
`/health`	GET	Composite health check. Returns `200 OK` with a JSON body containing `uptime_seconds` and subsystem status. Used by the `aegis status` CLI command and the daemon client library.

In a clustered deployment, the controller node's HealthSweeper background task also monitors worker health by evaluating heartbeat freshness. If a worker misses 3 consecutive heartbeat intervals, the HealthSweeper marks its NodePeer status as Unhealthy and emits a ClusterEvent::NodeUnhealthy event.

Remote Volume Access

When an agent executing on Worker Node A needs to access a volume that resides on Worker Node B, AEGIS transparently proxies the file operation via the RemoteStorageService gRPC protocol.

How It Works

The StorageRouter on Node A inspects the file path. Paths prefixed with /aegis/seal/{node_id}/{volume_id}/... are routed to the SealStorageProvider.
SealStorageProvider extracts the target node_id from the path, looks up the node's gRPC address in the NodeClusterRepository, and establishes (or reuses) a gRPC channel.
Each RPC call is wrapped in an SealNodeEnvelope carrying the node's Ed25519 signature and NodeSecurityToken.
The target node's RemoteStorageServiceHandler verifies the envelope, performs authoritative AegisFSAL access checks, and delegates to its local StorageProvider.
The response is returned to the requesting node and surfaced through the standard StorageProvider trait.

Supported Operations

All POSIX-style file operations are supported over the wire:

CreateDirectory, DeleteDirectory
SetQuota, GetUsage
OpenFile, ReadAt, WriteAt, CloseFile
Stat, Readdir
CreateFile, DeleteFile, Rename
HealthCheck

Path Format

Remote volume paths follow this convention:

/aegis/seal/{target_node_id}/{volume_id}/{path/within/volume}

The StorageRouter detects the /aegis/seal/ prefix and dispatches to SealStorageProvider. All other /aegis/volumes/... paths route to the default backend (SeaweedFS/OpenDAL), and bare absolute paths (/opt/data/...) route to the local host filesystem.

Configuration Hierarchy

AEGIS supports hierarchical configuration layering with the following precedence (lowest to highest):

Scope	Description	Example
Global	Cluster-wide defaults applied to all nodes	Default LLM provider settings
Tenant	Per-tenant overrides (keyed by `TenantSlug`)	Tenant-specific rate limits
Node	Per-node overrides (keyed by `NodeId`)	Node-specific runtime settings
Local	On-disk `aegis-config.yaml` loaded at startup	Hardware-specific configuration

When the controller pushes a configuration update via the PushConfig heartbeat command, the worker merges the layers in precedence order. Each layer is stored as a ConfigSnapshot value object containing the scope, a JSON payload, and a version hash. The merged result is a MergedConfig that the worker applies atomically.

Workers can also pull their current effective configuration on demand using the SyncConfig RPC. This is useful after a restart when the worker needs to catch up on configuration changes that occurred while it was offline.

After merging configuration layers, the worker runs EffectiveConfigValidator to verify that the merged result contains all required sections (runtime, storage, llm) before accepting work. If validation fails, the worker logs a fatal error and refuses to enter the heartbeat loop — preventing a misconfigured node from silently accepting executions it cannot fulfil.

Worker Lifecycle

When a daemon starts with spec.cluster.role: worker or spec.cluster.role: hybrid and a controller.endpoint is configured, the daemon automatically spawns a WorkerLifecycle background task. This task manages the full lifecycle of the worker's relationship with the cluster controller.

Lifecycle Stages

┌─────────┐     ┌──────────┐     ┌──────────┐     ┌───────────┐     ┌────────────┐
│ Connect  │────▶│  Attest  │────▶│ Register │────▶│ Heartbeat │────▶│ Deregister │
└─────────┘     └──────────┘     └──────────┘     │   Loop    │     └────────────┘
                                                   └───────────┘

Connect — Establish a gRPC channel to the controller's NodeClusterService on port 50056.
Attest — Perform the two-step Ed25519 challenge handshake (AttestNode + ChallengeNode). On success, the worker receives a NodeSecurityToken JWT.
Register — Call RegisterNode to advertise the worker's NodeCapabilityAdvertisement (GPU count, CPU cores, memory, supported runtimes, tags). The controller creates initial NodeConfigAssignment and RuntimeRegistryAssignment records for the worker via the NodeRegistryRepository, separating cluster membership ("is this node alive?") from configuration assignment ("what should this node run?").
Heartbeat Loop — Send Heartbeat RPCs at the configured heartbeat_interval_secs (default 30s). Process any NodeCommands returned in the response:
- Drain — Stop accepting new executions; complete in-flight work.
- PushConfig — Apply a configuration update from the controller.
- Shutdown — Begin graceful process shutdown after draining.
Deregister — On daemon shutdown (SIGTERM / Ctrl+C), call DeregisterNode to cleanly remove the worker from the cluster.

If a heartbeat fails (e.g., network partition), the worker logs a warning and retries on the next interval. The controller's HealthSweeper will mark the worker as Unhealthy after 3 missed intervals.

Configuration

The worker lifecycle is controlled by the spec.cluster section of aegis-config.yaml:

spec:
  cluster:
    enabled: true
    role: worker
    controller:
      endpoint: "http://controller.internal:50056"
    node_keypair_path: /etc/aegis/node-keypair.pem
    heartbeat_interval_secs: 30       # How often to send heartbeats
    token_refresh_margin_secs: 120    # Re-attest this many seconds before token expiry

Multi-Node Deployment

On this page