Production platform deployment with Podman pods, deployment profiles, Makefile automation, and Podman as the rootless agent container runtime.

Podman Deployment

aegis-deploy is the recommended production deployment path. It includes Keycloak (IAM), OpenBao (secrets management), TLS via Caddy, and full observability. aegis init does NOT deploy IAM or secrets management and is intended for local testing and evaluation only.

AEGIS uses Podman in two ways: as the platform deployment orchestrator (Podman Kube YAML pods running all AEGIS services) and as the agent container runtime (spawning isolated containers for agent execution via the bollard API). This page covers both.

Platform Deployment with Podman Pods

For production and staging environments, AEGIS deploys as a set of Podman pods defined in the aegis-deploy repository. Unlike aegis init (which is for local testing and evaluation only), aegis-deploy provides a complete, production-ready deployment with IAM, secrets management, and TLS. Each pod groups related containers with shared networking, health checks, and persistent volumes.

Public Pod Topology

Pod	Containers	Key Ports	Purpose
pod-core	aegis-runtime	8088 (HTTP), 50051 (gRPC), 2049 (NFS), 9091 (metrics)	Orchestrator, agent execution, NFS gateway
pod-database	PostgreSQL 15, postgres-exporter	5432, 9187	Primary data store and metrics
pod-temporal	Temporal 1.23, Temporal UI, aegis-temporal-worker	7233, 8233, 3000	Durable workflow execution
pod-secrets	OpenBao	8200	Secrets management (AppRole auth)
pod-iam	Keycloak 24	8180	OIDC identity provider
pod-storage	SeaweedFS master, volume, filer, WebDAV	9333, 8080, 8888, 7333	Distributed volume storage
pod-observability	Jaeger, Prometheus, Grafana, Loki, Promtail	16686, 9090, 3300, 3100	Tracing, metrics, dashboards, logs
pod-seal-gateway	aegis-seal-gateway	8089, 50055	Tool orchestration gateway

Proprietary add-on pods (Cortex, Zaru, Zaru Edge) are available under commercial license and are not included in the public aegis-deploy repository.

Deployment Profiles

Deploy only the pods you need using profiles:

Profile	Pods Included	Use Case
`minimal`	core, secrets	Bare-minimum agent execution
`development`	core, secrets, database, temporal, iam, observability	Local development and testing
`full`	All 8 public pods	Production and staging

# Deploy the development profile
make deploy PROFILE=development

# Deploy the full stack
make deploy PROFILE=full

Quick Start

# Clone the deployment repo
git clone https://github.com/100monkeys-ai/aegis-deploy.git
cd aegis-deploy

# Copy and edit environment configuration
cp .env.example .env
# Edit .env with your values (GHCR credentials, passwords, LLM API keys)

# Install system dependencies (Ubuntu)
make setup

# Authenticate with GitHub Container Registry
make registry-login

# Generate SEAL signing keys
make generate-keys

# Deploy the full stack
make deploy PROFILE=full

# Validate all services are healthy
make validate

# Bootstrap secrets and IAM
make bootstrap-secrets
make bootstrap-keycloak

Makefile Targets

Target	Description
`make setup`	Install system dependencies (Podman, utilities)
`make deploy PROFILE=<name>`	Deploy pods for the specified profile
`make teardown`	Stop and remove all pods
`make status`	Show pod and container status
`make validate`	Health-check all running services
`make registry-login`	Authenticate with GitHub Container Registry
`make bootstrap-secrets`	Initialize OpenBao (AppRole auth, KV mount)
`make bootstrap-keycloak`	Create Keycloak realms, clients, and users
`make generate-keys`	Generate Ed25519 SEAL signing keypair
`make redeploy POD=<name>`	Tear down and redeploy a single pod
`make logs POD=<name>`	Stream logs for a specific pod
`make clean`	Full teardown including volumes and networks

Environment Configuration

All configuration is driven by a .env file. Key variables:

Variable	Required	Description
`AEGIS_ROOT`	Yes	Absolute path to the `aegis-deploy` directory
`GHCR_USERNAME`	Yes	GitHub username for image pulls
`GHCR_TOKEN`	Yes	GitHub PAT with `read:packages` scope
`AEGIS_IMAGE_TAG`	No	Image tag (default: `latest`)
`POSTGRES_PASSWORD`	Yes	PostgreSQL password
`KEYCLOAK_ADMIN_PASSWORD`	Yes	Keycloak admin password
`RUST_LOG`	No	Log level (default: `info`)
`CONTAINER_SOCK`	No	Podman socket path (default: `/run/user/1000/podman/podman.sock`)

See .env.example in the aegis-deploy repository for the full variable reference.

Health Check Endpoints

Every pod container exposes a health check. Use make validate to check all at once, or query individually:

Service	Endpoint	Type
AEGIS Runtime	`GET /health` on :8088	HTTP
PostgreSQL	`pg_isready`	exec
Temporal	`temporal operator cluster health`	exec
OpenBao	`bao status`	exec
Keycloak	`GET /health/ready` on :8180	HTTP
SeaweedFS Master	`GET /cluster/status` on :9333	HTTP
Prometheus	`GET /-/ready` on :9090	HTTP
Grafana	`GET /api/health` on :3300	HTTP
Loki	`GET /ready` on :3100	HTTP
SEAL Gateway	`GET /` on :8089	HTTP

Persistent Volumes

Volume	Pod	Purpose
`aegis-postgres-data`	database	PostgreSQL databases
`aegis-runtime-data`	core	Agent execution outputs
`aegis-openbao-data`	secrets	Encrypted secret storage
`aegis-prometheus-data`	observability	Metrics (15-day retention)
`aegis-grafana-data`	observability	Dashboards and config
`aegis-loki-data`	observability	Log storage (7-day retention)
`aegis-seaweedfs-master-data`	storage	SeaweedFS metadata
`aegis-seaweedfs-volume-data`	storage	SeaweedFS block storage
`aegis-seaweedfs-filer-data`	storage	SeaweedFS filesystem layer
`aegis-seal-gateway-data`	seal-gateway	SQLite tool database
`aegis-temporal-worker-data`	temporal	Workflow worker state

Agent Container Runtime

The sections below cover Podman as the agent container runtime — how the AEGIS orchestrator spawns isolated containers for agent execution via the bollard Docker-compatible API.

Prerequisites

Podman 4.0+ installed and configured for rootless mode
systemd with user lingering enabled for the service account
cgroups v2 (recommended) or cgroups v1 with appropriate delegation
slirp4netns or pasta installed for rootless networking
Agent container images are accessible from the host (either locally present or pullable)
NFS traffic (TCP port 2049) is routable between agent containers and the host

Socket Configuration

Podman provides a Docker-compatible API socket via systemd socket activation. In rootless mode, the socket lives under the user's runtime directory.

Enable the Podman Socket

# Enable and start the rootless Podman socket
systemctl --user enable --now podman.socket

# Verify the socket is active
systemctl --user status podman.socket

# Confirm the socket path
ls -la /run/user/$(id -u)/podman/podman.sock

Enable User Lingering

Without lingering, the user's systemd instance (and the Podman socket) is torn down when the user logs out. Enable lingering so the socket persists:

# Enable lingering for the aegis service user
sudo loginctl enable-linger aegis

# Verify
loginctl show-user aegis | grep Linger

AEGIS Configuration

Point the AEGIS daemon at the Podman socket in aegis-config.yaml:

runtime:
  container_socket_path: "/run/user/1000/podman/podman.sock"

Replace 1000 with the UID of the service account running the daemon.

Alternatively, set the CONTAINER_HOST environment variable:

export CONTAINER_HOST=unix:///run/user/$(id -u)/podman/podman.sock

Container Lifecycle

The container lifecycle is identical to Docker. Bollard sends the same API calls to the Podman socket, and Podman handles them compatibly:

Pulls the image (respecting spec.runtime.image_pull_policy from the agent manifest — see Container Registry & Image Management).
Creates the container with:
- CPU quota and memory limit from spec.resources
- NFS volume mounts (described below)
- Network configuration from spec.security.network_policy
- Environment variables from spec.environment
- The container UID/GID stored in the Execution metadata for UID/GID squashing
Starts the container — bootstrap.py begins executing.
Monitors the container for the duration of the iteration.
Stops and removes the container after the iteration completes or times out.

Containers are removed immediately after each iteration. A fresh container is created for each iteration in the 100monkeys loop.

Container Cleanup (Defense-in-Depth)

The same three-layer cleanup defense applies:

Layer	Trigger	Mechanism
Explicit termination	Normal exit paths (success, failure, timeout, cancellation)	`runtime.terminate()` via Bollard API
RAII guard	Panic or unexpected error between `spawn()` and `terminate()`	`ContainerGuard` Drop impl spawns async cleanup task
Background reaper	Orphaned containers from process crashes or API failures	Daemon task runs every 5 min, cross-references containers against DB

The reaper identifies orphans by listing all containers with the aegis.managed=true label and checking their aegis.execution_id against the execution repository. Containers with aegis.keep_container_on_failure=true are skipped by the reaper.

Resource Limits

Manifest resource limits are translated to container constraints via the same Bollard API fields:

spec:
  resources:
    cpu_quota: 1.0          # -> nano_cpus (1_000_000_000)
    memory_bytes: 1073741824  # -> memory limit in bytes
    timeout_secs: 300

timeout_secs is enforced by the ExecutionSupervisor. If the inner loop has not produced a final response within timeout_secs, the container is force-killed and the iteration is failed.

cgroups v2 Delegation for Rootless

Rootless Podman requires cgroup v2 delegation to enforce resource limits. Without delegation, CPU and memory limits may be silently ignored.

# Verify cgroups v2 is in use
stat -fc %T /sys/fs/cgroup/

# Expected output: cgroup2fs

If resource limits are not being enforced, enable CPU and memory delegation for the user:

# /etc/systemd/system/user@.service.d/delegate.conf
[Service]
Delegate=cpu cpuset io memory pids

sudo systemctl daemon-reload

NFS Volume Mounting

Agent containers mount their volumes via the kernel NFS client to the orchestrator's NFS server gateway (port 2049). The mount configuration is identical to Docker:

// Example Bollard mount configuration produced by AEGIS for a volume named "workspace":
{
  "Target": "/workspace",
  "Type": "volume",
  "VolumeOptions": {
    "DriverConfig": {
      "Name": "local",
      "Options": {
        "type":   "nfs",
        "o":      "addr=<orchestrator-host>,nfsvers=3,proto=tcp,soft,timeo=10,nolock",
        "device": ":/<tenant_id>/<volume_id>"
      }
    }
  }
}

The agent container does not require CAP_SYS_ADMIN or any elevated capabilities for NFS mounts.

Network Reachability of NFS

In rootless Podman, use host.containers.internal to reach the host from within a container. Configure the NFS listen address in aegis-config.yaml:

storage:
  nfs_listen_addr: "0.0.0.0:2049"

In multi-host deployments, use the orchestrator host's external IP or hostname.

Network Configuration

Creating the AEGIS Network

# Create a Podman network for AEGIS containers
podman network create aegis-network

# Verify
podman network ls

Container DNS and Host Access

Podman maps host.containers.internal to the host by default. AEGIS also adds host.docker.internal for compatibility:

{
  "extra_hosts": [
    "host.docker.internal:host-gateway",
    "host.containers.internal:host-gateway"
  ]
}

Both names resolve to the host IP from within agent containers, so NFS mount addresses and SEAL callbacks work regardless of which runtime is in use.

Network Policy Enforcement

Network egress is controlled by the manifest network_policy, identical to Docker:

spec:
  security:
    network_policy:
      mode: allow
      allowlist:
        - pypi.org
        - api.github.com

The AEGIS daemon enforces network policy at the SEAL layer (per tool call), not via container network rules.

Differences from Docker

Aspect	Docker	Podman
Daemon model	Persistent root daemon (`dockerd`)	Daemonless; socket-activated per user
Socket path	`/var/run/docker.sock`	`/run/user/<UID>/podman/podman.sock`
Default registries	`docker.io` only	Configurable in `/etc/containers/registries.conf`
Auth file	`~/.docker/config.json`	`${XDG_RUNTIME_DIR}/containers/auth.json`
Host DNS name	`host.docker.internal`	`host.containers.internal` (both mapped by AEGIS)
cgroup management	Delegated by `dockerd`	Requires explicit cgroup v2 delegation for rootless
Security model	Root daemon, user communicates via group	No privileged daemon; socket is user-scoped
Process model	Containers are children of `dockerd`	Containers are children of `conmon` (per-container)

Socket Activation Behavior

Podman's socket is activated on demand by systemd. After a period of inactivity, the podman API process exits. The next API call re-activates it transparently. This is normally invisible to the AEGIS daemon, but be aware:

The first API call after an idle period may have slightly higher latency.
If podman.socket is not enabled, the socket file will not exist and Bollard will fail to connect.

Stats API Differences

The container stats endpoint may return different fields depending on the cgroup version. Under cgroups v2, some fields that Docker populates (e.g., per-CPU usage arrays) may be absent or zeroed. The AEGIS daemon normalizes these differences in the ContainerStats adapter layer.

systemd Service Configuration

For rootless Podman, the AEGIS daemon runs as a systemd user service:

# ~/.config/systemd/user/aegis.service
[Unit]
Description=AEGIS Orchestrator Daemon
After=network-online.target podman.socket
Requires=podman.socket

[Service]
WorkingDirectory=/opt/aegis
ExecStart=/usr/local/bin/aegis --daemon --config /etc/aegis/config.yaml
Restart=on-failure
RestartSec=10s
LimitNOFILE=65535
Environment=CONTAINER_HOST=unix:///run/user/%U/podman/podman.sock

# Environment variables for secrets (avoid plaintext in config)
EnvironmentFile=%h/.config/aegis/env

[Install]
WantedBy=default.target

# ~/.config/aegis/env (chmod 600)
DATABASE_URL=postgresql://aegis:password@localhost:5432/aegis
OPENAI_API_KEY=sk-...
OPENBAO_ROLE_ID=...
OPENBAO_SECRET_ID=...

# Enable and start (as the aegis user)
systemctl --user enable aegis
systemctl --user start aegis

# Check status
systemctl --user status aegis

# Follow logs
journalctl --user -u aegis -f

User lingering must be enabled (see Socket Configuration) or the service will stop when the user session ends.

Troubleshooting

Socket Not Found

If the AEGIS daemon fails to connect to the Podman socket:

# Check if the socket unit is active
systemctl --user status podman.socket

# Check if the socket file exists
ls -la /run/user/$(id -u)/podman/podman.sock

# Restart the socket if needed
systemctl --user restart podman.socket

Permission Denied

If the daemon gets permission errors connecting to the socket:

# Verify lingering is enabled
loginctl show-user aegis | grep Linger

# Verify the socket is owned by the correct user
ls -la /run/user/$(id -u)/podman/

# Verify the daemon is running as the correct user
ps aux | grep aegis

Container Stats Returning Zeros

If resource usage metrics are all zeros, cgroup v2 delegation is likely not configured:

# Check cgroup version
stat -fc %T /sys/fs/cgroup/

# Check delegation
cat /sys/fs/cgroup/user.slice/user-$(id -u).slice/cgroup.controllers

# If cpu/memory/io are missing, add the delegation override
sudo mkdir -p /etc/systemd/system/user@.service.d
sudo tee /etc/systemd/system/user@.service.d/delegate.conf <<EOF
[Service]
Delegate=cpu cpuset io memory pids
EOF
sudo systemctl daemon-reload

Log out and back in (or reboot) for delegation changes to take effect.

Network Connectivity Issues

If containers cannot reach the host or external networks:

# Check which network mode is in use
podman info | grep -i network

# Verify slirp4netns or pasta is installed
which slirp4netns
which pasta

# Test host connectivity from a container
podman run --rm alpine ping -c1 host.containers.internal

If using pasta (default in Podman 5.0+), and connectivity fails, try falling back to slirp4netns:

podman run --network=slirp4netns --rm alpine ping -c1 host.containers.internal

Health Checks

The AEGIS daemon exposes the same health endpoints regardless of the container runtime:

# Liveness (daemon process alive)
curl http://localhost:8080/health/live

# Readiness (daemon ready to accept requests; all dependencies connected)
curl http://localhost:8080/health/ready

Use these in load balancer health check configuration or monitoring systems.

Podman Deployment

On this page