Podman Deployment
Production platform deployment with Podman pods, deployment profiles, Makefile automation, and Podman as the rootless agent container runtime.
Podman Deployment
aegis-deploy is the recommended production deployment path. It includes Keycloak (IAM), OpenBao
(secrets management), TLS via Caddy, and full observability. aegis init does NOT deploy IAM
or secrets management and is intended for local testing and evaluation only.
AEGIS uses Podman in two ways: as the platform deployment orchestrator (Podman Kube YAML pods running all AEGIS services) and as the agent container runtime (spawning isolated containers for agent execution via the bollard API). This page covers both.
Platform Deployment with Podman Pods
For production and staging environments, AEGIS deploys as a set of Podman pods defined in the aegis-deploy repository. Unlike aegis init (which is for local testing and evaluation only), aegis-deploy provides a complete, production-ready deployment with IAM, secrets management, and TLS. Each pod groups related containers with shared networking, health checks, and persistent volumes.
Public Pod Topology
| Pod | Containers | Key Ports | Purpose |
|---|---|---|---|
| pod-core | aegis-runtime | 8088 (HTTP), 50051 (gRPC), 2049 (NFS), 9091 (metrics) | Orchestrator, agent execution, NFS gateway |
| pod-database | PostgreSQL 15, postgres-exporter | 5432, 9187 | Primary data store and metrics |
| pod-temporal | Temporal 1.23, Temporal UI, aegis-temporal-worker | 7233, 8233, 3000 | Durable workflow execution |
| pod-secrets | OpenBao | 8200 | Secrets management (AppRole auth) |
| pod-iam | Keycloak 24 | 8180 | OIDC identity provider |
| pod-storage | SeaweedFS master, volume, filer, WebDAV | 9333, 8080, 8888, 7333 | Distributed volume storage |
| pod-observability | Jaeger, Prometheus, Grafana, Loki, Promtail | 16686, 9090, 3300, 3100 | Tracing, metrics, dashboards, logs |
| pod-seal-gateway | aegis-seal-gateway | 8089, 50055 | Tool orchestration gateway |
Proprietary add-on pods (Cortex, Zaru, Zaru Edge) are available under commercial license and are not included in the public aegis-deploy repository.
Deployment Profiles
Deploy only the pods you need using profiles:
| Profile | Pods Included | Use Case |
|---|---|---|
minimal | core, secrets | Bare-minimum agent execution |
development | core, secrets, database, temporal, iam, observability | Local development and testing |
full | All 8 public pods | Production and staging |
# Deploy the development profile
make deploy PROFILE=development
# Deploy the full stack
make deploy PROFILE=fullQuick Start
# Clone the deployment repo
git clone https://github.com/100monkeys-ai/aegis-deploy.git
cd aegis-deploy
# Copy and edit environment configuration
cp .env.example .env
# Edit .env with your values (GHCR credentials, passwords, LLM API keys)
# Install system dependencies (Ubuntu)
make setup
# Authenticate with GitHub Container Registry
make registry-login
# Generate SEAL signing keys
make generate-keys
# Deploy the full stack
make deploy PROFILE=full
# Validate all services are healthy
make validate
# Bootstrap secrets and IAM
make bootstrap-secrets
make bootstrap-keycloakMakefile Targets
| Target | Description |
|---|---|
make setup | Install system dependencies (Podman, utilities) |
make deploy PROFILE=<name> | Deploy pods for the specified profile |
make teardown | Stop and remove all pods |
make status | Show pod and container status |
make validate | Health-check all running services |
make registry-login | Authenticate with GitHub Container Registry |
make bootstrap-secrets | Initialize OpenBao (AppRole auth, KV mount) |
make bootstrap-keycloak | Create Keycloak realms, clients, and users |
make generate-keys | Generate Ed25519 SEAL signing keypair |
make redeploy POD=<name> | Tear down and redeploy a single pod |
make logs POD=<name> | Stream logs for a specific pod |
make clean | Full teardown including volumes and networks |
Environment Configuration
All configuration is driven by a .env file. Key variables:
| Variable | Required | Description |
|---|---|---|
AEGIS_ROOT | Yes | Absolute path to the aegis-deploy directory |
GHCR_USERNAME | Yes | GitHub username for image pulls |
GHCR_TOKEN | Yes | GitHub PAT with read:packages scope |
AEGIS_IMAGE_TAG | No | Image tag (default: latest) |
POSTGRES_PASSWORD | Yes | PostgreSQL password |
KEYCLOAK_ADMIN_PASSWORD | Yes | Keycloak admin password |
RUST_LOG | No | Log level (default: info) |
CONTAINER_SOCK | No | Podman socket path (default: /run/user/1000/podman/podman.sock) |
See .env.example in the aegis-deploy repository for the full variable reference.
Health Check Endpoints
Every pod container exposes a health check. Use make validate to check all at once, or query individually:
| Service | Endpoint | Type |
|---|---|---|
| AEGIS Runtime | GET /health on :8088 | HTTP |
| PostgreSQL | pg_isready | exec |
| Temporal | temporal operator cluster health | exec |
| OpenBao | bao status | exec |
| Keycloak | GET /health/ready on :8180 | HTTP |
| SeaweedFS Master | GET /cluster/status on :9333 | HTTP |
| Prometheus | GET /-/ready on :9090 | HTTP |
| Grafana | GET /api/health on :3300 | HTTP |
| Loki | GET /ready on :3100 | HTTP |
| SEAL Gateway | GET / on :8089 | HTTP |
Persistent Volumes
| Volume | Pod | Purpose |
|---|---|---|
aegis-postgres-data | database | PostgreSQL databases |
aegis-runtime-data | core | Agent execution outputs |
aegis-openbao-data | secrets | Encrypted secret storage |
aegis-prometheus-data | observability | Metrics (15-day retention) |
aegis-grafana-data | observability | Dashboards and config |
aegis-loki-data | observability | Log storage (7-day retention) |
aegis-seaweedfs-master-data | storage | SeaweedFS metadata |
aegis-seaweedfs-volume-data | storage | SeaweedFS block storage |
aegis-seaweedfs-filer-data | storage | SeaweedFS filesystem layer |
aegis-seal-gateway-data | seal-gateway | SQLite tool database |
aegis-temporal-worker-data | temporal | Workflow worker state |
Agent Container Runtime
The sections below cover Podman as the agent container runtime — how the AEGIS orchestrator spawns isolated containers for agent execution via the bollard Docker-compatible API.
Prerequisites
- Podman 4.0+ installed and configured for rootless mode
- systemd with user lingering enabled for the service account
- cgroups v2 (recommended) or cgroups v1 with appropriate delegation
slirp4netnsorpastainstalled for rootless networking- Agent container images are accessible from the host (either locally present or pullable)
- NFS traffic (TCP port 2049) is routable between agent containers and the host
Socket Configuration
Podman provides a Docker-compatible API socket via systemd socket activation. In rootless mode, the socket lives under the user's runtime directory.
Enable the Podman Socket
# Enable and start the rootless Podman socket
systemctl --user enable --now podman.socket
# Verify the socket is active
systemctl --user status podman.socket
# Confirm the socket path
ls -la /run/user/$(id -u)/podman/podman.sockEnable User Lingering
Without lingering, the user's systemd instance (and the Podman socket) is torn down when the user logs out. Enable lingering so the socket persists:
# Enable lingering for the aegis service user
sudo loginctl enable-linger aegis
# Verify
loginctl show-user aegis | grep LingerAEGIS Configuration
Point the AEGIS daemon at the Podman socket in aegis-config.yaml:
runtime:
container_socket_path: "/run/user/1000/podman/podman.sock"Replace 1000 with the UID of the service account running the daemon.
Alternatively, set the CONTAINER_HOST environment variable:
export CONTAINER_HOST=unix:///run/user/$(id -u)/podman/podman.sockContainer Lifecycle
The container lifecycle is identical to Docker. Bollard sends the same API calls to the Podman socket, and Podman handles them compatibly:
- Pulls the image (respecting
spec.runtime.image_pull_policyfrom the agent manifest — see Container Registry & Image Management). - Creates the container with:
- CPU quota and memory limit from
spec.resources - NFS volume mounts (described below)
- Network configuration from
spec.security.network_policy - Environment variables from
spec.environment - The container UID/GID stored in the
Executionmetadata for UID/GID squashing
- CPU quota and memory limit from
- Starts the container —
bootstrap.pybegins executing. - Monitors the container for the duration of the iteration.
- Stops and removes the container after the iteration completes or times out.
Containers are removed immediately after each iteration. A fresh container is created for each iteration in the 100monkeys loop.
Container Cleanup (Defense-in-Depth)
The same three-layer cleanup defense applies:
| Layer | Trigger | Mechanism |
|---|---|---|
| Explicit termination | Normal exit paths (success, failure, timeout, cancellation) | runtime.terminate() via Bollard API |
| RAII guard | Panic or unexpected error between spawn() and terminate() | ContainerGuard Drop impl spawns async cleanup task |
| Background reaper | Orphaned containers from process crashes or API failures | Daemon task runs every 5 min, cross-references containers against DB |
The reaper identifies orphans by listing all containers with the aegis.managed=true label and checking their aegis.execution_id against the execution repository. Containers with aegis.keep_container_on_failure=true are skipped by the reaper.
Resource Limits
Manifest resource limits are translated to container constraints via the same Bollard API fields:
spec:
resources:
cpu_quota: 1.0 # -> nano_cpus (1_000_000_000)
memory_bytes: 1073741824 # -> memory limit in bytes
timeout_secs: 300timeout_secs is enforced by the ExecutionSupervisor. If the inner loop has not produced a final response within timeout_secs, the container is force-killed and the iteration is failed.
cgroups v2 Delegation for Rootless
Rootless Podman requires cgroup v2 delegation to enforce resource limits. Without delegation, CPU and memory limits may be silently ignored.
# Verify cgroups v2 is in use
stat -fc %T /sys/fs/cgroup/
# Expected output: cgroup2fsIf resource limits are not being enforced, enable CPU and memory delegation for the user:
# /etc/systemd/system/user@.service.d/delegate.conf
[Service]
Delegate=cpu cpuset io memory pidssudo systemctl daemon-reloadNFS Volume Mounting
Agent containers mount their volumes via the kernel NFS client to the orchestrator's NFS server gateway (port 2049). The mount configuration is identical to Docker:
// Example Bollard mount configuration produced by AEGIS for a volume named "workspace":
{
"Target": "/workspace",
"Type": "volume",
"VolumeOptions": {
"DriverConfig": {
"Name": "local",
"Options": {
"type": "nfs",
"o": "addr=<orchestrator-host>,nfsvers=3,proto=tcp,soft,timeo=10,nolock",
"device": ":/<tenant_id>/<volume_id>"
}
}
}
}The agent container does not require CAP_SYS_ADMIN or any elevated capabilities for NFS mounts.
Network Reachability of NFS
In rootless Podman, use host.containers.internal to reach the host from within a container. Configure the NFS listen address in aegis-config.yaml:
storage:
nfs_listen_addr: "0.0.0.0:2049"In multi-host deployments, use the orchestrator host's external IP or hostname.
Network Configuration
Creating the AEGIS Network
# Create a Podman network for AEGIS containers
podman network create aegis-network
# Verify
podman network lsContainer DNS and Host Access
Podman maps host.containers.internal to the host by default. AEGIS also adds host.docker.internal for compatibility:
{
"extra_hosts": [
"host.docker.internal:host-gateway",
"host.containers.internal:host-gateway"
]
}Both names resolve to the host IP from within agent containers, so NFS mount addresses and SEAL callbacks work regardless of which runtime is in use.
Network Policy Enforcement
Network egress is controlled by the manifest network_policy, identical to Docker:
spec:
security:
network_policy:
mode: allow
allowlist:
- pypi.org
- api.github.comThe AEGIS daemon enforces network policy at the SEAL layer (per tool call), not via container network rules.
Differences from Docker
| Aspect | Docker | Podman |
|---|---|---|
| Daemon model | Persistent root daemon (dockerd) | Daemonless; socket-activated per user |
| Socket path | /var/run/docker.sock | /run/user/<UID>/podman/podman.sock |
| Default registries | docker.io only | Configurable in /etc/containers/registries.conf |
| Auth file | ~/.docker/config.json | ${XDG_RUNTIME_DIR}/containers/auth.json |
| Host DNS name | host.docker.internal | host.containers.internal (both mapped by AEGIS) |
| cgroup management | Delegated by dockerd | Requires explicit cgroup v2 delegation for rootless |
| Security model | Root daemon, user communicates via group | No privileged daemon; socket is user-scoped |
| Process model | Containers are children of dockerd | Containers are children of conmon (per-container) |
Socket Activation Behavior
Podman's socket is activated on demand by systemd. After a period of inactivity, the podman API process exits. The next API call re-activates it transparently. This is normally invisible to the AEGIS daemon, but be aware:
- The first API call after an idle period may have slightly higher latency.
- If
podman.socketis not enabled, the socket file will not exist and Bollard will fail to connect.
Stats API Differences
The container stats endpoint may return different fields depending on the cgroup version. Under cgroups v2, some fields that Docker populates (e.g., per-CPU usage arrays) may be absent or zeroed. The AEGIS daemon normalizes these differences in the ContainerStats adapter layer.
systemd Service Configuration
For rootless Podman, the AEGIS daemon runs as a systemd user service:
# ~/.config/systemd/user/aegis.service
[Unit]
Description=AEGIS Orchestrator Daemon
After=network-online.target podman.socket
Requires=podman.socket
[Service]
WorkingDirectory=/opt/aegis
ExecStart=/usr/local/bin/aegis --daemon --config /etc/aegis/config.yaml
Restart=on-failure
RestartSec=10s
LimitNOFILE=65535
Environment=CONTAINER_HOST=unix:///run/user/%U/podman/podman.sock
# Environment variables for secrets (avoid plaintext in config)
EnvironmentFile=%h/.config/aegis/env
[Install]
WantedBy=default.target# ~/.config/aegis/env (chmod 600)
DATABASE_URL=postgresql://aegis:password@localhost:5432/aegis
OPENAI_API_KEY=sk-...
OPENBAO_ROLE_ID=...
OPENBAO_SECRET_ID=...# Enable and start (as the aegis user)
systemctl --user enable aegis
systemctl --user start aegis
# Check status
systemctl --user status aegis
# Follow logs
journalctl --user -u aegis -fUser lingering must be enabled (see Socket Configuration) or the service will stop when the user session ends.
Troubleshooting
Socket Not Found
If the AEGIS daemon fails to connect to the Podman socket:
# Check if the socket unit is active
systemctl --user status podman.socket
# Check if the socket file exists
ls -la /run/user/$(id -u)/podman/podman.sock
# Restart the socket if needed
systemctl --user restart podman.socketPermission Denied
If the daemon gets permission errors connecting to the socket:
# Verify lingering is enabled
loginctl show-user aegis | grep Linger
# Verify the socket is owned by the correct user
ls -la /run/user/$(id -u)/podman/
# Verify the daemon is running as the correct user
ps aux | grep aegisContainer Stats Returning Zeros
If resource usage metrics are all zeros, cgroup v2 delegation is likely not configured:
# Check cgroup version
stat -fc %T /sys/fs/cgroup/
# Check delegation
cat /sys/fs/cgroup/user.slice/user-$(id -u).slice/cgroup.controllers
# If cpu/memory/io are missing, add the delegation override
sudo mkdir -p /etc/systemd/system/user@.service.d
sudo tee /etc/systemd/system/user@.service.d/delegate.conf <<EOF
[Service]
Delegate=cpu cpuset io memory pids
EOF
sudo systemctl daemon-reloadLog out and back in (or reboot) for delegation changes to take effect.
Network Connectivity Issues
If containers cannot reach the host or external networks:
# Check which network mode is in use
podman info | grep -i network
# Verify slirp4netns or pasta is installed
which slirp4netns
which pasta
# Test host connectivity from a container
podman run --rm alpine ping -c1 host.containers.internalIf using pasta (default in Podman 5.0+), and connectivity fails, try falling back to slirp4netns:
podman run --network=slirp4netns --rm alpine ping -c1 host.containers.internalHealth Checks
The AEGIS daemon exposes the same health endpoints regardless of the container runtime:
# Liveness (daemon process alive)
curl http://localhost:8080/health/live
# Readiness (daemon ready to accept requests; all dependencies connected)
curl http://localhost:8080/health/readyUse these in load balancer health check configuration or monitoring systems.