Aegis Orchestrator
Deployment

Observability

Observability stack (Jaeger, Prometheus, Grafana, Loki), structured logging, OTLP export, alert rules, and metrics for AEGIS deployments.

Observability

AEGIS provides a comprehensive observability stack deployed as the pod-observability Podman pod, plus native structured logging and optional OTLP export from the orchestrator itself. This page covers both the deployed observability infrastructure and the orchestrator's telemetry output.


Observability Stack

The pod-observability pod in aegis-deploy bundles five components for production monitoring:

ComponentVersionPortPurpose
Jaeger1.5516686 (UI), 4317 (OTLP gRPC), 4318 (OTLP HTTP)Distributed tracing
Prometheus2.519090Metrics collection and alerting
Grafana10.43300Dashboards and visualization
Loki3.03100Log aggregation
Promtail3.09080Container log scraping

Prometheus Scrape Targets

Prometheus is pre-configured to scrape all AEGIS platform services:

TargetPortEndpoint
aegis-runtime9091/metrics
keycloak8180/metrics
seaweedfs-master9324/metrics
seaweedfs-volume9325/metrics
seaweedfs-filer9326/metrics
openbao8200/v1/sys/metrics
temporal7233/metrics
postgres-exporter9187/metrics

Global scrape interval: 15 seconds. Evaluation interval: 15 seconds. Retention: 15 days.

Alert Rules

Pre-configured Prometheus alert rules shipped with aegis-deploy:

Critical (1–2 minute threshold):

AlertConditionSeverity
AEGISRuntimeDownRuntime unreachable for 1 mincritical
PostgreSQLDownDatabase unreachable for 1 mincritical
TemporalDownWorkflow engine unreachable for 2 mincritical
KeycloakDownIAM provider unreachable for 1 mincritical

Warning (5 minute threshold):

AlertConditionSeverity
HighExecutionFailureRate>25% execution failureswarning
HighHTTPErrorRate>5% HTTP 5xx responseswarning
HighAPILatencyP95 latency >5 secondswarning
EventBusLagHighLag rate >100 events/swarning
SeaweedFSDownStorage unreachable for 5 minwarning
OpenBaoDownSecrets manager unreachable for 5 minwarning
HighPostgresConnections>80 active connectionswarning

Grafana Dashboards

Three dashboards are auto-provisioned:

AEGIS Overview — Active executions, execution rate by status, iteration rate, agent lifecycle operations, event bus throughput, HTTP/gRPC request rates and latency percentiles, node info.

Infrastructure — Service up/down status for all components, PostgreSQL connections and cache hit ratio, Keycloak login events and sessions, SeaweedFS volume count and disk usage, Temporal workflow starts/completions/failures.

Logs Explorer — Log volume by container, full log stream, error log volume and filtered error view. Uses Loki as datasource.

Grafana is accessible at port 3300 with anonymous viewer access enabled by default. Datasources (Prometheus, Loki, Jaeger) are auto-configured.

Log Aggregation with Loki

Promtail scrapes all Podman container logs from /var/log/podman-containers/ and ships them to Loki. Logs are parsed using Docker log format, with container name labels extracted automatically.

  • Retention: 7 days (168 hours)
  • Schema: TSDB v13 with daily index periods
  • Query: Use Grafana's Explore view or the Logs Explorer dashboard
# View logs for a specific container via Grafana
# Navigate to: http://localhost:3300 → Explore → Loki
# Query: {container_name="aegis-runtime"}

Log Levels

AEGIS uses Rust's RUST_LOG environment variable, which accepts a comma-separated list of level directives:

LevelUsage
errorUnrecoverable failures (operation panics, infrastructure unavailability)
warnRecoverable issues: missing optional config, NFS deregistration lag, LLM provider degraded
infoNormal lifecycle events: server started, execution completed, volume cleaned up
debugPer-request details: tool routing decisions, SEAL validation, storage path resolution
traceVerbose internal state: usually too noisy for production
# Production
RUST_LOG=info

# Debug a specific subsystem
RUST_LOG=info,aegis_orchestrator_core::infrastructure::nfs=debug

# Debug all tool routing
RUST_LOG=info,aegis_orchestrator_core::infrastructure::tool_router=debug

# Debug tool-call judging
RUST_LOG=info,aegis_orchestrator_core::application::tool_invocation_service=debug,aegis_orchestrator_core::application::validation_service=debug

# Verbose SEAL audit
RUST_LOG=info,aegis_orchestrator_core::infrastructure::seal=debug

# Development (everything)
RUST_LOG=debug

Directive syntax: [crate::path=]level[,...]. Omitting the crate path sets a global minimum level.


Log Formats

AEGIS supports two output formats controlled at startup:

Pretty (default for development)

Human-readable colored text. Suitable for local development and docker logs:

2026-01-15T10:23:45.123Z  INFO aegis_orchestrator: Starting gRPC server on 0.0.0.0:50051
2026-01-15T10:23:46.200Z  INFO aegis_orchestrator: Connected to Cortex gRPC service url=http://cortex:50052
2026-01-15T10:23:47.001Z  WARN aegis_orchestrator: Started with NO LLM providers configured. Agent execution will fail!

JSON (production)

Newline-delimited JSON; parseable by log aggregators:

{"timestamp":"2026-01-15T10:23:45.123Z","level":"INFO","target":"aegis_orchestrator","message":"Starting gRPC server on 0.0.0.0:50051"}
{"timestamp":"2026-01-15T10:23:46.200Z","level":"INFO","target":"aegis_orchestrator","fields":{"url":"http://cortex:50052"},"message":"Connected to Cortex gRPC service"}

Enable JSON format by setting the AEGIS_LOG_FORMAT environment variable:

AEGIS_LOG_FORMAT=json

If unset or set to any other value, the pretty format is used.


Structured Fields

Many log events include structured key-value fields alongside the message. These are available in both formats:

FieldEventsDescription
urlService connection eventsTarget URL being connected to
execution_idExecution lifecycleUUID of the active execution
countVolume cleanupNumber of volumes deleted
errError eventsError description
agent_idAgent lifecycleUUID of the agent

When using JSON format, structured fields appear as keys in the JSON object under "fields".


Domain Events in Logs

AEGIS publishes structured domain events to its internal event bus. These events also produce log entries. Key observable events:

Execution Events

Log Message PatternLevelMeaning
"Starting execution"INFOExecution started
"Inner loop generation failed"ERRORLLM generation failed for an iteration
"Could not find execution {} for LLM event"WARNRace condition during execution lookup

Volume Events

Log Message PatternLevelMeaning
"Volume cleanup: {} expired volumes deleted"INFOPeriodic TTL cleanup completed
"Volume cleanup failed"ERRORCleanup task failed
"NFS deregistration listener lagged"WARNEvent bus buffer full; some deregistrations may have been missed

Service Lifecycle

Log Message PatternLevelMeaning
"Starting gRPC server on {}"INFOgRPC server started
"Starting AEGIS gRPC server on {}"INFOInternal gRPC server
"Connected to Cortex gRPC service"INFOCortex connection established
"Cortex gRPC URL not configured"INFORunning in memoryless mode (expected when Cortex not deployed)
"Failed to connect to Temporal"ERRORTemporal workflow engine unreachable
"Failed to start some MCP servers"ERROROne or more MCP tool servers failed to start

SEAL / Security Events

SEAL policy violations always produce WARN log entries with structured fields including execution_id, tool_name, and the violation type. These are produced by SealAudit:

{"level":"WARN","target":"aegis_orchestrator_core::infrastructure::seal::audit","fields":{"execution_id":"a1b2...","tool_name":"fs.delete","violation":"ToolExplicitlyDenied"},"message":"SEAL tool call blocked"}

Tool-Call Judging

The runtime documentation and verified logging surface do not currently expose a dedicated judge-specific telemetry stream. Today, you can observe the following:

SignalWhat it tells you
Execution lifecycle logsWhether the parent execution started, refined, completed, or failed.
Inner-loop generation logsWhether the model returned a final response or hit a generation error.
SEAL audit logsWhether a tool call was blocked by policy before routing.
Tool routing debug logsWhich routing path a tool took when debug logging is enabled.
Child execution logsWhether a judge execution was spawned and how it completed, if you correlate by execution ID.

Use these logs to infer tool-call judging behavior today. Do not assume a dedicated judge_execution_id, score, confidence, or decision field is emitted unless you verify it in the running build.

If you want first-class operational visibility for tool-call judging, the next step is to add explicit judge telemetry in the orchestrator and surface it as structured logs or domain events. Useful fields would include the parent execution ID, the judge child execution ID, the tool name, the verdict score, the verdict confidence, and the final allow/block decision.

That work is recommended future instrumentation, not current runtime behavior.


Container Log Collection

The AEGIS daemon writes all logs to stdout and stderr. In the Podman pod deployment, Promtail automatically collects these logs and ships them to Loki. For standalone setups:

# Podman
podman logs -f aegis-runtime

# Docker
docker logs -f aegis-daemon

For additional log aggregation beyond the built-in Loki stack, configure your collector (Fluentd, Datadog Agent, etc.) to read container stdout and set AEGIS_LOG_FORMAT=json so log lines are parseable.


Health Checks

The REST API exposes health endpoints on port 8088:

curl http://localhost:8088/health
# → {"status":"ok"}

curl http://localhost:8088/health/live
# → liveness check

curl http://localhost:8088/health/ready
# → readiness check (all dependencies connected)

In the Podman pod deployment, health checks are configured automatically for every container. Use make validate to check all services at once.


OTLP External Log Export

AEGIS can ship structured log records directly to any OpenTelemetry Protocol (OTLP)-compatible backend — Grafana Cloud, Datadog, Honeycomb, a self-hosted OpenTelemetry Collector, or any OTLP-native destination.

This feature is additive: stdout logging is always active. OTLP export is an optional second pipeline enabled by setting spec.observability.logging.otlp_endpoint in the node configuration (or the AEGIS_OTLP_ENDPOINT environment variable).

Quick Start

The minimum change to enable OTLP is adding otlp_endpoint to your node config:

spec:
  observability:
    logging:
      level: info
      format: json
      otlp_endpoint: "http://otel-collector:4317"   # gRPC (default protocol)

Or via environment variable (no config change needed):

export AEGIS_OTLP_ENDPOINT=http://otel-collector:4317

Protocol Selection

Two OTLP transports are supported, controlled by otlp_protocol (or AEGIS_OTLP_PROTOCOL):

ProtocolConfig valueDefault portNotes
gRPCgrpc (default)4317Preferred for self-hosted collectors
HTTP/Protobufhttp4318Required for some SaaS endpoints (Grafana Cloud, Datadog)
logging:
  otlp_endpoint: "https://otlp-gateway.grafana.net/v1/logs"
  otlp_protocol: http

Authentication

Use otlp_headers to pass API keys or other authentication metadata. Values support the standard env: and secret: credential prefixes:

logging:
  otlp_endpoint: "https://otlp-gateway.grafana.net/v1/logs"
  otlp_protocol: http
  otlp_headers:
    Authorization: "env:GRAFANA_OTLP_TOKEN"   # resolved from env at startup

When setting headers via the AEGIS_OTLP_HEADERS environment variable, use a comma-separated key=value list:

AEGIS_OTLP_HEADERS="Authorization=Bearer my-token,x-scope-orgid=12345"

Never commit API keys or bearer tokens directly in your node config YAML. Always use env:VAR_NAME or secret:path credential prefixes, or set headers via the environment variable. See Credential Resolution.

Backend Integration Examples

Grafana Cloud Logs

logging:
  otlp_endpoint: "https://otlp-gateway-prod-us-central-0.grafana.net/v1/logs"
  otlp_protocol: http
  otlp_headers:
    Authorization: "env:GRAFANA_CLOUD_OTLP_TOKEN"   # Basic base64(instanceId:apiKey)
  otlp_service_name: "aegis-prod"

Datadog

logging:
  otlp_endpoint: "https://otlp.datadoghq.com/v1/logs"
  otlp_protocol: http
  otlp_headers:
    DD-API-KEY: "env:DATADOG_API_KEY"
  otlp_service_name: "aegis-prod"

Self-Hosted OpenTelemetry Collector

logging:
  otlp_endpoint: "http://otel-collector:4317"   # gRPC default
  otlp_service_name: "aegis-orchestrator"

Your otel-collector-config.yaml can then fan out to Loki, Jaeger, Prometheus, S3, or any other exporter.

Example: OTel Collector to Grafana Loki

For SREs running a self-hosted observability stack, the following OTel Collector configuration demonstrates how to receive OTLP logs from AEGIS and fan them out to Grafana Loki while preserving structured attributes:

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024
  
  # Map OTLP attributes to Loki labels for efficient indexing
  resource:
    attributes:
      - key: service.name
        action: upsert
        value: "aegis-orchestrator"

exporters:
  loki:
    endpoint: "http://loki:3100/loki/api/v1/push"
    labels:
      resource:
        service.name: "service_name"
        deployment.environment: "env"
      attributes:
        level: "level"

service:
  pipelines:
    logs:
      receivers: [otlp]
      processors: [batch, resource]
      exporters: [loki]

Minimum Export Level

Set otlp_min_level to filter how verbose the OTLP stream is, independently of stdout:

logging:
  level: debug          # verbose on stdout (useful during development)
  otlp_min_level: info  # only ship info+ to the backend (default)

Accepted values: error, warn, info, debug, trace. Override with AEGIS_OTLP_LOG_LEVEL.

Resource Attributes

Every exported log record automatically includes the following OpenTelemetry resource attributes:

AttributeSourceExample
service.nameotlp_service_name config key or AEGIS_OTLP_SERVICE_NAMEaegis-orchestrator
service.versionCompiled binary version0.1.0-pre-alpha
deployment.environmentmetadata.labels.environment (when set)production

Batch Tuning

Log records are buffered and exported in batches. The defaults are suitable for most workloads; tune only when you observe dropped records or high-memory usage:

logging:
  otlp_endpoint: "http://otel-collector:4317"
  batch:
    max_queue_size: 4096        # increase if logs spike during high-iteration runs
    scheduled_delay_ms: 2000    # flush more frequently
    max_export_batch_size: 512  # records per HTTP/gRPC call
    export_timeout_ms: 10000    # per-call timeout
FieldDefaultDescription
max_queue_size2048Maximum buffered records. Records are dropped if the queue is full.
scheduled_delay_ms5000Flush interval in milliseconds.
max_export_batch_size512Records per export RPC.
export_timeout_ms10000Per-call timeout in milliseconds.

TLS Configuration

For self-signed certificates or private CA chains:

logging:
  otlp_endpoint: "https://otel-collector.internal:4317"
  tls:
    verify: true                            # keep true in production
    ca_cert_path: /etc/aegis/internal-ca.pem  # custom CA bundle

Set verify: false only for local development; it disables all certificate validation.

Environment Variable Reference

All OTLP settings can be supplied (or overridden) at runtime via environment variables, without modifying the node config file:

VariableConfig equivalentNotes
AEGIS_OTLP_ENDPOINTlogging.otlp_endpointSetting this variable enables OTLP export
AEGIS_OTLP_PROTOCOLlogging.otlp_protocolgrpc or http
AEGIS_OTLP_HEADERSlogging.otlp_headersComma-separated key=value pairs
AEGIS_OTLP_LOG_LEVELlogging.otlp_min_levelMin level exported to OTLP
AEGIS_OTLP_SERVICE_NAMElogging.otlp_service_nameservice.name resource attribute

Metrics (Prometheus)

AEGIS exposes real-time operational metrics via a Prometheus-compatible endpoint. This allows you to monitor system health, execution performance, and security events using tools like Prometheus and Grafana.

The metrics endpoint is served over a dedicated HTTP listener, separate from the main API and gRPC ports.

Quick Start

By default, Prometheus metrics are enabled on port 9091. You can configure this in your node configuration:

spec:
  observability:
    metrics:
      enabled: true
      port: 9091
      path: "/metrics"

Or via environment variables:

export AEGIS_METRICS_ENABLED=true
export AEGIS_METRICS_PORT=9091

Scraping with Prometheus

Add the AEGIS node to your prometheus.yml scrape configuration:

scrape_configs:
  - job_name: 'aegis-orchestrator'
    static_configs:
      - targets: ['localhost:9091']

Key Observable Metrics

AEGIS provides a wide range of metrics across different subsystems:

  • Executions: Active execution count, total completions/failures, and duration histograms.
  • SEAL Security: Policy violations, attestation success/failure rates, and session counts.
  • Storage (NFS): File operation counts, latencies, and total bytes read/written.
  • Workflows: Active workflow executions and state transition counters.
  • System: Node uptime and static version/identity information.

For a complete list of available metrics, labels, and descriptions, see the Metrics Reference.

Security Note

The metrics endpoint is unauthenticated by design, following standard Prometheus patterns. Ensure that the metrics port (9091 by default) is protected by your network firewall or Kubernetes NetworkPolicy and is not exposed to the public internet.


See Also

On this page