Monitoring¶

By the end of this page, you'll know how to monitor system health, view metrics, and configure alerts.

Health checks¶

HTTP endpoint¶

KruxOS exposes a health endpoint on port 7701:

curl -s http://localhost:7701/health | python3 -m json.tool

Expected output:

{
    "status": "healthy",
    "version": "1.0.0",
    "uptime_seconds": 14400,
    "services": {
        "gateway": "healthy",
        "vault": "healthy",
        "proxy": "healthy",
        "audit": "healthy",
        "state": "healthy"
    },
    "resources": {
        "cpu_percent": 12.5,
        "memory_used_mb": 256,
        "memory_total_mb": 2048,
        "disk_used_percent": 34.2
    }
}

Health status values:

Status	Meaning
`healthy`	All services operating normally
`degraded`	Some services have issues but the system is functional
`unhealthy`	Critical services are down

CLI health check¶

kruxos alerts

Expected output:

System Health: HEALTHY
━━━━━━━━━━━━━━━━━━━━━━━
  CPU:     12.5% (ok)
  Memory:  256 MB / 2048 MB (ok)
  Disk:    34.2% (ok)
  Gateway: running
  Vault:   unlocked
  Proxy:   syncing (last: 2m ago)
  Audit:   writing (chain: verified)

Dashboard¶

The Health page at http://localhost:7800/health shows:

Real-time CPU, memory, and disk graphs
Per-service health status with history
Active alerts
Resource trend lines

Metrics¶

System metrics¶

Query system metrics via the CLI or SDK:

kruxos audit stats --last 24h

Agents can query metrics programmatically:

# System-level metrics
result = await os.call_async("system.metrics", category="system")
# Returns: cpu_percent, memory_used_mb, disk_used_percent, uptime_seconds

# Agent-level metrics
result = await os.call_async("system.metrics", category="agents")
# Returns: active_count, total_sessions, invocations_per_minute

# Policy metrics
result = await os.call_async("system.metrics", category="policy")
# Returns: evaluations_total, denied_count, approval_pending_count

# HTTP metrics
result = await os.call_async("system.metrics", category="http")
# Returns: requests_total, latency_p50, latency_p99

Alerts¶

Automatic alerts¶

KruxOS automatically monitors for these conditions:

Condition	Threshold	Alert
High CPU	> 90% for 5 min	Warning
High memory	> 85%	Warning
Disk space	> 90%	Critical
Audit write failure	Any failure	Critical
Service down	Health check fail	Critical
Approval waiting	> 30 min	Info
Rate limit exceeded	Any agent	Warning

Agent-triggered alerts¶

Agents can send alerts to supervisors:

await os.call_async(
    "alerts.send",
    severity="warning",
    title="Deployment failed",
    message="Tests failed on commit abc1234. Manual review needed.",
)

Viewing alerts¶

# Recent alerts
kruxos alerts --last 24h

# Critical only
kruxos alerts --severity critical

On the dashboard, alerts appear as banners on every page and in detail on the Health page.

Alert deduplication¶

KruxOS deduplicates identical alerts. If the same condition triggers repeatedly, you'll see one alert with a count and the time range, not a flood of notifications.

Monitoring the Service Proxy¶

Gmail sync status¶

kruxos status

The status output includes proxy health:

Proxy:      syncing (last: 2m ago)
  Gmail:    connected, 2347 messages synced
  Buffer:   0 pending writes

On the dashboard, navigate to Service Proxy for detailed sync status, write buffer contents, and error history.

External monitoring integration¶

Health endpoint for load balancers¶

The /health endpoint returns HTTP 200 when healthy and HTTP 503 when unhealthy. Use this for:

Load balancer health checks
Kubernetes liveness/readiness probes
Uptime monitoring services

Prometheus-compatible metrics¶

KruxOS exposes metrics in a format suitable for collection:

curl -s http://localhost:7701/health | python3 -c "
import json, sys
data = json.load(sys.stdin)
print(f'kruxos_cpu_percent {data[\"resources\"][\"cpu_percent\"]}')
print(f'kruxos_memory_used_mb {data[\"resources\"][\"memory_used_mb\"]}')
print(f'kruxos_status {{status=\"{data[\"status\"]}\"}} 1')
"

Next steps¶

Backup & Restore — protect your data
Updating KruxOS — apply updates safely
Troubleshooting — common issues and solutions