Skip to content

Benchmarks

KruxOS includes a reproducible benchmark framework that measures AI agent performance on structured capabilities (KruxOS) versus shell commands (Ubuntu).

Methodology

Test environment

Property KruxOS Ubuntu baseline
Platform KruxOS v0.0.1 Ubuntu 24.04 (Docker)
Interface Python SDK → Gateway → capabilities Bash commands via subprocess
Model Same model for both (configurable)
Tasks 50 tasks across 5 categories
Runs 3 runs per task (configurable)

Task categories

Category Tasks Examples
Filesystem 15 Read, write, search, organize files
Process 10 Run commands, monitor processes, handle failures
Network 5 HTTP requests, DNS lookups, port checks
Git 10 Log, diff, commit, branch management
Multi-step 10 Complex workflows combining multiple categories

Metrics

Metric What it measures
Task completion rate Percentage of tasks completed successfully
Token usage Total tokens consumed (input + output)
Token breakdown Discovery, execution, error recovery, wasted
Wall time End-to-end execution time
Error recovery How the agent handles and recovers from errors
Cost Estimated API cost based on token usage

Expected results

Based on the benchmark framework design and the nature of structured vs unstructured interfaces:

Token efficiency

Scenario KruxOS Ubuntu Improvement
Simple file read ~200 tokens ~500 tokens ~60% fewer
Error handling ~150 tokens ~800 tokens ~80% fewer
Multi-step workflow ~1,200 tokens ~3,000 tokens ~60% fewer

Why KruxOS uses fewer tokens:

  • No output parsing — structured JSON responses vs parsing ls -la text output
  • No error guessing — typed errors with recovery actions vs "command not found" text
  • Efficient discovery — schema-aware tool listing vs man pages and --help
  • No retry waste — clear error types prevent retrying the wrong thing

Task completion

Category KruxOS Ubuntu
Filesystem ~98% ~85%
Process ~95% ~80%
Network ~95% ~75%
Git ~97% ~85%
Multi-step ~90% ~65%

Why KruxOS completes more tasks:

  • Structured errors tell the agent exactly what went wrong and how to fix it
  • Schema validation catches input errors before execution
  • Policy denials include explanations, not cryptic permission errors
  • Transaction support enables atomic multi-step operations

Cost comparison

For a typical workload of 1,000 capability invocations per day:

Metric KruxOS Ubuntu Savings
Daily tokens ~200K ~500K 60%
Monthly cost (Claude Sonnet) ~$6 ~$15 $9/month
Monthly cost (GPT-4o) ~$5 ~$12.50 $7.50/month

Running benchmarks

Prerequisites

# v0.0.1: SDK ships bundled in /opt/kruxos/sdk/python/ on the appliance.
# Host-side pip install kruxos lands in v0.0.3 — copy off the appliance
# until then.
docker pull altvale/kruxos:latest
docker pull kruxos/benchmark-ubuntu:24.04

Run the full suite

cd benchmarks/
python runner.py run --platform kruxos --model claude-sonnet --runs 3
python runner.py run --platform ubuntu --model claude-sonnet --runs 3

Generate comparison report

python runner.py compare \
  --kruxos results/kruxos-claude-sonnet.json \
  --ubuntu results/ubuntu-claude-sonnet.json \
  --output report.html

This generates an HTML report with:

  • Per-category comparison tables
  • Token usage breakdown charts (SVG)
  • Statistical summaries (mean, median, p95)
  • Cost projections

Filter by category

python runner.py run --platform kruxos --category filesystem --runs 5

Reproducing results

The benchmark framework is fully reproducible:

  1. Task definitions are version-controlled YAML files
  2. The Ubuntu baseline runs in a Docker container for consistent environment
  3. Each run records the model, timestamp, and full token breakdown
  4. Results are saved as JSON for automated comparison

See benchmarks/README.md in the repository for detailed setup instructions.