Benchmarks¶
KruxOS includes a reproducible benchmark framework that measures AI agent performance on structured capabilities (KruxOS) versus shell commands (Ubuntu).
Methodology¶
Test environment¶
| Property | KruxOS | Ubuntu baseline |
|---|---|---|
| Platform | KruxOS v0.0.1 | Ubuntu 24.04 (Docker) |
| Interface | Python SDK → Gateway → capabilities | Bash commands via subprocess |
| Model | Same model for both (configurable) | |
| Tasks | 50 tasks across 5 categories | |
| Runs | 3 runs per task (configurable) |
Task categories¶
| Category | Tasks | Examples |
|---|---|---|
| Filesystem | 15 | Read, write, search, organize files |
| Process | 10 | Run commands, monitor processes, handle failures |
| Network | 5 | HTTP requests, DNS lookups, port checks |
| Git | 10 | Log, diff, commit, branch management |
| Multi-step | 10 | Complex workflows combining multiple categories |
Metrics¶
| Metric | What it measures |
|---|---|
| Task completion rate | Percentage of tasks completed successfully |
| Token usage | Total tokens consumed (input + output) |
| Token breakdown | Discovery, execution, error recovery, wasted |
| Wall time | End-to-end execution time |
| Error recovery | How the agent handles and recovers from errors |
| Cost | Estimated API cost based on token usage |
Expected results¶
Based on the benchmark framework design and the nature of structured vs unstructured interfaces:
Token efficiency¶
| Scenario | KruxOS | Ubuntu | Improvement |
|---|---|---|---|
| Simple file read | ~200 tokens | ~500 tokens | ~60% fewer |
| Error handling | ~150 tokens | ~800 tokens | ~80% fewer |
| Multi-step workflow | ~1,200 tokens | ~3,000 tokens | ~60% fewer |
Why KruxOS uses fewer tokens:
- No output parsing — structured JSON responses vs parsing
ls -latext output - No error guessing — typed errors with recovery actions vs "command not found" text
- Efficient discovery — schema-aware tool listing vs
manpages and--help - No retry waste — clear error types prevent retrying the wrong thing
Task completion¶
| Category | KruxOS | Ubuntu |
|---|---|---|
| Filesystem | ~98% | ~85% |
| Process | ~95% | ~80% |
| Network | ~95% | ~75% |
| Git | ~97% | ~85% |
| Multi-step | ~90% | ~65% |
Why KruxOS completes more tasks:
- Structured errors tell the agent exactly what went wrong and how to fix it
- Schema validation catches input errors before execution
- Policy denials include explanations, not cryptic permission errors
- Transaction support enables atomic multi-step operations
Cost comparison¶
For a typical workload of 1,000 capability invocations per day:
| Metric | KruxOS | Ubuntu | Savings |
|---|---|---|---|
| Daily tokens | ~200K | ~500K | 60% |
| Monthly cost (Claude Sonnet) | ~$6 | ~$15 | $9/month |
| Monthly cost (GPT-4o) | ~$5 | ~$12.50 | $7.50/month |
Running benchmarks¶
Prerequisites¶
# v0.0.1: SDK ships bundled in /opt/kruxos/sdk/python/ on the appliance.
# Host-side pip install kruxos lands in v0.0.3 — copy off the appliance
# until then.
docker pull altvale/kruxos:latest
docker pull kruxos/benchmark-ubuntu:24.04
Run the full suite¶
cd benchmarks/
python runner.py run --platform kruxos --model claude-sonnet --runs 3
python runner.py run --platform ubuntu --model claude-sonnet --runs 3
Generate comparison report¶
python runner.py compare \
--kruxos results/kruxos-claude-sonnet.json \
--ubuntu results/ubuntu-claude-sonnet.json \
--output report.html
This generates an HTML report with:
- Per-category comparison tables
- Token usage breakdown charts (SVG)
- Statistical summaries (mean, median, p95)
- Cost projections
Filter by category¶
Reproducing results¶
The benchmark framework is fully reproducible:
- Task definitions are version-controlled YAML files
- The Ubuntu baseline runs in a Docker container for consistent environment
- Each run records the model, timestamp, and full token breakdown
- Results are saved as JSON for automated comparison
See benchmarks/README.md in the repository for detailed setup instructions.