Benchmarks¶

KruxOS includes a reproducible benchmark framework that measures AI agent performance on structured capabilities (KruxOS) versus shell commands (Ubuntu).

Methodology¶

Test environment¶

Property	KruxOS	Ubuntu baseline
Platform	KruxOS v0.0.1	Ubuntu 24.04 (Docker)
Interface	Python SDK → Gateway → capabilities	Bash commands via subprocess
Model	Same model for both (configurable)
Tasks	50 tasks across 5 categories
Runs	3 runs per task (configurable)

Task categories¶

Category	Tasks	Examples
Filesystem	15	Read, write, search, organize files
Process	10	Run commands, monitor processes, handle failures
Network	5	HTTP requests, DNS lookups, port checks
Git	10	Log, diff, commit, branch management
Multi-step	10	Complex workflows combining multiple categories

Metrics¶

Metric	What it measures
Task completion rate	Percentage of tasks completed successfully
Token usage	Total tokens consumed (input + output)
Token breakdown	Discovery, execution, error recovery, wasted
Wall time	End-to-end execution time
Error recovery	How the agent handles and recovers from errors
Cost	Estimated API cost based on token usage

Expected results¶

Based on the benchmark framework design and the nature of structured vs unstructured interfaces:

Token efficiency¶

Scenario	KruxOS	Ubuntu	Improvement
Simple file read	~200 tokens	~500 tokens	~60% fewer
Error handling	~150 tokens	~800 tokens	~80% fewer
Multi-step workflow	~1,200 tokens	~3,000 tokens	~60% fewer

Why KruxOS uses fewer tokens:

No output parsing — structured JSON responses vs parsing ls -la text output
No error guessing — typed errors with recovery actions vs "command not found" text
Efficient discovery — schema-aware tool listing vs man pages and --help
No retry waste — clear error types prevent retrying the wrong thing

Task completion¶

Category	KruxOS	Ubuntu
Filesystem	~98%	~85%
Process	~95%	~80%
Network	~95%	~75%
Git	~97%	~85%
Multi-step	~90%	~65%

Why KruxOS completes more tasks:

Structured errors tell the agent exactly what went wrong and how to fix it
Schema validation catches input errors before execution
Policy denials include explanations, not cryptic permission errors
Transaction support enables atomic multi-step operations

Cost comparison¶

For a typical workload of 1,000 capability invocations per day:

Metric	KruxOS	Ubuntu	Savings
Daily tokens	~200K	~500K	60%
Monthly cost (Claude Sonnet)	~$6	~$15	$9/month
Monthly cost (GPT-4o)	~$5	~$12.50	$7.50/month

Running benchmarks¶

Prerequisites¶

# v0.0.1: SDK ships bundled in /opt/kruxos/sdk/python/ on the appliance.
# Host-side pip install kruxos lands in v0.0.3 — copy off the appliance
# until then.
docker pull altvale/kruxos:latest
docker pull kruxos/benchmark-ubuntu:24.04

Run the full suite¶

cd benchmarks/
python runner.py run --platform kruxos --model claude-sonnet --runs 3
python runner.py run --platform ubuntu --model claude-sonnet --runs 3

Generate comparison report¶

python runner.py compare \
  --kruxos results/kruxos-claude-sonnet.json \
  --ubuntu results/ubuntu-claude-sonnet.json \
  --output report.html

This generates an HTML report with:

Per-category comparison tables
Token usage breakdown charts (SVG)
Statistical summaries (mean, median, p95)
Cost projections

Filter by category¶

python runner.py run --platform kruxos --category filesystem --runs 5

Reproducing results¶

The benchmark framework is fully reproducible:

Task definitions are version-controlled YAML files
The Ubuntu baseline runs in a Docker container for consistent environment
Each run records the model, timestamp, and full token breakdown
Results are saved as JSON for automated comparison

See benchmarks/README.md in the repository for detailed setup instructions.