CLI Modes
AgenticSwarmBench has 5 modes, organized around two headline features (record/replay) plus synthetic load, end-to-end agent runs, and an experimental correctness evaluator.
Overview
record and replay are the headline features - capture a real coding session once, then replay that exact sequence against any endpoint for apples-to-apples comparisons. speed generates synthetic agentic load for controlled tests. agent measures what it feels like to use an endpoint end-to-end. eval is an experimental correctness mode.
Record/Replay vs Speed vs Agent
| record / replay | speed | agent | |
|---|---|---|---|
| What talks to your endpoint | You during record, ASB during replay | ASB directly (one synthetic request) | A real agent (Claude Code) through a proxy |
| Requests per task | Whatever the real session had | 1 | 5-15+ (real tool-use turns) |
| Context | Your actual session context | Synthetic padding to target size | Grows naturally as the agent works |
| Use case | Benchmark with your real traffic | Raw throughput at controlled sizes | "What does it feel like to use this endpoint?" |
asb record
HEADLINECapture real coding sessions as replayable JSONL
Starts a recording proxy between your agent and your LLM endpoint. Every request/response pair is saved as a JSONL line. Supports both OpenAI-compatible upstream (translates Anthropic Messages API) and Anthropic passthrough (auto-detected from api.anthropic.com).
Key Metrics
Examples
Record with OpenAI-compatible upstream
asb record \
-e http://your-gpu-server:8000 \
-m your-modelRecord with Anthropic passthrough (full fidelity)
asb record \
-e https://api.anthropic.com \
-m claude-sonnet-4-20250514 \
-k $ANTHROPIC_API_KEY \
--api-key-header x-api-key \
-o my-session.jsonlPoint your agent at the proxy
ANTHROPIC_BASE_URL=http://localhost:19000 claudeasb replay
HEADLINEReplay recorded sessions against any endpoint
Takes a recorded scenario and replays it against any endpoint, hardware, or configuration. Requests are grouped by context size and produce the same metrics as speed mode - but using your real traffic instead of synthetic padding. Defaults to realistic cache mode (shared prefix preserved, per-user context poisoned).
Key Metrics
Examples
Replay against a new endpoint
asb replay \
-e http://new-server:8000 \
-m my-model \
-w my-session.jsonlReplay a scenario directory with a schedule
asb replay \
-e http://new-server:8000 \
-m my-model \
-w ./scenarios/my-scenario/ \
--repetitions 3 --max-concurrent 5 --policy sequentialOptimistic cached numbers
asb replay -e URL -m MODEL -w scenario --cache-mode allwarmCap the replay at 1M cumulative prompt tokens
asb replay -e URL -m MODEL -w session.jsonl --slice-tokens 1000000Preview without sending requests
asb replay -e URL -m MODEL -w session.jsonl --dry-runasb speed
Synthetic inference speed under agentic load
When you don't have a recording yet, generates realistic agentic context synthetically. Each request is padded with system prompts, tool schemas, multi-turn conversation history, file contents, and error traces - so the model sees what it would see in a real coding session. Default cache mode is allcold (true cold-start).
Key Metrics
Examples
Default realistic sweep (fresh → full)
asb speed -e http://localhost:8000 -m my-modelSpecific concurrency and context profile
asb speed -e http://localhost:8000 -m my-model -u 32 -p longFixed token count stress test
asb speed -e http://localhost:8000 -m my-model -c 50000 -u 16Measure prefix cache impact (runs allcold then allwarm)
asb speed -e http://localhost:8000 -m my-model --cache-mode realisticJSON output for CI/CD
asb speed -e http://localhost:8000 -m my-model --format json -o results.jsonasb agent
End-to-end benchmark with a real agent process
Runs a real agent process (Claude Code by default) end-to-end and records per-request timing for every LLM call across the entire multi-turn session. Captures latency compounding over a real session - each turn's context grows as the agent reads files, runs tests, and fixes errors.
Key Metrics
Examples
Run agent benchmark on the first 10 tasks
asb agent \
-e http://localhost:8000 \
-m my-model \
-t p1-p10Use a different agent CLI
asb agent \
-e http://localhost:8000 \
-m my-model \
-t p1-p10 \
--agent-cmd my-agentasb eval
EXPERIMENTALCode correctness validation
Optional mode that sends the same tasks with agentic context, but validates the generated code instead of measuring speed. Useful for checking if your model still produces correct code under large-context pressure.
Key Metrics
Examples
Syntax validation (does it parse?)
asb eval -e http://localhost:8000 -m my-model -t p1-p25 -v syntaxExecution validation (does it run?)
asb eval -e http://localhost:8000 -m my-model -t p1-p25 -v executionHelper Commands
asb list-tasks - Browse Available Tasks
asb list-tasks # Show all 110 tasksasb list-tasks -t trivial # Filter by tierasb list-tasks --tags typescript,rust # Filter by languageasb list-scenarios - Browse Built-in Scenarios
asb list-scenariosasb list-scenarios --format jsonasb compare - Compare Two Runs
Generates a head-to-head table, ASCII bar chart, and winner summary between two JSON reports.
asb compare --baseline a.json --candidate b.json -o comparison.md