CLI Modes

AgenticSwarmBench has 5 modes, organized around two headline features (record/replay) plus synthetic load, end-to-end agent runs, and an experimental correctness evaluator.

Overview

record and replay are the headline features - capture a real coding session once, then replay that exact sequence against any endpoint for apples-to-apples comparisons. speed generates synthetic agentic load for controlled tests. agent measures what it feels like to use an endpoint end-to-end. eval is an experimental correctness mode.

Record/Replay vs Speed vs Agent

	record / replay	speed	agent
What talks to your endpoint	You during record, ASB during replay	ASB directly (one synthetic request)	A real agent (Claude Code) through a proxy
Requests per task	Whatever the real session had	1	5-15+ (real tool-use turns)
Context	Your actual session context	Synthetic padding to target size	Grows naturally as the agent works
Use case	Benchmark with your real traffic	Raw throughput at controlled sizes	"What does it feel like to use this endpoint?"

asb record

HEADLINE

Capture real coding sessions as replayable JSONL

Starts a recording proxy between your agent and your LLM endpoint. Every request/response pair is saved as a JSONL line. Supports both OpenAI-compatible upstream (translates Anthropic Messages API) and Anthropic passthrough (auto-detected from api.anthropic.com).

Key Metrics

Request countContext sizesTool callsToken usage

Examples

Record with OpenAI-compatible upstream

asb record \
  -e http://your-gpu-server:8000 \
  -m your-model

Record with Anthropic passthrough (full fidelity)

asb record \
  -e https://api.anthropic.com \
  -m claude-sonnet-4-20250514 \
  -k $ANTHROPIC_API_KEY \
  --api-key-header x-api-key \
  -o my-session.jsonl

Point your agent at the proxy

ANTHROPIC_BASE_URL=http://localhost:19000 claude

asb replay

HEADLINE

Replay recorded sessions against any endpoint

Takes a recorded scenario and replays it against any endpoint, hardware, or configuration. Requests are grouped by context size and produce the same metrics as speed mode - but using your real traffic instead of synthetic padding. Defaults to realistic cache mode (shared prefix preserved, per-user context poisoned).

Key Metrics

Decode tok/sPrefill tok/sTTFTITLPer-context breakdownAggregate throughput

Examples

Replay against a new endpoint

asb replay \
  -e http://new-server:8000 \
  -m my-model \
  -w my-session.jsonl

Replay a scenario directory with a schedule

asb replay \
  -e http://new-server:8000 \
  -m my-model \
  -w ./scenarios/my-scenario/ \
  --repetitions 3 --max-concurrent 5 --policy sequential

Optimistic cached numbers

asb replay -e URL -m MODEL -w scenario --cache-mode allwarm

Cap the replay at 1M cumulative prompt tokens

asb replay -e URL -m MODEL -w session.jsonl --slice-tokens 1000000

Preview without sending requests

asb replay -e URL -m MODEL -w session.jsonl --dry-run

asb speed

Synthetic inference speed under agentic load

When you don't have a recording yet, generates realistic agentic context synthetically. Each request is padded with system prompts, tool schemas, multi-turn conversation history, file contents, and error traces - so the model sees what it would see in a real coding session. Default cache mode is allcold (true cold-start).

Key Metrics

Decode tok/s per userPrefill tok/sTTFTITL (p50/p95/p99)Aggregate throughputReasoning overhead

Examples

Default realistic sweep (fresh → full)

asb speed -e http://localhost:8000 -m my-model

Specific concurrency and context profile

asb speed -e http://localhost:8000 -m my-model -u 32 -p long

Fixed token count stress test

asb speed -e http://localhost:8000 -m my-model -c 50000 -u 16

Measure prefix cache impact (runs allcold then allwarm)

asb speed -e http://localhost:8000 -m my-model --cache-mode realistic

JSON output for CI/CD

asb speed -e http://localhost:8000 -m my-model --format json -o results.json

asb agent

End-to-end benchmark with a real agent process

Runs a real agent process (Claude Code by default) end-to-end and records per-request timing for every LLM call across the entire multi-turn session. Captures latency compounding over a real session - each turn's context grows as the agent reads files, runs tests, and fixes errors.

Key Metrics

Session TTFTMulti-turn latency growthTool call overheadContext window scaling

Examples

Run agent benchmark on the first 10 tasks

asb agent \
  -e http://localhost:8000 \
  -m my-model \
  -t p1-p10

Use a different agent CLI

asb agent \
  -e http://localhost:8000 \
  -m my-model \
  -t p1-p10 \
  --agent-cmd my-agent

asb eval

EXPERIMENTAL

Code correctness validation

Optional mode that sends the same tasks with agentic context, but validates the generated code instead of measuring speed. Useful for checking if your model still produces correct code under large-context pressure.

Key Metrics

Syntax pass rateExecution pass rateTier breakdown

Examples

Syntax validation (does it parse?)

asb eval -e http://localhost:8000 -m my-model -t p1-p25 -v syntax

Execution validation (does it run?)

asb eval -e http://localhost:8000 -m my-model -t p1-p25 -v execution

Helper Commands

asb list-tasks - Browse Available Tasks

asb list-tasks                        # Show all 110 tasks

asb list-tasks -t trivial             # Filter by tier

asb list-tasks --tags typescript,rust  # Filter by language

asb list-scenarios - Browse Built-in Scenarios

asb list-scenarios

asb list-scenarios --format json

asb compare - Compare Two Runs

Generates a head-to-head table, ASCII bar chart, and winner summary between two JSON reports.

asb compare --baseline a.json --candidate b.json -o comparison.md