CLI Modes

AgenticSwarmBench has 5 modes - from quick speed tests to full agentic session recording and replay.

Overview

Each mode targets a different dimension of LLM serving performance under agentic swarm workloads. Use speed for inference benchmarking, eval for correctness, agent for real session measurement, and record / replay for capturing and replaying your own workloads.

asb speed

Inference speed under agentic load

Sends streaming requests with realistic agentic swarm context (system prompts, tool schemas, file contents, conversation history) directly to any OpenAI-compatible endpoint.

Key Metrics

TTFTTok/s per userITL (p50/p95/p99)Prefill tok/sAggregate throughputReasoning overhead

Examples

Default realistic sweep

asb speed -e http://localhost:8000 -m my-model

Specific concurrency and context

asb speed -e http://localhost:8000 -m my-model -u 32 -p long

Fixed token count stress test

asb speed -e http://localhost:8000 -m my-model -c 50000 -u 16

Measure prefix cache impact

asb speed -e http://localhost:8000 -m my-model --cache-mode both

JSON output for CI/CD

asb speed -e http://localhost:8000 -m my-model --format json -o results.json

asb eval

Code correctness validation

Sends agentic swarm tasks and validates the generated code at three levels: syntax (does it parse?), execution (does it run?), and functional (does it produce correct output?).

Key Metrics

Syntax pass rateExecution pass rateFunctional correctnessTier breakdown

Examples

Syntax validation

asb eval -e http://localhost:8000 -m my-model -t p1-p25 -v syntax

Execution validation

asb eval -e http://localhost:8000 -m my-model -t p1-p25 -v execution

Functional validation

asb eval -e http://localhost:8000 -m my-model -t p1-p25 -v functional

asb agent

Full agentic session benchmark via recording proxy

Runs a recording proxy between a real agent (like Claude Code) and your endpoint, measuring actual multi-turn agentic sessions. The proxy translates Anthropic Messages API → OpenAI Chat Completions API and records per-request timing.

Key Metrics

Session TTFTMulti-turn latency growthTool call overheadContext window scaling

Examples

Run agent benchmark

asb agent \
  -e http://localhost:8000 \
  -m my-model \
  -t p1-p10

asb record

Capture real coding sessions as JSONL workloads

Starts a recording proxy between your agent and your LLM endpoint. Every request/response pair is saved as a JSONL line. Then replay against any endpoint.

Key Metrics

Request countContext sizesTool calls

Examples

Record with OpenAI-compatible upstream

asb record \
  -e http://your-gpu-server:8000 \
  -m your-model

Record with Anthropic

asb record \
  -e https://api.anthropic.com \
  -m claude-sonnet-4-20250514 \
  -k $ANTHROPIC_API_KEY \
  --api-key-header x-api-key \
  -o my-session.jsonl

asb replay

Replay captured workloads against any endpoint

Takes a recorded workload and replays it against a different endpoint, hardware, or configuration. Requests are grouped by context size and produce the same metrics as speed mode.

Key Metrics

TTFTTok/sITLComparison delta

Examples

Replay against a new endpoint

asb replay \
  -e http://new-server:8000 \
  -m my-model \
  -w my-session.jsonl

Replay with report

asb replay \
  -e http://new-server:8000 \
  -m my-model \
  -w my-session.jsonl \
  -o report.md

Helper Commands

asb list-tasks - Browse Available Tasks

asb list-tasks                        # Show all 110 tasks
asb list-tasks -t trivial             # Filter by tier
asb list-tasks --tags typescript,rust  # Filter by language

asb list-workloads - Browse Built-in Workloads

asb list-workloads --format json

asb compare - Compare Two Runs

asb compare --baseline a.json --candidate b.json -o comparison.md