Record & Replay

Capture real coding sessions and replay them against any endpoint.

Why Record & Replay?

This is the most valuable way to benchmark. Synthetic load tells you what an endpoint can do in theory. Record/replay tells you what it actually does with your traffic. Record a real coding session once, then replay that exact sequence of requests against any endpoint, hardware config, or model - same context, same token counts, same multi-turn patterns.

Why this matters: agentic sessions have a unique shape. Context starts small and grows unpredictably. Some turns are tiny follow-ups; others dump 20K tokens of file contents. Synthetic benchmarks can approximate this, but a recording captures the real thing.

asb record - Capture a Session

Starts a recording proxy between your agent and your LLM endpoint. Every request/response pair is saved as a JSONL line.

Record with an OpenAI-compatible upstream

asb record \
  -e http://your-gpu-server:8000 \
  -m your-model

Record with Anthropic (auto-detected from URL)

asb record \
  -e https://api.anthropic.com \
  -m claude-sonnet-4-20250514 \
  -k $ANTHROPIC_API_KEY \
  --api-key-header x-api-key \
  -o my-session.jsonl

Custom output file and port

asb record \
  -e http://your-gpu-server:8000 \
  -m your-model \
  -o my-session.jsonl \
  -P 9000

Point Your Agent at the Proxy

Once the recording proxy is running, point your agent at it:

ANTHROPIC_BASE_URL=http://localhost:19000 claude

Stop recording with Ctrl+C when done.

Upstream Modes

The recorder supports two upstream modes:

OpenAI-compatible (default)

Translates Anthropic Messages API → OpenAI format before forwarding.

Anthropic passthrough

Forwards requests natively to Anthropic's API - no translation, full fidelity. Auto-detected when the endpoint is api.anthropic.com, or set explicitly with --upstream-api anthropic.

Both modes save the scenario in OpenAI format for replay.

asb replay - Replay Against Any Endpoint

Take a recorded scenario and replay it against a different endpoint, hardware, or configuration. Requests are grouped by context size and produce the same metrics as asb speed - decode tok/s (streaming speed after first token), prefill tok/s (input processing rate), TTFT, ITL, and aggregate throughput - but using your real traffic instead of synthetic padding.

Replay a single session against a new endpoint

asb replay \
  -e http://new-server:8000 \
  -m my-model \
  -w my-session.jsonl

Replay a scenario directory with a schedule

asb replay \
  -e http://new-server:8000 \
  -m my-model \
  -w ./scenarios/my-scenario/ \
  --repetitions 3 --max-concurrent 5 --policy sequential

Generate a full report

asb replay \
  -e http://new-server:8000 \
  -m my-model \
  -w my-session.jsonl \
  -o report.md

Preview without sending requests

asb replay -e URL -m MODEL -w session.jsonl --dry-run

Scheduling

Control how tasks execute with --repetitions, --max-concurrent, and --policy. Available policies: round_robin, sequential, random.

Cache Mode

Replay's default is --cache-mode realistic: it preserves the shared prefix (typically the system prompt) so the server can KV-cache it, but poisons each user's unique context so it doesn't. Use allwarm for the optimistic all-cached upper bound, or allcold to defeat caching entirely.

asb replay -e URL -m MODEL -w scenario                          # realistic (default)

asb replay -e URL -m MODEL -w scenario --cache-mode allwarm     # optimistic upper bound

asb replay -e URL -m MODEL -w scenario --cache-mode allcold     # defeat cache entirely

See Prefix cache poisoning for how the space-doubling mechanism works.

Slicing Scenarios

Real sessions grow from small contexts to large ones. --slice-tokens N replays requests from the start until cumulative prompt tokens reach N - preserving the natural context growth while capping how much you send through the endpoint.

asb replay -e URL -m MODEL -w session.jsonl --slice-tokens 1000000

Useful for targeting specific model context limits or keeping replay costs down.

Record CLI Flags

Flag	Description
-e, --endpoint	Upstream LLM endpoint URL
-m, --model	Model name
-k, --api-key	API key for the upstream endpoint
--api-key-header	Custom API key header name
-o, --output	Output JSONL file path
-P, --port	Proxy listen port (default: 19000)
--upstream-api	Force upstream API type (openai or anthropic)

Replay CLI Flags

Flag	Description
-e, --endpoint	Target endpoint URL
-m, --model	Model name
-w, --scenario	JSONL scenario file path or scenario directory
-o, --output	Report output path
--cache-mode	realistic (default) \| allwarm \| allcold
--repetitions	Number of times to replay each task
--max-concurrent	Maximum in-flight requests
--policy	Execution policy: round_robin \| sequential \| random
--slice-tokens	Stop replaying after N cumulative prompt tokens
--dry-run	Preview without sending requests