AgenticSwarmBench

The open-source benchmark for LLM inference under agentic swarm scenarios. Record real Claude Code / Cursor sessions and replay them against any endpoint - or run synthetic agentic load from 6K to 400K tokens.

$uv pip install agentic-swarm-bench

View on GitHub

terminal

Why Another Benchmark?

Existing benchmarks don't test what agentic swarm tools actually do - growing multi-turn contexts with tool calls, code files, and error traces.

	SWE-bench	LMSys Arena	Generic Benchmarks	ASB
Measures	Model quality	Chatbot speed	Uniform throughput	Agentic inference speed
Context size	Varies	~2K	Uniform	6K → 400K (growing)
Request pattern	Single-turn	Single-turn	Uniform	Multi-turn with tools
Content	GitHub issues	Chat messages	Generic text	Tool schemas, code, errors
Cache impact	N/A	N/A	N/A	Cold vs warm

SWE-bench

Measures: Model quality
Context size: Varies
Request pattern: Single-turn
Content: GitHub issues
Cache impact: N/A

LMSys Arena

Measures: Chatbot speed
Context size: ~2K
Request pattern: Single-turn
Content: Chat messages
Cache impact: N/A

Generic Benchmarks

Measures: Uniform throughput
Context size: Uniform
Request pattern: Uniform
Content: Generic text
Cache impact: N/A

ASB

Measures: Agentic inference speed
Context size: 6K → 400K (growing)
Request pattern: Multi-turn with tools
Content: Tool schemas, code, errors
Cache impact: Cold vs warm

What It Measures

Seven key metrics that determine whether your LLM serving stack is ready for agentic swarm.

TTFT

Time to First Token

How long until the first token arrives. Critical for perceived responsiveness in editors.

Tok/s per user

Decode Tokens/Second

Streaming speed per concurrent user. Determines how fast code appears.

tok/s

Prefill tok/s

Prefill Tokens/Second

Speed of processing input context. Bottleneck for large contexts.

tok/s

ITL

Inter-Token Latency

Time between consecutive tokens at p50/p95/p99. Drives streaming smoothness.

Throughput

Aggregate Throughput

Total tokens/second across all concurrent users. Measures serving capacity.

tok/s

Reasoning Overhead

Reasoning Token Overhead

Extra latency from chain-of-thought or thinking tokens before visible output.

Cache Speedup

Prefix Cache Speedup

Cold vs warm TTFT ratio. Shows prefix caching effectiveness for repeated contexts.

7 Context Profiles

Real coding sessions grow from 6K to 400K tokens. ASB tests every stage of that journey.

fresh

Just opened the project

20K

short

A few turns in

40K

medium

Active coding session

70K

long

Deep multi-file work

100K

full

Extended session

200K

Very large context

400K

xxl

Maximum context window

Each profile includes system prompts, tool schemas, code files, conversation history, and error traces

The default realistic sweep runs fresh → short → medium → long → full to simulate a full session lifecycle.

5 Modes, One Tool

From quick speed tests to full agentic session recording and replay.

asb record

Capture real coding sessions as replayable JSONL (headline feature)

Key Metrics

Request countContext sizesTool callsToken usage

Usage

$ asb record \
  -e http://your-gpu-server:8000 \
  -m your-model \
  -o my-session.jsonl

Quick Start

The fastest way to get meaningful numbers: record what you actually do, then replay it against any endpoint.

Install

uv recommended; pip also works

$ uv pip install "agentic-swarm-bench[proxy]"

Record a real session

Run your agent through the ASB proxy

$ asb record \
  -e http://your-gpu-server:8000 \
  -m your-model

ANTHROPIC_BASE_URL=http://localhost:19000 claude

Replay against any endpoint

Same context, same tokens, any endpoint

$ asb replay \
  -e http://new-server:8000 \
  -m my-model \
  -w my-session.jsonl

Sample Report Output

Every run produces a verdict with key findings and a detailed breakdown.

meta-llama/Meta-Llama-3.1-70B

http://localhost:8000 • suite: standard

GOOD

Key Findings

TTFT stays under 3s through 40K context - responsive for active agentic use
Decode rate holds above 30 tok/s per user up to 100K context
Prefix caching delivers 3.5× TTFT speedup at medium context
p95 ITL spikes above 50ms at 100K context - may cause visible streaming stutter
8-user concurrency degrades TTFT by 4-5× versus single-user baseline

Summary Table

Context	Users	TTFT	Tok/s	Verdict
fresh (6K)	1	180ms	65 tok/s	GOOD
short (20K)	1	520ms	58 tok/s	GOOD
medium (40K)	1	1.1s	48 tok/s	GOOD
medium (40K)	8	4.8s	14 tok/s	MARGINAL
long (70K)	1	2.2s	38 tok/s	GOOD
full (100K)	1	4.2s	30 tok/s	GOOD
full (100K)	8	15.0s	8 tok/s	POOR

What Good Looks Like

Reference ranges from real hardware. Use these as baselines when evaluating your own results.

Setup	Context	Users	TTFT	Tok/s per user	Verdict
vLLM 1×A100 80GB, 7B	6K	1	~100ms	~80-120	GOOD
vLLM 1×A100 80GB, 7B	40K	8	~2-4s	~20-40	MARGINAL
vLLM 1×A100 80GB, 7B	100K	32	~8-15s	~5-15	POOR
SGLang 1×H100, 70B	6K	1	~200ms	~40-60	GOOD
SGLang 1×H100, 70B	40K	8	~3-6s	~10-25	MARGINAL
API provider (Together/Fireworks)	40K	8	~2-8s	~15-40	MARGINAL

TTFT < 3s at 40K

Responsive editing experience

Tok/s > 30/user

Smooth code streaming

TTFT < 10s at 100K

Acceptable for deep sessions