bySwarmOne

AgenticSwarmBench

The open-source benchmark for LLM inference under agentic swarm scenarios. Record real Claude Code / Cursor sessions and replay them against any endpoint - or run synthetic agentic load from 6K to 400K tokens.

$uv pip install agentic-swarm-bench
View on GitHub
terminal
$

Why Another Benchmark?

Existing benchmarks don't test what agentic swarm tools actually do - growing multi-turn contexts with tool calls, code files, and error traces.

SWE-bench

Measures
Model quality
Context size
Varies
Request pattern
Single-turn
Content
GitHub issues
Cache impact
N/A

LMSys Arena

Measures
Chatbot speed
Context size
~2K
Request pattern
Single-turn
Content
Chat messages
Cache impact
N/A

Generic Benchmarks

Measures
Uniform throughput
Context size
Uniform
Request pattern
Uniform
Content
Generic text
Cache impact
N/A

ASB

Measures
Agentic inference speed
Context size
6K → 400K (growing)
Request pattern
Multi-turn with tools
Content
Tool schemas, code, errors
Cache impact
Cold vs warm

What It Measures

Seven key metrics that determine whether your LLM serving stack is ready for agentic swarm.

TTFT

Time to First Token

How long until the first token arrives. Critical for perceived responsiveness in editors.

ms

Tok/s per user

Decode Tokens/Second

Streaming speed per concurrent user. Determines how fast code appears.

tok/s

Prefill tok/s

Prefill Tokens/Second

Speed of processing input context. Bottleneck for large contexts.

tok/s

ITL

Inter-Token Latency

Time between consecutive tokens at p50/p95/p99. Drives streaming smoothness.

ms

Throughput

Aggregate Throughput

Total tokens/second across all concurrent users. Measures serving capacity.

tok/s

Reasoning Overhead

Reasoning Token Overhead

Extra latency from chain-of-thought or thinking tokens before visible output.

ms

Cache Speedup

Prefix Cache Speedup

Cold vs warm TTFT ratio. Shows prefix caching effectiveness for repeated contexts.

×

7 Context Profiles

Real coding sessions grow from 6K to 400K tokens. ASB tests every stage of that journey.

6K
fresh
20K
short
40K
medium
70K
long
100K
full
200K
xl
400K
xxl

Each profile includes system prompts, tool schemas, code files, conversation history, and error traces

The default realistic sweep runs fresh → short → medium → long → full to simulate a full session lifecycle.

5 Modes, One Tool

From quick speed tests to full agentic session recording and replay.

asb record

Capture real coding sessions as replayable JSONL (headline feature)

Key Metrics

Request countContext sizesTool callsToken usage

Usage

$ asb record \
  -e http://your-gpu-server:8000 \
  -m your-model \
  -o my-session.jsonl

Quick Start

The fastest way to get meaningful numbers: record what you actually do, then replay it against any endpoint.

1
Install
uv recommended; pip also works
$ uv pip install "agentic-swarm-bench[proxy]"
2
Record a real session
Run your agent through the ASB proxy
$ asb record \
  -e http://your-gpu-server:8000 \
  -m your-model

ANTHROPIC_BASE_URL=http://localhost:19000 claude
3
Replay against any endpoint
Same context, same tokens, any endpoint
$ asb replay \
  -e http://new-server:8000 \
  -m my-model \
  -w my-session.jsonl

Sample Report Output

Every run produces a verdict with key findings and a detailed breakdown.

meta-llama/Meta-Llama-3.1-70B
http://localhost:8000 • suite: standard
GOOD

Key Findings

  • TTFT stays under 3s through 40K context - responsive for active agentic use
  • Decode rate holds above 30 tok/s per user up to 100K context
  • Prefix caching delivers 3.5× TTFT speedup at medium context
  • p95 ITL spikes above 50ms at 100K context - may cause visible streaming stutter
  • 8-user concurrency degrades TTFT by 4-5× versus single-user baseline

Summary Table

ContextUsersTTFTTok/sVerdict
fresh (6K)1180ms65 tok/sGOOD
short (20K)1520ms58 tok/sGOOD
medium (40K)11.1s48 tok/sGOOD
medium (40K)84.8s14 tok/sMARGINAL
long (70K)12.2s38 tok/sGOOD
full (100K)14.2s30 tok/sGOOD
full (100K)815.0s8 tok/sPOOR

What Good Looks Like

Reference ranges from real hardware. Use these as baselines when evaluating your own results.

SetupContextUsersTTFTTok/s per userVerdict
vLLM 1×A100 80GB, 7B6K1~100ms~80-120GOOD
vLLM 1×A100 80GB, 7B40K8~2-4s~20-40MARGINAL
vLLM 1×A100 80GB, 7B100K32~8-15s~5-15POOR
SGLang 1×H100, 70B6K1~200ms~40-60GOOD
SGLang 1×H100, 70B40K8~3-6s~10-25MARGINAL
API provider (Together/Fireworks)40K8~2-8s~15-40MARGINAL
TTFT < 3s at 40K
Responsive editing experience
Tok/s > 30/user
Smooth code streaming
TTFT < 10s at 100K
Acceptable for deep sessions