bySwarmOne

AgenticSwarmBench

The open-source benchmark for LLM inference under agentic swarm workloads. Measure what actually matters: speed from 6K to 400K tokens.

$pip install agentic-swarm-bench
View on GitHub
terminal
$

Why Another Benchmark?

Existing benchmarks don't test what agentic swarm tools actually do - growing multi-turn contexts with tool calls, code files, and error traces.

SWE-bench

Measures
Model quality
Context size
Varies
Request pattern
Single-turn
Content
GitHub issues
Cache impact
N/A

LMSys Arena

Measures
Chatbot speed
Context size
~2K
Request pattern
Single-turn
Content
Chat messages
Cache impact
N/A

Generic Benchmarks

Measures
Uniform throughput
Context size
Uniform
Request pattern
Uniform
Content
Generic text
Cache impact
N/A

ASB

Measures
Agentic inference speed
Context size
6K → 400K (growing)
Request pattern
Multi-turn with tools
Content
Tool schemas, code, errors
Cache impact
Cold vs warm

What It Measures

Seven key metrics that determine whether your LLM serving stack is ready for agentic swarm.

TTFT

Time to First Token

How long until the first token arrives. Critical for perceived responsiveness in editors.

ms

Tok/s per user

Decode Tokens/Second

Streaming speed per concurrent user. Determines how fast code appears.

tok/s

Prefill tok/s

Prefill Tokens/Second

Speed of processing input context. Bottleneck for large contexts.

tok/s

ITL

Inter-Token Latency

Time between consecutive tokens at p50/p95/p99. Drives streaming smoothness.

ms

Throughput

Aggregate Throughput

Total tokens/second across all concurrent users. Measures serving capacity.

tok/s

Reasoning Overhead

Reasoning Token Overhead

Extra latency from chain-of-thought or thinking tokens before visible output.

ms

Cache Speedup

Prefix Cache Speedup

Cold vs warm TTFT ratio. Shows prefix caching effectiveness for repeated contexts.

×

7 Context Profiles

Real coding sessions grow from 6K to 400K tokens. ASB tests every stage of that journey.

6K
fresh
20K
short
40K
medium
70K
long
100K
full
200K
xl
400K
xxl

Each profile includes system prompts, tool schemas, code files, conversation history, and error traces

5 Modes, One Tool

From quick speed tests to full agentic session recording and replay.

asb speed

Inference speed under agentic swarm load

Key Metrics

TTFTTok/sITLPrefillThroughput

Usage

$ asb speed \
  --endpoint http://localhost:8000 \
  --model my-model \
  --suite quick

Quick Start

Three steps to benchmark your serving stack.

1
Install
$ pip install agentic-swarm-bench
2
Run
$ asb speed \
  --endpoint http://localhost:8000 \
  --model my-model \
  --suite quick
3
Docker
$ docker run --rm \
  -e ENDPOINT=http://host.docker.internal:8000 \
  -e MODEL=my-model \
  ghcr.io/swarmone/asb:latest speed --suite quick

Sample Report Output

Every run produces a verdict with key findings and a detailed breakdown.

meta-llama/Meta-Llama-3.1-70B
http://localhost:8000 • suite: standard
GOOD

Key Findings

  • TTFT stays under 3s through 40K context - responsive for active agentic use
  • Decode rate holds above 30 tok/s per user up to 100K context
  • Prefix caching delivers 3.5× TTFT speedup at medium context
  • p95 ITL spikes above 50ms at 100K context - may cause visible streaming stutter
  • 8-user concurrency degrades TTFT by 4-5× versus single-user baseline

Summary Table

ContextUsersTTFTTok/sVerdict
fresh (6K)1180ms65 tok/sGOOD
short (20K)1520ms58 tok/sGOOD
medium (40K)11.1s48 tok/sGOOD
medium (40K)84.8s14 tok/sMARGINAL
long (70K)12.2s38 tok/sGOOD
full (100K)14.2s30 tok/sGOOD
full (100K)815.0s8 tok/sPOOR

What Good Looks Like

Reference ranges from real hardware. Use these as baselines when evaluating your own results.

SetupContextUsersTTFTTok/s per userVerdict
vLLM 1×A100, 8B6K1~100ms~80-120GOOD
vLLM 1×A100, 8B40K1~600ms~60-95GOOD
vLLM 1×A100, 8B40K8~2-4s~20-40MARGINAL
vLLM 1×A100, 8B100K1~2s~50-70GOOD
vLLM 1×A100, 8B100K8~8-12s~10-18POOR
SGLang 1×H100, 70B6K1~180ms~55-65GOOD
SGLang 1×H100, 70B40K1~1.1s~40-50GOOD
SGLang 1×H100, 70B100K1~4.2s~25-35GOOD
TTFT < 3s at 40K
Responsive editing experience
Tok/s > 30/user
Smooth code streaming
TTFT < 10s at 100K
Acceptable for deep sessions