About ASB

What AgenticSwarmBench Is

AgenticSwarmBench (ASB) is an open-source inference performance benchmark purpose-built for agentic swarm scenarios - the kind of LLM request patterns that Claude Code, Cursor, Windsurf, and Copilot generate in practice.

It measures how fast your serving stack runs under growing multi-turn contexts (6K to 400K tokens), with tool schemas, file contents, error traces, and concurrent agents. No existing benchmark tests these specific access patterns.

ASB produces a clear verdict - 🟢 GOOD, 🟔 MARGINAL, or šŸ”“ POOR - answering one question: "Is this endpoint good enough for agentic swarm?"

Built by SwarmOne

ASB is created and maintained by SwarmOne - the AI-native cloud for agentic scenarios. SwarmOne provides optimized infrastructure for running agentic swarms at scale, and ASB was born from the need to rigorously benchmark that infrastructure.

Project Architecture

agentic-swarm-bench/
ā”œā”€ā”€ agentic_swarm_bench/
│   ā”œā”€ā”€ cli.py             # Click CLI: record | replay | speed | agent | eval
│   ā”œā”€ā”€ config.py          # Config: CLI > env > YAML > defaults
│   ā”œā”€ā”€ scenarios/         # Recording proxy, replay engine, schedule, poisoning
│   ā”œā”€ā”€ tasks/             # 110 agentic tasks (P1-P110) + codebase context
│   ā”œā”€ā”€ runner/            # Speed, eval, and agent run loops
│   ā”œā”€ā”€ proxy/             # FastAPI proxy: Anthropic <-> OpenAI translation
│   ā”œā”€ā”€ metrics/           # TTFT, tok/s, ITL, reasoning tokens, stats
│   └── report/            # Markdown reports: verdicts, grades, charts
ā”œā”€ā”€ skill/
│   └── SKILL.md           # Claude Code optimization skill
└── tests/                 # Test suite

Key Features

  • Record & replay: capture real coding sessions as JSONL, replay against any endpoint
  • 110 agentic tasks across 6 difficulty tiers (trivial → expert + multi-language)
  • 7 context profiles simulating real session growth (6K → 400K tokens)
  • 5 CLI modes: record, replay, speed, agent, eval (experimental)
  • Prefix cache poisoning via space-doubling - true cold-start measurements
  • Three cache modes: allcold, allwarm, realistic (shared prefix preserved)
  • Reasoning token detection (DeepSeek R1, o3, Claude Extended Thinking)
  • Automated verdict system with per-metric grading and ASCII charts
  • Docker support for reproducible benchmarking

Claude Code Optimization Skill

The repo includes a Claude Code skill (skill/SKILL.md) that turns Claude Code into an automated deployment optimizer. Point it at your serving stack and it will:

  1. Run asb speed to establish a baseline
  2. Read the verdict and key findings
  3. Identify the bottleneck (prefill-bound, decode-bound, scheduling, or context scaling)
  4. Tweak one deployment knob (tensor parallelism, batch size, chunked prefill, etc.)
  5. Re-run and compare - repeat until targets are met or 5 iterations show no improvement

Add the skill to Claude Code, then ask: "Optimize my vLLM deployment at http://localhost:8000 for agentic scenarios."

License

AgenticSwarmBench is open source under the Apache 2.0 License. Free to use, modify, and distribute.

Apache-2.0

How to Cite

If you use ASB in research or publications, please cite:

@software{agenticswarmbench2026,
  title  = {AgenticSwarmBench},
  author = {SwarmOne},
  url    = {https://github.com/SwarmOne/agentic-swarm-bench},
  year   = {2026},
  note   = {Open-source benchmark for LLM inference under agentic swarm scenarios}
}

Links