Contributing

How to contribute tasks, scenarios, and code to AgenticSwarmBench.

Development Setup

Clone the repo. uv is recommended, but pip works too.

git clone https://github.com/swarmone/agentic-swarm-bench.git
cd agentic-swarm-bench

With uv (recommended)

uv sync --all-extras
uv run pytest tests/ -v

Or with pip

pip install -e ".[dev,proxy]"
make test

Development Commands

CommandDescription
make lintCheck code style
make formatAuto-format code
make testRun the full test suite
uv run pytest tests/ -vRun tests via uv

Adding Tasks

Tasks are defined in agentic_swarm_bench/tasks/tasks.json. Each task has:

tasks.json (single entry)
{
  "id": "P111",
  "tier": "medium",
  "prompt": "Build a REST API endpoint that...",
  "tags": ["python", "api", "fastapi"],
  "max_output_tokens": 2048
}
FieldDescription
idUnique ID (P1 through P110+)
tierDifficulty: trivial, easy, medium, hard, expert
promptThe agentic swarm task description
tagsCategorization tags (language, domain)
max_output_tokensToken limit for the response

Adding Scenarios

Record a real session and contribute it as a built-in scenario:

  1. 1Record a session with asb record
  2. 2Place the JSONL file in agentic_swarm_bench/scenarios/data/
  3. 3Register it in scenarios/registry.py
  4. 4Open a PR with a description of the session and what it tests

Project Architecture

Project structure
agentic-swarm-bench/
  agentic_swarm_bench/
    cli.py              # Click CLI (asb record | replay | speed | agent | eval | ...)
    config.py           # Config: CLI > env > YAML > defaults

    scenarios/
      recorder.py       # Recording proxy: captures real sessions as JSONL
      player.py         # Replay engine: replays scenarios against any endpoint
      registry.py       # Load/list/resolve scenarios (file path or built-in name)
      schedule.py       # Execution schedule: repetitions, concurrency, ordering
      poison.py         # Prefix-cache poisoning: breaks KV cache between reps
      data/             # Built-in scenario directories

    tasks/
      tasks.json        # 110 agentic swarm tasks, P1-P110
      registry.py       # Load/filter tasks by tier, range, tags, language
      context/
        codebase_context.py  # Tool schemas, file contents, conversation turns

    runner/
      direct.py         # Speed mode: direct endpoint benchmark
      eval_runner.py    # Eval mode: code correctness validation
      claude_code.py    # Agent mode: Claude Code orchestration through proxy

    proxy/
      server.py         # Agent-mode proxy (FastAPI) - Anthropic <-> OpenAI
      padding.py        # Context padding for proxy mode
      translators.py    # API format translation

    metrics/
      collector.py      # Per-request metrics: TTFT, tok/s, ITL, thinking tokens
      stats.py          # Statistical analysis (p50, p95, p99, distributions)

    report/
      markdown.py       # Verdict, insights, grades, ASCII charts

  skill/
    SKILL.md            # Claude Code optimization skill

PR Guidelines

  • Run make test and make lint before submitting
  • Add tests for new features
  • Keep commits focused - one feature or fix per PR
  • Update documentation if you change CLI flags or behavior

License

AgenticSwarmBench is released under the Apache 2.0 license. See LICENSE for details.