Contributing

How to contribute tasks, scenarios, and code to AgenticSwarmBench.

Development Setup

Clone the repo. uv is recommended, but pip works too.

git clone https://github.com/swarmone/agentic-swarm-bench.git

cd agentic-swarm-bench

With uv (recommended)

uv sync --all-extras

uv run pytest tests/ -v

Or with pip

pip install -e ".[dev,proxy]"

make test

Development Commands

Command	Description
make lint	Check code style
make format	Auto-format code
make test	Run the full test suite
uv run pytest tests/ -v	Run tests via uv

Adding Tasks

Tasks are defined in agentic_swarm_bench/tasks/tasks.json. Each task has:

tasks.json (single entry)

{
  "id": "P111",
  "tier": "medium",
  "prompt": "Build a REST API endpoint that...",
  "tags": ["python", "api", "fastapi"],
  "max_output_tokens": 2048
}

Field	Description
id	Unique ID (P1 through P110+)
tier	Difficulty: trivial, easy, medium, hard, expert
prompt	The agentic swarm task description
tags	Categorization tags (language, domain)
max_output_tokens	Token limit for the response

Adding Scenarios

Record a real session and contribute it as a built-in scenario:

1Record a session with asb record
2Place the JSONL file in agentic_swarm_bench/scenarios/data/
3Register it in scenarios/registry.py
4Open a PR with a description of the session and what it tests

Project Architecture

Project structure

agentic-swarm-bench/
  agentic_swarm_bench/
    cli.py              # Click CLI (asb record | replay | speed | agent | eval | ...)
    config.py           # Config: CLI > env > YAML > defaults

    scenarios/
      recorder.py       # Recording proxy: captures real sessions as JSONL
      player.py         # Replay engine: replays scenarios against any endpoint
      registry.py       # Load/list/resolve scenarios (file path or built-in name)
      schedule.py       # Execution schedule: repetitions, concurrency, ordering
      poison.py         # Prefix-cache poisoning: breaks KV cache between reps
      data/             # Built-in scenario directories

    tasks/
      tasks.json        # 110 agentic swarm tasks, P1-P110
      registry.py       # Load/filter tasks by tier, range, tags, language
      context/
        codebase_context.py  # Tool schemas, file contents, conversation turns

    runner/
      direct.py         # Speed mode: direct endpoint benchmark
      eval_runner.py    # Eval mode: code correctness validation
      claude_code.py    # Agent mode: Claude Code orchestration through proxy

    proxy/
      server.py         # Agent-mode proxy (FastAPI) - Anthropic <-> OpenAI
      padding.py        # Context padding for proxy mode
      translators.py    # API format translation

    metrics/
      collector.py      # Per-request metrics: TTFT, tok/s, ITL, thinking tokens
      stats.py          # Statistical analysis (p50, p95, p99, distributions)

    report/
      markdown.py       # Verdict, insights, grades, ASCII charts

  skill/
    SKILL.md            # Claude Code optimization skill

PR Guidelines

Run make test and make lint before submitting
Add tests for new features
Keep commits focused - one feature or fix per PR
Update documentation if you change CLI flags or behavior

License

AgenticSwarmBench is released under the Apache 2.0 license. See LICENSE for details.