Skip to content

Agent Guidelines & Development Standards

"Discipline is the bridge between goals and accomplishment."

This document defines the guardrails, processes, and standards that all agents (and humans) must follow when developing pytest-gremlins.


Table of Contents

  1. Core Principles
  2. TDD: The Three Laws
  3. BDD: Behavior-Driven Development
  4. Development Workflow
  5. Git & Branching Strategy
  6. Code Quality Standards
  7. Testing Strategy
  8. Documentation Requirements
  9. CI/CD Pipeline
  10. Release Process
  11. Agent Boundaries

Core Principles

  1. Tests before code - No exceptions. Ever.
  2. Documentation is code - Same lifecycle, same rigor, same CI enforcement
  3. Small, reviewable PRs - Stacked via Graphite, easy to understand
  4. Isolated development - All work happens in git worktrees
  5. Automate everything - If a human has to remember it, automate it
  6. Dogfood early - Use pytest-gremlins on pytest-gremlins ASAP

TDD: The Three Laws

TDD enforcement is ULTRA strict. These are not guidelines—they are laws:

Law 1: Failing Test First

You may not write any production code unless it is to make a failing test pass.

Before touching src/, there must be a failing test in tests/. No "I'll add tests later." No "this is just a small change." No exceptions.

Law 2: Minimal Test Code

You may only write as much test code as required to make a test fail (and build/compile failures count as test failures).

Don't write a complete test suite upfront. Write ONE assertion that fails. Make it pass. Write the next assertion. Red-green-red-green.

Law 3: Minimal Production Code

You may only write as much production code as required to make a failing test pass.

Don't gold-plate. Don't add "while I'm here" features. Write the minimum code to go from red to green.

Law 4: Refactor

Engage in pragmatic refactoring once tests are passing.

Green means you can refactor. Improve structure, extract methods, rename things. But only when green. And stay green.

The TDD Cycle

Text Only
┌─────────────────────────────────────────────────────────────┐
│                                                             │
│    ┌─────────┐     ┌─────────┐     ┌─────────────┐         │
│    │  RED    │────►│  GREEN  │────►│  REFACTOR   │         │
│    │         │     │         │     │             │         │
│    │ Write a │     │ Write   │     │ Clean up    │         │
│    │ failing │     │ minimal │     │ while       │         │
│    │ test    │     │ code to │     │ staying     │         │
│    │         │     │ pass    │     │ green       │         │
│    └─────────┘     └─────────┘     └──────┬──────┘         │
│         ▲                                 │                 │
│         │                                 │                 │
│         └─────────────────────────────────┘                 │
│                                                             │
└─────────────────────────────────────────────────────────────┘

BDD: Behavior-Driven Development

We use Gherkin to define behaviors and Cucumber (via pytest-bdd) for automated acceptance testing.

The BDD Flow

  1. PM Agent writes Gherkin scenarios based on requirements
  2. QA Agent reviews scenarios, adds edge cases
  3. Dev Agent implements step definitions (tests) FIRST
  4. Dev Agent implements production code to make steps pass
  5. Scenarios become living documentation

Gherkin Standards

Gherkin
Feature: Mutation Switching
  As a developer using pytest-gremlins
  I want mutations to be controlled via environment variable
  So that I can run tests against different mutations without reloading code

  Background:
    Given a Python file with a comparison operator
    And the file has been instrumented with gremlins

  Scenario: Default behavior runs original code
    Given no ACTIVE_GREMLIN environment variable is set
    When I execute the instrumented code
    Then the original comparison logic executes

  Scenario: Setting ACTIVE_GREMLIN activates a mutation
    Given ACTIVE_GREMLIN is set to "gremlin_001"
    When I execute the instrumented code
    Then the mutated comparison logic executes

  Scenario Outline: Different gremlins produce different mutations
    Given ACTIVE_GREMLIN is set to "<gremlin_id>"
    When I execute the comparison "age >= 18"
    Then the result matches "<expected_operator>"

    Examples:
      | gremlin_id  | expected_operator |
      | gremlin_001 | age > 18          |
      | gremlin_002 | age <= 18         |
      | gremlin_003 | age < 18          |

Scenario Guidelines

  • Declarative, not imperative - Describe WHAT, not HOW
  • Business language - Avoid technical implementation details
  • One behavior per scenario - Keep them focused
  • Use Background - For common setup across scenarios
  • Scenario Outlines - For parameterized testing

File Organization

Text Only
features/
├── mutation_switching.feature
├── coverage_guidance.feature
├── incremental_analysis.feature
├── parallel_execution.feature
├── operators/
│   ├── comparison.feature
│   ├── arithmetic.feature
│   └── boolean.feature
└── reporting/
    ├── console_output.feature
    └── html_report.feature

Development Workflow

1. Story Breakdown (PM Agent)

  • PM agent breaks epics into stories
  • Stories have clear acceptance criteria (Gherkin scenarios)
  • Stories are sized for single PRs where possible

2. Development (Dev Agent)

Bash
# 1. Create isolated worktree
git worktree add .worktrees/feature-xyz -b feature/xyz

# 2. Navigate to worktree
cd .worktrees/feature-xyz

# 3. Write Gherkin scenarios (if not already done)
# 4. Write step definitions (failing tests)
# 5. Write minimal production code
# 6. Refactor while green
# 7. Commit with conventional commits

# 8. Create stacked PRs with Graphite
gt create -m "feat(operators): add comparison operator protocol"
gt create -m "feat(operators): implement comparison mutations"
gt create -m "test(operators): add comparison operator tests"

# 9. Push stack
gt push

3. Code Review (Reviewer Agent)

  • Automated review against standards
  • Check TDD compliance (tests exist, coverage adequate)
  • Check documentation updates
  • Check conventional commit format

4. Merge

  • All status checks pass
  • No conflicts with base branch
  • Reviewer agent approves
  • Squash merge to maintain linear history

5. Cleanup

Bash
# Remove worktree after merge
git worktree remove .worktrees/feature-xyz

Git & Branching Strategy

Branch Naming

Text Only
feature/short-description    # New features
fix/short-description        # Bug fixes
docs/short-description       # Documentation only
refactor/short-description   # Code refactoring
chore/short-description      # Maintenance tasks

Commit Messages (Conventional Commits)

Text Only
type(scope): description

[optional body]

[optional footer]

Types: feat, fix, docs, style, refactor, perf, test, chore, ci

Examples:

Text Only
feat(operators): add comparison operator

Implements the comparison gremlin operator that mutates
<, <=, >, >=, ==, != operators.

Closes #42
Text Only
fix(instrumentation): handle async functions correctly

Async functions were losing their coroutine status after
instrumentation. This preserves the async wrapper.

Fixes #57

Graphite Stacks

  • Max PR size: ~400 lines changed (soft limit)
  • Stack depth: As needed, but prefer shallow (2-3 PRs)
  • Each PR: Single logical change, independently reviewable

Worktree Requirements

All development happens in isolated git worktrees. Never commit directly from the main worktree.

Bash
# List active worktrees
git worktree list

# Create worktree for feature (always inside .worktrees/)
git worktree add .worktrees/{branch-name} -b {branch-name}

# Remove when done
git worktree remove .worktrees/{branch-name}

Code Quality Standards

Python Version Support

  • Python 3.11, 3.12, 3.13, 3.14
  • Test all versions in CI matrix
  • Use from __future__ import annotations for forward compat

Project Structure

Text Only
pytest-gremlins/
├── src/
│   └── pytest_gremlins/
│       ├── __init__.py
│       ├── config.py           # Configuration loading
│       ├── plugin.py           # pytest plugin hooks
│       ├── operators/
│       │   ├── __init__.py
│       │   ├── protocol.py     # GremlinOperator protocol
│       │   ├── registry.py     # OperatorRegistry
│       │   ├── comparison.py
│       │   ├── arithmetic.py
│       │   ├── boolean.py
│       │   ├── boundary.py
│       │   └── return_value.py
│       ├── instrumentation/
│       │   ├── __init__.py
│       │   ├── gremlin.py      # Gremlin dataclass
│       │   ├── transformer.py  # AST transformation
│       │   ├── switcher.py     # Mutation switching logic
│       │   ├── finder.py       # Mutation point finder
│       │   ├── import_hooks.py # Import interception
│       │   └── pragma.py       # Pardon pragma parsing
│       ├── coverage/
│       │   ├── __init__.py
│       │   ├── mapper.py       # Coverage-guided selection
│       │   ├── collector.py    # Coverage data collection
│       │   ├── context_plugin.py # Coverage.py dynamic context
│       │   ├── selector.py     # Test selection
│       │   └── prioritized_selector.py
│       ├── cache/
│       │   ├── __init__.py
│       │   ├── hasher.py       # Content hashing
│       │   ├── store.py        # SQLite result cache
│       │   └── incremental.py  # Cache coordinator
│       ├── parallel/
│       │   ├── __init__.py
│       │   ├── pool.py         # Worker pool
│       │   ├── batch_executor.py
│       │   ├── aggregator.py
│       │   └── distribution.py
│       └── reporting/
│           ├── __init__.py
│           ├── results.py      # GremlinResult dataclass
│           ├── score.py        # MutationScore
│           ├── console.py
│           ├── html.py
│           ├── json_reporter.py
│           └── history.py      # Run history tracking
├── tests/
│   ├── conftest.py
│   ├── benchmark/              # Benchmark tests
│   ├── cache/                  # Cache domain tests
│   ├── config/                 # Config domain tests
│   ├── coverage_module/        # Coverage domain tests
│   ├── instrumentation/        # Instrumentation domain tests
│   ├── operators/              # Operator domain tests
│   ├── parallel/               # Parallel domain tests
│   ├── plugin/                 # Plugin integration tests
│   └── reporting/              # Reporting domain tests
├── features/                   # Gherkin scenarios
│   └── ...
├── docs/
│   ├── design/
│   │   ├── NORTH_STAR.md
│   │   ├── OPERATORS.md
│   │   ├── AGENT_GUIDELINES.md
│   │   └── XDIST_INTEGRATION.md
│   ├── api/
│   └── changelog.md
├── pyproject.toml
├── tox.ini
├── .pre-commit-config.yaml
├── CLAUDE.md
├── README.md
├── CONTRIBUTING.md
├── CODE_OF_CONDUCT.md
└── LICENSE

Type Checking (mypy)

Strict mode from day one. No # type: ignore without justification.

TOML
# pyproject.toml
[tool.mypy]
python_version = "3.11"
strict = true
warn_return_any = true
warn_unused_ignores = true
disallow_untyped_defs = true
disallow_incomplete_defs = true
check_untyped_defs = true
disallow_untyped_decorators = true
no_implicit_optional = true
warn_redundant_casts = true
warn_unused_configs = true

Linting (Ruff)

TOML
# pyproject.toml
[tool.ruff]
line-length = 120
target-version = "py311"

[tool.ruff.lint]
select = [
    "E",      # pycodestyle errors
    "W",      # pycodestyle warnings
    "F",      # Pyflakes
    "I",      # isort
    "B",      # flake8-bugbear
    "C4",     # flake8-comprehensions
    "UP",     # pyupgrade
    "ARG",    # flake8-unused-arguments
    "SIM",    # flake8-simplify
    "TCH",    # flake8-type-checking
    "PTH",    # flake8-use-pathlib
    "ERA",    # eradicate (commented-out code)
    "PL",     # Pylint
    "RUF",    # Ruff-specific rules
]
ignore = [
    "PLR0913",  # Too many arguments (we'll manage this ourselves)
]

[tool.ruff.format]
quote-style = "single"

[tool.ruff.lint.isort]
force-single-line = false
force-sort-within-sections = true
known-first-party = ["pytest_gremlins"]
combine-as-imports = true
split-on-trailing-comma = true

Import Sorting (isort via Ruff)

Vertical hanging indent with force grid wrap:

Python
# Correct
from pytest_gremlins.operators import (
    ArithmeticOperator,
    BooleanOperator,
    ComparisonOperator,
    GremlinOperator,
    OperatorRegistry,
)

# Wrong
from pytest_gremlins.operators import ArithmeticOperator, BooleanOperator, ComparisonOperator

Pre-commit

YAML
# .pre-commit-config.yaml
repos:
  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v4.5.0
    hooks:
      - id: trailing-whitespace
      - id: end-of-file-fixer
      - id: check-yaml
      - id: check-toml
      - id: check-added-large-files
      - id: check-merge-conflict
      - id: debug-statements

  - repo: https://github.com/astral-sh/ruff-pre-commit
    rev: v0.1.6
    hooks:
      - id: ruff
        args: [--fix]
      - id: ruff-format

  - repo: https://github.com/pre-commit/mirrors-mypy
    rev: v1.7.1
    hooks:
      - id: mypy
        additional_dependencies: [pytest]
        args: [--strict]

  - repo: https://github.com/commitizen-tools/commitizen
    rev: v3.13.0
    hooks:
      - id: commitizen
        stages: [commit-msg]

Testing Strategy

Test Categories (pytest-test-categories)

We dogfood pytest-test-categories from day one:

Category Characteristics Timeout When to Run
Small Pure functions, no I/O, no network, mocked dependencies < 100ms Always, every commit
Medium Database, filesystem, multiple components < 10s PR checks, pre-merge
Large End-to-end, real external services, full system < 60s Nightly, release

Test File Organization

Tests are organized by domain module, not by test size. Test size categories (small, medium, large) are assigned via pytest markers, not directory nesting:

Text Only
tests/
├── conftest.py              # Shared fixtures
├── benchmark/               # Benchmark tests
├── cache/                   # Cache domain tests
├── config/                  # Config domain tests
├── coverage_module/         # Coverage domain tests
├── instrumentation/         # Instrumentation domain tests
├── operators/               # Operator domain tests
├── parallel/                # Parallel domain tests
├── plugin/                  # Plugin integration tests
└── reporting/               # Reporting domain tests

Test Naming

Do NOT use "should" in test names. Use declarative statements:

Python
# WRONG - "should" is always true whether or not it passes
def test_comparison_operator_should_return_mutations():
    ...

# CORRECT - Statement that can be falsified
def test_comparison_operator_returns_mutations_for_less_than():
    ...

def test_comparison_operator_returns_empty_list_for_non_comparison_node():
    ...

Test Structure

No branching, loops, or complexity in tests. Use parametrization:

Python
# WRONG - Logic in test
def test_comparison_mutations():
    for op in [ast.Lt, ast.LtE, ast.Gt]:
        node = make_comparison(op)
        result = operator.mutate(node)
        if op == ast.Lt:
            assert len(result) == 2
        else:
            assert len(result) == 3

# CORRECT - Parametrized, no logic
@pytest.mark.parametrize(
    ('input_op', 'expected_mutation_count'),
    [
        (ast.Lt, 2),
        (ast.LtE, 2),
        (ast.Gt, 2),
        (ast.Eq, 1),
        (ast.NotEq, 1),
    ],
)
def test_comparison_operator_mutation_count(input_op, expected_mutation_count):
    node = make_comparison(input_op)
    result = operator.mutate(node)
    assert len(result) == expected_mutation_count

Coverage Requirements

  • Line coverage: ≥ 90%
  • Branch coverage: ≥ 85%
  • Mutation score: ≥ 80% (once we can dogfood)

CI fails if coverage drops below thresholds.

Running Tests

Bash
# All small tests (fast, always safe)
uv run pytest tests -m small

# Small + medium (PR checks)
uv run pytest tests -m "small or medium"

# All tests including large (nightly/release)
uv run pytest

# Single test file
uv run pytest tests/operators/test_comparison.py

# Single test
uv run pytest tests/operators/test_comparison.py::test_comparison_operator_returns_mutations

# With coverage
uv run pytest --cov=pytest_gremlins --cov-report=html

# Tox for all Python versions
uv run tox

Documentation Requirements

Documentation Is Code

Documentation has the same lifecycle as code:

  • Lives in version control
  • Reviewed in PRs
  • Tested in CI (doctests, link checking)
  • Released with code

Types of Documentation

  1. README.md - First impression, quick start, badges
  2. User Guide - How to use pytest-gremlins
  3. API Reference - Auto-generated from docstrings
  4. Design Docs - Architecture decisions (this folder)
  5. Changelog - Auto-generated by commitizen
  6. Contributing Guide - How to contribute
  7. Code of Conduct - Community standards

Docstrings

All public APIs must have docstrings with:

  • One-line summary
  • Extended description (if needed)
  • Args with types and descriptions
  • Returns with type and description
  • Raises with exception types
  • Examples (as doctests)
Python
def mutate(self, node: ast.Compare) -> list[ast.Compare]:
    """Generate mutated variants of a comparison node.

    Takes an AST Compare node and returns all possible mutations
    based on the comparison operator.

    Args:
        node: An AST Compare node (e.g., `a < b`).

    Returns:
        List of mutated Compare nodes, each representing one gremlin.
        Returns empty list if the node cannot be mutated.

    Raises:
        TypeError: If node is not an ast.Compare instance.

    Examples:
        >>> import ast
        >>> node = ast.parse('a < b', mode='eval').body
        >>> operator = ComparisonOperator()
        >>> mutations = operator.mutate(node)
        >>> len(mutations)
        2

    """

Doctests

Doctests are mandatory for all public API examples. They serve dual purpose:

  1. Living documentation that's always accurate
  2. Additional test coverage

CI runs doctests:

Bash
uv run pytest --doctest-modules src/pytest_gremlins

ReadTheDocs

  • Built from Markdown (MkDocs)
  • Auto-deployed on merge to main
  • API docs generated from docstrings
  • Versioned docs for each release

Documentation Review Checklist

Every PR touching code must address documentation:

  • Docstrings updated for changed functions
  • User guide updated if behavior changes
  • Examples tested (doctests pass)
  • README updated if needed
  • Changelog entry (via conventional commit)

CI/CD Pipeline

GitHub Actions Workflows

On Pull Request

YAML
# .github/workflows/ci.yml
name: CI

on:
  pull_request:
  push:
    branches: [main]

jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: astral-sh/setup-uv@v4
      - run: uv sync
      - run: uv run ruff check .
      - run: uv run ruff format --check .

  typecheck:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: astral-sh/setup-uv@v4
      - run: uv sync
      - run: uv run mypy src

  test:
    strategy:
      matrix:
        os: [ubuntu-latest, macos-latest, windows-latest]
        python-version: ['3.11', '3.12', '3.13', '3.14']
    runs-on: ${{ matrix.os }}
    steps:
      - uses: actions/checkout@v4
      - uses: astral-sh/setup-uv@v4
        with:
          python-version: ${{ matrix.python-version }}
      - run: uv sync
      - run: uv run pytest tests -m "small or medium" --cov=pytest_gremlins
      - uses: codecov/codecov-action@v4

  docs:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: astral-sh/setup-uv@v4
      - run: uv sync
      - run: uv run pytest --doctest-modules src/pytest_gremlins
      - run: uv run mkdocs build --strict

  security:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: astral-sh/setup-uv@v4
      - run: uv sync
      - run: uv run pip-audit

On Release Tag

YAML
# .github/workflows/release.yml
name: Release

on:
  push:
    tags: ['v*']

jobs:
  publish-test-pypi:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: astral-sh/setup-uv@v4
      - run: uv build
      - run: uv publish --index-url https://test.pypi.org/simple/

  test-install:
    needs: publish-test-pypi
    runs-on: ubuntu-latest
    steps:
      - run: pip install --index-url https://test.pypi.org/simple/ pytest-gremlins
      - run: python -c "import pytest_gremlins; print(pytest_gremlins.__version__)"

  publish-pypi:
    needs: test-install
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: astral-sh/setup-uv@v4
      - run: uv build
      - run: uv publish

  github-release:
    needs: publish-pypi
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: softprops/action-gh-release@v1
        with:
          body_path: CHANGELOG.md
          generate_release_notes: true

Branch Protection (main)

  • Require pull request before merging
  • Require status checks: lint, typecheck, test, docs, security
  • Require linear history
  • Do not allow bypassing settings

Release Process

Philosophy: Release Often

We follow the FastAPI/Ruff model -- release stable versions frequently. Every few merged PRs, cut a release. No alpha/beta/RC ceremony for routine releases.

How to Release (Local)

Prerequisites: Be on main with a clean working tree. The script checks both and refuses to continue if either is wrong.

Bash
# Auto-detect bump type from conventional commits (recommended)
./scripts/release.sh

# Or force a specific bump type
./scripts/release.sh --patch
./scripts/release.sh --minor
./scripts/release.sh --major

What happens: The script uses commitizen to bump the version in pyproject.toml and src/pytest_gremlins/__init__.py, regenerate CHANGELOG.md, create a version commit and tag, and push both to origin. The tag push triggers the release.yml pipeline, which runs tests, publishes to PyPI, and creates a GitHub Release.

How to Release (GitHub Actions)

The cut-release.yml workflow lets you release without a local checkout:

  1. Go to Actions > Cut Release > Run workflow
  2. Select the bump type (auto, patch, minor, major)
  3. Click Run workflow

The workflow bumps the version, pushes the commit and tag, and triggers release.yml. The job summary shows which commits are included and links to the release pipeline.

Prerequisites: A GitHub fine-grained PAT with Repository permissions > Contents: Read and write must be stored as the RELEASE_PAT repository secret. Set a 90-day expiry and renew before it lapses. Without this secret, the workflow will fail at the push step because the default GITHUB_TOKEN cannot push commits that trigger other workflows.

When to Release

  • After merging 3-5 PRs (routine)
  • After any security fix (immediately)
  • After a significant bug fix users are waiting on

When to Pause and Think First

  • Breaking changes: Write migration notes in CHANGELOG.md before bumping. Users need to know what breaks and how to update their code.
  • Major version bumps: Discuss with the team first. A major bump signals a compatibility contract change and should not be accidental.

MLP Definition

v1.0.0 (Minimum Loveable Product) requires:

  1. Core Features
  2. Mutation switching architecture working
  3. Coverage-guided test selection
  4. Incremental analysis (caching)
  5. Parallel execution
  6. 5 core operators (comparison, boundary, boolean, arithmetic, return)

  7. Integration

  8. Native pytest plugin (pytest --gremlins)
  9. pytest-test-categories integration
  10. Configuration via pyproject.toml

  11. Reporting

  12. Console output (summary + details)
  13. HTML report

  14. Documentation

  15. Complete user guide
  16. API reference
  17. Examples that work
  18. README with quick start

  19. Quality

  20. All tests passing
  21. 90%+ line coverage
  22. 80%+ mutation score (dogfooded)
  23. Tested on real project (not just ourselves)

Agent Boundaries

What Agents CAN Do Autonomously

  • Create branches and worktrees
  • Write tests (TDD - tests first!)
  • Write production code (to pass tests)
  • Run tests and fix failures
  • Create commits (conventional format)
  • Create PRs via Graphite
  • Respond to review feedback
  • Bump patch versions
  • Merge PRs (when all checks pass + reviewer approves + no conflicts)
  • Update documentation alongside code

What Agents CANNOT Do Without Human Approval

  • Deviate from agreed roadmap/issue ACs
  • Make architectural decisions not in design docs
  • Bump minor or major versions
  • Create releases to PyPI
  • Modify CI/CD pipeline
  • Change branch protection rules
  • Modify agent guidelines (this document)
  • Skip tests or reduce coverage
  • Add # type: ignore without justification
  • Bypass pre-commit hooks

Agent Suggestions

Agents may suggest changes to:

  • Architecture
  • Roadmap priorities
  • New features not in backlog
  • Process improvements

But humans decide. Agents implement.

Validation Requirements

Before creating a PR, agents must:

  1. Run all small tests (always)
  2. Run medium tests (always, unless > 5 min)
  3. Run large tests (if touching integration points)
  4. Verify type checking passes
  5. Verify linting passes
  6. Verify doctests pass
  7. Verify documentation builds

If small tests ever become "too slow," something is wrong. Fix it.


References