Agent Guidelines & Development Standards¶
"Discipline is the bridge between goals and accomplishment."
This document defines the guardrails, processes, and standards that all agents (and humans) must follow when developing pytest-gremlins.
Table of Contents¶
- Core Principles
- TDD: The Three Laws
- BDD: Behavior-Driven Development
- Development Workflow
- Git & Branching Strategy
- Code Quality Standards
- Testing Strategy
- Documentation Requirements
- CI/CD Pipeline
- Release Process
- Agent Boundaries
Core Principles¶
- Tests before code - No exceptions. Ever.
- Documentation is code - Same lifecycle, same rigor, same CI enforcement
- Small, reviewable PRs - Stacked via Graphite, easy to understand
- Isolated development - All work happens in git worktrees
- Automate everything - If a human has to remember it, automate it
- Dogfood early - Use pytest-gremlins on pytest-gremlins ASAP
TDD: The Three Laws¶
TDD enforcement is ULTRA strict. These are not guidelines—they are laws:
Law 1: Failing Test First¶
You may not write any production code unless it is to make a failing test pass.
Before touching src/, there must be a failing test in tests/. No "I'll add tests later."
No "this is just a small change." No exceptions.
Law 2: Minimal Test Code¶
You may only write as much test code as required to make a test fail (and build/compile failures count as test failures).
Don't write a complete test suite upfront. Write ONE assertion that fails. Make it pass. Write the next assertion. Red-green-red-green.
Law 3: Minimal Production Code¶
You may only write as much production code as required to make a failing test pass.
Don't gold-plate. Don't add "while I'm here" features. Write the minimum code to go from red to green.
Law 4: Refactor¶
Engage in pragmatic refactoring once tests are passing.
Green means you can refactor. Improve structure, extract methods, rename things. But only when green. And stay green.
The TDD Cycle¶
┌─────────────────────────────────────────────────────────────┐
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────────┐ │
│ │ RED │────►│ GREEN │────►│ REFACTOR │ │
│ │ │ │ │ │ │ │
│ │ Write a │ │ Write │ │ Clean up │ │
│ │ failing │ │ minimal │ │ while │ │
│ │ test │ │ code to │ │ staying │ │
│ │ │ │ pass │ │ green │ │
│ └─────────┘ └─────────┘ └──────┬──────┘ │
│ ▲ │ │
│ │ │ │
│ └─────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
BDD: Behavior-Driven Development¶
We use Gherkin to define behaviors and Cucumber (via pytest-bdd) for automated acceptance testing.
The BDD Flow¶
- PM Agent writes Gherkin scenarios based on requirements
- QA Agent reviews scenarios, adds edge cases
- Dev Agent implements step definitions (tests) FIRST
- Dev Agent implements production code to make steps pass
- Scenarios become living documentation
Gherkin Standards¶
Feature: Mutation Switching
As a developer using pytest-gremlins
I want mutations to be controlled via environment variable
So that I can run tests against different mutations without reloading code
Background:
Given a Python file with a comparison operator
And the file has been instrumented with gremlins
Scenario: Default behavior runs original code
Given no ACTIVE_GREMLIN environment variable is set
When I execute the instrumented code
Then the original comparison logic executes
Scenario: Setting ACTIVE_GREMLIN activates a mutation
Given ACTIVE_GREMLIN is set to "gremlin_001"
When I execute the instrumented code
Then the mutated comparison logic executes
Scenario Outline: Different gremlins produce different mutations
Given ACTIVE_GREMLIN is set to "<gremlin_id>"
When I execute the comparison "age >= 18"
Then the result matches "<expected_operator>"
Examples:
| gremlin_id | expected_operator |
| gremlin_001 | age > 18 |
| gremlin_002 | age <= 18 |
| gremlin_003 | age < 18 |
Scenario Guidelines¶
- Declarative, not imperative - Describe WHAT, not HOW
- Business language - Avoid technical implementation details
- One behavior per scenario - Keep them focused
- Use Background - For common setup across scenarios
- Scenario Outlines - For parameterized testing
File Organization¶
features/
├── mutation_switching.feature
├── coverage_guidance.feature
├── incremental_analysis.feature
├── parallel_execution.feature
├── operators/
│ ├── comparison.feature
│ ├── arithmetic.feature
│ └── boolean.feature
└── reporting/
├── console_output.feature
└── html_report.feature
Development Workflow¶
1. Story Breakdown (PM Agent)¶
- PM agent breaks epics into stories
- Stories have clear acceptance criteria (Gherkin scenarios)
- Stories are sized for single PRs where possible
2. Development (Dev Agent)¶
# 1. Create isolated worktree
git worktree add .worktrees/feature-xyz -b feature/xyz
# 2. Navigate to worktree
cd .worktrees/feature-xyz
# 3. Write Gherkin scenarios (if not already done)
# 4. Write step definitions (failing tests)
# 5. Write minimal production code
# 6. Refactor while green
# 7. Commit with conventional commits
# 8. Create stacked PRs with Graphite
gt create -m "feat(operators): add comparison operator protocol"
gt create -m "feat(operators): implement comparison mutations"
gt create -m "test(operators): add comparison operator tests"
# 9. Push stack
gt push
3. Code Review (Reviewer Agent)¶
- Automated review against standards
- Check TDD compliance (tests exist, coverage adequate)
- Check documentation updates
- Check conventional commit format
4. Merge¶
- All status checks pass
- No conflicts with base branch
- Reviewer agent approves
- Squash merge to maintain linear history
5. Cleanup¶
Git & Branching Strategy¶
Branch Naming¶
feature/short-description # New features
fix/short-description # Bug fixes
docs/short-description # Documentation only
refactor/short-description # Code refactoring
chore/short-description # Maintenance tasks
Commit Messages (Conventional Commits)¶
Types: feat, fix, docs, style, refactor, perf, test, chore, ci
Examples:
feat(operators): add comparison operator
Implements the comparison gremlin operator that mutates
<, <=, >, >=, ==, != operators.
Closes #42
fix(instrumentation): handle async functions correctly
Async functions were losing their coroutine status after
instrumentation. This preserves the async wrapper.
Fixes #57
Graphite Stacks¶
- Max PR size: ~400 lines changed (soft limit)
- Stack depth: As needed, but prefer shallow (2-3 PRs)
- Each PR: Single logical change, independently reviewable
Worktree Requirements¶
All development happens in isolated git worktrees. Never commit directly from the main worktree.
# List active worktrees
git worktree list
# Create worktree for feature (always inside .worktrees/)
git worktree add .worktrees/{branch-name} -b {branch-name}
# Remove when done
git worktree remove .worktrees/{branch-name}
Code Quality Standards¶
Python Version Support¶
- Python 3.11, 3.12, 3.13, 3.14
- Test all versions in CI matrix
- Use
from __future__ import annotationsfor forward compat
Project Structure¶
pytest-gremlins/
├── src/
│ └── pytest_gremlins/
│ ├── __init__.py
│ ├── config.py # Configuration loading
│ ├── plugin.py # pytest plugin hooks
│ ├── operators/
│ │ ├── __init__.py
│ │ ├── protocol.py # GremlinOperator protocol
│ │ ├── registry.py # OperatorRegistry
│ │ ├── comparison.py
│ │ ├── arithmetic.py
│ │ ├── boolean.py
│ │ ├── boundary.py
│ │ └── return_value.py
│ ├── instrumentation/
│ │ ├── __init__.py
│ │ ├── gremlin.py # Gremlin dataclass
│ │ ├── transformer.py # AST transformation
│ │ ├── switcher.py # Mutation switching logic
│ │ ├── finder.py # Mutation point finder
│ │ ├── import_hooks.py # Import interception
│ │ └── pragma.py # Pardon pragma parsing
│ ├── coverage/
│ │ ├── __init__.py
│ │ ├── mapper.py # Coverage-guided selection
│ │ ├── collector.py # Coverage data collection
│ │ ├── context_plugin.py # Coverage.py dynamic context
│ │ ├── selector.py # Test selection
│ │ └── prioritized_selector.py
│ ├── cache/
│ │ ├── __init__.py
│ │ ├── hasher.py # Content hashing
│ │ ├── store.py # SQLite result cache
│ │ └── incremental.py # Cache coordinator
│ ├── parallel/
│ │ ├── __init__.py
│ │ ├── pool.py # Worker pool
│ │ ├── batch_executor.py
│ │ ├── aggregator.py
│ │ └── distribution.py
│ └── reporting/
│ ├── __init__.py
│ ├── results.py # GremlinResult dataclass
│ ├── score.py # MutationScore
│ ├── console.py
│ ├── html.py
│ ├── json_reporter.py
│ └── history.py # Run history tracking
├── tests/
│ ├── conftest.py
│ ├── benchmark/ # Benchmark tests
│ ├── cache/ # Cache domain tests
│ ├── config/ # Config domain tests
│ ├── coverage_module/ # Coverage domain tests
│ ├── instrumentation/ # Instrumentation domain tests
│ ├── operators/ # Operator domain tests
│ ├── parallel/ # Parallel domain tests
│ ├── plugin/ # Plugin integration tests
│ └── reporting/ # Reporting domain tests
├── features/ # Gherkin scenarios
│ └── ...
├── docs/
│ ├── design/
│ │ ├── NORTH_STAR.md
│ │ ├── OPERATORS.md
│ │ ├── AGENT_GUIDELINES.md
│ │ └── XDIST_INTEGRATION.md
│ ├── api/
│ └── changelog.md
├── pyproject.toml
├── tox.ini
├── .pre-commit-config.yaml
├── CLAUDE.md
├── README.md
├── CONTRIBUTING.md
├── CODE_OF_CONDUCT.md
└── LICENSE
Type Checking (mypy)¶
Strict mode from day one. No # type: ignore without justification.
# pyproject.toml
[tool.mypy]
python_version = "3.11"
strict = true
warn_return_any = true
warn_unused_ignores = true
disallow_untyped_defs = true
disallow_incomplete_defs = true
check_untyped_defs = true
disallow_untyped_decorators = true
no_implicit_optional = true
warn_redundant_casts = true
warn_unused_configs = true
Linting (Ruff)¶
# pyproject.toml
[tool.ruff]
line-length = 120
target-version = "py311"
[tool.ruff.lint]
select = [
"E", # pycodestyle errors
"W", # pycodestyle warnings
"F", # Pyflakes
"I", # isort
"B", # flake8-bugbear
"C4", # flake8-comprehensions
"UP", # pyupgrade
"ARG", # flake8-unused-arguments
"SIM", # flake8-simplify
"TCH", # flake8-type-checking
"PTH", # flake8-use-pathlib
"ERA", # eradicate (commented-out code)
"PL", # Pylint
"RUF", # Ruff-specific rules
]
ignore = [
"PLR0913", # Too many arguments (we'll manage this ourselves)
]
[tool.ruff.format]
quote-style = "single"
[tool.ruff.lint.isort]
force-single-line = false
force-sort-within-sections = true
known-first-party = ["pytest_gremlins"]
combine-as-imports = true
split-on-trailing-comma = true
Import Sorting (isort via Ruff)¶
Vertical hanging indent with force grid wrap:
# Correct
from pytest_gremlins.operators import (
ArithmeticOperator,
BooleanOperator,
ComparisonOperator,
GremlinOperator,
OperatorRegistry,
)
# Wrong
from pytest_gremlins.operators import ArithmeticOperator, BooleanOperator, ComparisonOperator
Pre-commit¶
# .pre-commit-config.yaml
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.5.0
hooks:
- id: trailing-whitespace
- id: end-of-file-fixer
- id: check-yaml
- id: check-toml
- id: check-added-large-files
- id: check-merge-conflict
- id: debug-statements
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.1.6
hooks:
- id: ruff
args: [--fix]
- id: ruff-format
- repo: https://github.com/pre-commit/mirrors-mypy
rev: v1.7.1
hooks:
- id: mypy
additional_dependencies: [pytest]
args: [--strict]
- repo: https://github.com/commitizen-tools/commitizen
rev: v3.13.0
hooks:
- id: commitizen
stages: [commit-msg]
Testing Strategy¶
Test Categories (pytest-test-categories)¶
We dogfood pytest-test-categories from day one:
| Category | Characteristics | Timeout | When to Run |
|---|---|---|---|
| Small | Pure functions, no I/O, no network, mocked dependencies | < 100ms | Always, every commit |
| Medium | Database, filesystem, multiple components | < 10s | PR checks, pre-merge |
| Large | End-to-end, real external services, full system | < 60s | Nightly, release |
Test File Organization¶
Tests are organized by domain module, not by test size. Test size categories (small, medium, large) are assigned via pytest markers, not directory nesting:
tests/
├── conftest.py # Shared fixtures
├── benchmark/ # Benchmark tests
├── cache/ # Cache domain tests
├── config/ # Config domain tests
├── coverage_module/ # Coverage domain tests
├── instrumentation/ # Instrumentation domain tests
├── operators/ # Operator domain tests
├── parallel/ # Parallel domain tests
├── plugin/ # Plugin integration tests
└── reporting/ # Reporting domain tests
Test Naming¶
Do NOT use "should" in test names. Use declarative statements:
# WRONG - "should" is always true whether or not it passes
def test_comparison_operator_should_return_mutations():
...
# CORRECT - Statement that can be falsified
def test_comparison_operator_returns_mutations_for_less_than():
...
def test_comparison_operator_returns_empty_list_for_non_comparison_node():
...
Test Structure¶
No branching, loops, or complexity in tests. Use parametrization:
# WRONG - Logic in test
def test_comparison_mutations():
for op in [ast.Lt, ast.LtE, ast.Gt]:
node = make_comparison(op)
result = operator.mutate(node)
if op == ast.Lt:
assert len(result) == 2
else:
assert len(result) == 3
# CORRECT - Parametrized, no logic
@pytest.mark.parametrize(
('input_op', 'expected_mutation_count'),
[
(ast.Lt, 2),
(ast.LtE, 2),
(ast.Gt, 2),
(ast.Eq, 1),
(ast.NotEq, 1),
],
)
def test_comparison_operator_mutation_count(input_op, expected_mutation_count):
node = make_comparison(input_op)
result = operator.mutate(node)
assert len(result) == expected_mutation_count
Coverage Requirements¶
- Line coverage: ≥ 90%
- Branch coverage: ≥ 85%
- Mutation score: ≥ 80% (once we can dogfood)
CI fails if coverage drops below thresholds.
Running Tests¶
# All small tests (fast, always safe)
uv run pytest tests -m small
# Small + medium (PR checks)
uv run pytest tests -m "small or medium"
# All tests including large (nightly/release)
uv run pytest
# Single test file
uv run pytest tests/operators/test_comparison.py
# Single test
uv run pytest tests/operators/test_comparison.py::test_comparison_operator_returns_mutations
# With coverage
uv run pytest --cov=pytest_gremlins --cov-report=html
# Tox for all Python versions
uv run tox
Documentation Requirements¶
Documentation Is Code¶
Documentation has the same lifecycle as code:
- Lives in version control
- Reviewed in PRs
- Tested in CI (doctests, link checking)
- Released with code
Types of Documentation¶
- README.md - First impression, quick start, badges
- User Guide - How to use pytest-gremlins
- API Reference - Auto-generated from docstrings
- Design Docs - Architecture decisions (this folder)
- Changelog - Auto-generated by commitizen
- Contributing Guide - How to contribute
- Code of Conduct - Community standards
Docstrings¶
All public APIs must have docstrings with:
- One-line summary
- Extended description (if needed)
- Args with types and descriptions
- Returns with type and description
- Raises with exception types
- Examples (as doctests)
def mutate(self, node: ast.Compare) -> list[ast.Compare]:
"""Generate mutated variants of a comparison node.
Takes an AST Compare node and returns all possible mutations
based on the comparison operator.
Args:
node: An AST Compare node (e.g., `a < b`).
Returns:
List of mutated Compare nodes, each representing one gremlin.
Returns empty list if the node cannot be mutated.
Raises:
TypeError: If node is not an ast.Compare instance.
Examples:
>>> import ast
>>> node = ast.parse('a < b', mode='eval').body
>>> operator = ComparisonOperator()
>>> mutations = operator.mutate(node)
>>> len(mutations)
2
"""
Doctests¶
Doctests are mandatory for all public API examples. They serve dual purpose:
- Living documentation that's always accurate
- Additional test coverage
CI runs doctests:
ReadTheDocs¶
- Built from Markdown (MkDocs)
- Auto-deployed on merge to main
- API docs generated from docstrings
- Versioned docs for each release
Documentation Review Checklist¶
Every PR touching code must address documentation:
- Docstrings updated for changed functions
- User guide updated if behavior changes
- Examples tested (doctests pass)
- README updated if needed
- Changelog entry (via conventional commit)
CI/CD Pipeline¶
GitHub Actions Workflows¶
On Pull Request¶
# .github/workflows/ci.yml
name: CI
on:
pull_request:
push:
branches: [main]
jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: astral-sh/setup-uv@v4
- run: uv sync
- run: uv run ruff check .
- run: uv run ruff format --check .
typecheck:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: astral-sh/setup-uv@v4
- run: uv sync
- run: uv run mypy src
test:
strategy:
matrix:
os: [ubuntu-latest, macos-latest, windows-latest]
python-version: ['3.11', '3.12', '3.13', '3.14']
runs-on: ${{ matrix.os }}
steps:
- uses: actions/checkout@v4
- uses: astral-sh/setup-uv@v4
with:
python-version: ${{ matrix.python-version }}
- run: uv sync
- run: uv run pytest tests -m "small or medium" --cov=pytest_gremlins
- uses: codecov/codecov-action@v4
docs:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: astral-sh/setup-uv@v4
- run: uv sync
- run: uv run pytest --doctest-modules src/pytest_gremlins
- run: uv run mkdocs build --strict
security:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: astral-sh/setup-uv@v4
- run: uv sync
- run: uv run pip-audit
On Release Tag¶
# .github/workflows/release.yml
name: Release
on:
push:
tags: ['v*']
jobs:
publish-test-pypi:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: astral-sh/setup-uv@v4
- run: uv build
- run: uv publish --index-url https://test.pypi.org/simple/
test-install:
needs: publish-test-pypi
runs-on: ubuntu-latest
steps:
- run: pip install --index-url https://test.pypi.org/simple/ pytest-gremlins
- run: python -c "import pytest_gremlins; print(pytest_gremlins.__version__)"
publish-pypi:
needs: test-install
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: astral-sh/setup-uv@v4
- run: uv build
- run: uv publish
github-release:
needs: publish-pypi
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: softprops/action-gh-release@v1
with:
body_path: CHANGELOG.md
generate_release_notes: true
Branch Protection (main)¶
- Require pull request before merging
- Require status checks: lint, typecheck, test, docs, security
- Require linear history
- Do not allow bypassing settings
Release Process¶
Philosophy: Release Often¶
We follow the FastAPI/Ruff model -- release stable versions frequently. Every few merged PRs, cut a release. No alpha/beta/RC ceremony for routine releases.
How to Release (Local)¶
Prerequisites: Be on main with a clean working tree. The script checks both and
refuses to continue if either is wrong.
# Auto-detect bump type from conventional commits (recommended)
./scripts/release.sh
# Or force a specific bump type
./scripts/release.sh --patch
./scripts/release.sh --minor
./scripts/release.sh --major
What happens: The script uses commitizen to bump the version in pyproject.toml and
src/pytest_gremlins/__init__.py, regenerate CHANGELOG.md, create a version commit and
tag, and push both to origin. The tag push triggers the release.yml pipeline, which runs
tests, publishes to PyPI, and creates a GitHub Release.
How to Release (GitHub Actions)¶
The cut-release.yml workflow lets you release without a local checkout:
- Go to Actions > Cut Release > Run workflow
- Select the bump type (
auto,patch,minor,major) - Click Run workflow
The workflow bumps the version, pushes the commit and tag, and triggers release.yml.
The job summary shows which commits are included and links to the release pipeline.
Prerequisites: A GitHub fine-grained PAT with Repository permissions > Contents:
Read and write must be stored as the RELEASE_PAT repository secret. Set a 90-day
expiry and renew before it lapses. Without this secret, the workflow will fail at the
push step because the default GITHUB_TOKEN cannot push commits that trigger other
workflows.
When to Release¶
- After merging 3-5 PRs (routine)
- After any security fix (immediately)
- After a significant bug fix users are waiting on
When to Pause and Think First¶
- Breaking changes: Write migration notes in
CHANGELOG.mdbefore bumping. Users need to know what breaks and how to update their code. - Major version bumps: Discuss with the team first. A major bump signals a compatibility contract change and should not be accidental.
MLP Definition¶
v1.0.0 (Minimum Loveable Product) requires:
- Core Features
- Mutation switching architecture working
- Coverage-guided test selection
- Incremental analysis (caching)
- Parallel execution
-
5 core operators (comparison, boundary, boolean, arithmetic, return)
-
Integration
- Native pytest plugin (
pytest --gremlins) - pytest-test-categories integration
-
Configuration via pyproject.toml
-
Reporting
- Console output (summary + details)
-
HTML report
-
Documentation
- Complete user guide
- API reference
- Examples that work
-
README with quick start
-
Quality
- All tests passing
- 90%+ line coverage
- 80%+ mutation score (dogfooded)
- Tested on real project (not just ourselves)
Agent Boundaries¶
What Agents CAN Do Autonomously¶
- Create branches and worktrees
- Write tests (TDD - tests first!)
- Write production code (to pass tests)
- Run tests and fix failures
- Create commits (conventional format)
- Create PRs via Graphite
- Respond to review feedback
- Bump patch versions
- Merge PRs (when all checks pass + reviewer approves + no conflicts)
- Update documentation alongside code
What Agents CANNOT Do Without Human Approval¶
- Deviate from agreed roadmap/issue ACs
- Make architectural decisions not in design docs
- Bump minor or major versions
- Create releases to PyPI
- Modify CI/CD pipeline
- Change branch protection rules
- Modify agent guidelines (this document)
- Skip tests or reduce coverage
- Add
# type: ignorewithout justification - Bypass pre-commit hooks
Agent Suggestions¶
Agents may suggest changes to:
- Architecture
- Roadmap priorities
- New features not in backlog
- Process improvements
But humans decide. Agents implement.
Validation Requirements¶
Before creating a PR, agents must:
- Run all small tests (always)
- Run medium tests (always, unless > 5 min)
- Run large tests (if touching integration points)
- Verify type checking passes
- Verify linting passes
- Verify doctests pass
- Verify documentation builds
If small tests ever become "too slow," something is wrong. Fix it.