pytest-gremlins: North Star Design Document¶

"Let the gremlins loose. See which ones survive."

Vision¶

pytest-gremlins is a fast-first mutation testing plugin for pytest. We aim to make mutation testing practical for everyday development - not an overnight CI job, but part of the normal TDD feedback loop.

Why Another Mutation Testing Tool?¶

The Python mutation testing landscape is broken:

Tool	Fatal Flaw
mutmut	Unix/WSL only (requires `fork()`); not a pytest plugin; no incremental analysis
Cosmic Ray	Complex setup with session management, multiple distributor options
MutPy	Dead (last update 2019), Python 3.4-3.7 only
mutatest	Dead (last update 2022), Python ≤3.8 only, random behavior

Meanwhile, the JVM (PIT) and JavaScript (Stryker) worlds have solved these problems. We're bringing those lessons to Python.

Core Principles¶

1. Speed Is Non-Negotiable¶

Mutation testing is useless if developers don't run it. Every architectural decision optimizes for speed:

Mutation switching over file modification
Coverage-guided test selection over running all tests
Incremental analysis over full re-runs
Parallel execution over sequential

2. Native pytest Integration¶

Not a wrapper. Not a separate CLI. A proper pytest plugin that respects:

pytest's collection and execution model
Fixtures and markers
pytest-xdist for parallelization
pytest-cov for coverage data

3. Actionable Output¶

No walls of text. Results should tell you:

Which gremlins survived (your tests are weak here)
Which tests would need strengthening
Which code is well-protected

Speed Architecture¶

The Four Pillars of Speed¶

Text Only

┌─────────────────────────────────────────────────────────────┐
│                    SPEED STRATEGY                           │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌─────────────────┐    ┌─────────────────┐                │
│  │   MUTATION      │    │   COVERAGE      │                │
│  │   SWITCHING     │    │   GUIDANCE      │                │
│  │                 │    │                 │                │
│  │  No file I/O    │    │  Only run tests │                │
│  │  No reloads     │    │  that matter    │                │
│  │  Single parse   │    │                 │                │
│  └────────┬────────┘    └────────┬────────┘                │
│           │                      │                          │
│           └──────────┬───────────┘                          │
│                      │                                      │
│  ┌─────────────────┐ │ ┌─────────────────┐                 │
│  │  INCREMENTAL   │◄─┴─►│    PARALLEL    │                 │
│  │   ANALYSIS     │     │   EXECUTION    │                 │
│  │                │     │                │                 │
│  │  Skip what     │     │  N workers,    │                 │
│  │  hasn't changed│     │  N gremlins    │                 │
│  └────────────────┘     └────────────────┘                 │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Pillar 1: Mutation Switching¶

The Problem: Traditional mutation testing modifies files on disk, reloads modules, runs tests, restores files. Repeat 1000x.

The Solution: Instrument code once with ALL mutations embedded, controlled by an environment variable:

Python

# Original
def is_adult(age):
    return age >= 18

# Instrumented (all gremlins baked in) — conceptual illustration.
# Actual mutations depend on which operators are enabled; the comparison
# operator produces >= → > and >= → <.
def is_adult(age):
    _g = __gremlin_active__
    if _g == "g001": return age > 18    # >= → >
    if _g == "g002": return age < 18    # >= → <
    if _g == "g003": return age >= 17   # boundary shift −1
    if _g == "g004": return age >= 19   # boundary shift +1
    return age >= 18                     # original

Benefits:

Zero file I/O during test runs
Zero module reloads
Single AST parse
Test process stays hot (no reimporting numpy 1000x)
Safe parallelization (workers just set different env vars)

Inspiration: Stryker 4.0's mutation switching delivers 20-70% speedup. For Python with slow imports, gains are even larger.

Pillar 2: Coverage-Guided Test Selection¶

The Problem: 1,000 gremlins × 500 tests = 500,000 test runs. Most are pointless - a test that never touches the mutated code can't catch the gremlin.

The Solution: Build a coverage map, only run tests that cover each gremlin's location:

Python

coverage_map = {
    "src/auth.py:42": ["test_login_success", "test_login_failure"],
    "src/shipping.py:17": ["test_calculate_shipping"],
}

# Gremlin in auth.py:42 → run 2 tests, not 500

Benefits:

10-1000x reduction in test executions
Scales better as project grows (more modular = more savings)
Identifies "incidentally tested" code (touched by many tests but not directly targeted)

Inspiration: PIT and Stryker both do this. Stryker reports 40-60% additional speedup from coverage guidance.

Pillar 3: Incremental Analysis¶

The Problem: You run mutation testing (10 minutes), fix one test, run again (10 minutes). Terrible feedback loop.

The Solution: Cache results keyed by content hashes. Only re-run when source or tests change:

Python

cache_key = hash(source_file + test_file + gremlin_definition)

if cache_key in history:
    return history[cache_key]  # Instant
else:
    result = run_tests()
    history[cache_key] = result
    return result

Invalidation Rules:

Change	Action
Source file modified	Re-run gremlins in that file
Test file modified	Re-run gremlins covered by those tests
New test added	Re-run gremlins the new test covers
Test deleted	Re-run gremlins that test was zapping
Nothing changed	Return cached results instantly

Benefits:

Subsequent runs finish in seconds
Enables mutation testing in TDD workflow
CI caching works naturally

Inspiration: PIT reduced a 31-hour analysis to under 3 minutes with incremental mode.

Pillar 4: Parallel Execution¶

The Problem: Even with all optimizations, 1000 gremlins still take time sequentially.

The Solution: Distribute gremlins across worker processes:

Text Only

Main Process
    │
    ├── Worker 1: ACTIVE_GREMLIN=1,5,9...
    ├── Worker 2: ACTIVE_GREMLIN=2,6,10...
    ├── Worker 3: ACTIVE_GREMLIN=3,7,11...
    └── Worker 4: ACTIVE_GREMLIN=4,8,12...

Why Mutation Switching Enables This:

Traditional approach: workers fight over file modifications
Mutation switching: all workers read same instrumented code, just set different env vars
No locks, no file copies, no coordination needed

Benefits:

Linear speedup with core count
Safe by design (no shared mutable state)
Simple implementation (ProcessPoolExecutor)

Combined Speedup¶

Optimization	Individual Gain	Cumulative
Baseline (naive)	1x	1x
Mutation switching	2-5x	2-5x
Coverage guidance	10-100x	20-500x
Incremental analysis	10-1000x (repeat runs)	200-500,000x
Parallel (8 cores)	8x	1,600-4,000,000x

A project that took 8 hours with naive mutation testing could complete in seconds on repeat runs with all optimizations.

Domain Language (Gremlins Theme)¶

We use Gremlins movie references as our ubiquitous language:

Traditional Term	Gremlin Term	Description
Original code	Mogwai	Clean, untouched source code
Start mutation testing	Feed after midnight	Begin the mutation process
Mutation engine	Midnight feeding	Transforms mogwai into gremlins
Mutant	Gremlin	A mutation injected into code
Kill mutant	Zap	Test catches the mutation
Surviving mutant	Survivor	Mutation not caught (weak test coverage)
Cleanup/reporting	Microwave	Eliminate gremlins, generate report

The Workflow¶

Text Only

1. MOGWAI        Your original, well-behaved source code
       │
       ▼
2. FEED AFTER    Start pytest-gremlins
   MIDNIGHT
       │
       ▼
3. GREMLINS      Mutations spawn throughout your code
   EMERGE
       │
       ▼
4. ZAP           Tests hunt and eliminate gremlins
   GREMLINS
       │
       ▼
5. SURVIVORS     Report which gremlins your tests missed
   REPORT
       │
       ▼
6. MICROWAVE     Clean up, strengthen tests, repeat

Non-Goals (For Now)¶

Distributed execution across machines - Celery/RabbitMQ complexity not worth it for v1
Every possible mutation operator - Start with high-value operators, expand later
Python < 3.11 support - Modern Python only, leverage match statements and type hints
Framework-specific integrations - Django, Flask, etc. can come later

Success Metrics¶

Speed: Incremental run on unchanged code < 5 seconds
Speed: Full run on 10K LOC project < 5 minutes (8 cores)
Usability: Zero config for basic usage (pytest --gremlins)
Accuracy: No false positives (gremlins reported as survivors that tests actually catch)

Prior Art & Inspiration¶

PIT (Java) - Gold standard, incremental analysis, parallel execution
Stryker (JS/TS) - Mutation switching architecture
mutmut (Python) - Popular, fork-based isolation (Unix only), v3+ has parallelization
Cosmic Ray (Python) - Good operators, bad UX

Open Questions¶

AST vs. Bytecode mutation? - AST is more readable/debuggable, bytecode might be faster
How to handle module-level code? - Code that runs at import time before gremlin switch
What mutation operators for v1? - Need to define the initial gremlin types
Integration with pytest-test-categories? - Run gremlins only on "small" tests for speed?
Report format? - HTML? JSON? GitHub annotations? All of the above?