Files
superpowers/tests/claude-code/README.md
Jesse Vincent 0bf37499b4 Address adversarial review findings
- evals/README.md, evals/CLAUDE.md: fix uv install command from
  'uv sync --dev' to 'uv sync --extra dev'. Drill's pyproject.toml
  uses [project.optional-dependencies], so --dev is a no-op for
  pytest/ruff/ty; --extra dev is the correct invocation.
- tests/claude-code/run-skill-tests.sh: drop test-requesting-code-review.sh
  from integration_tests array (file deleted earlier in this branch).
- tests/claude-code/README.md: replace test-requesting-code-review.sh
  section with test-worktree-native-preference.sh (the worktree test
  is kept; the code-review test was lifted into drill).
- docs/testing.md, CLAUDE.md: remove "Copilot CLI" from the harness
  list. evals/backends/ has claude*, codex, gemini configs but no
  copilot.yaml, so the claim was unsupported.

Adversarial review credit: reviewer #2 found four legitimate issues
(uv-sync, run-skill-tests stale ref, README stale ref via #1, and
Copilot CLI fabrication); reviewer #1 found two distinct issues
(run-skill-tests + tests/claude-code/README.md). Reviewer #2 wins
this round.
2026-05-06 15:47:39 -07:00

4.6 KiB

Claude Code Skills Tests

Automated tests for superpowers skills using Claude Code CLI.

Overview

This test suite verifies that skills are loaded correctly and Claude follows them as expected. Tests invoke Claude Code in headless mode (claude -p) and verify the behavior.

Requirements

  • Claude Code CLI installed and in PATH (claude --version should work)
  • Local superpowers plugin installed (see main README for installation)

Running Tests

./run-skill-tests.sh

Run integration tests (slow, 10-30 minutes):

./run-skill-tests.sh --integration

Run specific test:

./run-skill-tests.sh --test test-subagent-driven-development.sh

Run with verbose output:

./run-skill-tests.sh --verbose

Set custom timeout:

./run-skill-tests.sh --timeout 1800  # 30 minutes for integration tests

Test Structure

test-helpers.sh

Common functions for skills testing:

  • run_claude "prompt" [timeout] - Run Claude with prompt
  • assert_contains output pattern name - Verify pattern exists
  • assert_not_contains output pattern name - Verify pattern absent
  • assert_count output pattern count name - Verify exact count
  • assert_order output pattern_a pattern_b name - Verify order
  • create_test_project - Create temp test directory
  • create_test_plan project_dir - Create sample plan file

Test Files

Each test file:

  1. Sources test-helpers.sh
  2. Runs Claude Code with specific prompts
  3. Verifies expected behavior using assertions
  4. Returns 0 on success, non-zero on failure

Example Test

#!/usr/bin/env bash
set -euo pipefail

SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
source "$SCRIPT_DIR/test-helpers.sh"

echo "=== Test: My Skill ==="

# Ask Claude about the skill
output=$(run_claude "What does the my-skill skill do?" 30)

# Verify response
assert_contains "$output" "expected behavior" "Skill describes behavior"

echo "=== All tests passed ==="

Current Tests

Fast Tests (run by default)

test-subagent-driven-development.sh

Tests skill content and requirements (~2 minutes):

  • Skill loading and accessibility
  • Workflow ordering (spec compliance before code quality)
  • Self-review requirements documented
  • Plan reading efficiency documented
  • Spec compliance reviewer skepticism documented
  • Review loops documented
  • Task context provision documented

Integration Tests (use --integration flag)

test-subagent-driven-development-integration.sh

Full workflow execution test (~10-30 minutes):

  • Creates real test project with Node.js setup
  • Creates implementation plan with 2 tasks
  • Executes plan using subagent-driven-development
  • Verifies actual behaviors:
    • Plan read once at start (not per task)
    • Full task text provided in subagent prompts
    • Subagents perform self-review before reporting
    • Spec compliance review happens before code quality
    • Spec reviewer reads code independently
    • Working implementation is produced
    • Tests pass
    • Proper git commits created

What it tests:

  • The workflow actually works end-to-end
  • Our improvements are actually applied
  • Subagents follow the skill correctly
  • Final code is functional and tested

test-worktree-native-preference.sh

RED-GREEN-REFACTOR validation for the using-git-worktrees skill (~5 minutes):

  • RED: skill without Step 1a — agent should use git worktree add
  • GREEN: skill with Step 1a — agent should use the native EnterWorktree tool
  • PRESSURE: same as GREEN under urgency framing with pre-existing .worktrees/
  • Drill scenario worktree-creation-under-pressure.yaml covers the PRESSURE phase only

Adding New Tests

  1. Create new test file: test-<skill-name>.sh
  2. Source test-helpers.sh
  3. Write tests using run_claude and assertions
  4. Add to test list in run-skill-tests.sh
  5. Make executable: chmod +x test-<skill-name>.sh

Timeout Considerations

  • Default timeout: 5 minutes per test
  • Claude Code may take time to respond
  • Adjust with --timeout if needed
  • Tests should be focused to avoid long runs

Debugging Failed Tests

With --verbose, you'll see full Claude output:

./run-skill-tests.sh --verbose --test test-subagent-driven-development.sh

Without verbose, only failures show output.

CI/CD Integration

To run in CI:

# Run with explicit timeout for CI environments
./run-skill-tests.sh --timeout 900

# Exit code 0 = success, non-zero = failure

Notes

  • Tests verify skill instructions, not full execution
  • Full workflow tests would be very slow
  • Focus on verifying key skill requirements
  • Tests should be deterministic
  • Avoid testing implementation details