mirror of https://github.com/obra/superpowers.git synced 2026-05-07 13:58:11 +08:00

Files

Jesse Vincent 0bf37499b4 Address adversarial review findings

- evals/README.md, evals/CLAUDE.md: fix uv install command from
  'uv sync --dev' to 'uv sync --extra dev'. Drill's pyproject.toml
  uses [project.optional-dependencies], so --dev is a no-op for
  pytest/ruff/ty; --extra dev is the correct invocation.
- tests/claude-code/run-skill-tests.sh: drop test-requesting-code-review.sh
  from integration_tests array (file deleted earlier in this branch).
- tests/claude-code/README.md: replace test-requesting-code-review.sh
  section with test-worktree-native-preference.sh (the worktree test
  is kept; the code-review test was lifted into drill).
- docs/testing.md, CLAUDE.md: remove "Copilot CLI" from the harness
  list. evals/backends/ has claude*, codex, gemini configs but no
  copilot.yaml, so the claim was unsupported.

Adversarial review credit: reviewer #2 found four legitimate issues
(uv-sync, run-skill-tests stale ref, README stale ref via #1, and
Copilot CLI fabrication); reviewer #1 found two distinct issues
(run-skill-tests + tests/claude-code/README.md). Reviewer #2 wins
this round.

2026-05-06 15:47:39 -07:00

2.1 KiB

Raw Permalink Blame History

Testing Superpowers

Superpowers has two distinct kinds of tests, each in its own directory:

tests/ — does the plugin's non-LLM code work? Bash + node + python integration tests for brainstorm-server JS, OpenCode plugin loading, codex-plugin sync, and analysis utilities.
evals/ — do agents behave correctly on real LLM sessions? Python harness driving real tmux sessions of Claude Code / Codex / Gemini CLI, with an LLM actor and verifier judging skill compliance.

Plugin tests

Live in tests/. Currently:

tests/brainstorm-server/ — node test suite for the brainstorm server JS code.
tests/opencode/ — bash tests for OpenCode plugin loading, bootstrap caching, and tool registration.
tests/codex-plugin-sync/ — bash sync verification.
tests/claude-code/test-helpers.sh, analyze-token-usage.py — utilities used by remaining bash tests.
tests/claude-code/test-subagent-driven-development.sh — agent-can-describe-SDD test (no drill counterpart; tests description-recall, not behavior).
tests/claude-code/test-subagent-driven-development-integration.sh — extended SDD integration with token analysis (drill covers the YAGNI subset; bash adds commit-count, TodoWrite, and token telemetry assertions).
tests/claude-code/test-worktree-native-preference.sh — RED-GREEN-REFACTOR validation for worktree skill (drill covers the PRESSURE phase; bash also covers RED/GREEN baselines).
tests/explicit-skill-requests/ — Haiku-specific, multi-turn, and skill-name-prompted tests not covered by drill.

Run plugin tests via the relevant directory's run-*.sh or npm test.

Skill behavior evals

Live in evals/. Drill is the harness; scenarios live at evals/scenarios/*.yaml. See evals/README.md for setup. Quick start:

cd evals
uv sync --extra dev
export ANTHROPIC_API_KEY=sk-...
uv run drill run triggering-test-driven-development -b claude

Drill scenarios are slow (3-30+ minutes each) and run real LLM sessions. They are not part of CI today; the natural follow-up is a tiered model (fast subset on PR, full sweep nightly + on-demand).

2.1 KiB Raw Permalink Blame History

Testing Superpowers

Plugin tests

Skill behavior evals

2.1 KiB

Raw Permalink Blame History