Files
ironclaw/scripts/coverage.sh
Zaki Manian b4b19738a8 Trajectory benchmarks and e2e trace test rig (#553)
* refactor: extract shared assertion helpers to support/assertions.rs

Move 5 assertion helpers from e2e_spot_checks.rs to a shared module.
Add assert_all_tools_succeeded and assert_tool_succeeded for eliminating
false positives in E2E tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: add tool output capture via tool_results() accessor

Extract (name, preview) from ToolResult status events in TestChannel
and TestRig, enabling content assertions on tool outputs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: correct tool parameters in 3 broken trace fixtures

- tool_time.json: add missing "operation": "now" for time tool
- robust_correct_tool.json: same fix
- memory_full_cycle.json: change "path" to "target" for memory_write

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: add tool success and output assertions to eliminate false positives

Every E2E test that exercises tools now calls assert_all_tools_succeeded.
Added tool output content assertions where tool results are predictable
(time year, read_file content, memory_read content).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: capture per-tool timing from ToolStarted/ToolCompleted events

Record Instant on ToolStarted and compute elapsed duration on
ToolCompleted, wiring real timing data into collect_metrics() instead
of hardcoded zeros.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor: add RAII CleanupGuard for temp file/dir cleanup in tests

Replace manual cleanup_test_dir() calls and inline remove_file() with
Drop-based CleanupGuard that ensures cleanup even if a test panics.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: add Drop impl and graceful shutdown for TestRig

Wrap agent_handle in Option so Drop can abort leaked tasks. Signal
the channel shutdown before aborting for future cooperative shutdown.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: replace agent startup sleep with oneshot ready signal

Use a oneshot channel fired in Channel::start() instead of a fixed
100ms sleep, eliminating the race condition on slow systems.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: replace fragile string-matching iteration limit with count-based detection

Use tool completion count vs max_tool_iterations instead of scanning
status messages for "iteration"/"limit" substrings.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: use assert_all_tools_succeeded for memory_full_cycle test

Remove incorrect comment about memory_tree failing with empty path
(it actually succeeds). Omit empty path from fixture and use the
standard assert_all_tools_succeeded instead of per-tool assertions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor: promote benchmark metrics types to library code

Move TraceMetrics, ScenarioResult, RunResult, MetricDelta, and
compare_runs() from tests/support/metrics.rs to src/benchmark/metrics.rs.
Existing tests use re-export for backward compatibility.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: add Scenario and Criterion types for agent benchmarking

Scenario defines a task with input, success criteria, and resource
limits. Criterion is an enum of programmatic checks (tool_used,
response_contains, etc.) evaluated without LLM judgment.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: add initial benchmark scenario suite (12 scenarios across 5 categories)

Scenarios cover tool_selection, tool_chaining, error_recovery,
efficiency, and memory_operations. All loaded from JSON with
deserialization validation test.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: add benchmark runner with BenchChannel and InstrumentedLlm

BenchChannel is a minimal Channel implementation for benchmarks.
InstrumentedLlm wraps any LlmProvider to capture per-call metrics.
Runner creates a fresh agent per scenario, evaluates success criteria,
and produces RunResult with timing, token, and cost metrics.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: add baseline management, reports, and benchmark entry point

- baseline.rs: load/save/promote benchmark results
- report.rs: format comparison reports with regression detection
- benchmark_runner.rs: integration test with real LLM (feature-gated)
- Add benchmark feature flag to Cargo.toml

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* style: apply cargo fmt to benchmark module

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(benchmark): add multi-turn scenario types with setup, judge, ResponseNotContains

Add BenchScenario, Turn, TurnAssertions, JudgeConfig, ScenarioSetup,
WorkspaceSetup, SeedDocument types for multi-turn benchmark scenarios.
Add ResponseNotContains criterion variant. Add TurnAssertions::to_criteria()
converter for backward compat with existing evaluation engine.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(benchmark): add JSON scenario loader with recursive discovery and tag filter

Add load_bench_scenarios() for the new BenchScenario format with recursive
directory traversal and tag-based filtering. Create 4 initial trajectory
scenarios across tool-selection, multi-turn, and efficiency categories.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(benchmark): multi-turn runner with workspace seeding and per-turn metrics

Add run_bench_scenario() that loops over BenchScenario turns, seeds workspace
documents, collects per-turn metrics (tokens, tool calls, wall time), and
evaluates per-turn assertions. Add TurnMetrics to metrics.rs and
clear_for_next_turn() to BenchChannel.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(benchmark): add LLM-as-judge scoring with prompt formatting and score parsing

Create judge.rs with format_judge_prompt, parse_judge_score, and judge_turn.
Wire into run_bench_scenario for turns with judge config -- scores below
min_score fail the turn.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(benchmark): add CLI subcommand (ironclaw benchmark)

Add BenchmarkCommand with --tags, --scenario, --no-judge, --timeout,
--update-baseline flags. Wire into Command enum and main.rs dispatch.
Feature-gated behind benchmark flag.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(benchmark): per-scenario JSON output with full trajectory

Add save_scenario_results() that writes per-scenario JSON files alongside
the run summary. Each scenario gets its own file with turn_metrics trajectory.
Update CLI to use new output format.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(benchmark): add ToolRegistry::retain_only and wire tool filtering in scenarios

Add a retain_only() method to ToolRegistry that filters tools down to a
given allowlist. Wire this into run_bench_scenario() so that when a
scenario specifies a tools list in its setup, only those tools are
available during the benchmark run. Includes two tests for the new
method: one verifying filtering works and one verifying empty input
is a no-op.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(benchmark): wire identity overrides into workspace before agent start

Add seed_identity() helper that writes identity files (IDENTITY.md,
USER.md, etc.) into the workspace before the agent starts, so that
workspace.system_prompt() picks them up. Wire it into
run_bench_scenario() after workspace seeding. Include a test that
verifies identity files are written and readable.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(benchmark): add --parallel and --max-cost CLI flags

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(benchmark): use feature-conditional snapshot names for CLI help tests

Prevents snapshot conflicts between default (no benchmark) and
all-features (with benchmark) builds by using separate snapshot names
per feature set.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(benchmark): parallel execution with JoinSet and budget cap enforcement

Replace sequential loop in run_all_bench() with parallel execution using
JoinSet + semaphore when config.parallel > 1. Add budget cap enforcement
that skips remaining scenarios when max_total_cost_usd is exceeded.
Track skipped count in RunResult.skipped_scenarios and display it in
format_report().

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(benchmark): add tool restriction and identity override test scenarios

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: fix formatting for Phase 3

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(benchmark): add SkillRegistry::retain_only and wire skill filtering in scenarios

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(benchmark): add --json flag for machine-readable output

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* ci: add GitHub Actions benchmark workflow (manual trigger)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor(benchmark): remove in-tree benchmark harness, keep retain_only utilities

Move benchmark-specific code out of ironclaw in preparation for the
nearai/benchmarks trajectory adapter. This removes:

- src/benchmark/ (runner, scenarios, metrics, judge, report, etc.)
- src/cli/benchmark.rs and the Benchmark CLI subcommand
- benchmarks/ data directory (scenarios + trajectories)
- .github/workflows/benchmark.yml
- The "benchmark" Cargo feature flag

What remains:
- ToolRegistry::retain_only() and SkillRegistry::retain_only()
- Test support types (TraceMetrics, InstrumentedLlm) inlined into
  tests/support/ instead of re-exporting from the deleted module

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: add README for LLM trace fixture format

Documents the trajectory JSON format, response types, request hints,
directory structure, and how to write new traces.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(test): unify trace format around turns, add multi-turn support

Introduce TraceTurn type that groups user_input with LLM response steps,
making traces self-contained conversation trajectories. Add run_trace()
to TestRig for automatic multi-turn replay. Backward-compatible: flat
"steps" JSON is deserialized as a single turn transparently.

Includes all trace fixtures (spot, coverage, advanced), plan docs, and
new e2e tests for steering, error recovery, long chains, memory, and
prompt injection resilience.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(test): fix CI failures after merging main

- Fix tool_json fixture: use "data" parameter (not "input") to match
  JsonTool schema
- Fix status_events test: remove assertion for "time" tool that isn't
  in the fixture (only "echo" calls are used)
- Allow dead_code in test support metrics/instrumented_llm modules
  (utilities for future benchmark tests)

[skip-regression-check]

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Working on recording traces and testing them

* feat(test): add declarative expects to trace fixtures, split infra tests

Add TraceExpects struct with 9 optional assertion fields (response_contains,
tools_used, all_tools_succeeded, etc.) that can be declared in fixture JSON
instead of hand-written Rust. Add verify_expects() and run_recorded_trace()
so recorded trace tests become one-liners.

Split trace infra tests (deserialization, backward compat) into
tests/trace_format.rs which doesn't require the libsql feature gate.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor(test): add expects to all trace fixtures, simplify e2e tests

Add declarative expects blocks to all 19 trace fixture JSONs across
spot/, coverage/, advanced/, and root directories. Update all 8 e2e
test files to use verify_trace_expects() / run_and_verify_trace(),
replacing ~270 lines of hand-written assertions with fixture-driven
verification.

Tests that check things beyond expects (file content on disk, metrics,
event ordering) keep those extra assertions alongside the declarative
ones.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(test): adapt tests to AppBuilder refactor, fix formatting

Update test files to work with refactored TestRigBuilder that uses
AppBuilder::build_all() (removing with_tools/with_workspace methods).
Update telegram_check fixture to use tool_list instead of echo.
Fix cargo fmt issues in src/llm/mod.rs and src/llm/recording.rs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor(test): deduplicate support unit tests into single binary

Support modules (assertions, cleanup, test_channel, test_rig, trace_llm)
had #[cfg(test)] mod tests blocks that were compiled and run 12 times —
once per e2e test binary that declares `mod support;`. Extracted all 29
support unit tests into a dedicated `tests/support_unit_tests.rs` so they
run exactly once.

[skip-regression-check]

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* style: fix trailing newlines in support files

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor(test): unify trace types and fix recorded multi-turn replay

Import shared types (TraceStep, TraceResponse, TraceToolCall, RequestHint,
ExpectedToolResult, MemorySnapshotEntry, HttpExchange*) from
ironclaw::llm::recording instead of redefining them in trace_llm.rs.

Fix the flat-steps deserializer to split at UserInput boundaries into
multiple turns, instead of filtering them out and wrapping everything
into a single turn. This enables recorded multi-turn traces to be
replayed as proper multi-turn conversations via run_trace().

[skip-regression-check]

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(test): fix CI failures - unused imports and missing struct fields

- Add #[allow(unused_imports)] on pub use re-exports in trace_llm.rs
  (types are re-exported for downstream test files, not used locally)
- Add `..` to ToolCompleted pattern in test_channel.rs to match new
  `error` and `parameters` fields

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(test): fix CI failures after merging main

- Add missing `error` and `parameters` fields to ToolCompleted
  constructors in support_unit_tests.rs
- Add `..` to ToolCompleted pattern match in support_unit_tests.rs
- Add #[allow(dead_code)] to CleanupGuard, LlmTrace impl, and
  TraceLlm impl (only used behind #[cfg(feature = "libsql")])

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Adding coverage running script

* fix(test): address review feedback on E2E test infrastructure

- Increase wait_for_responses polling to exponential backoff (50ms-500ms)
  and raise default timeout from 15s to 30s to reduce CI flakiness (#1)
- Strengthen prompt_injection_resilience test with positive safety layer
  assertion via has_safety_warnings(), enable injection_check (#2)
- Add assert_tool_order() helper and tools_order field in TraceExpects
  for verifying tool execution ordering in multi-step traces (#3)
- Document TraceLlm sequential-call assumption for concurrency (#6)
- Clean up CleanupGuard with PathKind enum instead of shotgun
  remove_file + remove_dir_all on every path (#8)
- Fix coverage.sh: default to --lib only, fix multi-filter syntax,
  add COV_ALL_TARGETS option
- Add coverage/ to .gitignore
- Remove planning docs from PR

[skip-regression-check]

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: address PR review - use HashSet in retain_only, improve skill test

- Use HashSet for O(N+M) lookup in SkillRegistry::retain_only and
  ToolRegistry::retain_only instead of linear scan
- Strengthen test_retain_only_empty_is_noop in SkillRegistry to
  pre-populate with a skill before asserting the no-op behavior

[skip-regression-check]

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(test): revert incorrect safety layer assertion in injection test

The safety layer sanitizes tool output, not user input. The injection
test sends a malicious user message with no tools called, so the safety
layer never fires. Reverted to the original test which correctly
validates the LLM refuses via trace expects. Also fixed case-sensitive
request hint ("ignore" -> "Ignore") to suppress noisy warning.

[skip-regression-check]

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: clean stale profdata before coverage run

Adds `cargo llvm-cov clean` before each run to prevent
"mismatched data" warnings from stale instrumentation profiles.

[skip-regression-check]

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* style: fix formatting in retain_only test

[skip-regression-check]

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Illia Polosukhin <ilblackdragon@gmail.com>
2026-03-05 09:13:09 +00:00

102 lines
2.7 KiB
Bash
Executable File

#!/usr/bin/env bash
# Generate an HTML coverage report for a given set of tests.
#
# Usage:
# ./scripts/coverage.sh # all tests (lib only)
# ./scripts/coverage.sh safety # tests matching "safety"
# ./scripts/coverage.sh safety::sanitizer # specific module tests
# ./scripts/coverage.sh test_a test_b test_c # multiple test filters
#
# Options (env vars):
# COV_OPEN=1 Auto-open the report in a browser (default: 1)
# COV_FORMAT=html Output format: html, text, json, lcov (default: html)
# COV_OUT=coverage Output directory (default: coverage/)
# COV_FEATURES="" Extra --features to pass (default: none)
# COV_ALL_TARGETS=0 Set to 1 to include integration tests (default: lib only)
#
# Requires: cargo-llvm-cov (install: cargo install cargo-llvm-cov)
set -euo pipefail
COV_OPEN="${COV_OPEN:-1}"
COV_FORMAT="${COV_FORMAT:-html}"
COV_OUT="${COV_OUT:-coverage}"
COV_FEATURES="${COV_FEATURES:-}"
COV_ALL_TARGETS="${COV_ALL_TARGETS:-0}"
cd "$(git rev-parse --show-toplevel)"
if ! command -v cargo-llvm-cov &>/dev/null; then
echo "ERROR: cargo-llvm-cov not found. Install with: cargo install cargo-llvm-cov"
exit 1
fi
# Clean stale profiling data to avoid "mismatched data" warnings.
cargo llvm-cov clean --workspace 2>/dev/null || true
# Build the cargo llvm-cov command
cmd=(cargo llvm-cov)
# Features
if [[ -n "$COV_FEATURES" ]]; then
cmd+=(--features "$COV_FEATURES")
else
cmd+=(--all-features)
fi
# By default, only run the lib unit tests (fast, no integration test compilation).
# Set COV_ALL_TARGETS=1 to include integration tests.
if [[ "$COV_ALL_TARGETS" != "1" ]]; then
cmd+=(--lib)
fi
# Output format
case "$COV_FORMAT" in
html)
cmd+=(--html --output-dir "$COV_OUT")
;;
text)
cmd+=(--text)
;;
json)
cmd+=(--json --output-path "$COV_OUT/coverage.json")
;;
lcov)
cmd+=(--lcov --output-path "$COV_OUT/lcov.info")
;;
*)
echo "ERROR: Unknown format '$COV_FORMAT'. Use: html, text, json, lcov"
exit 1
;;
esac
# Test name filters (passed after -- to cargo test)
if [[ $# -gt 0 ]]; then
if [[ $# -eq 1 ]]; then
cmd+=(-- "$1")
else
# Join filters with | for regex matching
filter=$(IFS='|'; echo "$*")
cmd+=(-- "$filter")
fi
fi
echo "Running: ${cmd[*]}"
echo ""
"${cmd[@]}"
# Open report
if [[ "$COV_FORMAT" == "html" && "$COV_OPEN" == "1" ]]; then
index="$COV_OUT/html/index.html"
if [[ -f "$index" ]]; then
echo ""
echo "Report: $index"
if command -v open &>/dev/null; then
open "$index"
elif command -v xdg-open &>/dev/null; then
xdg-open "$index"
fi
fi
fi