mirror of
https://github.com/nearai/ironclaw.git
synced 2026-06-01 03:22:20 +08:00
* refactor: extract shared assertion helpers to support/assertions.rs Move 5 assertion helpers from e2e_spot_checks.rs to a shared module. Add assert_all_tools_succeeded and assert_tool_succeeded for eliminating false positives in E2E tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: add tool output capture via tool_results() accessor Extract (name, preview) from ToolResult status events in TestChannel and TestRig, enabling content assertions on tool outputs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: correct tool parameters in 3 broken trace fixtures - tool_time.json: add missing "operation": "now" for time tool - robust_correct_tool.json: same fix - memory_full_cycle.json: change "path" to "target" for memory_write Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: add tool success and output assertions to eliminate false positives Every E2E test that exercises tools now calls assert_all_tools_succeeded. Added tool output content assertions where tool results are predictable (time year, read_file content, memory_read content). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: capture per-tool timing from ToolStarted/ToolCompleted events Record Instant on ToolStarted and compute elapsed duration on ToolCompleted, wiring real timing data into collect_metrics() instead of hardcoded zeros. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refactor: add RAII CleanupGuard for temp file/dir cleanup in tests Replace manual cleanup_test_dir() calls and inline remove_file() with Drop-based CleanupGuard that ensures cleanup even if a test panics. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: add Drop impl and graceful shutdown for TestRig Wrap agent_handle in Option so Drop can abort leaked tasks. Signal the channel shutdown before aborting for future cooperative shutdown. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: replace agent startup sleep with oneshot ready signal Use a oneshot channel fired in Channel::start() instead of a fixed 100ms sleep, eliminating the race condition on slow systems. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: replace fragile string-matching iteration limit with count-based detection Use tool completion count vs max_tool_iterations instead of scanning status messages for "iteration"/"limit" substrings. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: use assert_all_tools_succeeded for memory_full_cycle test Remove incorrect comment about memory_tree failing with empty path (it actually succeeds). Omit empty path from fixture and use the standard assert_all_tools_succeeded instead of per-tool assertions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refactor: promote benchmark metrics types to library code Move TraceMetrics, ScenarioResult, RunResult, MetricDelta, and compare_runs() from tests/support/metrics.rs to src/benchmark/metrics.rs. Existing tests use re-export for backward compatibility. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: add Scenario and Criterion types for agent benchmarking Scenario defines a task with input, success criteria, and resource limits. Criterion is an enum of programmatic checks (tool_used, response_contains, etc.) evaluated without LLM judgment. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: add initial benchmark scenario suite (12 scenarios across 5 categories) Scenarios cover tool_selection, tool_chaining, error_recovery, efficiency, and memory_operations. All loaded from JSON with deserialization validation test. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: add benchmark runner with BenchChannel and InstrumentedLlm BenchChannel is a minimal Channel implementation for benchmarks. InstrumentedLlm wraps any LlmProvider to capture per-call metrics. Runner creates a fresh agent per scenario, evaluates success criteria, and produces RunResult with timing, token, and cost metrics. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: add baseline management, reports, and benchmark entry point - baseline.rs: load/save/promote benchmark results - report.rs: format comparison reports with regression detection - benchmark_runner.rs: integration test with real LLM (feature-gated) - Add benchmark feature flag to Cargo.toml Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * style: apply cargo fmt to benchmark module Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(benchmark): add multi-turn scenario types with setup, judge, ResponseNotContains Add BenchScenario, Turn, TurnAssertions, JudgeConfig, ScenarioSetup, WorkspaceSetup, SeedDocument types for multi-turn benchmark scenarios. Add ResponseNotContains criterion variant. Add TurnAssertions::to_criteria() converter for backward compat with existing evaluation engine. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(benchmark): add JSON scenario loader with recursive discovery and tag filter Add load_bench_scenarios() for the new BenchScenario format with recursive directory traversal and tag-based filtering. Create 4 initial trajectory scenarios across tool-selection, multi-turn, and efficiency categories. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(benchmark): multi-turn runner with workspace seeding and per-turn metrics Add run_bench_scenario() that loops over BenchScenario turns, seeds workspace documents, collects per-turn metrics (tokens, tool calls, wall time), and evaluates per-turn assertions. Add TurnMetrics to metrics.rs and clear_for_next_turn() to BenchChannel. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(benchmark): add LLM-as-judge scoring with prompt formatting and score parsing Create judge.rs with format_judge_prompt, parse_judge_score, and judge_turn. Wire into run_bench_scenario for turns with judge config -- scores below min_score fail the turn. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(benchmark): add CLI subcommand (ironclaw benchmark) Add BenchmarkCommand with --tags, --scenario, --no-judge, --timeout, --update-baseline flags. Wire into Command enum and main.rs dispatch. Feature-gated behind benchmark flag. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(benchmark): per-scenario JSON output with full trajectory Add save_scenario_results() that writes per-scenario JSON files alongside the run summary. Each scenario gets its own file with turn_metrics trajectory. Update CLI to use new output format. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(benchmark): add ToolRegistry::retain_only and wire tool filtering in scenarios Add a retain_only() method to ToolRegistry that filters tools down to a given allowlist. Wire this into run_bench_scenario() so that when a scenario specifies a tools list in its setup, only those tools are available during the benchmark run. Includes two tests for the new method: one verifying filtering works and one verifying empty input is a no-op. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(benchmark): wire identity overrides into workspace before agent start Add seed_identity() helper that writes identity files (IDENTITY.md, USER.md, etc.) into the workspace before the agent starts, so that workspace.system_prompt() picks them up. Wire it into run_bench_scenario() after workspace seeding. Include a test that verifies identity files are written and readable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(benchmark): add --parallel and --max-cost CLI flags Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(benchmark): use feature-conditional snapshot names for CLI help tests Prevents snapshot conflicts between default (no benchmark) and all-features (with benchmark) builds by using separate snapshot names per feature set. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(benchmark): parallel execution with JoinSet and budget cap enforcement Replace sequential loop in run_all_bench() with parallel execution using JoinSet + semaphore when config.parallel > 1. Add budget cap enforcement that skips remaining scenarios when max_total_cost_usd is exceeded. Track skipped count in RunResult.skipped_scenarios and display it in format_report(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(benchmark): add tool restriction and identity override test scenarios Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * chore: fix formatting for Phase 3 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(benchmark): add SkillRegistry::retain_only and wire skill filtering in scenarios Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(benchmark): add --json flag for machine-readable output Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * ci: add GitHub Actions benchmark workflow (manual trigger) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refactor(benchmark): remove in-tree benchmark harness, keep retain_only utilities Move benchmark-specific code out of ironclaw in preparation for the nearai/benchmarks trajectory adapter. This removes: - src/benchmark/ (runner, scenarios, metrics, judge, report, etc.) - src/cli/benchmark.rs and the Benchmark CLI subcommand - benchmarks/ data directory (scenarios + trajectories) - .github/workflows/benchmark.yml - The "benchmark" Cargo feature flag What remains: - ToolRegistry::retain_only() and SkillRegistry::retain_only() - Test support types (TraceMetrics, InstrumentedLlm) inlined into tests/support/ instead of re-exporting from the deleted module Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: add README for LLM trace fixture format Documents the trajectory JSON format, response types, request hints, directory structure, and how to write new traces. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(test): unify trace format around turns, add multi-turn support Introduce TraceTurn type that groups user_input with LLM response steps, making traces self-contained conversation trajectories. Add run_trace() to TestRig for automatic multi-turn replay. Backward-compatible: flat "steps" JSON is deserialized as a single turn transparently. Includes all trace fixtures (spot, coverage, advanced), plan docs, and new e2e tests for steering, error recovery, long chains, memory, and prompt injection resilience. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(test): fix CI failures after merging main - Fix tool_json fixture: use "data" parameter (not "input") to match JsonTool schema - Fix status_events test: remove assertion for "time" tool that isn't in the fixture (only "echo" calls are used) - Allow dead_code in test support metrics/instrumented_llm modules (utilities for future benchmark tests) [skip-regression-check] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Working on recording traces and testing them * feat(test): add declarative expects to trace fixtures, split infra tests Add TraceExpects struct with 9 optional assertion fields (response_contains, tools_used, all_tools_succeeded, etc.) that can be declared in fixture JSON instead of hand-written Rust. Add verify_expects() and run_recorded_trace() so recorded trace tests become one-liners. Split trace infra tests (deserialization, backward compat) into tests/trace_format.rs which doesn't require the libsql feature gate. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refactor(test): add expects to all trace fixtures, simplify e2e tests Add declarative expects blocks to all 19 trace fixture JSONs across spot/, coverage/, advanced/, and root directories. Update all 8 e2e test files to use verify_trace_expects() / run_and_verify_trace(), replacing ~270 lines of hand-written assertions with fixture-driven verification. Tests that check things beyond expects (file content on disk, metrics, event ordering) keep those extra assertions alongside the declarative ones. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(test): adapt tests to AppBuilder refactor, fix formatting Update test files to work with refactored TestRigBuilder that uses AppBuilder::build_all() (removing with_tools/with_workspace methods). Update telegram_check fixture to use tool_list instead of echo. Fix cargo fmt issues in src/llm/mod.rs and src/llm/recording.rs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refactor(test): deduplicate support unit tests into single binary Support modules (assertions, cleanup, test_channel, test_rig, trace_llm) had #[cfg(test)] mod tests blocks that were compiled and run 12 times — once per e2e test binary that declares `mod support;`. Extracted all 29 support unit tests into a dedicated `tests/support_unit_tests.rs` so they run exactly once. [skip-regression-check] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * style: fix trailing newlines in support files Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refactor(test): unify trace types and fix recorded multi-turn replay Import shared types (TraceStep, TraceResponse, TraceToolCall, RequestHint, ExpectedToolResult, MemorySnapshotEntry, HttpExchange*) from ironclaw::llm::recording instead of redefining them in trace_llm.rs. Fix the flat-steps deserializer to split at UserInput boundaries into multiple turns, instead of filtering them out and wrapping everything into a single turn. This enables recorded multi-turn traces to be replayed as proper multi-turn conversations via run_trace(). [skip-regression-check] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(test): fix CI failures - unused imports and missing struct fields - Add #[allow(unused_imports)] on pub use re-exports in trace_llm.rs (types are re-exported for downstream test files, not used locally) - Add `..` to ToolCompleted pattern in test_channel.rs to match new `error` and `parameters` fields Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(test): fix CI failures after merging main - Add missing `error` and `parameters` fields to ToolCompleted constructors in support_unit_tests.rs - Add `..` to ToolCompleted pattern match in support_unit_tests.rs - Add #[allow(dead_code)] to CleanupGuard, LlmTrace impl, and TraceLlm impl (only used behind #[cfg(feature = "libsql")]) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Adding coverage running script * fix(test): address review feedback on E2E test infrastructure - Increase wait_for_responses polling to exponential backoff (50ms-500ms) and raise default timeout from 15s to 30s to reduce CI flakiness (#1) - Strengthen prompt_injection_resilience test with positive safety layer assertion via has_safety_warnings(), enable injection_check (#2) - Add assert_tool_order() helper and tools_order field in TraceExpects for verifying tool execution ordering in multi-step traces (#3) - Document TraceLlm sequential-call assumption for concurrency (#6) - Clean up CleanupGuard with PathKind enum instead of shotgun remove_file + remove_dir_all on every path (#8) - Fix coverage.sh: default to --lib only, fix multi-filter syntax, add COV_ALL_TARGETS option - Add coverage/ to .gitignore - Remove planning docs from PR [skip-regression-check] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: address PR review - use HashSet in retain_only, improve skill test - Use HashSet for O(N+M) lookup in SkillRegistry::retain_only and ToolRegistry::retain_only instead of linear scan - Strengthen test_retain_only_empty_is_noop in SkillRegistry to pre-populate with a skill before asserting the no-op behavior [skip-regression-check] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(test): revert incorrect safety layer assertion in injection test The safety layer sanitizes tool output, not user input. The injection test sends a malicious user message with no tools called, so the safety layer never fires. Reverted to the original test which correctly validates the LLM refuses via trace expects. Also fixed case-sensitive request hint ("ignore" -> "Ignore") to suppress noisy warning. [skip-regression-check] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: clean stale profdata before coverage run Adds `cargo llvm-cov clean` before each run to prevent "mismatched data" warnings from stale instrumentation profiles. [skip-regression-check] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * style: fix formatting in retain_only test [skip-regression-check] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Illia Polosukhin <ilblackdragon@gmail.com>
102 lines
2.7 KiB
Bash
Executable File
102 lines
2.7 KiB
Bash
Executable File
#!/usr/bin/env bash
|
|
# Generate an HTML coverage report for a given set of tests.
|
|
#
|
|
# Usage:
|
|
# ./scripts/coverage.sh # all tests (lib only)
|
|
# ./scripts/coverage.sh safety # tests matching "safety"
|
|
# ./scripts/coverage.sh safety::sanitizer # specific module tests
|
|
# ./scripts/coverage.sh test_a test_b test_c # multiple test filters
|
|
#
|
|
# Options (env vars):
|
|
# COV_OPEN=1 Auto-open the report in a browser (default: 1)
|
|
# COV_FORMAT=html Output format: html, text, json, lcov (default: html)
|
|
# COV_OUT=coverage Output directory (default: coverage/)
|
|
# COV_FEATURES="" Extra --features to pass (default: none)
|
|
# COV_ALL_TARGETS=0 Set to 1 to include integration tests (default: lib only)
|
|
#
|
|
# Requires: cargo-llvm-cov (install: cargo install cargo-llvm-cov)
|
|
|
|
set -euo pipefail
|
|
|
|
COV_OPEN="${COV_OPEN:-1}"
|
|
COV_FORMAT="${COV_FORMAT:-html}"
|
|
COV_OUT="${COV_OUT:-coverage}"
|
|
COV_FEATURES="${COV_FEATURES:-}"
|
|
COV_ALL_TARGETS="${COV_ALL_TARGETS:-0}"
|
|
|
|
cd "$(git rev-parse --show-toplevel)"
|
|
|
|
if ! command -v cargo-llvm-cov &>/dev/null; then
|
|
echo "ERROR: cargo-llvm-cov not found. Install with: cargo install cargo-llvm-cov"
|
|
exit 1
|
|
fi
|
|
|
|
# Clean stale profiling data to avoid "mismatched data" warnings.
|
|
cargo llvm-cov clean --workspace 2>/dev/null || true
|
|
|
|
# Build the cargo llvm-cov command
|
|
cmd=(cargo llvm-cov)
|
|
|
|
# Features
|
|
if [[ -n "$COV_FEATURES" ]]; then
|
|
cmd+=(--features "$COV_FEATURES")
|
|
else
|
|
cmd+=(--all-features)
|
|
fi
|
|
|
|
# By default, only run the lib unit tests (fast, no integration test compilation).
|
|
# Set COV_ALL_TARGETS=1 to include integration tests.
|
|
if [[ "$COV_ALL_TARGETS" != "1" ]]; then
|
|
cmd+=(--lib)
|
|
fi
|
|
|
|
# Output format
|
|
case "$COV_FORMAT" in
|
|
html)
|
|
cmd+=(--html --output-dir "$COV_OUT")
|
|
;;
|
|
text)
|
|
cmd+=(--text)
|
|
;;
|
|
json)
|
|
cmd+=(--json --output-path "$COV_OUT/coverage.json")
|
|
;;
|
|
lcov)
|
|
cmd+=(--lcov --output-path "$COV_OUT/lcov.info")
|
|
;;
|
|
*)
|
|
echo "ERROR: Unknown format '$COV_FORMAT'. Use: html, text, json, lcov"
|
|
exit 1
|
|
;;
|
|
esac
|
|
|
|
# Test name filters (passed after -- to cargo test)
|
|
if [[ $# -gt 0 ]]; then
|
|
if [[ $# -eq 1 ]]; then
|
|
cmd+=(-- "$1")
|
|
else
|
|
# Join filters with | for regex matching
|
|
filter=$(IFS='|'; echo "$*")
|
|
cmd+=(-- "$filter")
|
|
fi
|
|
fi
|
|
|
|
echo "Running: ${cmd[*]}"
|
|
echo ""
|
|
|
|
"${cmd[@]}"
|
|
|
|
# Open report
|
|
if [[ "$COV_FORMAT" == "html" && "$COV_OPEN" == "1" ]]; then
|
|
index="$COV_OUT/html/index.html"
|
|
if [[ -f "$index" ]]; then
|
|
echo ""
|
|
echo "Report: $index"
|
|
if command -v open &>/dev/null; then
|
|
open "$index"
|
|
elif command -v xdg-open &>/dev/null; then
|
|
xdg-open "$index"
|
|
fi
|
|
fi
|
|
fi
|