mirror of
https://github.com/nearai/ironclaw.git
synced 2026-05-18 12:36:19 +08:00
* fix(oauth): remove pending flow on provider-error callback The /oauth/callback handler's ?error= branch (RFC 6749 §4.1.2.1 provider-side failures — user cancels consent, scope denied, etc.) returned the error page immediately without removing the flow from ext_mgr.pending_oauth_flows(). The ghost entry then lingered until the 5-minute expiry sweep, and any subsequent auth dance for the same (extension, user) pair had to dedupe against it. Mirror the happy-path cleanup: decode the state param, remove the keyed flow, then return the error page. Surfaced during live-canary auth-full repro: after test_wasm_tool_oauth_provider_error_leaves_extension_unauthed ran, the stale flow sat in the shared auth_matrix_server fixture. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(e2e): widen auth OAuth matrix timeouts for CI load Four tests in live-canary auth-full were failing in CI with `Page.wait_for_function: Timeout 60000ms exceeded`, `ClientConnectionError('Connection closed')`, and `Timed out waiting for OAuth refresh request` — all inside 60/20s deadlines that are tuned for a dev laptop and don't leave margin for ubuntu-latest's 2-vCPU runner under full suite load. Raise the per-call deadlines so the inner budgets fit comfortably inside pyproject.toml's 120s per-test cap: _wait_for_refresh_request default: 20.0s -> 60.0s _wait_for_auth_event call site: 60 -> 90 _wait_for_auth_prompt call site: 60 -> 90 send_chat_and_wait_for_terminal_message call sites: 60000 -> 90000 _wait_for_mock_google_tokens call site: 60.0 -> 90.0 _wait_for_response_contains (gmail) call site: 60.0 -> 90.0 Strictly widening; no passing test is slowed, no semantics change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(canary): Haiku-powered Slack report job Replace the team's raw Slack subscription (firehose of workflow notifications) with one curated per-run summary: Canary: 9 passed, 1 failed of 10 lanes ❌ auth-full (mock) — 12/13 passed, 1 failed in 350s > test_wasm_tool_first_chat_auth_attempt_emits_auth_url timed > out waiting for auth_required SSE event on the fresh thread tools: shell, http_request, gmail (~6 calls) ... commit `abc1234` • <github run link> New `canary-report` job (needs: every lane, if: always) downloads all lane artifacts, parses junit + summary + log tail per lane, and asks claude-haiku-4-5 to return a compact JSON per lane ({status, reason, tool_calls_total, tools_used, notable}). That's aggregated into a single Slack block message and posted via incoming webhook. Safety shape: - Script exits 0 even on Haiku/Slack failure so the notifier never masks the underlying canary signal. - Missing ANTHROPIC_API_KEY falls back to raw junit-only phrasing. - Slack POST failure falls back to plain-text "X/Y lanes failed" with the GH run URL so the channel still hears something. - No new Python deps — pure stdlib (urllib.request, xml.etree). - 20 KB log-tail cap per lane to keep Haiku token usage bounded. Secrets: - ANTHROPIC_API_KEY (already present, used by provider-matrix) - SLACK_WEBHOOK_URL (new — create an incoming webhook in Slack and add as repo secret; notifier prints to stdout otherwise) Testing: - Trigger manually via Actions -> "Live Canary" -> "Run workflow" with any single lane; canary-report runs after regardless of which lanes executed. - Run locally with --dry-run to preview the Slack payload. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(canary): post_json error handling + robust Haiku JSON extraction Address gemini-code-assist review on scripts/live-canary/notify_slack.py: 1. `post_json` unreachable error branch: `urllib.request.urlopen` raises `urllib.error.HTTPError` for 4xx/5xx before reaching the `if resp.status >= 300` check, so the error body was never surfaced. Wrap in try/except and read the body from the HTTPError instance — that's where Anthropic's "invalid API key" / "rate limited" detail lives. 2. Haiku JSON extraction was fragile: `startswith("```")` assumed the response had no prose preamble and only handled one fence shape. Replace with `re.search(r"\{.*\}", text, re.DOTALL)` so we pick the outermost JSON object regardless of any wrapper markdown or leading/trailing text. Greedy + DOTALL is correct for the single top-level object our schema requires. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(e2e): raise pytest timeout + bump multi-user chat wait to 180s The CI run on feat/canary-report surfaced that 90s was still not enough for test_mcp_same_server_multi_user_via_browser on ubuntu-latest — it timed out at the inner Playwright wait_for_function deadline with "Timeout 90000ms exceeded" after 118s of total test time. The test opens two browser contexts + two SSE streams and drives a full chat turn per user in sequence. Under 2-vCPU contention the compound pipeline genuinely takes over 90s. - tests/e2e/pyproject.toml: timeout 120 -> 240 (pytest-level cap) - test_v2_auth_oauth_matrix.py: send_chat_and_wait_for_terminal_message call sites 90000 -> 180000 (two owner/member turns, each budgeted for one runner-slow turn) 180s < 240s, so the inner deadline fires first with the useful Playwright traceback instead of the generic pytest SIGTERM. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(e2e): fix pytest-timeout CLI override + widen Mode-C deadlines The previous commit (c3c9bbab) raised tests/e2e/pyproject.toml's timeout from 120 to 240, but the auth canary runs the suite via scripts/auth_canary/run_canary.py which hardcodes `--timeout=120` on the pytest command line. The CLI flag wins over pyproject's ini_options, so the 240 bump was invisible to the auth lanes. That's why auth-smoke on the canary `all` run still failed with "Timeout (>120.0s) from pytest-timeout" even after our 180s inner widening — the outer CLI cap was firing at 120s first. Fix the override and widen the two remaining Mode-C deadlines that blew in the same run: scripts/auth_canary/run_canary.py: --timeout=120 -> 240 _wait_for_refresh_request default: 60.0 -> 120.0 (test_wasm_tool_oauth_refresh_on_demand and test_mcp_oauth_refresh_on_demand both use the default) test_settings_first_gmail_auth_then_chat_runs call sites: _wait_for_mock_google_tokens 90.0 -> 120.0 _wait_for_response_contains 90.0 -> 120.0 All remain comfortably under the new 240s pytest-level cap so a real hang still fails fast with a useful traceback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(e2e): opt-in text-match predicate for multi-user browser test Ship the structural fix that was overdue. Repeated budget bumps on send_chat_and_wait_for_terminal_message weren't holding under ubuntu-latest "all"-mode parallelism — 120s, 180s both exceeded on test_mcp_same_server_multi_user_via_browser. The underlying race is in the JS predicate: it waits for the assistant bubble AND the data-streaming attribute cleared AND the chat input re-enabled. Under 2-vCPU contention an SSE reconnect can drop the final attribute-clearing delta, and the compound predicate never flips even though the response text arrived long ago. Add an opt-in `expected_text_contains` parameter. When supplied, the predicate succeeds the moment the expected substring appears in the new assistant message — regardless of data-streaming or input state. Callers that already assert on specific response text (the existing MCP / gmail tests) can now short-circuit the race without compromising correctness: the test's own content assertions remain the gate. Default behavior unchanged for the ~30 existing call sites across test_chat.py, test_sse_reconnect.py, test_tool_approval.py, test_portfolio.py, test_message_persistence.py, test_agent_loop_recovery.py, test_pending_user_messages.py, test_widget_customization.py. Applied to the two multi-user call sites with expected_text_contains="Mock MCP search result" — that's exactly what the test's next two assertions verify. Local run of the flaky test alone: 40s, green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci(canary): move auth-smoke to self-hosted runner Multi-user browser test (test_mcp_same_server_multi_user_via_browser) consistently exceeds the Playwright budget on GH ubuntu-latest under the 2-vCPU parallelism pressure of an "all" canary run — a single compound chat turn burns >180s, with each budget bump we apply it ratchets the flake, not the fix. Pilot move onto the [self-hosted, ironclaw-live] runner that private-oauth already uses. Same runner label means no new infrastructure required; if the self-hosted box has Python 3.12 and Playwright browsers installed (or can provision them via the existing setup-python + scripts/live-canary/run.sh's `PLAYWRIGHT_INSTALL=with-deps` flow), this is a zero-code-change canary fix. If the pilot works, auth-full is the next candidate. If the runner queues become a bottleneck, we'd scale to multiple workers under the same label rather than revert to ubuntu-latest. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci(canary): revert auth-smoke to ubuntu-latest + widen budgets to 300s/360s Railway self-hosted runner ('railway-private-oauth' on a small Docker container) turned out to be no faster than GH ubuntu-latest for the multi-user browser flow — both take ~194–196s for test_mcp_same_server_multi_user_via_browser. The runner container is evidently provisioned at a similar vCPU allocation, so the move bought nothing. Revert to ubuntu-latest (parallel canary shape preserved; avoids serialising auth lanes behind private-oauth on the single self-hosted worker) and widen deadlines for the last CI-load hop: test_v2_auth_oauth_matrix.py multi-user call sites: Playwright wait_for_function 180000 -> 300000 ms scripts/auth_canary/run_canary.py: --timeout=240 -> 360 (outer pytest cap) tests/e2e/pyproject.toml: timeout = 240 -> 360 300s inner fits inside the new 360s outer with 60s margin. Local run of the same test alone completes in ~40s, so we have plenty of headroom against real hangs still surfacing fast with a useful traceback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * disable report * scripts(auth-canary): add Google storage-state bootstrap helper The auth-browser-consent lane drives Google's real OAuth consent UI in Playwright, but Google's risk engine routinely interrupts the flow with a "Verify it's you" challenge that handle_google_popup cannot solve, so the test stalls on the password screen. Bypass: log in once interactively in Playwright Chromium, save cookies + localStorage to a storage_state.json, point AUTH_BROWSER_GOOGLE_- STORAGE_STATE_PATH at it. Subsequent canary runs spawn contexts with that state preloaded, so the popup arrives at consent with no login or challenge in the way. - scripts/auth_live_canary/bootstrap_google_storage_state.py: new one-shot interactive helper that writes ~/.ironclaw/auth-canary/google_storage_state.json by default - scripts/auth_live_canary/README.md: document the bypass under "Browser-consent Google challenge bypass" Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * canary(auth-browser-consent): fix Google account-picker + chat drift The auth-browser-consent google case was failing on two distinct issues, the first masking the second: 1) Account picker. When AUTH_BROWSER_GOOGLE_STORAGE_STATE_PATH is set (the recommended path — username/password automation gets blocked by Google's risk engine), Google's OAuth popup lands on a "Choose an account" picker before the consent screen. handle_google_popup only knew how to fill email + password and click Continue/Allow, so the popup sat on the picker until complete_provider_auth's 120s callback wait timed out. Added a picker-detection step that tries selectors in order — username text, [data-identifier], and a generic "any visible @-bearing text not equal to 'Use another account'" XPath — and clicks the first hit, with debug logging so future regressions surface in the run output. 2) Tool-name and response-text drift. After the OAuth fix unblocked the rest of the probe, browser_chat still failed because: - case.expected_tool_name was "gmail", but the gateway records the tool call under its WASM module name "gmail_tool" - case.expected_text was "Gmail" (case-sensitive), but real LLM responses to "check gmail unread" against an empty inbox vary ("Your inbox is clear...", "Inbox is empty", etc.) and rarely emit literal "Gmail" Updated BROWSER_CASES["google"] to expected_tool_name="gmail_tool" and expected_text="inbox", and made the browser_chat assertion's text comparison case-insensitive so the canary doesn't depend on exact wording. After both fixes the auth-browser-consent google lane runs green: ✓ browser_oauth (popup -> /oauth/callback) ✓ browser_chat (assistant references inbox) ✓ responses_api (real Gmail tool call) Not addressed here: BROWSER_CASES["github"] likely has the same expected_tool_name drift ("github" vs probably "github_tool"); needs verification with real GitHub OAuth creds before changing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * canary(auth-browser-consent): robust account-picker fallback + browser channel Two follow-ups discovered during local debugging of the auth-browser- consent google lane: 1) Account-picker fallback was matching hidden <style> blocks. The XPath `//*[contains(text(), '@') ...]` matched any element whose text contains `@`, which includes <style> tags carrying CSS at-rules (@font-face, @media). Replaced the XPath with role-based locators (get_by_role link/button) filtered by an email regex — only interactive elements match, no false positives from style blocks. Verified locally that the fallback now clicks the right account row even when AUTH_BROWSER_GOOGLE_USERNAME is unset. 2) Bootstrap script: Google's anti-automation blocks Playwright's default Chromium (Chrome for Testing) at sign-in with "This browser or app may not be secure". Added a --browser flag with a default of firefox (Marionette is less aggressively fingerprinted than CDP), plus chrome (system Google Chrome) and chromium (override) options. For accounts where Google blocks even those — typically brand-new Gmails or accounts with high risk scores — the fallback path is to launch Chrome manually with --remote-debugging-port and connect via playwright.chromium.connect_over_cdp; documented in the README. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * canary(auth-live-canary): include observed extension state in timeout error When `wait_for_extension_state` times out the bare error "Timed out waiting for extension state: gmail" is unhelpful for diagnosing CI failures, since CI artifacts don't capture IronClaw's gateway logs — there's no way to tell whether the extension never appeared, appeared but never authenticated, or authenticated but never activated. Track the last-observed extension on each poll and surface authenticated/active in the timeout message. After this change a failed run says e.g. "Timed out waiting for extension state: gmail (expected authenticated=True, active=True; last observed: authenticated=False, active=False)", which immediately separates token-exchange failures from activation-state-machine bugs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * canary(auth-live-canary): widen chat-wait deadlines 120s -> 300s The auth-browser-consent google probe completed OAuth + extension activation successfully on CI but timed out at the next step (send_chat_and_wait_for_terminal_message), with the agent stuck on "Thinking (step 1)" for the full 120s budget. Local runs on the same code path complete the chat in ~36s, but ubuntu-latest 2-vCPU runners under cold-start load (gateway restart, mock LLM bootstrap, WASM tool first-invocation) need substantially more headroom. 300s matches the precedent set by `d8765714 ci(canary): revert auth-smoke to ubuntu-latest + widen budgets to 300s/360s` for the auth-smoke lane on the same runner class. Both call sites widened — the seeded Responses-API probe at line 221 and the browser_oauth probe at line 800. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * canary(common): drain gateway/mock_llm stdout pipes (was deadlocking CI) scripts/live_canary/common.py spawns the IronClaw gateway and the mock LLM with stdout=PIPE + stderr=STDOUT, reads one line of mock_llm output to discover its bound port, then never reads from either pipe again. On Linux the kernel pipe buffer caps at 64 KiB; once a sustained chat request fills it with `RUST_LOG=info` output, the child blocks on its next stdout write and the request handler freezes mid-response. That's why every auth-browser-consent CI run got stuck on "Thinking (step 1)..." for the full chat-wait budget while the same test passes locally — macOS pipe buffers are larger and the test completes before the buffer fills. Fix: spawn a daemon thread per subprocess that drains the pipe to a log file under the run's output_dir. Two wins: - Pipes never fill, child never blocks. - gateway.log and mock_llm.log become CI artifacts, so the next failure that doesn't have a clear runner-side error message is immediately debuggable from IronClaw's own logs. Verified locally that the lane still passes after the change and both log files are produced. Locally each is < 10 KiB; CI runs may be larger but well under any artifact size limit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * canary: pin LLM backend via settings API + add LLM_API_KEY (root cause of CI freeze) The auth-browser-consent google lane has been freezing on CI at "Thinking (step 1)..." for the full chat-wait budget. Gateway logs captured by the previous commit's pipe drainer reveal the smoking gun: ERROR Configured LLM backend is not usable. backend=openai_compatible reason=missing API key WARN LLM_BACKEND env var is set but DB setting takes priority. db_value=nearai env_value=openai_compatible WARN Active LLM backend fell back to NearAI default attempted=openai_compatible active=nearai Two compounding issues: 1. The openai_compatible provider refuses to instantiate without an API key, even though the mock LLM ignores the value. Fix: set `LLM_API_KEY=mock-api-key` in `build_gateway_env`, matching what `tests/e2e/conftest.py` already does for the e2e suite. 2. IronClaw's DB-stored LLM settings take priority over env vars, and the freshly-seeded canary DB defaults `llm_backend` to `nearai`. So even with a clean env, the agent fell back to NearAI and entered an interactive auth flow that hangs indefinitely in CI (the "Thinking" never ends). This is the exact trap `tests/e2e/CLAUDE.md` documents: "do not rely on env-vs-DB precedence … pin the provider explicitly through /api/settings/...". Fix: pin `llm_backend`, `openai_compatible_base_url`, and `selected_model` via PUT /api/settings/<key> immediately after the gateway becomes healthy. Also revert the BROWSER_CASES["google"] case I touched earlier: when NearAI was driving it emitted the WASM canonical tool name (`gmail_tool`), but the mock LLM (now correctly driving) emits the tool name it knows from its mapping (`gmail`). Restoring the original `expected_tool_name="gmail"` / `expected_text="gmail"` matches what the mock LLM actually produces. Verified locally: all three browser_oauth / browser_chat / responses_api probes now pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * canary(auth-live-canary): revert chat-wait deadline 300s -> 120s The 300s widening at98abeebewas a band-aid attempt to work around the actual root cause (subprocess pipe deadlock + DB-overrides-env LLM backend), which were both fixed atf59981d3and8733d3c0respectively. With those fixes the chat completes in ~35s on CI, so the 300s budget is overkill — revert to the original 120s, which gives ~3.5x headroom over the observed steady-state and matches the deadline shape used elsewhere in the e2e suite. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci(canary): rename github oauth secrets to dodge GITHUB_ prefix block GitHub Actions reserves the GITHUB_ prefix for auto-generated repo secrets (GITHUB_TOKEN, etc.) and rejects user-created secrets that start with it: "Secret names must not start with GITHUB_". The existing references to GITHUB_OAUTH_CLIENT_ID and GITHUB_OAUTH_- CLIENT_SECRET in this workflow couldn't be backed by actual secrets for that reason — the OAuth-client config was effectively unset for the github browser-consent case, which is why it was silently filtered out by configured_browser_cases(). Decouple the secret name from the env var name: store the secrets under the AUTH_BROWSER_GITHUB_CLIENT_ID / AUTH_BROWSER_GITHUB_CLIENT_- SECRET names (matching the AUTH_BROWSER_GITHUB_* convention used by the other github canary fixture vars), and re-export them here under the GITHUB_OAUTH_CLIENT_ID / _SECRET env names that auth_registry.py and the WASM github tool expect. No code changes needed in auth_registry.py / scripts/auth_live_- canary/ — they continue to read GITHUB_OAUTH_CLIENT_ID/_SECRET from the environment as before. Operator action: create the OAuth app on GitHub (Settings → Developer settings → OAuth Apps → New OAuth App) and store the resulting credentials at: AUTH_BROWSER_GITHUB_CLIENT_ID AUTH_BROWSER_GITHUB_CLIENT_SECRET (not GITHUB_OAUTH_CLIENT_ID / _SECRET, which GitHub will reject). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * canary(auth-browser-consent): drop github case (tool is PAT-only, not OAuth) CI run 25022303491 surfaced that `Activate /api/extensions/github/- activate` returns `{success: false, awaiting_token: true, message: "Create a Personal Access Token..."}` with no `auth_url`, which the browser-consent probe needs in order to drive the OAuth popup. Confirmed via `registry/tools/github.json`: "auth_summary": { "method": "manual", <- PAT paste, not OAuth "secrets": ["github_token"], "setup_url": "https://github.com/settings/tokens" } The github WASM tool's source capabilities JSON does carry an `oauth` block, but the released v0.2.3 artifact (referenced from the registry) ships with the manual-auth path. Until a release flips `auth_summary.method` to "oauth" — and the github extension actually returns an `auth_url` from /activate — there's nothing for the browser-consent probe to do. - Drop the `github` entry from BROWSER_CASES with a comment pointing at the criterion for re-adding it. - Drop the github-specific filter in `configured_browser_cases` since the case is gone (no risk of an env-aware code path that quietly skips github when secrets are present-but-mismatched). GitHub coverage is unchanged in SEEDED_CASES, which seeds the PAT directly via `AUTH_LIVE_GITHUB_TOKEN` and exercises real `/v1/responses` + browser tool calls — that lane already works. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * canary(auth-browser-consent): tick notion's trust-URL checkbox before Continue CI run 25023708895 surfaced the notion case timing out at "Timed out waiting for notion OAuth callback page". The popup screenshot shows Notion MCP's consent screen with: - Workspace correctly auto-selected (storage state worked) - A yellow warning: "I recognize and trust this URL" - An unchecked checkbox next to that text - A grayed-out (disabled) Continue button The button is gated behind the checkbox. handle_notion_popup clicked the disabled Continue and silently no-op'd, so the complete_provider_auth loop waited the full 120s for /oauth/callback that never arrived. Add a checkbox-detection step before the Continue click: popup.get_by_text(re.compile("I recognize and trust this URL", I)) .first.click(timeout=3000) Includes debug print statements (matching the auth-canary pattern established for google's account picker) so future Notion UI changes are immediately visible in test-output.log. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(e2e): drain ironclaw subprocess pipes in auth-matrix fixture Same pipe-deadlock fix as scripts/live_canary/common.pyf59981d3, applied to tests/e2e/scenarios/test_v2_auth_oauth_matrix.py's _start_auth_matrix_server. The auth-matrix fixture spawns ironclaw with stdout=PIPE + stderr=PIPE and never drains them, so under sustained log volume the kernel pipe buffer fills, ironclaw blocks on its next stdout write, and any test that relies on subsequent gateway responses (auth gate emission, SSE events, chat replies) hangs until pytest-timeout fires. This fix doesn't make the auth-full lane's failing test pass — the real bug is engine-v2 silently dropping `auth_required` SSE events for unauthenticated extensions (introduced by #2868). But it makes the failure mode debuggable: gateway log is captured to /tmp/ironclaw-auth-matrix-gateway.log (overridable via IRONCLAW_AUTH_MATRIX_LOG env), and RUST_LOG passes through from the test runner so we can crank up verbosity without rebuilding. Without this change, the failing test's log was empty after the extension-install line; with this change you see the engine-v2 trace summary that surfaces the actual NotCallable-without-auth-gate bug. That diagnostic visibility is the value here. - _drain_stream_to_file: asyncio drainer mirroring common.py's sync threading version - _start_auth_matrix_server: drain stdout/stderr to log_path - _shutdown_auth_matrix_server: cancel drain_tasks for clean exit - env: RUST_LOG forwarding so debug runs work Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * canary(workflow): add Telegram Bot API mock Foundation piece for the new workflow-canary lane that exercises multi-tool / multi-channel user workflows from issue #1044 (Telegram + routines + Sheets/Calendar/Gmail end-to-end). Models the same single-port aiohttp-based mock pattern used by tests/e2e/mock_llm.py. Endpoints: - /bot{token}/{getMe,getUpdates,sendMessage,sendChatAction, setWebhook,deleteWebhook,getFile} — the subset IronClaw's WASM telegram tool + channels-src/telegram actually call. Tokens are accepted without validation; the canary doesn't need to test Telegram's auth — just IronClaw's flow against a Bot API shape. - /__mock/inject_message — push a simulated incoming user message onto the next getUpdates response, so scenarios can drive a Telegram → IronClaw round-trip without a real Telegram account. - /__mock/sent_messages — drain the queue of every sendMessage / sendChatAction IronClaw emitted, for end-to-end assertions. - /__mock/reset — clear all state between probes. IronClaw routes its API calls through this mock via IRONCLAW_TEST_HTTP_REMAP=api.telegram.org=<mock_url>, the same mechanism the auth-live-canary uses for Gmail/Calendar/Sheets mocks. Smoke-tested: getMe → success, inject_message → getUpdates returns the injected message, sendMessage → bot response shape + recorded in sent_messages. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * canary(workflow): land workflow-canary lane with periodic-reminder scenario Phase 1A of the workflow-canary system from issue #1044. Adds a new canary lane that exercises the routine engine + cron-fire path, the foundation that the remaining four scripts (Telegram → Sheets, Calendar prep, HN monitor, CRM tracker) will layer on. Components: - scripts/workflow_canary/routines.py — direct libSQL helpers for inserting a lightweight cron routine with a backdated next_fire_at and polling routine_runs for terminal status (ok / attention / failed). Backdating beats wall-clock cron in tests by 30+ s per probe and is the same shape auth-live-seeded uses for expire_secret_in_db. - scripts/workflow_canary/run_workflow_canary.py — entrypoint that starts the Telegram mock, calls common.start_gateway_stack with workflow-tuned env (ROUTINES_ENABLED=true, ROUTINES_CRON_INTERVAL=2, IRONCLAW_TEST_HTTP_REMAP=api.telegram.org=<mock>), and runs scenario modules. CLI mirrors run_live_canary.py. - scripts/workflow_canary/scenarios/periodic_reminder.py — Script 4 Phase 1A: insert lightweight routine → wait for engine to fire → assert run row reaches a terminal status. Verified locally: 1 probe, 1 fire, status=attention. Plumbing: - .github/workflows/live-canary.yml — new workflow-canary job + lane added to the workflow_dispatch choice list and the canary-report aggregator's needs:. - scripts/live-canary/run.sh — workflow-canary case dispatches to run_workflow_canary.py. Phase 1B follow-ups in subsequent commits: - Telegram channel install + bot-token seeding (needs admin auth or direct encrypted-secrets DB write) - Verify Telegram sendMessage was emitted to the mock during the routine fire (covered by mock telegram's /__mock/sent_messages) - Scripts 1, 3, 5 (Sheets / HN / Gmail-CRM) - Script 2 (Calendar prep with web search) Local verification: $ tests/e2e/.venv/bin/python scripts/workflow_canary/run_workflow_canary.py \ --skip-build --skip-python-bootstrap [workflow-canary] mock telegram listening at http://127.0.0.1:51139 [periodic_reminder] inserted routine ..., next_fire_at backdated 60s [periodic_reminder] routine fired: status=attention [workflow-canary] all 1 probe(s) passed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * canary(workflow): land all 5 issue #1044 scenarios + scenario README Layer Scripts 1, 2, 3, 5 onto the foundation shipped in16278ea9, so the workflow-canary lane covers all five user-workflow scripts from issue #1044. Each scenario delegates to a shared `run_routine_probe()` helper that captures the Phase 1A shape: insert a Lightweight cron routine with a script-specific prompt → backdate next_fire_at → poll routine_runs for terminal status. Scenarios added: - bug_logger.py (Script 1 — Telegram bugs → Google Sheet) - calendar_prep.py (Script 2 — Calendar prep → Telegram, Reporter: Nick) - hn_monitor.py (Script 3 — Hacker News → Telegram, Reporter: Emil) - crm_tracker.py (Script 5 — Gmail → Sheets CRM, Reporter: Cameron) Plus periodic_reminder.py (Script 4, Reporter: Henry) refactored to also use run_routine_probe. scenarios/_common.py centralizes the routine plumbing — each scenario file is now ~30 lines of routine-name + prompt + Phase 1B follow-up notes. The Phase 1B follow-up plan (Telegram channel install, mock Sheets writes, mock Calendar reads, mock HN scrape, LLM email classification, dedup verification) is documented inline in each scenario's docstring AND in the new scripts/workflow_canary/README.md. Local verification: all 5 probes green in ~2 s each. $ tests/e2e/.venv/bin/python scripts/workflow_canary/run_workflow_canary.py \ --skip-build --skip-python-bootstrap [workflow-canary] === Script 1 — Telegram → Google Sheet Bug Logger === [workflow-canary] === Script 2 — Calendar Prep Assistant === [workflow-canary] === Script 3 — Hacker News Keyword Monitor === [workflow-canary] === Script 4 — Periodic Reminder via Telegram === [workflow-canary] === Script 5 — Email → CRM Inbound Tracker === [workflow-canary] all 5 probe(s) passed. What this catches: - Routine engine cron-tick path (spawn_cron_ticker → check_cron_triggers) - RoutineAction::Lightweight execution - DB serialization of action_config / trigger_config - Mock-LLM round-trip latency under cron scheduling - routines.next_fire_at → routine_runs status state machine What it doesn't catch yet (per-scenario Phase 1B work, documented in README + scenario docstrings): - Telegram channel install + sendMessage assertion - Mock Sheets / Calendar / Gmail / HN write+read semantics - LLM-driven structured classification (CRM) - Cross-fire dedup verification Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * canary(workflow): scaffold Phase 1B telegram-side-effect verification Lays the groundwork for verifying mock-Telegram side effects from each scenario's routine fire — but gates the verification off until a separate engine bug is fixed. What's added: - tests/e2e/mock_llm.py: new TOOL_CALL_PATTERNS entry that matches ``[CANARY-WORKFLOW-<key>]`` in any prompt and emits a deterministic http tool call to api.telegram.org/.../sendMessage with a per-scenario ack text. - scripts/workflow_canary/scenarios/_common.py: each scenario now composes its prompt as ``<prompt_intro>\n\n[CANARY-WORKFLOW-<key>]`` so the matcher fires. When ``verify_telegram=True``, the helper polls /__mock/sent_messages for up to 5 s and asserts the expected ack was captured. Default is ``verify_telegram=False`` (Phase 1A parity) — see below. - scripts/workflow_canary/telegram_mock.py: aiohttp request-logger middleware so the canary's stdout shows every inbound request, giving operators a one-line answer to "did the gateway's HTTP remap actually reach the mock?". - scripts/workflow_canary/scenarios/{bug_logger,calendar_prep, hn_monitor,periodic_reminder,crm_tracker}.py: scenarios pass ``mock_telegram_url=mock_telegram_url`` and ``prompt_intro=...`` ready for verify_telegram to flip on. What's gated off and why: The mock-Telegram verification path requires ``IRONCLAW_TEST_HTTP_REMAP=api.telegram.org=<mock>`` to route the http tool's sendMessage call into the mock. The remap is correctly registered at gateway startup (src/app.rs::http_interceptor + src/http_intercept.rs), but the ToolContext built inside the routine engine's Lightweight action loop does NOT inherit the global ``http_interceptor`` slot. Result: the http tool reaches into the real network for api.telegram.org (returning a 401 since the bot token is fake) and the mock never sees the request — confirmed via the new request-logger middleware showing zero non-internal hits. That's a real engine bug in routine-driven tool dispatch — the http_interceptor needs to propagate through the routine action's ToolContext just like it does for chat-driven tool dispatch. Out of scope for this canary PR; tracked as a follow-up. Once fixed, flip the default in ``run_routine_probe`` and every scenario's verify_telegram check activates with no further changes. Local verification: all 5 probes still green at the Phase 1A level. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * canary(workflow): re-exec under venv after bootstrap (fix CI 'No module named httpx') CI run 25028445222 failed on the workflow-canary lane with: [workflow-canary] mock telegram listening at http://... [workflow-canary] error: No module named 'httpx' Root cause: run_workflow_canary.py was missing the bootstrap-then- reexec pattern that scripts/auth_live_canary/run_live_canary.py uses (line 1229+). bootstrap_python() creates the venv and installs tests/e2e/'s pyproject deps (which include httpx + aiohttp), but the parent process keeps executing under whatever interpreter invoked it — typically the system Python on CI runners, which doesn't have httpx. The scenario module's `import httpx` at top level then fails immediately. Fix: copy the auth-live-canary reexec pattern. main() now: 1. If not --skip-python-bootstrap AND WORKFLOW_CANARY_REEXEC is unset: bootstrap the venv, install playwright, build cargo, then subprocess-spawn ourselves under the venv python with --skip-python-bootstrap and WORKFLOW_CANARY_REEXEC=1 so this branch isn't re-entered. 2. The reexecuted process sees skip_python_bootstrap=True and runs the actual canary against the venv interpreter that has all deps available. Local sanity check: still passes (--skip-build --skip-python-bootstrap short-circuits the bootstrap, both branches behave identically when the venv already exists). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(routine-engine): propagate http_interceptor into Lightweight tool dispatch The chat path's tool dispatch correctly receives the global HTTP interceptor (e.g., the `IRONCLAW_TEST_HTTP_REMAP` debug-only host remapper installed in `src/app.rs::http_interceptor`), but the routine engine's Lightweight action path constructed its `JobContext` from scratch with `..Default::default()`, leaving `http_interceptor: None`. Tools called from a routine therefore reached the real network even when the rest of the system was configured to route through mocks. Plumb the interceptor through: - `RoutineEngine` gains an `http_interceptor` field - `RoutineEngine::new` takes it as the 11th argument - `EngineContext` carries it across the spawn boundary - `JobContext` construction at the Lightweight action site copies it from the engine context Threading complete: AgentDeps → RoutineEngine → EngineContext → JobContext → http tool. Same shape the chat path already uses. Test rigs updated: `tests/support/test_rig.rs` and `tests/e2e_routine_heartbeat.rs` (10 call sites total) pass `None` for the new arg, matching their existing minimal stack model. Build clean against `--no-default-features --features libsql`. Why this matters: with the interceptor lost, every workflow-canary probe's http tool dispatch reached real api.telegram.org and 401'd on the fake token — leaving the mock Telegram bot empty and the canary's send-side assertions unverifiable. With the fix, the interceptor honors the IRONCLAW_TEST_HTTP_REMAP and the workflow canary's Phase 1B verification activates immediately. Activates in this commit: - scripts/workflow_canary/scenarios/_common.py default flips to `verify_telegram=True` - All 5 scenarios (bug_logger, calendar_prep, hn_monitor, periodic_reminder, crm_tracker) now assert that the mock Telegram bot received the per-scenario ack message `[canary-workflow:<key>] ack` Local verification: $ tests/e2e/.venv/bin/python scripts/workflow_canary/run_workflow_canary.py \ --skip-build --skip-python-bootstrap [workflow-canary] === Script 1 — Telegram → Google Sheet Bug Logger === ... (all 5 scenarios) ... [workflow-canary] all 5 probe(s) passed. $ grep "POST /bot" artifacts/workflow-canary/telegram_mock.log | wc -l 5 # one per scenario, distinct ack text per probe Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * canary(workflow): add manual_trigger + lifecycle + dedup_cooldown probes Three new scenarios covering issue #1044 assertions that the existing 5 cron-fire probes don't reach. Each scenario tests a distinct back-end mechanism that real users hit: - **manual_trigger** (Scripts 3 PHASE 2.1 + 3 PHASE 4.2 + 4 PHASE 4.2) Inserts a routine WITHOUT backdating next_fire_at, so the only path to a fire is the manual-trigger API. POSTs /api/routines/<id>/trigger, asserts response carries a run_id, polls routine_runs for terminal status, then verifies mock Telegram captured the per-scenario ack. Catches regressions in RoutineEngine::fire_manual end-to-end. - **lifecycle** (Scripts 1 PHASE 5 + 4 PHASE 5) — three sub-probes: 1. disabled-blocks-fires: insert with enabled=False + backdate; assert no routine_runs row appears within 8 s window. 2. enable-resumes-fires: toggle enabled=true via API, backdate, assert fire reaches terminal status. 3. delete-removes-routine: confirm /api/routines lists it, DELETE, confirm it's gone. Catches regressions in toggle handler, delete handler, and the engine's enabled-flag respect during cron tick selection. - **dedup_cooldown** (Scripts 1 PHASE 4.4 + 3 PHASE 3.2 + 5 PHASE 5.5) Insert with cooldown_secs=30; first fire lands within ~5 s; immediate re-backdate; assert ONLY ONE run row exists after 8 s. Catches regressions in cooldown enforcement during check_cron_triggers. This is the closest engine-level correlate to the user-script "no duplicate rows / alerts / messages" assertions, which are application-level dedup that lives outside the canary's deterministic-mock surface. Plumbing: - routines.py: trigger_routine_via_api / toggle_routine_via_api / delete_routine_via_api / list_routines_via_api helpers (all auth- bearer, JSON in/out, raise_for_status). - routines.py: insert_lightweight_cron_routine grew `cooldown_secs` + `enabled` parameters; defaults preserve existing behavior. - run_workflow_canary.py: registered the three new scenario keys. Local verification — all 10 probes (5 original + 5 new sub-probes across 3 new scenarios) green: ✅ bug_logger / calendar_prep / hn_monitor / periodic_reminder / crm_tracker (existing — Telegram ack capture) ✅ manual_trigger (548ms) ✅ lifecycle_disable (8004ms — full no-fire window) ✅ lifecycle_toggle (1543ms) ✅ lifecycle_delete (56ms) ✅ dedup_cooldown (10017ms — first fire + 8s no-fire window) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * canary(workflow): add NL-driven routine_create + routine_update probes Two scenarios that close issue #1044's chat-driven assertions (Script 1 PHASE 3.1, Script 2 PHASE 3.1, Script 3 PHASE 2.1, Script 4 PHASE 2.1 + 5.1, Script 5 PHASE 4.1): - **nl_routine_create**: opens a thread via /api/chat/thread/new, posts an NL message tagged [CANARY-WORKFLOW-NL-CREATE], waits for the agent to dispatch routine_create, then verifies the routines row landed in libSQL AND is visible via GET /api/routines. - **nl_schedule_update**: pre-seeds a target routine (canary-nl-update-target), posts an NL message tagged [CANARY-WORKFLOW-NL-UPDATE], waits for the agent to dispatch routine_update with a new schedule, then verifies trigger_config changed in libSQL. Asserts on schedule-changed (not exact match) because the engine normalizes 5-field cron → 7-field internal form ("0 */5 * * *" → "0 0 */5 * * * *"). Plumbing: - Two new TOOL_CALL_PATTERNS entries in tests/e2e/mock_llm.py matched in priority order (specific NL-CREATE / NL-UPDATE sentinels checked BEFORE the generic [CANARY-WORKFLOW-<key>] http-tool fallback, since the canary's own routines emit the generic pattern from inside their action prompts). - Helper additions in scripts/workflow_canary/routines.py: _open_thread / _send_chat / _read_routine / _wait_for_*. Local verification — all 12 probes green: ✅ bug_logger / calendar_prep / hn_monitor / periodic_reminder / crm_tracker (5 cron-fire + telegram-ack) ✅ manual_trigger (POST /api/routines/<id>/trigger) ✅ lifecycle_disable / lifecycle_toggle / lifecycle_delete ✅ dedup_cooldown (cooldown_secs suppresses second fire) ✅ nl_routine_create (chat → routine_create tool) ✅ nl_schedule_update (chat → routine_update tool) What's still deferred to follow-up PRs (per-provider mocks, each ~1-3 days of work — see scripts/workflow_canary/README.md): - Mock Google Sheets (Scripts 1 + 5 dedicated assertions) - Mock Google Calendar (Script 2) - Mock Hacker News (Script 3) - LLM-driven email classification with seeded inbox (Script 5) - Telegram channel install + bot-token validation flow (Scripts 1-5 PHASE 1) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(workflow-canary): Phase 1 — mock Sheets + bug_logger Sheet-write probe Adds scripts/workflow_canary/sheets_mock.py: single-port aiohttp Google Sheets v4 mock supporting POST /v4/spreadsheets, values:append, values get, plus /__mock/ test hooks for seeding, draining, and resetting. The append handler enforces values=list-of-lists (returns the canonical "expected a sequence" 400) so the canary catches the issue #1044 FAIL CRITERIA shape. Wires the mock into run_workflow_canary.py: - generic _spawn_mock helper for telegram_mock + sheets_mock - IRONCLAW_TEST_HTTP_REMAP carries comma-separated entries for api.telegram.org and sheets.googleapis.com - mock_sheets_url passed through to every scenario's run() kwargs Rewrites scenarios/bug_logger.py to drop the run_routine_probe Telegram fallback in favor of a Sheet-write end-to-end assertion: pre-seed the spreadsheet, fire the routine with [CANARY-WORKFLOW-SHEET-APPEND], wait for the appended row, validate shape (timestamp / message / source). Mock LLM: new TOOL_CALL_PATTERNS entry that matches the SHEET-APPEND sentinel and emits an http POST values:append with a hardcoded canary row. All 12 probes still pass locally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(workflow-canary): Phase 2-4 — Calendar / HN / Gmail / web_search mocks + e2e probes Phase 2 (Calendar): scripts/workflow_canary/calendar_mock.py — Google Calendar v3 events surface (list / insert / get / delete) with seed hooks. calendar_prep_e2e seeds one canary event, fires the routine, asserts events.list was hit and Telegram received the prep briefing referencing the seeded event title. Phase 3 (Hacker News): scripts/workflow_canary/hn_mock.py — /newest HTML fixture with seeded "Show HN" posts (canary-distinct ``<!-- canary-hn-feed -->`` marker). hn_monitor_e2e re-seeds posts, asserts /newest GET landed and Telegram summary references both seeded posts. Phase 4 (CRM tracker): scripts/workflow_canary/gmail_mock.py + web_search_mock.py — Gmail v1 messages.list/.get + Brave Search v3. crm_tracker_e2e seeds 1 lead + 1 newsletter + 1 receipt; asserts exactly ONE row appended to the CRM sheet (only the lead) with all 6 expected columns + Telegram ack referencing 1 lead. Mock LLM TOOL_CALL_PATTERNS gain three parallel-call entries ([CANARY-WORKFLOW-CAL-LIST] → http GET events.list + http POST sendMessage; [CANARY-WORKFLOW-HN-FETCH] → GET /newest + sendMessage; [CANARY-WORKFLOW-CRM-CLASSIFY] → Gmail GET + Sheets append + Telegram ack). Parallel emit is required because the engine's lightweight loop dedups same-tool re-dispatch (see match_tool_call:1178). run_workflow_canary.py now spawns six mock subprocesses; remap covers api.telegram.org, sheets.googleapis.com, www.googleapis.com, news.ycombinator.com, gmail.googleapis.com, api.search.brave.com. All 12 existing probes pass + 3 phase 2-4 probes upgrade from side-effect-only to full content-correctness assertions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(workflow-canary): Phase 5 — Telegram channel install + round-trip scripts/workflow_canary/telegram_setup.py: install + capability patch + setup helpers (mirrors tests/e2e/scenarios/test_telegram_e2e.py patch_capabilities + activate flow). Adds pair_telegram_user that sends an "hello" webhook, extracts the pairing code from mock_telegram, and approves it via /api/pairing/telegram/approve. scripts/live_canary/common.py: GatewayStack now exposes http_url (HTTP-channel webhook port) + channels_dir (WASM_CHANNELS_DIR) so workflow-canary scenarios can drive the Telegram channel install + patch + webhook flow. run_workflow_canary.py: passes IRONCLAW_TEST_TELEGRAM_API_BASE_URL so the hardcoded validate_telegram_bot_token getMe call (in src/extensions/manager.rs) routes to mock_telegram. The bot-token validate path bypasses the standard IRONCLAW_TEST_HTTP_REMAP flow, hence the additional env override. New scenarios: - telegram_channel_install: install + patch caps + setup + assert channel reaches Active state. Catches "HTTP 404 on valid token" regression (Script 4 PHASE 1.1). - telegram_round_trip: post inbound webhook → assert mock_telegram receives an outbound sendMessage with the actual chat_id (NOT 'default'). Catches the chat_id 'default' regression. - routine_visibility_from_telegram: pair user, ask for routines, assert agent replies on the paired chat_id. Covers Scripts 1-4 PHASE "routine visibility from Telegram" assertions. - manual_trigger_from_telegram: pair user, hit /api/routines/<id>/ trigger, assert routine fires through lightweight loop and ack reaches the paired chat_id. Covers Script 4 PHASE 4.2. All 16 probes pass locally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(workflow-canary): Phase 6 — first_immediate_run + log_assertions scripts/workflow_canary/scenarios/first_immediate_run.py: insert a routine with a "0 * * * *" schedule + fire_immediately=True; assert the first run reaches terminal status within 10s. Catches "first check is delayed to next hour" regression (Script 3 PHASE 2.1). scripts/workflow_canary/scenarios/log_assertions.py: scan gateway.log at the end of the lane for known fail-criterion regex patterns: chat_id 'default', parsed naive timestamp without timezone, retry after None, expected a sequence. Catches log regressions across all 5 issue #1044 scripts simultaneously. Auth-recovery (token revocation → auth_required SSE) is deferred to the auth-live-canary lane; it requires a working OAuth setup to revoke, which is outside this lane's mock-only scope. All 18 probes pass locally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(workflow-canary): Phase 7 — cron timing + idempotent toggle + README scripts/workflow_canary/scenarios/cron_timing_accuracy.py: insert a routine, set next_fire_at to "now + 5s" explicitly, assert the engine fires within ±10s of the set boundary. Catches "cron skipped a cycle" + "fires never trigger" regressions (Scripts 3 PHASE 3.1, 4 PHASE 3.4). scripts/workflow_canary/scenarios/idempotent_disable_enable.py: double-toggle disable then double-toggle enable, assert both halves are no-ops; finally backdate, fire once, then disable + backdate again and assert no NEW runs land in the next 6s. Catches "disable doesn't take effect" + "enable triggers a phantom run" regressions (Script 1 PHASE 5.1 / 5.2). scripts/workflow_canary/README.md: rewritten to reflect 20-probe coverage matrix across phases 1–7 with mock surface + scenarios inventory. Final canary state: 20 probes across 7 phases, all green locally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(workflow-canary): close gaps — wire web_search + add auth_recovery [CANARY-WORKFLOW-CAL-LIST] now emits a parallel triplet (calendar events.list + web_search company lookup + telegram sendMessage). calendar_prep asserts mock_web_search captured the lookup with the expected company-name query parameter, completing the Script 2 "company background + recent news" assertion from issue #1044. scripts/workflow_canary/scenarios/auth_recovery.py: drives a chat that triggers an unauthenticated gmail tool call, asserts the agent surfaces a graceful response — chat send returns 202 (not 5xx), thread settles, history contains no Error 400 / Internal Server Error / panicked / Traceback fragments. Catches the regression shape from Script 2 PHASE 5 fail criteria without requiring a real OAuth handshake (full token-revocation coverage stays in auth-live-canary). 21 probes total, all green locally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci(canary): run every 6h + re-enable Slack report Schedule: cron flips from "0 2 * * *" (once daily at 02:00 UTC) to "0 */6 * * *" (4× daily at 00/06/12/18 UTC). All twelve job-level `if:` guards updated in lockstep so each lane still gates on the schedule string. Slack report: drop the `if: false` hardcode on the canary-report job's notify step and replace with a schedule + workflow_dispatch gate. The notifier (scripts/live-canary/notify_slack.py) already exits 0 on Haiku/Slack failures so a flaky webhook can't mask lane status. PR-triggered runs (currently none, but possible via workflow_run) skip the post to keep noise out of the channel. Both ANTHROPIC_API_KEY and SLACK_WEBHOOK_URL repo secrets are already populated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(canary-report): parse workflow-canary results.json shape The notifier reads `auth-canary-junit.xml` for JUnit-emitting lanes (auth-smoke, auth-full, auth-channels, auth-live-seeded, auth-browser-consent). The workflow-canary lane writes its own `results.json` instead — one entry per probe with `success: bool`, `latency_ms`, `details`. The notifier had no parser for that shape, so the workflow-canary slot in Slack rendered as a useless `❔ 0/0 passed, 0 failed` line. Add `parse_results_json` mirroring the JUnit parser's contract: `passed = sum(success)`, `failed = sum(!success)`, each failed probe becomes a `(provider/mode, error-or-summary)` entry on `junit_failures` so the Slack reason field renders the same way as an auth-canary failure. Latencies sum to `duration_s`. Both parsers run on every lane dir; first one whose file exists wins (auth-canary lanes emit XML only, workflow-canary lane emits JSON only — no overlap). Validated by re-running the notifier locally against the downloaded artifact from CI run 25033224036: before: "❔ workflow-canary (mock) — 0/0 passed" after: "✅ workflow-canary (mock) — 21/21 passed, 0 failed in 69s" Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(canary-report): log notifier progress for diagnosability Until now `notify_slack.py` was silent on the success path, which made it impossible to verify from CI logs alone whether Haiku enrichment actually ran. Add four stderr lines covering each phase: [notify_slack] discovered N lane dir(s): lane1/provider1, ... [notify_slack] lane/provider: tests=N passed=N failed=N skipped=N status=... [notify_slack] haiku enriched X/N lane(s) [notify_slack] posted Slack message for N lane(s) Lines stay terse and structured so they're greppable from `gh run view --log`. Haiku-failure tracking inspects `r.notable` — `run_haiku` stamps it with `haiku call failed:` / `haiku returned no JSON object` / `haiku JSON parse failed` on the three failure paths. Confirmed from local dry-run against the artifact downloaded from the previous CI run (which had the results.json parser): tests=21, passed=21, failed=0, status=pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(canary/workflow-canary): forward SCENARIO into --scenario Addresses @henrypark133's review on PR #2874: the workflow-canary lane of `scripts/live-canary/run.sh` ignored `${SCENARIO}` and always ran the full 21-probe suite. The matching workflow_dispatch job didn't export `inputs.scenario` either, so manual dispatch with a scenario filter went nowhere. Targeted local reruns / debugging hit the same gap. run.sh: translate `${SCENARIO}` (comma-list supported) into one or more `--scenario <name>` flags on `run_workflow_canary.py`. Empty SCENARIO falls through to the full suite. Guards the array splat for bash 3.2 / macOS where `${arr[@]}` on an empty array under `set -u` explodes. live-canary.yml: add `SCENARIO: ${{ inputs.scenario }}` to the Workflow Canary job's env so workflow_dispatch reaches run.sh. Verified: tests/e2e/.venv/bin/python \ scripts/workflow_canary/run_workflow_canary.py \ --skip-build --skip-python-bootstrap \ --scenario telegram_round_trip → "all 1 probe(s) passed" (full suite without the flag still runs all 21 probes) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(canary/workflow-canary): align nl_schedule_update on 'every 6 hours' Addresses Copilot AI's review on PR #2874: the docstring claimed "every 5 minutes" while EXPECTED_NEW_SCHEDULE / mock LLM emitted "0 */5 * * *" (every 5 hours), and the chat prompt the canary sent said "every 5 hours". Three different cadences across one probe. Pick "every 6 hours" consistently: - Docstring narrative: "every 6 hours" - Constant: EXPECTED_NEW_SCHEDULE = "0 */6 * * *" - Chat prompt: "fire every 6 hours" - mock_llm.py routine_update args: schedule = "0 */6 * * *" Verified locally: nl_schedule_update probe still green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(canary/auth-browser-consent): drop stale GitHub secret exposure Addresses @henrypark133's review on PR #2874: the auth-browser-consent job kept exporting 8 GitHub-related secrets (GITHUB_OAUTH_CLIENT_ID, GITHUB_OAUTH_CLIENT_SECRET, AUTH_BROWSER_GITHUB_OWNER / _REPO / _ISSUE_NUMBER / _USERNAME / _PASSWORD / _STORAGE_STATE_B64) even though the lane no longer drives a GitHub OAuth flow. BROWSER_CASES in `scripts/live_canary/auth_registry.py` was reduced to {google, notion} when github was reclassified as PAT-only — those secrets are unused on every scheduled run and just broaden the secret-exposure surface. Strip all 8 from the lane: - env: block — 5 lines (CLIENT_ID + 4 AUTH_BROWSER_GITHUB_* helpers) - Materialize provider storage state — 1 secret + its materialize block - Materialize sensitive secrets — 2 secrets + their write_secret lines Replace with explanatory comments pointing at BROWSER_CASES / auth_registry.py so a future contributor doesn't re-add them by reflex when github gets an OAuth flow. Github coverage continues to live in SEEDED_CASES (auth-live-seeded lane) which seeds the PAT directly — that lane's secrets are unaffected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(canary): align user-facing browser-cases list with auth_registry Addresses @henrypark133's review on PR #2874: removing `github` from BROWSER_CASES made `--mode browser --case github` invalid, but the contract was still advertised in three places that operators read when copying invocations: - run_live_canary.py --help (`For browser mode: google, github, notion`) - scripts/auth_live_canary/README.md (`github` listed under "Runs through Responses API and browser") - scripts/live-canary/README.md (`CASES=google,github` example) - scripts/live-canary/ACCOUNTS.md (full GitHub OAuth client + fixture + storage-state-secret sections still active, plus a Playwright storage-state recipe pointing at github.com/login) Update each in lockstep: - --help now says `For browser mode: google, notion. (github browser coverage is intentionally absent — the github WASM tool is PAT-only, not OAuth; see SEEDED_CASES instead.)` - auth_live_canary/README — github entry now reads "Responses API only (PAT-only — not browser-OAuth)"; notion entry corrected to "Responses API and browser" (it was inaccurately listed as Responses API only). - live-canary/README — example flips to `CASES=google,notion` with a one-line note pointing at auth_registry.py. - live-canary/ACCOUNTS — drops the GitHub OAuth client + fixture sections, swaps the Playwright storage-state recipe target from github.com/login to accounts.google.com, drops AUTH_BROWSER_GITHUB_STORAGE_STATE_B64 from the CI-secrets list. The argparse validator in run_live_canary.py already gives a clean error if anyone passes `--mode browser --case github`: "--case values ['github'] are not valid for --mode browser. Allowed: ['google', 'notion']", so the docs change is the user-facing fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(canary/telegram): split is_active into installed vs. active Addresses Copilot AI's review on PR #2874: `is_telegram_active` only checked that an extension named "telegram" appeared in `/api/extensions`, returning True for an installed-but-inactive extension (mid-setup, awaiting auth, activation_error). Two callers (`telegram_round_trip._ensure_active`, `routine_visibility_from_telegram._ensure_active_and_paired`) used this as a precheck to skip `setup_telegram_channel()`, so a stale inactive entry would short-circuit setup and the probe would then fail mysteriously when the channel didn't respond. Split into two helpers: - `is_telegram_installed(...)` — original semantics (entry exists), used internally as a building block; not exported as a precheck. - `wait_for_telegram_active(...)` — polls until the entry has `active=true` (the actual runtime-readiness signal — channel opened, hooks registered, credentials bound, per `.claude/rules/lifecycle.md`'s discovery-vs-activation rule). Shared `_find_telegram` helper handles the three historical envelope shapes the gateway has used (`extensions` / `items` / `installed`). Update all 4 callers to use `wait_for_telegram_active`: - telegram_channel_install.py - telegram_round_trip.py (precheck + post-setup wait) - routine_visibility_from_telegram.py (precheck + post-setup wait) - manual_trigger_from_telegram.py (precheck + post-setup wait) Verified: all 4 telegram probes still green back-to-back. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(canary/periodic_reminder): align docstring with current behavior Addresses Copilot AI's review on PR #2874: the module docstring still described the Telegram delivery assertion as a "Phase 1B follow-up" even though the scenario now sets verify_telegram=True and the inline comment on the call site already explained the Phase 1B work had landed. Future readers would assume Telegram verification was missing from this probe. Replace the docstring with a 5-step description of what the probe actually does end-to-end: 1. Backdated cron routine inserted via libSQL 2. Routine engine cron-tick picks it up 3. Lightweight action runs against mock LLM → http sendMessage 4. IRONCLAW_TEST_HTTP_REMAP routes to telegram_mock 5. Asserts both terminal routine_runs status AND captured sendMessage Also adds an explicit note that channel-install coverage (capability patch + setup + pairing) lives in the sibling telegram_* scenarios — this one covers the routine-driven sendMessage path and intentionally hits api.telegram.org via the raw http tool rather than through the installed channel. Verified: probe still green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(canary-report): rich failure blocks + cross-lane categorization + GH issues Three additions to scripts/live-canary/notify_slack.py to make the 6h Slack report actionable instead of just informational: 1) **Per-lane rich failure block** — Haiku now extracts four structured fields when status==fail: test_name, error, root_cause, fix. The Slack section renders them in the issue-friendly shape the reviewer asked for: ❌ auth-full (mock) — 11/13 passed, 1 failed in 213s Test: `test_wasm_tool_first_chat_auth_attempt_emits_auth_url` Error: SSE stream closed; auth_required event never arrived Root Cause: bridge gate not wired for installed-but-unauthed extensions (#2868 fallout) Fix: route Extension::NeedsAuth through effect_adapter.rs For passing/skipped lanes the existing single-line `> reason` is preserved so the green-path Slack output is unchanged. 2) **Cross-lane "Summary by Category" block** — second Haiku pass over all failed-lane summaries that groups them by shared root cause (e.g. "WASM tool dispatch regression — Auth Full, Auth Smoke, Auth Live Seeded"). Only fires when there are 2+ failures (single-failure runs are already obvious from the per-lane block). Rendered as a Slack mrkdwn bulleted list since Block Kit doesn't support real tables. 3) **Auto-opened GitHub issues** — opt-in via CANARY_CREATE_ISSUES=1 env var (gated to scheduled runs only in live-canary.yml so workflow_dispatch debugging doesn't flood the tracker). For each failed lane: - Search for an OPEN issue with title `[canary] <lane>: <test>`. - If found: comment "another occurrence on <run_url>". - If not found: open a new issue with the rich body + labels `canary-failure` + `lane:<lane>`. Strategy chosen to avoid issue spam while still surfacing recurring failures. Uses GITHUB_TOKEN + the repo's existing `permissions: issues: write` block — no new secrets. All three additions degrade silently — Haiku failure stamps .notable but doesn't block the post; categorization failure produces an "_(unavailable)_" placeholder; issue-creation errors are logged to stderr only. The notifier still exits 0 in every failure path so a flaky webhook can't fail the canary run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci(canary-report): reuse AUTH_LIVE_GITHUB_TOKEN for issue creation Swap the issue-creation token source from the built-in secrets.GITHUB_TOKEN to the existing AUTH_LIVE_GITHUB_TOKEN PAT — no new secrets to mint, and that PAT already covers nearai/ironclaw operations. Set as CANARY_ISSUES_TOKEN (the highest-priority env var in notify_slack.py's --github-token precedence chain) so it wins over GH_TOKEN / GITHUB_TOKEN if any of those are also present. Verify the PAT has `issues: write` scope (Issues: read & write for fine-grained PATs, repo scope for classic PATs). If it doesn't, the notifier still degrades gracefully — the API call fails, the error is logged to stderr, the canary run isn't blocked. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(canary/workflow-canary): build Telegram WASM before the lane runs Addresses reviewer feedback on PR #2874: the workflow-canary job only checked out the repo + installed Rust + Python before invoking run.sh, but four scenarios in the lane (telegram_channel_install, telegram_round_trip, routine_visibility_from_telegram, manual_trigger_from_telegram) call `/api/extensions/install` for the bundled `telegram` WASM channel. That installer needs a prebuilt `channels-src/telegram/telegram.wasm` artifact, which the repo doesn't check in — every fresh CI runner would 404 on the install path. Add the same four-step preamble the other WASM-using lanes (deterministic-replay, public-smoke, release-public-full) carry: - rust-toolchain with `targets: wasm32-wasip2` - Swatinem/rust-cache keyed `live-canary-workflow-canary` - `cargo install cargo-component --locked` - `./scripts/build-wasm-extensions.sh --channels` Use `--channels` (not the default everything-build) because the lane doesn't exercise any WASM tool — only the bundled WASM channels get installed. That keeps the cold-cache build budget under ~6 min; warm-cache runs are ~1-2 min. The 30-min job budget still has plenty of headroom: previous runs land around 8 min for the canary itself, so worst-case ~14 min total on a fresh runner. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>