ironclaw

mirror of https://github.com/nearai/ironclaw.git synced 2026-05-18 12:36:19 +08:00
Files
Nick Pismenkov 5a5beec1c2 feat: canary report (#2874 )
* fix(oauth): remove pending flow on provider-error callback

The /oauth/callback handler's ?error= branch (RFC 6749 §4.1.2.1
provider-side failures — user cancels consent, scope denied, etc.)
returned the error page immediately without removing the flow from
ext_mgr.pending_oauth_flows(). The ghost entry then lingered until
the 5-minute expiry sweep, and any subsequent auth dance for the
same (extension, user) pair had to dedupe against it.

Mirror the happy-path cleanup: decode the state param, remove the
keyed flow, then return the error page.

Surfaced during live-canary auth-full repro: after
test_wasm_tool_oauth_provider_error_leaves_extension_unauthed ran,
the stale flow sat in the shared auth_matrix_server fixture.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(e2e): widen auth OAuth matrix timeouts for CI load

Four tests in live-canary auth-full were failing in CI with
`Page.wait_for_function: Timeout 60000ms exceeded`,
`ClientConnectionError('Connection closed')`, and
`Timed out waiting for OAuth refresh request` — all inside 60/20s
deadlines that are tuned for a dev laptop and don't leave margin
for ubuntu-latest's 2-vCPU runner under full suite load.

Raise the per-call deadlines so the inner budgets fit comfortably
inside pyproject.toml's 120s per-test cap:

  _wait_for_refresh_request default: 20.0s -> 60.0s
  _wait_for_auth_event call site:      60   -> 90
  _wait_for_auth_prompt call site:     60   -> 90
  send_chat_and_wait_for_terminal_message call sites: 60000 -> 90000
  _wait_for_mock_google_tokens call site: 60.0 -> 90.0
  _wait_for_response_contains (gmail) call site: 60.0 -> 90.0

Strictly widening; no passing test is slowed, no semantics change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(canary): Haiku-powered Slack report job

Replace the team's raw Slack subscription (firehose of workflow
notifications) with one curated per-run summary:

  Canary: 9 passed, 1 failed of 10 lanes
  ❌ auth-full (mock) — 12/13 passed, 1 failed in 350s
  > test_wasm_tool_first_chat_auth_attempt_emits_auth_url timed
  >   out waiting for auth_required SSE event on the fresh thread
  tools: shell, http_request, gmail (~6 calls)
  ...
  commit `abc1234` • <github run link>

New `canary-report` job (needs: every lane, if: always) downloads
all lane artifacts, parses junit + summary + log tail per lane, and
asks claude-haiku-4-5 to return a compact JSON per lane
({status, reason, tool_calls_total, tools_used, notable}). That's
aggregated into a single Slack block message and posted via
incoming webhook.

Safety shape:
- Script exits 0 even on Haiku/Slack failure so the notifier never
  masks the underlying canary signal.
- Missing ANTHROPIC_API_KEY falls back to raw junit-only phrasing.
- Slack POST failure falls back to plain-text "X/Y lanes failed"
  with the GH run URL so the channel still hears something.
- No new Python deps — pure stdlib (urllib.request, xml.etree).
- 20 KB log-tail cap per lane to keep Haiku token usage bounded.

Secrets:
- ANTHROPIC_API_KEY (already present, used by provider-matrix)
- SLACK_WEBHOOK_URL (new — create an incoming webhook in Slack
  and add as repo secret; notifier prints to stdout otherwise)

Testing:
- Trigger manually via Actions -> "Live Canary" -> "Run workflow"
  with any single lane; canary-report runs after regardless of
  which lanes executed.
- Run locally with --dry-run to preview the Slack payload.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(canary): post_json error handling + robust Haiku JSON extraction

Address gemini-code-assist review on scripts/live-canary/notify_slack.py:

1. `post_json` unreachable error branch: `urllib.request.urlopen`
   raises `urllib.error.HTTPError` for 4xx/5xx before reaching the
   `if resp.status >= 300` check, so the error body was never
   surfaced. Wrap in try/except and read the body from the
   HTTPError instance — that's where Anthropic's "invalid API key"
   / "rate limited" detail lives.

2. Haiku JSON extraction was fragile: `startswith("```")` assumed
   the response had no prose preamble and only handled one fence
   shape. Replace with `re.search(r"\{.*\}", text, re.DOTALL)` so
   we pick the outermost JSON object regardless of any wrapper
   markdown or leading/trailing text. Greedy + DOTALL is correct
   for the single top-level object our schema requires.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(e2e): raise pytest timeout + bump multi-user chat wait to 180s

The CI run on feat/canary-report surfaced that 90s was still not
enough for test_mcp_same_server_multi_user_via_browser on
ubuntu-latest — it timed out at the inner Playwright
wait_for_function deadline with "Timeout 90000ms exceeded" after
118s of total test time.

The test opens two browser contexts + two SSE streams and drives a
full chat turn per user in sequence. Under 2-vCPU contention the
compound pipeline genuinely takes over 90s.

- tests/e2e/pyproject.toml: timeout 120 -> 240 (pytest-level cap)
- test_v2_auth_oauth_matrix.py: send_chat_and_wait_for_terminal_message
  call sites 90000 -> 180000 (two owner/member turns, each budgeted
  for one runner-slow turn)

180s < 240s, so the inner deadline fires first with the useful
Playwright traceback instead of the generic pytest SIGTERM.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(e2e): fix pytest-timeout CLI override + widen Mode-C deadlines

The previous commit (c3c9bbab) raised tests/e2e/pyproject.toml's
timeout from 120 to 240, but the auth canary runs the suite via
scripts/auth_canary/run_canary.py which hardcodes
`--timeout=120` on the pytest command line. The CLI flag wins
over pyproject's ini_options, so the 240 bump was invisible to
the auth lanes. That's why auth-smoke on the canary `all` run
still failed with "Timeout (>120.0s) from pytest-timeout" even
after our 180s inner widening — the outer CLI cap was firing at
120s first.

Fix the override and widen the two remaining Mode-C deadlines
that blew in the same run:

  scripts/auth_canary/run_canary.py: --timeout=120 -> 240
  _wait_for_refresh_request default: 60.0 -> 120.0
    (test_wasm_tool_oauth_refresh_on_demand and
     test_mcp_oauth_refresh_on_demand both use the default)
  test_settings_first_gmail_auth_then_chat_runs call sites:
    _wait_for_mock_google_tokens 90.0 -> 120.0
    _wait_for_response_contains 90.0 -> 120.0

All remain comfortably under the new 240s pytest-level cap so a
real hang still fails fast with a useful traceback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(e2e): opt-in text-match predicate for multi-user browser test

Ship the structural fix that was overdue. Repeated budget bumps on
send_chat_and_wait_for_terminal_message weren't holding under
ubuntu-latest "all"-mode parallelism — 120s, 180s both exceeded on
test_mcp_same_server_multi_user_via_browser. The underlying race is
in the JS predicate: it waits for the assistant bubble AND the
data-streaming attribute cleared AND the chat input re-enabled.
Under 2-vCPU contention an SSE reconnect can drop the final
attribute-clearing delta, and the compound predicate never flips
even though the response text arrived long ago.

Add an opt-in `expected_text_contains` parameter. When supplied,
the predicate succeeds the moment the expected substring appears in
the new assistant message — regardless of data-streaming or input
state. Callers that already assert on specific response text (the
existing MCP / gmail tests) can now short-circuit the race without
compromising correctness: the test's own content assertions remain
the gate.

Default behavior unchanged for the ~30 existing call sites across
test_chat.py, test_sse_reconnect.py, test_tool_approval.py,
test_portfolio.py, test_message_persistence.py, test_agent_loop_recovery.py,
test_pending_user_messages.py, test_widget_customization.py.

Applied to the two multi-user call sites with
expected_text_contains="Mock MCP search result" — that's exactly
what the test's next two assertions verify.

Local run of the flaky test alone: 40s, green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci(canary): move auth-smoke to self-hosted runner

Multi-user browser test (test_mcp_same_server_multi_user_via_browser)
consistently exceeds the Playwright budget on GH ubuntu-latest under
the 2-vCPU parallelism pressure of an "all" canary run — a single
compound chat turn burns >180s, with each budget bump we apply it
ratchets the flake, not the fix.

Pilot move onto the [self-hosted, ironclaw-live] runner that
private-oauth already uses. Same runner label means no new
infrastructure required; if the self-hosted box has Python 3.12 and
Playwright browsers installed (or can provision them via the existing
setup-python + scripts/live-canary/run.sh's `PLAYWRIGHT_INSTALL=with-deps`
flow), this is a zero-code-change canary fix.

If the pilot works, auth-full is the next candidate. If the runner
queues become a bottleneck, we'd scale to multiple workers under
the same label rather than revert to ubuntu-latest.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci(canary): revert auth-smoke to ubuntu-latest + widen budgets to 300s/360s

Railway self-hosted runner ('railway-private-oauth' on a small Docker
container) turned out to be no faster than GH ubuntu-latest for the
multi-user browser flow — both take ~194–196s for
test_mcp_same_server_multi_user_via_browser. The runner container is
evidently provisioned at a similar vCPU allocation, so the move
bought nothing.

Revert to ubuntu-latest (parallel canary shape preserved; avoids
serialising auth lanes behind private-oauth on the single
self-hosted worker) and widen deadlines for the last CI-load hop:

  test_v2_auth_oauth_matrix.py multi-user call sites:
    Playwright wait_for_function 180000 -> 300000 ms
  scripts/auth_canary/run_canary.py:
    --timeout=240 -> 360 (outer pytest cap)
  tests/e2e/pyproject.toml:
    timeout = 240 -> 360

300s inner fits inside the new 360s outer with 60s margin. Local
run of the same test alone completes in ~40s, so we have plenty
of headroom against real hangs still surfacing fast with a
useful traceback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* disable report

* scripts(auth-canary): add Google storage-state bootstrap helper

The auth-browser-consent lane drives Google's real OAuth consent UI in
Playwright, but Google's risk engine routinely interrupts the flow with
a "Verify it's you" challenge that handle_google_popup cannot solve, so
the test stalls on the password screen.

Bypass: log in once interactively in Playwright Chromium, save cookies
+ localStorage to a storage_state.json, point AUTH_BROWSER_GOOGLE_-
STORAGE_STATE_PATH at it. Subsequent canary runs spawn contexts with
that state preloaded, so the popup arrives at consent with no login or
challenge in the way.

- scripts/auth_live_canary/bootstrap_google_storage_state.py: new
  one-shot interactive helper that writes
  ~/.ironclaw/auth-canary/google_storage_state.json by default
- scripts/auth_live_canary/README.md: document the bypass under
  "Browser-consent Google challenge bypass"

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* canary(auth-browser-consent): fix Google account-picker + chat drift

The auth-browser-consent google case was failing on two distinct
issues, the first masking the second:

1) Account picker. When AUTH_BROWSER_GOOGLE_STORAGE_STATE_PATH is set
   (the recommended path — username/password automation gets blocked
   by Google's risk engine), Google's OAuth popup lands on a "Choose
   an account" picker before the consent screen. handle_google_popup
   only knew how to fill email + password and click Continue/Allow,
   so the popup sat on the picker until complete_provider_auth's
   120s callback wait timed out. Added a picker-detection step that
   tries selectors in order — username text, [data-identifier], and
   a generic "any visible @-bearing text not equal to 'Use another
   account'" XPath — and clicks the first hit, with debug logging
   so future regressions surface in the run output.

2) Tool-name and response-text drift. After the OAuth fix unblocked
   the rest of the probe, browser_chat still failed because:
   - case.expected_tool_name was "gmail", but the gateway records
     the tool call under its WASM module name "gmail_tool"
   - case.expected_text was "Gmail" (case-sensitive), but real LLM
     responses to "check gmail unread" against an empty inbox vary
     ("Your inbox is clear...", "Inbox is empty", etc.) and rarely
     emit literal "Gmail"
   Updated BROWSER_CASES["google"] to expected_tool_name="gmail_tool"
   and expected_text="inbox", and made the browser_chat assertion's
   text comparison case-insensitive so the canary doesn't depend on
   exact wording.

After both fixes the auth-browser-consent google lane runs green:
  ✓ browser_oauth   (popup -> /oauth/callback)
  ✓ browser_chat    (assistant references inbox)
  ✓ responses_api   (real Gmail tool call)

Not addressed here: BROWSER_CASES["github"] likely has the same
expected_tool_name drift ("github" vs probably "github_tool"); needs
verification with real GitHub OAuth creds before changing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* canary(auth-browser-consent): robust account-picker fallback + browser channel

Two follow-ups discovered during local debugging of the auth-browser-
consent google lane:

1) Account-picker fallback was matching hidden <style> blocks. The XPath
   `//*[contains(text(), '@') ...]` matched any element whose text
   contains `@`, which includes <style> tags carrying CSS at-rules
   (@font-face, @media). Replaced the XPath with role-based locators
   (get_by_role link/button) filtered by an email regex — only
   interactive elements match, no false positives from style blocks.
   Verified locally that the fallback now clicks the right account row
   even when AUTH_BROWSER_GOOGLE_USERNAME is unset.

2) Bootstrap script: Google's anti-automation blocks Playwright's
   default Chromium (Chrome for Testing) at sign-in with "This browser
   or app may not be secure". Added a --browser flag with a default of
   firefox (Marionette is less aggressively fingerprinted than CDP),
   plus chrome (system Google Chrome) and chromium (override) options.
   For accounts where Google blocks even those — typically brand-new
   Gmails or accounts with high risk scores — the fallback path is to
   launch Chrome manually with --remote-debugging-port and connect via
   playwright.chromium.connect_over_cdp; documented in the README.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* canary(auth-live-canary): include observed extension state in timeout error

When `wait_for_extension_state` times out the bare error
"Timed out waiting for extension state: gmail" is unhelpful for
diagnosing CI failures, since CI artifacts don't capture IronClaw's
gateway logs — there's no way to tell whether the extension never
appeared, appeared but never authenticated, or authenticated but
never activated.

Track the last-observed extension on each poll and surface
authenticated/active in the timeout message. After this change a
failed run says e.g.
"Timed out waiting for extension state: gmail (expected
authenticated=True, active=True; last observed: authenticated=False,
active=False)", which immediately separates token-exchange failures
from activation-state-machine bugs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* canary(auth-live-canary): widen chat-wait deadlines 120s -> 300s

The auth-browser-consent google probe completed OAuth + extension
activation successfully on CI but timed out at the next step
(send_chat_and_wait_for_terminal_message), with the agent stuck on
"Thinking (step 1)" for the full 120s budget. Local runs on the
same code path complete the chat in ~36s, but ubuntu-latest 2-vCPU
runners under cold-start load (gateway restart, mock LLM bootstrap,
WASM tool first-invocation) need substantially more headroom.

300s matches the precedent set by `d8765714 ci(canary): revert
auth-smoke to ubuntu-latest + widen budgets to 300s/360s` for the
auth-smoke lane on the same runner class.

Both call sites widened — the seeded Responses-API probe at line 221
and the browser_oauth probe at line 800.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* canary(common): drain gateway/mock_llm stdout pipes (was deadlocking CI)

scripts/live_canary/common.py spawns the IronClaw gateway and the
mock LLM with stdout=PIPE + stderr=STDOUT, reads one line of mock_llm
output to discover its bound port, then never reads from either pipe
again. On Linux the kernel pipe buffer caps at 64 KiB; once a
sustained chat request fills it with `RUST_LOG=info` output, the
child blocks on its next stdout write and the request handler
freezes mid-response.

That's why every auth-browser-consent CI run got stuck on
"Thinking (step 1)..." for the full chat-wait budget while the same
test passes locally — macOS pipe buffers are larger and the test
completes before the buffer fills.

Fix: spawn a daemon thread per subprocess that drains the pipe to a
log file under the run's output_dir. Two wins:

- Pipes never fill, child never blocks.
- gateway.log and mock_llm.log become CI artifacts, so the next
  failure that doesn't have a clear runner-side error message is
  immediately debuggable from IronClaw's own logs.

Verified locally that the lane still passes after the change and
both log files are produced. Locally each is < 10 KiB; CI runs may
be larger but well under any artifact size limit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* canary: pin LLM backend via settings API + add LLM_API_KEY (root cause of CI freeze)

The auth-browser-consent google lane has been freezing on CI at
"Thinking (step 1)..." for the full chat-wait budget. Gateway logs
captured by the previous commit's pipe drainer reveal the smoking
gun:

  ERROR Configured LLM backend is not usable.
        backend=openai_compatible reason=missing API key
  WARN  LLM_BACKEND env var is set but DB setting takes priority.
        db_value=nearai env_value=openai_compatible
  WARN  Active LLM backend fell back to NearAI default
        attempted=openai_compatible active=nearai

Two compounding issues:

1. The openai_compatible provider refuses to instantiate without an
   API key, even though the mock LLM ignores the value. Fix: set
   `LLM_API_KEY=mock-api-key` in `build_gateway_env`, matching what
   `tests/e2e/conftest.py` already does for the e2e suite.

2. IronClaw's DB-stored LLM settings take priority over env vars,
   and the freshly-seeded canary DB defaults `llm_backend` to
   `nearai`. So even with a clean env, the agent fell back to NearAI
   and entered an interactive auth flow that hangs indefinitely in
   CI (the "Thinking" never ends). This is the exact trap
   `tests/e2e/CLAUDE.md` documents: "do not rely on env-vs-DB
   precedence … pin the provider explicitly through /api/settings/...".
   Fix: pin `llm_backend`, `openai_compatible_base_url`, and
   `selected_model` via PUT /api/settings/<key> immediately after the
   gateway becomes healthy.

Also revert the BROWSER_CASES["google"] case I touched earlier:
when NearAI was driving it emitted the WASM canonical tool name
(`gmail_tool`), but the mock LLM (now correctly driving) emits the
tool name it knows from its mapping (`gmail`). Restoring the original
`expected_tool_name="gmail"` / `expected_text="gmail"` matches what
the mock LLM actually produces.

Verified locally: all three browser_oauth / browser_chat /
responses_api probes now pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* canary(auth-live-canary): revert chat-wait deadline 300s -> 120s

The 300s widening at 98abeebe was a band-aid attempt to work around
the actual root cause (subprocess pipe deadlock + DB-overrides-env
LLM backend), which were both fixed at f59981d3 and 8733d3c0
respectively. With those fixes the chat completes in ~35s on CI, so
the 300s budget is overkill — revert to the original 120s, which
gives ~3.5x headroom over the observed steady-state and matches the
deadline shape used elsewhere in the e2e suite.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci(canary): rename github oauth secrets to dodge GITHUB_ prefix block

GitHub Actions reserves the GITHUB_ prefix for auto-generated repo
secrets (GITHUB_TOKEN, etc.) and rejects user-created secrets that
start with it: "Secret names must not start with GITHUB_". The
existing references to GITHUB_OAUTH_CLIENT_ID and GITHUB_OAUTH_-
CLIENT_SECRET in this workflow couldn't be backed by actual secrets
for that reason — the OAuth-client config was effectively unset for
the github browser-consent case, which is why it was silently
filtered out by configured_browser_cases().

Decouple the secret name from the env var name: store the secrets
under the AUTH_BROWSER_GITHUB_CLIENT_ID / AUTH_BROWSER_GITHUB_CLIENT_-
SECRET names (matching the AUTH_BROWSER_GITHUB_* convention used by
the other github canary fixture vars), and re-export them here under
the GITHUB_OAUTH_CLIENT_ID / _SECRET env names that
auth_registry.py and the WASM github tool expect.

No code changes needed in auth_registry.py / scripts/auth_live_-
canary/ — they continue to read GITHUB_OAUTH_CLIENT_ID/_SECRET from
the environment as before.

Operator action: create the OAuth app on GitHub (Settings →
Developer settings → OAuth Apps → New OAuth App) and store the
resulting credentials at:

  AUTH_BROWSER_GITHUB_CLIENT_ID
  AUTH_BROWSER_GITHUB_CLIENT_SECRET

(not GITHUB_OAUTH_CLIENT_ID / _SECRET, which GitHub will reject).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* canary(auth-browser-consent): drop github case (tool is PAT-only, not OAuth)

CI run 25022303491 surfaced that `Activate /api/extensions/github/-
activate` returns `{success: false, awaiting_token: true,
message: "Create a Personal Access Token..."}` with no `auth_url`,
which the browser-consent probe needs in order to drive the OAuth
popup.

Confirmed via `registry/tools/github.json`:

    "auth_summary": {
        "method": "manual",       <- PAT paste, not OAuth
        "secrets": ["github_token"],
        "setup_url": "https://github.com/settings/tokens"
    }

The github WASM tool's source capabilities JSON does carry an `oauth`
block, but the released v0.2.3 artifact (referenced from the registry)
ships with the manual-auth path. Until a release flips
`auth_summary.method` to "oauth" — and the github extension actually
returns an `auth_url` from /activate — there's nothing for the
browser-consent probe to do.

- Drop the `github` entry from BROWSER_CASES with a comment pointing
  at the criterion for re-adding it.
- Drop the github-specific filter in `configured_browser_cases` since
  the case is gone (no risk of an env-aware code path that quietly
  skips github when secrets are present-but-mismatched).

GitHub coverage is unchanged in SEEDED_CASES, which seeds the PAT
directly via `AUTH_LIVE_GITHUB_TOKEN` and exercises real
`/v1/responses` + browser tool calls — that lane already works.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* canary(auth-browser-consent): tick notion's trust-URL checkbox before Continue

CI run 25023708895 surfaced the notion case timing out at "Timed out
waiting for notion OAuth callback page". The popup screenshot shows
Notion MCP's consent screen with:

- Workspace correctly auto-selected (storage state worked)
- A yellow warning: "I recognize and trust this URL"
- An unchecked checkbox next to that text
- A grayed-out (disabled) Continue button

The button is gated behind the checkbox. handle_notion_popup
clicked the disabled Continue and silently no-op'd, so the
complete_provider_auth loop waited the full 120s for /oauth/callback
that never arrived.

Add a checkbox-detection step before the Continue click:

  popup.get_by_text(re.compile("I recognize and trust this URL", I))
       .first.click(timeout=3000)

Includes debug print statements (matching the auth-canary pattern
established for google's account picker) so future Notion UI
changes are immediately visible in test-output.log.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(e2e): drain ironclaw subprocess pipes in auth-matrix fixture

Same pipe-deadlock fix as scripts/live_canary/common.py f59981d3,
applied to tests/e2e/scenarios/test_v2_auth_oauth_matrix.py's
_start_auth_matrix_server. The auth-matrix fixture spawns ironclaw
with stdout=PIPE + stderr=PIPE and never drains them, so under
sustained log volume the kernel pipe buffer fills, ironclaw blocks
on its next stdout write, and any test that relies on subsequent
gateway responses (auth gate emission, SSE events, chat replies)
hangs until pytest-timeout fires.

This fix doesn't make the auth-full lane's failing test pass — the
real bug is engine-v2 silently dropping `auth_required` SSE events
for unauthenticated extensions (introduced by #2868). But it makes
the failure mode debuggable: gateway log is captured to
/tmp/ironclaw-auth-matrix-gateway.log (overridable via
IRONCLAW_AUTH_MATRIX_LOG env), and RUST_LOG passes through from the
test runner so we can crank up verbosity without rebuilding.

Without this change, the failing test's log was empty after the
extension-install line; with this change you see the engine-v2
trace summary that surfaces the actual NotCallable-without-auth-gate
bug. That diagnostic visibility is the value here.

- _drain_stream_to_file: asyncio drainer mirroring common.py's sync
  threading version
- _start_auth_matrix_server: drain stdout/stderr to log_path
- _shutdown_auth_matrix_server: cancel drain_tasks for clean exit
- env: RUST_LOG forwarding so debug runs work

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* canary(workflow): add Telegram Bot API mock

Foundation piece for the new workflow-canary lane that exercises
multi-tool / multi-channel user workflows from issue #1044 (Telegram +
routines + Sheets/Calendar/Gmail end-to-end). Models the same
single-port aiohttp-based mock pattern used by tests/e2e/mock_llm.py.

Endpoints:
- /bot{token}/{getMe,getUpdates,sendMessage,sendChatAction,
  setWebhook,deleteWebhook,getFile} — the subset IronClaw's WASM
  telegram tool + channels-src/telegram actually call. Tokens are
  accepted without validation; the canary doesn't need to test
  Telegram's auth — just IronClaw's flow against a Bot API shape.
- /__mock/inject_message — push a simulated incoming user message
  onto the next getUpdates response, so scenarios can drive a
  Telegram → IronClaw round-trip without a real Telegram account.
- /__mock/sent_messages — drain the queue of every sendMessage /
  sendChatAction IronClaw emitted, for end-to-end assertions.
- /__mock/reset — clear all state between probes.

IronClaw routes its API calls through this mock via
IRONCLAW_TEST_HTTP_REMAP=api.telegram.org=<mock_url>, the same
mechanism the auth-live-canary uses for Gmail/Calendar/Sheets mocks.

Smoke-tested: getMe → success, inject_message → getUpdates returns
the injected message, sendMessage → bot response shape + recorded
in sent_messages.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* canary(workflow): land workflow-canary lane with periodic-reminder scenario

Phase 1A of the workflow-canary system from issue #1044. Adds a new
canary lane that exercises the routine engine + cron-fire path, the
foundation that the remaining four scripts (Telegram → Sheets,
Calendar prep, HN monitor, CRM tracker) will layer on.

Components:

- scripts/workflow_canary/routines.py — direct libSQL helpers for
  inserting a lightweight cron routine with a backdated next_fire_at
  and polling routine_runs for terminal status (ok / attention /
  failed). Backdating beats wall-clock cron in tests by 30+ s per
  probe and is the same shape auth-live-seeded uses for
  expire_secret_in_db.
- scripts/workflow_canary/run_workflow_canary.py — entrypoint that
  starts the Telegram mock, calls common.start_gateway_stack with
  workflow-tuned env (ROUTINES_ENABLED=true, ROUTINES_CRON_INTERVAL=2,
  IRONCLAW_TEST_HTTP_REMAP=api.telegram.org=<mock>), and runs
  scenario modules. CLI mirrors run_live_canary.py.
- scripts/workflow_canary/scenarios/periodic_reminder.py — Script 4
  Phase 1A: insert lightweight routine → wait for engine to fire →
  assert run row reaches a terminal status. Verified locally: 1
  probe, 1 fire, status=attention.

Plumbing:

- .github/workflows/live-canary.yml — new workflow-canary job + lane
  added to the workflow_dispatch choice list and the canary-report
  aggregator's needs:.
- scripts/live-canary/run.sh — workflow-canary case dispatches to
  run_workflow_canary.py.

Phase 1B follow-ups in subsequent commits:
- Telegram channel install + bot-token seeding (needs admin auth or
  direct encrypted-secrets DB write)
- Verify Telegram sendMessage was emitted to the mock during the
  routine fire (covered by mock telegram's /__mock/sent_messages)
- Scripts 1, 3, 5 (Sheets / HN / Gmail-CRM)
- Script 2 (Calendar prep with web search)

Local verification:
  $ tests/e2e/.venv/bin/python scripts/workflow_canary/run_workflow_canary.py \
        --skip-build --skip-python-bootstrap
  [workflow-canary] mock telegram listening at http://127.0.0.1:51139
  [periodic_reminder] inserted routine ..., next_fire_at backdated 60s
  [periodic_reminder] routine fired: status=attention
  [workflow-canary] all 1 probe(s) passed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* canary(workflow): land all 5 issue #1044 scenarios + scenario README

Layer Scripts 1, 2, 3, 5 onto the foundation shipped in 16278ea9, so
the workflow-canary lane covers all five user-workflow scripts from
issue #1044. Each scenario delegates to a shared
`run_routine_probe()` helper that captures the Phase 1A shape: insert
a Lightweight cron routine with a script-specific prompt → backdate
next_fire_at → poll routine_runs for terminal status.

Scenarios added:

- bug_logger.py     (Script 1 — Telegram bugs → Google Sheet)
- calendar_prep.py  (Script 2 — Calendar prep → Telegram, Reporter: Nick)
- hn_monitor.py     (Script 3 — Hacker News → Telegram, Reporter: Emil)
- crm_tracker.py    (Script 5 — Gmail → Sheets CRM, Reporter: Cameron)

Plus periodic_reminder.py (Script 4, Reporter: Henry) refactored to
also use run_routine_probe.

scenarios/_common.py centralizes the routine plumbing — each scenario
file is now ~30 lines of routine-name + prompt + Phase 1B follow-up
notes. The Phase 1B follow-up plan (Telegram channel install, mock
Sheets writes, mock Calendar reads, mock HN scrape, LLM email
classification, dedup verification) is documented inline in each
scenario's docstring AND in the new scripts/workflow_canary/README.md.

Local verification: all 5 probes green in ~2 s each.

  $ tests/e2e/.venv/bin/python scripts/workflow_canary/run_workflow_canary.py \
        --skip-build --skip-python-bootstrap
  [workflow-canary] === Script 1 — Telegram → Google Sheet Bug Logger ===
  [workflow-canary] === Script 2 — Calendar Prep Assistant ===
  [workflow-canary] === Script 3 — Hacker News Keyword Monitor ===
  [workflow-canary] === Script 4 — Periodic Reminder via Telegram ===
  [workflow-canary] === Script 5 — Email → CRM Inbound Tracker ===
  [workflow-canary] all 5 probe(s) passed.

What this catches:
- Routine engine cron-tick path (spawn_cron_ticker → check_cron_triggers)
- RoutineAction::Lightweight execution
- DB serialization of action_config / trigger_config
- Mock-LLM round-trip latency under cron scheduling
- routines.next_fire_at → routine_runs status state machine

What it doesn't catch yet (per-scenario Phase 1B work, documented in
README + scenario docstrings):
- Telegram channel install + sendMessage assertion
- Mock Sheets / Calendar / Gmail / HN write+read semantics
- LLM-driven structured classification (CRM)
- Cross-fire dedup verification

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* canary(workflow): scaffold Phase 1B telegram-side-effect verification

Lays the groundwork for verifying mock-Telegram side effects from
each scenario's routine fire — but gates the verification off until
a separate engine bug is fixed.

What's added:

- tests/e2e/mock_llm.py: new TOOL_CALL_PATTERNS entry that matches
  ``[CANARY-WORKFLOW-<key>]`` in any prompt and emits a deterministic
  http tool call to api.telegram.org/.../sendMessage with a
  per-scenario ack text.
- scripts/workflow_canary/scenarios/_common.py: each scenario now
  composes its prompt as
  ``<prompt_intro>\n\n[CANARY-WORKFLOW-<key>]`` so the matcher fires.
  When ``verify_telegram=True``, the helper polls
  /__mock/sent_messages for up to 5 s and asserts the expected ack
  was captured. Default is ``verify_telegram=False`` (Phase 1A
  parity) — see below.
- scripts/workflow_canary/telegram_mock.py: aiohttp request-logger
  middleware so the canary's stdout shows every inbound request,
  giving operators a one-line answer to "did the gateway's HTTP
  remap actually reach the mock?".
- scripts/workflow_canary/scenarios/{bug_logger,calendar_prep,
  hn_monitor,periodic_reminder,crm_tracker}.py: scenarios pass
  ``mock_telegram_url=mock_telegram_url`` and ``prompt_intro=...``
  ready for verify_telegram to flip on.

What's gated off and why:

The mock-Telegram verification path requires
``IRONCLAW_TEST_HTTP_REMAP=api.telegram.org=<mock>`` to route
the http tool's sendMessage call into the mock. The remap is
correctly registered at gateway startup
(src/app.rs::http_interceptor + src/http_intercept.rs), but the
ToolContext built inside the routine engine's Lightweight action
loop does NOT inherit the global ``http_interceptor`` slot. Result:
the http tool reaches into the real network for api.telegram.org
(returning a 401 since the bot token is fake) and the mock never
sees the request — confirmed via the new request-logger middleware
showing zero non-internal hits.

That's a real engine bug in routine-driven tool dispatch — the
http_interceptor needs to propagate through the routine action's
ToolContext just like it does for chat-driven tool dispatch. Out of
scope for this canary PR; tracked as a follow-up. Once fixed, flip
the default in ``run_routine_probe`` and every scenario's
verify_telegram check activates with no further changes.

Local verification: all 5 probes still green at the Phase 1A level.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* canary(workflow): re-exec under venv after bootstrap (fix CI 'No module named httpx')

CI run 25028445222 failed on the workflow-canary lane with:

  [workflow-canary] mock telegram listening at http://...
  [workflow-canary] error: No module named 'httpx'

Root cause: run_workflow_canary.py was missing the bootstrap-then-
reexec pattern that scripts/auth_live_canary/run_live_canary.py
uses (line 1229+). bootstrap_python() creates the venv and installs
tests/e2e/'s pyproject deps (which include httpx + aiohttp), but
the parent process keeps executing under whatever interpreter
invoked it — typically the system Python on CI runners, which
doesn't have httpx. The scenario module's `import httpx` at top
level then fails immediately.

Fix: copy the auth-live-canary reexec pattern. main() now:

1. If not --skip-python-bootstrap AND WORKFLOW_CANARY_REEXEC is
   unset: bootstrap the venv, install playwright, build cargo,
   then subprocess-spawn ourselves under the venv python with
   --skip-python-bootstrap and WORKFLOW_CANARY_REEXEC=1 so this
   branch isn't re-entered.
2. The reexecuted process sees skip_python_bootstrap=True and runs
   the actual canary against the venv interpreter that has all
   deps available.

Local sanity check: still passes (--skip-build --skip-python-bootstrap
short-circuits the bootstrap, both branches behave identically when
the venv already exists).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(routine-engine): propagate http_interceptor into Lightweight tool dispatch

The chat path's tool dispatch correctly receives the global
HTTP interceptor (e.g., the `IRONCLAW_TEST_HTTP_REMAP` debug-only
host remapper installed in `src/app.rs::http_interceptor`), but the
routine engine's Lightweight action path constructed its
`JobContext` from scratch with `..Default::default()`, leaving
`http_interceptor: None`. Tools called from a routine therefore
reached the real network even when the rest of the system was
configured to route through mocks.

Plumb the interceptor through:

- `RoutineEngine` gains an `http_interceptor` field
- `RoutineEngine::new` takes it as the 11th argument
- `EngineContext` carries it across the spawn boundary
- `JobContext` construction at the Lightweight action site copies
  it from the engine context

Threading complete: AgentDeps → RoutineEngine → EngineContext →
JobContext → http tool. Same shape the chat path already uses.

Test rigs updated: `tests/support/test_rig.rs` and
`tests/e2e_routine_heartbeat.rs` (10 call sites total) pass `None`
for the new arg, matching their existing minimal stack model.
Build clean against `--no-default-features --features libsql`.

Why this matters: with the interceptor lost, every workflow-canary
probe's http tool dispatch reached real api.telegram.org and 401'd
on the fake token — leaving the mock Telegram bot empty and the
canary's send-side assertions unverifiable. With the fix, the
interceptor honors the IRONCLAW_TEST_HTTP_REMAP and the workflow
canary's Phase 1B verification activates immediately.

Activates in this commit:

- scripts/workflow_canary/scenarios/_common.py default flips to
  `verify_telegram=True`
- All 5 scenarios (bug_logger, calendar_prep, hn_monitor,
  periodic_reminder, crm_tracker) now assert that the mock
  Telegram bot received the per-scenario ack message
  `[canary-workflow:<key>] ack`

Local verification:

  $ tests/e2e/.venv/bin/python scripts/workflow_canary/run_workflow_canary.py \
        --skip-build --skip-python-bootstrap
  [workflow-canary] === Script 1 — Telegram → Google Sheet Bug Logger ===
  ... (all 5 scenarios) ...
  [workflow-canary] all 5 probe(s) passed.

  $ grep "POST /bot" artifacts/workflow-canary/telegram_mock.log | wc -l
  5  # one per scenario, distinct ack text per probe

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* canary(workflow): add manual_trigger + lifecycle + dedup_cooldown probes

Three new scenarios covering issue #1044 assertions that the existing
5 cron-fire probes don't reach. Each scenario tests a distinct
back-end mechanism that real users hit:

- **manual_trigger** (Scripts 3 PHASE 2.1 + 3 PHASE 4.2 + 4 PHASE 4.2)
  Inserts a routine WITHOUT backdating next_fire_at, so the only path
  to a fire is the manual-trigger API. POSTs
  /api/routines/<id>/trigger, asserts response carries a run_id, polls
  routine_runs for terminal status, then verifies mock Telegram
  captured the per-scenario ack. Catches regressions in
  RoutineEngine::fire_manual end-to-end.

- **lifecycle** (Scripts 1 PHASE 5 + 4 PHASE 5) — three sub-probes:
  1. disabled-blocks-fires: insert with enabled=False + backdate;
     assert no routine_runs row appears within 8 s window.
  2. enable-resumes-fires: toggle enabled=true via API, backdate,
     assert fire reaches terminal status.
  3. delete-removes-routine: confirm /api/routines lists it, DELETE,
     confirm it's gone.
  Catches regressions in toggle handler, delete handler, and the
  engine's enabled-flag respect during cron tick selection.

- **dedup_cooldown** (Scripts 1 PHASE 4.4 + 3 PHASE 3.2 + 5 PHASE 5.5)
  Insert with cooldown_secs=30; first fire lands within ~5 s; immediate
  re-backdate; assert ONLY ONE run row exists after 8 s. Catches
  regressions in cooldown enforcement during check_cron_triggers.
  This is the closest engine-level correlate to the user-script
  "no duplicate rows / alerts / messages" assertions, which are
  application-level dedup that lives outside the canary's
  deterministic-mock surface.

Plumbing:

- routines.py: trigger_routine_via_api / toggle_routine_via_api /
  delete_routine_via_api / list_routines_via_api helpers (all auth-
  bearer, JSON in/out, raise_for_status).
- routines.py: insert_lightweight_cron_routine grew `cooldown_secs`
  + `enabled` parameters; defaults preserve existing behavior.
- run_workflow_canary.py: registered the three new scenario keys.

Local verification — all 10 probes (5 original + 5 new sub-probes
across 3 new scenarios) green:

  ✅ bug_logger / calendar_prep / hn_monitor / periodic_reminder /
     crm_tracker          (existing — Telegram ack capture)
  ✅ manual_trigger        (548ms)
  ✅ lifecycle_disable     (8004ms — full no-fire window)
  ✅ lifecycle_toggle      (1543ms)
  ✅ lifecycle_delete      (56ms)
  ✅ dedup_cooldown        (10017ms — first fire + 8s no-fire window)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* canary(workflow): add NL-driven routine_create + routine_update probes

Two scenarios that close issue #1044's chat-driven assertions
(Script 1 PHASE 3.1, Script 2 PHASE 3.1, Script 3 PHASE 2.1,
Script 4 PHASE 2.1 + 5.1, Script 5 PHASE 4.1):

- **nl_routine_create**: opens a thread via /api/chat/thread/new,
  posts an NL message tagged [CANARY-WORKFLOW-NL-CREATE], waits for
  the agent to dispatch routine_create, then verifies the routines
  row landed in libSQL AND is visible via GET /api/routines.

- **nl_schedule_update**: pre-seeds a target routine
  (canary-nl-update-target), posts an NL message tagged
  [CANARY-WORKFLOW-NL-UPDATE], waits for the agent to dispatch
  routine_update with a new schedule, then verifies trigger_config
  changed in libSQL. Asserts on schedule-changed (not exact match)
  because the engine normalizes 5-field cron → 7-field internal
  form ("0 */5 * * *" → "0 0 */5 * * * *").

Plumbing:

- Two new TOOL_CALL_PATTERNS entries in tests/e2e/mock_llm.py
  matched in priority order (specific NL-CREATE / NL-UPDATE
  sentinels checked BEFORE the generic [CANARY-WORKFLOW-<key>]
  http-tool fallback, since the canary's own routines emit the
  generic pattern from inside their action prompts).

- Helper additions in scripts/workflow_canary/routines.py:
  _open_thread / _send_chat / _read_routine / _wait_for_*.

Local verification — all 12 probes green:

  ✅ bug_logger / calendar_prep / hn_monitor / periodic_reminder /
     crm_tracker        (5 cron-fire + telegram-ack)
  ✅ manual_trigger      (POST /api/routines/<id>/trigger)
  ✅ lifecycle_disable / lifecycle_toggle / lifecycle_delete
  ✅ dedup_cooldown      (cooldown_secs suppresses second fire)
  ✅ nl_routine_create   (chat → routine_create tool)
  ✅ nl_schedule_update  (chat → routine_update tool)

What's still deferred to follow-up PRs (per-provider mocks, each
~1-3 days of work — see scripts/workflow_canary/README.md):

- Mock Google Sheets (Scripts 1 + 5 dedicated assertions)
- Mock Google Calendar (Script 2)
- Mock Hacker News (Script 3)
- LLM-driven email classification with seeded inbox (Script 5)
- Telegram channel install + bot-token validation flow (Scripts 1-5
  PHASE 1)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(workflow-canary): Phase 1 — mock Sheets + bug_logger Sheet-write probe

Adds scripts/workflow_canary/sheets_mock.py: single-port aiohttp Google
Sheets v4 mock supporting POST /v4/spreadsheets, values:append, values
get, plus /__mock/ test hooks for seeding, draining, and resetting.
The append handler enforces values=list-of-lists (returns the canonical
"expected a sequence" 400) so the canary catches the issue #1044 FAIL
CRITERIA shape.

Wires the mock into run_workflow_canary.py:
  - generic _spawn_mock helper for telegram_mock + sheets_mock
  - IRONCLAW_TEST_HTTP_REMAP carries comma-separated entries for
    api.telegram.org and sheets.googleapis.com
  - mock_sheets_url passed through to every scenario's run() kwargs

Rewrites scenarios/bug_logger.py to drop the run_routine_probe Telegram
fallback in favor of a Sheet-write end-to-end assertion: pre-seed the
spreadsheet, fire the routine with [CANARY-WORKFLOW-SHEET-APPEND], wait
for the appended row, validate shape (timestamp / message / source).

Mock LLM: new TOOL_CALL_PATTERNS entry that matches the SHEET-APPEND
sentinel and emits an http POST values:append with a hardcoded canary
row.

All 12 probes still pass locally.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(workflow-canary): Phase 2-4 — Calendar / HN / Gmail / web_search mocks + e2e probes

Phase 2 (Calendar): scripts/workflow_canary/calendar_mock.py — Google
Calendar v3 events surface (list / insert / get / delete) with seed
hooks. calendar_prep_e2e seeds one canary event, fires the routine,
asserts events.list was hit and Telegram received the prep briefing
referencing the seeded event title.

Phase 3 (Hacker News): scripts/workflow_canary/hn_mock.py — /newest
HTML fixture with seeded "Show HN" posts (canary-distinct
``<!-- canary-hn-feed -->`` marker). hn_monitor_e2e re-seeds posts,
asserts /newest GET landed and Telegram summary references both
seeded posts.

Phase 4 (CRM tracker): scripts/workflow_canary/gmail_mock.py +
web_search_mock.py — Gmail v1 messages.list/.get + Brave Search v3.
crm_tracker_e2e seeds 1 lead + 1 newsletter + 1 receipt; asserts
exactly ONE row appended to the CRM sheet (only the lead) with all
6 expected columns + Telegram ack referencing 1 lead.

Mock LLM TOOL_CALL_PATTERNS gain three parallel-call entries
([CANARY-WORKFLOW-CAL-LIST] → http GET events.list + http POST
sendMessage; [CANARY-WORKFLOW-HN-FETCH] → GET /newest + sendMessage;
[CANARY-WORKFLOW-CRM-CLASSIFY] → Gmail GET + Sheets append + Telegram
ack). Parallel emit is required because the engine's lightweight
loop dedups same-tool re-dispatch (see match_tool_call:1178).

run_workflow_canary.py now spawns six mock subprocesses; remap covers
api.telegram.org, sheets.googleapis.com, www.googleapis.com,
news.ycombinator.com, gmail.googleapis.com, api.search.brave.com.

All 12 existing probes pass + 3 phase 2-4 probes upgrade from
side-effect-only to full content-correctness assertions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(workflow-canary): Phase 5 — Telegram channel install + round-trip

scripts/workflow_canary/telegram_setup.py: install + capability patch
+ setup helpers (mirrors tests/e2e/scenarios/test_telegram_e2e.py
patch_capabilities + activate flow). Adds pair_telegram_user that
sends an "hello" webhook, extracts the pairing code from
mock_telegram, and approves it via /api/pairing/telegram/approve.

scripts/live_canary/common.py: GatewayStack now exposes http_url
(HTTP-channel webhook port) + channels_dir (WASM_CHANNELS_DIR)
so workflow-canary scenarios can drive the Telegram channel install
+ patch + webhook flow.

run_workflow_canary.py: passes IRONCLAW_TEST_TELEGRAM_API_BASE_URL
so the hardcoded validate_telegram_bot_token getMe call (in
src/extensions/manager.rs) routes to mock_telegram. The bot-token
validate path bypasses the standard IRONCLAW_TEST_HTTP_REMAP flow,
hence the additional env override.

New scenarios:
- telegram_channel_install: install + patch caps + setup + assert
  channel reaches Active state. Catches "HTTP 404 on valid token"
  regression (Script 4 PHASE 1.1).
- telegram_round_trip: post inbound webhook → assert mock_telegram
  receives an outbound sendMessage with the actual chat_id (NOT
  'default'). Catches the chat_id 'default' regression.
- routine_visibility_from_telegram: pair user, ask for routines,
  assert agent replies on the paired chat_id. Covers Scripts 1-4
  PHASE "routine visibility from Telegram" assertions.
- manual_trigger_from_telegram: pair user, hit /api/routines/<id>/
  trigger, assert routine fires through lightweight loop and ack
  reaches the paired chat_id. Covers Script 4 PHASE 4.2.

All 16 probes pass locally.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(workflow-canary): Phase 6 — first_immediate_run + log_assertions

scripts/workflow_canary/scenarios/first_immediate_run.py: insert a
routine with a "0 * * * *" schedule + fire_immediately=True; assert
the first run reaches terminal status within 10s. Catches "first
check is delayed to next hour" regression (Script 3 PHASE 2.1).

scripts/workflow_canary/scenarios/log_assertions.py: scan
gateway.log at the end of the lane for known fail-criterion regex
patterns: chat_id 'default', parsed naive timestamp without timezone,
retry after None, expected a sequence. Catches log regressions across
all 5 issue #1044 scripts simultaneously.

Auth-recovery (token revocation → auth_required SSE) is deferred to
the auth-live-canary lane; it requires a working OAuth setup to
revoke, which is outside this lane's mock-only scope.

All 18 probes pass locally.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(workflow-canary): Phase 7 — cron timing + idempotent toggle + README

scripts/workflow_canary/scenarios/cron_timing_accuracy.py: insert a
routine, set next_fire_at to "now + 5s" explicitly, assert the engine
fires within ±10s of the set boundary. Catches "cron skipped a cycle"
+ "fires never trigger" regressions (Scripts 3 PHASE 3.1, 4 PHASE 3.4).

scripts/workflow_canary/scenarios/idempotent_disable_enable.py:
double-toggle disable then double-toggle enable, assert both halves
are no-ops; finally backdate, fire once, then disable + backdate again
and assert no NEW runs land in the next 6s. Catches "disable doesn't
take effect" + "enable triggers a phantom run" regressions
(Script 1 PHASE 5.1 / 5.2).

scripts/workflow_canary/README.md: rewritten to reflect 20-probe
coverage matrix across phases 1–7 with mock surface + scenarios
inventory.

Final canary state: 20 probes across 7 phases, all green locally.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(workflow-canary): close gaps — wire web_search + add auth_recovery

[CANARY-WORKFLOW-CAL-LIST] now emits a parallel triplet (calendar
events.list + web_search company lookup + telegram sendMessage).
calendar_prep asserts mock_web_search captured the lookup with the
expected company-name query parameter, completing the Script 2
"company background + recent news" assertion from issue #1044.

scripts/workflow_canary/scenarios/auth_recovery.py: drives a chat
that triggers an unauthenticated gmail tool call, asserts the agent
surfaces a graceful response — chat send returns 202 (not 5xx),
thread settles, history contains no Error 400 / Internal Server
Error / panicked / Traceback fragments. Catches the regression
shape from Script 2 PHASE 5 fail criteria without requiring a real
OAuth handshake (full token-revocation coverage stays in
auth-live-canary).

21 probes total, all green locally.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci(canary): run every 6h + re-enable Slack report

Schedule: cron flips from "0 2 * * *" (once daily at 02:00 UTC) to
"0 */6 * * *" (4× daily at 00/06/12/18 UTC). All twelve job-level
`if:` guards updated in lockstep so each lane still gates on the
schedule string.

Slack report: drop the `if: false` hardcode on the canary-report
job's notify step and replace with a schedule + workflow_dispatch
gate. The notifier (scripts/live-canary/notify_slack.py) already
exits 0 on Haiku/Slack failures so a flaky webhook can't mask lane
status. PR-triggered runs (currently none, but possible via
workflow_run) skip the post to keep noise out of the channel.

Both ANTHROPIC_API_KEY and SLACK_WEBHOOK_URL repo secrets are
already populated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(canary-report): parse workflow-canary results.json shape

The notifier reads `auth-canary-junit.xml` for JUnit-emitting lanes
(auth-smoke, auth-full, auth-channels, auth-live-seeded,
auth-browser-consent). The workflow-canary lane writes its own
`results.json` instead — one entry per probe with `success: bool`,
`latency_ms`, `details`. The notifier had no parser for that shape, so
the workflow-canary slot in Slack rendered as a useless
`❔ 0/0 passed, 0 failed` line.

Add `parse_results_json` mirroring the JUnit parser's contract:
`passed = sum(success)`, `failed = sum(!success)`, each failed probe
becomes a `(provider/mode, error-or-summary)` entry on
`junit_failures` so the Slack reason field renders the same way as an
auth-canary failure. Latencies sum to `duration_s`. Both parsers run
on every lane dir; first one whose file exists wins (auth-canary lanes
emit XML only, workflow-canary lane emits JSON only — no overlap).

Validated by re-running the notifier locally against the downloaded
artifact from CI run 25033224036:
  before: "❔ workflow-canary (mock) — 0/0 passed"
  after:  "✅ workflow-canary (mock) — 21/21 passed,
           0 failed in 69s"

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(canary-report): log notifier progress for diagnosability

Until now `notify_slack.py` was silent on the success path, which made
it impossible to verify from CI logs alone whether Haiku enrichment
actually ran. Add four stderr lines covering each phase:

  [notify_slack] discovered N lane dir(s): lane1/provider1, ...
  [notify_slack]   lane/provider: tests=N passed=N failed=N skipped=N status=...
  [notify_slack] haiku enriched X/N lane(s)
  [notify_slack] posted Slack message for N lane(s)

Lines stay terse and structured so they're greppable from `gh run
view --log`. Haiku-failure tracking inspects `r.notable` — `run_haiku`
stamps it with `haiku call failed:` / `haiku returned no JSON object`
/ `haiku JSON parse failed` on the three failure paths.

Confirmed from local dry-run against the artifact downloaded from
the previous CI run (which had the results.json parser): tests=21,
passed=21, failed=0, status=pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(canary/workflow-canary): forward SCENARIO into --scenario

Addresses @henrypark133's review on PR #2874: the workflow-canary lane
of `scripts/live-canary/run.sh` ignored `${SCENARIO}` and always ran
the full 21-probe suite. The matching workflow_dispatch job didn't
export `inputs.scenario` either, so manual dispatch with a scenario
filter went nowhere. Targeted local reruns / debugging hit the same
gap.

run.sh: translate `${SCENARIO}` (comma-list supported) into one or
more `--scenario <name>` flags on `run_workflow_canary.py`. Empty
SCENARIO falls through to the full suite. Guards the array splat for
bash 3.2 / macOS where `${arr[@]}` on an empty array under `set -u`
explodes.

live-canary.yml: add `SCENARIO: ${{ inputs.scenario }}` to the
Workflow Canary job's env so workflow_dispatch reaches run.sh.

Verified:
  tests/e2e/.venv/bin/python \
    scripts/workflow_canary/run_workflow_canary.py \
    --skip-build --skip-python-bootstrap \
    --scenario telegram_round_trip
  → "all 1 probe(s) passed"
  (full suite without the flag still runs all 21 probes)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(canary/workflow-canary): align nl_schedule_update on 'every 6 hours'

Addresses Copilot AI's review on PR #2874: the docstring claimed
"every 5 minutes" while EXPECTED_NEW_SCHEDULE / mock LLM emitted
"0 */5 * * *" (every 5 hours), and the chat prompt the canary sent
said "every 5 hours". Three different cadences across one probe.

Pick "every 6 hours" consistently:
- Docstring narrative: "every 6 hours"
- Constant: EXPECTED_NEW_SCHEDULE = "0 */6 * * *"
- Chat prompt: "fire every 6 hours"
- mock_llm.py routine_update args: schedule = "0 */6 * * *"

Verified locally: nl_schedule_update probe still green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(canary/auth-browser-consent): drop stale GitHub secret exposure

Addresses @henrypark133's review on PR #2874: the auth-browser-consent
job kept exporting 8 GitHub-related secrets (GITHUB_OAUTH_CLIENT_ID,
GITHUB_OAUTH_CLIENT_SECRET, AUTH_BROWSER_GITHUB_OWNER / _REPO /
_ISSUE_NUMBER / _USERNAME / _PASSWORD / _STORAGE_STATE_B64) even
though the lane no longer drives a GitHub OAuth flow. BROWSER_CASES
in `scripts/live_canary/auth_registry.py` was reduced to {google,
notion} when github was reclassified as PAT-only — those secrets are
unused on every scheduled run and just broaden the secret-exposure
surface.

Strip all 8 from the lane:
- env: block — 5 lines (CLIENT_ID + 4 AUTH_BROWSER_GITHUB_* helpers)
- Materialize provider storage state — 1 secret + its materialize block
- Materialize sensitive secrets — 2 secrets + their write_secret lines

Replace with explanatory comments pointing at BROWSER_CASES /
auth_registry.py so a future contributor doesn't re-add them by reflex
when github gets an OAuth flow.

Github coverage continues to live in SEEDED_CASES (auth-live-seeded
lane) which seeds the PAT directly — that lane's secrets are
unaffected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(canary): align user-facing browser-cases list with auth_registry

Addresses @henrypark133's review on PR #2874: removing `github` from
BROWSER_CASES made `--mode browser --case github` invalid, but the
contract was still advertised in three places that operators read
when copying invocations:

- run_live_canary.py --help (`For browser mode: google, github, notion`)
- scripts/auth_live_canary/README.md (`github` listed under "Runs
  through Responses API and browser")
- scripts/live-canary/README.md (`CASES=google,github` example)
- scripts/live-canary/ACCOUNTS.md (full GitHub OAuth client + fixture
  + storage-state-secret sections still active, plus a Playwright
  storage-state recipe pointing at github.com/login)

Update each in lockstep:

- --help now says `For browser mode: google, notion. (github browser
  coverage is intentionally absent — the github WASM tool is PAT-only,
  not OAuth; see SEEDED_CASES instead.)`
- auth_live_canary/README — github entry now reads "Responses API
  only (PAT-only — not browser-OAuth)"; notion entry corrected to
  "Responses API and browser" (it was inaccurately listed as
  Responses API only).
- live-canary/README — example flips to `CASES=google,notion` with a
  one-line note pointing at auth_registry.py.
- live-canary/ACCOUNTS — drops the GitHub OAuth client + fixture
  sections, swaps the Playwright storage-state recipe target from
  github.com/login to accounts.google.com, drops
  AUTH_BROWSER_GITHUB_STORAGE_STATE_B64 from the CI-secrets list.

The argparse validator in run_live_canary.py already gives a clean
error if anyone passes `--mode browser --case github`:
"--case values ['github'] are not valid for --mode browser. Allowed:
['google', 'notion']", so the docs change is the user-facing fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(canary/telegram): split is_active into installed vs. active

Addresses Copilot AI's review on PR #2874: `is_telegram_active` only
checked that an extension named "telegram" appeared in
`/api/extensions`, returning True for an installed-but-inactive
extension (mid-setup, awaiting auth, activation_error). Two callers
(`telegram_round_trip._ensure_active`,
`routine_visibility_from_telegram._ensure_active_and_paired`) used
this as a precheck to skip `setup_telegram_channel()`, so a stale
inactive entry would short-circuit setup and the probe would then
fail mysteriously when the channel didn't respond.

Split into two helpers:

- `is_telegram_installed(...)` — original semantics (entry exists),
  used internally as a building block; not exported as a precheck.
- `wait_for_telegram_active(...)` — polls until the entry has
  `active=true` (the actual runtime-readiness signal — channel
  opened, hooks registered, credentials bound, per
  `.claude/rules/lifecycle.md`'s discovery-vs-activation rule).

Shared `_find_telegram` helper handles the three historical envelope
shapes the gateway has used (`extensions` / `items` / `installed`).

Update all 4 callers to use `wait_for_telegram_active`:
- telegram_channel_install.py
- telegram_round_trip.py (precheck + post-setup wait)
- routine_visibility_from_telegram.py (precheck + post-setup wait)
- manual_trigger_from_telegram.py (precheck + post-setup wait)

Verified: all 4 telegram probes still green back-to-back.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(canary/periodic_reminder): align docstring with current behavior

Addresses Copilot AI's review on PR #2874: the module docstring still
described the Telegram delivery assertion as a "Phase 1B follow-up"
even though the scenario now sets verify_telegram=True and the
inline comment on the call site already explained the Phase 1B work
had landed. Future readers would assume Telegram verification was
missing from this probe.

Replace the docstring with a 5-step description of what the probe
actually does end-to-end:
1. Backdated cron routine inserted via libSQL
2. Routine engine cron-tick picks it up
3. Lightweight action runs against mock LLM → http sendMessage
4. IRONCLAW_TEST_HTTP_REMAP routes to telegram_mock
5. Asserts both terminal routine_runs status AND captured sendMessage

Also adds an explicit note that channel-install coverage (capability
patch + setup + pairing) lives in the sibling telegram_* scenarios —
this one covers the routine-driven sendMessage path and intentionally
hits api.telegram.org via the raw http tool rather than through the
installed channel.

Verified: probe still green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(canary-report): rich failure blocks + cross-lane categorization + GH issues

Three additions to scripts/live-canary/notify_slack.py to make the
6h Slack report actionable instead of just informational:

1) **Per-lane rich failure block** — Haiku now extracts four
   structured fields when status==fail: test_name, error, root_cause,
   fix. The Slack section renders them in the issue-friendly shape
   the reviewer asked for:

       ❌ auth-full (mock) — 11/13 passed, 1 failed in 213s
         Test: `test_wasm_tool_first_chat_auth_attempt_emits_auth_url`
         Error: SSE stream closed; auth_required event never arrived
         Root Cause: bridge gate not wired for installed-but-unauthed
                     extensions (#2868 fallout)
         Fix: route Extension::NeedsAuth through effect_adapter.rs

   For passing/skipped lanes the existing single-line `> reason` is
   preserved so the green-path Slack output is unchanged.

2) **Cross-lane "Summary by Category" block** — second Haiku pass
   over all failed-lane summaries that groups them by shared root
   cause (e.g. "WASM tool dispatch regression — Auth Full, Auth
   Smoke, Auth Live Seeded"). Only fires when there are 2+
   failures (single-failure runs are already obvious from the
   per-lane block). Rendered as a Slack mrkdwn bulleted list since
   Block Kit doesn't support real tables.

3) **Auto-opened GitHub issues** — opt-in via CANARY_CREATE_ISSUES=1
   env var (gated to scheduled runs only in live-canary.yml so
   workflow_dispatch debugging doesn't flood the tracker). For each
   failed lane:
   - Search for an OPEN issue with title `[canary] <lane>: <test>`.
   - If found: comment "another occurrence on <run_url>".
   - If not found: open a new issue with the rich body + labels
     `canary-failure` + `lane:<lane>`.

   Strategy chosen to avoid issue spam while still surfacing
   recurring failures. Uses GITHUB_TOKEN + the repo's existing
   `permissions: issues: write` block — no new secrets.

All three additions degrade silently — Haiku failure stamps
.notable but doesn't block the post; categorization failure produces
an "_(unavailable)_" placeholder; issue-creation errors are logged
to stderr only. The notifier still exits 0 in every failure path so
a flaky webhook can't fail the canary run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci(canary-report): reuse AUTH_LIVE_GITHUB_TOKEN for issue creation

Swap the issue-creation token source from the built-in
secrets.GITHUB_TOKEN to the existing AUTH_LIVE_GITHUB_TOKEN PAT —
no new secrets to mint, and that PAT already covers
nearai/ironclaw operations.

Set as CANARY_ISSUES_TOKEN (the highest-priority env var in
notify_slack.py's --github-token precedence chain) so it wins over
GH_TOKEN / GITHUB_TOKEN if any of those are also present.

Verify the PAT has `issues: write` scope (Issues: read & write for
fine-grained PATs, repo scope for classic PATs). If it doesn't, the
notifier still degrades gracefully — the API call fails, the error
is logged to stderr, the canary run isn't blocked.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(canary/workflow-canary): build Telegram WASM before the lane runs

Addresses reviewer feedback on PR #2874: the workflow-canary job only
checked out the repo + installed Rust + Python before invoking
run.sh, but four scenarios in the lane (telegram_channel_install,
telegram_round_trip, routine_visibility_from_telegram,
manual_trigger_from_telegram) call `/api/extensions/install` for
the bundled `telegram` WASM channel. That installer needs a
prebuilt `channels-src/telegram/telegram.wasm` artifact, which the
repo doesn't check in — every fresh CI runner would 404 on the
install path.

Add the same four-step preamble the other WASM-using lanes
(deterministic-replay, public-smoke, release-public-full) carry:

  - rust-toolchain with `targets: wasm32-wasip2`
  - Swatinem/rust-cache keyed `live-canary-workflow-canary`
  - `cargo install cargo-component --locked`
  - `./scripts/build-wasm-extensions.sh --channels`

Use `--channels` (not the default everything-build) because the
lane doesn't exercise any WASM tool — only the bundled WASM channels
get installed. That keeps the cold-cache build budget under ~6 min;
warm-cache runs are ~1-2 min.

The 30-min job budget still has plenty of headroom: previous runs
land around 8 min for the canary itself, so worst-case
~14 min total on a fresh runner.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>