mirror of
https://github.com/supabase/supabase.git
synced 2026-05-20 20:41:03 +08:00
## Motivation When Assistant runs a potentially destructive tool like `execute_sql`, it stops the LLM request and prompts for client-side approval and execution of the tool. After approval, a second request kicks off under a separate trace. This has made scoring and [Topics](https://www.braintrust.dev/blog/topics) classification challenging, as the generated `output` is split across stateless requests. The [span-level scoring](https://www.braintrust.dev/docs/evaluate/custom-code#score-spans) approach we've used thusfar (after the LLM call, we massage the result into an `output` payload that's stuck onto the root span) has been cumbersome and led to invalid scores / topics where only part of the assistant response is considered. It's also inefficient, as we're duplicating potentially large info (like the `search_docs` output) that already exists within the trace. An alternative to scoring spans is to [score traces](https://www.braintrust.dev/docs/evaluate/custom-code#score-traces). Braintrust [best practices](https://www.braintrust.dev/docs/evaluate/score-online#best-practices) advise: > Use span scope for evaluating individual operations or outputs. Use trace scope for evaluating multi-turn conversations, overall workflow completion, or when your scorer needs access to the full execution context. We've also received [direct guidance](https://supabase.slack.com/archives/C05QYJBLX89/p1777925770927149?thread_ts=1777905716.911979&cid=C05QYJBLX89) from their team to use this approach. ## Changes Migrates eval scorers from custom `AssistantEvalOutput` shape to trace-level scoring via `trace.getThread()` / `trace.getSpans()`, with thread parsing that scores the full latest Assistant turn and passes prior conversation separately where relevant. Moves `execute_sql` and `deploy_edge_function` from client-side execution after approval to AI SDK `needsApproval` + server-side `execute()`. SQL results returned to the model are gated by AI opt-in level, so row data is only included with `schema_and_log_and_data`; otherwise the tool returns the no-data-permissions sentinel. Adds `metadata.isFinalStep` to disambiguate multiple LLM requests within an "assistant" turn due to tool call requests/responses. For online evals, this means we should configure automations to only score traces with `metadata.isFinalStep = true` to ensure we're judging the complete generated response. Other minor kaizen changes: - Renamed `promptProviderOptions` to `systemProviderOptions` to clarify that this is associated with the "system" message and disambiguate from the root `providerOptions` - Adds `evals/trace-utils.ts` to handle Zod validation of the `unknown` span shapes from Braintrust, to more easily access typed inputs/output on tool spans. - Bumps AI SDK floor version `^6.0.116` → `^6.0.174` - Tweaked the "Conciseness" scorer to not unfairly dock points for the new `[called tool_name]` labels in serialized assistant response ## Verification In the studio staging build, I asked Assistant to create a todos table with 3 sample todos. I manually approved the `execute_sql` call and saw Assistant generate text before & after the call. In Braintrust I verified two traces were produced (see [filtered logs](https://www.braintrust.dev/app/supabase.io/p/Assistant/logs?v=Staging&tvt=trace&search={%22filter%22:[{%22text%22:%22metadata.environment%2520%253D%2520%27staging%27%22,%22label%22:%22metadata.environment%2520%253D%2520%27staging%27%22,%22originType%22:%22btql%22},{%22text%22:%22%2560Chat%2520ID%2560%2520%253D%2520%25221cb2ac45-e5e7-458c-9da4-3bf6863b8842%2522%22,%22label%22:%22Chat%2520ID%2520equals%25201cb2ac45-e5e7-458c-9da4-3bf6863b8842%22,%22originType%22:%22form%22}]})), the first with `metadata.isFinalStep = false` and the second with `metadata.isFinalStep = true`. In the Braintrust staging scorers, I ran the preview Completeness scorer on the second trace and verified it sees the complete Assistant response including markers for tool calls ([link to trace](https://www.braintrust.dev/app/supabase.io/p/Assistant%20(Staging%20Scorers)/trace?object_type=project_logs&object_id=b5214b62-ad1e-4929-9d5b-40b1daebe948&r=0ed0a4f8-8aff-4a34-bb1d-1df1d88a5070&s=ff9015f8-6bf7-4ab3-83a9-ca4e69e27e82)) <img width="1193" height="960" alt="CleanShot 2026-05-07 at 11 27 10@2x" src="https://github.com/user-attachments/assets/509d4858-c3a1-4068-986d-3aa4d5617d1a" /> I also tested the `deploy_edge_function` workflow and verified it still prompts for permission and warns on deployment of existing functions. **References** - https://www.braintrust.dev/docs/evaluate/custom-code#score-traces - https://ai-sdk.dev/docs/ai-sdk-core/tools-and-tool-calling#tool-execution-approval Supercedes https://github.com/supabase/supabase/pull/45556 and https://github.com/supabase/supabase/pull/45339 Closes AI-473 <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Tool actions (SQL execution, edge-function deploy) now require explicit user Approve/Deny before proceeding. * **Improvements** * Assistant pauses for approval responses before sending follow-ups, giving clearer control over risky actions. * Deploy/replace flows show confirmation and clearer replace warnings. * Evaluation/scoring updated to use richer trace data for more accurate assistant performance signals. <!-- end of auto-generated comment: release notes by coderabbit.ai -->