supabase

mirror of https://github.com/supabase/supabase.git synced 2026-05-20 20:41:03 +08:00

Files

Matt Rossman d143571586 feat(assistant): trace-level scorers + server-side tool execution with needsApproval (#45654 )

## Motivation

When Assistant runs a potentially destructive tool like `execute_sql`,
it stops the LLM request and prompts for client-side approval and
execution of the tool. After approval, a second request kicks off under
a separate trace. This has made scoring and
[Topics](https://www.braintrust.dev/blog/topics) classification
challenging, as the generated `output` is split across stateless
requests. The [span-level
scoring](https://www.braintrust.dev/docs/evaluate/custom-code#score-spans)
approach we've used thusfar (after the LLM call, we massage the result
into an `output` payload that's stuck onto the root span) has been
cumbersome and led to invalid scores / topics where only part of the
assistant response is considered. It's also inefficient, as we're
duplicating potentially large info (like the `search_docs` output) that
already exists within the trace.

An alternative to scoring spans is to [score
traces](https://www.braintrust.dev/docs/evaluate/custom-code#score-traces).
Braintrust [best
practices](https://www.braintrust.dev/docs/evaluate/score-online#best-practices)
advise:

> Use span scope for evaluating individual operations or outputs. Use
trace scope for evaluating multi-turn conversations, overall workflow
completion, or when your scorer needs access to the full execution
context.

We've also received [direct
guidance](https://supabase.slack.com/archives/C05QYJBLX89/p1777925770927149?thread_ts=1777905716.911979&cid=C05QYJBLX89)
from their team to use this approach.

## Changes

Migrates eval scorers from custom `AssistantEvalOutput` shape to
trace-level scoring via `trace.getThread()` / `trace.getSpans()`, with
thread parsing that scores the full latest Assistant turn and passes
prior conversation separately where relevant.

Moves `execute_sql` and `deploy_edge_function` from client-side
execution after approval to AI SDK `needsApproval` + server-side
`execute()`. SQL results returned to the model are gated by AI opt-in
level, so row data is only included with `schema_and_log_and_data`;
otherwise the tool returns the no-data-permissions sentinel.

Adds `metadata.isFinalStep` to disambiguate multiple LLM requests within
an "assistant" turn due to tool call requests/responses. For online
evals, this means we should configure automations to only score traces
with `metadata.isFinalStep = true` to ensure we're judging the complete
generated response.

Other minor kaizen changes:
- Renamed `promptProviderOptions` to `systemProviderOptions` to clarify
that this is associated with the "system" message and disambiguate from
the root `providerOptions`
- Adds `evals/trace-utils.ts` to handle Zod validation of the `unknown`
span shapes from Braintrust, to more easily access typed inputs/output
on tool spans.
- Bumps AI SDK floor version `^6.0.116` → `^6.0.174`
- Tweaked the "Conciseness" scorer to not unfairly dock points for the
new `[called tool_name]` labels in serialized assistant response

## Verification

In the studio staging build, I asked Assistant to create a todos table
with 3 sample todos. I manually approved the `execute_sql` call and saw
Assistant generate text before & after the call.

In Braintrust I verified two traces were produced (see [filtered
logs](https://www.braintrust.dev/app/supabase.io/p/Assistant/logs?v=Staging&tvt=trace&search={%22filter%22:[{%22text%22:%22metadata.environment%2520%253D%2520%27staging%27%22,%22label%22:%22metadata.environment%2520%253D%2520%27staging%27%22,%22originType%22:%22btql%22},{%22text%22:%22%2560Chat%2520ID%2560%2520%253D%2520%25221cb2ac45-e5e7-458c-9da4-3bf6863b8842%2522%22,%22label%22:%22Chat%2520ID%2520equals%25201cb2ac45-e5e7-458c-9da4-3bf6863b8842%22,%22originType%22:%22form%22}]})),
the first with `metadata.isFinalStep = false` and the second with
`metadata.isFinalStep = true`.

In the Braintrust staging scorers, I ran the preview Completeness scorer
on the second trace and verified it sees the complete Assistant response
including markers for tool calls ([link to
trace](https://www.braintrust.dev/app/supabase.io/p/Assistant%20(Staging%20Scorers)/trace?object_type=project_logs&object_id=b5214b62-ad1e-4929-9d5b-40b1daebe948&r=0ed0a4f8-8aff-4a34-bb1d-1df1d88a5070&s=ff9015f8-6bf7-4ab3-83a9-ca4e69e27e82))

<img width="1193" height="960" alt="CleanShot 2026-05-07 at 11 27 10@2x"
src="https://github.com/user-attachments/assets/509d4858-c3a1-4068-986d-3aa4d5617d1a"
/>

I also tested the `deploy_edge_function` workflow and verified it still
prompts for permission and warns on deployment of existing functions.

**References**
- https://www.braintrust.dev/docs/evaluate/custom-code#score-traces
-
https://ai-sdk.dev/docs/ai-sdk-core/tools-and-tool-calling#tool-execution-approval

Supercedes https://github.com/supabase/supabase/pull/45556 and
https://github.com/supabase/supabase/pull/45339

Closes AI-473

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Tool actions (SQL execution, edge-function deploy) now require
explicit user Approve/Deny before proceeding.

* **Improvements**
* Assistant pauses for approval responses before sending follow-ups,
giving clearer control over risky actions.
  * Deploy/replace flows show confirmation and clearer replace warnings.
* Evaluation/scoring updated to use richer trace data for more accurate
assistant performance signals.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

2026-05-12 15:24:21 -04:00

feat(assistant): trace-level scorers + server-side tool execution with needsApproval (#45654 )

2026-05-12 15:24:21 -04:00

api

feat(assistant): trace-level scorers + server-side tool execution with needsApproval (#45654 )

2026-05-12 15:24:21 -04:00

constants

feat(studio): add timezone picker to user dropdown (#45517 )

2026-05-06 14:52:36 +02:00

telemetry

…

validation

…

auth.tsx

…

base64url.ts

…

breadcrumbs.test.ts

…

breadcrumbs.ts

…

cloudprovider-utils.test.ts

…

cloudprovider-utils.ts

…

data-api-types.ts

…

datetime.test.ts

feat(studio): add timezone picker to user dropdown (#45517 )

2026-05-06 14:52:36 +02:00

datetime.tsx

feat(studio): add timezone picker to user dropdown (#45517 )

2026-05-06 14:52:36 +02:00

dayjs.ts

…

error-reporting.test.ts

…

error-reporting.ts

…

formatSql.test.ts

…

formatSql.ts

feat(studio): mark sql provenance for safety (#45336 )

2026-05-04 13:08:06 -04:00

get-error-message.test.ts

…

get-error-message.ts

…

github.test.ts

…

github.ts

…

gotrue.test.ts

…

gotrue.ts

…

helpers.test.ts

…

helpers.ts

…

http-status-codes.test.ts

…

http-status-codes.ts

…

integration-utils.test.ts

…

integration-utils.ts

…

isNonNullable.test.ts

…

isNonNullable.ts

…

keyboard.ts

feat(studio): add keyboard shortcuts to Database listing pages (#45467 )

2026-05-04 07:08:35 -06:00

local-storage.test.ts

…

migration-utils.test.ts

…

migration-utils.ts

…

mime.test.ts

…

mime.ts

…

navigation.test.ts

…

navigation.ts

…

page-title.test.ts

…

page-title.ts

…

password-strength.test.ts

…

password-strength.ts

…

pathname.utils.ts

…

pg-format.test.ts

…

pg-format.ts

…

pingPostgrest.test.ts

…

pingPostgrest.ts

…

posthog.test.ts

…

posthog.ts

…

profile.tsx

…

project-supabase-client.test.ts

…

project-supabase-client.ts

…

project-transition-state.test.ts

…

project-transition-state.ts

…

project.tsx

…

restore-estimate.test.ts

…

restore-estimate.ts

…

ringBuffer.test.ts

…

ringBuffer.ts

…

role-impersonation.test.ts

feat(studio): mark sql provenance for safety (#45336 )

2026-05-04 13:08:06 -04:00

role-impersonation.ts

feat(studio): mark sql provenance for safety (#45336 )

2026-05-04 13:08:06 -04:00

sanitize.test.ts

…

sanitize.ts

…

semver.test.ts

…

semver.ts

…

sql-event-parser.test.ts

[FE-3134] fix(studio): handle ALTER TABLE IF EXISTS in RLS detection (#45493 )

2026-05-04 16:21:47 +08:00

sql-event-parser.ts

[FE-3134] fix(studio): handle ALTER TABLE IF EXISTS in RLS detection (#45493 )

2026-05-04 16:21:47 +08:00

sql-identifier-quoting.test.ts

…

sql-identifier-quoting.ts

…

sql-parameters.test.ts

…

sql-parameters.ts

…

telemetry.tsx

…

toast-errors.tsx

…

toaster.tsx

…

type-helpers.ts

…

upload.test.ts

…

upload.ts

…

void.test.ts

…

void.ts

…