Files
supabase/apps/studio/lib/ai/model.utils.test.ts
Matt Rossman d143571586 feat(assistant): trace-level scorers + server-side tool execution with needsApproval (#45654)
## Motivation

When Assistant runs a potentially destructive tool like `execute_sql`,
it stops the LLM request and prompts for client-side approval and
execution of the tool. After approval, a second request kicks off under
a separate trace. This has made scoring and
[Topics](https://www.braintrust.dev/blog/topics) classification
challenging, as the generated `output` is split across stateless
requests. The [span-level
scoring](https://www.braintrust.dev/docs/evaluate/custom-code#score-spans)
approach we've used thusfar (after the LLM call, we massage the result
into an `output` payload that's stuck onto the root span) has been
cumbersome and led to invalid scores / topics where only part of the
assistant response is considered. It's also inefficient, as we're
duplicating potentially large info (like the `search_docs` output) that
already exists within the trace.

An alternative to scoring spans is to [score
traces](https://www.braintrust.dev/docs/evaluate/custom-code#score-traces).
Braintrust [best
practices](https://www.braintrust.dev/docs/evaluate/score-online#best-practices)
advise:

> Use span scope for evaluating individual operations or outputs. Use
trace scope for evaluating multi-turn conversations, overall workflow
completion, or when your scorer needs access to the full execution
context.

We've also received [direct
guidance](https://supabase.slack.com/archives/C05QYJBLX89/p1777925770927149?thread_ts=1777905716.911979&cid=C05QYJBLX89)
from their team to use this approach.

## Changes

Migrates eval scorers from custom `AssistantEvalOutput` shape to
trace-level scoring via `trace.getThread()` / `trace.getSpans()`, with
thread parsing that scores the full latest Assistant turn and passes
prior conversation separately where relevant.

Moves `execute_sql` and `deploy_edge_function` from client-side
execution after approval to AI SDK `needsApproval` + server-side
`execute()`. SQL results returned to the model are gated by AI opt-in
level, so row data is only included with `schema_and_log_and_data`;
otherwise the tool returns the no-data-permissions sentinel.

Adds `metadata.isFinalStep` to disambiguate multiple LLM requests within
an "assistant" turn due to tool call requests/responses. For online
evals, this means we should configure automations to only score traces
with `metadata.isFinalStep = true` to ensure we're judging the complete
generated response.

Other minor kaizen changes:
- Renamed `promptProviderOptions` to `systemProviderOptions` to clarify
that this is associated with the "system" message and disambiguate from
the root `providerOptions`
- Adds `evals/trace-utils.ts` to handle Zod validation of the `unknown`
span shapes from Braintrust, to more easily access typed inputs/output
on tool spans.
- Bumps AI SDK floor version `^6.0.116` → `^6.0.174`
- Tweaked the "Conciseness" scorer to not unfairly dock points for the
new `[called tool_name]` labels in serialized assistant response

## Verification

In the studio staging build, I asked Assistant to create a todos table
with 3 sample todos. I manually approved the `execute_sql` call and saw
Assistant generate text before & after the call.

In Braintrust I verified two traces were produced (see [filtered
logs](https://www.braintrust.dev/app/supabase.io/p/Assistant/logs?v=Staging&tvt=trace&search={%22filter%22:[{%22text%22:%22metadata.environment%2520%253D%2520%27staging%27%22,%22label%22:%22metadata.environment%2520%253D%2520%27staging%27%22,%22originType%22:%22btql%22},{%22text%22:%22%2560Chat%2520ID%2560%2520%253D%2520%25221cb2ac45-e5e7-458c-9da4-3bf6863b8842%2522%22,%22label%22:%22Chat%2520ID%2520equals%25201cb2ac45-e5e7-458c-9da4-3bf6863b8842%22,%22originType%22:%22form%22}]})),
the first with `metadata.isFinalStep = false` and the second with
`metadata.isFinalStep = true`.

In the Braintrust staging scorers, I ran the preview Completeness scorer
on the second trace and verified it sees the complete Assistant response
including markers for tool calls ([link to
trace](https://www.braintrust.dev/app/supabase.io/p/Assistant%20(Staging%20Scorers)/trace?object_type=project_logs&object_id=b5214b62-ad1e-4929-9d5b-40b1daebe948&r=0ed0a4f8-8aff-4a34-bb1d-1df1d88a5070&s=ff9015f8-6bf7-4ab3-83a9-ca4e69e27e82))

<img width="1193" height="960" alt="CleanShot 2026-05-07 at 11 27 10@2x"
src="https://github.com/user-attachments/assets/509d4858-c3a1-4068-986d-3aa4d5617d1a"
/>

I also tested the `deploy_edge_function` workflow and verified it still
prompts for permission and warns on deployment of existing functions.

**References**
- https://www.braintrust.dev/docs/evaluate/custom-code#score-traces
-
https://ai-sdk.dev/docs/ai-sdk-core/tools-and-tool-calling#tool-execution-approval

Supercedes https://github.com/supabase/supabase/pull/45556 and
https://github.com/supabase/supabase/pull/45339

Closes AI-473

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Tool actions (SQL execution, edge-function deploy) now require
explicit user Approve/Deny before proceeding.

* **Improvements**
* Assistant pauses for approval responses before sending follow-ups,
giving clearer control over risky actions.
  * Deploy/replace flows show confirmation and clearer replace warnings.
* Evaluation/scoring updated to use richer trace data for more accurate
assistant performance signals.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2026-05-12 15:24:21 -04:00

162 lines
6.0 KiB
TypeScript

import { describe, expect, it } from 'vitest'
import {
ASSISTANT_MODELS,
DEFAULT_ASSISTANT_ADVANCE_MODEL_ID,
DEFAULT_ASSISTANT_BASE_MODEL_ID,
DEFAULT_COMPLETION_MODEL,
defaultAssistantModelId,
getAssistantModelEntry,
getDefaultModelForProvider,
isAdvanceOnlyModelId,
isAssistantBaseModelId,
isKnownAssistantModelId,
openaiModelEntry,
PROVIDERS,
} from './model.utils'
import type { ProviderName } from './model.utils'
describe('model.utils', () => {
describe('getDefaultModelForProvider', () => {
it('should return correct default for bedrock provider', () => {
const result = getDefaultModelForProvider('bedrock')
expect(result).toBe('openai.gpt-oss-120b-1:0')
})
it('should return correct default for openai provider', () => {
const result = getDefaultModelForProvider('openai')
expect(result).toBe('gpt-5.4-nano')
})
it('should return undefined for unknown provider', () => {
const result = getDefaultModelForProvider('unknown' as ProviderName)
expect(result).toBeUndefined()
})
})
describe('PROVIDERS registry', () => {
it('should have bedrock provider with models', () => {
expect(PROVIDERS.bedrock).toBeDefined()
expect(PROVIDERS.bedrock.models).toBeDefined()
expect(Object.keys(PROVIDERS.bedrock.models)).toContain(
'anthropic.claude-3-7-sonnet-20250219-v1:0'
)
expect(Object.keys(PROVIDERS.bedrock.models)).toContain('openai.gpt-oss-120b-1:0')
})
it('should have openai provider with models', () => {
expect(PROVIDERS.openai).toBeDefined()
expect(PROVIDERS.openai.models).toBeDefined()
expect(Object.keys(PROVIDERS.openai.models)).toContain('gpt-5.3-codex')
expect(Object.keys(PROVIDERS.openai.models)).toContain('gpt-5.4-nano')
})
it('should have exactly one default model per provider', () => {
const providers: ProviderName[] = ['bedrock', 'openai']
providers.forEach((provider) => {
const models = PROVIDERS[provider].models
const defaultModels = Object.entries(models).filter(([_, config]) => config.default)
expect(defaultModels.length).toBe(1)
})
})
it('should have valid model configurations', () => {
const providers: ProviderName[] = ['bedrock', 'openai']
providers.forEach((provider) => {
const models = PROVIDERS[provider].models
Object.entries(models).forEach(([_modelId, config]) => {
expect(config).toHaveProperty('default')
expect(typeof config.default).toBe('boolean')
})
})
})
it('should have bedrock model with systemProviderOptions', () => {
const sonnetModel = PROVIDERS.bedrock.models['anthropic.claude-3-7-sonnet-20250219-v1:0']
expect(sonnetModel.systemProviderOptions).toBeDefined()
expect(sonnetModel.systemProviderOptions?.bedrock).toBeDefined()
expect(sonnetModel.systemProviderOptions?.bedrock?.cachePoint).toEqual({
type: 'default',
})
})
it('should have openai provider with providerOptions', () => {
expect(PROVIDERS.openai.providerOptions).toBeDefined()
expect(PROVIDERS.openai.providerOptions?.openai).toBeDefined()
expect(PROVIDERS.openai.providerOptions?.openai?.reasoningEffort).toBeUndefined()
})
})
describe('assistant model registry', () => {
it('should have non-empty base and advance tiers', () => {
expect(
ASSISTANT_MODELS.filter((m) => !m.requiresAdvanceModelEntitlement).length
).toBeGreaterThan(0)
expect(
ASSISTANT_MODELS.filter((m) => m.requiresAdvanceModelEntitlement).length
).toBeGreaterThan(0)
})
it('all model IDs should be unique', () => {
const ids = ASSISTANT_MODELS.map((m) => m.id)
expect(new Set(ids).size).toBe(ids.length)
})
it('should have all models in openai provider registry', () => {
ASSISTANT_MODELS.forEach((entry) => {
expect(Object.keys(PROVIDERS.openai.models)).toContain(entry.id)
})
})
it('defaults should satisfy unions', () => {
expect(DEFAULT_ASSISTANT_BASE_MODEL_ID).toBe('gpt-5.4-nano')
expect(DEFAULT_ASSISTANT_ADVANCE_MODEL_ID).toBe('gpt-5.3-codex')
expect(defaultAssistantModelId(false)).toBe(DEFAULT_ASSISTANT_BASE_MODEL_ID)
expect(defaultAssistantModelId(true)).toBe(DEFAULT_ASSISTANT_ADVANCE_MODEL_ID)
})
it('isAssistantBaseModelId / isAdvanceOnlyModelId', () => {
expect(isAssistantBaseModelId('gpt-5.4-nano')).toBe(true)
expect(isAssistantBaseModelId('gpt-5.3-codex')).toBe(false)
expect(isAdvanceOnlyModelId('gpt-5.3-codex')).toBe(true)
expect(isAdvanceOnlyModelId('gpt-5.4-nano')).toBe(false)
})
it('isKnownAssistantModelId', () => {
expect(isKnownAssistantModelId('gpt-5.4-nano')).toBe(true)
expect(isKnownAssistantModelId('gpt-5.3-codex')).toBe(true)
expect(isKnownAssistantModelId('gpt-5')).toBe(false)
expect(isKnownAssistantModelId('gpt-5-mini')).toBe(false)
expect(isKnownAssistantModelId('unknown')).toBe(false)
})
it('getAssistantModelEntry returns config for known ids', () => {
expect(getAssistantModelEntry('gpt-5.4-nano').reasoningEffort).toBe('low')
expect(getAssistantModelEntry('gpt-5.3-codex').reasoningEffort).toBe('low')
expect(getAssistantModelEntry('gpt-5.4-nano')).toEqual(
ASSISTANT_MODELS.find((m) => m.id === 'gpt-5.4-nano')
)
})
it('DEFAULT_COMPLETION_MODEL is gpt-5.4-nano with no reasoning effort', () => {
expect(DEFAULT_COMPLETION_MODEL.id).toBe(DEFAULT_ASSISTANT_BASE_MODEL_ID)
expect(DEFAULT_COMPLETION_MODEL.reasoningEffort).toBe('none')
})
it('openaiModelEntry enforces valid reasoning effort at compile time', () => {
// Valid: supported effort level
const withEffort = openaiModelEntry({
id: 'gpt-5.4-nano',
reasoningEffort: 'low',
})
expect(withEffort.reasoningEffort).toBe('low')
// Valid: no effort
const withoutEffort = openaiModelEntry({ id: 'gpt-5.4-nano' })
expect(withoutEffort.reasoningEffort).toBeUndefined()
})
})
})