mirror of
https://github.com/supabase/supabase.git
synced 2026-06-12 17:27:58 +08:00
## Motivation When Assistant runs a potentially destructive tool like `execute_sql`, it stops the LLM request and prompts for client-side approval and execution of the tool. After approval, a second request kicks off under a separate trace. This has made scoring and [Topics](https://www.braintrust.dev/blog/topics) classification challenging, as the generated `output` is split across stateless requests. The [span-level scoring](https://www.braintrust.dev/docs/evaluate/custom-code#score-spans) approach we've used thusfar (after the LLM call, we massage the result into an `output` payload that's stuck onto the root span) has been cumbersome and led to invalid scores / topics where only part of the assistant response is considered. It's also inefficient, as we're duplicating potentially large info (like the `search_docs` output) that already exists within the trace. An alternative to scoring spans is to [score traces](https://www.braintrust.dev/docs/evaluate/custom-code#score-traces). Braintrust [best practices](https://www.braintrust.dev/docs/evaluate/score-online#best-practices) advise: > Use span scope for evaluating individual operations or outputs. Use trace scope for evaluating multi-turn conversations, overall workflow completion, or when your scorer needs access to the full execution context. We've also received [direct guidance](https://supabase.slack.com/archives/C05QYJBLX89/p1777925770927149?thread_ts=1777905716.911979&cid=C05QYJBLX89) from their team to use this approach. ## Changes Migrates eval scorers from custom `AssistantEvalOutput` shape to trace-level scoring via `trace.getThread()` / `trace.getSpans()`, with thread parsing that scores the full latest Assistant turn and passes prior conversation separately where relevant. Moves `execute_sql` and `deploy_edge_function` from client-side execution after approval to AI SDK `needsApproval` + server-side `execute()`. SQL results returned to the model are gated by AI opt-in level, so row data is only included with `schema_and_log_and_data`; otherwise the tool returns the no-data-permissions sentinel. Adds `metadata.isFinalStep` to disambiguate multiple LLM requests within an "assistant" turn due to tool call requests/responses. For online evals, this means we should configure automations to only score traces with `metadata.isFinalStep = true` to ensure we're judging the complete generated response. Other minor kaizen changes: - Renamed `promptProviderOptions` to `systemProviderOptions` to clarify that this is associated with the "system" message and disambiguate from the root `providerOptions` - Adds `evals/trace-utils.ts` to handle Zod validation of the `unknown` span shapes from Braintrust, to more easily access typed inputs/output on tool spans. - Bumps AI SDK floor version `^6.0.116` → `^6.0.174` - Tweaked the "Conciseness" scorer to not unfairly dock points for the new `[called tool_name]` labels in serialized assistant response ## Verification In the studio staging build, I asked Assistant to create a todos table with 3 sample todos. I manually approved the `execute_sql` call and saw Assistant generate text before & after the call. In Braintrust I verified two traces were produced (see [filtered logs](https://www.braintrust.dev/app/supabase.io/p/Assistant/logs?v=Staging&tvt=trace&search={%22filter%22:[{%22text%22:%22metadata.environment%2520%253D%2520%27staging%27%22,%22label%22:%22metadata.environment%2520%253D%2520%27staging%27%22,%22originType%22:%22btql%22},{%22text%22:%22%2560Chat%2520ID%2560%2520%253D%2520%25221cb2ac45-e5e7-458c-9da4-3bf6863b8842%2522%22,%22label%22:%22Chat%2520ID%2520equals%25201cb2ac45-e5e7-458c-9da4-3bf6863b8842%22,%22originType%22:%22form%22}]})), the first with `metadata.isFinalStep = false` and the second with `metadata.isFinalStep = true`. In the Braintrust staging scorers, I ran the preview Completeness scorer on the second trace and verified it sees the complete Assistant response including markers for tool calls ([link to trace](https://www.braintrust.dev/app/supabase.io/p/Assistant%20(Staging%20Scorers)/trace?object_type=project_logs&object_id=b5214b62-ad1e-4929-9d5b-40b1daebe948&r=0ed0a4f8-8aff-4a34-bb1d-1df1d88a5070&s=ff9015f8-6bf7-4ab3-83a9-ca4e69e27e82)) <img width="1193" height="960" alt="CleanShot 2026-05-07 at 11 27 10@2x" src="https://github.com/user-attachments/assets/509d4858-c3a1-4068-986d-3aa4d5617d1a" /> I also tested the `deploy_edge_function` workflow and verified it still prompts for permission and warns on deployment of existing functions. **References** - https://www.braintrust.dev/docs/evaluate/custom-code#score-traces - https://ai-sdk.dev/docs/ai-sdk-core/tools-and-tool-calling#tool-execution-approval Supercedes https://github.com/supabase/supabase/pull/45556 and https://github.com/supabase/supabase/pull/45339 Closes AI-473 <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Tool actions (SQL execution, edge-function deploy) now require explicit user Approve/Deny before proceeding. * **Improvements** * Assistant pauses for approval responses before sending follow-ups, giving clearer control over risky actions. * Deploy/replace flows show confirmation and clearer replace warnings. * Evaluation/scoring updated to use richer trace data for more accurate assistant performance signals. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
162 lines
6.0 KiB
TypeScript
162 lines
6.0 KiB
TypeScript
import { describe, expect, it } from 'vitest'
|
|
|
|
import {
|
|
ASSISTANT_MODELS,
|
|
DEFAULT_ASSISTANT_ADVANCE_MODEL_ID,
|
|
DEFAULT_ASSISTANT_BASE_MODEL_ID,
|
|
DEFAULT_COMPLETION_MODEL,
|
|
defaultAssistantModelId,
|
|
getAssistantModelEntry,
|
|
getDefaultModelForProvider,
|
|
isAdvanceOnlyModelId,
|
|
isAssistantBaseModelId,
|
|
isKnownAssistantModelId,
|
|
openaiModelEntry,
|
|
PROVIDERS,
|
|
} from './model.utils'
|
|
import type { ProviderName } from './model.utils'
|
|
|
|
describe('model.utils', () => {
|
|
describe('getDefaultModelForProvider', () => {
|
|
it('should return correct default for bedrock provider', () => {
|
|
const result = getDefaultModelForProvider('bedrock')
|
|
expect(result).toBe('openai.gpt-oss-120b-1:0')
|
|
})
|
|
|
|
it('should return correct default for openai provider', () => {
|
|
const result = getDefaultModelForProvider('openai')
|
|
expect(result).toBe('gpt-5.4-nano')
|
|
})
|
|
|
|
it('should return undefined for unknown provider', () => {
|
|
const result = getDefaultModelForProvider('unknown' as ProviderName)
|
|
expect(result).toBeUndefined()
|
|
})
|
|
})
|
|
|
|
describe('PROVIDERS registry', () => {
|
|
it('should have bedrock provider with models', () => {
|
|
expect(PROVIDERS.bedrock).toBeDefined()
|
|
expect(PROVIDERS.bedrock.models).toBeDefined()
|
|
expect(Object.keys(PROVIDERS.bedrock.models)).toContain(
|
|
'anthropic.claude-3-7-sonnet-20250219-v1:0'
|
|
)
|
|
expect(Object.keys(PROVIDERS.bedrock.models)).toContain('openai.gpt-oss-120b-1:0')
|
|
})
|
|
|
|
it('should have openai provider with models', () => {
|
|
expect(PROVIDERS.openai).toBeDefined()
|
|
expect(PROVIDERS.openai.models).toBeDefined()
|
|
expect(Object.keys(PROVIDERS.openai.models)).toContain('gpt-5.3-codex')
|
|
expect(Object.keys(PROVIDERS.openai.models)).toContain('gpt-5.4-nano')
|
|
})
|
|
|
|
it('should have exactly one default model per provider', () => {
|
|
const providers: ProviderName[] = ['bedrock', 'openai']
|
|
|
|
providers.forEach((provider) => {
|
|
const models = PROVIDERS[provider].models
|
|
const defaultModels = Object.entries(models).filter(([_, config]) => config.default)
|
|
expect(defaultModels.length).toBe(1)
|
|
})
|
|
})
|
|
|
|
it('should have valid model configurations', () => {
|
|
const providers: ProviderName[] = ['bedrock', 'openai']
|
|
|
|
providers.forEach((provider) => {
|
|
const models = PROVIDERS[provider].models
|
|
Object.entries(models).forEach(([_modelId, config]) => {
|
|
expect(config).toHaveProperty('default')
|
|
expect(typeof config.default).toBe('boolean')
|
|
})
|
|
})
|
|
})
|
|
|
|
it('should have bedrock model with systemProviderOptions', () => {
|
|
const sonnetModel = PROVIDERS.bedrock.models['anthropic.claude-3-7-sonnet-20250219-v1:0']
|
|
expect(sonnetModel.systemProviderOptions).toBeDefined()
|
|
expect(sonnetModel.systemProviderOptions?.bedrock).toBeDefined()
|
|
expect(sonnetModel.systemProviderOptions?.bedrock?.cachePoint).toEqual({
|
|
type: 'default',
|
|
})
|
|
})
|
|
|
|
it('should have openai provider with providerOptions', () => {
|
|
expect(PROVIDERS.openai.providerOptions).toBeDefined()
|
|
expect(PROVIDERS.openai.providerOptions?.openai).toBeDefined()
|
|
expect(PROVIDERS.openai.providerOptions?.openai?.reasoningEffort).toBeUndefined()
|
|
})
|
|
})
|
|
|
|
describe('assistant model registry', () => {
|
|
it('should have non-empty base and advance tiers', () => {
|
|
expect(
|
|
ASSISTANT_MODELS.filter((m) => !m.requiresAdvanceModelEntitlement).length
|
|
).toBeGreaterThan(0)
|
|
expect(
|
|
ASSISTANT_MODELS.filter((m) => m.requiresAdvanceModelEntitlement).length
|
|
).toBeGreaterThan(0)
|
|
})
|
|
|
|
it('all model IDs should be unique', () => {
|
|
const ids = ASSISTANT_MODELS.map((m) => m.id)
|
|
expect(new Set(ids).size).toBe(ids.length)
|
|
})
|
|
|
|
it('should have all models in openai provider registry', () => {
|
|
ASSISTANT_MODELS.forEach((entry) => {
|
|
expect(Object.keys(PROVIDERS.openai.models)).toContain(entry.id)
|
|
})
|
|
})
|
|
|
|
it('defaults should satisfy unions', () => {
|
|
expect(DEFAULT_ASSISTANT_BASE_MODEL_ID).toBe('gpt-5.4-nano')
|
|
expect(DEFAULT_ASSISTANT_ADVANCE_MODEL_ID).toBe('gpt-5.3-codex')
|
|
expect(defaultAssistantModelId(false)).toBe(DEFAULT_ASSISTANT_BASE_MODEL_ID)
|
|
expect(defaultAssistantModelId(true)).toBe(DEFAULT_ASSISTANT_ADVANCE_MODEL_ID)
|
|
})
|
|
|
|
it('isAssistantBaseModelId / isAdvanceOnlyModelId', () => {
|
|
expect(isAssistantBaseModelId('gpt-5.4-nano')).toBe(true)
|
|
expect(isAssistantBaseModelId('gpt-5.3-codex')).toBe(false)
|
|
expect(isAdvanceOnlyModelId('gpt-5.3-codex')).toBe(true)
|
|
expect(isAdvanceOnlyModelId('gpt-5.4-nano')).toBe(false)
|
|
})
|
|
|
|
it('isKnownAssistantModelId', () => {
|
|
expect(isKnownAssistantModelId('gpt-5.4-nano')).toBe(true)
|
|
expect(isKnownAssistantModelId('gpt-5.3-codex')).toBe(true)
|
|
expect(isKnownAssistantModelId('gpt-5')).toBe(false)
|
|
expect(isKnownAssistantModelId('gpt-5-mini')).toBe(false)
|
|
expect(isKnownAssistantModelId('unknown')).toBe(false)
|
|
})
|
|
|
|
it('getAssistantModelEntry returns config for known ids', () => {
|
|
expect(getAssistantModelEntry('gpt-5.4-nano').reasoningEffort).toBe('low')
|
|
expect(getAssistantModelEntry('gpt-5.3-codex').reasoningEffort).toBe('low')
|
|
expect(getAssistantModelEntry('gpt-5.4-nano')).toEqual(
|
|
ASSISTANT_MODELS.find((m) => m.id === 'gpt-5.4-nano')
|
|
)
|
|
})
|
|
|
|
it('DEFAULT_COMPLETION_MODEL is gpt-5.4-nano with no reasoning effort', () => {
|
|
expect(DEFAULT_COMPLETION_MODEL.id).toBe(DEFAULT_ASSISTANT_BASE_MODEL_ID)
|
|
expect(DEFAULT_COMPLETION_MODEL.reasoningEffort).toBe('none')
|
|
})
|
|
|
|
it('openaiModelEntry enforces valid reasoning effort at compile time', () => {
|
|
// Valid: supported effort level
|
|
const withEffort = openaiModelEntry({
|
|
id: 'gpt-5.4-nano',
|
|
reasoningEffort: 'low',
|
|
})
|
|
expect(withEffort.reasoningEffort).toBe('low')
|
|
|
|
// Valid: no effort
|
|
const withoutEffort = openaiModelEntry({ id: 'gpt-5.4-nano' })
|
|
expect(withoutEffort.reasoningEffort).toBeUndefined()
|
|
})
|
|
})
|
|
})
|