Files
supabase/apps/studio/evals/assistant.eval.ts
Matt Rossman 517171b246 feat(assistant): online evals support and CI workflows (#43194)
Lays groundwork for online evals on Assistant chat logs.

https://www.braintrust.dev/docs/observe/score-online

### Changes

- New workflows:
- `braintrust-scorers-deploy.yml` keeps prod scorers in sync on push to
`master`
- `braintrust-preview-scorers-deploy.yml` deploys preview scorers to the
staging project for PRs labeled `preview-scorers`, posting a comment
with scorer links
([example](https://github.com/supabase/supabase/pull/43194#issuecomment-4000097222))
- `braintrust-preview-scorers-cleanup.yml` deletes preview scorers when
the PR is closed
([example](https://github.com/supabase/supabase/pull/43194#issuecomment-4000749847))
- Adds `evals/scorer-online.ts` entry point invoked with `pnpm
scorers:deploy`, registering scorers for online evals in the Braintrust
"Assistant" project
- Refactors scorer code to separate online-compatible scorers
(`scorer-online.ts`) from WASM-dependent ones (`scorer-wasm.ts`)
- "URL Validity" scorer now only checks Supabase domains to prevent
requests to untrusted origins
- Span `input` is now shaped `{ prompt: string }` instead of plain
`string` for compatibility with offline eval scorers
- Env vars `BRAINTRUST_STAGING_PROJECT_ID` and `BRAINTRUST_PROJECT_ID`
configured in GitHub repo settings
- `generateAssistantResponse` now uses `startSpan` + `withCurrent`
instead of `traced()` to manually manage the root span lifecycle — this
ensures `onFinish` logs output to the span _before_ `span.end()` is
called, which is when Braintrust triggers scoring automations

### Online Scorers

We share scoring logic across offline and online evals, but some of our
scorers aren't transferrable to an "online" setting due to runtime
challenges or ground truth requirements.

**Supported**
- Goal Completion
- Conciseness
- Completeness
- Docs Faithfulness
- URL Validity

**Unsupported**
- Correctness (requires ground truth output)
- Tool Usage (requires ground truth requiredTools)
- SQL Syntax (uses libpg-query WASM)
- SQL Identifier Quoting (uses libpg-query WASM)
 
### How to use these scorers

Going forward if you want to add/edit online eval scorers, add the
`preview-scorers` label to a PR. This deploys scorers to the [Assistant
(Staging
Scorers)](https://www.braintrust.dev/app/supabase.io/p/Assistant%20(Staging%20Scorers)?v=Overview)
project in Braintrust with branch-specific slugs, and comments on the PR
([example](https://github.com/supabase/supabase/pull/43194#issuecomment-4000097222)).
From the Braintrust dashboard you can "Test" the scorer with traces from
any project.

<img width="1866" height="528" alt="CleanShot 2026-03-05 at 15 15 00@2x"
src="https://github.com/user-attachments/assets/4f15cebc-3f2d-4e8a-9ee2-fe8ef7bf4199"
/>

Once merged, scorers are deployed to the primary
[Assistant](https://www.braintrust.dev/app/supabase.io/p/Assistant)
project, and preview scorers are deleted from the staging project. Down
the road, scorers on the Assistant project will run automatically on a
sample of production traces.

Closes AI-437
2026-03-09 13:05:26 -04:00

51 lines
1.6 KiB
TypeScript

import assert from 'node:assert'
import { openai } from '@ai-sdk/openai'
import { Eval } from 'braintrust'
import { generateAssistantResponse } from 'lib/ai/generate-assistant-response'
import { getMockTools } from 'lib/ai/tools/mock-tools'
import { dataset } from './dataset'
import { buildAssistantEvalOutput } from './output'
import {
completenessScorer,
concisenessScorer,
correctnessScorer,
docsFaithfulnessScorer,
goalCompletionScorer,
toolUsageScorer,
urlValidityScorer,
} from './scorer'
import { sqlIdentifierQuotingScorer, sqlSyntaxScorer } from './scorer-wasm'
assert(process.env.BRAINTRUST_PROJECT_ID, 'BRAINTRUST_PROJECT_ID is not set')
assert(process.env.OPENAI_API_KEY, 'OPENAI_API_KEY is not set')
Eval('Assistant', {
projectId: process.env.BRAINTRUST_PROJECT_ID,
trialCount: process.env.CI ? 3 : 1,
data: () => dataset,
task: async (input) => {
const result = await generateAssistantResponse({
model: openai('gpt-5-mini'),
messages: [{ id: '1', role: 'user', parts: [{ type: 'text', text: input.prompt }] }],
tools: await getMockTools(input.mockTables ? { list_tables: input.mockTables } : undefined),
})
// `result.toolCalls` only shows the last step, instead aggregate tools across all steps
const [finishReason, steps] = await Promise.all([result.finishReason, result.steps])
return buildAssistantEvalOutput(finishReason, steps)
},
scores: [
toolUsageScorer,
sqlSyntaxScorer,
sqlIdentifierQuotingScorer,
goalCompletionScorer,
concisenessScorer,
completenessScorer,
docsFaithfulnessScorer,
correctnessScorer,
urlValidityScorer,
],
})