mirror of
https://github.com/supabase/supabase.git
synced 2026-05-06 22:18:00 +08:00
Lays groundwork for online evals on Assistant chat logs. https://www.braintrust.dev/docs/observe/score-online ### Changes - New workflows: - `braintrust-scorers-deploy.yml` keeps prod scorers in sync on push to `master` - `braintrust-preview-scorers-deploy.yml` deploys preview scorers to the staging project for PRs labeled `preview-scorers`, posting a comment with scorer links ([example](https://github.com/supabase/supabase/pull/43194#issuecomment-4000097222)) - `braintrust-preview-scorers-cleanup.yml` deletes preview scorers when the PR is closed ([example](https://github.com/supabase/supabase/pull/43194#issuecomment-4000749847)) - Adds `evals/scorer-online.ts` entry point invoked with `pnpm scorers:deploy`, registering scorers for online evals in the Braintrust "Assistant" project - Refactors scorer code to separate online-compatible scorers (`scorer-online.ts`) from WASM-dependent ones (`scorer-wasm.ts`) - "URL Validity" scorer now only checks Supabase domains to prevent requests to untrusted origins - Span `input` is now shaped `{ prompt: string }` instead of plain `string` for compatibility with offline eval scorers - Env vars `BRAINTRUST_STAGING_PROJECT_ID` and `BRAINTRUST_PROJECT_ID` configured in GitHub repo settings - `generateAssistantResponse` now uses `startSpan` + `withCurrent` instead of `traced()` to manually manage the root span lifecycle — this ensures `onFinish` logs output to the span _before_ `span.end()` is called, which is when Braintrust triggers scoring automations ### Online Scorers We share scoring logic across offline and online evals, but some of our scorers aren't transferrable to an "online" setting due to runtime challenges or ground truth requirements. **Supported** - Goal Completion - Conciseness - Completeness - Docs Faithfulness - URL Validity **Unsupported** - Correctness (requires ground truth output) - Tool Usage (requires ground truth requiredTools) - SQL Syntax (uses libpg-query WASM) - SQL Identifier Quoting (uses libpg-query WASM) ### How to use these scorers Going forward if you want to add/edit online eval scorers, add the `preview-scorers` label to a PR. This deploys scorers to the [Assistant (Staging Scorers)](https://www.braintrust.dev/app/supabase.io/p/Assistant%20(Staging%20Scorers)?v=Overview) project in Braintrust with branch-specific slugs, and comments on the PR ([example](https://github.com/supabase/supabase/pull/43194#issuecomment-4000097222)). From the Braintrust dashboard you can "Test" the scorer with traces from any project. <img width="1866" height="528" alt="CleanShot 2026-03-05 at 15 15 00@2x" src="https://github.com/user-attachments/assets/4f15cebc-3f2d-4e8a-9ee2-fe8ef7bf4199" /> Once merged, scorers are deployed to the primary [Assistant](https://www.braintrust.dev/app/supabase.io/p/Assistant) project, and preview scorers are deleted from the staging project. Down the road, scorers on the Assistant project will run automatically on a sample of production traces. Closes AI-437
51 lines
1.6 KiB
TypeScript
51 lines
1.6 KiB
TypeScript
import assert from 'node:assert'
|
|
import { openai } from '@ai-sdk/openai'
|
|
import { Eval } from 'braintrust'
|
|
import { generateAssistantResponse } from 'lib/ai/generate-assistant-response'
|
|
import { getMockTools } from 'lib/ai/tools/mock-tools'
|
|
|
|
import { dataset } from './dataset'
|
|
import { buildAssistantEvalOutput } from './output'
|
|
import {
|
|
completenessScorer,
|
|
concisenessScorer,
|
|
correctnessScorer,
|
|
docsFaithfulnessScorer,
|
|
goalCompletionScorer,
|
|
toolUsageScorer,
|
|
urlValidityScorer,
|
|
} from './scorer'
|
|
import { sqlIdentifierQuotingScorer, sqlSyntaxScorer } from './scorer-wasm'
|
|
|
|
assert(process.env.BRAINTRUST_PROJECT_ID, 'BRAINTRUST_PROJECT_ID is not set')
|
|
assert(process.env.OPENAI_API_KEY, 'OPENAI_API_KEY is not set')
|
|
|
|
Eval('Assistant', {
|
|
projectId: process.env.BRAINTRUST_PROJECT_ID,
|
|
trialCount: process.env.CI ? 3 : 1,
|
|
data: () => dataset,
|
|
task: async (input) => {
|
|
const result = await generateAssistantResponse({
|
|
model: openai('gpt-5-mini'),
|
|
messages: [{ id: '1', role: 'user', parts: [{ type: 'text', text: input.prompt }] }],
|
|
tools: await getMockTools(input.mockTables ? { list_tables: input.mockTables } : undefined),
|
|
})
|
|
|
|
// `result.toolCalls` only shows the last step, instead aggregate tools across all steps
|
|
const [finishReason, steps] = await Promise.all([result.finishReason, result.steps])
|
|
|
|
return buildAssistantEvalOutput(finishReason, steps)
|
|
},
|
|
scores: [
|
|
toolUsageScorer,
|
|
sqlSyntaxScorer,
|
|
sqlIdentifierQuotingScorer,
|
|
goalCompletionScorer,
|
|
concisenessScorer,
|
|
completenessScorer,
|
|
docsFaithfulnessScorer,
|
|
correctnessScorer,
|
|
urlValidityScorer,
|
|
],
|
|
})
|