docs(ai): assistant evals development workflow (#45840)

Adds `README.md` to `apps/studio/evals` explaining the development workflow for updating offline and online evals for Studio's AI Assistant. Resolves AI-681  ## Summary by CodeRabbit * **Documentation** * Added comprehensive documentation for Studio Assistant Evals, covering evaluation setup, configuration of scoring methods, and deployment workflows for both offline and online evaluation processes. [![Review Change Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/supabase/supabase/pull/45840)
2026-06-05 04:12:29 +08:00 · 2026-05-12 15:24:29 -04:00
parent d143571586
commit 36152cb7fe
1 changed files with 56 additions and 0 deletions
--- a/apps/studio/evals/README.md
+++ b/apps/studio/evals/README.md
@@ -0,0 +1,56 @@
+# Studio Assistant Evals
+
+We use [Braintrust](https://www.braintrust.dev/) to evaluate Assistant behaviors against a tracked dataset (offline evals) and against live traces (online evals).
+
+## Offline Evals
+
+Add offline eval test cases to `dataset.ts`. If needed, add new scorers (see below) for the specific dimension you wish to test. Expect to update and run offline evals when adding new Assistant behaviors
+
+You may wish to run offline evals when:
+
+- You updated the eval suite with a new test case or scorer
+- You changed Assistant's behavior and want to check for improvements/regressions
+
+### Running Offline Evals in CI
+
+Add the `run-evals` label on a PR to the repo and Braintrust's GitHub Action will run evals and post a summary comment ([example](https://github.com/supabase/supabase/pull/44729)).
+
+You can find detailed results in the "Experiments" tab of the "Assistant" project on Braintrust.
+
+### Running Offline Evals in Local Dev
+
+Within `apps/studio`
+
+```bash
+# To set up WASM files
+pnpm evals:setup
+
+# Run all evals and upload results to Braintrust
+pnpm evals:upload
+
+# Run all evals without uploading results
+pnpm evals:run
+
+# Run an upload single test case
+pnpm braintrust eval evals/assistant.eval.ts --filter "input.prompt=How many projects"
+```
+
+Upload results when you want to inspect Experiments or Logs in the Braintrust dashboard or API. You can use developer tools like [Braintrust MCP](https://www.braintrust.dev/docs/integrations/developer-tools/mcp) or [`bt` CLI](https://www.braintrust.dev/docs/reference/cli/quickstart) to analyze results with an agent.
+
+## Scorers
+
+Scorers look at a `thread` or task `output` and assign a score deterministically or via LLM-as-a-judge. Optionally they can consider `expected` values.
+
+Define scorers in `scorer.ts` and include them in `assistant.eval.ts` to run them in offline evals.
+
+### Updating Online Scorers
+
+Online scorers run as serverless functions on Braintrust infrastructure. They're deployed from the `scorer-online.ts` script. Since these scoring against production traces, they can't rely on ground truth `expected` values. Structure scoring logic and LLM prompts accordingly. Not every scorer needs to be an online scorer.
+
+To opt-in to online scoring, add the scorer to `scorer-online-manifest.json` and add a corresponding handler in `scorer-online.ts`
+
+### Testing & Deploying Online Scorers
+
+Add the `preview-scorers` label to a PR to deploy branch-prefixed scorers to the "Assistant (Staging Scorers)" Braintrust project ([example](https://github.com/supabase/supabase/pull/45654#issuecomment-4398433047)). From that project dashboard, you can manually test the scorer against a trace from any project.
+
+After merge to `master`, preview scorers automatically clean up and deploy to the production in the "Assistant" Braintrust project. Update the "Online Scoring" automation in the Logs page to include the new scorer function.