Files
supabase/.github
Matt Rossman 072883bcec feat: assistant evals (#41311)
* chore: bump `supabase` CLI

* chore: stricter message types in `generate-v4.ts`

* feat: tutorial eval

https://www.braintrust.dev/docs/evaluation

* feat: project ID for eval

* refactor: `generateAssistantResponse` out of `handlePost`

* refactor: generateAssistantResponse to lib/ai

* feat: factuality eval with assistant response

* chore: upgrade braintrust to v1.0.1

* chore: silence tsconfig warning

* feat: assertion scorer

* fix: aggregate tools across all steps

* refactor: strict tool names, remove need for `as const`

* refactor: generic tool name type in assertions

* feat: transfer mocks from `feature/braintrust`

* feat: LLM criteria assertion

* feat: braintrust evals workflow

* fix: BRAINTRUST_PROJECT_ID

* feat: `sql_similar` assertion

* fix: `OPENAI_API_KEY` in workflow env

* feat: split AssertionScorer into separate scorers

* feat: remove tutorial eval

* feat: 20 minute CI timeout

* feat: category in test case metadata

* feat: score with gpt-5

* refactor: dataset to own file, colocate scorers

* feat: "gpt-5.2-2025-12-11" for llm as a judge

* feat: SQL syntax scorer with `libpg-query`

* feat: `evals:setup` and `evals:run` scripts

* feat: `evals:setup` in CI

* feat: human readable scorer names

* chore: rename to "SQL Validity"

* feat: add 2 "sql_generation" test cases

* feat: update requiredTools in test cases

* chore: ignore Cursor MCP config

* feat: "Conciseness" score

* feat: "Completeness" scorer

* fix: generate-v4 test mocks

* feat: serialize "steps" for scorer inputs

* updated node mem options for typecheck

* updated runner

* remove ram update as actions handle this

* feat: read `BRAINTRUST_PROJECT_ID` from secrets

* feat: score helpfulness, remove old scorers

* feat: separate `evals:run` and `evals:upload` scripts

* feat: passthrough entire classifier result

* feat: use live `search_docs` impl, store docs result in metadata

* feat: reduce classifier options

* feat: filter workflow by `run-evals` PR label or `master` branch

* chore: cleanup stubbed mock tools

* fix: checkout actual branch with `ref:`

* fix: capture search_docs results from all content parts

* feat: simplify sql syntax score calculation

* feat: use AI SDK's UI message validator

* docs: justification for relative `extends`

* fix: cleanup leftover validatedMessages

* doc: note mock token isn't secret for snyk

* fix: mock ui message to pass validation

* feat: revert ignoring Cursor MCP config

Using `.git/info/exclude` instead until we have an opinion on this

* feat: add "tsconfig" as shared-data devDependency, revert relative path in tsconfig

* refactor: tool call parsing into function

* Update apps/studio/evals/assistant.eval.ts

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* refactor: organize mock schemas and tool factories

---------

Co-authored-by: Ali Waseem <waseema393@gmail.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
2025-12-22 23:45:48 -05:00
..
2025-12-22 23:45:48 -05:00