mirror of
https://github.com/linshenkx/prompt-optimizer.git
synced 2026-05-06 21:50:27 +08:00
feat(evaluation): migrate compare rewrite protocol and add calibration artifacts
This commit is contained in:
@@ -9,8 +9,10 @@
|
||||
1. [current-spec.md](./current-spec.md)
|
||||
2. [manual-acceptance.md](./manual-acceptance.md)
|
||||
3. [manual-test-playbook.md](./manual-test-playbook.md)
|
||||
4. [real-api-samples/review-summary.md](./real-api-samples/review-summary.md)
|
||||
5. `real-api-samples/*/rendered-messages.md`
|
||||
4. [protocol-migration-minimal-plan.md](./protocol-migration-minimal-plan.md)
|
||||
5. [auto-compare-rewrite-effect-analysis.md](./auto-compare-rewrite-effect-analysis.md)
|
||||
6. [real-api-samples/review-summary.md](./real-api-samples/review-summary.md)
|
||||
7. `real-api-samples/*/rendered-messages.md`
|
||||
|
||||
## 当前目录结构
|
||||
|
||||
@@ -23,6 +25,12 @@
|
||||
- `manual-test-playbook.md`
|
||||
当前最适合直接照着操作的一份手测步骤文档。
|
||||
如果你要逐步验证 compare 阶段功能,优先看这份。
|
||||
- `protocol-migration-minimal-plan.md`
|
||||
compare / rewrite 从 Markdown 协议层迁移到 JSON payload 协议层的最小实现方案与落地说明。
|
||||
如果你准备继续做协议层收敛或复盘这次迁移,优先看这份。
|
||||
- `auto-compare-rewrite-effect-analysis.md`
|
||||
基于真实 calibration 产物整理的效果分析。
|
||||
如果你想判断“当前自动对比评估 + 智能改写到底有没有实际价值”,优先看这份。
|
||||
- `real-api-samples/`
|
||||
真实模型请求样例。
|
||||
这是判断“现在到底发了什么给模型”的最高优先级证据。
|
||||
@@ -37,6 +45,7 @@
|
||||
- 右侧多个输出一起比的是“对比评估”。
|
||||
- 左侧只看设计态输入,右侧只看执行态证据。
|
||||
- 文本模式当前主线已经基本完成。
|
||||
- compare / rewrite 的 LLM 协议层已经迁移为“规则说明 + JSON payload”,Markdown 现在主要保留给 docs / calibration 调试视图。
|
||||
- image 右侧评估链路仍未纳入本轮范围。
|
||||
|
||||
## 当前真实样例覆盖
|
||||
|
||||
@@ -0,0 +1,322 @@
|
||||
# 自动对比评估优化链路效果分析
|
||||
|
||||
> 基于 `2026-03-22` 这一轮真实 calibration 产物整理。
|
||||
> 分析对象是当前 compare 阶段与“评估后智能改写”链路的实际表现,不包含后续多轮 SPO 自动迭代。
|
||||
|
||||
## 1. 这份文档要回答什么
|
||||
|
||||
我们当前已经把自动优化拆成了这条链路:
|
||||
|
||||
1. 执行测试
|
||||
2. 结构化对比评估
|
||||
3. 压缩评估证据
|
||||
4. 基于评估进行智能改写
|
||||
|
||||
这份文档主要回答三个问题:
|
||||
|
||||
- 当前 compare 本身的判断质量如何
|
||||
- 当前 rewrite 是否真的能根据 compare 结果做出有价值的 prompt 改写
|
||||
- 这条链路距离“可放心自动多轮迭代”还有多远
|
||||
|
||||
## 2. 结论先说
|
||||
|
||||
当前链路已经具备了明显的实用价值,但更适合做“保守纠错型优化”,还不适合直接承担“强自动增益型优化”。
|
||||
|
||||
更具体地说:
|
||||
|
||||
- compare 已经达到“可作为自动优化上游判断器”的水平
|
||||
- rewrite 已经达到“能修坏改动、能回收过拟合、能恢复 contract”的水平
|
||||
- 整条链路尚未达到“可以放心多轮自动迭代”的水平
|
||||
|
||||
## 3. 本轮真实证据范围
|
||||
|
||||
本轮主要看了 1 个 live case 和 5 个跨主题 synthetic case:
|
||||
|
||||
- `live-basic-system-boundary-control`
|
||||
- `synthetic-medical-latent-trigger-overfit`
|
||||
- `synthetic-ecommerce-schema-no-model-worship`
|
||||
- `synthetic-legal-flat-not-unclear`
|
||||
- `synthetic-teaching-overfit-regression`
|
||||
- `synthetic-hiring-replica-semantic-instability`
|
||||
|
||||
总览见:
|
||||
|
||||
- `structured-compare-calibration/latest/summary.md`
|
||||
|
||||
其中 synthetic case 的命中情况为:
|
||||
|
||||
- 医疗分诊 latent overfit:`3/5`
|
||||
- 电商 schema / contract:`6/6`
|
||||
- 法务 flat vs unclear:`3/3`
|
||||
- 教学 overfit regression:`6/6`
|
||||
- 招聘 replica instability:`4/4`
|
||||
|
||||
## 4. 当前已经验证到的能力
|
||||
|
||||
### 4.1 compare 能稳定识别“硬问题”
|
||||
|
||||
当前 compare 最可靠的能力,是识别那些有明确结构语义的问题:
|
||||
|
||||
- schema / contract 漂移
|
||||
- output boundary 漂移
|
||||
- 样例贴合型过拟合
|
||||
- `flat` 与 `unclear` 的区分
|
||||
- 单次收益与 replica 稳定性的区分
|
||||
|
||||
这意味着 compare 不再只是“泛泛说哪个好一点”,而是已经能承担自动化流程中的“风险门控”职责。
|
||||
|
||||
### 4.2 rewrite 能把明显坏改动拉回安全区
|
||||
|
||||
当前 rewrite 的最好表现,不是把 prompt 优化得多惊艳,而是它已经能:
|
||||
|
||||
- 恢复被 workspace 改坏的 schema / contract
|
||||
- 删除明显针对单一样例的硬编码规则
|
||||
- 把 prompt 拉回更通用、更稳的结构
|
||||
|
||||
这类能力对于后续 SPO 很关键,因为它意味着自动优化链路至少具备“不会一路越改越歪”的基础。
|
||||
|
||||
### 4.3 live case 说明系统已经有“不过度乱改”的倾向
|
||||
|
||||
在真实 `basic-system` live case 中,compare 给出的结论是:
|
||||
|
||||
- `targetVsBaseline = improved`
|
||||
- `targetVsReferenceGap = none`
|
||||
- `stopRecommendation = review`
|
||||
|
||||
而 rewrite 基本保持了当前 workspace prompt 的核心结构,没有凭空做大幅重写。
|
||||
|
||||
这说明当前 rewrite 至少有一定保守性,不会在已经比较好的 prompt 上随意做大动作。
|
||||
|
||||
## 5. 典型正向案例
|
||||
|
||||
### 5.1 电商 schema case:已经具备“先修硬伤,再吸收优点”的能力
|
||||
|
||||
对应样本:
|
||||
|
||||
- `structured-compare-calibration/latest/synthetic-ecommerce-schema-no-model-worship/summary.md`
|
||||
|
||||
这里的关键点是:
|
||||
|
||||
- target/workspace 改坏了字段名和顶层结构
|
||||
- teacher/workspace 的文案表达更好看
|
||||
- compare 仍然优先判定 schema / contract 回退
|
||||
|
||||
更重要的是 rewrite 的行为:
|
||||
|
||||
- 没有继续保留坏 contract
|
||||
- 先恢复 `title / selling_points / cautions`
|
||||
- 再吸收 teacher 那些更概括、更有卖点的表达策略
|
||||
|
||||
这说明当前链路已经不是“谁写得更像大模型就跟谁学”,而是已经具备了较强的结构优先级。
|
||||
|
||||
### 5.2 教学 case:已经能识别“像样例答案”不等于“更好的 prompt”
|
||||
|
||||
对应样本:
|
||||
|
||||
- `structured-compare-calibration/latest/synthetic-teaching-overfit-regression/summary.md`
|
||||
|
||||
这里的 target/workspace 做的是很多自动优化系统最容易犯的错:
|
||||
|
||||
- 为当前题目写一个很像“正确答案”的口诀
|
||||
- 但丢掉了通用原理
|
||||
- 结果在当前样例上更顺口,在泛化上却更差
|
||||
|
||||
compare 在这里给出了比较准确的判断:
|
||||
|
||||
- `targetVsBaseline = regressed`
|
||||
- `overfitRisk = high`
|
||||
- `referenceBaseline = unsupported`
|
||||
|
||||
rewrite 最终也把 prompt 改回了“先讲通用原理,再讲题目”的结构。
|
||||
|
||||
这说明当前链路在“反过拟合纠偏”上已经具备较好的实战价值。
|
||||
|
||||
### 5.3 招聘 case:compare 已能识别“单次赢了,但不稳定”
|
||||
|
||||
对应样本:
|
||||
|
||||
- `structured-compare-calibration/latest/synthetic-hiring-replica-semantic-instability/summary.md`
|
||||
|
||||
这里的关键点是:
|
||||
|
||||
- target/workspace 比 previous 更结构化
|
||||
- 单次输出看起来更好
|
||||
- 但 replica 给出了不同的 `recommendation`
|
||||
|
||||
compare 能正确识别:
|
||||
|
||||
- `targetVsBaseline = improved`
|
||||
- `targetReplica = unstable`
|
||||
- `stopRecommendation = review`
|
||||
|
||||
这说明 compare 已经不会把“单次跑得好”直接当成“稳定改进”。
|
||||
|
||||
这是后续做自动迭代时非常重要的一道安全阀。
|
||||
|
||||
## 6. 当前最明显的不足
|
||||
|
||||
### 6.1 rewrite 仍然更擅长“修格式”,不擅长“修决策边界”
|
||||
|
||||
在招聘 instability case 里,compare 已经知道问题本质是:
|
||||
|
||||
- recommendation 的语义决策不稳定
|
||||
|
||||
但 rewrite 产出的新 prompt 更偏向加强:
|
||||
|
||||
- JSON 结构约束
|
||||
- 字段枚举值
|
||||
- 分析必须具体
|
||||
|
||||
这些改动不是错,但它们更像是在修“格式一致性”,不是在修“为什么同一份证据会从 `hold` 漂到 `hire`”。
|
||||
|
||||
也就是说:
|
||||
|
||||
- compare 已经能发现语义级稳定性问题
|
||||
- rewrite 还不太会针对这种问题做深层修复
|
||||
|
||||
### 6.2 flat 场景下,rewrite 还缺少足够克制的 no-op 策略
|
||||
|
||||
补充说明:
|
||||
|
||||
- 这份分析完成后,代码里已经补上了第一版 `rewriteGuidance` gating
|
||||
- 当前规则会让 `flat + no-gap` 更倾向于 `skip`
|
||||
- 以及让 `improved + no-gap + low-headroom` 更倾向于 `minor-rewrite`
|
||||
|
||||
所以下面这段问题描述,主要代表“本轮真实 calibration 暴露出来的原始问题”,而不是当前代码仍然完全没有处理。
|
||||
|
||||
在法务 case 中,compare 的判断是比较理想的:
|
||||
|
||||
- `targetVsBaseline = flat`
|
||||
- `targetVsReferenceGap = none`
|
||||
|
||||
这说明当前 workspace prompt 其实没有明显需要继续改的地方。
|
||||
|
||||
但 rewrite 仍然生成了一版“更简洁、更可操作、避免特定词过拟合”的新 prompt。
|
||||
|
||||
这个结果不算错,但它暴露出一个问题:
|
||||
|
||||
- 当前系统还不太会在“其实没必要改”时选择少改或不改
|
||||
|
||||
如果后面直接把这条链路接到多轮自动迭代中,这种轻微但持续的无效改写,会慢慢把 prompt 推向不必要的复杂化。
|
||||
|
||||
### 6.3 当前还没有完成“改写后再验证”的闭环证明
|
||||
|
||||
现在已经证明了两件事:
|
||||
|
||||
- compare 的判断经常是合理的
|
||||
- rewrite 的输出方向经常是合理的
|
||||
|
||||
但还没有完整证明第三件事:
|
||||
|
||||
- rewrite 之后重新执行测试,结果是否真的更好
|
||||
|
||||
也就是说,当前还主要证明了“它会提出像样的修改”,尚未完全证明“它会稳定带来实测收益”。
|
||||
|
||||
这也是为什么当前更适合把它看成“自动辅助优化器”,而不是“自动闭环优化器”。
|
||||
|
||||
## 7. 医疗 case 带来的重要新认知
|
||||
|
||||
对应样本:
|
||||
|
||||
- `structured-compare-calibration/latest/synthetic-medical-latent-trigger-overfit/summary.md`
|
||||
|
||||
这个 case 原本更像是想测试:
|
||||
|
||||
- “看起来有样例收益,但其实过拟合风险很高”
|
||||
|
||||
但当前 compare 实际给出的结论更强:
|
||||
|
||||
- `targetVsBaseline = regressed`
|
||||
- `overfitRisk = high`
|
||||
- `stopRecommendation = review`
|
||||
|
||||
这说明在医疗这类高风险主题里,当前 compare 会显著更保守。
|
||||
|
||||
这个现象我认为总体是正向的,原因有两点:
|
||||
|
||||
- 它没有被“当前样例上更像对题作答”骗过去
|
||||
- 它开始表现出一定领域风险敏感性
|
||||
|
||||
这也意味着我们的 calibration 样本已经不再只是验证“提示词会不会跑通”,而是真的能测出 judge / synthesis 的边界风格。
|
||||
|
||||
## 8. 当前适合做什么,不适合做什么
|
||||
|
||||
### 8.1 当前已经适合
|
||||
|
||||
- 作为 compare 阶段的自动判断器
|
||||
- 自动识别明显回退
|
||||
- 自动识别 schema / contract 漂移
|
||||
- 自动识别样例过拟合
|
||||
- 自动把 workspace prompt 拉回更稳、更通用的版本
|
||||
|
||||
### 8.2 当前还不适合
|
||||
|
||||
- 直接做多轮无人值守自动迭代
|
||||
- 遇到 `flat` 场景时持续自动改写
|
||||
- 依赖当前 rewrite 去修复杂的语义级稳定性问题
|
||||
- 在没有复测闭环的情况下宣称“自动优化成功”
|
||||
|
||||
## 9. 对下一阶段实现的建议
|
||||
|
||||
### 9.1 优先补 rewrite gating
|
||||
|
||||
状态更新:
|
||||
|
||||
- 这一项已经落地了第一版
|
||||
- 当前是保守规则,不是最终形态
|
||||
- 后续重点应从“有没有 gating”转向“gating 是否足够细、是否与 UI / SPO 编排联动”
|
||||
|
||||
建议增加“是否值得改写”的显式门控逻辑。
|
||||
|
||||
一个比较合适的第一版规则是:
|
||||
|
||||
- 当 `targetVsBaseline = flat`
|
||||
- 且 `targetVsReferenceGap = none`
|
||||
- 且 `improvementHeadroom` 不是 `high`
|
||||
|
||||
则默认不自动改写,或至少进入“建议不改写”的保守分支。
|
||||
|
||||
这能减少 flat case 下的无效扰动。
|
||||
|
||||
### 9.2 为 instability 单独设计 rewrite 策略
|
||||
|
||||
状态更新:
|
||||
|
||||
- 这一项已经落地第一版专项指令
|
||||
- 当前通过 `rewriteGuidance.focusAreas / priorityMoves` 把 instability 传给 rewrite
|
||||
- 现在 rewrite 至少会被明确要求去补“判定标准 / tie-break / 保守默认规则”
|
||||
|
||||
当前 instability 已经能被 compare 看见,但 rewrite 还不会针对性修复。
|
||||
|
||||
后续应明确让 rewrite 学会根据 instability 去补:
|
||||
|
||||
- recommendation / decision 的判定优先级
|
||||
- tie-break 规则
|
||||
- 证据不足时的默认落点
|
||||
- 关键字段之间的约束关系
|
||||
|
||||
否则它会一直停留在“加强格式约束”的浅层修补。
|
||||
|
||||
### 9.3 把“改写后复测”纳入正式验收
|
||||
|
||||
真正的自动优化效果,不应该只看 compare 和 rewrite 的文字质量。
|
||||
|
||||
后续一旦进入 SPO 或一轮自动迭代,应该把以下链路作为正式验收对象:
|
||||
|
||||
1. 先执行测试
|
||||
2. compare
|
||||
3. rewrite
|
||||
4. 用 rewrite 结果重新执行测试
|
||||
5. 再 compare 一次
|
||||
|
||||
只有做到这一步,我们才能更有把握地回答:
|
||||
|
||||
- 这条链路到底是在产生真实收益
|
||||
- 还是只是在生成听起来合理的“优化建议”
|
||||
|
||||
## 10. 最后一句判断
|
||||
|
||||
如果只问一句“当前自动对比评估优化效果怎么样”,我的判断是:
|
||||
|
||||
它已经成功跨过了“只能演示”的阶段,进入了“可以作为真实优化系统基础能力”的阶段;
|
||||
但距离“稳定可靠的自动多轮优化器”,还差 rewrite gating、instability 定向改写,以及改写后复测闭环这三块关键能力。
|
||||
@@ -114,6 +114,11 @@
|
||||
- 复用 iterate 链路的增强通用重写能力
|
||||
- 重写输入会对评估结果做去重、分层与 compare 焦点压缩
|
||||
- 重写输入已显式纳入 `conflictSignals`
|
||||
- 重写输入已新增 machine-readable `rewriteGuidance`
|
||||
- 当前已落地第一版 rewrite gating:
|
||||
- `flat + no-gap` 场景默认倾向 `skip`
|
||||
- `improved + no-gap + low-headroom` 场景默认倾向 `minor-rewrite`
|
||||
- 仍存在明显回退 / 不稳定 / 不被支持改动时,继续走 `rewrite`
|
||||
- 当前仍未落地的,主要是上层复用能力:
|
||||
- 更独立的通用智能重写协议 / 模板
|
||||
|
||||
@@ -316,6 +321,9 @@
|
||||
- 结果面板已支持基于整份评估结果的一键“智能重写”,并直接复用 iterate 版本链路。
|
||||
- 智能重写当前会显式消费 `compareStopSignals + compareInsights + conflictSignals`。
|
||||
- compare 结果元数据在 UI 侧已统一抽成共享消费模块,避免 `useEvaluation / EvaluationPanel / rewrite` 多处漂移。
|
||||
- rewrite payload 当前已包含 `rewriteGuidance.recommendation`,用于约束 `skip / minor-rewrite / rewrite` 三类行为。
|
||||
- rewrite payload 当前还会附带 `rewriteGuidance.focusAreas / priorityMoves`,用于把 `instability / contract-repair / generalization` 转成更可执行的专项改写指令。
|
||||
- UI 侧当前已识别 `rewriteGuidance.recommendation = skip`,会在“智能重写”入口直接短路,不再无意义发起 iterate 请求。
|
||||
|
||||
## 8. 当前剩余问题
|
||||
|
||||
|
||||
@@ -0,0 +1,508 @@
|
||||
# Compare / Rewrite 协议层迁移最小方案
|
||||
|
||||
> 目标:把 compare / rewrite 发送给 LLM 的“机器协议层”从 Markdown 拼接,迁移为“少量自然语言说明 + JSON payload 证据层”。
|
||||
> 约束:尽量不扩大 compare 主能力范围,不改用户可见功能语义;优先降低边界模糊、fence 嵌套、schema 漂移、消息包装漂移。
|
||||
|
||||
## 当前状态
|
||||
|
||||
- 已落地:`pairwise judge`、`structured compare synthesis`、`rewrite-from-evaluation` 都已切到“规则说明 + JSON payload”协议。
|
||||
- 已落地:rewrite payload 现在额外包含 machine-readable `rewriteGuidance`,用于表达 `skip / minor-rewrite / rewrite` 的第一版 gating 结论。
|
||||
- 已保留:Markdown 渲染函数没有删除,继续作为 docs / calibration 的 debug 辅助视图。
|
||||
- 已验证:本地单测、`@prompt-optimizer/core build`、`pnpm compare:calibrate` 已跑通。
|
||||
- 当前 calibration 结果:
|
||||
- `synthetic-schema-drift-regression`: `4/4`
|
||||
- `synthetic-cosmetic-regression`: `3/3`
|
||||
- `synthetic-replica-instability`: `3/3`
|
||||
- `synthetic-overfit-risk`: `3/4`
|
||||
- 当前 rewrite 输出已确认不再出现 code fence、`role/content` 包装或消息数组包装。
|
||||
|
||||
## 1. 背景与问题
|
||||
|
||||
当前 compare / rewrite 链路里,发送给 LLM 的核心输入大量依赖 Markdown 结构:
|
||||
|
||||
- `pairwise judge` 使用:
|
||||
- `roleBindingsMarkdown`
|
||||
- `renderedTestCasesMarkdown`
|
||||
- `renderedLeftSnapshotMarkdown`
|
||||
- `renderedRightSnapshotMarkdown`
|
||||
- `synthesis` 使用:
|
||||
- `roleBindingsMarkdown`
|
||||
- `synthesisHintsMarkdown`
|
||||
- `judgeResultsMarkdown`
|
||||
- `rewrite-from-evaluation` 虽然已经补上了 `workspacePrompt` / `referencePrompt`,但整体仍是“自然语言规则 + 文本分段”的组织方式。
|
||||
|
||||
这会带来四类问题:
|
||||
|
||||
- 协议层与证据正文都使用 Markdown,边界不清。
|
||||
- 被评估 prompt / 输出本身也常包含 Markdown、代码块、标题、列表,LLM 很容易误判层级。
|
||||
- 结构化 compare 的判断阶段,原本应识别为“正文中的边界违例”,却可能被当成“上层格式的一部分”。
|
||||
- rewrite 阶段容易把提示词正文包成代码块、`role/content` 对象、消息数组,或者错误继承正文里的展示包装。
|
||||
|
||||
一句话概括:
|
||||
|
||||
**Markdown 适合作为展示层,不适合作为机器协议层。**
|
||||
|
||||
## 2. 改造目标
|
||||
|
||||
本次最小迁移只做一件事:
|
||||
|
||||
- **LLM 真正收到的协议层改成 JSON payload**
|
||||
|
||||
同时保留:
|
||||
|
||||
- docs / calibration / real-api-samples 中现有 Markdown 调试产物
|
||||
- 现有 compare 能力边界与 UI 行为
|
||||
- 现有 `EvaluationService` 的整体调用时序
|
||||
|
||||
即:
|
||||
|
||||
- 面向模型:结构化 payload
|
||||
- 面向人看:Markdown 渲染视图
|
||||
|
||||
## 3. 迁移原则
|
||||
|
||||
### 3.1 协议分层
|
||||
|
||||
以后每条 compare / rewrite LLM 请求都分成两层:
|
||||
|
||||
- 说明层:少量自然语言规则
|
||||
- 证据层:JSON payload
|
||||
|
||||
说明层只做:
|
||||
|
||||
- 定义任务目标
|
||||
- 定义判断规则
|
||||
- 定义输出 contract
|
||||
|
||||
JSON payload 只做:
|
||||
|
||||
- 承载 testCases
|
||||
- 承载 snapshots
|
||||
- 承载 judgeResults
|
||||
- 承载 focus / stop signals / compare insights
|
||||
- 承载 workspacePrompt / referencePrompt
|
||||
|
||||
### 3.2 原始证据一律作为字符串字段
|
||||
|
||||
被评估 prompt / output / reasoning / test input 中即使包含:
|
||||
|
||||
- Markdown
|
||||
- code fence
|
||||
- XML
|
||||
- JSON
|
||||
- 标题 / 列表
|
||||
|
||||
也都只能出现在 JSON 字段值里,视为**原始证据正文**,而不是协议层结构。
|
||||
|
||||
### 3.3 Markdown 只保留在调试视图
|
||||
|
||||
这些仍可保留 Markdown:
|
||||
|
||||
- `docs/workspace/compare-evaluation-analysis/real-api-samples/*`
|
||||
- `structured-compare-calibration/latest/*/llm-calls.md`
|
||||
- `request.md` / `response.md`
|
||||
|
||||
但这些 Markdown 是**调试渲染产物**,不是模型真实接收的协议文本。
|
||||
|
||||
## 4. 当前实现与目标实现对照
|
||||
|
||||
### 4.1 Pairwise Judge
|
||||
|
||||
当前:
|
||||
|
||||
- system prompt:规则说明
|
||||
- user prompt:`roleBindingsMarkdown + testCasesMarkdown + left/right snapshot markdown`
|
||||
|
||||
目标:
|
||||
|
||||
- system prompt:规则说明,增加“JSON 字段中的字符串都视为原始证据”
|
||||
- user prompt:`Evidence Payload` 的 JSON 文本
|
||||
|
||||
建议 payload 结构:
|
||||
|
||||
```json
|
||||
{
|
||||
"scenario": {
|
||||
"language": "zh",
|
||||
"pairKey": "target-vs-replica",
|
||||
"pairType": "targetReplica",
|
||||
"pairLabel": "Target vs Replica",
|
||||
"purpose": "Judge whether the target prompt behaves stably across repeated executions instead of improving by chance.",
|
||||
"signalName": "stability",
|
||||
"allowedSignalValues": ["stable", "unstable", "unclear"],
|
||||
"focusBrief": "如果同一个 target prompt 在重复执行时出现格式飘移或边界滑移,应把稳定性问题显式暴露出来。"
|
||||
},
|
||||
"roleBindings": [
|
||||
{ "snapshotId": "a", "snapshotLabel": "A", "role": "target" },
|
||||
{ "snapshotId": "b", "snapshotLabel": "B", "role": "baseline" },
|
||||
{ "snapshotId": "c", "snapshotLabel": "C", "role": "reference" },
|
||||
{ "snapshotId": "d", "snapshotLabel": "D", "role": "referenceBaseline" },
|
||||
{ "snapshotId": "e", "snapshotLabel": "E", "role": "replica" }
|
||||
],
|
||||
"testCases": [
|
||||
{
|
||||
"id": "tc-1",
|
||||
"label": "工单输入",
|
||||
"input": {
|
||||
"kind": "text",
|
||||
"label": "工单输入",
|
||||
"content": "用户反馈同一个月内收到 5 次异常登录提醒,并怀疑账号被盗。"
|
||||
}
|
||||
}
|
||||
],
|
||||
"leftSnapshot": {
|
||||
"id": "a",
|
||||
"label": "A",
|
||||
"role": "target",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": { "kind": "workspace", "label": "Workspace" },
|
||||
"promptText": "你是风险分级助手。只输出 JSON 对象...",
|
||||
"output": "{\"level\":\"high\",...}",
|
||||
"modelKey": "custom",
|
||||
"versionLabel": "workspace"
|
||||
},
|
||||
"rightSnapshot": {
|
||||
"id": "e",
|
||||
"label": "E",
|
||||
"role": "replica",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": { "kind": "workspace", "label": "Replica" },
|
||||
"promptText": "你是风险分级助手。只输出 JSON 对象...",
|
||||
"output": "```json\\n{\"level\":\"high\",...}\\n```\\n补充说明:建议同时检查近期设备记录。",
|
||||
"modelKey": "custom",
|
||||
"versionLabel": "workspace-replica"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 4.2 Synthesis
|
||||
|
||||
当前:
|
||||
|
||||
- 传入 `synthesisHintsMarkdown`
|
||||
- 再把 `judgeResultsMarkdown` 拼接进去
|
||||
|
||||
目标:
|
||||
|
||||
- system prompt:保留综合规则
|
||||
- user prompt:传入一个 `Synthesis Payload`
|
||||
|
||||
建议 payload 结构:
|
||||
|
||||
```json
|
||||
{
|
||||
"scenario": {
|
||||
"roleName": "Structured System Prompt Compare Synthesizer",
|
||||
"subjectLabel": "system prompt",
|
||||
"sharedCompareInputs": true,
|
||||
"samePromptAcrossSnapshots": true,
|
||||
"crossModelComparison": true,
|
||||
"focusBrief": "优先判断改动是否真正减少额外解释与格式滑移。"
|
||||
},
|
||||
"roleBindings": [
|
||||
{ "snapshotId": "a", "snapshotLabel": "A", "role": "target" },
|
||||
{ "snapshotId": "b", "snapshotLabel": "B", "role": "baseline" },
|
||||
{ "snapshotId": "c", "snapshotLabel": "C", "role": "reference" },
|
||||
{ "snapshotId": "d", "snapshotLabel": "D", "role": "referenceBaseline" }
|
||||
],
|
||||
"deterministicHints": {
|
||||
"signalSnapshot": {
|
||||
"progress": "improved",
|
||||
"gap": "none",
|
||||
"promptValidity": "supported",
|
||||
"stability": "unstable"
|
||||
},
|
||||
"derivedStopSignals": {
|
||||
"targetVsBaseline": "improved",
|
||||
"targetVsReferenceGap": "none",
|
||||
"overfitRisk": "high",
|
||||
"stopRecommendation": "review"
|
||||
},
|
||||
"learnableSignals": [
|
||||
"在提示词中明确使用“只输出 JSON 对象”并列出字段名,可以稳定输出格式。"
|
||||
],
|
||||
"overfitWarnings": [
|
||||
"Target 在 Replica 测试中出现 JSON 外补充说明。"
|
||||
],
|
||||
"conflictSignals": [
|
||||
"improvementUnstableAcrossReplicas",
|
||||
"sampleOverfitRiskVisible"
|
||||
]
|
||||
},
|
||||
"judgeResults": [
|
||||
{
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairSignal": "improved",
|
||||
"verdict": "left-better",
|
||||
"confidence": "high",
|
||||
"analysis": "..."
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### 4.3 Rewrite From Evaluation
|
||||
|
||||
当前:
|
||||
|
||||
- 规则说明
|
||||
- `workspacePrompt` / `referencePrompt` 文本块
|
||||
- `result.summary` / `improvements` / `compareInsights` 等文本块
|
||||
|
||||
目标:
|
||||
|
||||
- system 或 user prompt 顶部保留重写规则
|
||||
- 下方传一个 `Rewrite Payload`
|
||||
|
||||
建议 payload 结构:
|
||||
|
||||
```json
|
||||
{
|
||||
"scenario": {
|
||||
"language": "zh",
|
||||
"evaluationType": "compare",
|
||||
"subjectLabel": "系统提示词",
|
||||
"overallScore": 65
|
||||
},
|
||||
"sourcePrompts": {
|
||||
"workspacePrompt": "你是风险分级助手。只输出一个 JSON 对象...",
|
||||
"referencePrompt": "你是风险分级助手。输出 level, rationale, next_action。"
|
||||
},
|
||||
"compressedEvaluation": {
|
||||
"summary": "Target 相比 Baseline 有进步,但 Replica 暴露出格式漂移。",
|
||||
"improvements": [
|
||||
"在提示词中明确使用“只输出 JSON 对象”并列出字段格式。"
|
||||
],
|
||||
"stopSignals": {
|
||||
"targetVsBaseline": "improved",
|
||||
"targetVsReferenceGap": "none",
|
||||
"overfitRisk": "high",
|
||||
"stopRecommendation": "review"
|
||||
},
|
||||
"compareInsights": {
|
||||
"progressSummary": { "...": "..." },
|
||||
"stabilitySummary": { "...": "..." },
|
||||
"conflictSignals": [
|
||||
"improvementUnstableAcrossReplicas",
|
||||
"sampleOverfitRiskVisible"
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## 5. 最小代码改造范围
|
||||
|
||||
### 5.1 第一批必改
|
||||
|
||||
#### A. `packages/core/src/services/evaluation/structured-compare-prompts.ts`
|
||||
|
||||
当前职责:
|
||||
|
||||
- 组装 `pairwise judge` / `synthesis` 模板上下文
|
||||
|
||||
要改成:
|
||||
|
||||
- 新增 payload builder
|
||||
- 不再要求上层先把证据渲染成 Markdown 字符串
|
||||
|
||||
建议新增函数:
|
||||
|
||||
- `buildStructuredComparePairJudgePayload()`
|
||||
- `buildStructuredCompareSynthesisPayload()`
|
||||
|
||||
对应新的 params 类型:
|
||||
|
||||
- `StructuredComparePairJudgePayloadParams`
|
||||
- `StructuredCompareSynthesisPayloadParams`
|
||||
|
||||
#### B. `packages/core/src/services/template/default-templates/evaluation-structured-compare/*`
|
||||
|
||||
当前模板里有很多:
|
||||
|
||||
- `roleBindingsMarkdown`
|
||||
- `renderedTestCasesMarkdown`
|
||||
- `judgeResultsMarkdown`
|
||||
|
||||
要改成:
|
||||
|
||||
- `pairJudgePayloadJson`
|
||||
- `synthesisPayloadJson`
|
||||
|
||||
并在 system prompt 中明确写:
|
||||
|
||||
- payload 中的字符串字段全部视为原始证据
|
||||
- 不要把字段值中的 Markdown / code fence 当成协议层结构
|
||||
|
||||
#### C. `packages/core/src/services/evaluation/service.ts`
|
||||
|
||||
当前:
|
||||
|
||||
- 先把 snapshot/testCase 渲染成 markdown,再传给 builder
|
||||
|
||||
要改成:
|
||||
|
||||
- 保留当前的 normalize / role / judgePlan 逻辑
|
||||
- 只替换“消息构造层”
|
||||
|
||||
也就是说:
|
||||
|
||||
- `renderStructuredCompareRoleBindings()`
|
||||
- `renderStructuredCompareJudgeResults()`
|
||||
- `renderStructuredCompareSynthesisHints()`
|
||||
|
||||
这些函数可以继续保留给 debug view 用
|
||||
|
||||
但真正给 LLM 的 builder 改走 JSON payload。
|
||||
|
||||
### 5.2 第二批建议改
|
||||
|
||||
#### D. `packages/core/src/services/evaluation/rewrite-from-evaluation.ts`
|
||||
|
||||
当前:
|
||||
|
||||
- 已经有了 `workspacePrompt` / `referencePrompt`
|
||||
- 但输出还是一整段自然语言拼接
|
||||
|
||||
建议改成:
|
||||
|
||||
- `buildRewritePayload()`
|
||||
- 模板只渲染:
|
||||
- 规则说明
|
||||
- `Rewrite Payload` JSON
|
||||
|
||||
### 5.3 暂时不改
|
||||
|
||||
- UI 展示组件
|
||||
- compare result 面板结构
|
||||
- calibration 文档目录结构
|
||||
- `request.md` / `response.md` / `llm-calls.md` 的 Markdown 导出方式
|
||||
|
||||
## 6. 如何保留当前调试体验
|
||||
|
||||
为避免“协议层升级后,人类不易读”,建议并行保留两个输出:
|
||||
|
||||
- 面向模型:
|
||||
- `pairJudgePayloadJson`
|
||||
- `synthesisPayloadJson`
|
||||
- `rewritePayloadJson`
|
||||
- 面向人:
|
||||
- `rendered-messages.md`
|
||||
- `request.md`
|
||||
- `llm-calls.md`
|
||||
|
||||
也就是:
|
||||
|
||||
- 模型看到 JSON payload
|
||||
- 文档仍然渲染成人可读 Markdown
|
||||
|
||||
这样不会影响:
|
||||
|
||||
- 真实 API 样本对照
|
||||
- calibration case 复盘
|
||||
- 手工调 prompt 的可读性
|
||||
|
||||
## 7. 对测试与校准的影响
|
||||
|
||||
### 7.1 单测
|
||||
|
||||
要更新的测试主要有两类:
|
||||
|
||||
- `packages/core/tests/unit/evaluation/structured-compare-prompts.test.ts`
|
||||
- 从断言“出现某个 Markdown 片段”
|
||||
- 改为断言“出现某个 payload JSON key”
|
||||
- `packages/core/tests/unit/evaluation/rewrite-from-evaluation.test.ts`
|
||||
- 从断言“某段自然语言存在”
|
||||
- 改为断言:
|
||||
- 存在 `workspacePrompt`
|
||||
- 存在 `referencePrompt`
|
||||
- 存在 `compressedEvaluation`
|
||||
- 存在 contract / raw prompt text 的规则
|
||||
|
||||
### 7.2 Calibration
|
||||
|
||||
`scripts/run-structured-compare-calibration.mjs` 不需要改业务流程,只需:
|
||||
|
||||
- 保存新的 payload 原文
|
||||
- docs 里继续保留 markdown 渲染版
|
||||
|
||||
建议新增产物:
|
||||
|
||||
- `pair-judge-payload.json`
|
||||
- `synthesis-payload.json`
|
||||
- `rewrite-payload.json`
|
||||
|
||||
这样以后复盘时可以直接看机器协议层是否干净。
|
||||
|
||||
## 8. 推荐实施顺序
|
||||
|
||||
### Phase 1:Pairwise Judge 协议化
|
||||
|
||||
只改:
|
||||
|
||||
- `structured-compare-prompts.ts`
|
||||
- `evaluation-structured-compare` 模板
|
||||
- `service.ts` 里 pairwise message 构造
|
||||
|
||||
验收标准:
|
||||
|
||||
- `synthetic-replica-instability` 仍稳定命中
|
||||
- `synthetic-schema-drift-regression` 仍稳定命中
|
||||
- docs 中能看到 payload 与 markdown 调试视图同时存在
|
||||
|
||||
### Phase 2:Synthesis 协议化
|
||||
|
||||
只改:
|
||||
|
||||
- synthesis builder/template
|
||||
- synthesis hints 传参结构
|
||||
|
||||
验收标准:
|
||||
|
||||
- `summary.md` 里的 stop signals 与当前校准结果不明显退化
|
||||
- 关键 case 的 conflict signals 保持稳定
|
||||
|
||||
### Phase 3:Rewrite 协议化
|
||||
|
||||
只改:
|
||||
|
||||
- `rewrite-from-evaluation.ts`
|
||||
- `evaluation-rewrite/*`
|
||||
- UI 调用参数不变,仅消息协议变更
|
||||
|
||||
验收标准:
|
||||
|
||||
- 不再输出 `role/content` 包装
|
||||
- 不再轻易擅改字段名 / schema
|
||||
- `synthetic-schema-drift-regression` 的 rewrite 继续能恢复 contract
|
||||
|
||||
## 9. 我对“最小实现”的建议
|
||||
|
||||
如果现在就开始做,我建议不要一步到位把所有 Markdown 都删掉。
|
||||
|
||||
最小、最稳的改法是:
|
||||
|
||||
1. 先保留现有自然语言说明段
|
||||
2. 把核心证据从 Markdown 改成 JSON payload
|
||||
3. 现有 Markdown 渲染函数先不删,只降级为 debug 辅助函数
|
||||
|
||||
这样有几个好处:
|
||||
|
||||
- 改动面可控
|
||||
- calibration runner 几乎不用重写
|
||||
- prompt 调优时仍保留人类可读性
|
||||
- 协议层已经完成最关键的去歧义
|
||||
|
||||
## 10. 最终判断
|
||||
|
||||
对于你们这个项目,我建议把协议层原则正式定下来:
|
||||
|
||||
**Markdown 只做展示层,JSON payload 才是机器协议层。**
|
||||
|
||||
这是对 compare / rewrite 最有价值的一次“基础设施型”收敛,因为它会同时提升:
|
||||
|
||||
- 对比评估的稳定性
|
||||
- calibration 的可解释性
|
||||
- rewrite 的 contract 保真度
|
||||
- 后续 SPO 自动迭代链路的可靠性
|
||||
@@ -0,0 +1,75 @@
|
||||
# Structured Compare Calibration
|
||||
|
||||
> 这一组样本不是为了证明 compare 功能“能跑”,而是为了校准我们新引入的 structured compare judge / synthesis / rewrite 提示词。
|
||||
|
||||
## 目标
|
||||
|
||||
- 为 `pairwise judge` 提供少量但高价值的校准场景。
|
||||
- 让 `synthesis` 在这些场景下暴露出是否存在“过度乐观”“忽略 overfit 风险”“把单次好运当稳定收益”等问题。
|
||||
- 让 `rewrite-from-evaluation` 接收到的上游证据足够清晰、可压缩、可复用。
|
||||
|
||||
## 场景设计原则
|
||||
|
||||
- 样本要能稳定打到某个误判风险,而不是泛泛比较“哪个输出更好”。
|
||||
- 每个样本都应能回答一个具体问题。
|
||||
- 尽量把“结构性收益”和“样例贴合收益”区分开。
|
||||
- 至少覆盖一次真实 target/teacher 执行,避免全部停留在手工构造快照。
|
||||
|
||||
## 当前样本
|
||||
|
||||
- `live-basic-system-boundary-control`
|
||||
使用真实模型执行 4 个快照,观察 structured compare 是否能识别“只输出 JSON、不要解释”的边界控制收益。
|
||||
- `synthetic-medical-latent-trigger-overfit`
|
||||
医疗分诊场景。目标是观察系统能否识别“样例触发词硬编码”带来的高过拟合风险,而不是把更激进的动作建议直接当成收益。
|
||||
- `synthetic-ecommerce-schema-no-model-worship`
|
||||
电商商品抽取场景。目标是校准 compare 是否会坚持 schema / contract 优先,不会因为 teacher 输出更流畅就放过字段改名和 wrapper 漂移。
|
||||
- `synthetic-legal-flat-not-unclear`
|
||||
法务风险摘要场景。目标是让 judge 学会把“结论等价、只改措辞”的情况稳定判为 `flat`,而不是退化成 `unclear`。
|
||||
- `synthetic-teaching-overfit-regression`
|
||||
教学讲解场景。目标是识别“为当前题目硬塞口诀导致通用原理丢失”的回退,并保留高 overfit 风险。
|
||||
- `synthetic-hiring-replica-semantic-instability`
|
||||
招聘筛选场景。目标是区分“单次输出更像样”和“同 prompt 反复执行仍稳定”这两件事。
|
||||
|
||||
## 最新校准结论
|
||||
|
||||
- synthetic 校准样本已经切换为跨主题集合,不再集中在客服/登录失败这一类题材上,目的是降低 calibration 自身对单一领域的过拟合。
|
||||
- `pairwise judge` 目前在 5 个跨主题 synthetic case 中,已经能稳定识别 3 类核心能力:
|
||||
- schema / contract 漂移应判回退
|
||||
- “flat 不是 unclear”
|
||||
- replica 语义不稳定应触发保守 stopRecommendation
|
||||
- `rewrite-from-evaluation` 在医疗、教学、电商这类样本上,已经能根据 compare 结论回退到更稳的通用 prompt,而不是继续保留样例贴合规则或坏 contract。
|
||||
- 当前最有价值的新发现来自 `synthetic-medical-latent-trigger-overfit`:
|
||||
compare 并没有把它看成“轻微过拟合但仍可能有收益”,而是直接判成了 `regressed + high overfit risk`。这说明现在的 judge 对高风险领域会更保守,也说明该样本已经能检验更细的提示词边界。
|
||||
- live case 当前仍是 `targetVsBaseline=improved` 且 `stopRecommendation=review`。这说明真实边界控制收益仍可见,但系统仍保持保守,不会轻易建议停止。
|
||||
|
||||
## 运行方式
|
||||
|
||||
在项目根目录执行:
|
||||
|
||||
```bash
|
||||
pnpm -F @prompt-optimizer/core build
|
||||
node scripts/run-structured-compare-calibration.mjs
|
||||
```
|
||||
|
||||
或直接使用:
|
||||
|
||||
```bash
|
||||
pnpm compare:calibrate
|
||||
```
|
||||
|
||||
## 输出位置
|
||||
|
||||
- 总结:`docs/workspace/compare-evaluation-analysis/structured-compare-calibration/latest/summary.md`
|
||||
- 每个 case 的 request / response / rewrite / llm-calls 都在对应子目录中。
|
||||
- 每个 case 还会落盘:
|
||||
- `pair-judge-payloads.json`
|
||||
- `synthesis-payload.json`
|
||||
- `rewrite-payload.json`
|
||||
|
||||
## 如何使用这些结果
|
||||
|
||||
- 如果 synthetic case 没命中预期,优先改 compare judge / synthesis 提示词。
|
||||
- 如果 live case 的 stopSignals 合理,但 rewrite 输出方向仍然跑偏,优先改 rewrite-from-evaluation 模板。
|
||||
- 如果 rewrite 输出开始擅自改字段名、改 schema、改消息包装方式,先检查是否把“当前工作区 prompt 原文”和“参考 prompt 快照”一起喂给了 rewrite 模板。
|
||||
- 如果 calibration 偶发被真实 API 超时打断,优先重跑 `pnpm compare:calibrate`;当前 runner 已内置超时拉长和有限重试。
|
||||
- 如果 synthetic 与 live 表现相互矛盾,优先检查场景描述是否过于理想化,再决定是否扩充样本。
|
||||
@@ -0,0 +1,56 @@
|
||||
[
|
||||
{
|
||||
"id": "a",
|
||||
"label": "A",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "workspace",
|
||||
"label": "Target Workspace"
|
||||
},
|
||||
"promptText": "你是一个严格的数据抽取助手。\n你的任务是阅读用户输入,并输出一个且仅一个 JSON 对象。\nJSON schema 必须为:\n{\"audience\": string|null, \"pain_points\": string[], \"tone\": string|null}\n规则:\n1. 只输出 JSON 对象,不要输出 Markdown、解释、前后缀或代码块。\n2. pain_points 只保留用户明确提到的问题,不要脑补。\n3. 缺失信息时 audience 和 tone 用 null,pain_points 用 []。\n4. 键名必须完全使用 audience、pain_points、tone。",
|
||||
"output": "{\"audience\": \"独立设计师\", \"pain_points\": [\"版本混乱\", \"客户确认来回很慢\"], \"tone\": \"professional and trustworthy\"}",
|
||||
"modelKey": "custom",
|
||||
"versionLabel": "workspace"
|
||||
},
|
||||
{
|
||||
"id": "b",
|
||||
"label": "B",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "version",
|
||||
"version": 1,
|
||||
"label": "Target Previous"
|
||||
},
|
||||
"promptText": "你是一个严格的数据抽取助手。\n阅读用户输入,输出一个 JSON 对象,包含以下字段:\n- audience: string | null\n- pain_points: string[]\n- tone: string | null\n要求:只返回 JSON。",
|
||||
"output": "```json\n{\n \"audience\": \"独立设计师\",\n \"pain_points\": [\"版本混乱\", \"客户确认来回很慢\"],\n \"tone\": \"专业可信\"\n}\n```",
|
||||
"modelKey": "custom",
|
||||
"versionLabel": "previous"
|
||||
},
|
||||
{
|
||||
"id": "c",
|
||||
"label": "C",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "workspace",
|
||||
"label": "Teacher Workspace"
|
||||
},
|
||||
"promptText": "你是一个严格的数据抽取助手。\n你的任务是阅读用户输入,并输出一个且仅一个 JSON 对象。\nJSON schema 必须为:\n{\"audience\": string|null, \"pain_points\": string[], \"tone\": string|null}\n规则:\n1. 只输出 JSON 对象,不要输出 Markdown、解释、前后缀或代码块。\n2. pain_points 只保留用户明确提到的问题,不要脑补。\n3. 缺失信息时 audience 和 tone 用 null,pain_points 用 []。\n4. 键名必须完全使用 audience、pain_points、tone。",
|
||||
"output": "{\"audience\": \"独立设计师\", \"pain_points\": [\"版本混乱\", \"客户确认来回很慢\"], \"tone\": \"专业可信\"}",
|
||||
"modelKey": "deepseek",
|
||||
"versionLabel": "teacher-workspace"
|
||||
},
|
||||
{
|
||||
"id": "d",
|
||||
"label": "D",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "version",
|
||||
"version": 1,
|
||||
"label": "Teacher Previous"
|
||||
},
|
||||
"promptText": "你是一个严格的数据抽取助手。\n阅读用户输入,输出一个 JSON 对象,包含以下字段:\n- audience: string | null\n- pain_points: string[]\n- tone: string | null\n要求:只返回 JSON。",
|
||||
"output": "{\n \"audience\": \"独立设计师\",\n \"pain_points\": [\"版本混乱\", \"客户确认来回很慢\"],\n \"tone\": \"专业可信\"\n}",
|
||||
"modelKey": "deepseek",
|
||||
"versionLabel": "teacher-previous"
|
||||
}
|
||||
]
|
||||
File diff suppressed because one or more lines are too long
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,260 @@
|
||||
[
|
||||
{
|
||||
"phase": "pair-judge:target-vs-reference",
|
||||
"payload": {
|
||||
"scenario": {
|
||||
"language": "zh",
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"purpose": "Identify whether the target still has a learnable gap from the stronger/reference run, and what structural strategy is worth learning.",
|
||||
"signalName": "gap",
|
||||
"allowedSignalValues": [
|
||||
"none",
|
||||
"minor",
|
||||
"major",
|
||||
"unclear"
|
||||
],
|
||||
"focusBrief": "优先判断改动是否真正减少了额外解释、格式边界滑移和输出结构不稳定,而不是只看表面完整度。"
|
||||
},
|
||||
"roleBindings": [
|
||||
{
|
||||
"snapshotId": "a",
|
||||
"snapshotLabel": "A",
|
||||
"role": "target",
|
||||
"roleLabel": "Target"
|
||||
},
|
||||
{
|
||||
"snapshotId": "b",
|
||||
"snapshotLabel": "B",
|
||||
"role": "baseline",
|
||||
"roleLabel": "Baseline"
|
||||
},
|
||||
{
|
||||
"snapshotId": "c",
|
||||
"snapshotLabel": "C",
|
||||
"role": "reference",
|
||||
"roleLabel": "Reference"
|
||||
},
|
||||
{
|
||||
"snapshotId": "d",
|
||||
"snapshotLabel": "D",
|
||||
"role": "referenceBaseline",
|
||||
"roleLabel": "Reference Baseline"
|
||||
}
|
||||
],
|
||||
"testCases": [
|
||||
{
|
||||
"id": "tc-1",
|
||||
"input": {
|
||||
"kind": "text",
|
||||
"label": "用户输入",
|
||||
"content": "我在做一个给独立设计师用的合同管理工具,语气希望专业可信。现在最大的问题是版本混乱和客户确认来回很慢。请先解释你的判断依据,再给出结果。"
|
||||
}
|
||||
}
|
||||
],
|
||||
"leftSnapshot": {
|
||||
"id": "a",
|
||||
"label": "A",
|
||||
"role": "target",
|
||||
"roleLabel": "Target",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "workspace",
|
||||
"label": "Target Workspace"
|
||||
},
|
||||
"promptText": "你是一个严格的数据抽取助手。\n你的任务是阅读用户输入,并输出一个且仅一个 JSON 对象。\nJSON schema 必须为:\n{\"audience\": string|null, \"pain_points\": string[], \"tone\": string|null}\n规则:\n1. 只输出 JSON 对象,不要输出 Markdown、解释、前后缀或代码块。\n2. pain_points 只保留用户明确提到的问题,不要脑补。\n3. 缺失信息时 audience 和 tone 用 null,pain_points 用 []。\n4. 键名必须完全使用 audience、pain_points、tone。",
|
||||
"modelKey": "custom",
|
||||
"versionLabel": "workspace",
|
||||
"output": "{\"audience\": \"独立设计师\", \"pain_points\": [\"版本混乱\", \"客户确认来回很慢\"], \"tone\": \"professional and trustworthy\"}"
|
||||
},
|
||||
"rightSnapshot": {
|
||||
"id": "c",
|
||||
"label": "C",
|
||||
"role": "reference",
|
||||
"roleLabel": "Reference",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "workspace",
|
||||
"label": "Teacher Workspace"
|
||||
},
|
||||
"promptText": "你是一个严格的数据抽取助手。\n你的任务是阅读用户输入,并输出一个且仅一个 JSON 对象。\nJSON schema 必须为:\n{\"audience\": string|null, \"pain_points\": string[], \"tone\": string|null}\n规则:\n1. 只输出 JSON 对象,不要输出 Markdown、解释、前后缀或代码块。\n2. pain_points 只保留用户明确提到的问题,不要脑补。\n3. 缺失信息时 audience 和 tone 用 null,pain_points 用 []。\n4. 键名必须完全使用 audience、pain_points、tone。",
|
||||
"modelKey": "deepseek",
|
||||
"versionLabel": "teacher-workspace",
|
||||
"output": "{\"audience\": \"独立设计师\", \"pain_points\": [\"版本混乱\", \"客户确认来回很慢\"], \"tone\": \"专业可信\"}"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"phase": "pair-judge:reference-vs-reference-baseline",
|
||||
"payload": {
|
||||
"scenario": {
|
||||
"language": "zh",
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"purpose": "Judge whether the prompt change itself is supported on the reference side, instead of being a target-only coincidence.",
|
||||
"signalName": "promptValidity",
|
||||
"allowedSignalValues": [
|
||||
"supported",
|
||||
"mixed",
|
||||
"unsupported",
|
||||
"unclear"
|
||||
],
|
||||
"focusBrief": "优先判断改动是否真正减少了额外解释、格式边界滑移和输出结构不稳定,而不是只看表面完整度。"
|
||||
},
|
||||
"roleBindings": [
|
||||
{
|
||||
"snapshotId": "a",
|
||||
"snapshotLabel": "A",
|
||||
"role": "target",
|
||||
"roleLabel": "Target"
|
||||
},
|
||||
{
|
||||
"snapshotId": "b",
|
||||
"snapshotLabel": "B",
|
||||
"role": "baseline",
|
||||
"roleLabel": "Baseline"
|
||||
},
|
||||
{
|
||||
"snapshotId": "c",
|
||||
"snapshotLabel": "C",
|
||||
"role": "reference",
|
||||
"roleLabel": "Reference"
|
||||
},
|
||||
{
|
||||
"snapshotId": "d",
|
||||
"snapshotLabel": "D",
|
||||
"role": "referenceBaseline",
|
||||
"roleLabel": "Reference Baseline"
|
||||
}
|
||||
],
|
||||
"testCases": [
|
||||
{
|
||||
"id": "tc-1",
|
||||
"input": {
|
||||
"kind": "text",
|
||||
"label": "用户输入",
|
||||
"content": "我在做一个给独立设计师用的合同管理工具,语气希望专业可信。现在最大的问题是版本混乱和客户确认来回很慢。请先解释你的判断依据,再给出结果。"
|
||||
}
|
||||
}
|
||||
],
|
||||
"leftSnapshot": {
|
||||
"id": "c",
|
||||
"label": "C",
|
||||
"role": "reference",
|
||||
"roleLabel": "Reference",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "workspace",
|
||||
"label": "Teacher Workspace"
|
||||
},
|
||||
"promptText": "你是一个严格的数据抽取助手。\n你的任务是阅读用户输入,并输出一个且仅一个 JSON 对象。\nJSON schema 必须为:\n{\"audience\": string|null, \"pain_points\": string[], \"tone\": string|null}\n规则:\n1. 只输出 JSON 对象,不要输出 Markdown、解释、前后缀或代码块。\n2. pain_points 只保留用户明确提到的问题,不要脑补。\n3. 缺失信息时 audience 和 tone 用 null,pain_points 用 []。\n4. 键名必须完全使用 audience、pain_points、tone。",
|
||||
"modelKey": "deepseek",
|
||||
"versionLabel": "teacher-workspace",
|
||||
"output": "{\"audience\": \"独立设计师\", \"pain_points\": [\"版本混乱\", \"客户确认来回很慢\"], \"tone\": \"专业可信\"}"
|
||||
},
|
||||
"rightSnapshot": {
|
||||
"id": "d",
|
||||
"label": "D",
|
||||
"role": "referenceBaseline",
|
||||
"roleLabel": "Reference Baseline",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "version",
|
||||
"label": "Teacher Previous"
|
||||
},
|
||||
"promptText": "你是一个严格的数据抽取助手。\n阅读用户输入,输出一个 JSON 对象,包含以下字段:\n- audience: string | null\n- pain_points: string[]\n- tone: string | null\n要求:只返回 JSON。",
|
||||
"modelKey": "deepseek",
|
||||
"versionLabel": "teacher-previous",
|
||||
"output": "{\n \"audience\": \"独立设计师\",\n \"pain_points\": [\"版本混乱\", \"客户确认来回很慢\"],\n \"tone\": \"专业可信\"\n}"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"phase": "pair-judge:target-vs-baseline",
|
||||
"payload": {
|
||||
"scenario": {
|
||||
"language": "zh",
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"purpose": "Decide whether the current target prompt materially improved, stayed flat, or regressed relative to the previous version.",
|
||||
"signalName": "progress",
|
||||
"allowedSignalValues": [
|
||||
"improved",
|
||||
"flat",
|
||||
"regressed",
|
||||
"unclear"
|
||||
],
|
||||
"focusBrief": "优先判断改动是否真正减少了额外解释、格式边界滑移和输出结构不稳定,而不是只看表面完整度。"
|
||||
},
|
||||
"roleBindings": [
|
||||
{
|
||||
"snapshotId": "a",
|
||||
"snapshotLabel": "A",
|
||||
"role": "target",
|
||||
"roleLabel": "Target"
|
||||
},
|
||||
{
|
||||
"snapshotId": "b",
|
||||
"snapshotLabel": "B",
|
||||
"role": "baseline",
|
||||
"roleLabel": "Baseline"
|
||||
},
|
||||
{
|
||||
"snapshotId": "c",
|
||||
"snapshotLabel": "C",
|
||||
"role": "reference",
|
||||
"roleLabel": "Reference"
|
||||
},
|
||||
{
|
||||
"snapshotId": "d",
|
||||
"snapshotLabel": "D",
|
||||
"role": "referenceBaseline",
|
||||
"roleLabel": "Reference Baseline"
|
||||
}
|
||||
],
|
||||
"testCases": [
|
||||
{
|
||||
"id": "tc-1",
|
||||
"input": {
|
||||
"kind": "text",
|
||||
"label": "用户输入",
|
||||
"content": "我在做一个给独立设计师用的合同管理工具,语气希望专业可信。现在最大的问题是版本混乱和客户确认来回很慢。请先解释你的判断依据,再给出结果。"
|
||||
}
|
||||
}
|
||||
],
|
||||
"leftSnapshot": {
|
||||
"id": "a",
|
||||
"label": "A",
|
||||
"role": "target",
|
||||
"roleLabel": "Target",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "workspace",
|
||||
"label": "Target Workspace"
|
||||
},
|
||||
"promptText": "你是一个严格的数据抽取助手。\n你的任务是阅读用户输入,并输出一个且仅一个 JSON 对象。\nJSON schema 必须为:\n{\"audience\": string|null, \"pain_points\": string[], \"tone\": string|null}\n规则:\n1. 只输出 JSON 对象,不要输出 Markdown、解释、前后缀或代码块。\n2. pain_points 只保留用户明确提到的问题,不要脑补。\n3. 缺失信息时 audience 和 tone 用 null,pain_points 用 []。\n4. 键名必须完全使用 audience、pain_points、tone。",
|
||||
"modelKey": "custom",
|
||||
"versionLabel": "workspace",
|
||||
"output": "{\"audience\": \"独立设计师\", \"pain_points\": [\"版本混乱\", \"客户确认来回很慢\"], \"tone\": \"professional and trustworthy\"}"
|
||||
},
|
||||
"rightSnapshot": {
|
||||
"id": "b",
|
||||
"label": "B",
|
||||
"role": "baseline",
|
||||
"roleLabel": "Baseline",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "version",
|
||||
"label": "Target Previous"
|
||||
},
|
||||
"promptText": "你是一个严格的数据抽取助手。\n阅读用户输入,输出一个 JSON 对象,包含以下字段:\n- audience: string | null\n- pain_points: string[]\n- tone: string | null\n要求:只返回 JSON。",
|
||||
"modelKey": "custom",
|
||||
"versionLabel": "previous",
|
||||
"output": "```json\n{\n \"audience\": \"独立设计师\",\n \"pain_points\": [\"版本混乱\", \"客户确认来回很慢\"],\n \"tone\": \"专业可信\"\n}\n```"
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
@@ -0,0 +1,94 @@
|
||||
{
|
||||
"type": "compare",
|
||||
"evaluationModelKey": "deepseek",
|
||||
"mode": {
|
||||
"functionMode": "basic",
|
||||
"subMode": "system"
|
||||
},
|
||||
"focus": {
|
||||
"content": "优先判断改动是否真正减少了额外解释、格式边界滑移和输出结构不稳定,而不是只看表面完整度。",
|
||||
"source": "system",
|
||||
"priority": "highest"
|
||||
},
|
||||
"target": {
|
||||
"workspacePrompt": "你是一个严格的数据抽取助手。\n你的任务是阅读用户输入,并输出一个且仅一个 JSON 对象。\nJSON schema 必须为:\n{\"audience\": string|null, \"pain_points\": string[], \"tone\": string|null}\n规则:\n1. 只输出 JSON 对象,不要输出 Markdown、解释、前后缀或代码块。\n2. pain_points 只保留用户明确提到的问题,不要脑补。\n3. 缺失信息时 audience 和 tone 用 null,pain_points 用 []。\n4. 键名必须完全使用 audience、pain_points、tone。"
|
||||
},
|
||||
"testCases": [
|
||||
{
|
||||
"id": "tc-1",
|
||||
"input": {
|
||||
"kind": "text",
|
||||
"label": "用户输入",
|
||||
"content": "我在做一个给独立设计师用的合同管理工具,语气希望专业可信。现在最大的问题是版本混乱和客户确认来回很慢。请先解释你的判断依据,再给出结果。"
|
||||
}
|
||||
}
|
||||
],
|
||||
"snapshots": [
|
||||
{
|
||||
"id": "a",
|
||||
"label": "A",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "workspace",
|
||||
"label": "Target Workspace"
|
||||
},
|
||||
"promptText": "你是一个严格的数据抽取助手。\n你的任务是阅读用户输入,并输出一个且仅一个 JSON 对象。\nJSON schema 必须为:\n{\"audience\": string|null, \"pain_points\": string[], \"tone\": string|null}\n规则:\n1. 只输出 JSON 对象,不要输出 Markdown、解释、前后缀或代码块。\n2. pain_points 只保留用户明确提到的问题,不要脑补。\n3. 缺失信息时 audience 和 tone 用 null,pain_points 用 []。\n4. 键名必须完全使用 audience、pain_points、tone。",
|
||||
"output": "{\"audience\": \"独立设计师\", \"pain_points\": [\"版本混乱\", \"客户确认来回很慢\"], \"tone\": \"professional and trustworthy\"}",
|
||||
"modelKey": "custom",
|
||||
"versionLabel": "workspace"
|
||||
},
|
||||
{
|
||||
"id": "b",
|
||||
"label": "B",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "version",
|
||||
"version": 1,
|
||||
"label": "Target Previous"
|
||||
},
|
||||
"promptText": "你是一个严格的数据抽取助手。\n阅读用户输入,输出一个 JSON 对象,包含以下字段:\n- audience: string | null\n- pain_points: string[]\n- tone: string | null\n要求:只返回 JSON。",
|
||||
"output": "```json\n{\n \"audience\": \"独立设计师\",\n \"pain_points\": [\"版本混乱\", \"客户确认来回很慢\"],\n \"tone\": \"专业可信\"\n}\n```",
|
||||
"modelKey": "custom",
|
||||
"versionLabel": "previous"
|
||||
},
|
||||
{
|
||||
"id": "c",
|
||||
"label": "C",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "workspace",
|
||||
"label": "Teacher Workspace"
|
||||
},
|
||||
"promptText": "你是一个严格的数据抽取助手。\n你的任务是阅读用户输入,并输出一个且仅一个 JSON 对象。\nJSON schema 必须为:\n{\"audience\": string|null, \"pain_points\": string[], \"tone\": string|null}\n规则:\n1. 只输出 JSON 对象,不要输出 Markdown、解释、前后缀或代码块。\n2. pain_points 只保留用户明确提到的问题,不要脑补。\n3. 缺失信息时 audience 和 tone 用 null,pain_points 用 []。\n4. 键名必须完全使用 audience、pain_points、tone。",
|
||||
"output": "{\"audience\": \"独立设计师\", \"pain_points\": [\"版本混乱\", \"客户确认来回很慢\"], \"tone\": \"专业可信\"}",
|
||||
"modelKey": "deepseek",
|
||||
"versionLabel": "teacher-workspace"
|
||||
},
|
||||
{
|
||||
"id": "d",
|
||||
"label": "D",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "version",
|
||||
"version": 1,
|
||||
"label": "Teacher Previous"
|
||||
},
|
||||
"promptText": "你是一个严格的数据抽取助手。\n阅读用户输入,输出一个 JSON 对象,包含以下字段:\n- audience: string | null\n- pain_points: string[]\n- tone: string | null\n要求:只返回 JSON。",
|
||||
"output": "{\n \"audience\": \"独立设计师\",\n \"pain_points\": [\"版本混乱\", \"客户确认来回很慢\"],\n \"tone\": \"专业可信\"\n}",
|
||||
"modelKey": "deepseek",
|
||||
"versionLabel": "teacher-previous"
|
||||
}
|
||||
],
|
||||
"compareHints": {
|
||||
"mode": "structured",
|
||||
"snapshotRoles": {
|
||||
"a": "target",
|
||||
"b": "baseline",
|
||||
"c": "reference",
|
||||
"d": "referenceBaseline"
|
||||
},
|
||||
"hasSharedTestCases": true,
|
||||
"hasSamePromptSnapshots": true,
|
||||
"hasCrossModelComparison": true
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,96 @@
|
||||
```json
|
||||
{
|
||||
"type": "compare",
|
||||
"evaluationModelKey": "deepseek",
|
||||
"mode": {
|
||||
"functionMode": "basic",
|
||||
"subMode": "system"
|
||||
},
|
||||
"focus": {
|
||||
"content": "优先判断改动是否真正减少了额外解释、格式边界滑移和输出结构不稳定,而不是只看表面完整度。",
|
||||
"source": "system",
|
||||
"priority": "highest"
|
||||
},
|
||||
"target": {
|
||||
"workspacePrompt": "你是一个严格的数据抽取助手。\n你的任务是阅读用户输入,并输出一个且仅一个 JSON 对象。\nJSON schema 必须为:\n{\"audience\": string|null, \"pain_points\": string[], \"tone\": string|null}\n规则:\n1. 只输出 JSON 对象,不要输出 Markdown、解释、前后缀或代码块。\n2. pain_points 只保留用户明确提到的问题,不要脑补。\n3. 缺失信息时 audience 和 tone 用 null,pain_points 用 []。\n4. 键名必须完全使用 audience、pain_points、tone。"
|
||||
},
|
||||
"testCases": [
|
||||
{
|
||||
"id": "tc-1",
|
||||
"input": {
|
||||
"kind": "text",
|
||||
"label": "用户输入",
|
||||
"content": "我在做一个给独立设计师用的合同管理工具,语气希望专业可信。现在最大的问题是版本混乱和客户确认来回很慢。请先解释你的判断依据,再给出结果。"
|
||||
}
|
||||
}
|
||||
],
|
||||
"snapshots": [
|
||||
{
|
||||
"id": "a",
|
||||
"label": "A",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "workspace",
|
||||
"label": "Target Workspace"
|
||||
},
|
||||
"promptText": "你是一个严格的数据抽取助手。\n你的任务是阅读用户输入,并输出一个且仅一个 JSON 对象。\nJSON schema 必须为:\n{\"audience\": string|null, \"pain_points\": string[], \"tone\": string|null}\n规则:\n1. 只输出 JSON 对象,不要输出 Markdown、解释、前后缀或代码块。\n2. pain_points 只保留用户明确提到的问题,不要脑补。\n3. 缺失信息时 audience 和 tone 用 null,pain_points 用 []。\n4. 键名必须完全使用 audience、pain_points、tone。",
|
||||
"output": "{\"audience\": \"独立设计师\", \"pain_points\": [\"版本混乱\", \"客户确认来回很慢\"], \"tone\": \"professional and trustworthy\"}",
|
||||
"modelKey": "custom",
|
||||
"versionLabel": "workspace"
|
||||
},
|
||||
{
|
||||
"id": "b",
|
||||
"label": "B",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "version",
|
||||
"version": 1,
|
||||
"label": "Target Previous"
|
||||
},
|
||||
"promptText": "你是一个严格的数据抽取助手。\n阅读用户输入,输出一个 JSON 对象,包含以下字段:\n- audience: string | null\n- pain_points: string[]\n- tone: string | null\n要求:只返回 JSON。",
|
||||
"output": "```json\n{\n \"audience\": \"独立设计师\",\n \"pain_points\": [\"版本混乱\", \"客户确认来回很慢\"],\n \"tone\": \"专业可信\"\n}\n```",
|
||||
"modelKey": "custom",
|
||||
"versionLabel": "previous"
|
||||
},
|
||||
{
|
||||
"id": "c",
|
||||
"label": "C",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "workspace",
|
||||
"label": "Teacher Workspace"
|
||||
},
|
||||
"promptText": "你是一个严格的数据抽取助手。\n你的任务是阅读用户输入,并输出一个且仅一个 JSON 对象。\nJSON schema 必须为:\n{\"audience\": string|null, \"pain_points\": string[], \"tone\": string|null}\n规则:\n1. 只输出 JSON 对象,不要输出 Markdown、解释、前后缀或代码块。\n2. pain_points 只保留用户明确提到的问题,不要脑补。\n3. 缺失信息时 audience 和 tone 用 null,pain_points 用 []。\n4. 键名必须完全使用 audience、pain_points、tone。",
|
||||
"output": "{\"audience\": \"独立设计师\", \"pain_points\": [\"版本混乱\", \"客户确认来回很慢\"], \"tone\": \"专业可信\"}",
|
||||
"modelKey": "deepseek",
|
||||
"versionLabel": "teacher-workspace"
|
||||
},
|
||||
{
|
||||
"id": "d",
|
||||
"label": "D",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "version",
|
||||
"version": 1,
|
||||
"label": "Teacher Previous"
|
||||
},
|
||||
"promptText": "你是一个严格的数据抽取助手。\n阅读用户输入,输出一个 JSON 对象,包含以下字段:\n- audience: string | null\n- pain_points: string[]\n- tone: string | null\n要求:只返回 JSON。",
|
||||
"output": "{\n \"audience\": \"独立设计师\",\n \"pain_points\": [\"版本混乱\", \"客户确认来回很慢\"],\n \"tone\": \"专业可信\"\n}",
|
||||
"modelKey": "deepseek",
|
||||
"versionLabel": "teacher-previous"
|
||||
}
|
||||
],
|
||||
"compareHints": {
|
||||
"mode": "structured",
|
||||
"snapshotRoles": {
|
||||
"a": "target",
|
||||
"b": "baseline",
|
||||
"c": "reference",
|
||||
"d": "referenceBaseline"
|
||||
},
|
||||
"hasSharedTestCases": true,
|
||||
"hasSamePromptSnapshots": true,
|
||||
"hasCrossModelComparison": true
|
||||
}
|
||||
}
|
||||
```
|
||||
@@ -0,0 +1,215 @@
|
||||
{
|
||||
"type": "compare",
|
||||
"score": {
|
||||
"overall": 75,
|
||||
"dimensions": [
|
||||
{
|
||||
"key": "goalAchievementRobustness",
|
||||
"label": "目标达成稳定性",
|
||||
"score": 90
|
||||
},
|
||||
{
|
||||
"key": "outputQualityCeiling",
|
||||
"label": "输出质量上限",
|
||||
"score": 70
|
||||
},
|
||||
{
|
||||
"key": "promptPatternQuality",
|
||||
"label": "提示词模式质量",
|
||||
"score": 85
|
||||
},
|
||||
{
|
||||
"key": "crossSnapshotRobustness",
|
||||
"label": "跨快照鲁棒性",
|
||||
"score": 60
|
||||
},
|
||||
{
|
||||
"key": "workspaceTransferability",
|
||||
"label": "对工作区的可迁移性",
|
||||
"score": 70
|
||||
}
|
||||
]
|
||||
},
|
||||
"improvements": [
|
||||
"在提取`tone`等描述性字段时,应优先直接使用用户输入中的原词,避免进行不必要的翻译或改写,以保持信息的原始性和准确性。",
|
||||
"在要求“只输出JSON”的提示词中,明确列举禁止项(如Markdown、解释、代码块、前后缀)能有效减少格式漂移。",
|
||||
"仅规定“只返回JSON”的模糊指令,模型可能仍会添加美化格式(如换行和缩进),这被视为一种边界违例。"
|
||||
],
|
||||
"summary": "Target相比Baseline在格式控制上有显著进步,但与Reference在字段本地化处理上仍有可学习的微小差距;提示词中增加明确禁止项的改动在Reference侧被验证有效,但存在一定的样例过拟合风险。",
|
||||
"patchPlan": [],
|
||||
"metadata": {
|
||||
"compareMode": "structured",
|
||||
"compareStopSignals": {
|
||||
"targetVsBaseline": "improved",
|
||||
"targetVsReferenceGap": "minor",
|
||||
"improvementHeadroom": "medium",
|
||||
"overfitRisk": "medium",
|
||||
"stopRecommendation": "continue",
|
||||
"stopReasons": [
|
||||
"minor learnable gap remains vs reference",
|
||||
"pairwise judges flagged possible sample overfit"
|
||||
]
|
||||
},
|
||||
"model": "deepseek",
|
||||
"timestamp": 1774176350757,
|
||||
"duration": 28158,
|
||||
"compareJudgements": [
|
||||
{
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"leftSnapshotId": "a",
|
||||
"leftSnapshotLabel": "A",
|
||||
"leftRole": "target",
|
||||
"rightSnapshotId": "b",
|
||||
"rightSnapshotLabel": "B",
|
||||
"rightRole": "baseline",
|
||||
"verdict": "left-better",
|
||||
"winner": "left",
|
||||
"confidence": "high",
|
||||
"pairSignal": "improved",
|
||||
"analysis": "Target (A) 在输出格式的严格性和边界控制上显著优于 Baseline (B)。Baseline 的输出包裹了 Markdown 代码块,违反了“只输出 JSON 对象”的核心指令,属于明确的硬边界违例。Target 则严格遵守了所有格式和内容规则,没有额外解释或格式漂移,实现了真正的改进。",
|
||||
"evidence": [
|
||||
"Baseline (B) 的输出包裹了"
|
||||
],
|
||||
"learnableSignals": [],
|
||||
"overfitWarnings": []
|
||||
},
|
||||
{
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"leftSnapshotId": "a",
|
||||
"leftSnapshotLabel": "A",
|
||||
"leftRole": "target",
|
||||
"rightSnapshotId": "c",
|
||||
"rightSnapshotLabel": "C",
|
||||
"rightRole": "reference",
|
||||
"verdict": "right-better",
|
||||
"winner": "right",
|
||||
"confidence": "high",
|
||||
"pairSignal": "minor",
|
||||
"analysis": "两者都正确提取了核心信息并严格遵守了输出协议,但Reference在`tone`字段的本地化处理上更优,直接使用了用户输入中的中文原词“专业可信”,而Target使用了英文翻译“professional and trustworthy”。这是一个清晰、可学习的结构优势,即更忠实地保留用户输入的原词,而非进行不必要的翻译或解释。",
|
||||
"evidence": [
|
||||
"Target的`tone`字段值为\"professional and trustworthy\",是对用户输入中“专业可信”的英文翻译。",
|
||||
"Reference的`tone`字段值为\"专业可信\",与用户输入中的中文原词完全一致。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"在提取`tone`等描述性字段时,应优先直接使用用户输入中的原词,避免进行不必要的翻译或改写,以保持信息的原始性和准确性。"
|
||||
],
|
||||
"overfitWarnings": [
|
||||
"此判断基于当前用户输入明确提供了中文描述。如果用户输入本身是英文或未明确描述语气,此优势可能不适用。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"leftSnapshotId": "c",
|
||||
"leftSnapshotLabel": "C",
|
||||
"leftRole": "reference",
|
||||
"rightSnapshotId": "d",
|
||||
"rightSnapshotLabel": "D",
|
||||
"rightRole": "referenceBaseline",
|
||||
"verdict": "left-better",
|
||||
"winner": "left",
|
||||
"confidence": "high",
|
||||
"pairSignal": "supported",
|
||||
"analysis": "左侧(Reference)的提示词通过增加明确的规则约束,显著减少了输出格式的边界滑移风险,并消除了右侧(Reference Baseline)输出中存在的额外格式(如换行和缩进),使输出更严格地符合“只输出JSON对象”的要求。这一改进在参考侧内部得到了验证,并非仅针对当前样例的巧合。",
|
||||
"evidence": [
|
||||
"左侧提示词明确禁止了Markdown、解释、前后缀或代码块,而右侧提示词仅要求“只返回JSON”,约束较弱。",
|
||||
"左侧输出为紧凑的JSON字符串:`{\"audience\": \"独立设计师\", \"pain_points\": [\"版本混乱\", \"客户确认来回很慢\"], \"tone\": \"专业可信\"}`。",
|
||||
"右侧输出包含了额外的格式(换行和缩进):`{\n \"audience\": \"独立设计师\",\n \"pain_points\": [\"版本混乱\", \"客户确认来回很慢\"],\n \"tone\": \"专业可信\"\n}`,这违反了左侧提示词中“不要输出...前后缀”的硬边界规则。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"在要求“只输出JSON”的提示词中,明确列举禁止项(如Markdown、解释、代码块、前后缀)能有效减少格式漂移。",
|
||||
"仅规定“只返回JSON”的模糊指令,模型可能仍会添加美化格式(如换行和缩进),这被视为一种边界违例。"
|
||||
],
|
||||
"overfitWarnings": []
|
||||
}
|
||||
],
|
||||
"snapshotRoles": {
|
||||
"a": "target",
|
||||
"b": "baseline",
|
||||
"c": "reference",
|
||||
"d": "referenceBaseline"
|
||||
},
|
||||
"compareInsights": {
|
||||
"pairHighlights": [
|
||||
{
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"pairSignal": "improved",
|
||||
"verdict": "left-better",
|
||||
"confidence": "high",
|
||||
"analysis": "Target (A) 在输出格式的严格性和边界控制上显著优于 Baseline (B)。Baseline 的输出包裹了 Markdown 代码块,违反了“只输出 JSON 对象”的核心指令,属于明确的硬边界违例。Target 则严格遵守了所有格式和内容规则,没有额外解释或格式漂移,实现了真正的改进。"
|
||||
},
|
||||
{
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"pairSignal": "minor",
|
||||
"verdict": "right-better",
|
||||
"confidence": "high",
|
||||
"analysis": "两者都正确提取了核心信息并严格遵守了输出协议,但Reference在`tone`字段的本地化处理上更优,直接使用了用户输入中的中文原词“专业可信”,而Target使用了英文翻译“professional and trustworthy”。这是一个清晰、可学习的结构优势,即更忠实地保留用户输入的原词,而非进行不必要的翻译或解释。"
|
||||
},
|
||||
{
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"pairSignal": "supported",
|
||||
"verdict": "left-better",
|
||||
"confidence": "high",
|
||||
"analysis": "左侧(Reference)的提示词通过增加明确的规则约束,显著减少了输出格式的边界滑移风险,并消除了右侧(Reference Baseline)输出中存在的额外格式(如换行和缩进),使输出更严格地符合“只输出JSON对象”的要求。这一改进在参考侧内部得到了验证,并非仅针对当前样例的巧合。"
|
||||
}
|
||||
],
|
||||
"progressSummary": {
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"pairSignal": "improved",
|
||||
"verdict": "left-better",
|
||||
"confidence": "high",
|
||||
"analysis": "Target (A) 在输出格式的严格性和边界控制上显著优于 Baseline (B)。Baseline 的输出包裹了 Markdown 代码块,违反了“只输出 JSON 对象”的核心指令,属于明确的硬边界违例。Target 则严格遵守了所有格式和内容规则,没有额外解释或格式漂移,实现了真正的改进。"
|
||||
},
|
||||
"referenceGapSummary": {
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"pairSignal": "minor",
|
||||
"verdict": "right-better",
|
||||
"confidence": "high",
|
||||
"analysis": "两者都正确提取了核心信息并严格遵守了输出协议,但Reference在`tone`字段的本地化处理上更优,直接使用了用户输入中的中文原词“专业可信”,而Target使用了英文翻译“professional and trustworthy”。这是一个清晰、可学习的结构优势,即更忠实地保留用户输入的原词,而非进行不必要的翻译或解释。"
|
||||
},
|
||||
"promptChangeSummary": {
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"pairSignal": "supported",
|
||||
"verdict": "left-better",
|
||||
"confidence": "high",
|
||||
"analysis": "左侧(Reference)的提示词通过增加明确的规则约束,显著减少了输出格式的边界滑移风险,并消除了右侧(Reference Baseline)输出中存在的额外格式(如换行和缩进),使输出更严格地符合“只输出JSON对象”的要求。这一改进在参考侧内部得到了验证,并非仅针对当前样例的巧合。"
|
||||
},
|
||||
"evidenceHighlights": [
|
||||
"Baseline (B) 的输出包裹了",
|
||||
"Target的`tone`字段值为\"professional and trustworthy\",是对用户输入中“专业可信”的英文翻译。",
|
||||
"Reference的`tone`字段值为\"专业可信\",与用户输入中的中文原词完全一致。",
|
||||
"左侧提示词明确禁止了Markdown、解释、前后缀或代码块,而右侧提示词仅要求“只返回JSON”,约束较弱。",
|
||||
"左侧输出为紧凑的JSON字符串:`{\"audience\": \"独立设计师\", \"pain_points\": [\"版本混乱\", \"客户确认来回很慢\"], \"tone\": \"专业可信\"}`。",
|
||||
"右侧输出包含了额外的格式(换行和缩进):`{ \"audience\": \"独立设计师\", \"pain_points\": [\"版本混乱\", \"客户确认来回很慢\"], \"tone\": \"专业可信\" }`,这违反了左侧提示词中“不要输出...前后缀”的硬边界规则。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"在提取`tone`等描述性字段时,应优先直接使用用户输入中的原词,避免进行不必要的翻译或改写,以保持信息的原始性和准确性。",
|
||||
"在要求“只输出JSON”的提示词中,明确列举禁止项(如Markdown、解释、代码块、前后缀)能有效减少格式漂移。",
|
||||
"仅规定“只返回JSON”的模糊指令,模型可能仍会添加美化格式(如换行和缩进),这被视为一种边界违例。"
|
||||
],
|
||||
"overfitWarnings": [
|
||||
"此判断基于当前用户输入明确提供了中文描述。如果用户输入本身是英文或未明确描述语气,此优势可能不适用。"
|
||||
],
|
||||
"conflictSignals": [
|
||||
"sampleOverfitRiskVisible"
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,217 @@
|
||||
```json
|
||||
{
|
||||
"type": "compare",
|
||||
"score": {
|
||||
"overall": 75,
|
||||
"dimensions": [
|
||||
{
|
||||
"key": "goalAchievementRobustness",
|
||||
"label": "目标达成稳定性",
|
||||
"score": 90
|
||||
},
|
||||
{
|
||||
"key": "outputQualityCeiling",
|
||||
"label": "输出质量上限",
|
||||
"score": 70
|
||||
},
|
||||
{
|
||||
"key": "promptPatternQuality",
|
||||
"label": "提示词模式质量",
|
||||
"score": 85
|
||||
},
|
||||
{
|
||||
"key": "crossSnapshotRobustness",
|
||||
"label": "跨快照鲁棒性",
|
||||
"score": 60
|
||||
},
|
||||
{
|
||||
"key": "workspaceTransferability",
|
||||
"label": "对工作区的可迁移性",
|
||||
"score": 70
|
||||
}
|
||||
]
|
||||
},
|
||||
"improvements": [
|
||||
"在提取`tone`等描述性字段时,应优先直接使用用户输入中的原词,避免进行不必要的翻译或改写,以保持信息的原始性和准确性。",
|
||||
"在要求“只输出JSON”的提示词中,明确列举禁止项(如Markdown、解释、代码块、前后缀)能有效减少格式漂移。",
|
||||
"仅规定“只返回JSON”的模糊指令,模型可能仍会添加美化格式(如换行和缩进),这被视为一种边界违例。"
|
||||
],
|
||||
"summary": "Target相比Baseline在格式控制上有显著进步,但与Reference在字段本地化处理上仍有可学习的微小差距;提示词中增加明确禁止项的改动在Reference侧被验证有效,但存在一定的样例过拟合风险。",
|
||||
"patchPlan": [],
|
||||
"metadata": {
|
||||
"compareMode": "structured",
|
||||
"compareStopSignals": {
|
||||
"targetVsBaseline": "improved",
|
||||
"targetVsReferenceGap": "minor",
|
||||
"improvementHeadroom": "medium",
|
||||
"overfitRisk": "medium",
|
||||
"stopRecommendation": "continue",
|
||||
"stopReasons": [
|
||||
"minor learnable gap remains vs reference",
|
||||
"pairwise judges flagged possible sample overfit"
|
||||
]
|
||||
},
|
||||
"model": "deepseek",
|
||||
"timestamp": 1774176350757,
|
||||
"duration": 28158,
|
||||
"compareJudgements": [
|
||||
{
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"leftSnapshotId": "a",
|
||||
"leftSnapshotLabel": "A",
|
||||
"leftRole": "target",
|
||||
"rightSnapshotId": "b",
|
||||
"rightSnapshotLabel": "B",
|
||||
"rightRole": "baseline",
|
||||
"verdict": "left-better",
|
||||
"winner": "left",
|
||||
"confidence": "high",
|
||||
"pairSignal": "improved",
|
||||
"analysis": "Target (A) 在输出格式的严格性和边界控制上显著优于 Baseline (B)。Baseline 的输出包裹了 Markdown 代码块,违反了“只输出 JSON 对象”的核心指令,属于明确的硬边界违例。Target 则严格遵守了所有格式和内容规则,没有额外解释或格式漂移,实现了真正的改进。",
|
||||
"evidence": [
|
||||
"Baseline (B) 的输出包裹了"
|
||||
],
|
||||
"learnableSignals": [],
|
||||
"overfitWarnings": []
|
||||
},
|
||||
{
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"leftSnapshotId": "a",
|
||||
"leftSnapshotLabel": "A",
|
||||
"leftRole": "target",
|
||||
"rightSnapshotId": "c",
|
||||
"rightSnapshotLabel": "C",
|
||||
"rightRole": "reference",
|
||||
"verdict": "right-better",
|
||||
"winner": "right",
|
||||
"confidence": "high",
|
||||
"pairSignal": "minor",
|
||||
"analysis": "两者都正确提取了核心信息并严格遵守了输出协议,但Reference在`tone`字段的本地化处理上更优,直接使用了用户输入中的中文原词“专业可信”,而Target使用了英文翻译“professional and trustworthy”。这是一个清晰、可学习的结构优势,即更忠实地保留用户输入的原词,而非进行不必要的翻译或解释。",
|
||||
"evidence": [
|
||||
"Target的`tone`字段值为\"professional and trustworthy\",是对用户输入中“专业可信”的英文翻译。",
|
||||
"Reference的`tone`字段值为\"专业可信\",与用户输入中的中文原词完全一致。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"在提取`tone`等描述性字段时,应优先直接使用用户输入中的原词,避免进行不必要的翻译或改写,以保持信息的原始性和准确性。"
|
||||
],
|
||||
"overfitWarnings": [
|
||||
"此判断基于当前用户输入明确提供了中文描述。如果用户输入本身是英文或未明确描述语气,此优势可能不适用。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"leftSnapshotId": "c",
|
||||
"leftSnapshotLabel": "C",
|
||||
"leftRole": "reference",
|
||||
"rightSnapshotId": "d",
|
||||
"rightSnapshotLabel": "D",
|
||||
"rightRole": "referenceBaseline",
|
||||
"verdict": "left-better",
|
||||
"winner": "left",
|
||||
"confidence": "high",
|
||||
"pairSignal": "supported",
|
||||
"analysis": "左侧(Reference)的提示词通过增加明确的规则约束,显著减少了输出格式的边界滑移风险,并消除了右侧(Reference Baseline)输出中存在的额外格式(如换行和缩进),使输出更严格地符合“只输出JSON对象”的要求。这一改进在参考侧内部得到了验证,并非仅针对当前样例的巧合。",
|
||||
"evidence": [
|
||||
"左侧提示词明确禁止了Markdown、解释、前后缀或代码块,而右侧提示词仅要求“只返回JSON”,约束较弱。",
|
||||
"左侧输出为紧凑的JSON字符串:`{\"audience\": \"独立设计师\", \"pain_points\": [\"版本混乱\", \"客户确认来回很慢\"], \"tone\": \"专业可信\"}`。",
|
||||
"右侧输出包含了额外的格式(换行和缩进):`{\n \"audience\": \"独立设计师\",\n \"pain_points\": [\"版本混乱\", \"客户确认来回很慢\"],\n \"tone\": \"专业可信\"\n}`,这违反了左侧提示词中“不要输出...前后缀”的硬边界规则。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"在要求“只输出JSON”的提示词中,明确列举禁止项(如Markdown、解释、代码块、前后缀)能有效减少格式漂移。",
|
||||
"仅规定“只返回JSON”的模糊指令,模型可能仍会添加美化格式(如换行和缩进),这被视为一种边界违例。"
|
||||
],
|
||||
"overfitWarnings": []
|
||||
}
|
||||
],
|
||||
"snapshotRoles": {
|
||||
"a": "target",
|
||||
"b": "baseline",
|
||||
"c": "reference",
|
||||
"d": "referenceBaseline"
|
||||
},
|
||||
"compareInsights": {
|
||||
"pairHighlights": [
|
||||
{
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"pairSignal": "improved",
|
||||
"verdict": "left-better",
|
||||
"confidence": "high",
|
||||
"analysis": "Target (A) 在输出格式的严格性和边界控制上显著优于 Baseline (B)。Baseline 的输出包裹了 Markdown 代码块,违反了“只输出 JSON 对象”的核心指令,属于明确的硬边界违例。Target 则严格遵守了所有格式和内容规则,没有额外解释或格式漂移,实现了真正的改进。"
|
||||
},
|
||||
{
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"pairSignal": "minor",
|
||||
"verdict": "right-better",
|
||||
"confidence": "high",
|
||||
"analysis": "两者都正确提取了核心信息并严格遵守了输出协议,但Reference在`tone`字段的本地化处理上更优,直接使用了用户输入中的中文原词“专业可信”,而Target使用了英文翻译“professional and trustworthy”。这是一个清晰、可学习的结构优势,即更忠实地保留用户输入的原词,而非进行不必要的翻译或解释。"
|
||||
},
|
||||
{
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"pairSignal": "supported",
|
||||
"verdict": "left-better",
|
||||
"confidence": "high",
|
||||
"analysis": "左侧(Reference)的提示词通过增加明确的规则约束,显著减少了输出格式的边界滑移风险,并消除了右侧(Reference Baseline)输出中存在的额外格式(如换行和缩进),使输出更严格地符合“只输出JSON对象”的要求。这一改进在参考侧内部得到了验证,并非仅针对当前样例的巧合。"
|
||||
}
|
||||
],
|
||||
"progressSummary": {
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"pairSignal": "improved",
|
||||
"verdict": "left-better",
|
||||
"confidence": "high",
|
||||
"analysis": "Target (A) 在输出格式的严格性和边界控制上显著优于 Baseline (B)。Baseline 的输出包裹了 Markdown 代码块,违反了“只输出 JSON 对象”的核心指令,属于明确的硬边界违例。Target 则严格遵守了所有格式和内容规则,没有额外解释或格式漂移,实现了真正的改进。"
|
||||
},
|
||||
"referenceGapSummary": {
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"pairSignal": "minor",
|
||||
"verdict": "right-better",
|
||||
"confidence": "high",
|
||||
"analysis": "两者都正确提取了核心信息并严格遵守了输出协议,但Reference在`tone`字段的本地化处理上更优,直接使用了用户输入中的中文原词“专业可信”,而Target使用了英文翻译“professional and trustworthy”。这是一个清晰、可学习的结构优势,即更忠实地保留用户输入的原词,而非进行不必要的翻译或解释。"
|
||||
},
|
||||
"promptChangeSummary": {
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"pairSignal": "supported",
|
||||
"verdict": "left-better",
|
||||
"confidence": "high",
|
||||
"analysis": "左侧(Reference)的提示词通过增加明确的规则约束,显著减少了输出格式的边界滑移风险,并消除了右侧(Reference Baseline)输出中存在的额外格式(如换行和缩进),使输出更严格地符合“只输出JSON对象”的要求。这一改进在参考侧内部得到了验证,并非仅针对当前样例的巧合。"
|
||||
},
|
||||
"evidenceHighlights": [
|
||||
"Baseline (B) 的输出包裹了",
|
||||
"Target的`tone`字段值为\"professional and trustworthy\",是对用户输入中“专业可信”的英文翻译。",
|
||||
"Reference的`tone`字段值为\"专业可信\",与用户输入中的中文原词完全一致。",
|
||||
"左侧提示词明确禁止了Markdown、解释、前后缀或代码块,而右侧提示词仅要求“只返回JSON”,约束较弱。",
|
||||
"左侧输出为紧凑的JSON字符串:`{\"audience\": \"独立设计师\", \"pain_points\": [\"版本混乱\", \"客户确认来回很慢\"], \"tone\": \"专业可信\"}`。",
|
||||
"右侧输出包含了额外的格式(换行和缩进):`{ \"audience\": \"独立设计师\", \"pain_points\": [\"版本混乱\", \"客户确认来回很慢\"], \"tone\": \"专业可信\" }`,这违反了左侧提示词中“不要输出...前后缀”的硬边界规则。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"在提取`tone`等描述性字段时,应优先直接使用用户输入中的原词,避免进行不必要的翻译或改写,以保持信息的原始性和准确性。",
|
||||
"在要求“只输出JSON”的提示词中,明确列举禁止项(如Markdown、解释、代码块、前后缀)能有效减少格式漂移。",
|
||||
"仅规定“只返回JSON”的模糊指令,模型可能仍会添加美化格式(如换行和缩进),这被视为一种边界违例。"
|
||||
],
|
||||
"overfitWarnings": [
|
||||
"此判断基于当前用户输入明确提供了中文描述。如果用户输入本身是英文或未明确描述语气,此优势可能不适用。"
|
||||
],
|
||||
"conflictSignals": [
|
||||
"sampleOverfitRiskVisible"
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
@@ -0,0 +1,198 @@
|
||||
请只根据下面这份 JSON payload,把当前工作区系统提示词直接重写成一个完整的新版本。
|
||||
|
||||
要求:
|
||||
1. "sourcePrompts.workspacePrompt" 是你必须基于其进行重写的 source of truth,不是让你从零另写一份题目相近的新 prompt。
|
||||
2. 保留原提示词的核心目标、硬约束、必要边界、变量名、字段名、schema、角色结构和输出协议,除非评估明确表明这些内容本身有问题。
|
||||
3. 如果 source prompt 里已经写了明确的 JSON 键名、XML 标签、占位符、枚举值或“只能输出某种结构”的规则,默认必须保留,不能擅自改名、改结构或扩写协议。
|
||||
4. 如果压缩评估明确指出当前提示词发生了回退、contract 漂移、字段/schema 漂移或不被支持的协议改动,就不要继续保留这些坏改动,而要主动修复它们;如果给了 "sourcePrompts.referencePrompt",优先把它当作恢复 contract 的锚点。
|
||||
5. 优先吸收可复用、跨输入也应成立的改进,不要为了当前样例、当前输出细节或一次性现象过拟合。
|
||||
6. 如果某条建议明显依赖当前样例,应主动将其泛化、弱化或舍弃。
|
||||
7. 不要自行发明新的测试证据,只能基于下面这份压缩评估结论来改写。
|
||||
8. 优先做“最小但完整”的重写,在保留原 contract 的前提下提升质量,而不是整套改写。
|
||||
9. 只输出提示词正文,不要把结果包装成 JSON、YAML、XML、"role/content" 对象、消息数组或代码块。
|
||||
10. 只输出重写后的完整提示词,不要额外解释。
|
||||
11. "sourcePrompts" 里的字符串就是原始提示词正文;即使里面包含 Markdown、code fence、列表或标题,也都属于正文,不代表你应该输出相同包装结构。
|
||||
12. 如果 compare 相关条目之间有重叠,优先相信聚合焦点结论和停止信号,再参考较底层的证据摘录。
|
||||
13. 在动手改写前,先看 "compressedEvaluation.rewriteGuidance.recommendation"。
|
||||
14. 如果 recommendation 是 "skip",就原样输出 "sourcePrompts.workspacePrompt",不要做任何改写。
|
||||
15. 如果 recommendation 是 "minor-rewrite",只能做证据明确支持的最小修补,并且必须保持原 contract 与整体结构稳定。
|
||||
16. 只有 recommendation 是 "rewrite" 时,才允许做更实质性的重写。
|
||||
17. 在决定改哪里之前,先看 "compressedEvaluation.rewriteGuidance.priorityMoves",把这些动作当作最高优先级的改写议程。
|
||||
18. 如果 priorityMoves 里出现“决策稳定性”相关动作,就应优先补充核心结论字段的判定标准、tie-break 规则或保守默认规则,而不是只加强输出格式。
|
||||
|
||||
Rewrite Payload (JSON):
|
||||
{
|
||||
"scenario": {
|
||||
"language": "zh",
|
||||
"evaluationType": "compare",
|
||||
"evaluationTypeLabel": "对比评估",
|
||||
"subjectLabel": "系统提示词",
|
||||
"mode": {
|
||||
"functionMode": "basic",
|
||||
"subMode": "system"
|
||||
},
|
||||
"overallScore": 75
|
||||
},
|
||||
"sourcePrompts": {
|
||||
"workspacePrompt": "你是一个严格的数据抽取助手。\n你的任务是阅读用户输入,并输出一个且仅一个 JSON 对象。\nJSON schema 必须为:\n{\"audience\": string|null, \"pain_points\": string[], \"tone\": string|null}\n规则:\n1. 只输出 JSON 对象,不要输出 Markdown、解释、前后缀或代码块。\n2. pain_points 只保留用户明确提到的问题,不要脑补。\n3. 缺失信息时 audience 和 tone 用 null,pain_points 用 []。\n4. 键名必须完全使用 audience、pain_points、tone。",
|
||||
"referencePrompt": "你是一个严格的数据抽取助手。\n阅读用户输入,输出一个 JSON 对象,包含以下字段:\n- audience: string | null\n- pain_points: string[]\n- tone: string | null\n要求:只返回 JSON。"
|
||||
},
|
||||
"compressedEvaluation": {
|
||||
"summary": "Target相比Baseline在格式控制上有显著进步,但与Reference在字段本地化处理上仍有可学习的微小差距;提示词中增加明确禁止项的改动在Reference侧被验证有效,但存在一定的样例过拟合风险。",
|
||||
"dimensionScores": [
|
||||
{
|
||||
"key": "goalAchievementRobustness",
|
||||
"label": "目标达成稳定性",
|
||||
"score": 90
|
||||
},
|
||||
{
|
||||
"key": "outputQualityCeiling",
|
||||
"label": "输出质量上限",
|
||||
"score": 70
|
||||
},
|
||||
{
|
||||
"key": "promptPatternQuality",
|
||||
"label": "提示词模式质量",
|
||||
"score": 85
|
||||
},
|
||||
{
|
||||
"key": "crossSnapshotRobustness",
|
||||
"label": "跨快照鲁棒性",
|
||||
"score": 60
|
||||
},
|
||||
{
|
||||
"key": "workspaceTransferability",
|
||||
"label": "对工作区的可迁移性",
|
||||
"score": 70
|
||||
}
|
||||
],
|
||||
"improvements": [
|
||||
"在提取`tone`等描述性字段时,应优先直接使用用户输入中的原词,避免进行不必要的翻译或改写,以保持信息的原始性和准确性。",
|
||||
"在要求“只输出JSON”的提示词中,明确列举禁止项(如Markdown、解释、代码块、前后缀)能有效减少格式漂移。",
|
||||
"仅规定“只返回JSON”的模糊指令,模型可能仍会添加美化格式(如换行和缩进),这被视为一种边界违例。"
|
||||
],
|
||||
"patchPlan": [],
|
||||
"compareStopSignals": {
|
||||
"targetVsBaseline": "improved",
|
||||
"targetVsReferenceGap": "minor",
|
||||
"improvementHeadroom": "medium",
|
||||
"overfitRisk": "medium",
|
||||
"stopRecommendation": "continue",
|
||||
"stopReasons": [
|
||||
"minor learnable gap remains vs reference",
|
||||
"pairwise judges flagged possible sample overfit"
|
||||
]
|
||||
},
|
||||
"compareInsights": {
|
||||
"pairHighlights": [
|
||||
{
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"pairSignal": "improved",
|
||||
"verdict": "left-better",
|
||||
"confidence": "high",
|
||||
"analysis": "Target (A) 在输出格式的严格性和边界控制上显著优于 Baseline (B)。Baseline 的输出包裹了 Markdown 代码块,违反了“只输出 JSON 对象”的核心指令,属于明确的硬边界违例。Target 则严格遵守了所有格式和内容规则,没有额外解释或格式漂移,实现了真正的改进。"
|
||||
},
|
||||
{
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"pairSignal": "minor",
|
||||
"verdict": "right-better",
|
||||
"confidence": "high",
|
||||
"analysis": "两者都正确提取了核心信息并严格遵守了输出协议,但Reference在`tone`字段的本地化处理上更优,直接使用了用户输入中的中文原词“专业可信”,而Target使用了英文翻译“professional and trustworthy”。这是一个清晰、可学习的结构优势,即更忠实地保留用户输入的原词,而非进行不必要的翻译或解释。"
|
||||
},
|
||||
{
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"pairSignal": "supported",
|
||||
"verdict": "left-better",
|
||||
"confidence": "high",
|
||||
"analysis": "左侧(Reference)的提示词通过增加明确的规则约束,显著减少了输出格式的边界滑移风险,并消除了右侧(Reference Baseline)输出中存在的额外格式(如换行和缩进),使输出更严格地符合“只输出JSON对象”的要求。这一改进在参考侧内部得到了验证,并非仅针对当前样例的巧合。"
|
||||
}
|
||||
],
|
||||
"progressSummary": {
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"pairSignal": "improved",
|
||||
"verdict": "left-better",
|
||||
"confidence": "high",
|
||||
"analysis": "Target (A) 在输出格式的严格性和边界控制上显著优于 Baseline (B)。Baseline 的输出包裹了 Markdown 代码块,违反了“只输出 JSON 对象”的核心指令,属于明确的硬边界违例。Target 则严格遵守了所有格式和内容规则,没有额外解释或格式漂移,实现了真正的改进。"
|
||||
},
|
||||
"referenceGapSummary": {
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"pairSignal": "minor",
|
||||
"verdict": "right-better",
|
||||
"confidence": "high",
|
||||
"analysis": "两者都正确提取了核心信息并严格遵守了输出协议,但Reference在`tone`字段的本地化处理上更优,直接使用了用户输入中的中文原词“专业可信”,而Target使用了英文翻译“professional and trustworthy”。这是一个清晰、可学习的结构优势,即更忠实地保留用户输入的原词,而非进行不必要的翻译或解释。"
|
||||
},
|
||||
"promptChangeSummary": {
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"pairSignal": "supported",
|
||||
"verdict": "left-better",
|
||||
"confidence": "high",
|
||||
"analysis": "左侧(Reference)的提示词通过增加明确的规则约束,显著减少了输出格式的边界滑移风险,并消除了右侧(Reference Baseline)输出中存在的额外格式(如换行和缩进),使输出更严格地符合“只输出JSON对象”的要求。这一改进在参考侧内部得到了验证,并非仅针对当前样例的巧合。"
|
||||
},
|
||||
"evidenceHighlights": [
|
||||
"Baseline (B) 的输出包裹了",
|
||||
"Target的`tone`字段值为\"professional and trustworthy\",是对用户输入中“专业可信”的英文翻译。",
|
||||
"Reference的`tone`字段值为\"专业可信\",与用户输入中的中文原词完全一致。",
|
||||
"左侧提示词明确禁止了Markdown、解释、前后缀或代码块,而右侧提示词仅要求“只返回JSON”,约束较弱。",
|
||||
"左侧输出为紧凑的JSON字符串:`{\"audience\": \"独立设计师\", \"pain_points\": [\"版本混乱\", \"客户确认来回很慢\"], \"tone\": \"专业可信\"}`。",
|
||||
"右侧输出包含了额外的格式(换行和缩进):`{ \"audience\": \"独立设计师\", \"pain_points\": [\"版本混乱\", \"客户确认来回很慢\"], \"tone\": \"专业可信\" }`,这违反了左侧提示词中“不要输出...前后缀”的硬边界规则。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"在提取`tone`等描述性字段时,应优先直接使用用户输入中的原词,避免进行不必要的翻译或改写,以保持信息的原始性和准确性。",
|
||||
"在要求“只输出JSON”的提示词中,明确列举禁止项(如Markdown、解释、代码块、前后缀)能有效减少格式漂移。",
|
||||
"仅规定“只返回JSON”的模糊指令,模型可能仍会添加美化格式(如换行和缩进),这被视为一种边界违例。"
|
||||
],
|
||||
"overfitWarnings": [
|
||||
"此判断基于当前用户输入明确提供了中文描述。如果用户输入本身是英文或未明确描述语气,此优势可能不适用。"
|
||||
],
|
||||
"conflictSignals": [
|
||||
"sampleOverfitRiskVisible"
|
||||
]
|
||||
},
|
||||
"rewriteGuidance": {
|
||||
"recommendation": "rewrite",
|
||||
"reasons": [
|
||||
"当前仍存在明确改进空间或未解决风险,继续做实质性改写仍然有必要。"
|
||||
],
|
||||
"focusAreas": [
|
||||
"generalization"
|
||||
],
|
||||
"priorityMoves": [
|
||||
"删除或弱化样例触发式规则,优先改写成跨输入也应成立的通用原则。"
|
||||
]
|
||||
},
|
||||
"focusSummaryLines": [
|
||||
"进步判断: Target vs Baseline | signal=improved | verdict=left-better | confidence=high | Target (A) 在输出格式的严格性和边界控制上显著优于 Baseline (B)。Baseline 的输出包裹了 Markdown 代码块,违反了“只输出 JSON 对象”的核心指令,属于明确的硬边界违例。Target 则严格遵守了所有格式和内容规则,没有额外解释或格式漂移,实现了真正的改进。",
|
||||
"参考差距: Target vs Reference | signal=minor | verdict=right-better | confidence=high | 两者都正确提取了核心信息并严格遵守了输出协议,但Reference在`tone`字段的本地化处理上更优,直接使用了用户输入中的中文原词“专业可信”,而Target使用了英文翻译“professional and trustworthy”。这是一个清晰、可学习的结构优势,即更忠实地保留用户输入的原词,而非进行不必要的翻译或解释。",
|
||||
"改动有效性: Reference vs Reference Baseline | signal=supported | verdict=left-better | confidence=high | 左侧(Reference)的提示词通过增加明确的规则约束,显著减少了输出格式的边界滑移风险,并消除了右侧(Reference Baseline)输出中存在的额外格式(如换行和缩进),使输出更严格地符合“只输出JSON对象”的要求。这一改进在参考侧内部得到了验证,并非仅针对当前样例的巧合。"
|
||||
],
|
||||
"conflictLines": [
|
||||
"如果“可复用收益”和“样例贴合收益”并存,应优先采用保守结论,并保持过拟合风险可见。"
|
||||
],
|
||||
"learnableSignalLines": [
|
||||
"在提取`tone`等描述性字段时,应优先直接使用用户输入中的原词,避免进行不必要的翻译或改写,以保持信息的原始性和准确性。",
|
||||
"在要求“只输出JSON”的提示词中,明确列举禁止项(如Markdown、解释、代码块、前后缀)能有效减少格式漂移。",
|
||||
"仅规定“只返回JSON”的模糊指令,模型可能仍会添加美化格式(如换行和缩进),这被视为一种边界违例。"
|
||||
],
|
||||
"overfitWarningLines": [
|
||||
"此判断基于当前用户输入明确提供了中文描述。如果用户输入本身是英文或未明确描述语气,此优势可能不适用。"
|
||||
],
|
||||
"supportEvidenceLines": [
|
||||
"1. Target vs Baseline | signal=improved | verdict=left-better | confidence=high | Target (A) 在输出格式的严格性和边界控制上显著优于 Baseline (B)。Baseline 的输出包裹了 Markdown 代码块,违反了“只输出 JSON 对象”的核心指令,属于明确的硬边界违例。Target 则严格遵守了所有格式和内容规则,没有额外解释或格式漂移,实现了真正的改进。",
|
||||
"2. Target vs Reference | signal=minor | verdict=right-better | confidence=high | 两者都正确提取了核心信息并严格遵守了输出协议,但Reference在`tone`字段的本地化处理上更优,直接使用了用户输入中的中文原词“专业可信”,而Target使用了英文翻译“professional and trustworthy”。这是一个清晰、可学习的结构优势,即更忠实地保留用户输入的原词,而非进行不必要的翻...",
|
||||
"3. Reference vs Reference Baseline | signal=supported | verdict=left-better | confidence=high | 左侧(Reference)的提示词通过增加明确的规则约束,显著减少了输出格式的边界滑移风险,并消除了右侧(Reference Baseline)输出中存在的额外格式(如换行和缩进),使输出更严格地符合“只输出JSON对象”的要求。这一改进在参考侧内部得到了验证,并非仅针对当前样例的巧合。",
|
||||
"Baseline (B) 的输出包裹了"
|
||||
]
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,10 @@
|
||||
你是一个严格的数据抽取助手。
|
||||
你的任务是阅读用户输入,并输出一个且仅一个 JSON 对象。
|
||||
JSON schema 必须为:
|
||||
{"audience": string|null, "pain_points": string[], "tone": string|null}
|
||||
规则:
|
||||
1. 只输出 JSON 对象,不要输出 Markdown、解释、前后缀、代码块、额外的换行或缩进。
|
||||
2. pain_points 只保留用户明确提到的问题,不要脑补。
|
||||
3. 缺失信息时 audience 和 tone 用 null,pain_points 用 []。
|
||||
4. 键名必须完全使用 audience、pain_points、tone。
|
||||
5. 对于所有字段,尤其是描述性字段(如 tone),应优先直接使用用户输入中的原词,避免进行不必要的翻译、改写或解释。
|
||||
@@ -0,0 +1,175 @@
|
||||
{
|
||||
"scenario": {
|
||||
"language": "zh",
|
||||
"evaluationType": "compare",
|
||||
"evaluationTypeLabel": "对比评估",
|
||||
"subjectLabel": "系统提示词",
|
||||
"mode": {
|
||||
"functionMode": "basic",
|
||||
"subMode": "system"
|
||||
},
|
||||
"overallScore": 75
|
||||
},
|
||||
"sourcePrompts": {
|
||||
"workspacePrompt": "你是一个严格的数据抽取助手。\n你的任务是阅读用户输入,并输出一个且仅一个 JSON 对象。\nJSON schema 必须为:\n{\"audience\": string|null, \"pain_points\": string[], \"tone\": string|null}\n规则:\n1. 只输出 JSON 对象,不要输出 Markdown、解释、前后缀或代码块。\n2. pain_points 只保留用户明确提到的问题,不要脑补。\n3. 缺失信息时 audience 和 tone 用 null,pain_points 用 []。\n4. 键名必须完全使用 audience、pain_points、tone。",
|
||||
"referencePrompt": "你是一个严格的数据抽取助手。\n阅读用户输入,输出一个 JSON 对象,包含以下字段:\n- audience: string | null\n- pain_points: string[]\n- tone: string | null\n要求:只返回 JSON。"
|
||||
},
|
||||
"compressedEvaluation": {
|
||||
"summary": "Target相比Baseline在格式控制上有显著进步,但与Reference在字段本地化处理上仍有可学习的微小差距;提示词中增加明确禁止项的改动在Reference侧被验证有效,但存在一定的样例过拟合风险。",
|
||||
"dimensionScores": [
|
||||
{
|
||||
"key": "goalAchievementRobustness",
|
||||
"label": "目标达成稳定性",
|
||||
"score": 90
|
||||
},
|
||||
{
|
||||
"key": "outputQualityCeiling",
|
||||
"label": "输出质量上限",
|
||||
"score": 70
|
||||
},
|
||||
{
|
||||
"key": "promptPatternQuality",
|
||||
"label": "提示词模式质量",
|
||||
"score": 85
|
||||
},
|
||||
{
|
||||
"key": "crossSnapshotRobustness",
|
||||
"label": "跨快照鲁棒性",
|
||||
"score": 60
|
||||
},
|
||||
{
|
||||
"key": "workspaceTransferability",
|
||||
"label": "对工作区的可迁移性",
|
||||
"score": 70
|
||||
}
|
||||
],
|
||||
"improvements": [
|
||||
"在提取`tone`等描述性字段时,应优先直接使用用户输入中的原词,避免进行不必要的翻译或改写,以保持信息的原始性和准确性。",
|
||||
"在要求“只输出JSON”的提示词中,明确列举禁止项(如Markdown、解释、代码块、前后缀)能有效减少格式漂移。",
|
||||
"仅规定“只返回JSON”的模糊指令,模型可能仍会添加美化格式(如换行和缩进),这被视为一种边界违例。"
|
||||
],
|
||||
"patchPlan": [],
|
||||
"compareStopSignals": {
|
||||
"targetVsBaseline": "improved",
|
||||
"targetVsReferenceGap": "minor",
|
||||
"improvementHeadroom": "medium",
|
||||
"overfitRisk": "medium",
|
||||
"stopRecommendation": "continue",
|
||||
"stopReasons": [
|
||||
"minor learnable gap remains vs reference",
|
||||
"pairwise judges flagged possible sample overfit"
|
||||
]
|
||||
},
|
||||
"compareInsights": {
|
||||
"pairHighlights": [
|
||||
{
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"pairSignal": "improved",
|
||||
"verdict": "left-better",
|
||||
"confidence": "high",
|
||||
"analysis": "Target (A) 在输出格式的严格性和边界控制上显著优于 Baseline (B)。Baseline 的输出包裹了 Markdown 代码块,违反了“只输出 JSON 对象”的核心指令,属于明确的硬边界违例。Target 则严格遵守了所有格式和内容规则,没有额外解释或格式漂移,实现了真正的改进。"
|
||||
},
|
||||
{
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"pairSignal": "minor",
|
||||
"verdict": "right-better",
|
||||
"confidence": "high",
|
||||
"analysis": "两者都正确提取了核心信息并严格遵守了输出协议,但Reference在`tone`字段的本地化处理上更优,直接使用了用户输入中的中文原词“专业可信”,而Target使用了英文翻译“professional and trustworthy”。这是一个清晰、可学习的结构优势,即更忠实地保留用户输入的原词,而非进行不必要的翻译或解释。"
|
||||
},
|
||||
{
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"pairSignal": "supported",
|
||||
"verdict": "left-better",
|
||||
"confidence": "high",
|
||||
"analysis": "左侧(Reference)的提示词通过增加明确的规则约束,显著减少了输出格式的边界滑移风险,并消除了右侧(Reference Baseline)输出中存在的额外格式(如换行和缩进),使输出更严格地符合“只输出JSON对象”的要求。这一改进在参考侧内部得到了验证,并非仅针对当前样例的巧合。"
|
||||
}
|
||||
],
|
||||
"progressSummary": {
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"pairSignal": "improved",
|
||||
"verdict": "left-better",
|
||||
"confidence": "high",
|
||||
"analysis": "Target (A) 在输出格式的严格性和边界控制上显著优于 Baseline (B)。Baseline 的输出包裹了 Markdown 代码块,违反了“只输出 JSON 对象”的核心指令,属于明确的硬边界违例。Target 则严格遵守了所有格式和内容规则,没有额外解释或格式漂移,实现了真正的改进。"
|
||||
},
|
||||
"referenceGapSummary": {
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"pairSignal": "minor",
|
||||
"verdict": "right-better",
|
||||
"confidence": "high",
|
||||
"analysis": "两者都正确提取了核心信息并严格遵守了输出协议,但Reference在`tone`字段的本地化处理上更优,直接使用了用户输入中的中文原词“专业可信”,而Target使用了英文翻译“professional and trustworthy”。这是一个清晰、可学习的结构优势,即更忠实地保留用户输入的原词,而非进行不必要的翻译或解释。"
|
||||
},
|
||||
"promptChangeSummary": {
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"pairSignal": "supported",
|
||||
"verdict": "left-better",
|
||||
"confidence": "high",
|
||||
"analysis": "左侧(Reference)的提示词通过增加明确的规则约束,显著减少了输出格式的边界滑移风险,并消除了右侧(Reference Baseline)输出中存在的额外格式(如换行和缩进),使输出更严格地符合“只输出JSON对象”的要求。这一改进在参考侧内部得到了验证,并非仅针对当前样例的巧合。"
|
||||
},
|
||||
"evidenceHighlights": [
|
||||
"Baseline (B) 的输出包裹了",
|
||||
"Target的`tone`字段值为\"professional and trustworthy\",是对用户输入中“专业可信”的英文翻译。",
|
||||
"Reference的`tone`字段值为\"专业可信\",与用户输入中的中文原词完全一致。",
|
||||
"左侧提示词明确禁止了Markdown、解释、前后缀或代码块,而右侧提示词仅要求“只返回JSON”,约束较弱。",
|
||||
"左侧输出为紧凑的JSON字符串:`{\"audience\": \"独立设计师\", \"pain_points\": [\"版本混乱\", \"客户确认来回很慢\"], \"tone\": \"专业可信\"}`。",
|
||||
"右侧输出包含了额外的格式(换行和缩进):`{ \"audience\": \"独立设计师\", \"pain_points\": [\"版本混乱\", \"客户确认来回很慢\"], \"tone\": \"专业可信\" }`,这违反了左侧提示词中“不要输出...前后缀”的硬边界规则。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"在提取`tone`等描述性字段时,应优先直接使用用户输入中的原词,避免进行不必要的翻译或改写,以保持信息的原始性和准确性。",
|
||||
"在要求“只输出JSON”的提示词中,明确列举禁止项(如Markdown、解释、代码块、前后缀)能有效减少格式漂移。",
|
||||
"仅规定“只返回JSON”的模糊指令,模型可能仍会添加美化格式(如换行和缩进),这被视为一种边界违例。"
|
||||
],
|
||||
"overfitWarnings": [
|
||||
"此判断基于当前用户输入明确提供了中文描述。如果用户输入本身是英文或未明确描述语气,此优势可能不适用。"
|
||||
],
|
||||
"conflictSignals": [
|
||||
"sampleOverfitRiskVisible"
|
||||
]
|
||||
},
|
||||
"rewriteGuidance": {
|
||||
"recommendation": "rewrite",
|
||||
"reasons": [
|
||||
"当前仍存在明确改进空间或未解决风险,继续做实质性改写仍然有必要。"
|
||||
],
|
||||
"focusAreas": [
|
||||
"generalization"
|
||||
],
|
||||
"priorityMoves": [
|
||||
"删除或弱化样例触发式规则,优先改写成跨输入也应成立的通用原则。"
|
||||
]
|
||||
},
|
||||
"focusSummaryLines": [
|
||||
"进步判断: Target vs Baseline | signal=improved | verdict=left-better | confidence=high | Target (A) 在输出格式的严格性和边界控制上显著优于 Baseline (B)。Baseline 的输出包裹了 Markdown 代码块,违反了“只输出 JSON 对象”的核心指令,属于明确的硬边界违例。Target 则严格遵守了所有格式和内容规则,没有额外解释或格式漂移,实现了真正的改进。",
|
||||
"参考差距: Target vs Reference | signal=minor | verdict=right-better | confidence=high | 两者都正确提取了核心信息并严格遵守了输出协议,但Reference在`tone`字段的本地化处理上更优,直接使用了用户输入中的中文原词“专业可信”,而Target使用了英文翻译“professional and trustworthy”。这是一个清晰、可学习的结构优势,即更忠实地保留用户输入的原词,而非进行不必要的翻译或解释。",
|
||||
"改动有效性: Reference vs Reference Baseline | signal=supported | verdict=left-better | confidence=high | 左侧(Reference)的提示词通过增加明确的规则约束,显著减少了输出格式的边界滑移风险,并消除了右侧(Reference Baseline)输出中存在的额外格式(如换行和缩进),使输出更严格地符合“只输出JSON对象”的要求。这一改进在参考侧内部得到了验证,并非仅针对当前样例的巧合。"
|
||||
],
|
||||
"conflictLines": [
|
||||
"如果“可复用收益”和“样例贴合收益”并存,应优先采用保守结论,并保持过拟合风险可见。"
|
||||
],
|
||||
"learnableSignalLines": [
|
||||
"在提取`tone`等描述性字段时,应优先直接使用用户输入中的原词,避免进行不必要的翻译或改写,以保持信息的原始性和准确性。",
|
||||
"在要求“只输出JSON”的提示词中,明确列举禁止项(如Markdown、解释、代码块、前后缀)能有效减少格式漂移。",
|
||||
"仅规定“只返回JSON”的模糊指令,模型可能仍会添加美化格式(如换行和缩进),这被视为一种边界违例。"
|
||||
],
|
||||
"overfitWarningLines": [
|
||||
"此判断基于当前用户输入明确提供了中文描述。如果用户输入本身是英文或未明确描述语气,此优势可能不适用。"
|
||||
],
|
||||
"supportEvidenceLines": [
|
||||
"1. Target vs Baseline | signal=improved | verdict=left-better | confidence=high | Target (A) 在输出格式的严格性和边界控制上显著优于 Baseline (B)。Baseline 的输出包裹了 Markdown 代码块,违反了“只输出 JSON 对象”的核心指令,属于明确的硬边界违例。Target 则严格遵守了所有格式和内容规则,没有额外解释或格式漂移,实现了真正的改进。",
|
||||
"2. Target vs Reference | signal=minor | verdict=right-better | confidence=high | 两者都正确提取了核心信息并严格遵守了输出协议,但Reference在`tone`字段的本地化处理上更优,直接使用了用户输入中的中文原词“专业可信”,而Target使用了英文翻译“professional and trustworthy”。这是一个清晰、可学习的结构优势,即更忠实地保留用户输入的原词,而非进行不必要的翻...",
|
||||
"3. Reference vs Reference Baseline | signal=supported | verdict=left-better | confidence=high | 左侧(Reference)的提示词通过增加明确的规则约束,显著减少了输出格式的边界滑移风险,并消除了右侧(Reference Baseline)输出中存在的额外格式(如换行和缩进),使输出更严格地符合“只输出JSON对象”的要求。这一改进在参考侧内部得到了验证,并非仅针对当前样例的巧合。",
|
||||
"Baseline (B) 的输出包裹了"
|
||||
]
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,10 @@
|
||||
# 真实模型: basic-system 边界控制改动
|
||||
|
||||
- caseId: live-basic-system-boundary-control
|
||||
- kind: live
|
||||
|
||||
使用真实 target/teacher 执行 4 个快照,检验 structured compare 是否能识别“更强边界约束”带来的真实收益,而不是只看表面措辞变化。
|
||||
|
||||
## Focus
|
||||
|
||||
优先判断改动是否真正减少了额外解释、格式边界滑移和输出结构不稳定,而不是只看表面完整度。
|
||||
@@ -0,0 +1,54 @@
|
||||
{
|
||||
"generatedAt": "2026-03-22T10:44:18.102Z",
|
||||
"case": {
|
||||
"id": "live-basic-system-boundary-control",
|
||||
"title": "真实模型: basic-system 边界控制改动",
|
||||
"kind": "live"
|
||||
},
|
||||
"summary": {
|
||||
"compareMode": "structured",
|
||||
"summary": "Target相比Baseline在格式控制上有显著进步,但与Reference在字段本地化处理上仍有可学习的微小差距;提示词中增加明确禁止项的改动在Reference侧被验证有效,但存在一定的样例过拟合风险。",
|
||||
"score": 75,
|
||||
"improvements": [
|
||||
"在提取`tone`等描述性字段时,应优先直接使用用户输入中的原词,避免进行不必要的翻译或改写,以保持信息的原始性和准确性。",
|
||||
"在要求“只输出JSON”的提示词中,明确列举禁止项(如Markdown、解释、代码块、前后缀)能有效减少格式漂移。",
|
||||
"仅规定“只返回JSON”的模糊指令,模型可能仍会添加美化格式(如换行和缩进),这被视为一种边界违例。"
|
||||
],
|
||||
"stopSignals": {
|
||||
"targetVsBaseline": "improved",
|
||||
"targetVsReferenceGap": "minor",
|
||||
"improvementHeadroom": "medium",
|
||||
"overfitRisk": "medium",
|
||||
"stopRecommendation": "continue",
|
||||
"stopReasons": [
|
||||
"minor learnable gap remains vs reference",
|
||||
"pairwise judges flagged possible sample overfit"
|
||||
]
|
||||
},
|
||||
"conflictSignals": [
|
||||
"sampleOverfitRiskVisible"
|
||||
],
|
||||
"pairJudgements": [
|
||||
{
|
||||
"pairType": "targetBaseline",
|
||||
"pairSignal": "improved",
|
||||
"verdict": "left-better",
|
||||
"confidence": "high"
|
||||
},
|
||||
{
|
||||
"pairType": "targetReference",
|
||||
"pairSignal": "minor",
|
||||
"verdict": "right-better",
|
||||
"confidence": "high"
|
||||
},
|
||||
{
|
||||
"pairType": "referenceBaseline",
|
||||
"pairSignal": "supported",
|
||||
"verdict": "left-better",
|
||||
"confidence": "high"
|
||||
}
|
||||
],
|
||||
"expected": null
|
||||
},
|
||||
"expectationResults": []
|
||||
}
|
||||
@@ -0,0 +1,79 @@
|
||||
# 真实模型: basic-system 边界控制改动
|
||||
|
||||
- caseId: live-basic-system-boundary-control
|
||||
- kind: live
|
||||
- generatedAt: 2026-03-22T10:44:18.102Z
|
||||
|
||||
## Description
|
||||
|
||||
使用真实 target/teacher 执行 4 个快照,检验 structured compare 是否能识别“更强边界约束”带来的真实收益,而不是只看表面措辞变化。
|
||||
|
||||
## Compare Result
|
||||
|
||||
```json
|
||||
{
|
||||
"compareMode": "structured",
|
||||
"summary": "Target相比Baseline在格式控制上有显著进步,但与Reference在字段本地化处理上仍有可学习的微小差距;提示词中增加明确禁止项的改动在Reference侧被验证有效,但存在一定的样例过拟合风险。",
|
||||
"score": 75,
|
||||
"improvements": [
|
||||
"在提取`tone`等描述性字段时,应优先直接使用用户输入中的原词,避免进行不必要的翻译或改写,以保持信息的原始性和准确性。",
|
||||
"在要求“只输出JSON”的提示词中,明确列举禁止项(如Markdown、解释、代码块、前后缀)能有效减少格式漂移。",
|
||||
"仅规定“只返回JSON”的模糊指令,模型可能仍会添加美化格式(如换行和缩进),这被视为一种边界违例。"
|
||||
],
|
||||
"stopSignals": {
|
||||
"targetVsBaseline": "improved",
|
||||
"targetVsReferenceGap": "minor",
|
||||
"improvementHeadroom": "medium",
|
||||
"overfitRisk": "medium",
|
||||
"stopRecommendation": "continue",
|
||||
"stopReasons": [
|
||||
"minor learnable gap remains vs reference",
|
||||
"pairwise judges flagged possible sample overfit"
|
||||
]
|
||||
},
|
||||
"conflictSignals": [
|
||||
"sampleOverfitRiskVisible"
|
||||
],
|
||||
"pairJudgements": [
|
||||
{
|
||||
"pairType": "targetBaseline",
|
||||
"pairSignal": "improved",
|
||||
"verdict": "left-better",
|
||||
"confidence": "high"
|
||||
},
|
||||
{
|
||||
"pairType": "targetReference",
|
||||
"pairSignal": "minor",
|
||||
"verdict": "right-better",
|
||||
"confidence": "high"
|
||||
},
|
||||
{
|
||||
"pairType": "referenceBaseline",
|
||||
"pairSignal": "supported",
|
||||
"verdict": "left-better",
|
||||
"confidence": "high"
|
||||
}
|
||||
],
|
||||
"expected": null
|
||||
}
|
||||
```
|
||||
|
||||
## Expectation Check
|
||||
|
||||
无预设断言,本样本用于探索式观察。
|
||||
|
||||
|
||||
## Rewrite Output
|
||||
|
||||
```
|
||||
你是一个严格的数据抽取助手。
|
||||
你的任务是阅读用户输入,并输出一个且仅一个 JSON 对象。
|
||||
JSON schema 必须为:
|
||||
{"audience": string|null, "pain_points": string[], "tone": string|null}
|
||||
规则:
|
||||
1. 只输出 JSON 对象,不要输出 Markdown、解释、前后缀、代码块、额外的换行或缩进。
|
||||
2. pain_points 只保留用户明确提到的问题,不要脑补。
|
||||
3. 缺失信息时 audience 和 tone 用 null,pain_points 用 []。
|
||||
4. 键名必须完全使用 audience、pain_points、tone。
|
||||
5. 对于所有字段,尤其是描述性字段(如 tone),应优先直接使用用户输入中的原词,避免进行不必要的翻译、改写或解释。
|
||||
```
|
||||
@@ -0,0 +1,150 @@
|
||||
{
|
||||
"scenario": {
|
||||
"language": "zh",
|
||||
"roleName": "结构化系统提示词对比综合专家",
|
||||
"subjectLabel": "系统提示词",
|
||||
"sharedCompareInputs": true,
|
||||
"samePromptAcrossSnapshots": true,
|
||||
"crossModelComparison": true,
|
||||
"focusBrief": "优先判断改动是否真正减少了额外解释、格式边界滑移和输出结构不稳定,而不是只看表面完整度。"
|
||||
},
|
||||
"roleBindings": [
|
||||
{
|
||||
"snapshotId": "a",
|
||||
"snapshotLabel": "A",
|
||||
"role": "target",
|
||||
"roleLabel": "Target"
|
||||
},
|
||||
{
|
||||
"snapshotId": "b",
|
||||
"snapshotLabel": "B",
|
||||
"role": "baseline",
|
||||
"roleLabel": "Baseline"
|
||||
},
|
||||
{
|
||||
"snapshotId": "c",
|
||||
"snapshotLabel": "C",
|
||||
"role": "reference",
|
||||
"roleLabel": "Reference"
|
||||
},
|
||||
{
|
||||
"snapshotId": "d",
|
||||
"snapshotLabel": "D",
|
||||
"role": "referenceBaseline",
|
||||
"roleLabel": "Reference Baseline"
|
||||
}
|
||||
],
|
||||
"deterministicHints": {
|
||||
"priorityOrder": [
|
||||
"targetBaseline",
|
||||
"targetReference",
|
||||
"referenceBaseline",
|
||||
"targetReplica"
|
||||
],
|
||||
"signalSnapshot": {
|
||||
"progress": "improved",
|
||||
"gap": "minor",
|
||||
"promptValidity": "supported"
|
||||
},
|
||||
"derivedStopSignals": {
|
||||
"targetVsBaseline": "improved",
|
||||
"targetVsReferenceGap": "minor",
|
||||
"improvementHeadroom": "medium",
|
||||
"overfitRisk": "medium",
|
||||
"stopRecommendation": "continue",
|
||||
"stopReasons": [
|
||||
"minor learnable gap remains vs reference",
|
||||
"pairwise judges flagged possible sample overfit"
|
||||
]
|
||||
},
|
||||
"learnableSignals": [
|
||||
"在提取`tone`等描述性字段时,应优先直接使用用户输入中的原词,避免进行不必要的翻译或改写,以保持信息的原始性和准确性。",
|
||||
"在要求“只输出JSON”的提示词中,明确列举禁止项(如Markdown、解释、代码块、前后缀)能有效减少格式漂移。",
|
||||
"仅规定“只返回JSON”的模糊指令,模型可能仍会添加美化格式(如换行和缩进),这被视为一种边界违例。"
|
||||
],
|
||||
"overfitWarnings": [
|
||||
"此判断基于当前用户输入明确提供了中文描述。如果用户输入本身是英文或未明确描述语气,此优势可能不适用。"
|
||||
],
|
||||
"conflictSignals": [
|
||||
{
|
||||
"key": "sampleOverfitRiskVisible",
|
||||
"description": "如果“可复用收益”和“样例贴合收益”并存,应优先采用保守结论,并保持过拟合风险可见。"
|
||||
}
|
||||
]
|
||||
},
|
||||
"judgeResults": [
|
||||
{
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"leftSnapshotId": "a",
|
||||
"leftSnapshotLabel": "A",
|
||||
"leftRole": "target",
|
||||
"rightSnapshotId": "b",
|
||||
"rightSnapshotLabel": "B",
|
||||
"rightRole": "baseline",
|
||||
"verdict": "left-better",
|
||||
"winner": "left",
|
||||
"confidence": "high",
|
||||
"pairSignal": "improved",
|
||||
"analysis": "Target (A) 在输出格式的严格性和边界控制上显著优于 Baseline (B)。Baseline 的输出包裹了 Markdown 代码块,违反了“只输出 JSON 对象”的核心指令,属于明确的硬边界违例。Target 则严格遵守了所有格式和内容规则,没有额外解释或格式漂移,实现了真正的改进。",
|
||||
"evidence": [
|
||||
"Baseline (B) 的输出包裹了"
|
||||
],
|
||||
"learnableSignals": [],
|
||||
"overfitWarnings": []
|
||||
},
|
||||
{
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"leftSnapshotId": "a",
|
||||
"leftSnapshotLabel": "A",
|
||||
"leftRole": "target",
|
||||
"rightSnapshotId": "c",
|
||||
"rightSnapshotLabel": "C",
|
||||
"rightRole": "reference",
|
||||
"verdict": "right-better",
|
||||
"winner": "right",
|
||||
"confidence": "high",
|
||||
"pairSignal": "minor",
|
||||
"analysis": "两者都正确提取了核心信息并严格遵守了输出协议,但Reference在`tone`字段的本地化处理上更优,直接使用了用户输入中的中文原词“专业可信”,而Target使用了英文翻译“professional and trustworthy”。这是一个清晰、可学习的结构优势,即更忠实地保留用户输入的原词,而非进行不必要的翻译或解释。",
|
||||
"evidence": [
|
||||
"Target的`tone`字段值为\"professional and trustworthy\",是对用户输入中“专业可信”的英文翻译。",
|
||||
"Reference的`tone`字段值为\"专业可信\",与用户输入中的中文原词完全一致。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"在提取`tone`等描述性字段时,应优先直接使用用户输入中的原词,避免进行不必要的翻译或改写,以保持信息的原始性和准确性。"
|
||||
],
|
||||
"overfitWarnings": [
|
||||
"此判断基于当前用户输入明确提供了中文描述。如果用户输入本身是英文或未明确描述语气,此优势可能不适用。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"leftSnapshotId": "c",
|
||||
"leftSnapshotLabel": "C",
|
||||
"leftRole": "reference",
|
||||
"rightSnapshotId": "d",
|
||||
"rightSnapshotLabel": "D",
|
||||
"rightRole": "referenceBaseline",
|
||||
"verdict": "left-better",
|
||||
"winner": "left",
|
||||
"confidence": "high",
|
||||
"pairSignal": "supported",
|
||||
"analysis": "左侧(Reference)的提示词通过增加明确的规则约束,显著减少了输出格式的边界滑移风险,并消除了右侧(Reference Baseline)输出中存在的额外格式(如换行和缩进),使输出更严格地符合“只输出JSON对象”的要求。这一改进在参考侧内部得到了验证,并非仅针对当前样例的巧合。",
|
||||
"evidence": [
|
||||
"左侧提示词明确禁止了Markdown、解释、前后缀或代码块,而右侧提示词仅要求“只返回JSON”,约束较弱。",
|
||||
"左侧输出为紧凑的JSON字符串:`{\"audience\": \"独立设计师\", \"pain_points\": [\"版本混乱\", \"客户确认来回很慢\"], \"tone\": \"专业可信\"}`。",
|
||||
"右侧输出包含了额外的格式(换行和缩进):`{\n \"audience\": \"独立设计师\",\n \"pain_points\": [\"版本混乱\", \"客户确认来回很慢\"],\n \"tone\": \"专业可信\"\n}`,这违反了左侧提示词中“不要输出...前后缀”的硬边界规则。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"在要求“只输出JSON”的提示词中,明确列举禁止项(如Markdown、解释、代码块、前后缀)能有效减少格式漂移。",
|
||||
"仅规定“只返回JSON”的模糊指令,模型可能仍会添加美化格式(如换行和缩进),这被视为一种边界违例。"
|
||||
],
|
||||
"overfitWarnings": []
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -0,0 +1,71 @@
|
||||
{
|
||||
"generatedAt": "2026-03-22T10:44:18.102Z",
|
||||
"rows": [
|
||||
{
|
||||
"caseId": "live-basic-system-boundary-control",
|
||||
"title": "真实模型: basic-system 边界控制改动",
|
||||
"kind": "live",
|
||||
"score": 75,
|
||||
"stopRecommendation": "continue",
|
||||
"targetVsBaseline": "improved",
|
||||
"targetVsReferenceGap": "minor",
|
||||
"expectationMatched": null,
|
||||
"expectationTotal": null
|
||||
},
|
||||
{
|
||||
"caseId": "synthetic-medical-latent-trigger-overfit",
|
||||
"title": "合成样本: 医疗分诊里的隐性触发过拟合",
|
||||
"kind": "synthetic",
|
||||
"score": 35,
|
||||
"stopRecommendation": "review",
|
||||
"targetVsBaseline": "regressed",
|
||||
"targetVsReferenceGap": "major",
|
||||
"expectationMatched": 3,
|
||||
"expectationTotal": 5
|
||||
},
|
||||
{
|
||||
"caseId": "synthetic-ecommerce-schema-no-model-worship",
|
||||
"title": "合成样本: 电商抽取里不能因为 teacher 更会写就忽略 schema",
|
||||
"kind": "synthetic",
|
||||
"score": 40,
|
||||
"stopRecommendation": "review",
|
||||
"targetVsBaseline": "regressed",
|
||||
"targetVsReferenceGap": "minor",
|
||||
"expectationMatched": 6,
|
||||
"expectationTotal": 6
|
||||
},
|
||||
{
|
||||
"caseId": "synthetic-legal-flat-not-unclear",
|
||||
"title": "合成样本: 法务风险摘要应该判 flat 而不是 unclear",
|
||||
"kind": "synthetic",
|
||||
"score": 50,
|
||||
"stopRecommendation": "continue",
|
||||
"targetVsBaseline": "flat",
|
||||
"targetVsReferenceGap": "none",
|
||||
"expectationMatched": 3,
|
||||
"expectationTotal": 3
|
||||
},
|
||||
{
|
||||
"caseId": "synthetic-teaching-overfit-regression",
|
||||
"title": "合成样本: 教学讲解里的样例口诀导致回退",
|
||||
"kind": "synthetic",
|
||||
"score": 30,
|
||||
"stopRecommendation": "review",
|
||||
"targetVsBaseline": "regressed",
|
||||
"targetVsReferenceGap": "major",
|
||||
"expectationMatched": 6,
|
||||
"expectationTotal": 6
|
||||
},
|
||||
{
|
||||
"caseId": "synthetic-hiring-replica-semantic-instability",
|
||||
"title": "合成样本: 招聘筛选里 replica 语义不稳定",
|
||||
"kind": "synthetic",
|
||||
"score": 65,
|
||||
"stopRecommendation": "review",
|
||||
"targetVsBaseline": "improved",
|
||||
"targetVsReferenceGap": "none",
|
||||
"expectationMatched": 4,
|
||||
"expectationTotal": 4
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -0,0 +1,19 @@
|
||||
# Structured Compare Calibration Summary
|
||||
|
||||
- generatedAt: 2026-03-22T10:44:18.102Z
|
||||
- outputRoot: D:\Dev\myProject\prompt-optimizer\docs\workspace\compare-evaluation-analysis\structured-compare-calibration\latest
|
||||
|
||||
| Case | Kind | Score | targetVsBaseline | targetVsReferenceGap | stopRecommendation | Expectation Match |
|
||||
| --- | --- | --- | --- | --- | --- | --- |
|
||||
| live-basic-system-boundary-control | live | 75 | improved | minor | continue | exploratory |
|
||||
| synthetic-medical-latent-trigger-overfit | synthetic | 35 | regressed | major | review | 3/5 |
|
||||
| synthetic-ecommerce-schema-no-model-worship | synthetic | 40 | regressed | minor | review | 6/6 |
|
||||
| synthetic-legal-flat-not-unclear | synthetic | 50 | flat | none | continue | 3/3 |
|
||||
| synthetic-teaching-overfit-regression | synthetic | 30 | regressed | major | review | 6/6 |
|
||||
| synthetic-hiring-replica-semantic-instability | synthetic | 65 | improved | none | review | 4/4 |
|
||||
|
||||
## Notes
|
||||
|
||||
- synthetic cases 用来检验 judge / synthesis 的提示词边界。
|
||||
- live case 用来观察真实 target/teacher 执行结果在 structured compare 下是否能收敛成合理结论。
|
||||
- 每个 case 子目录内都保存了 compare request、compare result、rewrite input / output,以及完整 LLM 调用日志。
|
||||
File diff suppressed because one or more lines are too long
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,260 @@
|
||||
[
|
||||
{
|
||||
"phase": "pair-judge:target-vs-baseline",
|
||||
"payload": {
|
||||
"scenario": {
|
||||
"language": "zh",
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"purpose": "Decide whether the current target prompt materially improved, stayed flat, or regressed relative to the previous version.",
|
||||
"signalName": "progress",
|
||||
"allowedSignalValues": [
|
||||
"improved",
|
||||
"flat",
|
||||
"regressed",
|
||||
"unclear"
|
||||
],
|
||||
"focusBrief": "即便 reference 看上去更完整、更自然,只要 prompt 改动造成字段名或外层结构变化,就应把 target 相对 baseline 判为回退。"
|
||||
},
|
||||
"roleBindings": [
|
||||
{
|
||||
"snapshotId": "a",
|
||||
"snapshotLabel": "A",
|
||||
"role": "target",
|
||||
"roleLabel": "Target"
|
||||
},
|
||||
{
|
||||
"snapshotId": "b",
|
||||
"snapshotLabel": "B",
|
||||
"role": "baseline",
|
||||
"roleLabel": "Baseline"
|
||||
},
|
||||
{
|
||||
"snapshotId": "c",
|
||||
"snapshotLabel": "C",
|
||||
"role": "reference",
|
||||
"roleLabel": "Reference"
|
||||
},
|
||||
{
|
||||
"snapshotId": "d",
|
||||
"snapshotLabel": "D",
|
||||
"role": "referenceBaseline",
|
||||
"roleLabel": "Reference Baseline"
|
||||
}
|
||||
],
|
||||
"testCases": [
|
||||
{
|
||||
"id": "tc-1",
|
||||
"input": {
|
||||
"kind": "text",
|
||||
"label": "商品文案",
|
||||
"content": "便携手冲咖啡壶,容量 600ml,适合露营和办公室使用,主打双层不锈钢保温,注意不支持电磁炉直火加热。"
|
||||
}
|
||||
}
|
||||
],
|
||||
"leftSnapshot": {
|
||||
"id": "a",
|
||||
"label": "A",
|
||||
"role": "target",
|
||||
"roleLabel": "Target",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "workspace",
|
||||
"label": "Workspace"
|
||||
},
|
||||
"promptText": "你是电商商品信息抽取助手。\n输出一个 JSON 对象,字段改为 product_name, buyer_highlights, cautions。\n为了后续扩展,把三个字段统一包在 payload 里。\n不要解释。",
|
||||
"modelKey": "custom",
|
||||
"versionLabel": "workspace",
|
||||
"output": "{\"payload\":{\"product_name\":\"便携手冲咖啡壶\",\"buyer_highlights\":[\"600ml 容量\",\"适合露营和办公室\",\"双层不锈钢保温\"],\"cautions\":[\"不支持电磁炉直火加热\"]}}"
|
||||
},
|
||||
"rightSnapshot": {
|
||||
"id": "b",
|
||||
"label": "B",
|
||||
"role": "baseline",
|
||||
"roleLabel": "Baseline",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "version",
|
||||
"label": "Previous"
|
||||
},
|
||||
"promptText": "你是电商商品信息抽取助手。\n只输出 JSON 对象,字段必须为 title, selling_points, cautions。\n不要改字段名,不要添加外层包裹对象,不要解释。",
|
||||
"modelKey": "custom",
|
||||
"versionLabel": "v4",
|
||||
"output": "{\"title\":\"便携手冲咖啡壶\",\"selling_points\":[\"600ml 容量\",\"适合露营和办公室\",\"双层不锈钢保温\"],\"cautions\":[\"不支持电磁炉直火加热\"]}"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"phase": "pair-judge:reference-vs-reference-baseline",
|
||||
"payload": {
|
||||
"scenario": {
|
||||
"language": "zh",
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"purpose": "Judge whether the prompt change itself is supported on the reference side, instead of being a target-only coincidence.",
|
||||
"signalName": "promptValidity",
|
||||
"allowedSignalValues": [
|
||||
"supported",
|
||||
"mixed",
|
||||
"unsupported",
|
||||
"unclear"
|
||||
],
|
||||
"focusBrief": "即便 reference 看上去更完整、更自然,只要 prompt 改动造成字段名或外层结构变化,就应把 target 相对 baseline 判为回退。"
|
||||
},
|
||||
"roleBindings": [
|
||||
{
|
||||
"snapshotId": "a",
|
||||
"snapshotLabel": "A",
|
||||
"role": "target",
|
||||
"roleLabel": "Target"
|
||||
},
|
||||
{
|
||||
"snapshotId": "b",
|
||||
"snapshotLabel": "B",
|
||||
"role": "baseline",
|
||||
"roleLabel": "Baseline"
|
||||
},
|
||||
{
|
||||
"snapshotId": "c",
|
||||
"snapshotLabel": "C",
|
||||
"role": "reference",
|
||||
"roleLabel": "Reference"
|
||||
},
|
||||
{
|
||||
"snapshotId": "d",
|
||||
"snapshotLabel": "D",
|
||||
"role": "referenceBaseline",
|
||||
"roleLabel": "Reference Baseline"
|
||||
}
|
||||
],
|
||||
"testCases": [
|
||||
{
|
||||
"id": "tc-1",
|
||||
"input": {
|
||||
"kind": "text",
|
||||
"label": "商品文案",
|
||||
"content": "便携手冲咖啡壶,容量 600ml,适合露营和办公室使用,主打双层不锈钢保温,注意不支持电磁炉直火加热。"
|
||||
}
|
||||
}
|
||||
],
|
||||
"leftSnapshot": {
|
||||
"id": "c",
|
||||
"label": "C",
|
||||
"role": "reference",
|
||||
"roleLabel": "Reference",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "workspace",
|
||||
"label": "Teacher Workspace"
|
||||
},
|
||||
"promptText": "你是电商商品信息抽取助手。\n输出一个 JSON 对象,字段改为 product_name, buyer_highlights, cautions。\n为了后续扩展,把三个字段统一包在 payload 里。\n不要解释。",
|
||||
"modelKey": "deepseek",
|
||||
"versionLabel": "teacher-workspace",
|
||||
"output": "{\"payload\":{\"product_name\":\"便携手冲咖啡壶\",\"buyer_highlights\":[\"双场景使用:露营与办公室\",\"600ml 大容量\",\"双层不锈钢保温更稳\"],\"cautions\":[\"不支持电磁炉直火加热\"]}}"
|
||||
},
|
||||
"rightSnapshot": {
|
||||
"id": "d",
|
||||
"label": "D",
|
||||
"role": "referenceBaseline",
|
||||
"roleLabel": "Reference Baseline",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "version",
|
||||
"label": "Teacher Previous"
|
||||
},
|
||||
"promptText": "你是电商商品信息抽取助手。\n只输出 JSON 对象,字段必须为 title, selling_points, cautions。\n不要改字段名,不要添加外层包裹对象,不要解释。",
|
||||
"modelKey": "deepseek",
|
||||
"versionLabel": "teacher-v4",
|
||||
"output": "{\"title\":\"便携手冲咖啡壶\",\"selling_points\":[\"600ml 容量\",\"适合露营和办公室\",\"双层不锈钢保温\"],\"cautions\":[\"不支持电磁炉直火加热\"]}"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"phase": "pair-judge:target-vs-reference",
|
||||
"payload": {
|
||||
"scenario": {
|
||||
"language": "zh",
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"purpose": "Identify whether the target still has a learnable gap from the stronger/reference run, and what structural strategy is worth learning.",
|
||||
"signalName": "gap",
|
||||
"allowedSignalValues": [
|
||||
"none",
|
||||
"minor",
|
||||
"major",
|
||||
"unclear"
|
||||
],
|
||||
"focusBrief": "即便 reference 看上去更完整、更自然,只要 prompt 改动造成字段名或外层结构变化,就应把 target 相对 baseline 判为回退。"
|
||||
},
|
||||
"roleBindings": [
|
||||
{
|
||||
"snapshotId": "a",
|
||||
"snapshotLabel": "A",
|
||||
"role": "target",
|
||||
"roleLabel": "Target"
|
||||
},
|
||||
{
|
||||
"snapshotId": "b",
|
||||
"snapshotLabel": "B",
|
||||
"role": "baseline",
|
||||
"roleLabel": "Baseline"
|
||||
},
|
||||
{
|
||||
"snapshotId": "c",
|
||||
"snapshotLabel": "C",
|
||||
"role": "reference",
|
||||
"roleLabel": "Reference"
|
||||
},
|
||||
{
|
||||
"snapshotId": "d",
|
||||
"snapshotLabel": "D",
|
||||
"role": "referenceBaseline",
|
||||
"roleLabel": "Reference Baseline"
|
||||
}
|
||||
],
|
||||
"testCases": [
|
||||
{
|
||||
"id": "tc-1",
|
||||
"input": {
|
||||
"kind": "text",
|
||||
"label": "商品文案",
|
||||
"content": "便携手冲咖啡壶,容量 600ml,适合露营和办公室使用,主打双层不锈钢保温,注意不支持电磁炉直火加热。"
|
||||
}
|
||||
}
|
||||
],
|
||||
"leftSnapshot": {
|
||||
"id": "a",
|
||||
"label": "A",
|
||||
"role": "target",
|
||||
"roleLabel": "Target",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "workspace",
|
||||
"label": "Workspace"
|
||||
},
|
||||
"promptText": "你是电商商品信息抽取助手。\n输出一个 JSON 对象,字段改为 product_name, buyer_highlights, cautions。\n为了后续扩展,把三个字段统一包在 payload 里。\n不要解释。",
|
||||
"modelKey": "custom",
|
||||
"versionLabel": "workspace",
|
||||
"output": "{\"payload\":{\"product_name\":\"便携手冲咖啡壶\",\"buyer_highlights\":[\"600ml 容量\",\"适合露营和办公室\",\"双层不锈钢保温\"],\"cautions\":[\"不支持电磁炉直火加热\"]}}"
|
||||
},
|
||||
"rightSnapshot": {
|
||||
"id": "c",
|
||||
"label": "C",
|
||||
"role": "reference",
|
||||
"roleLabel": "Reference",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "workspace",
|
||||
"label": "Teacher Workspace"
|
||||
},
|
||||
"promptText": "你是电商商品信息抽取助手。\n输出一个 JSON 对象,字段改为 product_name, buyer_highlights, cautions。\n为了后续扩展,把三个字段统一包在 payload 里。\n不要解释。",
|
||||
"modelKey": "deepseek",
|
||||
"versionLabel": "teacher-workspace",
|
||||
"output": "{\"payload\":{\"product_name\":\"便携手冲咖啡壶\",\"buyer_highlights\":[\"双场景使用:露营与办公室\",\"600ml 大容量\",\"双层不锈钢保温更稳\"],\"cautions\":[\"不支持电磁炉直火加热\"]}}"
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
@@ -0,0 +1,95 @@
|
||||
{
|
||||
"type": "compare",
|
||||
"evaluationModelKey": "deepseek",
|
||||
"mode": {
|
||||
"functionMode": "basic",
|
||||
"subMode": "system"
|
||||
},
|
||||
"focus": {
|
||||
"content": "即便 reference 看上去更完整、更自然,只要 prompt 改动造成字段名或外层结构变化,就应把 target 相对 baseline 判为回退。",
|
||||
"source": "system",
|
||||
"priority": "highest"
|
||||
},
|
||||
"target": {
|
||||
"workspacePrompt": "你是电商商品信息抽取助手。\n输出一个 JSON 对象,字段改为 product_name, buyer_highlights, cautions。\n为了后续扩展,把三个字段统一包在 payload 里。\n不要解释。",
|
||||
"referencePrompt": "你是电商商品信息抽取助手。\n只输出 JSON 对象,字段必须为 title, selling_points, cautions。\n不要改字段名,不要添加外层包裹对象,不要解释。"
|
||||
},
|
||||
"testCases": [
|
||||
{
|
||||
"id": "tc-1",
|
||||
"input": {
|
||||
"kind": "text",
|
||||
"label": "商品文案",
|
||||
"content": "便携手冲咖啡壶,容量 600ml,适合露营和办公室使用,主打双层不锈钢保温,注意不支持电磁炉直火加热。"
|
||||
}
|
||||
}
|
||||
],
|
||||
"snapshots": [
|
||||
{
|
||||
"id": "a",
|
||||
"label": "A",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "workspace",
|
||||
"label": "Workspace"
|
||||
},
|
||||
"promptText": "你是电商商品信息抽取助手。\n输出一个 JSON 对象,字段改为 product_name, buyer_highlights, cautions。\n为了后续扩展,把三个字段统一包在 payload 里。\n不要解释。",
|
||||
"output": "{\"payload\":{\"product_name\":\"便携手冲咖啡壶\",\"buyer_highlights\":[\"600ml 容量\",\"适合露营和办公室\",\"双层不锈钢保温\"],\"cautions\":[\"不支持电磁炉直火加热\"]}}",
|
||||
"modelKey": "custom",
|
||||
"versionLabel": "workspace"
|
||||
},
|
||||
{
|
||||
"id": "b",
|
||||
"label": "B",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "version",
|
||||
"version": 4,
|
||||
"label": "Previous"
|
||||
},
|
||||
"promptText": "你是电商商品信息抽取助手。\n只输出 JSON 对象,字段必须为 title, selling_points, cautions。\n不要改字段名,不要添加外层包裹对象,不要解释。",
|
||||
"output": "{\"title\":\"便携手冲咖啡壶\",\"selling_points\":[\"600ml 容量\",\"适合露营和办公室\",\"双层不锈钢保温\"],\"cautions\":[\"不支持电磁炉直火加热\"]}",
|
||||
"modelKey": "custom",
|
||||
"versionLabel": "v4"
|
||||
},
|
||||
{
|
||||
"id": "c",
|
||||
"label": "C",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "workspace",
|
||||
"label": "Teacher Workspace"
|
||||
},
|
||||
"promptText": "你是电商商品信息抽取助手。\n输出一个 JSON 对象,字段改为 product_name, buyer_highlights, cautions。\n为了后续扩展,把三个字段统一包在 payload 里。\n不要解释。",
|
||||
"output": "{\"payload\":{\"product_name\":\"便携手冲咖啡壶\",\"buyer_highlights\":[\"双场景使用:露营与办公室\",\"600ml 大容量\",\"双层不锈钢保温更稳\"],\"cautions\":[\"不支持电磁炉直火加热\"]}}",
|
||||
"modelKey": "deepseek",
|
||||
"versionLabel": "teacher-workspace"
|
||||
},
|
||||
{
|
||||
"id": "d",
|
||||
"label": "D",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "version",
|
||||
"version": 4,
|
||||
"label": "Teacher Previous"
|
||||
},
|
||||
"promptText": "你是电商商品信息抽取助手。\n只输出 JSON 对象,字段必须为 title, selling_points, cautions。\n不要改字段名,不要添加外层包裹对象,不要解释。",
|
||||
"output": "{\"title\":\"便携手冲咖啡壶\",\"selling_points\":[\"600ml 容量\",\"适合露营和办公室\",\"双层不锈钢保温\"],\"cautions\":[\"不支持电磁炉直火加热\"]}",
|
||||
"modelKey": "deepseek",
|
||||
"versionLabel": "teacher-v4"
|
||||
}
|
||||
],
|
||||
"compareHints": {
|
||||
"mode": "structured",
|
||||
"snapshotRoles": {
|
||||
"a": "target",
|
||||
"b": "baseline",
|
||||
"c": "reference",
|
||||
"d": "referenceBaseline"
|
||||
},
|
||||
"hasSharedTestCases": true,
|
||||
"hasSamePromptSnapshots": true,
|
||||
"hasCrossModelComparison": true
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,97 @@
|
||||
```json
|
||||
{
|
||||
"type": "compare",
|
||||
"evaluationModelKey": "deepseek",
|
||||
"mode": {
|
||||
"functionMode": "basic",
|
||||
"subMode": "system"
|
||||
},
|
||||
"focus": {
|
||||
"content": "即便 reference 看上去更完整、更自然,只要 prompt 改动造成字段名或外层结构变化,就应把 target 相对 baseline 判为回退。",
|
||||
"source": "system",
|
||||
"priority": "highest"
|
||||
},
|
||||
"target": {
|
||||
"workspacePrompt": "你是电商商品信息抽取助手。\n输出一个 JSON 对象,字段改为 product_name, buyer_highlights, cautions。\n为了后续扩展,把三个字段统一包在 payload 里。\n不要解释。",
|
||||
"referencePrompt": "你是电商商品信息抽取助手。\n只输出 JSON 对象,字段必须为 title, selling_points, cautions。\n不要改字段名,不要添加外层包裹对象,不要解释。"
|
||||
},
|
||||
"testCases": [
|
||||
{
|
||||
"id": "tc-1",
|
||||
"input": {
|
||||
"kind": "text",
|
||||
"label": "商品文案",
|
||||
"content": "便携手冲咖啡壶,容量 600ml,适合露营和办公室使用,主打双层不锈钢保温,注意不支持电磁炉直火加热。"
|
||||
}
|
||||
}
|
||||
],
|
||||
"snapshots": [
|
||||
{
|
||||
"id": "a",
|
||||
"label": "A",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "workspace",
|
||||
"label": "Workspace"
|
||||
},
|
||||
"promptText": "你是电商商品信息抽取助手。\n输出一个 JSON 对象,字段改为 product_name, buyer_highlights, cautions。\n为了后续扩展,把三个字段统一包在 payload 里。\n不要解释。",
|
||||
"output": "{\"payload\":{\"product_name\":\"便携手冲咖啡壶\",\"buyer_highlights\":[\"600ml 容量\",\"适合露营和办公室\",\"双层不锈钢保温\"],\"cautions\":[\"不支持电磁炉直火加热\"]}}",
|
||||
"modelKey": "custom",
|
||||
"versionLabel": "workspace"
|
||||
},
|
||||
{
|
||||
"id": "b",
|
||||
"label": "B",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "version",
|
||||
"version": 4,
|
||||
"label": "Previous"
|
||||
},
|
||||
"promptText": "你是电商商品信息抽取助手。\n只输出 JSON 对象,字段必须为 title, selling_points, cautions。\n不要改字段名,不要添加外层包裹对象,不要解释。",
|
||||
"output": "{\"title\":\"便携手冲咖啡壶\",\"selling_points\":[\"600ml 容量\",\"适合露营和办公室\",\"双层不锈钢保温\"],\"cautions\":[\"不支持电磁炉直火加热\"]}",
|
||||
"modelKey": "custom",
|
||||
"versionLabel": "v4"
|
||||
},
|
||||
{
|
||||
"id": "c",
|
||||
"label": "C",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "workspace",
|
||||
"label": "Teacher Workspace"
|
||||
},
|
||||
"promptText": "你是电商商品信息抽取助手。\n输出一个 JSON 对象,字段改为 product_name, buyer_highlights, cautions。\n为了后续扩展,把三个字段统一包在 payload 里。\n不要解释。",
|
||||
"output": "{\"payload\":{\"product_name\":\"便携手冲咖啡壶\",\"buyer_highlights\":[\"双场景使用:露营与办公室\",\"600ml 大容量\",\"双层不锈钢保温更稳\"],\"cautions\":[\"不支持电磁炉直火加热\"]}}",
|
||||
"modelKey": "deepseek",
|
||||
"versionLabel": "teacher-workspace"
|
||||
},
|
||||
{
|
||||
"id": "d",
|
||||
"label": "D",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "version",
|
||||
"version": 4,
|
||||
"label": "Teacher Previous"
|
||||
},
|
||||
"promptText": "你是电商商品信息抽取助手。\n只输出 JSON 对象,字段必须为 title, selling_points, cautions。\n不要改字段名,不要添加外层包裹对象,不要解释。",
|
||||
"output": "{\"title\":\"便携手冲咖啡壶\",\"selling_points\":[\"600ml 容量\",\"适合露营和办公室\",\"双层不锈钢保温\"],\"cautions\":[\"不支持电磁炉直火加热\"]}",
|
||||
"modelKey": "deepseek",
|
||||
"versionLabel": "teacher-v4"
|
||||
}
|
||||
],
|
||||
"compareHints": {
|
||||
"mode": "structured",
|
||||
"snapshotRoles": {
|
||||
"a": "target",
|
||||
"b": "baseline",
|
||||
"c": "reference",
|
||||
"d": "referenceBaseline"
|
||||
},
|
||||
"hasSharedTestCases": true,
|
||||
"hasSamePromptSnapshots": true,
|
||||
"hasCrossModelComparison": true
|
||||
}
|
||||
}
|
||||
```
|
||||
@@ -0,0 +1,232 @@
|
||||
{
|
||||
"type": "compare",
|
||||
"score": {
|
||||
"overall": 40,
|
||||
"dimensions": [
|
||||
{
|
||||
"key": "goalAchievementRobustness",
|
||||
"label": "目标达成稳定性",
|
||||
"score": 20
|
||||
},
|
||||
{
|
||||
"key": "outputQualityCeiling",
|
||||
"label": "输出质量上限",
|
||||
"score": 60
|
||||
},
|
||||
{
|
||||
"key": "promptPatternQuality",
|
||||
"label": "提示词模式质量",
|
||||
"score": 30
|
||||
},
|
||||
{
|
||||
"key": "crossSnapshotRobustness",
|
||||
"label": "跨快照鲁棒性",
|
||||
"score": 10
|
||||
},
|
||||
{
|
||||
"key": "workspaceTransferability",
|
||||
"label": "对工作区的可迁移性",
|
||||
"score": 30
|
||||
}
|
||||
]
|
||||
},
|
||||
"improvements": [
|
||||
"当系统提示词明确禁止更改字段名或添加外层包裹对象时,任何此类改动都构成硬边界违例,应避免。",
|
||||
"对于产品亮点(如buyer_highlights)字段,可学习将原文中的适用场景描述(如“适合A和B使用”)提炼并包装为更具概括性和吸引力的营销短语(如“双场景使用:A与B”)。",
|
||||
"在列举产品基础参数(如容量、材质)时,可考虑添加积极的修饰语(如“大容量”、“更稳”)以增强卖点表述,而非简单复制原文。"
|
||||
],
|
||||
"summary": "Target相比Baseline在输出协议稳定性上出现明确回退;与Reference相比,在亮点提炼的深度和营销感上仍有可学习的差距;且该Prompt改动在Reference侧不被支持,存在较高的过拟合风险,建议审阅。",
|
||||
"patchPlan": [],
|
||||
"metadata": {
|
||||
"compareMode": "structured",
|
||||
"compareStopSignals": {
|
||||
"targetVsBaseline": "regressed",
|
||||
"targetVsReferenceGap": "minor",
|
||||
"improvementHeadroom": "high",
|
||||
"overfitRisk": "high",
|
||||
"stopRecommendation": "review",
|
||||
"stopReasons": [
|
||||
"target regressed vs baseline",
|
||||
"minor learnable gap remains vs reference",
|
||||
"reference-side evidence does not support the prompt change",
|
||||
"pairwise judges flagged possible sample overfit"
|
||||
]
|
||||
},
|
||||
"model": "deepseek",
|
||||
"timestamp": 1774176418671,
|
||||
"duration": 28409,
|
||||
"compareJudgements": [
|
||||
{
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"leftSnapshotId": "a",
|
||||
"leftSnapshotLabel": "A",
|
||||
"leftRole": "target",
|
||||
"rightSnapshotId": "b",
|
||||
"rightSnapshotLabel": "B",
|
||||
"rightRole": "baseline",
|
||||
"verdict": "right-better",
|
||||
"winner": "right",
|
||||
"confidence": "high",
|
||||
"pairSignal": "regressed",
|
||||
"analysis": "Target (A) 的 prompt 指令要求输出结构包含外层 `payload` 对象,并更改了字段名,这直接违反了 Baseline (B) prompt 中“不要改字段名,不要添加外层包裹对象”的明确硬边界规则。虽然两者都正确提取了商品信息,但 Target 的输出结构不符合 Baseline 的指令要求,属于协议漂移,构成明确的回退。",
|
||||
"evidence": [
|
||||
"Target 输出结构为 `{\"payload\": {...}}`,添加了外层包裹对象 `payload`,违反了 Baseline prompt 中“不要添加外层包裹对象”的指令。",
|
||||
"Target 将字段名改为 `product_name` 和 `buyer_highlights`,而 Baseline prompt 要求字段必须为 `title` 和 `selling_points`,违反了“不要改字段名”的指令。",
|
||||
"Baseline 的输出 `{\"title\":..., \"selling_points\":..., \"cautions\":...}` 完全遵循了其自身 prompt 的指令。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"当 prompt 明确禁止更改字段名或添加外层包裹对象时,任何此类改动都构成硬边界违例,应判为回退。",
|
||||
"输出协议(字段名、结构层级)的稳定性是评估 prompt 版本间兼容性的关键信号。"
|
||||
],
|
||||
"overfitWarnings": []
|
||||
},
|
||||
{
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"leftSnapshotId": "a",
|
||||
"leftSnapshotLabel": "A",
|
||||
"leftRole": "target",
|
||||
"rightSnapshotId": "c",
|
||||
"rightSnapshotLabel": "C",
|
||||
"rightRole": "reference",
|
||||
"verdict": "right-better",
|
||||
"winner": "right",
|
||||
"confidence": "medium",
|
||||
"pairSignal": "minor",
|
||||
"analysis": "Reference 在 buyer_highlights 字段的处理上展示了更优的提示词遵循能力和信息提炼结构,将原始输入中的并列信息(“适合露营和办公室使用”)整合为更具营销感的“双场景使用:露营与办公室”,并调整了列表项的顺序以突出卖点,而 Target 则更直接地复制了原文片段。这种差异体现了 Reference 对“提炼亮点”这一指令的更深层理解,是一种可学习的结构化处理模式。",
|
||||
"evidence": [
|
||||
"Target 的 buyer_highlights 为 [\"600ml 容量\",\"适合露营和办公室\",\"双层不锈钢保温\"],基本是原文片段的直接罗列。",
|
||||
"Reference 的 buyer_highlights 为 [\"双场景使用:露营与办公室\",\"600ml 大容量\",\"双层不锈钢保温更稳\"],对“适合露营和办公室使用”进行了概念提炼和包装(“双场景使用”),并为“容量”和“保温”添加了修饰词(“大”、“更稳”),列表顺序也做了调整。",
|
||||
"两者在 product_name 和 cautions 字段上表现一致,且都严格遵守了输出 JSON 结构(包含 payload 外层)。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"对于 buyer_highlights 字段,可学习将原文中的适用场景描述(如“适合A和B使用”)提炼并包装为更具概括性和吸引力的营销短语(如“双场景使用:A与B”)。",
|
||||
"可学习在列举产品亮点时,考虑对基础参数(如容量、材质)添加积极的修饰语(如“大容量”、“更稳”)以增强卖点表述,而非简单复制。",
|
||||
"可学习调整亮点列表的顺序,以优化信息呈现的节奏和重点。"
|
||||
],
|
||||
"overfitWarnings": [
|
||||
"Reference 对“双层不锈钢保温”添加“更稳”这一修饰,其必要性可能依赖于具体产品描述语境,存在一定的主观性,不一定在所有情况下都是最优或必需的改写。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"leftSnapshotId": "c",
|
||||
"leftSnapshotLabel": "C",
|
||||
"leftRole": "reference",
|
||||
"rightSnapshotId": "d",
|
||||
"rightSnapshotLabel": "D",
|
||||
"rightRole": "referenceBaseline",
|
||||
"verdict": "left-better",
|
||||
"winner": "left",
|
||||
"confidence": "high",
|
||||
"pairSignal": "unsupported",
|
||||
"analysis": "左侧(Reference)的 prompt 明确要求将字段名改为 product_name, buyer_highlights, cautions,并将它们包裹在 payload 对象内。右侧(Reference Baseline)的 prompt 则要求字段名为 title, selling_points, cautions,且禁止添加外层包裹对象。左侧的输出严格遵守了其 prompt 指令,而右侧的输出也严格遵守了其 prompt 指令。因此,左侧 prompt 所要求的改动(字段改名和添加包裹层)在右侧(即其自身的基线版本)中是完全不被支持的,这构成了明确的硬边界违例。",
|
||||
"evidence": [
|
||||
"左侧 prompt 要求字段名为 product_name, buyer_highlights, cautions,右侧 prompt 要求字段名为 title, selling_points, cautions,两者冲突。",
|
||||
"左侧 prompt 要求将字段统一包在 payload 里,右侧 prompt 明确禁止添加外层包裹对象,两者冲突。",
|
||||
"左侧输出为 {\"payload\":{\"product_name\":...}},右侧输出为 {\"title\":...},均严格遵守了各自的 prompt 指令,但指令本身互斥。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"Prompt 中关于字段名的指令是硬性约束,违反即构成负面证据。",
|
||||
"Prompt 中关于是否添加外层包裹对象的指令是硬性约束,违反即构成负面证据。"
|
||||
],
|
||||
"overfitWarnings": [
|
||||
"当前判断基于 prompt 指令的硬性冲突,不依赖于具体输入内容,因此无样例拟合风险。"
|
||||
]
|
||||
}
|
||||
],
|
||||
"snapshotRoles": {
|
||||
"a": "target",
|
||||
"b": "baseline",
|
||||
"c": "reference",
|
||||
"d": "referenceBaseline"
|
||||
},
|
||||
"compareInsights": {
|
||||
"pairHighlights": [
|
||||
{
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"pairSignal": "regressed",
|
||||
"verdict": "right-better",
|
||||
"confidence": "high",
|
||||
"analysis": "Target (A) 的 prompt 指令要求输出结构包含外层 `payload` 对象,并更改了字段名,这直接违反了 Baseline (B) prompt 中“不要改字段名,不要添加外层包裹对象”的明确硬边界规则。虽然两者都正确提取了商品信息,但 Target 的输出结构不符合 Baseline 的指令要求,属于协议漂移,构成明确的回退。"
|
||||
},
|
||||
{
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"pairSignal": "minor",
|
||||
"verdict": "right-better",
|
||||
"confidence": "medium",
|
||||
"analysis": "Reference 在 buyer_highlights 字段的处理上展示了更优的提示词遵循能力和信息提炼结构,将原始输入中的并列信息(“适合露营和办公室使用”)整合为更具营销感的“双场景使用:露营与办公室”,并调整了列表项的顺序以突出卖点,而 Target 则更直接地复制了原文片段。这种差异体现了 Reference 对“提炼亮点”这一指令的更深层理解,是一种可学习的结构化处理模式。"
|
||||
},
|
||||
{
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"pairSignal": "unsupported",
|
||||
"verdict": "left-better",
|
||||
"confidence": "high",
|
||||
"analysis": "左侧(Reference)的 prompt 明确要求将字段名改为 product_name, buyer_highlights, cautions,并将它们包裹在 payload 对象内。右侧(Reference Baseline)的 prompt 则要求字段名为 title, selling_points, cautions,且禁止添加外层包裹对象。左侧的输出严格遵守了其 prompt 指令,而右侧的输出也严格遵守了其 prompt 指令。因此,左侧 prompt 所要求的改动(字段改名和添加包裹层)在右侧(即其自身的基线版本)中是完全不被支持的,这构成了明确的硬边界违例。"
|
||||
}
|
||||
],
|
||||
"progressSummary": {
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"pairSignal": "regressed",
|
||||
"verdict": "right-better",
|
||||
"confidence": "high",
|
||||
"analysis": "Target (A) 的 prompt 指令要求输出结构包含外层 `payload` 对象,并更改了字段名,这直接违反了 Baseline (B) prompt 中“不要改字段名,不要添加外层包裹对象”的明确硬边界规则。虽然两者都正确提取了商品信息,但 Target 的输出结构不符合 Baseline 的指令要求,属于协议漂移,构成明确的回退。"
|
||||
},
|
||||
"referenceGapSummary": {
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"pairSignal": "minor",
|
||||
"verdict": "right-better",
|
||||
"confidence": "medium",
|
||||
"analysis": "Reference 在 buyer_highlights 字段的处理上展示了更优的提示词遵循能力和信息提炼结构,将原始输入中的并列信息(“适合露营和办公室使用”)整合为更具营销感的“双场景使用:露营与办公室”,并调整了列表项的顺序以突出卖点,而 Target 则更直接地复制了原文片段。这种差异体现了 Reference 对“提炼亮点”这一指令的更深层理解,是一种可学习的结构化处理模式。"
|
||||
},
|
||||
"promptChangeSummary": {
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"pairSignal": "unsupported",
|
||||
"verdict": "left-better",
|
||||
"confidence": "high",
|
||||
"analysis": "左侧(Reference)的 prompt 明确要求将字段名改为 product_name, buyer_highlights, cautions,并将它们包裹在 payload 对象内。右侧(Reference Baseline)的 prompt 则要求字段名为 title, selling_points, cautions,且禁止添加外层包裹对象。左侧的输出严格遵守了其 prompt 指令,而右侧的输出也严格遵守了其 prompt 指令。因此,左侧 prompt 所要求的改动(字段改名和添加包裹层)在右侧(即其自身的基线版本)中是完全不被支持的,这构成了明确的硬边界违例。"
|
||||
},
|
||||
"evidenceHighlights": [
|
||||
"Target 输出结构为 `{\"payload\": {...}}`,添加了外层包裹对象 `payload`,违反了 Baseline prompt 中“不要添加外层包裹对象”的指令。",
|
||||
"Target 将字段名改为 `product_name` 和 `buyer_highlights`,而 Baseline prompt 要求字段必须为 `title` 和 `selling_points`,违反了“不要改字段名”的指令。",
|
||||
"Baseline 的输出 `{\"title\":..., \"selling_points\":..., \"cautions\":...}` 完全遵循了其自身 prompt 的指令。",
|
||||
"Target 的 buyer_highlights 为 [\"600ml 容量\",\"适合露营和办公室\",\"双层不锈钢保温\"],基本是原文片段的直接罗列。",
|
||||
"Reference 的 buyer_highlights 为 [\"双场景使用:露营与办公室\",\"600ml 大容量\",\"双层不锈钢保温更稳\"],对“适合露营和办公室使用”进行了概念提炼和包装(“双场景使用”),并为“容量”和“保温”添加了修饰词(“大”、“更稳”),列表顺序也做了调整。",
|
||||
"两者在 product_name 和 cautions 字段上表现一致,且都严格遵守了输出 JSON 结构(包含 payload 外层)。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"当 prompt 明确禁止更改字段名或添加外层包裹对象时,任何此类改动都构成硬边界违例,应判为回退。",
|
||||
"输出协议(字段名、结构层级)的稳定性是评估 prompt 版本间兼容性的关键信号。",
|
||||
"对于 buyer_highlights 字段,可学习将原文中的适用场景描述(如“适合A和B使用”)提炼并包装为更具概括性和吸引力的营销短语(如“双场景使用:A与B”)。",
|
||||
"可学习在列举产品亮点时,考虑对基础参数(如容量、材质)添加积极的修饰语(如“大容量”、“更稳”)以增强卖点表述,而非简单复制。",
|
||||
"可学习调整亮点列表的顺序,以优化信息呈现的节奏和重点。",
|
||||
"Prompt 中关于字段名的指令是硬性约束,违反即构成负面证据。"
|
||||
],
|
||||
"overfitWarnings": [
|
||||
"Reference 对“双层不锈钢保温”添加“更稳”这一修饰,其必要性可能依赖于具体产品描述语境,存在一定的主观性,不一定在所有情况下都是最优或必需的改写。",
|
||||
"当前判断基于 prompt 指令的硬性冲突,不依赖于具体输入内容,因此无样例拟合风险。"
|
||||
],
|
||||
"conflictSignals": [
|
||||
"regressionOutweighsCosmeticGains",
|
||||
"sampleOverfitRiskVisible"
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,234 @@
|
||||
```json
|
||||
{
|
||||
"type": "compare",
|
||||
"score": {
|
||||
"overall": 40,
|
||||
"dimensions": [
|
||||
{
|
||||
"key": "goalAchievementRobustness",
|
||||
"label": "目标达成稳定性",
|
||||
"score": 20
|
||||
},
|
||||
{
|
||||
"key": "outputQualityCeiling",
|
||||
"label": "输出质量上限",
|
||||
"score": 60
|
||||
},
|
||||
{
|
||||
"key": "promptPatternQuality",
|
||||
"label": "提示词模式质量",
|
||||
"score": 30
|
||||
},
|
||||
{
|
||||
"key": "crossSnapshotRobustness",
|
||||
"label": "跨快照鲁棒性",
|
||||
"score": 10
|
||||
},
|
||||
{
|
||||
"key": "workspaceTransferability",
|
||||
"label": "对工作区的可迁移性",
|
||||
"score": 30
|
||||
}
|
||||
]
|
||||
},
|
||||
"improvements": [
|
||||
"当系统提示词明确禁止更改字段名或添加外层包裹对象时,任何此类改动都构成硬边界违例,应避免。",
|
||||
"对于产品亮点(如buyer_highlights)字段,可学习将原文中的适用场景描述(如“适合A和B使用”)提炼并包装为更具概括性和吸引力的营销短语(如“双场景使用:A与B”)。",
|
||||
"在列举产品基础参数(如容量、材质)时,可考虑添加积极的修饰语(如“大容量”、“更稳”)以增强卖点表述,而非简单复制原文。"
|
||||
],
|
||||
"summary": "Target相比Baseline在输出协议稳定性上出现明确回退;与Reference相比,在亮点提炼的深度和营销感上仍有可学习的差距;且该Prompt改动在Reference侧不被支持,存在较高的过拟合风险,建议审阅。",
|
||||
"patchPlan": [],
|
||||
"metadata": {
|
||||
"compareMode": "structured",
|
||||
"compareStopSignals": {
|
||||
"targetVsBaseline": "regressed",
|
||||
"targetVsReferenceGap": "minor",
|
||||
"improvementHeadroom": "high",
|
||||
"overfitRisk": "high",
|
||||
"stopRecommendation": "review",
|
||||
"stopReasons": [
|
||||
"target regressed vs baseline",
|
||||
"minor learnable gap remains vs reference",
|
||||
"reference-side evidence does not support the prompt change",
|
||||
"pairwise judges flagged possible sample overfit"
|
||||
]
|
||||
},
|
||||
"model": "deepseek",
|
||||
"timestamp": 1774176418671,
|
||||
"duration": 28409,
|
||||
"compareJudgements": [
|
||||
{
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"leftSnapshotId": "a",
|
||||
"leftSnapshotLabel": "A",
|
||||
"leftRole": "target",
|
||||
"rightSnapshotId": "b",
|
||||
"rightSnapshotLabel": "B",
|
||||
"rightRole": "baseline",
|
||||
"verdict": "right-better",
|
||||
"winner": "right",
|
||||
"confidence": "high",
|
||||
"pairSignal": "regressed",
|
||||
"analysis": "Target (A) 的 prompt 指令要求输出结构包含外层 `payload` 对象,并更改了字段名,这直接违反了 Baseline (B) prompt 中“不要改字段名,不要添加外层包裹对象”的明确硬边界规则。虽然两者都正确提取了商品信息,但 Target 的输出结构不符合 Baseline 的指令要求,属于协议漂移,构成明确的回退。",
|
||||
"evidence": [
|
||||
"Target 输出结构为 `{\"payload\": {...}}`,添加了外层包裹对象 `payload`,违反了 Baseline prompt 中“不要添加外层包裹对象”的指令。",
|
||||
"Target 将字段名改为 `product_name` 和 `buyer_highlights`,而 Baseline prompt 要求字段必须为 `title` 和 `selling_points`,违反了“不要改字段名”的指令。",
|
||||
"Baseline 的输出 `{\"title\":..., \"selling_points\":..., \"cautions\":...}` 完全遵循了其自身 prompt 的指令。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"当 prompt 明确禁止更改字段名或添加外层包裹对象时,任何此类改动都构成硬边界违例,应判为回退。",
|
||||
"输出协议(字段名、结构层级)的稳定性是评估 prompt 版本间兼容性的关键信号。"
|
||||
],
|
||||
"overfitWarnings": []
|
||||
},
|
||||
{
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"leftSnapshotId": "a",
|
||||
"leftSnapshotLabel": "A",
|
||||
"leftRole": "target",
|
||||
"rightSnapshotId": "c",
|
||||
"rightSnapshotLabel": "C",
|
||||
"rightRole": "reference",
|
||||
"verdict": "right-better",
|
||||
"winner": "right",
|
||||
"confidence": "medium",
|
||||
"pairSignal": "minor",
|
||||
"analysis": "Reference 在 buyer_highlights 字段的处理上展示了更优的提示词遵循能力和信息提炼结构,将原始输入中的并列信息(“适合露营和办公室使用”)整合为更具营销感的“双场景使用:露营与办公室”,并调整了列表项的顺序以突出卖点,而 Target 则更直接地复制了原文片段。这种差异体现了 Reference 对“提炼亮点”这一指令的更深层理解,是一种可学习的结构化处理模式。",
|
||||
"evidence": [
|
||||
"Target 的 buyer_highlights 为 [\"600ml 容量\",\"适合露营和办公室\",\"双层不锈钢保温\"],基本是原文片段的直接罗列。",
|
||||
"Reference 的 buyer_highlights 为 [\"双场景使用:露营与办公室\",\"600ml 大容量\",\"双层不锈钢保温更稳\"],对“适合露营和办公室使用”进行了概念提炼和包装(“双场景使用”),并为“容量”和“保温”添加了修饰词(“大”、“更稳”),列表顺序也做了调整。",
|
||||
"两者在 product_name 和 cautions 字段上表现一致,且都严格遵守了输出 JSON 结构(包含 payload 外层)。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"对于 buyer_highlights 字段,可学习将原文中的适用场景描述(如“适合A和B使用”)提炼并包装为更具概括性和吸引力的营销短语(如“双场景使用:A与B”)。",
|
||||
"可学习在列举产品亮点时,考虑对基础参数(如容量、材质)添加积极的修饰语(如“大容量”、“更稳”)以增强卖点表述,而非简单复制。",
|
||||
"可学习调整亮点列表的顺序,以优化信息呈现的节奏和重点。"
|
||||
],
|
||||
"overfitWarnings": [
|
||||
"Reference 对“双层不锈钢保温”添加“更稳”这一修饰,其必要性可能依赖于具体产品描述语境,存在一定的主观性,不一定在所有情况下都是最优或必需的改写。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"leftSnapshotId": "c",
|
||||
"leftSnapshotLabel": "C",
|
||||
"leftRole": "reference",
|
||||
"rightSnapshotId": "d",
|
||||
"rightSnapshotLabel": "D",
|
||||
"rightRole": "referenceBaseline",
|
||||
"verdict": "left-better",
|
||||
"winner": "left",
|
||||
"confidence": "high",
|
||||
"pairSignal": "unsupported",
|
||||
"analysis": "左侧(Reference)的 prompt 明确要求将字段名改为 product_name, buyer_highlights, cautions,并将它们包裹在 payload 对象内。右侧(Reference Baseline)的 prompt 则要求字段名为 title, selling_points, cautions,且禁止添加外层包裹对象。左侧的输出严格遵守了其 prompt 指令,而右侧的输出也严格遵守了其 prompt 指令。因此,左侧 prompt 所要求的改动(字段改名和添加包裹层)在右侧(即其自身的基线版本)中是完全不被支持的,这构成了明确的硬边界违例。",
|
||||
"evidence": [
|
||||
"左侧 prompt 要求字段名为 product_name, buyer_highlights, cautions,右侧 prompt 要求字段名为 title, selling_points, cautions,两者冲突。",
|
||||
"左侧 prompt 要求将字段统一包在 payload 里,右侧 prompt 明确禁止添加外层包裹对象,两者冲突。",
|
||||
"左侧输出为 {\"payload\":{\"product_name\":...}},右侧输出为 {\"title\":...},均严格遵守了各自的 prompt 指令,但指令本身互斥。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"Prompt 中关于字段名的指令是硬性约束,违反即构成负面证据。",
|
||||
"Prompt 中关于是否添加外层包裹对象的指令是硬性约束,违反即构成负面证据。"
|
||||
],
|
||||
"overfitWarnings": [
|
||||
"当前判断基于 prompt 指令的硬性冲突,不依赖于具体输入内容,因此无样例拟合风险。"
|
||||
]
|
||||
}
|
||||
],
|
||||
"snapshotRoles": {
|
||||
"a": "target",
|
||||
"b": "baseline",
|
||||
"c": "reference",
|
||||
"d": "referenceBaseline"
|
||||
},
|
||||
"compareInsights": {
|
||||
"pairHighlights": [
|
||||
{
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"pairSignal": "regressed",
|
||||
"verdict": "right-better",
|
||||
"confidence": "high",
|
||||
"analysis": "Target (A) 的 prompt 指令要求输出结构包含外层 `payload` 对象,并更改了字段名,这直接违反了 Baseline (B) prompt 中“不要改字段名,不要添加外层包裹对象”的明确硬边界规则。虽然两者都正确提取了商品信息,但 Target 的输出结构不符合 Baseline 的指令要求,属于协议漂移,构成明确的回退。"
|
||||
},
|
||||
{
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"pairSignal": "minor",
|
||||
"verdict": "right-better",
|
||||
"confidence": "medium",
|
||||
"analysis": "Reference 在 buyer_highlights 字段的处理上展示了更优的提示词遵循能力和信息提炼结构,将原始输入中的并列信息(“适合露营和办公室使用”)整合为更具营销感的“双场景使用:露营与办公室”,并调整了列表项的顺序以突出卖点,而 Target 则更直接地复制了原文片段。这种差异体现了 Reference 对“提炼亮点”这一指令的更深层理解,是一种可学习的结构化处理模式。"
|
||||
},
|
||||
{
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"pairSignal": "unsupported",
|
||||
"verdict": "left-better",
|
||||
"confidence": "high",
|
||||
"analysis": "左侧(Reference)的 prompt 明确要求将字段名改为 product_name, buyer_highlights, cautions,并将它们包裹在 payload 对象内。右侧(Reference Baseline)的 prompt 则要求字段名为 title, selling_points, cautions,且禁止添加外层包裹对象。左侧的输出严格遵守了其 prompt 指令,而右侧的输出也严格遵守了其 prompt 指令。因此,左侧 prompt 所要求的改动(字段改名和添加包裹层)在右侧(即其自身的基线版本)中是完全不被支持的,这构成了明确的硬边界违例。"
|
||||
}
|
||||
],
|
||||
"progressSummary": {
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"pairSignal": "regressed",
|
||||
"verdict": "right-better",
|
||||
"confidence": "high",
|
||||
"analysis": "Target (A) 的 prompt 指令要求输出结构包含外层 `payload` 对象,并更改了字段名,这直接违反了 Baseline (B) prompt 中“不要改字段名,不要添加外层包裹对象”的明确硬边界规则。虽然两者都正确提取了商品信息,但 Target 的输出结构不符合 Baseline 的指令要求,属于协议漂移,构成明确的回退。"
|
||||
},
|
||||
"referenceGapSummary": {
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"pairSignal": "minor",
|
||||
"verdict": "right-better",
|
||||
"confidence": "medium",
|
||||
"analysis": "Reference 在 buyer_highlights 字段的处理上展示了更优的提示词遵循能力和信息提炼结构,将原始输入中的并列信息(“适合露营和办公室使用”)整合为更具营销感的“双场景使用:露营与办公室”,并调整了列表项的顺序以突出卖点,而 Target 则更直接地复制了原文片段。这种差异体现了 Reference 对“提炼亮点”这一指令的更深层理解,是一种可学习的结构化处理模式。"
|
||||
},
|
||||
"promptChangeSummary": {
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"pairSignal": "unsupported",
|
||||
"verdict": "left-better",
|
||||
"confidence": "high",
|
||||
"analysis": "左侧(Reference)的 prompt 明确要求将字段名改为 product_name, buyer_highlights, cautions,并将它们包裹在 payload 对象内。右侧(Reference Baseline)的 prompt 则要求字段名为 title, selling_points, cautions,且禁止添加外层包裹对象。左侧的输出严格遵守了其 prompt 指令,而右侧的输出也严格遵守了其 prompt 指令。因此,左侧 prompt 所要求的改动(字段改名和添加包裹层)在右侧(即其自身的基线版本)中是完全不被支持的,这构成了明确的硬边界违例。"
|
||||
},
|
||||
"evidenceHighlights": [
|
||||
"Target 输出结构为 `{\"payload\": {...}}`,添加了外层包裹对象 `payload`,违反了 Baseline prompt 中“不要添加外层包裹对象”的指令。",
|
||||
"Target 将字段名改为 `product_name` 和 `buyer_highlights`,而 Baseline prompt 要求字段必须为 `title` 和 `selling_points`,违反了“不要改字段名”的指令。",
|
||||
"Baseline 的输出 `{\"title\":..., \"selling_points\":..., \"cautions\":...}` 完全遵循了其自身 prompt 的指令。",
|
||||
"Target 的 buyer_highlights 为 [\"600ml 容量\",\"适合露营和办公室\",\"双层不锈钢保温\"],基本是原文片段的直接罗列。",
|
||||
"Reference 的 buyer_highlights 为 [\"双场景使用:露营与办公室\",\"600ml 大容量\",\"双层不锈钢保温更稳\"],对“适合露营和办公室使用”进行了概念提炼和包装(“双场景使用”),并为“容量”和“保温”添加了修饰词(“大”、“更稳”),列表顺序也做了调整。",
|
||||
"两者在 product_name 和 cautions 字段上表现一致,且都严格遵守了输出 JSON 结构(包含 payload 外层)。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"当 prompt 明确禁止更改字段名或添加外层包裹对象时,任何此类改动都构成硬边界违例,应判为回退。",
|
||||
"输出协议(字段名、结构层级)的稳定性是评估 prompt 版本间兼容性的关键信号。",
|
||||
"对于 buyer_highlights 字段,可学习将原文中的适用场景描述(如“适合A和B使用”)提炼并包装为更具概括性和吸引力的营销短语(如“双场景使用:A与B”)。",
|
||||
"可学习在列举产品亮点时,考虑对基础参数(如容量、材质)添加积极的修饰语(如“大容量”、“更稳”)以增强卖点表述,而非简单复制。",
|
||||
"可学习调整亮点列表的顺序,以优化信息呈现的节奏和重点。",
|
||||
"Prompt 中关于字段名的指令是硬性约束,违反即构成负面证据。"
|
||||
],
|
||||
"overfitWarnings": [
|
||||
"Reference 对“双层不锈钢保温”添加“更稳”这一修饰,其必要性可能依赖于具体产品描述语境,存在一定的主观性,不一定在所有情况下都是最优或必需的改写。",
|
||||
"当前判断基于 prompt 指令的硬性冲突,不依赖于具体输入内容,因此无样例拟合风险。"
|
||||
],
|
||||
"conflictSignals": [
|
||||
"regressionOutweighsCosmeticGains",
|
||||
"sampleOverfitRiskVisible"
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
@@ -0,0 +1,212 @@
|
||||
请只根据下面这份 JSON payload,把当前工作区系统提示词直接重写成一个完整的新版本。
|
||||
|
||||
要求:
|
||||
1. "sourcePrompts.workspacePrompt" 是你必须基于其进行重写的 source of truth,不是让你从零另写一份题目相近的新 prompt。
|
||||
2. 保留原提示词的核心目标、硬约束、必要边界、变量名、字段名、schema、角色结构和输出协议,除非评估明确表明这些内容本身有问题。
|
||||
3. 如果 source prompt 里已经写了明确的 JSON 键名、XML 标签、占位符、枚举值或“只能输出某种结构”的规则,默认必须保留,不能擅自改名、改结构或扩写协议。
|
||||
4. 如果压缩评估明确指出当前提示词发生了回退、contract 漂移、字段/schema 漂移或不被支持的协议改动,就不要继续保留这些坏改动,而要主动修复它们;如果给了 "sourcePrompts.referencePrompt",优先把它当作恢复 contract 的锚点。
|
||||
5. 优先吸收可复用、跨输入也应成立的改进,不要为了当前样例、当前输出细节或一次性现象过拟合。
|
||||
6. 如果某条建议明显依赖当前样例,应主动将其泛化、弱化或舍弃。
|
||||
7. 不要自行发明新的测试证据,只能基于下面这份压缩评估结论来改写。
|
||||
8. 优先做“最小但完整”的重写,在保留原 contract 的前提下提升质量,而不是整套改写。
|
||||
9. 只输出提示词正文,不要把结果包装成 JSON、YAML、XML、"role/content" 对象、消息数组或代码块。
|
||||
10. 只输出重写后的完整提示词,不要额外解释。
|
||||
11. "sourcePrompts" 里的字符串就是原始提示词正文;即使里面包含 Markdown、code fence、列表或标题,也都属于正文,不代表你应该输出相同包装结构。
|
||||
12. 如果 compare 相关条目之间有重叠,优先相信聚合焦点结论和停止信号,再参考较底层的证据摘录。
|
||||
13. 在动手改写前,先看 "compressedEvaluation.rewriteGuidance.recommendation"。
|
||||
14. 如果 recommendation 是 "skip",就原样输出 "sourcePrompts.workspacePrompt",不要做任何改写。
|
||||
15. 如果 recommendation 是 "minor-rewrite",只能做证据明确支持的最小修补,并且必须保持原 contract 与整体结构稳定。
|
||||
16. 只有 recommendation 是 "rewrite" 时,才允许做更实质性的重写。
|
||||
17. 在决定改哪里之前,先看 "compressedEvaluation.rewriteGuidance.priorityMoves",把这些动作当作最高优先级的改写议程。
|
||||
18. 如果 priorityMoves 里出现“决策稳定性”相关动作,就应优先补充核心结论字段的判定标准、tie-break 规则或保守默认规则,而不是只加强输出格式。
|
||||
|
||||
Rewrite Payload (JSON):
|
||||
{
|
||||
"scenario": {
|
||||
"language": "zh",
|
||||
"evaluationType": "compare",
|
||||
"evaluationTypeLabel": "对比评估",
|
||||
"subjectLabel": "系统提示词",
|
||||
"mode": {
|
||||
"functionMode": "basic",
|
||||
"subMode": "system"
|
||||
},
|
||||
"overallScore": 40
|
||||
},
|
||||
"sourcePrompts": {
|
||||
"workspacePrompt": "你是电商商品信息抽取助手。\n输出一个 JSON 对象,字段改为 product_name, buyer_highlights, cautions。\n为了后续扩展,把三个字段统一包在 payload 里。\n不要解释。",
|
||||
"referencePrompt": "你是电商商品信息抽取助手。\n只输出 JSON 对象,字段必须为 title, selling_points, cautions。\n不要改字段名,不要添加外层包裹对象,不要解释。"
|
||||
},
|
||||
"compressedEvaluation": {
|
||||
"summary": "Target相比Baseline在输出协议稳定性上出现明确回退;与Reference相比,在亮点提炼的深度和营销感上仍有可学习的差距;且该Prompt改动在Reference侧不被支持,存在较高的过拟合风险,建议审阅。",
|
||||
"dimensionScores": [
|
||||
{
|
||||
"key": "goalAchievementRobustness",
|
||||
"label": "目标达成稳定性",
|
||||
"score": 20
|
||||
},
|
||||
{
|
||||
"key": "outputQualityCeiling",
|
||||
"label": "输出质量上限",
|
||||
"score": 60
|
||||
},
|
||||
{
|
||||
"key": "promptPatternQuality",
|
||||
"label": "提示词模式质量",
|
||||
"score": 30
|
||||
},
|
||||
{
|
||||
"key": "crossSnapshotRobustness",
|
||||
"label": "跨快照鲁棒性",
|
||||
"score": 10
|
||||
},
|
||||
{
|
||||
"key": "workspaceTransferability",
|
||||
"label": "对工作区的可迁移性",
|
||||
"score": 30
|
||||
}
|
||||
],
|
||||
"improvements": [
|
||||
"当系统提示词明确禁止更改字段名或添加外层包裹对象时,任何此类改动都构成硬边界违例,应避免。",
|
||||
"对于产品亮点(如buyer_highlights)字段,可学习将原文中的适用场景描述(如“适合A和B使用”)提炼并包装为更具概括性和吸引力的营销短语(如“双场景使用:A与B”)。",
|
||||
"在列举产品基础参数(如容量、材质)时,可考虑添加积极的修饰语(如“大容量”、“更稳”)以增强卖点表述,而非简单复制原文。"
|
||||
],
|
||||
"patchPlan": [],
|
||||
"compareStopSignals": {
|
||||
"targetVsBaseline": "regressed",
|
||||
"targetVsReferenceGap": "minor",
|
||||
"improvementHeadroom": "high",
|
||||
"overfitRisk": "high",
|
||||
"stopRecommendation": "review",
|
||||
"stopReasons": [
|
||||
"target regressed vs baseline",
|
||||
"minor learnable gap remains vs reference",
|
||||
"reference-side evidence does not support the prompt change",
|
||||
"pairwise judges flagged possible sample overfit"
|
||||
]
|
||||
},
|
||||
"compareInsights": {
|
||||
"pairHighlights": [
|
||||
{
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"pairSignal": "regressed",
|
||||
"verdict": "right-better",
|
||||
"confidence": "high",
|
||||
"analysis": "Target (A) 的 prompt 指令要求输出结构包含外层 `payload` 对象,并更改了字段名,这直接违反了 Baseline (B) prompt 中“不要改字段名,不要添加外层包裹对象”的明确硬边界规则。虽然两者都正确提取了商品信息,但 Target 的输出结构不符合 Baseline 的指令要求,属于协议漂移,构成明确的回退。"
|
||||
},
|
||||
{
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"pairSignal": "minor",
|
||||
"verdict": "right-better",
|
||||
"confidence": "medium",
|
||||
"analysis": "Reference 在 buyer_highlights 字段的处理上展示了更优的提示词遵循能力和信息提炼结构,将原始输入中的并列信息(“适合露营和办公室使用”)整合为更具营销感的“双场景使用:露营与办公室”,并调整了列表项的顺序以突出卖点,而 Target 则更直接地复制了原文片段。这种差异体现了 Reference 对“提炼亮点”这一指令的更深层理解,是一种可学习的结构化处理模式。"
|
||||
},
|
||||
{
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"pairSignal": "unsupported",
|
||||
"verdict": "left-better",
|
||||
"confidence": "high",
|
||||
"analysis": "左侧(Reference)的 prompt 明确要求将字段名改为 product_name, buyer_highlights, cautions,并将它们包裹在 payload 对象内。右侧(Reference Baseline)的 prompt 则要求字段名为 title, selling_points, cautions,且禁止添加外层包裹对象。左侧的输出严格遵守了其 prompt 指令,而右侧的输出也严格遵守了其 prompt 指令。因此,左侧 prompt 所要求的改动(字段改名和添加包裹层)在右侧(即其自身的基线版本)中是完全不被支持的,这构成了明确的硬边界违例。"
|
||||
}
|
||||
],
|
||||
"progressSummary": {
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"pairSignal": "regressed",
|
||||
"verdict": "right-better",
|
||||
"confidence": "high",
|
||||
"analysis": "Target (A) 的 prompt 指令要求输出结构包含外层 `payload` 对象,并更改了字段名,这直接违反了 Baseline (B) prompt 中“不要改字段名,不要添加外层包裹对象”的明确硬边界规则。虽然两者都正确提取了商品信息,但 Target 的输出结构不符合 Baseline 的指令要求,属于协议漂移,构成明确的回退。"
|
||||
},
|
||||
"referenceGapSummary": {
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"pairSignal": "minor",
|
||||
"verdict": "right-better",
|
||||
"confidence": "medium",
|
||||
"analysis": "Reference 在 buyer_highlights 字段的处理上展示了更优的提示词遵循能力和信息提炼结构,将原始输入中的并列信息(“适合露营和办公室使用”)整合为更具营销感的“双场景使用:露营与办公室”,并调整了列表项的顺序以突出卖点,而 Target 则更直接地复制了原文片段。这种差异体现了 Reference 对“提炼亮点”这一指令的更深层理解,是一种可学习的结构化处理模式。"
|
||||
},
|
||||
"promptChangeSummary": {
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"pairSignal": "unsupported",
|
||||
"verdict": "left-better",
|
||||
"confidence": "high",
|
||||
"analysis": "左侧(Reference)的 prompt 明确要求将字段名改为 product_name, buyer_highlights, cautions,并将它们包裹在 payload 对象内。右侧(Reference Baseline)的 prompt 则要求字段名为 title, selling_points, cautions,且禁止添加外层包裹对象。左侧的输出严格遵守了其 prompt 指令,而右侧的输出也严格遵守了其 prompt 指令。因此,左侧 prompt 所要求的改动(字段改名和添加包裹层)在右侧(即其自身的基线版本)中是完全不被支持的,这构成了明确的硬边界违例。"
|
||||
},
|
||||
"evidenceHighlights": [
|
||||
"Target 输出结构为 `{\"payload\": {...}}`,添加了外层包裹对象 `payload`,违反了 Baseline prompt 中“不要添加外层包裹对象”的指令。",
|
||||
"Target 将字段名改为 `product_name` 和 `buyer_highlights`,而 Baseline prompt 要求字段必须为 `title` 和 `selling_points`,违反了“不要改字段名”的指令。",
|
||||
"Baseline 的输出 `{\"title\":..., \"selling_points\":..., \"cautions\":...}` 完全遵循了其自身 prompt 的指令。",
|
||||
"Target 的 buyer_highlights 为 [\"600ml 容量\",\"适合露营和办公室\",\"双层不锈钢保温\"],基本是原文片段的直接罗列。",
|
||||
"Reference 的 buyer_highlights 为 [\"双场景使用:露营与办公室\",\"600ml 大容量\",\"双层不锈钢保温更稳\"],对“适合露营和办公室使用”进行了概念提炼和包装(“双场景使用”),并为“容量”和“保温”添加了修饰词(“大”、“更稳”),列表顺序也做了调整。",
|
||||
"两者在 product_name 和 cautions 字段上表现一致,且都严格遵守了输出 JSON 结构(包含 payload 外层)。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"当 prompt 明确禁止更改字段名或添加外层包裹对象时,任何此类改动都构成硬边界违例,应判为回退。",
|
||||
"输出协议(字段名、结构层级)的稳定性是评估 prompt 版本间兼容性的关键信号。",
|
||||
"对于 buyer_highlights 字段,可学习将原文中的适用场景描述(如“适合A和B使用”)提炼并包装为更具概括性和吸引力的营销短语(如“双场景使用:A与B”)。",
|
||||
"可学习在列举产品亮点时,考虑对基础参数(如容量、材质)添加积极的修饰语(如“大容量”、“更稳”)以增强卖点表述,而非简单复制。",
|
||||
"可学习调整亮点列表的顺序,以优化信息呈现的节奏和重点。",
|
||||
"Prompt 中关于字段名的指令是硬性约束,违反即构成负面证据。"
|
||||
],
|
||||
"overfitWarnings": [
|
||||
"Reference 对“双层不锈钢保温”添加“更稳”这一修饰,其必要性可能依赖于具体产品描述语境,存在一定的主观性,不一定在所有情况下都是最优或必需的改写。",
|
||||
"当前判断基于 prompt 指令的硬性冲突,不依赖于具体输入内容,因此无样例拟合风险。"
|
||||
],
|
||||
"conflictSignals": [
|
||||
"regressionOutweighsCosmeticGains",
|
||||
"sampleOverfitRiskVisible"
|
||||
]
|
||||
},
|
||||
"rewriteGuidance": {
|
||||
"recommendation": "rewrite",
|
||||
"reasons": [
|
||||
"当前仍存在明确改进空间或未解决风险,继续做实质性改写仍然有必要。",
|
||||
"需要先修复相对 baseline 的回退,再谈其他表层优化。"
|
||||
],
|
||||
"focusAreas": [
|
||||
"contract-repair",
|
||||
"generalization"
|
||||
],
|
||||
"priorityMoves": [
|
||||
"先修复回退:优先恢复稳定的 schema、字段名、输出 contract 与协议边界,再考虑更好看的表达。",
|
||||
"删除或弱化样例触发式规则,优先改写成跨输入也应成立的通用原则。"
|
||||
]
|
||||
},
|
||||
"focusSummaryLines": [
|
||||
"进步判断: Target vs Baseline | signal=regressed | verdict=right-better | confidence=high | Target (A) 的 prompt 指令要求输出结构包含外层 `payload` 对象,并更改了字段名,这直接违反了 Baseline (B) prompt 中“不要改字段名,不要添加外层包裹对象”的明确硬边界规则。虽然两者都正确提取了商品信息,但 Target 的输出结构不符合 Baseline 的指令要求,属于协议漂移,构成明确的回退。",
|
||||
"参考差距: Target vs Reference | signal=minor | verdict=right-better | confidence=medium | Reference 在 buyer_highlights 字段的处理上展示了更优的提示词遵循能力和信息提炼结构,将原始输入中的并列信息(“适合露营和办公室使用”)整合为更具营销感的“双场景使用:露营与办公室”,并调整了列表项的顺序以突出卖点,而 Target 则更直接地复制了原文片段。这种差异体现了 Reference 对“提炼亮点”这一指令的更...",
|
||||
"改动有效性: Reference vs Reference Baseline | signal=unsupported | verdict=left-better | confidence=high | 左侧(Reference)的 prompt 明确要求将字段名改为 product_name, buyer_highlights, cautions,并将它们包裹在 payload 对象内。右侧(Reference Baseline)的 prompt 则要求字段名为 title, selling_points, ca..."
|
||||
],
|
||||
"conflictLines": [
|
||||
"相对 baseline 的回退应优先于其他表面优化。",
|
||||
"如果“可复用收益”和“样例贴合收益”并存,应优先采用保守结论,并保持过拟合风险可见。"
|
||||
],
|
||||
"learnableSignalLines": [
|
||||
"当 prompt 明确禁止更改字段名或添加外层包裹对象时,任何此类改动都构成硬边界违例,应判为回退。",
|
||||
"输出协议(字段名、结构层级)的稳定性是评估 prompt 版本间兼容性的关键信号。",
|
||||
"对于 buyer_highlights 字段,可学习将原文中的适用场景描述(如“适合A和B使用”)提炼并包装为更具概括性和吸引力的营销短语(如“双场景使用:A与B”)。",
|
||||
"可学习在列举产品亮点时,考虑对基础参数(如容量、材质)添加积极的修饰语(如“大容量”、“更稳”)以增强卖点表述,而非简单复制。",
|
||||
"可学习调整亮点列表的顺序,以优化信息呈现的节奏和重点。"
|
||||
],
|
||||
"overfitWarningLines": [
|
||||
"Reference 对“双层不锈钢保温”添加“更稳”这一修饰,其必要性可能依赖于具体产品描述语境,存在一定的主观性,不一定在所有情况下都是最优或必需的改写。",
|
||||
"当前判断基于 prompt 指令的硬性冲突,不依赖于具体输入内容,因此无样例拟合风险。"
|
||||
],
|
||||
"supportEvidenceLines": [
|
||||
"1. Target vs Baseline | signal=regressed | verdict=right-better | confidence=high | Target (A) 的 prompt 指令要求输出结构包含外层 `payload` 对象,并更改了字段名,这直接违反了 Baseline (B) prompt 中“不要改字段名,不要添加外层包裹对象”的明确硬边界规则。虽然两者都正确提取了商品信息,但 Target 的输出结构不符合 Baseline 的指令要...",
|
||||
"2. Target vs Reference | signal=minor | verdict=right-better | confidence=medium | Reference 在 buyer_highlights 字段的处理上展示了更优的提示词遵循能力和信息提炼结构,将原始输入中的并列信息(“适合露营和办公室使用”)整合为更具营销感的“双场景使用:露营与办公室”,并调整了列表项的顺序以突出卖点,而 Target 则更直接地复制了原文片段。这种差异体现了 Refere...",
|
||||
"3. Reference vs Reference Baseline | signal=unsupported | verdict=left-better | confidence=high | 左侧(Reference)的 prompt 明确要求将字段名改为 product_name, buyer_highlights, cautions,并将它们包裹在 payload 对象内。右侧(Reference Baseline)的 prompt 则要求字段名为 title, se...",
|
||||
"Target 输出结构为 `{\"payload\": {...}}`,添加了外层包裹对象 `payload`,违反了 Baseline prompt 中“不要添加外层包裹对象”的指令。"
|
||||
]
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,4 @@
|
||||
你是电商商品信息抽取助手。
|
||||
只输出一个 JSON 对象,字段必须为 title, selling_points, cautions。
|
||||
不要改字段名,不要添加外层包裹对象,不要解释。
|
||||
在提取 selling_points 时,应主动提炼和优化信息:将原文中关于适用场景的描述(例如“适合A和B使用”)概括为更具吸引力的营销短语(例如“双场景使用:A与B”);对产品基础参数(如容量、材质)可添加积极的修饰语(如“大容量”、“更稳”)以增强卖点;并考虑调整亮点列表的顺序以优化呈现。
|
||||
@@ -0,0 +1,189 @@
|
||||
{
|
||||
"scenario": {
|
||||
"language": "zh",
|
||||
"evaluationType": "compare",
|
||||
"evaluationTypeLabel": "对比评估",
|
||||
"subjectLabel": "系统提示词",
|
||||
"mode": {
|
||||
"functionMode": "basic",
|
||||
"subMode": "system"
|
||||
},
|
||||
"overallScore": 40
|
||||
},
|
||||
"sourcePrompts": {
|
||||
"workspacePrompt": "你是电商商品信息抽取助手。\n输出一个 JSON 对象,字段改为 product_name, buyer_highlights, cautions。\n为了后续扩展,把三个字段统一包在 payload 里。\n不要解释。",
|
||||
"referencePrompt": "你是电商商品信息抽取助手。\n只输出 JSON 对象,字段必须为 title, selling_points, cautions。\n不要改字段名,不要添加外层包裹对象,不要解释。"
|
||||
},
|
||||
"compressedEvaluation": {
|
||||
"summary": "Target相比Baseline在输出协议稳定性上出现明确回退;与Reference相比,在亮点提炼的深度和营销感上仍有可学习的差距;且该Prompt改动在Reference侧不被支持,存在较高的过拟合风险,建议审阅。",
|
||||
"dimensionScores": [
|
||||
{
|
||||
"key": "goalAchievementRobustness",
|
||||
"label": "目标达成稳定性",
|
||||
"score": 20
|
||||
},
|
||||
{
|
||||
"key": "outputQualityCeiling",
|
||||
"label": "输出质量上限",
|
||||
"score": 60
|
||||
},
|
||||
{
|
||||
"key": "promptPatternQuality",
|
||||
"label": "提示词模式质量",
|
||||
"score": 30
|
||||
},
|
||||
{
|
||||
"key": "crossSnapshotRobustness",
|
||||
"label": "跨快照鲁棒性",
|
||||
"score": 10
|
||||
},
|
||||
{
|
||||
"key": "workspaceTransferability",
|
||||
"label": "对工作区的可迁移性",
|
||||
"score": 30
|
||||
}
|
||||
],
|
||||
"improvements": [
|
||||
"当系统提示词明确禁止更改字段名或添加外层包裹对象时,任何此类改动都构成硬边界违例,应避免。",
|
||||
"对于产品亮点(如buyer_highlights)字段,可学习将原文中的适用场景描述(如“适合A和B使用”)提炼并包装为更具概括性和吸引力的营销短语(如“双场景使用:A与B”)。",
|
||||
"在列举产品基础参数(如容量、材质)时,可考虑添加积极的修饰语(如“大容量”、“更稳”)以增强卖点表述,而非简单复制原文。"
|
||||
],
|
||||
"patchPlan": [],
|
||||
"compareStopSignals": {
|
||||
"targetVsBaseline": "regressed",
|
||||
"targetVsReferenceGap": "minor",
|
||||
"improvementHeadroom": "high",
|
||||
"overfitRisk": "high",
|
||||
"stopRecommendation": "review",
|
||||
"stopReasons": [
|
||||
"target regressed vs baseline",
|
||||
"minor learnable gap remains vs reference",
|
||||
"reference-side evidence does not support the prompt change",
|
||||
"pairwise judges flagged possible sample overfit"
|
||||
]
|
||||
},
|
||||
"compareInsights": {
|
||||
"pairHighlights": [
|
||||
{
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"pairSignal": "regressed",
|
||||
"verdict": "right-better",
|
||||
"confidence": "high",
|
||||
"analysis": "Target (A) 的 prompt 指令要求输出结构包含外层 `payload` 对象,并更改了字段名,这直接违反了 Baseline (B) prompt 中“不要改字段名,不要添加外层包裹对象”的明确硬边界规则。虽然两者都正确提取了商品信息,但 Target 的输出结构不符合 Baseline 的指令要求,属于协议漂移,构成明确的回退。"
|
||||
},
|
||||
{
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"pairSignal": "minor",
|
||||
"verdict": "right-better",
|
||||
"confidence": "medium",
|
||||
"analysis": "Reference 在 buyer_highlights 字段的处理上展示了更优的提示词遵循能力和信息提炼结构,将原始输入中的并列信息(“适合露营和办公室使用”)整合为更具营销感的“双场景使用:露营与办公室”,并调整了列表项的顺序以突出卖点,而 Target 则更直接地复制了原文片段。这种差异体现了 Reference 对“提炼亮点”这一指令的更深层理解,是一种可学习的结构化处理模式。"
|
||||
},
|
||||
{
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"pairSignal": "unsupported",
|
||||
"verdict": "left-better",
|
||||
"confidence": "high",
|
||||
"analysis": "左侧(Reference)的 prompt 明确要求将字段名改为 product_name, buyer_highlights, cautions,并将它们包裹在 payload 对象内。右侧(Reference Baseline)的 prompt 则要求字段名为 title, selling_points, cautions,且禁止添加外层包裹对象。左侧的输出严格遵守了其 prompt 指令,而右侧的输出也严格遵守了其 prompt 指令。因此,左侧 prompt 所要求的改动(字段改名和添加包裹层)在右侧(即其自身的基线版本)中是完全不被支持的,这构成了明确的硬边界违例。"
|
||||
}
|
||||
],
|
||||
"progressSummary": {
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"pairSignal": "regressed",
|
||||
"verdict": "right-better",
|
||||
"confidence": "high",
|
||||
"analysis": "Target (A) 的 prompt 指令要求输出结构包含外层 `payload` 对象,并更改了字段名,这直接违反了 Baseline (B) prompt 中“不要改字段名,不要添加外层包裹对象”的明确硬边界规则。虽然两者都正确提取了商品信息,但 Target 的输出结构不符合 Baseline 的指令要求,属于协议漂移,构成明确的回退。"
|
||||
},
|
||||
"referenceGapSummary": {
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"pairSignal": "minor",
|
||||
"verdict": "right-better",
|
||||
"confidence": "medium",
|
||||
"analysis": "Reference 在 buyer_highlights 字段的处理上展示了更优的提示词遵循能力和信息提炼结构,将原始输入中的并列信息(“适合露营和办公室使用”)整合为更具营销感的“双场景使用:露营与办公室”,并调整了列表项的顺序以突出卖点,而 Target 则更直接地复制了原文片段。这种差异体现了 Reference 对“提炼亮点”这一指令的更深层理解,是一种可学习的结构化处理模式。"
|
||||
},
|
||||
"promptChangeSummary": {
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"pairSignal": "unsupported",
|
||||
"verdict": "left-better",
|
||||
"confidence": "high",
|
||||
"analysis": "左侧(Reference)的 prompt 明确要求将字段名改为 product_name, buyer_highlights, cautions,并将它们包裹在 payload 对象内。右侧(Reference Baseline)的 prompt 则要求字段名为 title, selling_points, cautions,且禁止添加外层包裹对象。左侧的输出严格遵守了其 prompt 指令,而右侧的输出也严格遵守了其 prompt 指令。因此,左侧 prompt 所要求的改动(字段改名和添加包裹层)在右侧(即其自身的基线版本)中是完全不被支持的,这构成了明确的硬边界违例。"
|
||||
},
|
||||
"evidenceHighlights": [
|
||||
"Target 输出结构为 `{\"payload\": {...}}`,添加了外层包裹对象 `payload`,违反了 Baseline prompt 中“不要添加外层包裹对象”的指令。",
|
||||
"Target 将字段名改为 `product_name` 和 `buyer_highlights`,而 Baseline prompt 要求字段必须为 `title` 和 `selling_points`,违反了“不要改字段名”的指令。",
|
||||
"Baseline 的输出 `{\"title\":..., \"selling_points\":..., \"cautions\":...}` 完全遵循了其自身 prompt 的指令。",
|
||||
"Target 的 buyer_highlights 为 [\"600ml 容量\",\"适合露营和办公室\",\"双层不锈钢保温\"],基本是原文片段的直接罗列。",
|
||||
"Reference 的 buyer_highlights 为 [\"双场景使用:露营与办公室\",\"600ml 大容量\",\"双层不锈钢保温更稳\"],对“适合露营和办公室使用”进行了概念提炼和包装(“双场景使用”),并为“容量”和“保温”添加了修饰词(“大”、“更稳”),列表顺序也做了调整。",
|
||||
"两者在 product_name 和 cautions 字段上表现一致,且都严格遵守了输出 JSON 结构(包含 payload 外层)。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"当 prompt 明确禁止更改字段名或添加外层包裹对象时,任何此类改动都构成硬边界违例,应判为回退。",
|
||||
"输出协议(字段名、结构层级)的稳定性是评估 prompt 版本间兼容性的关键信号。",
|
||||
"对于 buyer_highlights 字段,可学习将原文中的适用场景描述(如“适合A和B使用”)提炼并包装为更具概括性和吸引力的营销短语(如“双场景使用:A与B”)。",
|
||||
"可学习在列举产品亮点时,考虑对基础参数(如容量、材质)添加积极的修饰语(如“大容量”、“更稳”)以增强卖点表述,而非简单复制。",
|
||||
"可学习调整亮点列表的顺序,以优化信息呈现的节奏和重点。",
|
||||
"Prompt 中关于字段名的指令是硬性约束,违反即构成负面证据。"
|
||||
],
|
||||
"overfitWarnings": [
|
||||
"Reference 对“双层不锈钢保温”添加“更稳”这一修饰,其必要性可能依赖于具体产品描述语境,存在一定的主观性,不一定在所有情况下都是最优或必需的改写。",
|
||||
"当前判断基于 prompt 指令的硬性冲突,不依赖于具体输入内容,因此无样例拟合风险。"
|
||||
],
|
||||
"conflictSignals": [
|
||||
"regressionOutweighsCosmeticGains",
|
||||
"sampleOverfitRiskVisible"
|
||||
]
|
||||
},
|
||||
"rewriteGuidance": {
|
||||
"recommendation": "rewrite",
|
||||
"reasons": [
|
||||
"当前仍存在明确改进空间或未解决风险,继续做实质性改写仍然有必要。",
|
||||
"需要先修复相对 baseline 的回退,再谈其他表层优化。"
|
||||
],
|
||||
"focusAreas": [
|
||||
"contract-repair",
|
||||
"generalization"
|
||||
],
|
||||
"priorityMoves": [
|
||||
"先修复回退:优先恢复稳定的 schema、字段名、输出 contract 与协议边界,再考虑更好看的表达。",
|
||||
"删除或弱化样例触发式规则,优先改写成跨输入也应成立的通用原则。"
|
||||
]
|
||||
},
|
||||
"focusSummaryLines": [
|
||||
"进步判断: Target vs Baseline | signal=regressed | verdict=right-better | confidence=high | Target (A) 的 prompt 指令要求输出结构包含外层 `payload` 对象,并更改了字段名,这直接违反了 Baseline (B) prompt 中“不要改字段名,不要添加外层包裹对象”的明确硬边界规则。虽然两者都正确提取了商品信息,但 Target 的输出结构不符合 Baseline 的指令要求,属于协议漂移,构成明确的回退。",
|
||||
"参考差距: Target vs Reference | signal=minor | verdict=right-better | confidence=medium | Reference 在 buyer_highlights 字段的处理上展示了更优的提示词遵循能力和信息提炼结构,将原始输入中的并列信息(“适合露营和办公室使用”)整合为更具营销感的“双场景使用:露营与办公室”,并调整了列表项的顺序以突出卖点,而 Target 则更直接地复制了原文片段。这种差异体现了 Reference 对“提炼亮点”这一指令的更...",
|
||||
"改动有效性: Reference vs Reference Baseline | signal=unsupported | verdict=left-better | confidence=high | 左侧(Reference)的 prompt 明确要求将字段名改为 product_name, buyer_highlights, cautions,并将它们包裹在 payload 对象内。右侧(Reference Baseline)的 prompt 则要求字段名为 title, selling_points, ca..."
|
||||
],
|
||||
"conflictLines": [
|
||||
"相对 baseline 的回退应优先于其他表面优化。",
|
||||
"如果“可复用收益”和“样例贴合收益”并存,应优先采用保守结论,并保持过拟合风险可见。"
|
||||
],
|
||||
"learnableSignalLines": [
|
||||
"当 prompt 明确禁止更改字段名或添加外层包裹对象时,任何此类改动都构成硬边界违例,应判为回退。",
|
||||
"输出协议(字段名、结构层级)的稳定性是评估 prompt 版本间兼容性的关键信号。",
|
||||
"对于 buyer_highlights 字段,可学习将原文中的适用场景描述(如“适合A和B使用”)提炼并包装为更具概括性和吸引力的营销短语(如“双场景使用:A与B”)。",
|
||||
"可学习在列举产品亮点时,考虑对基础参数(如容量、材质)添加积极的修饰语(如“大容量”、“更稳”)以增强卖点表述,而非简单复制。",
|
||||
"可学习调整亮点列表的顺序,以优化信息呈现的节奏和重点。"
|
||||
],
|
||||
"overfitWarningLines": [
|
||||
"Reference 对“双层不锈钢保温”添加“更稳”这一修饰,其必要性可能依赖于具体产品描述语境,存在一定的主观性,不一定在所有情况下都是最优或必需的改写。",
|
||||
"当前判断基于 prompt 指令的硬性冲突,不依赖于具体输入内容,因此无样例拟合风险。"
|
||||
],
|
||||
"supportEvidenceLines": [
|
||||
"1. Target vs Baseline | signal=regressed | verdict=right-better | confidence=high | Target (A) 的 prompt 指令要求输出结构包含外层 `payload` 对象,并更改了字段名,这直接违反了 Baseline (B) prompt 中“不要改字段名,不要添加外层包裹对象”的明确硬边界规则。虽然两者都正确提取了商品信息,但 Target 的输出结构不符合 Baseline 的指令要...",
|
||||
"2. Target vs Reference | signal=minor | verdict=right-better | confidence=medium | Reference 在 buyer_highlights 字段的处理上展示了更优的提示词遵循能力和信息提炼结构,将原始输入中的并列信息(“适合露营和办公室使用”)整合为更具营销感的“双场景使用:露营与办公室”,并调整了列表项的顺序以突出卖点,而 Target 则更直接地复制了原文片段。这种差异体现了 Refere...",
|
||||
"3. Reference vs Reference Baseline | signal=unsupported | verdict=left-better | confidence=high | 左侧(Reference)的 prompt 明确要求将字段名改为 product_name, buyer_highlights, cautions,并将它们包裹在 payload 对象内。右侧(Reference Baseline)的 prompt 则要求字段名为 title, se...",
|
||||
"Target 输出结构为 `{\"payload\": {...}}`,添加了外层包裹对象 `payload`,违反了 Baseline prompt 中“不要添加外层包裹对象”的指令。"
|
||||
]
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,10 @@
|
||||
# 合成样本: 电商抽取里不能因为 teacher 更会写就忽略 schema
|
||||
|
||||
- caseId: synthetic-ecommerce-schema-no-model-worship
|
||||
- kind: synthetic
|
||||
|
||||
workspace prompt 把既有商品抽取 contract 改成了新字段和外层 wrapper,teacher 输出也更像“高质量总结”。这个样本用于校验 compare 是否会坚持 schema/contract 优先,而不是因为 reference 更流畅就放过漂移。
|
||||
|
||||
## Focus
|
||||
|
||||
即便 reference 看上去更完整、更自然,只要 prompt 改动造成字段名或外层结构变化,就应把 target 相对 baseline 判为回退。
|
||||
@@ -0,0 +1,146 @@
|
||||
{
|
||||
"generatedAt": "2026-03-22T10:44:18.102Z",
|
||||
"case": {
|
||||
"id": "synthetic-ecommerce-schema-no-model-worship",
|
||||
"title": "合成样本: 电商抽取里不能因为 teacher 更会写就忽略 schema",
|
||||
"kind": "synthetic"
|
||||
},
|
||||
"summary": {
|
||||
"compareMode": "structured",
|
||||
"summary": "Target相比Baseline在输出协议稳定性上出现明确回退;与Reference相比,在亮点提炼的深度和营销感上仍有可学习的差距;且该Prompt改动在Reference侧不被支持,存在较高的过拟合风险,建议审阅。",
|
||||
"score": 40,
|
||||
"improvements": [
|
||||
"当系统提示词明确禁止更改字段名或添加外层包裹对象时,任何此类改动都构成硬边界违例,应避免。",
|
||||
"对于产品亮点(如buyer_highlights)字段,可学习将原文中的适用场景描述(如“适合A和B使用”)提炼并包装为更具概括性和吸引力的营销短语(如“双场景使用:A与B”)。",
|
||||
"在列举产品基础参数(如容量、材质)时,可考虑添加积极的修饰语(如“大容量”、“更稳”)以增强卖点表述,而非简单复制原文。"
|
||||
],
|
||||
"stopSignals": {
|
||||
"targetVsBaseline": "regressed",
|
||||
"targetVsReferenceGap": "minor",
|
||||
"improvementHeadroom": "high",
|
||||
"overfitRisk": "high",
|
||||
"stopRecommendation": "review",
|
||||
"stopReasons": [
|
||||
"target regressed vs baseline",
|
||||
"minor learnable gap remains vs reference",
|
||||
"reference-side evidence does not support the prompt change",
|
||||
"pairwise judges flagged possible sample overfit"
|
||||
]
|
||||
},
|
||||
"conflictSignals": [
|
||||
"regressionOutweighsCosmeticGains",
|
||||
"sampleOverfitRiskVisible"
|
||||
],
|
||||
"pairJudgements": [
|
||||
{
|
||||
"pairType": "targetBaseline",
|
||||
"pairSignal": "regressed",
|
||||
"verdict": "right-better",
|
||||
"confidence": "high"
|
||||
},
|
||||
{
|
||||
"pairType": "targetReference",
|
||||
"pairSignal": "minor",
|
||||
"verdict": "right-better",
|
||||
"confidence": "medium"
|
||||
},
|
||||
{
|
||||
"pairType": "referenceBaseline",
|
||||
"pairSignal": "unsupported",
|
||||
"verdict": "left-better",
|
||||
"confidence": "high"
|
||||
}
|
||||
],
|
||||
"expected": {
|
||||
"stopSignals": {
|
||||
"targetVsBaseline": [
|
||||
"regressed"
|
||||
],
|
||||
"stopRecommendation": [
|
||||
"review"
|
||||
]
|
||||
},
|
||||
"pairSignals": {
|
||||
"targetBaseline": [
|
||||
"regressed"
|
||||
],
|
||||
"targetReference": [
|
||||
"none",
|
||||
"minor"
|
||||
],
|
||||
"referenceBaseline": [
|
||||
"unsupported"
|
||||
]
|
||||
},
|
||||
"conflictSignals": [
|
||||
"regressionOutweighsCosmeticGains"
|
||||
]
|
||||
}
|
||||
},
|
||||
"expectationResults": [
|
||||
{
|
||||
"type": "stopSignal",
|
||||
"key": "targetVsBaseline",
|
||||
"expected": [
|
||||
"regressed"
|
||||
],
|
||||
"actual": "regressed",
|
||||
"matched": true
|
||||
},
|
||||
{
|
||||
"type": "stopSignal",
|
||||
"key": "stopRecommendation",
|
||||
"expected": [
|
||||
"review"
|
||||
],
|
||||
"actual": "review",
|
||||
"matched": true
|
||||
},
|
||||
{
|
||||
"type": "pairSignal",
|
||||
"key": "targetBaseline",
|
||||
"expected": [
|
||||
"regressed"
|
||||
],
|
||||
"actual": [
|
||||
"regressed"
|
||||
],
|
||||
"matched": true
|
||||
},
|
||||
{
|
||||
"type": "pairSignal",
|
||||
"key": "targetReference",
|
||||
"expected": [
|
||||
"none",
|
||||
"minor"
|
||||
],
|
||||
"actual": [
|
||||
"minor"
|
||||
],
|
||||
"matched": true
|
||||
},
|
||||
{
|
||||
"type": "pairSignal",
|
||||
"key": "referenceBaseline",
|
||||
"expected": [
|
||||
"unsupported"
|
||||
],
|
||||
"actual": [
|
||||
"unsupported"
|
||||
],
|
||||
"matched": true
|
||||
},
|
||||
{
|
||||
"type": "conflictSignal",
|
||||
"key": "regressionOutweighsCosmeticGains",
|
||||
"expected": [
|
||||
"regressionOutweighsCosmeticGains"
|
||||
],
|
||||
"actual": [
|
||||
"regressionOutweighsCosmeticGains",
|
||||
"sampleOverfitRiskVisible"
|
||||
],
|
||||
"matched": true
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -0,0 +1,107 @@
|
||||
# 合成样本: 电商抽取里不能因为 teacher 更会写就忽略 schema
|
||||
|
||||
- caseId: synthetic-ecommerce-schema-no-model-worship
|
||||
- kind: synthetic
|
||||
- generatedAt: 2026-03-22T10:44:18.102Z
|
||||
|
||||
## Description
|
||||
|
||||
workspace prompt 把既有商品抽取 contract 改成了新字段和外层 wrapper,teacher 输出也更像“高质量总结”。这个样本用于校验 compare 是否会坚持 schema/contract 优先,而不是因为 reference 更流畅就放过漂移。
|
||||
|
||||
## Compare Result
|
||||
|
||||
```json
|
||||
{
|
||||
"compareMode": "structured",
|
||||
"summary": "Target相比Baseline在输出协议稳定性上出现明确回退;与Reference相比,在亮点提炼的深度和营销感上仍有可学习的差距;且该Prompt改动在Reference侧不被支持,存在较高的过拟合风险,建议审阅。",
|
||||
"score": 40,
|
||||
"improvements": [
|
||||
"当系统提示词明确禁止更改字段名或添加外层包裹对象时,任何此类改动都构成硬边界违例,应避免。",
|
||||
"对于产品亮点(如buyer_highlights)字段,可学习将原文中的适用场景描述(如“适合A和B使用”)提炼并包装为更具概括性和吸引力的营销短语(如“双场景使用:A与B”)。",
|
||||
"在列举产品基础参数(如容量、材质)时,可考虑添加积极的修饰语(如“大容量”、“更稳”)以增强卖点表述,而非简单复制原文。"
|
||||
],
|
||||
"stopSignals": {
|
||||
"targetVsBaseline": "regressed",
|
||||
"targetVsReferenceGap": "minor",
|
||||
"improvementHeadroom": "high",
|
||||
"overfitRisk": "high",
|
||||
"stopRecommendation": "review",
|
||||
"stopReasons": [
|
||||
"target regressed vs baseline",
|
||||
"minor learnable gap remains vs reference",
|
||||
"reference-side evidence does not support the prompt change",
|
||||
"pairwise judges flagged possible sample overfit"
|
||||
]
|
||||
},
|
||||
"conflictSignals": [
|
||||
"regressionOutweighsCosmeticGains",
|
||||
"sampleOverfitRiskVisible"
|
||||
],
|
||||
"pairJudgements": [
|
||||
{
|
||||
"pairType": "targetBaseline",
|
||||
"pairSignal": "regressed",
|
||||
"verdict": "right-better",
|
||||
"confidence": "high"
|
||||
},
|
||||
{
|
||||
"pairType": "targetReference",
|
||||
"pairSignal": "minor",
|
||||
"verdict": "right-better",
|
||||
"confidence": "medium"
|
||||
},
|
||||
{
|
||||
"pairType": "referenceBaseline",
|
||||
"pairSignal": "unsupported",
|
||||
"verdict": "left-better",
|
||||
"confidence": "high"
|
||||
}
|
||||
],
|
||||
"expected": {
|
||||
"stopSignals": {
|
||||
"targetVsBaseline": [
|
||||
"regressed"
|
||||
],
|
||||
"stopRecommendation": [
|
||||
"review"
|
||||
]
|
||||
},
|
||||
"pairSignals": {
|
||||
"targetBaseline": [
|
||||
"regressed"
|
||||
],
|
||||
"targetReference": [
|
||||
"none",
|
||||
"minor"
|
||||
],
|
||||
"referenceBaseline": [
|
||||
"unsupported"
|
||||
]
|
||||
},
|
||||
"conflictSignals": [
|
||||
"regressionOutweighsCosmeticGains"
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Expectation Check
|
||||
|
||||
| 类型 | 键 | 期望 | 实际 | 是否命中 |
|
||||
| --- | --- | --- | --- | --- |
|
||||
| stopSignal | targetVsBaseline | regressed | regressed | yes |
|
||||
| stopSignal | stopRecommendation | review | review | yes |
|
||||
| pairSignal | targetBaseline | regressed | regressed | yes |
|
||||
| pairSignal | targetReference | none / minor | minor | yes |
|
||||
| pairSignal | referenceBaseline | unsupported | unsupported | yes |
|
||||
| conflictSignal | regressionOutweighsCosmeticGains | regressionOutweighsCosmeticGains | regressionOutweighsCosmeticGains / sampleOverfitRiskVisible | yes |
|
||||
|
||||
|
||||
## Rewrite Output
|
||||
|
||||
```
|
||||
你是电商商品信息抽取助手。
|
||||
只输出一个 JSON 对象,字段必须为 title, selling_points, cautions。
|
||||
不要改字段名,不要添加外层包裹对象,不要解释。
|
||||
在提取 selling_points 时,应主动提炼和优化信息:将原文中关于适用场景的描述(例如“适合A和B使用”)概括为更具吸引力的营销短语(例如“双场景使用:A与B”);对产品基础参数(如容量、材质)可添加积极的修饰语(如“大容量”、“更稳”)以增强卖点;并考虑调整亮点列表的顺序以优化呈现。
|
||||
```
|
||||
@@ -0,0 +1,168 @@
|
||||
{
|
||||
"scenario": {
|
||||
"language": "zh",
|
||||
"roleName": "结构化系统提示词对比综合专家",
|
||||
"subjectLabel": "系统提示词",
|
||||
"sharedCompareInputs": true,
|
||||
"samePromptAcrossSnapshots": true,
|
||||
"crossModelComparison": true,
|
||||
"focusBrief": "即便 reference 看上去更完整、更自然,只要 prompt 改动造成字段名或外层结构变化,就应把 target 相对 baseline 判为回退。"
|
||||
},
|
||||
"roleBindings": [
|
||||
{
|
||||
"snapshotId": "a",
|
||||
"snapshotLabel": "A",
|
||||
"role": "target",
|
||||
"roleLabel": "Target"
|
||||
},
|
||||
{
|
||||
"snapshotId": "b",
|
||||
"snapshotLabel": "B",
|
||||
"role": "baseline",
|
||||
"roleLabel": "Baseline"
|
||||
},
|
||||
{
|
||||
"snapshotId": "c",
|
||||
"snapshotLabel": "C",
|
||||
"role": "reference",
|
||||
"roleLabel": "Reference"
|
||||
},
|
||||
{
|
||||
"snapshotId": "d",
|
||||
"snapshotLabel": "D",
|
||||
"role": "referenceBaseline",
|
||||
"roleLabel": "Reference Baseline"
|
||||
}
|
||||
],
|
||||
"deterministicHints": {
|
||||
"priorityOrder": [
|
||||
"targetBaseline",
|
||||
"targetReference",
|
||||
"referenceBaseline",
|
||||
"targetReplica"
|
||||
],
|
||||
"signalSnapshot": {
|
||||
"progress": "regressed",
|
||||
"gap": "minor",
|
||||
"promptValidity": "unsupported"
|
||||
},
|
||||
"derivedStopSignals": {
|
||||
"targetVsBaseline": "regressed",
|
||||
"targetVsReferenceGap": "minor",
|
||||
"improvementHeadroom": "high",
|
||||
"overfitRisk": "high",
|
||||
"stopRecommendation": "review",
|
||||
"stopReasons": [
|
||||
"target regressed vs baseline",
|
||||
"minor learnable gap remains vs reference",
|
||||
"reference-side evidence does not support the prompt change",
|
||||
"pairwise judges flagged possible sample overfit"
|
||||
]
|
||||
},
|
||||
"learnableSignals": [
|
||||
"当 prompt 明确禁止更改字段名或添加外层包裹对象时,任何此类改动都构成硬边界违例,应判为回退。",
|
||||
"输出协议(字段名、结构层级)的稳定性是评估 prompt 版本间兼容性的关键信号。",
|
||||
"对于 buyer_highlights 字段,可学习将原文中的适用场景描述(如“适合A和B使用”)提炼并包装为更具概括性和吸引力的营销短语(如“双场景使用:A与B”)。",
|
||||
"可学习在列举产品亮点时,考虑对基础参数(如容量、材质)添加积极的修饰语(如“大容量”、“更稳”)以增强卖点表述,而非简单复制。"
|
||||
],
|
||||
"overfitWarnings": [
|
||||
"Reference 对“双层不锈钢保温”添加“更稳”这一修饰,其必要性可能依赖于具体产品描述语境,存在一定的主观性,不一定在所有情况下都是最优或必需的改写。",
|
||||
"当前判断基于 prompt 指令的硬性冲突,不依赖于具体输入内容,因此无样例拟合风险。"
|
||||
],
|
||||
"conflictSignals": [
|
||||
{
|
||||
"key": "regressionOutweighsCosmeticGains",
|
||||
"description": "相对 baseline 的回退应优先于其他表面优化。"
|
||||
},
|
||||
{
|
||||
"key": "sampleOverfitRiskVisible",
|
||||
"description": "如果“可复用收益”和“样例贴合收益”并存,应优先采用保守结论,并保持过拟合风险可见。"
|
||||
}
|
||||
]
|
||||
},
|
||||
"judgeResults": [
|
||||
{
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"leftSnapshotId": "a",
|
||||
"leftSnapshotLabel": "A",
|
||||
"leftRole": "target",
|
||||
"rightSnapshotId": "b",
|
||||
"rightSnapshotLabel": "B",
|
||||
"rightRole": "baseline",
|
||||
"verdict": "right-better",
|
||||
"winner": "right",
|
||||
"confidence": "high",
|
||||
"pairSignal": "regressed",
|
||||
"analysis": "Target (A) 的 prompt 指令要求输出结构包含外层 `payload` 对象,并更改了字段名,这直接违反了 Baseline (B) prompt 中“不要改字段名,不要添加外层包裹对象”的明确硬边界规则。虽然两者都正确提取了商品信息,但 Target 的输出结构不符合 Baseline 的指令要求,属于协议漂移,构成明确的回退。",
|
||||
"evidence": [
|
||||
"Target 输出结构为 `{\"payload\": {...}}`,添加了外层包裹对象 `payload`,违反了 Baseline prompt 中“不要添加外层包裹对象”的指令。",
|
||||
"Target 将字段名改为 `product_name` 和 `buyer_highlights`,而 Baseline prompt 要求字段必须为 `title` 和 `selling_points`,违反了“不要改字段名”的指令。",
|
||||
"Baseline 的输出 `{\"title\":..., \"selling_points\":..., \"cautions\":...}` 完全遵循了其自身 prompt 的指令。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"当 prompt 明确禁止更改字段名或添加外层包裹对象时,任何此类改动都构成硬边界违例,应判为回退。",
|
||||
"输出协议(字段名、结构层级)的稳定性是评估 prompt 版本间兼容性的关键信号。"
|
||||
],
|
||||
"overfitWarnings": []
|
||||
},
|
||||
{
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"leftSnapshotId": "a",
|
||||
"leftSnapshotLabel": "A",
|
||||
"leftRole": "target",
|
||||
"rightSnapshotId": "c",
|
||||
"rightSnapshotLabel": "C",
|
||||
"rightRole": "reference",
|
||||
"verdict": "right-better",
|
||||
"winner": "right",
|
||||
"confidence": "medium",
|
||||
"pairSignal": "minor",
|
||||
"analysis": "Reference 在 buyer_highlights 字段的处理上展示了更优的提示词遵循能力和信息提炼结构,将原始输入中的并列信息(“适合露营和办公室使用”)整合为更具营销感的“双场景使用:露营与办公室”,并调整了列表项的顺序以突出卖点,而 Target 则更直接地复制了原文片段。这种差异体现了 Reference 对“提炼亮点”这一指令的更深层理解,是一种可学习的结构化处理模式。",
|
||||
"evidence": [
|
||||
"Target 的 buyer_highlights 为 [\"600ml 容量\",\"适合露营和办公室\",\"双层不锈钢保温\"],基本是原文片段的直接罗列。",
|
||||
"Reference 的 buyer_highlights 为 [\"双场景使用:露营与办公室\",\"600ml 大容量\",\"双层不锈钢保温更稳\"],对“适合露营和办公室使用”进行了概念提炼和包装(“双场景使用”),并为“容量”和“保温”添加了修饰词(“大”、“更稳”),列表顺序也做了调整。",
|
||||
"两者在 product_name 和 cautions 字段上表现一致,且都严格遵守了输出 JSON 结构(包含 payload 外层)。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"对于 buyer_highlights 字段,可学习将原文中的适用场景描述(如“适合A和B使用”)提炼并包装为更具概括性和吸引力的营销短语(如“双场景使用:A与B”)。",
|
||||
"可学习在列举产品亮点时,考虑对基础参数(如容量、材质)添加积极的修饰语(如“大容量”、“更稳”)以增强卖点表述,而非简单复制。",
|
||||
"可学习调整亮点列表的顺序,以优化信息呈现的节奏和重点。"
|
||||
],
|
||||
"overfitWarnings": [
|
||||
"Reference 对“双层不锈钢保温”添加“更稳”这一修饰,其必要性可能依赖于具体产品描述语境,存在一定的主观性,不一定在所有情况下都是最优或必需的改写。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"leftSnapshotId": "c",
|
||||
"leftSnapshotLabel": "C",
|
||||
"leftRole": "reference",
|
||||
"rightSnapshotId": "d",
|
||||
"rightSnapshotLabel": "D",
|
||||
"rightRole": "referenceBaseline",
|
||||
"verdict": "left-better",
|
||||
"winner": "left",
|
||||
"confidence": "high",
|
||||
"pairSignal": "unsupported",
|
||||
"analysis": "左侧(Reference)的 prompt 明确要求将字段名改为 product_name, buyer_highlights, cautions,并将它们包裹在 payload 对象内。右侧(Reference Baseline)的 prompt 则要求字段名为 title, selling_points, cautions,且禁止添加外层包裹对象。左侧的输出严格遵守了其 prompt 指令,而右侧的输出也严格遵守了其 prompt 指令。因此,左侧 prompt 所要求的改动(字段改名和添加包裹层)在右侧(即其自身的基线版本)中是完全不被支持的,这构成了明确的硬边界违例。",
|
||||
"evidence": [
|
||||
"左侧 prompt 要求字段名为 product_name, buyer_highlights, cautions,右侧 prompt 要求字段名为 title, selling_points, cautions,两者冲突。",
|
||||
"左侧 prompt 要求将字段统一包在 payload 里,右侧 prompt 明确禁止添加外层包裹对象,两者冲突。",
|
||||
"左侧输出为 {\"payload\":{\"product_name\":...}},右侧输出为 {\"title\":...},均严格遵守了各自的 prompt 指令,但指令本身互斥。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"Prompt 中关于字段名的指令是硬性约束,违反即构成负面证据。",
|
||||
"Prompt 中关于是否添加外层包裹对象的指令是硬性约束,违反即构成负面证据。"
|
||||
],
|
||||
"overfitWarnings": [
|
||||
"当前判断基于 prompt 指令的硬性冲突,不依赖于具体输入内容,因此无样例拟合风险。"
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
File diff suppressed because one or more lines are too long
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,369 @@
|
||||
[
|
||||
{
|
||||
"phase": "pair-judge:target-vs-reference",
|
||||
"payload": {
|
||||
"scenario": {
|
||||
"language": "zh",
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"purpose": "Identify whether the target still has a learnable gap from the stronger/reference run, and what structural strategy is worth learning.",
|
||||
"signalName": "gap",
|
||||
"allowedSignalValues": [
|
||||
"none",
|
||||
"minor",
|
||||
"major",
|
||||
"unclear"
|
||||
],
|
||||
"focusBrief": "如果工作区版本在重复执行时连录用建议都发生漂移,就算首个结果更像样,也应优先暴露稳定性问题。"
|
||||
},
|
||||
"roleBindings": [
|
||||
{
|
||||
"snapshotId": "a",
|
||||
"snapshotLabel": "A",
|
||||
"role": "target",
|
||||
"roleLabel": "Target"
|
||||
},
|
||||
{
|
||||
"snapshotId": "b",
|
||||
"snapshotLabel": "B",
|
||||
"role": "baseline",
|
||||
"roleLabel": "Baseline"
|
||||
},
|
||||
{
|
||||
"snapshotId": "c",
|
||||
"snapshotLabel": "C",
|
||||
"role": "reference",
|
||||
"roleLabel": "Reference"
|
||||
},
|
||||
{
|
||||
"snapshotId": "d",
|
||||
"snapshotLabel": "D",
|
||||
"role": "referenceBaseline",
|
||||
"roleLabel": "Reference Baseline"
|
||||
},
|
||||
{
|
||||
"snapshotId": "e",
|
||||
"snapshotLabel": "E",
|
||||
"role": "replica",
|
||||
"roleLabel": "Replica"
|
||||
}
|
||||
],
|
||||
"testCases": [
|
||||
{
|
||||
"id": "tc-1",
|
||||
"input": {
|
||||
"kind": "text",
|
||||
"label": "候选人信息",
|
||||
"content": "候选人 6 年 B2B SaaS 产品经验,做过企业权限系统和审计日志,英语一般,过去一年没有直接带人经验。岗位要求:熟悉权限/审计场景,能与海外客户沟通,优先有团队协作经验。"
|
||||
}
|
||||
}
|
||||
],
|
||||
"leftSnapshot": {
|
||||
"id": "a",
|
||||
"label": "A",
|
||||
"role": "target",
|
||||
"roleLabel": "Target",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "workspace",
|
||||
"label": "Workspace"
|
||||
},
|
||||
"promptText": "你是简历筛选总结助手。\n只输出 JSON 对象,字段为 recommendation, strengths, risks。\nrecommendation 只能是 hire、hold、reject 之一。\nstrengths 和 risks 都要紧扣岗位要求,避免泛泛而谈。",
|
||||
"modelKey": "custom",
|
||||
"versionLabel": "workspace",
|
||||
"output": "{\"recommendation\":\"hold\",\"strengths\":[\"有 6 年 B2B SaaS 产品经验\",\"做过权限系统和审计日志,和岗位高度相关\"],\"risks\":[\"英语一般,海外客户沟通能力待确认\",\"近期没有直接带人经验\"]}"
|
||||
},
|
||||
"rightSnapshot": {
|
||||
"id": "c",
|
||||
"label": "C",
|
||||
"role": "reference",
|
||||
"roleLabel": "Reference",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "workspace",
|
||||
"label": "Teacher Workspace"
|
||||
},
|
||||
"promptText": "你是简历筛选总结助手。\n只输出 JSON 对象,字段为 recommendation, strengths, risks。\nrecommendation 只能是 hire、hold、reject 之一。\nstrengths 和 risks 都要紧扣岗位要求,避免泛泛而谈。",
|
||||
"modelKey": "deepseek",
|
||||
"versionLabel": "teacher-workspace",
|
||||
"output": "{\"recommendation\":\"hold\",\"strengths\":[\"权限系统和审计日志经验与岗位核心场景强相关\",\"B2B SaaS 背景成熟\"],\"risks\":[\"英语一般,跨海外客户沟通需进一步验证\",\"缺少近期直接管理经验\"]}"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"phase": "pair-judge:target-vs-replica",
|
||||
"payload": {
|
||||
"scenario": {
|
||||
"language": "zh",
|
||||
"pairKey": "target-vs-replica",
|
||||
"pairType": "targetReplica",
|
||||
"pairLabel": "Target vs Replica",
|
||||
"purpose": "Judge whether the target prompt behaves stably across repeated executions instead of improving by chance.",
|
||||
"signalName": "stability",
|
||||
"allowedSignalValues": [
|
||||
"stable",
|
||||
"unstable",
|
||||
"unclear"
|
||||
],
|
||||
"focusBrief": "如果工作区版本在重复执行时连录用建议都发生漂移,就算首个结果更像样,也应优先暴露稳定性问题。"
|
||||
},
|
||||
"roleBindings": [
|
||||
{
|
||||
"snapshotId": "a",
|
||||
"snapshotLabel": "A",
|
||||
"role": "target",
|
||||
"roleLabel": "Target"
|
||||
},
|
||||
{
|
||||
"snapshotId": "b",
|
||||
"snapshotLabel": "B",
|
||||
"role": "baseline",
|
||||
"roleLabel": "Baseline"
|
||||
},
|
||||
{
|
||||
"snapshotId": "c",
|
||||
"snapshotLabel": "C",
|
||||
"role": "reference",
|
||||
"roleLabel": "Reference"
|
||||
},
|
||||
{
|
||||
"snapshotId": "d",
|
||||
"snapshotLabel": "D",
|
||||
"role": "referenceBaseline",
|
||||
"roleLabel": "Reference Baseline"
|
||||
},
|
||||
{
|
||||
"snapshotId": "e",
|
||||
"snapshotLabel": "E",
|
||||
"role": "replica",
|
||||
"roleLabel": "Replica"
|
||||
}
|
||||
],
|
||||
"testCases": [
|
||||
{
|
||||
"id": "tc-1",
|
||||
"input": {
|
||||
"kind": "text",
|
||||
"label": "候选人信息",
|
||||
"content": "候选人 6 年 B2B SaaS 产品经验,做过企业权限系统和审计日志,英语一般,过去一年没有直接带人经验。岗位要求:熟悉权限/审计场景,能与海外客户沟通,优先有团队协作经验。"
|
||||
}
|
||||
}
|
||||
],
|
||||
"leftSnapshot": {
|
||||
"id": "a",
|
||||
"label": "A",
|
||||
"role": "target",
|
||||
"roleLabel": "Target",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "workspace",
|
||||
"label": "Workspace"
|
||||
},
|
||||
"promptText": "你是简历筛选总结助手。\n只输出 JSON 对象,字段为 recommendation, strengths, risks。\nrecommendation 只能是 hire、hold、reject 之一。\nstrengths 和 risks 都要紧扣岗位要求,避免泛泛而谈。",
|
||||
"modelKey": "custom",
|
||||
"versionLabel": "workspace",
|
||||
"output": "{\"recommendation\":\"hold\",\"strengths\":[\"有 6 年 B2B SaaS 产品经验\",\"做过权限系统和审计日志,和岗位高度相关\"],\"risks\":[\"英语一般,海外客户沟通能力待确认\",\"近期没有直接带人经验\"]}"
|
||||
},
|
||||
"rightSnapshot": {
|
||||
"id": "e",
|
||||
"label": "E",
|
||||
"role": "replica",
|
||||
"roleLabel": "Replica",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "workspace",
|
||||
"label": "Replica"
|
||||
},
|
||||
"promptText": "你是简历筛选总结助手。\n只输出 JSON 对象,字段为 recommendation, strengths, risks。\nrecommendation 只能是 hire、hold、reject 之一。\nstrengths 和 risks 都要紧扣岗位要求,避免泛泛而谈。",
|
||||
"modelKey": "custom",
|
||||
"versionLabel": "workspace-replica",
|
||||
"output": "{\"recommendation\":\"hire\",\"strengths\":[\"权限系统与审计日志经验高度匹配岗位核心需求\",\"B2B SaaS 背景可直接上手复杂业务\"],\"risks\":[\"英语一般,但可通过团队支持弥补\",\"近一年缺少直接带人经验\"]}"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"phase": "pair-judge:reference-vs-reference-baseline",
|
||||
"payload": {
|
||||
"scenario": {
|
||||
"language": "zh",
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"purpose": "Judge whether the prompt change itself is supported on the reference side, instead of being a target-only coincidence.",
|
||||
"signalName": "promptValidity",
|
||||
"allowedSignalValues": [
|
||||
"supported",
|
||||
"mixed",
|
||||
"unsupported",
|
||||
"unclear"
|
||||
],
|
||||
"focusBrief": "如果工作区版本在重复执行时连录用建议都发生漂移,就算首个结果更像样,也应优先暴露稳定性问题。"
|
||||
},
|
||||
"roleBindings": [
|
||||
{
|
||||
"snapshotId": "a",
|
||||
"snapshotLabel": "A",
|
||||
"role": "target",
|
||||
"roleLabel": "Target"
|
||||
},
|
||||
{
|
||||
"snapshotId": "b",
|
||||
"snapshotLabel": "B",
|
||||
"role": "baseline",
|
||||
"roleLabel": "Baseline"
|
||||
},
|
||||
{
|
||||
"snapshotId": "c",
|
||||
"snapshotLabel": "C",
|
||||
"role": "reference",
|
||||
"roleLabel": "Reference"
|
||||
},
|
||||
{
|
||||
"snapshotId": "d",
|
||||
"snapshotLabel": "D",
|
||||
"role": "referenceBaseline",
|
||||
"roleLabel": "Reference Baseline"
|
||||
},
|
||||
{
|
||||
"snapshotId": "e",
|
||||
"snapshotLabel": "E",
|
||||
"role": "replica",
|
||||
"roleLabel": "Replica"
|
||||
}
|
||||
],
|
||||
"testCases": [
|
||||
{
|
||||
"id": "tc-1",
|
||||
"input": {
|
||||
"kind": "text",
|
||||
"label": "候选人信息",
|
||||
"content": "候选人 6 年 B2B SaaS 产品经验,做过企业权限系统和审计日志,英语一般,过去一年没有直接带人经验。岗位要求:熟悉权限/审计场景,能与海外客户沟通,优先有团队协作经验。"
|
||||
}
|
||||
}
|
||||
],
|
||||
"leftSnapshot": {
|
||||
"id": "c",
|
||||
"label": "C",
|
||||
"role": "reference",
|
||||
"roleLabel": "Reference",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "workspace",
|
||||
"label": "Teacher Workspace"
|
||||
},
|
||||
"promptText": "你是简历筛选总结助手。\n只输出 JSON 对象,字段为 recommendation, strengths, risks。\nrecommendation 只能是 hire、hold、reject 之一。\nstrengths 和 risks 都要紧扣岗位要求,避免泛泛而谈。",
|
||||
"modelKey": "deepseek",
|
||||
"versionLabel": "teacher-workspace",
|
||||
"output": "{\"recommendation\":\"hold\",\"strengths\":[\"权限系统和审计日志经验与岗位核心场景强相关\",\"B2B SaaS 背景成熟\"],\"risks\":[\"英语一般,跨海外客户沟通需进一步验证\",\"缺少近期直接管理经验\"]}"
|
||||
},
|
||||
"rightSnapshot": {
|
||||
"id": "d",
|
||||
"label": "D",
|
||||
"role": "referenceBaseline",
|
||||
"roleLabel": "Reference Baseline",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "version",
|
||||
"label": "Teacher Previous"
|
||||
},
|
||||
"promptText": "你是简历筛选总结助手。\n输出 recommendation, strengths, risks。\n结论尽量简洁。",
|
||||
"modelKey": "deepseek",
|
||||
"versionLabel": "teacher-v5",
|
||||
"output": "{\"recommendation\":\"hold\",\"strengths\":[\"岗位相关经验较多\"],\"risks\":[\"英语一般,管理经历偏弱\"]}"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"phase": "pair-judge:target-vs-baseline",
|
||||
"payload": {
|
||||
"scenario": {
|
||||
"language": "zh",
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"purpose": "Decide whether the current target prompt materially improved, stayed flat, or regressed relative to the previous version.",
|
||||
"signalName": "progress",
|
||||
"allowedSignalValues": [
|
||||
"improved",
|
||||
"flat",
|
||||
"regressed",
|
||||
"unclear"
|
||||
],
|
||||
"focusBrief": "如果工作区版本在重复执行时连录用建议都发生漂移,就算首个结果更像样,也应优先暴露稳定性问题。"
|
||||
},
|
||||
"roleBindings": [
|
||||
{
|
||||
"snapshotId": "a",
|
||||
"snapshotLabel": "A",
|
||||
"role": "target",
|
||||
"roleLabel": "Target"
|
||||
},
|
||||
{
|
||||
"snapshotId": "b",
|
||||
"snapshotLabel": "B",
|
||||
"role": "baseline",
|
||||
"roleLabel": "Baseline"
|
||||
},
|
||||
{
|
||||
"snapshotId": "c",
|
||||
"snapshotLabel": "C",
|
||||
"role": "reference",
|
||||
"roleLabel": "Reference"
|
||||
},
|
||||
{
|
||||
"snapshotId": "d",
|
||||
"snapshotLabel": "D",
|
||||
"role": "referenceBaseline",
|
||||
"roleLabel": "Reference Baseline"
|
||||
},
|
||||
{
|
||||
"snapshotId": "e",
|
||||
"snapshotLabel": "E",
|
||||
"role": "replica",
|
||||
"roleLabel": "Replica"
|
||||
}
|
||||
],
|
||||
"testCases": [
|
||||
{
|
||||
"id": "tc-1",
|
||||
"input": {
|
||||
"kind": "text",
|
||||
"label": "候选人信息",
|
||||
"content": "候选人 6 年 B2B SaaS 产品经验,做过企业权限系统和审计日志,英语一般,过去一年没有直接带人经验。岗位要求:熟悉权限/审计场景,能与海外客户沟通,优先有团队协作经验。"
|
||||
}
|
||||
}
|
||||
],
|
||||
"leftSnapshot": {
|
||||
"id": "a",
|
||||
"label": "A",
|
||||
"role": "target",
|
||||
"roleLabel": "Target",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "workspace",
|
||||
"label": "Workspace"
|
||||
},
|
||||
"promptText": "你是简历筛选总结助手。\n只输出 JSON 对象,字段为 recommendation, strengths, risks。\nrecommendation 只能是 hire、hold、reject 之一。\nstrengths 和 risks 都要紧扣岗位要求,避免泛泛而谈。",
|
||||
"modelKey": "custom",
|
||||
"versionLabel": "workspace",
|
||||
"output": "{\"recommendation\":\"hold\",\"strengths\":[\"有 6 年 B2B SaaS 产品经验\",\"做过权限系统和审计日志,和岗位高度相关\"],\"risks\":[\"英语一般,海外客户沟通能力待确认\",\"近期没有直接带人经验\"]}"
|
||||
},
|
||||
"rightSnapshot": {
|
||||
"id": "b",
|
||||
"label": "B",
|
||||
"role": "baseline",
|
||||
"roleLabel": "Baseline",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "version",
|
||||
"label": "Previous"
|
||||
},
|
||||
"promptText": "你是简历筛选总结助手。\n输出 recommendation, strengths, risks。\n结论尽量简洁。",
|
||||
"modelKey": "custom",
|
||||
"versionLabel": "v5",
|
||||
"output": "{\"recommendation\":\"hold\",\"strengths\":[\"经验较匹配\"],\"risks\":[\"英语一般\"]}"
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
@@ -0,0 +1,109 @@
|
||||
{
|
||||
"type": "compare",
|
||||
"evaluationModelKey": "deepseek",
|
||||
"mode": {
|
||||
"functionMode": "basic",
|
||||
"subMode": "system"
|
||||
},
|
||||
"focus": {
|
||||
"content": "如果工作区版本在重复执行时连录用建议都发生漂移,就算首个结果更像样,也应优先暴露稳定性问题。",
|
||||
"source": "system",
|
||||
"priority": "highest"
|
||||
},
|
||||
"target": {
|
||||
"workspacePrompt": "你是简历筛选总结助手。\n只输出 JSON 对象,字段为 recommendation, strengths, risks。\nrecommendation 只能是 hire、hold、reject 之一。\nstrengths 和 risks 都要紧扣岗位要求,避免泛泛而谈。",
|
||||
"referencePrompt": "你是简历筛选总结助手。\n输出 recommendation, strengths, risks。\n结论尽量简洁。"
|
||||
},
|
||||
"testCases": [
|
||||
{
|
||||
"id": "tc-1",
|
||||
"input": {
|
||||
"kind": "text",
|
||||
"label": "候选人信息",
|
||||
"content": "候选人 6 年 B2B SaaS 产品经验,做过企业权限系统和审计日志,英语一般,过去一年没有直接带人经验。岗位要求:熟悉权限/审计场景,能与海外客户沟通,优先有团队协作经验。"
|
||||
}
|
||||
}
|
||||
],
|
||||
"snapshots": [
|
||||
{
|
||||
"id": "a",
|
||||
"label": "A",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "workspace",
|
||||
"label": "Workspace"
|
||||
},
|
||||
"promptText": "你是简历筛选总结助手。\n只输出 JSON 对象,字段为 recommendation, strengths, risks。\nrecommendation 只能是 hire、hold、reject 之一。\nstrengths 和 risks 都要紧扣岗位要求,避免泛泛而谈。",
|
||||
"output": "{\"recommendation\":\"hold\",\"strengths\":[\"有 6 年 B2B SaaS 产品经验\",\"做过权限系统和审计日志,和岗位高度相关\"],\"risks\":[\"英语一般,海外客户沟通能力待确认\",\"近期没有直接带人经验\"]}",
|
||||
"modelKey": "custom",
|
||||
"versionLabel": "workspace"
|
||||
},
|
||||
{
|
||||
"id": "b",
|
||||
"label": "B",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "version",
|
||||
"version": 5,
|
||||
"label": "Previous"
|
||||
},
|
||||
"promptText": "你是简历筛选总结助手。\n输出 recommendation, strengths, risks。\n结论尽量简洁。",
|
||||
"output": "{\"recommendation\":\"hold\",\"strengths\":[\"经验较匹配\"],\"risks\":[\"英语一般\"]}",
|
||||
"modelKey": "custom",
|
||||
"versionLabel": "v5"
|
||||
},
|
||||
{
|
||||
"id": "c",
|
||||
"label": "C",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "workspace",
|
||||
"label": "Teacher Workspace"
|
||||
},
|
||||
"promptText": "你是简历筛选总结助手。\n只输出 JSON 对象,字段为 recommendation, strengths, risks。\nrecommendation 只能是 hire、hold、reject 之一。\nstrengths 和 risks 都要紧扣岗位要求,避免泛泛而谈。",
|
||||
"output": "{\"recommendation\":\"hold\",\"strengths\":[\"权限系统和审计日志经验与岗位核心场景强相关\",\"B2B SaaS 背景成熟\"],\"risks\":[\"英语一般,跨海外客户沟通需进一步验证\",\"缺少近期直接管理经验\"]}",
|
||||
"modelKey": "deepseek",
|
||||
"versionLabel": "teacher-workspace"
|
||||
},
|
||||
{
|
||||
"id": "d",
|
||||
"label": "D",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "version",
|
||||
"version": 5,
|
||||
"label": "Teacher Previous"
|
||||
},
|
||||
"promptText": "你是简历筛选总结助手。\n输出 recommendation, strengths, risks。\n结论尽量简洁。",
|
||||
"output": "{\"recommendation\":\"hold\",\"strengths\":[\"岗位相关经验较多\"],\"risks\":[\"英语一般,管理经历偏弱\"]}",
|
||||
"modelKey": "deepseek",
|
||||
"versionLabel": "teacher-v5"
|
||||
},
|
||||
{
|
||||
"id": "e",
|
||||
"label": "E",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "workspace",
|
||||
"label": "Replica"
|
||||
},
|
||||
"promptText": "你是简历筛选总结助手。\n只输出 JSON 对象,字段为 recommendation, strengths, risks。\nrecommendation 只能是 hire、hold、reject 之一。\nstrengths 和 risks 都要紧扣岗位要求,避免泛泛而谈。",
|
||||
"output": "{\"recommendation\":\"hire\",\"strengths\":[\"权限系统与审计日志经验高度匹配岗位核心需求\",\"B2B SaaS 背景可直接上手复杂业务\"],\"risks\":[\"英语一般,但可通过团队支持弥补\",\"近一年缺少直接带人经验\"]}",
|
||||
"modelKey": "custom",
|
||||
"versionLabel": "workspace-replica"
|
||||
}
|
||||
],
|
||||
"compareHints": {
|
||||
"mode": "structured",
|
||||
"snapshotRoles": {
|
||||
"a": "target",
|
||||
"b": "baseline",
|
||||
"c": "reference",
|
||||
"d": "referenceBaseline",
|
||||
"e": "replica"
|
||||
},
|
||||
"hasSharedTestCases": true,
|
||||
"hasSamePromptSnapshots": true,
|
||||
"hasCrossModelComparison": true
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,111 @@
|
||||
```json
|
||||
{
|
||||
"type": "compare",
|
||||
"evaluationModelKey": "deepseek",
|
||||
"mode": {
|
||||
"functionMode": "basic",
|
||||
"subMode": "system"
|
||||
},
|
||||
"focus": {
|
||||
"content": "如果工作区版本在重复执行时连录用建议都发生漂移,就算首个结果更像样,也应优先暴露稳定性问题。",
|
||||
"source": "system",
|
||||
"priority": "highest"
|
||||
},
|
||||
"target": {
|
||||
"workspacePrompt": "你是简历筛选总结助手。\n只输出 JSON 对象,字段为 recommendation, strengths, risks。\nrecommendation 只能是 hire、hold、reject 之一。\nstrengths 和 risks 都要紧扣岗位要求,避免泛泛而谈。",
|
||||
"referencePrompt": "你是简历筛选总结助手。\n输出 recommendation, strengths, risks。\n结论尽量简洁。"
|
||||
},
|
||||
"testCases": [
|
||||
{
|
||||
"id": "tc-1",
|
||||
"input": {
|
||||
"kind": "text",
|
||||
"label": "候选人信息",
|
||||
"content": "候选人 6 年 B2B SaaS 产品经验,做过企业权限系统和审计日志,英语一般,过去一年没有直接带人经验。岗位要求:熟悉权限/审计场景,能与海外客户沟通,优先有团队协作经验。"
|
||||
}
|
||||
}
|
||||
],
|
||||
"snapshots": [
|
||||
{
|
||||
"id": "a",
|
||||
"label": "A",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "workspace",
|
||||
"label": "Workspace"
|
||||
},
|
||||
"promptText": "你是简历筛选总结助手。\n只输出 JSON 对象,字段为 recommendation, strengths, risks。\nrecommendation 只能是 hire、hold、reject 之一。\nstrengths 和 risks 都要紧扣岗位要求,避免泛泛而谈。",
|
||||
"output": "{\"recommendation\":\"hold\",\"strengths\":[\"有 6 年 B2B SaaS 产品经验\",\"做过权限系统和审计日志,和岗位高度相关\"],\"risks\":[\"英语一般,海外客户沟通能力待确认\",\"近期没有直接带人经验\"]}",
|
||||
"modelKey": "custom",
|
||||
"versionLabel": "workspace"
|
||||
},
|
||||
{
|
||||
"id": "b",
|
||||
"label": "B",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "version",
|
||||
"version": 5,
|
||||
"label": "Previous"
|
||||
},
|
||||
"promptText": "你是简历筛选总结助手。\n输出 recommendation, strengths, risks。\n结论尽量简洁。",
|
||||
"output": "{\"recommendation\":\"hold\",\"strengths\":[\"经验较匹配\"],\"risks\":[\"英语一般\"]}",
|
||||
"modelKey": "custom",
|
||||
"versionLabel": "v5"
|
||||
},
|
||||
{
|
||||
"id": "c",
|
||||
"label": "C",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "workspace",
|
||||
"label": "Teacher Workspace"
|
||||
},
|
||||
"promptText": "你是简历筛选总结助手。\n只输出 JSON 对象,字段为 recommendation, strengths, risks。\nrecommendation 只能是 hire、hold、reject 之一。\nstrengths 和 risks 都要紧扣岗位要求,避免泛泛而谈。",
|
||||
"output": "{\"recommendation\":\"hold\",\"strengths\":[\"权限系统和审计日志经验与岗位核心场景强相关\",\"B2B SaaS 背景成熟\"],\"risks\":[\"英语一般,跨海外客户沟通需进一步验证\",\"缺少近期直接管理经验\"]}",
|
||||
"modelKey": "deepseek",
|
||||
"versionLabel": "teacher-workspace"
|
||||
},
|
||||
{
|
||||
"id": "d",
|
||||
"label": "D",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "version",
|
||||
"version": 5,
|
||||
"label": "Teacher Previous"
|
||||
},
|
||||
"promptText": "你是简历筛选总结助手。\n输出 recommendation, strengths, risks。\n结论尽量简洁。",
|
||||
"output": "{\"recommendation\":\"hold\",\"strengths\":[\"岗位相关经验较多\"],\"risks\":[\"英语一般,管理经历偏弱\"]}",
|
||||
"modelKey": "deepseek",
|
||||
"versionLabel": "teacher-v5"
|
||||
},
|
||||
{
|
||||
"id": "e",
|
||||
"label": "E",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "workspace",
|
||||
"label": "Replica"
|
||||
},
|
||||
"promptText": "你是简历筛选总结助手。\n只输出 JSON 对象,字段为 recommendation, strengths, risks。\nrecommendation 只能是 hire、hold、reject 之一。\nstrengths 和 risks 都要紧扣岗位要求,避免泛泛而谈。",
|
||||
"output": "{\"recommendation\":\"hire\",\"strengths\":[\"权限系统与审计日志经验高度匹配岗位核心需求\",\"B2B SaaS 背景可直接上手复杂业务\"],\"risks\":[\"英语一般,但可通过团队支持弥补\",\"近一年缺少直接带人经验\"]}",
|
||||
"modelKey": "custom",
|
||||
"versionLabel": "workspace-replica"
|
||||
}
|
||||
],
|
||||
"compareHints": {
|
||||
"mode": "structured",
|
||||
"snapshotRoles": {
|
||||
"a": "target",
|
||||
"b": "baseline",
|
||||
"c": "reference",
|
||||
"d": "referenceBaseline",
|
||||
"e": "replica"
|
||||
},
|
||||
"hasSharedTestCases": true,
|
||||
"hasSamePromptSnapshots": true,
|
||||
"hasCrossModelComparison": true
|
||||
}
|
||||
}
|
||||
```
|
||||
@@ -0,0 +1,270 @@
|
||||
{
|
||||
"type": "compare",
|
||||
"score": {
|
||||
"overall": 65,
|
||||
"dimensions": [
|
||||
{
|
||||
"key": "goalAchievementRobustness",
|
||||
"label": "目标达成稳定性",
|
||||
"score": 40
|
||||
},
|
||||
{
|
||||
"key": "outputQualityCeiling",
|
||||
"label": "输出质量上限",
|
||||
"score": 85
|
||||
},
|
||||
{
|
||||
"key": "promptPatternQuality",
|
||||
"label": "提示词模式质量",
|
||||
"score": 80
|
||||
},
|
||||
{
|
||||
"key": "crossSnapshotRobustness",
|
||||
"label": "跨快照鲁棒性",
|
||||
"score": 35
|
||||
},
|
||||
{
|
||||
"key": "workspaceTransferability",
|
||||
"label": "对工作区的可迁移性",
|
||||
"score": 70
|
||||
}
|
||||
]
|
||||
},
|
||||
"improvements": [
|
||||
"在简历筛选总结任务中,要求输出字段(如strengths, risks)‘紧扣岗位要求,避免泛泛而谈’,能有效引导模型生成更具体、更具信息量的评估点。",
|
||||
"明确的输出格式指令(如‘只输出 JSON 对象’)和字段枚举值(如hire/hold/reject),有助于确保响应的结构一致性和规范性。"
|
||||
],
|
||||
"summary": "Target相比Baseline在输出结构化和内容针对性上有明确进步,且与Reference质量相当,但重复执行时核心决策(如录用建议)发生漂移,稳定性存在严重问题,且提示词改进的收益可能部分依赖于当前样例与岗位的高匹配度。",
|
||||
"patchPlan": [],
|
||||
"metadata": {
|
||||
"compareMode": "structured",
|
||||
"compareStopSignals": {
|
||||
"targetVsBaseline": "improved",
|
||||
"targetVsReferenceGap": "none",
|
||||
"improvementHeadroom": "low",
|
||||
"overfitRisk": "high",
|
||||
"stopRecommendation": "review",
|
||||
"stopReasons": [
|
||||
"replica evidence suggests unstable behavior",
|
||||
"pairwise judges flagged possible sample overfit"
|
||||
]
|
||||
},
|
||||
"model": "deepseek",
|
||||
"timestamp": 1774176504604,
|
||||
"duration": 24372,
|
||||
"compareJudgements": [
|
||||
{
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"leftSnapshotId": "a",
|
||||
"leftSnapshotLabel": "A",
|
||||
"leftRole": "target",
|
||||
"rightSnapshotId": "b",
|
||||
"rightSnapshotLabel": "B",
|
||||
"rightRole": "baseline",
|
||||
"verdict": "left-better",
|
||||
"winner": "left",
|
||||
"confidence": "high",
|
||||
"pairSignal": "improved",
|
||||
"analysis": "Target (A) 在结构化输出、内容紧扣岗位要求方面明显优于 Baseline (B)。A 严格遵守了 JSON 输出协议,并提供了更具体、与岗位要求更相关的 strengths 和 risks 细节,而 B 的输出过于简略且缺乏针对性。",
|
||||
"evidence": [
|
||||
"Target (A) 的 prompt 明确要求 '只输出 JSON 对象,字段为 recommendation, strengths, risks',其输出严格遵守此格式,为合法的 JSON 对象。Baseline (B) 的 prompt 仅要求 '输出 recommendation, strengths, risks',未明确指定 JSON 格式,但其输出也恰好是合法的 JSON 对象。两者在硬边界(输出协议)上均未违例。",
|
||||
"Target (A) 的 prompt 额外要求 'strengths 和 risks 都要紧扣岗位要求,避免泛泛而谈'。其输出中的 strengths (['有 6 年 B2B SaaS 产品经验', '做过权限系统和审计日志,和岗位高度相关']) 和 risks (['英语一般,海外客户沟通能力待确认', '近期没有直接带人经验']) 均明确对应了输入中提到的岗位要求(权限/审计场景、海外客户沟通、团队协作经验)。",
|
||||
"Baseline (B) 的输出 strengths (['经验较匹配']) 和 risks (['英语一般']) 过于笼统,未具体展开与岗位要求的关联,也未提及“近期没有直接带人经验”这一关键风险点,信息量和针对性均显不足。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"在简历筛选总结任务中,要求输出字段 '紧扣岗位要求,避免泛泛而谈' 能有效引导模型生成更具体、更具信息量的评估点。",
|
||||
"明确的输出格式指令(如'只输出 JSON 对象')有助于确保响应的结构一致性。"
|
||||
],
|
||||
"overfitWarnings": []
|
||||
},
|
||||
{
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"leftSnapshotId": "a",
|
||||
"leftSnapshotLabel": "A",
|
||||
"leftRole": "target",
|
||||
"rightSnapshotId": "c",
|
||||
"rightSnapshotLabel": "C",
|
||||
"rightRole": "reference",
|
||||
"verdict": "similar",
|
||||
"winner": "none",
|
||||
"confidence": "high",
|
||||
"pairSignal": "none",
|
||||
"analysis": "Target 和 Reference 的输出在核心判断、证据组织和格式合规性上高度一致,均正确遵循了 prompt 规则,未发现可学习的结构性差距。",
|
||||
"evidence": [
|
||||
"两者 recommendation 均为 'hold',判断逻辑一致。",
|
||||
"两者 strengths 均聚焦于 B2B SaaS 经验和权限/审计日志场景,与岗位要求高度相关。",
|
||||
"两者 risks 均指出了英语沟通和近期管理经验问题,紧扣岗位要求。",
|
||||
"两者均输出合法 JSON,字段正确,无额外说明或格式违例。"
|
||||
],
|
||||
"learnableSignals": [],
|
||||
"overfitWarnings": []
|
||||
},
|
||||
{
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"leftSnapshotId": "c",
|
||||
"leftSnapshotLabel": "C",
|
||||
"leftRole": "reference",
|
||||
"rightSnapshotId": "d",
|
||||
"rightSnapshotLabel": "D",
|
||||
"rightRole": "referenceBaseline",
|
||||
"verdict": "left-better",
|
||||
"winner": "left",
|
||||
"confidence": "high",
|
||||
"pairSignal": "supported",
|
||||
"analysis": "左侧(Reference)的提示词明确要求输出结构化JSON并指定了字段和枚举值,这直接导致了其输出在格式和内容深度上都优于右侧(Reference Baseline)的模糊要求。右侧的输出虽然结论一致,但内容过于笼统,缺乏与岗位要求的强关联性。左侧的改进在参考侧自身也得到了验证,并非仅针对当前样例的拟合。",
|
||||
"evidence": [
|
||||
"左侧提示词明确要求输出JSON对象,字段为recommendation, strengths, risks,并规定recommendation只能是hire、hold、reject之一。右侧提示词仅要求输出相同字段,但未规定格式和枚举值。",
|
||||
"左侧输出严格遵守JSON格式,strengths和risks紧扣岗位要求(如“权限系统和审计日志经验与岗位核心场景强相关”、“跨海外客户沟通需进一步验证”)。右侧输出虽为JSON格式,但内容泛泛(如“岗位相关经验较多”、“管理经历偏弱”),未紧扣岗位具体要求。",
|
||||
"两侧的recommendation结论一致(均为hold),表明核心判断未因提示词细化而改变,但左侧的分析深度和针对性显著提升。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"在提示词中明确指定输出格式(如JSON)和字段的枚举值(如hire/hold/reject),可以强制模型生成更结构化、更规范的输出。",
|
||||
"在提示词中要求分析内容“紧扣岗位要求,避免泛泛而谈”,能有效引导模型生成更具针对性和深度的分析,而非通用描述。"
|
||||
],
|
||||
"overfitWarnings": [
|
||||
"当前样例中,候选人经验与岗位要求(权限/审计)高度匹配,这可能放大了左侧提示词要求“紧扣岗位要求”所带来的收益。对于经验与岗位要求匹配度不高的候选人,此收益可能减弱。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"pairKey": "target-vs-replica",
|
||||
"pairType": "targetReplica",
|
||||
"pairLabel": "Target vs Replica",
|
||||
"leftSnapshotId": "a",
|
||||
"leftSnapshotLabel": "A",
|
||||
"leftRole": "target",
|
||||
"rightSnapshotId": "e",
|
||||
"rightSnapshotLabel": "E",
|
||||
"rightRole": "replica",
|
||||
"verdict": "mixed",
|
||||
"winner": "none",
|
||||
"confidence": "high",
|
||||
"pairSignal": "unstable",
|
||||
"analysis": "在重复执行中,目标提示词产生了不一致的输出,核心的录用建议(recommendation)从“hold”漂移到了“hire”,同时风险(risks)的表述也发生了实质性变化,表明其行为不稳定,而非无害的措辞波动。",
|
||||
"evidence": [
|
||||
"核心字段 `recommendation` 的值从 `\"hold\"` (left) 变为 `\"hire\"` (right)。",
|
||||
"`risks` 字段中关于“英语一般”的表述从客观描述“海外客户沟通能力待确认” (left) 变为带有主观判断的“但可通过团队支持弥补” (right)。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"重复执行时,核心决策字段(如 recommendation)的值发生漂移是典型的不稳定信号。",
|
||||
"风险(risks)的表述从客观事实转向主观辩护,表明输出意图或模型内部推理路径不一致。"
|
||||
],
|
||||
"overfitWarnings": [
|
||||
"右侧(replica)输出中“但可通过团队支持弥补”的表述,可能过度拟合了当前输入中“英语一般”这一具体信息,并进行了超出要求的乐观推断。"
|
||||
]
|
||||
}
|
||||
],
|
||||
"snapshotRoles": {
|
||||
"a": "target",
|
||||
"b": "baseline",
|
||||
"c": "reference",
|
||||
"d": "referenceBaseline",
|
||||
"e": "replica"
|
||||
},
|
||||
"compareInsights": {
|
||||
"pairHighlights": [
|
||||
{
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"pairSignal": "improved",
|
||||
"verdict": "left-better",
|
||||
"confidence": "high",
|
||||
"analysis": "Target (A) 在结构化输出、内容紧扣岗位要求方面明显优于 Baseline (B)。A 严格遵守了 JSON 输出协议,并提供了更具体、与岗位要求更相关的 strengths 和 risks 细节,而 B 的输出过于简略且缺乏针对性。"
|
||||
},
|
||||
{
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"pairSignal": "none",
|
||||
"verdict": "similar",
|
||||
"confidence": "high",
|
||||
"analysis": "Target 和 Reference 的输出在核心判断、证据组织和格式合规性上高度一致,均正确遵循了 prompt 规则,未发现可学习的结构性差距。"
|
||||
},
|
||||
{
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"pairSignal": "supported",
|
||||
"verdict": "left-better",
|
||||
"confidence": "high",
|
||||
"analysis": "左侧(Reference)的提示词明确要求输出结构化JSON并指定了字段和枚举值,这直接导致了其输出在格式和内容深度上都优于右侧(Reference Baseline)的模糊要求。右侧的输出虽然结论一致,但内容过于笼统,缺乏与岗位要求的强关联性。左侧的改进在参考侧自身也得到了验证,并非仅针对当前样例的拟合。"
|
||||
},
|
||||
{
|
||||
"pairKey": "target-vs-replica",
|
||||
"pairType": "targetReplica",
|
||||
"pairLabel": "Target vs Replica",
|
||||
"pairSignal": "unstable",
|
||||
"verdict": "mixed",
|
||||
"confidence": "high",
|
||||
"analysis": "在重复执行中,目标提示词产生了不一致的输出,核心的录用建议(recommendation)从“hold”漂移到了“hire”,同时风险(risks)的表述也发生了实质性变化,表明其行为不稳定,而非无害的措辞波动。"
|
||||
}
|
||||
],
|
||||
"progressSummary": {
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"pairSignal": "improved",
|
||||
"verdict": "left-better",
|
||||
"confidence": "high",
|
||||
"analysis": "Target (A) 在结构化输出、内容紧扣岗位要求方面明显优于 Baseline (B)。A 严格遵守了 JSON 输出协议,并提供了更具体、与岗位要求更相关的 strengths 和 risks 细节,而 B 的输出过于简略且缺乏针对性。"
|
||||
},
|
||||
"referenceGapSummary": {
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"pairSignal": "none",
|
||||
"verdict": "similar",
|
||||
"confidence": "high",
|
||||
"analysis": "Target 和 Reference 的输出在核心判断、证据组织和格式合规性上高度一致,均正确遵循了 prompt 规则,未发现可学习的结构性差距。"
|
||||
},
|
||||
"promptChangeSummary": {
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"pairSignal": "supported",
|
||||
"verdict": "left-better",
|
||||
"confidence": "high",
|
||||
"analysis": "左侧(Reference)的提示词明确要求输出结构化JSON并指定了字段和枚举值,这直接导致了其输出在格式和内容深度上都优于右侧(Reference Baseline)的模糊要求。右侧的输出虽然结论一致,但内容过于笼统,缺乏与岗位要求的强关联性。左侧的改进在参考侧自身也得到了验证,并非仅针对当前样例的拟合。"
|
||||
},
|
||||
"stabilitySummary": {
|
||||
"pairKey": "target-vs-replica",
|
||||
"pairType": "targetReplica",
|
||||
"pairLabel": "Target vs Replica",
|
||||
"pairSignal": "unstable",
|
||||
"verdict": "mixed",
|
||||
"confidence": "high",
|
||||
"analysis": "在重复执行中,目标提示词产生了不一致的输出,核心的录用建议(recommendation)从“hold”漂移到了“hire”,同时风险(risks)的表述也发生了实质性变化,表明其行为不稳定,而非无害的措辞波动。"
|
||||
},
|
||||
"evidenceHighlights": [
|
||||
"Target (A) 的 prompt 明确要求 '只输出 JSON 对象,字段为 recommendation, strengths, risks',其输出严格遵守此格式,为合法的 JSON 对象。Baseline (B) 的 prompt 仅要求 '输出 recommendation, strengths, risks',未明确指定 JSON 格式,但其输出也恰好是合法的 JSON 对象。两者在硬边界(输出协议)上均未违例。",
|
||||
"Target (A) 的 prompt 额外要求 'strengths 和 risks 都要紧扣岗位要求,避免泛泛而谈'。其输出中的 strengths (['有 6 年 B2B SaaS 产品经验', '做过权限系统和审计日志,和岗位高度相关']) 和 risks (['英语一般,海外客户沟通能力待确认', '近期没有直接带人经验']) 均明确对应了输入中提到的岗位要求(权限/审计场景、海外客户沟通、团队协作经验)。",
|
||||
"Baseline (B) 的输出 strengths (['经验较匹配']) 和 risks (['英语一般']) 过于笼统,未具体展开与岗位要求的关联,也未提及“近期没有直接带人经验”这一关键风险点,信息量和针对性均显不足。",
|
||||
"两者 recommendation 均为 'hold',判断逻辑一致。",
|
||||
"两者 strengths 均聚焦于 B2B SaaS 经验和权限/审计日志场景,与岗位要求高度相关。",
|
||||
"两者 risks 均指出了英语沟通和近期管理经验问题,紧扣岗位要求。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"在简历筛选总结任务中,要求输出字段 '紧扣岗位要求,避免泛泛而谈' 能有效引导模型生成更具体、更具信息量的评估点。",
|
||||
"明确的输出格式指令(如'只输出 JSON 对象')有助于确保响应的结构一致性。",
|
||||
"在提示词中明确指定输出格式(如JSON)和字段的枚举值(如hire/hold/reject),可以强制模型生成更结构化、更规范的输出。",
|
||||
"在提示词中要求分析内容“紧扣岗位要求,避免泛泛而谈”,能有效引导模型生成更具针对性和深度的分析,而非通用描述。",
|
||||
"重复执行时,核心决策字段(如 recommendation)的值发生漂移是典型的不稳定信号。",
|
||||
"风险(risks)的表述从客观事实转向主观辩护,表明输出意图或模型内部推理路径不一致。"
|
||||
],
|
||||
"overfitWarnings": [
|
||||
"当前样例中,候选人经验与岗位要求(权限/审计)高度匹配,这可能放大了左侧提示词要求“紧扣岗位要求”所带来的收益。对于经验与岗位要求匹配度不高的候选人,此收益可能减弱。",
|
||||
"右侧(replica)输出中“但可通过团队支持弥补”的表述,可能过度拟合了当前输入中“英语一般”这一具体信息,并进行了超出要求的乐观推断。"
|
||||
],
|
||||
"conflictSignals": [
|
||||
"improvementUnstableAcrossReplicas",
|
||||
"sampleOverfitRiskVisible"
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,272 @@
|
||||
```json
|
||||
{
|
||||
"type": "compare",
|
||||
"score": {
|
||||
"overall": 65,
|
||||
"dimensions": [
|
||||
{
|
||||
"key": "goalAchievementRobustness",
|
||||
"label": "目标达成稳定性",
|
||||
"score": 40
|
||||
},
|
||||
{
|
||||
"key": "outputQualityCeiling",
|
||||
"label": "输出质量上限",
|
||||
"score": 85
|
||||
},
|
||||
{
|
||||
"key": "promptPatternQuality",
|
||||
"label": "提示词模式质量",
|
||||
"score": 80
|
||||
},
|
||||
{
|
||||
"key": "crossSnapshotRobustness",
|
||||
"label": "跨快照鲁棒性",
|
||||
"score": 35
|
||||
},
|
||||
{
|
||||
"key": "workspaceTransferability",
|
||||
"label": "对工作区的可迁移性",
|
||||
"score": 70
|
||||
}
|
||||
]
|
||||
},
|
||||
"improvements": [
|
||||
"在简历筛选总结任务中,要求输出字段(如strengths, risks)‘紧扣岗位要求,避免泛泛而谈’,能有效引导模型生成更具体、更具信息量的评估点。",
|
||||
"明确的输出格式指令(如‘只输出 JSON 对象’)和字段枚举值(如hire/hold/reject),有助于确保响应的结构一致性和规范性。"
|
||||
],
|
||||
"summary": "Target相比Baseline在输出结构化和内容针对性上有明确进步,且与Reference质量相当,但重复执行时核心决策(如录用建议)发生漂移,稳定性存在严重问题,且提示词改进的收益可能部分依赖于当前样例与岗位的高匹配度。",
|
||||
"patchPlan": [],
|
||||
"metadata": {
|
||||
"compareMode": "structured",
|
||||
"compareStopSignals": {
|
||||
"targetVsBaseline": "improved",
|
||||
"targetVsReferenceGap": "none",
|
||||
"improvementHeadroom": "low",
|
||||
"overfitRisk": "high",
|
||||
"stopRecommendation": "review",
|
||||
"stopReasons": [
|
||||
"replica evidence suggests unstable behavior",
|
||||
"pairwise judges flagged possible sample overfit"
|
||||
]
|
||||
},
|
||||
"model": "deepseek",
|
||||
"timestamp": 1774176504604,
|
||||
"duration": 24372,
|
||||
"compareJudgements": [
|
||||
{
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"leftSnapshotId": "a",
|
||||
"leftSnapshotLabel": "A",
|
||||
"leftRole": "target",
|
||||
"rightSnapshotId": "b",
|
||||
"rightSnapshotLabel": "B",
|
||||
"rightRole": "baseline",
|
||||
"verdict": "left-better",
|
||||
"winner": "left",
|
||||
"confidence": "high",
|
||||
"pairSignal": "improved",
|
||||
"analysis": "Target (A) 在结构化输出、内容紧扣岗位要求方面明显优于 Baseline (B)。A 严格遵守了 JSON 输出协议,并提供了更具体、与岗位要求更相关的 strengths 和 risks 细节,而 B 的输出过于简略且缺乏针对性。",
|
||||
"evidence": [
|
||||
"Target (A) 的 prompt 明确要求 '只输出 JSON 对象,字段为 recommendation, strengths, risks',其输出严格遵守此格式,为合法的 JSON 对象。Baseline (B) 的 prompt 仅要求 '输出 recommendation, strengths, risks',未明确指定 JSON 格式,但其输出也恰好是合法的 JSON 对象。两者在硬边界(输出协议)上均未违例。",
|
||||
"Target (A) 的 prompt 额外要求 'strengths 和 risks 都要紧扣岗位要求,避免泛泛而谈'。其输出中的 strengths (['有 6 年 B2B SaaS 产品经验', '做过权限系统和审计日志,和岗位高度相关']) 和 risks (['英语一般,海外客户沟通能力待确认', '近期没有直接带人经验']) 均明确对应了输入中提到的岗位要求(权限/审计场景、海外客户沟通、团队协作经验)。",
|
||||
"Baseline (B) 的输出 strengths (['经验较匹配']) 和 risks (['英语一般']) 过于笼统,未具体展开与岗位要求的关联,也未提及“近期没有直接带人经验”这一关键风险点,信息量和针对性均显不足。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"在简历筛选总结任务中,要求输出字段 '紧扣岗位要求,避免泛泛而谈' 能有效引导模型生成更具体、更具信息量的评估点。",
|
||||
"明确的输出格式指令(如'只输出 JSON 对象')有助于确保响应的结构一致性。"
|
||||
],
|
||||
"overfitWarnings": []
|
||||
},
|
||||
{
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"leftSnapshotId": "a",
|
||||
"leftSnapshotLabel": "A",
|
||||
"leftRole": "target",
|
||||
"rightSnapshotId": "c",
|
||||
"rightSnapshotLabel": "C",
|
||||
"rightRole": "reference",
|
||||
"verdict": "similar",
|
||||
"winner": "none",
|
||||
"confidence": "high",
|
||||
"pairSignal": "none",
|
||||
"analysis": "Target 和 Reference 的输出在核心判断、证据组织和格式合规性上高度一致,均正确遵循了 prompt 规则,未发现可学习的结构性差距。",
|
||||
"evidence": [
|
||||
"两者 recommendation 均为 'hold',判断逻辑一致。",
|
||||
"两者 strengths 均聚焦于 B2B SaaS 经验和权限/审计日志场景,与岗位要求高度相关。",
|
||||
"两者 risks 均指出了英语沟通和近期管理经验问题,紧扣岗位要求。",
|
||||
"两者均输出合法 JSON,字段正确,无额外说明或格式违例。"
|
||||
],
|
||||
"learnableSignals": [],
|
||||
"overfitWarnings": []
|
||||
},
|
||||
{
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"leftSnapshotId": "c",
|
||||
"leftSnapshotLabel": "C",
|
||||
"leftRole": "reference",
|
||||
"rightSnapshotId": "d",
|
||||
"rightSnapshotLabel": "D",
|
||||
"rightRole": "referenceBaseline",
|
||||
"verdict": "left-better",
|
||||
"winner": "left",
|
||||
"confidence": "high",
|
||||
"pairSignal": "supported",
|
||||
"analysis": "左侧(Reference)的提示词明确要求输出结构化JSON并指定了字段和枚举值,这直接导致了其输出在格式和内容深度上都优于右侧(Reference Baseline)的模糊要求。右侧的输出虽然结论一致,但内容过于笼统,缺乏与岗位要求的强关联性。左侧的改进在参考侧自身也得到了验证,并非仅针对当前样例的拟合。",
|
||||
"evidence": [
|
||||
"左侧提示词明确要求输出JSON对象,字段为recommendation, strengths, risks,并规定recommendation只能是hire、hold、reject之一。右侧提示词仅要求输出相同字段,但未规定格式和枚举值。",
|
||||
"左侧输出严格遵守JSON格式,strengths和risks紧扣岗位要求(如“权限系统和审计日志经验与岗位核心场景强相关”、“跨海外客户沟通需进一步验证”)。右侧输出虽为JSON格式,但内容泛泛(如“岗位相关经验较多”、“管理经历偏弱”),未紧扣岗位具体要求。",
|
||||
"两侧的recommendation结论一致(均为hold),表明核心判断未因提示词细化而改变,但左侧的分析深度和针对性显著提升。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"在提示词中明确指定输出格式(如JSON)和字段的枚举值(如hire/hold/reject),可以强制模型生成更结构化、更规范的输出。",
|
||||
"在提示词中要求分析内容“紧扣岗位要求,避免泛泛而谈”,能有效引导模型生成更具针对性和深度的分析,而非通用描述。"
|
||||
],
|
||||
"overfitWarnings": [
|
||||
"当前样例中,候选人经验与岗位要求(权限/审计)高度匹配,这可能放大了左侧提示词要求“紧扣岗位要求”所带来的收益。对于经验与岗位要求匹配度不高的候选人,此收益可能减弱。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"pairKey": "target-vs-replica",
|
||||
"pairType": "targetReplica",
|
||||
"pairLabel": "Target vs Replica",
|
||||
"leftSnapshotId": "a",
|
||||
"leftSnapshotLabel": "A",
|
||||
"leftRole": "target",
|
||||
"rightSnapshotId": "e",
|
||||
"rightSnapshotLabel": "E",
|
||||
"rightRole": "replica",
|
||||
"verdict": "mixed",
|
||||
"winner": "none",
|
||||
"confidence": "high",
|
||||
"pairSignal": "unstable",
|
||||
"analysis": "在重复执行中,目标提示词产生了不一致的输出,核心的录用建议(recommendation)从“hold”漂移到了“hire”,同时风险(risks)的表述也发生了实质性变化,表明其行为不稳定,而非无害的措辞波动。",
|
||||
"evidence": [
|
||||
"核心字段 `recommendation` 的值从 `\"hold\"` (left) 变为 `\"hire\"` (right)。",
|
||||
"`risks` 字段中关于“英语一般”的表述从客观描述“海外客户沟通能力待确认” (left) 变为带有主观判断的“但可通过团队支持弥补” (right)。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"重复执行时,核心决策字段(如 recommendation)的值发生漂移是典型的不稳定信号。",
|
||||
"风险(risks)的表述从客观事实转向主观辩护,表明输出意图或模型内部推理路径不一致。"
|
||||
],
|
||||
"overfitWarnings": [
|
||||
"右侧(replica)输出中“但可通过团队支持弥补”的表述,可能过度拟合了当前输入中“英语一般”这一具体信息,并进行了超出要求的乐观推断。"
|
||||
]
|
||||
}
|
||||
],
|
||||
"snapshotRoles": {
|
||||
"a": "target",
|
||||
"b": "baseline",
|
||||
"c": "reference",
|
||||
"d": "referenceBaseline",
|
||||
"e": "replica"
|
||||
},
|
||||
"compareInsights": {
|
||||
"pairHighlights": [
|
||||
{
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"pairSignal": "improved",
|
||||
"verdict": "left-better",
|
||||
"confidence": "high",
|
||||
"analysis": "Target (A) 在结构化输出、内容紧扣岗位要求方面明显优于 Baseline (B)。A 严格遵守了 JSON 输出协议,并提供了更具体、与岗位要求更相关的 strengths 和 risks 细节,而 B 的输出过于简略且缺乏针对性。"
|
||||
},
|
||||
{
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"pairSignal": "none",
|
||||
"verdict": "similar",
|
||||
"confidence": "high",
|
||||
"analysis": "Target 和 Reference 的输出在核心判断、证据组织和格式合规性上高度一致,均正确遵循了 prompt 规则,未发现可学习的结构性差距。"
|
||||
},
|
||||
{
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"pairSignal": "supported",
|
||||
"verdict": "left-better",
|
||||
"confidence": "high",
|
||||
"analysis": "左侧(Reference)的提示词明确要求输出结构化JSON并指定了字段和枚举值,这直接导致了其输出在格式和内容深度上都优于右侧(Reference Baseline)的模糊要求。右侧的输出虽然结论一致,但内容过于笼统,缺乏与岗位要求的强关联性。左侧的改进在参考侧自身也得到了验证,并非仅针对当前样例的拟合。"
|
||||
},
|
||||
{
|
||||
"pairKey": "target-vs-replica",
|
||||
"pairType": "targetReplica",
|
||||
"pairLabel": "Target vs Replica",
|
||||
"pairSignal": "unstable",
|
||||
"verdict": "mixed",
|
||||
"confidence": "high",
|
||||
"analysis": "在重复执行中,目标提示词产生了不一致的输出,核心的录用建议(recommendation)从“hold”漂移到了“hire”,同时风险(risks)的表述也发生了实质性变化,表明其行为不稳定,而非无害的措辞波动。"
|
||||
}
|
||||
],
|
||||
"progressSummary": {
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"pairSignal": "improved",
|
||||
"verdict": "left-better",
|
||||
"confidence": "high",
|
||||
"analysis": "Target (A) 在结构化输出、内容紧扣岗位要求方面明显优于 Baseline (B)。A 严格遵守了 JSON 输出协议,并提供了更具体、与岗位要求更相关的 strengths 和 risks 细节,而 B 的输出过于简略且缺乏针对性。"
|
||||
},
|
||||
"referenceGapSummary": {
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"pairSignal": "none",
|
||||
"verdict": "similar",
|
||||
"confidence": "high",
|
||||
"analysis": "Target 和 Reference 的输出在核心判断、证据组织和格式合规性上高度一致,均正确遵循了 prompt 规则,未发现可学习的结构性差距。"
|
||||
},
|
||||
"promptChangeSummary": {
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"pairSignal": "supported",
|
||||
"verdict": "left-better",
|
||||
"confidence": "high",
|
||||
"analysis": "左侧(Reference)的提示词明确要求输出结构化JSON并指定了字段和枚举值,这直接导致了其输出在格式和内容深度上都优于右侧(Reference Baseline)的模糊要求。右侧的输出虽然结论一致,但内容过于笼统,缺乏与岗位要求的强关联性。左侧的改进在参考侧自身也得到了验证,并非仅针对当前样例的拟合。"
|
||||
},
|
||||
"stabilitySummary": {
|
||||
"pairKey": "target-vs-replica",
|
||||
"pairType": "targetReplica",
|
||||
"pairLabel": "Target vs Replica",
|
||||
"pairSignal": "unstable",
|
||||
"verdict": "mixed",
|
||||
"confidence": "high",
|
||||
"analysis": "在重复执行中,目标提示词产生了不一致的输出,核心的录用建议(recommendation)从“hold”漂移到了“hire”,同时风险(risks)的表述也发生了实质性变化,表明其行为不稳定,而非无害的措辞波动。"
|
||||
},
|
||||
"evidenceHighlights": [
|
||||
"Target (A) 的 prompt 明确要求 '只输出 JSON 对象,字段为 recommendation, strengths, risks',其输出严格遵守此格式,为合法的 JSON 对象。Baseline (B) 的 prompt 仅要求 '输出 recommendation, strengths, risks',未明确指定 JSON 格式,但其输出也恰好是合法的 JSON 对象。两者在硬边界(输出协议)上均未违例。",
|
||||
"Target (A) 的 prompt 额外要求 'strengths 和 risks 都要紧扣岗位要求,避免泛泛而谈'。其输出中的 strengths (['有 6 年 B2B SaaS 产品经验', '做过权限系统和审计日志,和岗位高度相关']) 和 risks (['英语一般,海外客户沟通能力待确认', '近期没有直接带人经验']) 均明确对应了输入中提到的岗位要求(权限/审计场景、海外客户沟通、团队协作经验)。",
|
||||
"Baseline (B) 的输出 strengths (['经验较匹配']) 和 risks (['英语一般']) 过于笼统,未具体展开与岗位要求的关联,也未提及“近期没有直接带人经验”这一关键风险点,信息量和针对性均显不足。",
|
||||
"两者 recommendation 均为 'hold',判断逻辑一致。",
|
||||
"两者 strengths 均聚焦于 B2B SaaS 经验和权限/审计日志场景,与岗位要求高度相关。",
|
||||
"两者 risks 均指出了英语沟通和近期管理经验问题,紧扣岗位要求。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"在简历筛选总结任务中,要求输出字段 '紧扣岗位要求,避免泛泛而谈' 能有效引导模型生成更具体、更具信息量的评估点。",
|
||||
"明确的输出格式指令(如'只输出 JSON 对象')有助于确保响应的结构一致性。",
|
||||
"在提示词中明确指定输出格式(如JSON)和字段的枚举值(如hire/hold/reject),可以强制模型生成更结构化、更规范的输出。",
|
||||
"在提示词中要求分析内容“紧扣岗位要求,避免泛泛而谈”,能有效引导模型生成更具针对性和深度的分析,而非通用描述。",
|
||||
"重复执行时,核心决策字段(如 recommendation)的值发生漂移是典型的不稳定信号。",
|
||||
"风险(risks)的表述从客观事实转向主观辩护,表明输出意图或模型内部推理路径不一致。"
|
||||
],
|
||||
"overfitWarnings": [
|
||||
"当前样例中,候选人经验与岗位要求(权限/审计)高度匹配,这可能放大了左侧提示词要求“紧扣岗位要求”所带来的收益。对于经验与岗位要求匹配度不高的候选人,此收益可能减弱。",
|
||||
"右侧(replica)输出中“但可通过团队支持弥补”的表述,可能过度拟合了当前输入中“英语一般”这一具体信息,并进行了超出要求的乐观推断。"
|
||||
],
|
||||
"conflictSignals": [
|
||||
"improvementUnstableAcrossReplicas",
|
||||
"sampleOverfitRiskVisible"
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
@@ -0,0 +1,230 @@
|
||||
请只根据下面这份 JSON payload,把当前工作区系统提示词直接重写成一个完整的新版本。
|
||||
|
||||
要求:
|
||||
1. "sourcePrompts.workspacePrompt" 是你必须基于其进行重写的 source of truth,不是让你从零另写一份题目相近的新 prompt。
|
||||
2. 保留原提示词的核心目标、硬约束、必要边界、变量名、字段名、schema、角色结构和输出协议,除非评估明确表明这些内容本身有问题。
|
||||
3. 如果 source prompt 里已经写了明确的 JSON 键名、XML 标签、占位符、枚举值或“只能输出某种结构”的规则,默认必须保留,不能擅自改名、改结构或扩写协议。
|
||||
4. 如果压缩评估明确指出当前提示词发生了回退、contract 漂移、字段/schema 漂移或不被支持的协议改动,就不要继续保留这些坏改动,而要主动修复它们;如果给了 "sourcePrompts.referencePrompt",优先把它当作恢复 contract 的锚点。
|
||||
5. 优先吸收可复用、跨输入也应成立的改进,不要为了当前样例、当前输出细节或一次性现象过拟合。
|
||||
6. 如果某条建议明显依赖当前样例,应主动将其泛化、弱化或舍弃。
|
||||
7. 不要自行发明新的测试证据,只能基于下面这份压缩评估结论来改写。
|
||||
8. 优先做“最小但完整”的重写,在保留原 contract 的前提下提升质量,而不是整套改写。
|
||||
9. 只输出提示词正文,不要把结果包装成 JSON、YAML、XML、"role/content" 对象、消息数组或代码块。
|
||||
10. 只输出重写后的完整提示词,不要额外解释。
|
||||
11. "sourcePrompts" 里的字符串就是原始提示词正文;即使里面包含 Markdown、code fence、列表或标题,也都属于正文,不代表你应该输出相同包装结构。
|
||||
12. 如果 compare 相关条目之间有重叠,优先相信聚合焦点结论和停止信号,再参考较底层的证据摘录。
|
||||
13. 在动手改写前,先看 "compressedEvaluation.rewriteGuidance.recommendation"。
|
||||
14. 如果 recommendation 是 "skip",就原样输出 "sourcePrompts.workspacePrompt",不要做任何改写。
|
||||
15. 如果 recommendation 是 "minor-rewrite",只能做证据明确支持的最小修补,并且必须保持原 contract 与整体结构稳定。
|
||||
16. 只有 recommendation 是 "rewrite" 时,才允许做更实质性的重写。
|
||||
17. 在决定改哪里之前,先看 "compressedEvaluation.rewriteGuidance.priorityMoves",把这些动作当作最高优先级的改写议程。
|
||||
18. 如果 priorityMoves 里出现“决策稳定性”相关动作,就应优先补充核心结论字段的判定标准、tie-break 规则或保守默认规则,而不是只加强输出格式。
|
||||
|
||||
Rewrite Payload (JSON):
|
||||
{
|
||||
"scenario": {
|
||||
"language": "zh",
|
||||
"evaluationType": "compare",
|
||||
"evaluationTypeLabel": "对比评估",
|
||||
"subjectLabel": "系统提示词",
|
||||
"mode": {
|
||||
"functionMode": "basic",
|
||||
"subMode": "system"
|
||||
},
|
||||
"overallScore": 65
|
||||
},
|
||||
"sourcePrompts": {
|
||||
"workspacePrompt": "你是简历筛选总结助手。\n只输出 JSON 对象,字段为 recommendation, strengths, risks。\nrecommendation 只能是 hire、hold、reject 之一。\nstrengths 和 risks 都要紧扣岗位要求,避免泛泛而谈。",
|
||||
"referencePrompt": "你是简历筛选总结助手。\n输出 recommendation, strengths, risks。\n结论尽量简洁。"
|
||||
},
|
||||
"compressedEvaluation": {
|
||||
"summary": "Target相比Baseline在输出结构化和内容针对性上有明确进步,且与Reference质量相当,但重复执行时核心决策(如录用建议)发生漂移,稳定性存在严重问题,且提示词改进的收益可能部分依赖于当前样例与岗位的高匹配度。",
|
||||
"dimensionScores": [
|
||||
{
|
||||
"key": "goalAchievementRobustness",
|
||||
"label": "目标达成稳定性",
|
||||
"score": 40
|
||||
},
|
||||
{
|
||||
"key": "outputQualityCeiling",
|
||||
"label": "输出质量上限",
|
||||
"score": 85
|
||||
},
|
||||
{
|
||||
"key": "promptPatternQuality",
|
||||
"label": "提示词模式质量",
|
||||
"score": 80
|
||||
},
|
||||
{
|
||||
"key": "crossSnapshotRobustness",
|
||||
"label": "跨快照鲁棒性",
|
||||
"score": 35
|
||||
},
|
||||
{
|
||||
"key": "workspaceTransferability",
|
||||
"label": "对工作区的可迁移性",
|
||||
"score": 70
|
||||
}
|
||||
],
|
||||
"improvements": [
|
||||
"在简历筛选总结任务中,要求输出字段(如strengths, risks)‘紧扣岗位要求,避免泛泛而谈’,能有效引导模型生成更具体、更具信息量的评估点。",
|
||||
"明确的输出格式指令(如‘只输出 JSON 对象’)和字段枚举值(如hire/hold/reject),有助于确保响应的结构一致性和规范性。"
|
||||
],
|
||||
"patchPlan": [],
|
||||
"compareStopSignals": {
|
||||
"targetVsBaseline": "improved",
|
||||
"targetVsReferenceGap": "none",
|
||||
"improvementHeadroom": "low",
|
||||
"overfitRisk": "high",
|
||||
"stopRecommendation": "review",
|
||||
"stopReasons": [
|
||||
"replica evidence suggests unstable behavior",
|
||||
"pairwise judges flagged possible sample overfit"
|
||||
]
|
||||
},
|
||||
"compareInsights": {
|
||||
"pairHighlights": [
|
||||
{
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"pairSignal": "improved",
|
||||
"verdict": "left-better",
|
||||
"confidence": "high",
|
||||
"analysis": "Target (A) 在结构化输出、内容紧扣岗位要求方面明显优于 Baseline (B)。A 严格遵守了 JSON 输出协议,并提供了更具体、与岗位要求更相关的 strengths 和 risks 细节,而 B 的输出过于简略且缺乏针对性。"
|
||||
},
|
||||
{
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"pairSignal": "none",
|
||||
"verdict": "similar",
|
||||
"confidence": "high",
|
||||
"analysis": "Target 和 Reference 的输出在核心判断、证据组织和格式合规性上高度一致,均正确遵循了 prompt 规则,未发现可学习的结构性差距。"
|
||||
},
|
||||
{
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"pairSignal": "supported",
|
||||
"verdict": "left-better",
|
||||
"confidence": "high",
|
||||
"analysis": "左侧(Reference)的提示词明确要求输出结构化JSON并指定了字段和枚举值,这直接导致了其输出在格式和内容深度上都优于右侧(Reference Baseline)的模糊要求。右侧的输出虽然结论一致,但内容过于笼统,缺乏与岗位要求的强关联性。左侧的改进在参考侧自身也得到了验证,并非仅针对当前样例的拟合。"
|
||||
},
|
||||
{
|
||||
"pairKey": "target-vs-replica",
|
||||
"pairType": "targetReplica",
|
||||
"pairLabel": "Target vs Replica",
|
||||
"pairSignal": "unstable",
|
||||
"verdict": "mixed",
|
||||
"confidence": "high",
|
||||
"analysis": "在重复执行中,目标提示词产生了不一致的输出,核心的录用建议(recommendation)从“hold”漂移到了“hire”,同时风险(risks)的表述也发生了实质性变化,表明其行为不稳定,而非无害的措辞波动。"
|
||||
}
|
||||
],
|
||||
"progressSummary": {
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"pairSignal": "improved",
|
||||
"verdict": "left-better",
|
||||
"confidence": "high",
|
||||
"analysis": "Target (A) 在结构化输出、内容紧扣岗位要求方面明显优于 Baseline (B)。A 严格遵守了 JSON 输出协议,并提供了更具体、与岗位要求更相关的 strengths 和 risks 细节,而 B 的输出过于简略且缺乏针对性。"
|
||||
},
|
||||
"referenceGapSummary": {
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"pairSignal": "none",
|
||||
"verdict": "similar",
|
||||
"confidence": "high",
|
||||
"analysis": "Target 和 Reference 的输出在核心判断、证据组织和格式合规性上高度一致,均正确遵循了 prompt 规则,未发现可学习的结构性差距。"
|
||||
},
|
||||
"promptChangeSummary": {
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"pairSignal": "supported",
|
||||
"verdict": "left-better",
|
||||
"confidence": "high",
|
||||
"analysis": "左侧(Reference)的提示词明确要求输出结构化JSON并指定了字段和枚举值,这直接导致了其输出在格式和内容深度上都优于右侧(Reference Baseline)的模糊要求。右侧的输出虽然结论一致,但内容过于笼统,缺乏与岗位要求的强关联性。左侧的改进在参考侧自身也得到了验证,并非仅针对当前样例的拟合。"
|
||||
},
|
||||
"stabilitySummary": {
|
||||
"pairKey": "target-vs-replica",
|
||||
"pairType": "targetReplica",
|
||||
"pairLabel": "Target vs Replica",
|
||||
"pairSignal": "unstable",
|
||||
"verdict": "mixed",
|
||||
"confidence": "high",
|
||||
"analysis": "在重复执行中,目标提示词产生了不一致的输出,核心的录用建议(recommendation)从“hold”漂移到了“hire”,同时风险(risks)的表述也发生了实质性变化,表明其行为不稳定,而非无害的措辞波动。"
|
||||
},
|
||||
"evidenceHighlights": [
|
||||
"Target (A) 的 prompt 明确要求 '只输出 JSON 对象,字段为 recommendation, strengths, risks',其输出严格遵守此格式,为合法的 JSON 对象。Baseline (B) 的 prompt 仅要求 '输出 recommendation, strengths, risks',未明确指定 JSON 格式,但其输出也恰好是合法的 JSON 对象。两者在硬边界(输出协议)上均未违例。",
|
||||
"Target (A) 的 prompt 额外要求 'strengths 和 risks 都要紧扣岗位要求,避免泛泛而谈'。其输出中的 strengths (['有 6 年 B2B SaaS 产品经验', '做过权限系统和审计日志,和岗位高度相关']) 和 risks (['英语一般,海外客户沟通能力待确认', '近期没有直接带人经验']) 均明确对应了输入中提到的岗位要求(权限/审计场景、海外客户沟通、团队协作经验)。",
|
||||
"Baseline (B) 的输出 strengths (['经验较匹配']) 和 risks (['英语一般']) 过于笼统,未具体展开与岗位要求的关联,也未提及“近期没有直接带人经验”这一关键风险点,信息量和针对性均显不足。",
|
||||
"两者 recommendation 均为 'hold',判断逻辑一致。",
|
||||
"两者 strengths 均聚焦于 B2B SaaS 经验和权限/审计日志场景,与岗位要求高度相关。",
|
||||
"两者 risks 均指出了英语沟通和近期管理经验问题,紧扣岗位要求。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"在简历筛选总结任务中,要求输出字段 '紧扣岗位要求,避免泛泛而谈' 能有效引导模型生成更具体、更具信息量的评估点。",
|
||||
"明确的输出格式指令(如'只输出 JSON 对象')有助于确保响应的结构一致性。",
|
||||
"在提示词中明确指定输出格式(如JSON)和字段的枚举值(如hire/hold/reject),可以强制模型生成更结构化、更规范的输出。",
|
||||
"在提示词中要求分析内容“紧扣岗位要求,避免泛泛而谈”,能有效引导模型生成更具针对性和深度的分析,而非通用描述。",
|
||||
"重复执行时,核心决策字段(如 recommendation)的值发生漂移是典型的不稳定信号。",
|
||||
"风险(risks)的表述从客观事实转向主观辩护,表明输出意图或模型内部推理路径不一致。"
|
||||
],
|
||||
"overfitWarnings": [
|
||||
"当前样例中,候选人经验与岗位要求(权限/审计)高度匹配,这可能放大了左侧提示词要求“紧扣岗位要求”所带来的收益。对于经验与岗位要求匹配度不高的候选人,此收益可能减弱。",
|
||||
"右侧(replica)输出中“但可通过团队支持弥补”的表述,可能过度拟合了当前输入中“英语一般”这一具体信息,并进行了超出要求的乐观推断。"
|
||||
],
|
||||
"conflictSignals": [
|
||||
"improvementUnstableAcrossReplicas",
|
||||
"sampleOverfitRiskVisible"
|
||||
]
|
||||
},
|
||||
"rewriteGuidance": {
|
||||
"recommendation": "rewrite",
|
||||
"reasons": [
|
||||
"当前仍存在明确改进空间或未解决风险,继续做实质性改写仍然有必要。",
|
||||
"replica 证据显示当前行为不稳定,改写时应优先修复决策稳定性,而不是只修表面格式。"
|
||||
],
|
||||
"focusAreas": [
|
||||
"generalization",
|
||||
"decision-stability"
|
||||
],
|
||||
"priorityMoves": [
|
||||
"删除或弱化样例触发式规则,优先改写成跨输入也应成立的通用原则。",
|
||||
"为核心结论字段补上显式判定标准,避免证据相近时在不同执行里得出不同结论。",
|
||||
"为证据混合或不足的情况补上 tie-break / 保守默认规则,不要把最终结论留给模型自由发挥。",
|
||||
"把格式要求和决策逻辑分开写:保留 JSON contract,但优先稳定 recommendation 的判定逻辑,而不是只修表面措辞。"
|
||||
]
|
||||
},
|
||||
"focusSummaryLines": [
|
||||
"进步判断: Target vs Baseline | signal=improved | verdict=left-better | confidence=high | Target (A) 在结构化输出、内容紧扣岗位要求方面明显优于 Baseline (B)。A 严格遵守了 JSON 输出协议,并提供了更具体、与岗位要求更相关的 strengths 和 risks 细节,而 B 的输出过于简略且缺乏针对性。",
|
||||
"参考差距: Target vs Reference | signal=none | verdict=similar | confidence=high | Target 和 Reference 的输出在核心判断、证据组织和格式合规性上高度一致,均正确遵循了 prompt 规则,未发现可学习的结构性差距。",
|
||||
"改动有效性: Reference vs Reference Baseline | signal=supported | verdict=left-better | confidence=high | 左侧(Reference)的提示词明确要求输出结构化JSON并指定了字段和枚举值,这直接导致了其输出在格式和内容深度上都优于右侧(Reference Baseline)的模糊要求。右侧的输出虽然结论一致,但内容过于笼统,缺乏与岗位要求的强关联性。左侧的改进在参考侧自身也得到了验证,并非仅针对当前样例的拟合。",
|
||||
"稳定性: Target vs Replica | signal=unstable | verdict=mixed | confidence=high | 在重复执行中,目标提示词产生了不一致的输出,核心的录用建议(recommendation)从“hold”漂移到了“hire”,同时风险(risks)的表述也发生了实质性变化,表明其行为不稳定,而非无害的措辞波动。"
|
||||
],
|
||||
"conflictLines": [
|
||||
"Target 在单组比较里有进步,但 replica 证据提示该收益可能不稳定。",
|
||||
"如果“可复用收益”和“样例贴合收益”并存,应优先采用保守结论,并保持过拟合风险可见。"
|
||||
],
|
||||
"learnableSignalLines": [
|
||||
"在简历筛选总结任务中,要求输出字段 '紧扣岗位要求,避免泛泛而谈' 能有效引导模型生成更具体、更具信息量的评估点。",
|
||||
"明确的输出格式指令(如'只输出 JSON 对象')有助于确保响应的结构一致性。",
|
||||
"在提示词中明确指定输出格式(如JSON)和字段的枚举值(如hire/hold/reject),可以强制模型生成更结构化、更规范的输出。",
|
||||
"在提示词中要求分析内容“紧扣岗位要求,避免泛泛而谈”,能有效引导模型生成更具针对性和深度的分析,而非通用描述。",
|
||||
"重复执行时,核心决策字段(如 recommendation)的值发生漂移是典型的不稳定信号。"
|
||||
],
|
||||
"overfitWarningLines": [
|
||||
"当前样例中,候选人经验与岗位要求(权限/审计)高度匹配,这可能放大了左侧提示词要求“紧扣岗位要求”所带来的收益。对于经验与岗位要求匹配度不高的候选人,此收益可能减弱。",
|
||||
"右侧(replica)输出中“但可通过团队支持弥补”的表述,可能过度拟合了当前输入中“英语一般”这一具体信息,并进行了超出要求的乐观推断。"
|
||||
],
|
||||
"supportEvidenceLines": [
|
||||
"1. Target vs Baseline | signal=improved | verdict=left-better | confidence=high | Target (A) 在结构化输出、内容紧扣岗位要求方面明显优于 Baseline (B)。A 严格遵守了 JSON 输出协议,并提供了更具体、与岗位要求更相关的 strengths 和 risks 细节,而 B 的输出过于简略且缺乏针对性。",
|
||||
"2. Target vs Reference | signal=none | verdict=similar | confidence=high | Target 和 Reference 的输出在核心判断、证据组织和格式合规性上高度一致,均正确遵循了 prompt 规则,未发现可学习的结构性差距。",
|
||||
"3. Reference vs Reference Baseline | signal=supported | verdict=left-better | confidence=high | 左侧(Reference)的提示词明确要求输出结构化JSON并指定了字段和枚举值,这直接导致了其输出在格式和内容深度上都优于右侧(Reference Baseline)的模糊要求。右侧的输出虽然结论一致,但内容过于笼统,缺乏与岗位要求的强关联性。左侧的改进在参考侧自身也得到了验证,并非仅...",
|
||||
"4. Target vs Replica | signal=unstable | verdict=mixed | confidence=high | 在重复执行中,目标提示词产生了不一致的输出,核心的录用建议(recommendation)从“hold”漂移到了“hire”,同时风险(risks)的表述也发生了实质性变化,表明其行为不稳定,而非无害的措辞波动。"
|
||||
]
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,19 @@
|
||||
你是简历筛选总结助手。你的核心任务是根据候选人的简历信息和岗位要求,输出一个结构化的评估总结。
|
||||
|
||||
**输出格式**
|
||||
- 只输出一个 JSON 对象,且必须包含以下三个字段:`recommendation`, `strengths`, `risks`。
|
||||
- `recommendation` 字段的值只能是 `"hire"`、`"hold"`、`"reject"` 中的一个。
|
||||
- `strengths` 和 `risks` 字段的值必须是字符串数组,每个元素都应是一个具体、简洁的评估点。
|
||||
|
||||
**评估原则**
|
||||
1. **紧扣岗位要求**:所有评估点(strengths 和 risks)都必须基于简历内容与岗位要求的匹配度进行分析。避免使用“经验丰富”、“能力一般”等泛泛而谈的描述,必须具体指出与岗位相关的技能、经验或资质的匹配情况。
|
||||
2. **决策稳定性**:`recommendation` 的判定应遵循以下标准,以确保相同证据输入下结论一致:
|
||||
- **`hire`**:简历中明确展示的能力和经验**全面满足或超出**岗位的核心要求,且无明显重大风险。
|
||||
- **`hold`**:简历与岗位要求**部分匹配**,存在一些可接受的风险或不确定性(如某些技能待验证、经验年限略有不足),需要进一步考察。
|
||||
- **`reject`**:简历与岗位要求的**核心部分严重不匹配**,或存在无法接受的重大缺陷。
|
||||
- **平局处理**:当证据混合或不足以明确指向 `hire` 或 `reject` 时,默认采用更保守的结论 **`hold`**。
|
||||
|
||||
**输出要求**
|
||||
- 严格遵循上述 JSON 格式。
|
||||
- `strengths` 和 `risks` 的内容必须具体、客观,直接关联岗位要求。
|
||||
- 基于上述原则生成稳定的 `recommendation`。
|
||||
@@ -0,0 +1,207 @@
|
||||
{
|
||||
"scenario": {
|
||||
"language": "zh",
|
||||
"evaluationType": "compare",
|
||||
"evaluationTypeLabel": "对比评估",
|
||||
"subjectLabel": "系统提示词",
|
||||
"mode": {
|
||||
"functionMode": "basic",
|
||||
"subMode": "system"
|
||||
},
|
||||
"overallScore": 65
|
||||
},
|
||||
"sourcePrompts": {
|
||||
"workspacePrompt": "你是简历筛选总结助手。\n只输出 JSON 对象,字段为 recommendation, strengths, risks。\nrecommendation 只能是 hire、hold、reject 之一。\nstrengths 和 risks 都要紧扣岗位要求,避免泛泛而谈。",
|
||||
"referencePrompt": "你是简历筛选总结助手。\n输出 recommendation, strengths, risks。\n结论尽量简洁。"
|
||||
},
|
||||
"compressedEvaluation": {
|
||||
"summary": "Target相比Baseline在输出结构化和内容针对性上有明确进步,且与Reference质量相当,但重复执行时核心决策(如录用建议)发生漂移,稳定性存在严重问题,且提示词改进的收益可能部分依赖于当前样例与岗位的高匹配度。",
|
||||
"dimensionScores": [
|
||||
{
|
||||
"key": "goalAchievementRobustness",
|
||||
"label": "目标达成稳定性",
|
||||
"score": 40
|
||||
},
|
||||
{
|
||||
"key": "outputQualityCeiling",
|
||||
"label": "输出质量上限",
|
||||
"score": 85
|
||||
},
|
||||
{
|
||||
"key": "promptPatternQuality",
|
||||
"label": "提示词模式质量",
|
||||
"score": 80
|
||||
},
|
||||
{
|
||||
"key": "crossSnapshotRobustness",
|
||||
"label": "跨快照鲁棒性",
|
||||
"score": 35
|
||||
},
|
||||
{
|
||||
"key": "workspaceTransferability",
|
||||
"label": "对工作区的可迁移性",
|
||||
"score": 70
|
||||
}
|
||||
],
|
||||
"improvements": [
|
||||
"在简历筛选总结任务中,要求输出字段(如strengths, risks)‘紧扣岗位要求,避免泛泛而谈’,能有效引导模型生成更具体、更具信息量的评估点。",
|
||||
"明确的输出格式指令(如‘只输出 JSON 对象’)和字段枚举值(如hire/hold/reject),有助于确保响应的结构一致性和规范性。"
|
||||
],
|
||||
"patchPlan": [],
|
||||
"compareStopSignals": {
|
||||
"targetVsBaseline": "improved",
|
||||
"targetVsReferenceGap": "none",
|
||||
"improvementHeadroom": "low",
|
||||
"overfitRisk": "high",
|
||||
"stopRecommendation": "review",
|
||||
"stopReasons": [
|
||||
"replica evidence suggests unstable behavior",
|
||||
"pairwise judges flagged possible sample overfit"
|
||||
]
|
||||
},
|
||||
"compareInsights": {
|
||||
"pairHighlights": [
|
||||
{
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"pairSignal": "improved",
|
||||
"verdict": "left-better",
|
||||
"confidence": "high",
|
||||
"analysis": "Target (A) 在结构化输出、内容紧扣岗位要求方面明显优于 Baseline (B)。A 严格遵守了 JSON 输出协议,并提供了更具体、与岗位要求更相关的 strengths 和 risks 细节,而 B 的输出过于简略且缺乏针对性。"
|
||||
},
|
||||
{
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"pairSignal": "none",
|
||||
"verdict": "similar",
|
||||
"confidence": "high",
|
||||
"analysis": "Target 和 Reference 的输出在核心判断、证据组织和格式合规性上高度一致,均正确遵循了 prompt 规则,未发现可学习的结构性差距。"
|
||||
},
|
||||
{
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"pairSignal": "supported",
|
||||
"verdict": "left-better",
|
||||
"confidence": "high",
|
||||
"analysis": "左侧(Reference)的提示词明确要求输出结构化JSON并指定了字段和枚举值,这直接导致了其输出在格式和内容深度上都优于右侧(Reference Baseline)的模糊要求。右侧的输出虽然结论一致,但内容过于笼统,缺乏与岗位要求的强关联性。左侧的改进在参考侧自身也得到了验证,并非仅针对当前样例的拟合。"
|
||||
},
|
||||
{
|
||||
"pairKey": "target-vs-replica",
|
||||
"pairType": "targetReplica",
|
||||
"pairLabel": "Target vs Replica",
|
||||
"pairSignal": "unstable",
|
||||
"verdict": "mixed",
|
||||
"confidence": "high",
|
||||
"analysis": "在重复执行中,目标提示词产生了不一致的输出,核心的录用建议(recommendation)从“hold”漂移到了“hire”,同时风险(risks)的表述也发生了实质性变化,表明其行为不稳定,而非无害的措辞波动。"
|
||||
}
|
||||
],
|
||||
"progressSummary": {
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"pairSignal": "improved",
|
||||
"verdict": "left-better",
|
||||
"confidence": "high",
|
||||
"analysis": "Target (A) 在结构化输出、内容紧扣岗位要求方面明显优于 Baseline (B)。A 严格遵守了 JSON 输出协议,并提供了更具体、与岗位要求更相关的 strengths 和 risks 细节,而 B 的输出过于简略且缺乏针对性。"
|
||||
},
|
||||
"referenceGapSummary": {
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"pairSignal": "none",
|
||||
"verdict": "similar",
|
||||
"confidence": "high",
|
||||
"analysis": "Target 和 Reference 的输出在核心判断、证据组织和格式合规性上高度一致,均正确遵循了 prompt 规则,未发现可学习的结构性差距。"
|
||||
},
|
||||
"promptChangeSummary": {
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"pairSignal": "supported",
|
||||
"verdict": "left-better",
|
||||
"confidence": "high",
|
||||
"analysis": "左侧(Reference)的提示词明确要求输出结构化JSON并指定了字段和枚举值,这直接导致了其输出在格式和内容深度上都优于右侧(Reference Baseline)的模糊要求。右侧的输出虽然结论一致,但内容过于笼统,缺乏与岗位要求的强关联性。左侧的改进在参考侧自身也得到了验证,并非仅针对当前样例的拟合。"
|
||||
},
|
||||
"stabilitySummary": {
|
||||
"pairKey": "target-vs-replica",
|
||||
"pairType": "targetReplica",
|
||||
"pairLabel": "Target vs Replica",
|
||||
"pairSignal": "unstable",
|
||||
"verdict": "mixed",
|
||||
"confidence": "high",
|
||||
"analysis": "在重复执行中,目标提示词产生了不一致的输出,核心的录用建议(recommendation)从“hold”漂移到了“hire”,同时风险(risks)的表述也发生了实质性变化,表明其行为不稳定,而非无害的措辞波动。"
|
||||
},
|
||||
"evidenceHighlights": [
|
||||
"Target (A) 的 prompt 明确要求 '只输出 JSON 对象,字段为 recommendation, strengths, risks',其输出严格遵守此格式,为合法的 JSON 对象。Baseline (B) 的 prompt 仅要求 '输出 recommendation, strengths, risks',未明确指定 JSON 格式,但其输出也恰好是合法的 JSON 对象。两者在硬边界(输出协议)上均未违例。",
|
||||
"Target (A) 的 prompt 额外要求 'strengths 和 risks 都要紧扣岗位要求,避免泛泛而谈'。其输出中的 strengths (['有 6 年 B2B SaaS 产品经验', '做过权限系统和审计日志,和岗位高度相关']) 和 risks (['英语一般,海外客户沟通能力待确认', '近期没有直接带人经验']) 均明确对应了输入中提到的岗位要求(权限/审计场景、海外客户沟通、团队协作经验)。",
|
||||
"Baseline (B) 的输出 strengths (['经验较匹配']) 和 risks (['英语一般']) 过于笼统,未具体展开与岗位要求的关联,也未提及“近期没有直接带人经验”这一关键风险点,信息量和针对性均显不足。",
|
||||
"两者 recommendation 均为 'hold',判断逻辑一致。",
|
||||
"两者 strengths 均聚焦于 B2B SaaS 经验和权限/审计日志场景,与岗位要求高度相关。",
|
||||
"两者 risks 均指出了英语沟通和近期管理经验问题,紧扣岗位要求。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"在简历筛选总结任务中,要求输出字段 '紧扣岗位要求,避免泛泛而谈' 能有效引导模型生成更具体、更具信息量的评估点。",
|
||||
"明确的输出格式指令(如'只输出 JSON 对象')有助于确保响应的结构一致性。",
|
||||
"在提示词中明确指定输出格式(如JSON)和字段的枚举值(如hire/hold/reject),可以强制模型生成更结构化、更规范的输出。",
|
||||
"在提示词中要求分析内容“紧扣岗位要求,避免泛泛而谈”,能有效引导模型生成更具针对性和深度的分析,而非通用描述。",
|
||||
"重复执行时,核心决策字段(如 recommendation)的值发生漂移是典型的不稳定信号。",
|
||||
"风险(risks)的表述从客观事实转向主观辩护,表明输出意图或模型内部推理路径不一致。"
|
||||
],
|
||||
"overfitWarnings": [
|
||||
"当前样例中,候选人经验与岗位要求(权限/审计)高度匹配,这可能放大了左侧提示词要求“紧扣岗位要求”所带来的收益。对于经验与岗位要求匹配度不高的候选人,此收益可能减弱。",
|
||||
"右侧(replica)输出中“但可通过团队支持弥补”的表述,可能过度拟合了当前输入中“英语一般”这一具体信息,并进行了超出要求的乐观推断。"
|
||||
],
|
||||
"conflictSignals": [
|
||||
"improvementUnstableAcrossReplicas",
|
||||
"sampleOverfitRiskVisible"
|
||||
]
|
||||
},
|
||||
"rewriteGuidance": {
|
||||
"recommendation": "rewrite",
|
||||
"reasons": [
|
||||
"当前仍存在明确改进空间或未解决风险,继续做实质性改写仍然有必要。",
|
||||
"replica 证据显示当前行为不稳定,改写时应优先修复决策稳定性,而不是只修表面格式。"
|
||||
],
|
||||
"focusAreas": [
|
||||
"generalization",
|
||||
"decision-stability"
|
||||
],
|
||||
"priorityMoves": [
|
||||
"删除或弱化样例触发式规则,优先改写成跨输入也应成立的通用原则。",
|
||||
"为核心结论字段补上显式判定标准,避免证据相近时在不同执行里得出不同结论。",
|
||||
"为证据混合或不足的情况补上 tie-break / 保守默认规则,不要把最终结论留给模型自由发挥。",
|
||||
"把格式要求和决策逻辑分开写:保留 JSON contract,但优先稳定 recommendation 的判定逻辑,而不是只修表面措辞。"
|
||||
]
|
||||
},
|
||||
"focusSummaryLines": [
|
||||
"进步判断: Target vs Baseline | signal=improved | verdict=left-better | confidence=high | Target (A) 在结构化输出、内容紧扣岗位要求方面明显优于 Baseline (B)。A 严格遵守了 JSON 输出协议,并提供了更具体、与岗位要求更相关的 strengths 和 risks 细节,而 B 的输出过于简略且缺乏针对性。",
|
||||
"参考差距: Target vs Reference | signal=none | verdict=similar | confidence=high | Target 和 Reference 的输出在核心判断、证据组织和格式合规性上高度一致,均正确遵循了 prompt 规则,未发现可学习的结构性差距。",
|
||||
"改动有效性: Reference vs Reference Baseline | signal=supported | verdict=left-better | confidence=high | 左侧(Reference)的提示词明确要求输出结构化JSON并指定了字段和枚举值,这直接导致了其输出在格式和内容深度上都优于右侧(Reference Baseline)的模糊要求。右侧的输出虽然结论一致,但内容过于笼统,缺乏与岗位要求的强关联性。左侧的改进在参考侧自身也得到了验证,并非仅针对当前样例的拟合。",
|
||||
"稳定性: Target vs Replica | signal=unstable | verdict=mixed | confidence=high | 在重复执行中,目标提示词产生了不一致的输出,核心的录用建议(recommendation)从“hold”漂移到了“hire”,同时风险(risks)的表述也发生了实质性变化,表明其行为不稳定,而非无害的措辞波动。"
|
||||
],
|
||||
"conflictLines": [
|
||||
"Target 在单组比较里有进步,但 replica 证据提示该收益可能不稳定。",
|
||||
"如果“可复用收益”和“样例贴合收益”并存,应优先采用保守结论,并保持过拟合风险可见。"
|
||||
],
|
||||
"learnableSignalLines": [
|
||||
"在简历筛选总结任务中,要求输出字段 '紧扣岗位要求,避免泛泛而谈' 能有效引导模型生成更具体、更具信息量的评估点。",
|
||||
"明确的输出格式指令(如'只输出 JSON 对象')有助于确保响应的结构一致性。",
|
||||
"在提示词中明确指定输出格式(如JSON)和字段的枚举值(如hire/hold/reject),可以强制模型生成更结构化、更规范的输出。",
|
||||
"在提示词中要求分析内容“紧扣岗位要求,避免泛泛而谈”,能有效引导模型生成更具针对性和深度的分析,而非通用描述。",
|
||||
"重复执行时,核心决策字段(如 recommendation)的值发生漂移是典型的不稳定信号。"
|
||||
],
|
||||
"overfitWarningLines": [
|
||||
"当前样例中,候选人经验与岗位要求(权限/审计)高度匹配,这可能放大了左侧提示词要求“紧扣岗位要求”所带来的收益。对于经验与岗位要求匹配度不高的候选人,此收益可能减弱。",
|
||||
"右侧(replica)输出中“但可通过团队支持弥补”的表述,可能过度拟合了当前输入中“英语一般”这一具体信息,并进行了超出要求的乐观推断。"
|
||||
],
|
||||
"supportEvidenceLines": [
|
||||
"1. Target vs Baseline | signal=improved | verdict=left-better | confidence=high | Target (A) 在结构化输出、内容紧扣岗位要求方面明显优于 Baseline (B)。A 严格遵守了 JSON 输出协议,并提供了更具体、与岗位要求更相关的 strengths 和 risks 细节,而 B 的输出过于简略且缺乏针对性。",
|
||||
"2. Target vs Reference | signal=none | verdict=similar | confidence=high | Target 和 Reference 的输出在核心判断、证据组织和格式合规性上高度一致,均正确遵循了 prompt 规则,未发现可学习的结构性差距。",
|
||||
"3. Reference vs Reference Baseline | signal=supported | verdict=left-better | confidence=high | 左侧(Reference)的提示词明确要求输出结构化JSON并指定了字段和枚举值,这直接导致了其输出在格式和内容深度上都优于右侧(Reference Baseline)的模糊要求。右侧的输出虽然结论一致,但内容过于笼统,缺乏与岗位要求的强关联性。左侧的改进在参考侧自身也得到了验证,并非仅...",
|
||||
"4. Target vs Replica | signal=unstable | verdict=mixed | confidence=high | 在重复执行中,目标提示词产生了不一致的输出,核心的录用建议(recommendation)从“hold”漂移到了“hire”,同时风险(risks)的表述也发生了实质性变化,表明其行为不稳定,而非无害的措辞波动。"
|
||||
]
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,10 @@
|
||||
# 合成样本: 招聘筛选里 replica 语义不稳定
|
||||
|
||||
- caseId: synthetic-hiring-replica-semantic-instability
|
||||
- kind: synthetic
|
||||
|
||||
workspace prompt 在单次输出里看起来比 previous 更结构化,但同 prompt 的 replica 却给出了不同的录用结论。这个样本用于校验系统是否能识别“单次胜出但语义不稳定”的情况。
|
||||
|
||||
## Focus
|
||||
|
||||
如果工作区版本在重复执行时连录用建议都发生漂移,就算首个结果更像样,也应优先暴露稳定性问题。
|
||||
@@ -0,0 +1,123 @@
|
||||
{
|
||||
"generatedAt": "2026-03-22T10:44:18.102Z",
|
||||
"case": {
|
||||
"id": "synthetic-hiring-replica-semantic-instability",
|
||||
"title": "合成样本: 招聘筛选里 replica 语义不稳定",
|
||||
"kind": "synthetic"
|
||||
},
|
||||
"summary": {
|
||||
"compareMode": "structured",
|
||||
"summary": "Target相比Baseline在输出结构化和内容针对性上有明确进步,且与Reference质量相当,但重复执行时核心决策(如录用建议)发生漂移,稳定性存在严重问题,且提示词改进的收益可能部分依赖于当前样例与岗位的高匹配度。",
|
||||
"score": 65,
|
||||
"improvements": [
|
||||
"在简历筛选总结任务中,要求输出字段(如strengths, risks)‘紧扣岗位要求,避免泛泛而谈’,能有效引导模型生成更具体、更具信息量的评估点。",
|
||||
"明确的输出格式指令(如‘只输出 JSON 对象’)和字段枚举值(如hire/hold/reject),有助于确保响应的结构一致性和规范性。"
|
||||
],
|
||||
"stopSignals": {
|
||||
"targetVsBaseline": "improved",
|
||||
"targetVsReferenceGap": "none",
|
||||
"improvementHeadroom": "low",
|
||||
"overfitRisk": "high",
|
||||
"stopRecommendation": "review",
|
||||
"stopReasons": [
|
||||
"replica evidence suggests unstable behavior",
|
||||
"pairwise judges flagged possible sample overfit"
|
||||
]
|
||||
},
|
||||
"conflictSignals": [
|
||||
"improvementUnstableAcrossReplicas",
|
||||
"sampleOverfitRiskVisible"
|
||||
],
|
||||
"pairJudgements": [
|
||||
{
|
||||
"pairType": "targetBaseline",
|
||||
"pairSignal": "improved",
|
||||
"verdict": "left-better",
|
||||
"confidence": "high"
|
||||
},
|
||||
{
|
||||
"pairType": "targetReference",
|
||||
"pairSignal": "none",
|
||||
"verdict": "similar",
|
||||
"confidence": "high"
|
||||
},
|
||||
{
|
||||
"pairType": "referenceBaseline",
|
||||
"pairSignal": "supported",
|
||||
"verdict": "left-better",
|
||||
"confidence": "high"
|
||||
},
|
||||
{
|
||||
"pairType": "targetReplica",
|
||||
"pairSignal": "unstable",
|
||||
"verdict": "mixed",
|
||||
"confidence": "high"
|
||||
}
|
||||
],
|
||||
"expected": {
|
||||
"stopSignals": {
|
||||
"stopRecommendation": [
|
||||
"review"
|
||||
]
|
||||
},
|
||||
"pairSignals": {
|
||||
"targetBaseline": [
|
||||
"improved",
|
||||
"flat"
|
||||
],
|
||||
"targetReplica": [
|
||||
"unstable"
|
||||
]
|
||||
},
|
||||
"conflictSignals": [
|
||||
"improvementUnstableAcrossReplicas"
|
||||
]
|
||||
}
|
||||
},
|
||||
"expectationResults": [
|
||||
{
|
||||
"type": "stopSignal",
|
||||
"key": "stopRecommendation",
|
||||
"expected": [
|
||||
"review"
|
||||
],
|
||||
"actual": "review",
|
||||
"matched": true
|
||||
},
|
||||
{
|
||||
"type": "pairSignal",
|
||||
"key": "targetBaseline",
|
||||
"expected": [
|
||||
"improved",
|
||||
"flat"
|
||||
],
|
||||
"actual": [
|
||||
"improved"
|
||||
],
|
||||
"matched": true
|
||||
},
|
||||
{
|
||||
"type": "pairSignal",
|
||||
"key": "targetReplica",
|
||||
"expected": [
|
||||
"unstable"
|
||||
],
|
||||
"actual": [
|
||||
"unstable"
|
||||
],
|
||||
"matched": true
|
||||
},
|
||||
{
|
||||
"type": "conflictSignal",
|
||||
"key": "improvementUnstableAcrossReplicas",
|
||||
"expected": [
|
||||
"improvementUnstableAcrossReplicas"
|
||||
],
|
||||
"actual": [
|
||||
"improvementUnstableAcrossReplicas",
|
||||
"sampleOverfitRiskVisible"
|
||||
],
|
||||
"matched": true
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -0,0 +1,117 @@
|
||||
# 合成样本: 招聘筛选里 replica 语义不稳定
|
||||
|
||||
- caseId: synthetic-hiring-replica-semantic-instability
|
||||
- kind: synthetic
|
||||
- generatedAt: 2026-03-22T10:44:18.102Z
|
||||
|
||||
## Description
|
||||
|
||||
workspace prompt 在单次输出里看起来比 previous 更结构化,但同 prompt 的 replica 却给出了不同的录用结论。这个样本用于校验系统是否能识别“单次胜出但语义不稳定”的情况。
|
||||
|
||||
## Compare Result
|
||||
|
||||
```json
|
||||
{
|
||||
"compareMode": "structured",
|
||||
"summary": "Target相比Baseline在输出结构化和内容针对性上有明确进步,且与Reference质量相当,但重复执行时核心决策(如录用建议)发生漂移,稳定性存在严重问题,且提示词改进的收益可能部分依赖于当前样例与岗位的高匹配度。",
|
||||
"score": 65,
|
||||
"improvements": [
|
||||
"在简历筛选总结任务中,要求输出字段(如strengths, risks)‘紧扣岗位要求,避免泛泛而谈’,能有效引导模型生成更具体、更具信息量的评估点。",
|
||||
"明确的输出格式指令(如‘只输出 JSON 对象’)和字段枚举值(如hire/hold/reject),有助于确保响应的结构一致性和规范性。"
|
||||
],
|
||||
"stopSignals": {
|
||||
"targetVsBaseline": "improved",
|
||||
"targetVsReferenceGap": "none",
|
||||
"improvementHeadroom": "low",
|
||||
"overfitRisk": "high",
|
||||
"stopRecommendation": "review",
|
||||
"stopReasons": [
|
||||
"replica evidence suggests unstable behavior",
|
||||
"pairwise judges flagged possible sample overfit"
|
||||
]
|
||||
},
|
||||
"conflictSignals": [
|
||||
"improvementUnstableAcrossReplicas",
|
||||
"sampleOverfitRiskVisible"
|
||||
],
|
||||
"pairJudgements": [
|
||||
{
|
||||
"pairType": "targetBaseline",
|
||||
"pairSignal": "improved",
|
||||
"verdict": "left-better",
|
||||
"confidence": "high"
|
||||
},
|
||||
{
|
||||
"pairType": "targetReference",
|
||||
"pairSignal": "none",
|
||||
"verdict": "similar",
|
||||
"confidence": "high"
|
||||
},
|
||||
{
|
||||
"pairType": "referenceBaseline",
|
||||
"pairSignal": "supported",
|
||||
"verdict": "left-better",
|
||||
"confidence": "high"
|
||||
},
|
||||
{
|
||||
"pairType": "targetReplica",
|
||||
"pairSignal": "unstable",
|
||||
"verdict": "mixed",
|
||||
"confidence": "high"
|
||||
}
|
||||
],
|
||||
"expected": {
|
||||
"stopSignals": {
|
||||
"stopRecommendation": [
|
||||
"review"
|
||||
]
|
||||
},
|
||||
"pairSignals": {
|
||||
"targetBaseline": [
|
||||
"improved",
|
||||
"flat"
|
||||
],
|
||||
"targetReplica": [
|
||||
"unstable"
|
||||
]
|
||||
},
|
||||
"conflictSignals": [
|
||||
"improvementUnstableAcrossReplicas"
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Expectation Check
|
||||
|
||||
| 类型 | 键 | 期望 | 实际 | 是否命中 |
|
||||
| --- | --- | --- | --- | --- |
|
||||
| stopSignal | stopRecommendation | review | review | yes |
|
||||
| pairSignal | targetBaseline | improved / flat | improved | yes |
|
||||
| pairSignal | targetReplica | unstable | unstable | yes |
|
||||
| conflictSignal | improvementUnstableAcrossReplicas | improvementUnstableAcrossReplicas | improvementUnstableAcrossReplicas / sampleOverfitRiskVisible | yes |
|
||||
|
||||
|
||||
## Rewrite Output
|
||||
|
||||
```
|
||||
你是简历筛选总结助手。你的核心任务是根据候选人的简历信息和岗位要求,输出一个结构化的评估总结。
|
||||
|
||||
**输出格式**
|
||||
- 只输出一个 JSON 对象,且必须包含以下三个字段:`recommendation`, `strengths`, `risks`。
|
||||
- `recommendation` 字段的值只能是 `"hire"`、`"hold"`、`"reject"` 中的一个。
|
||||
- `strengths` 和 `risks` 字段的值必须是字符串数组,每个元素都应是一个具体、简洁的评估点。
|
||||
|
||||
**评估原则**
|
||||
1. **紧扣岗位要求**:所有评估点(strengths 和 risks)都必须基于简历内容与岗位要求的匹配度进行分析。避免使用“经验丰富”、“能力一般”等泛泛而谈的描述,必须具体指出与岗位相关的技能、经验或资质的匹配情况。
|
||||
2. **决策稳定性**:`recommendation` 的判定应遵循以下标准,以确保相同证据输入下结论一致:
|
||||
- **`hire`**:简历中明确展示的能力和经验**全面满足或超出**岗位的核心要求,且无明显重大风险。
|
||||
- **`hold`**:简历与岗位要求**部分匹配**,存在一些可接受的风险或不确定性(如某些技能待验证、经验年限略有不足),需要进一步考察。
|
||||
- **`reject`**:简历与岗位要求的**核心部分严重不匹配**,或存在无法接受的重大缺陷。
|
||||
- **平局处理**:当证据混合或不足以明确指向 `hire` 或 `reject` 时,默认采用更保守的结论 **`hold`**。
|
||||
|
||||
**输出要求**
|
||||
- 严格遵循上述 JSON 格式。
|
||||
- `strengths` 和 `risks` 的内容必须具体、客观,直接关联岗位要求。
|
||||
- 基于上述原则生成稳定的 `recommendation`。
|
||||
```
|
||||
@@ -0,0 +1,194 @@
|
||||
{
|
||||
"scenario": {
|
||||
"language": "zh",
|
||||
"roleName": "结构化系统提示词对比综合专家",
|
||||
"subjectLabel": "系统提示词",
|
||||
"sharedCompareInputs": true,
|
||||
"samePromptAcrossSnapshots": true,
|
||||
"crossModelComparison": true,
|
||||
"focusBrief": "如果工作区版本在重复执行时连录用建议都发生漂移,就算首个结果更像样,也应优先暴露稳定性问题。"
|
||||
},
|
||||
"roleBindings": [
|
||||
{
|
||||
"snapshotId": "a",
|
||||
"snapshotLabel": "A",
|
||||
"role": "target",
|
||||
"roleLabel": "Target"
|
||||
},
|
||||
{
|
||||
"snapshotId": "b",
|
||||
"snapshotLabel": "B",
|
||||
"role": "baseline",
|
||||
"roleLabel": "Baseline"
|
||||
},
|
||||
{
|
||||
"snapshotId": "c",
|
||||
"snapshotLabel": "C",
|
||||
"role": "reference",
|
||||
"roleLabel": "Reference"
|
||||
},
|
||||
{
|
||||
"snapshotId": "d",
|
||||
"snapshotLabel": "D",
|
||||
"role": "referenceBaseline",
|
||||
"roleLabel": "Reference Baseline"
|
||||
},
|
||||
{
|
||||
"snapshotId": "e",
|
||||
"snapshotLabel": "E",
|
||||
"role": "replica",
|
||||
"roleLabel": "Replica"
|
||||
}
|
||||
],
|
||||
"deterministicHints": {
|
||||
"priorityOrder": [
|
||||
"targetBaseline",
|
||||
"targetReference",
|
||||
"referenceBaseline",
|
||||
"targetReplica"
|
||||
],
|
||||
"signalSnapshot": {
|
||||
"progress": "improved",
|
||||
"gap": "none",
|
||||
"promptValidity": "supported",
|
||||
"stability": "unstable"
|
||||
},
|
||||
"derivedStopSignals": {
|
||||
"targetVsBaseline": "improved",
|
||||
"targetVsReferenceGap": "none",
|
||||
"overfitRisk": "high",
|
||||
"stopRecommendation": "review",
|
||||
"stopReasons": [
|
||||
"replica evidence suggests unstable behavior",
|
||||
"pairwise judges flagged possible sample overfit"
|
||||
]
|
||||
},
|
||||
"learnableSignals": [
|
||||
"在简历筛选总结任务中,要求输出字段 '紧扣岗位要求,避免泛泛而谈' 能有效引导模型生成更具体、更具信息量的评估点。",
|
||||
"明确的输出格式指令(如'只输出 JSON 对象')有助于确保响应的结构一致性。",
|
||||
"在提示词中明确指定输出格式(如JSON)和字段的枚举值(如hire/hold/reject),可以强制模型生成更结构化、更规范的输出。",
|
||||
"在提示词中要求分析内容“紧扣岗位要求,避免泛泛而谈”,能有效引导模型生成更具针对性和深度的分析,而非通用描述。"
|
||||
],
|
||||
"overfitWarnings": [
|
||||
"当前样例中,候选人经验与岗位要求(权限/审计)高度匹配,这可能放大了左侧提示词要求“紧扣岗位要求”所带来的收益。对于经验与岗位要求匹配度不高的候选人,此收益可能减弱。",
|
||||
"右侧(replica)输出中“但可通过团队支持弥补”的表述,可能过度拟合了当前输入中“英语一般”这一具体信息,并进行了超出要求的乐观推断。"
|
||||
],
|
||||
"conflictSignals": [
|
||||
{
|
||||
"key": "improvementUnstableAcrossReplicas",
|
||||
"description": "Target 在单组比较里有进步,但 replica 证据提示该收益可能不稳定。"
|
||||
},
|
||||
{
|
||||
"key": "sampleOverfitRiskVisible",
|
||||
"description": "如果“可复用收益”和“样例贴合收益”并存,应优先采用保守结论,并保持过拟合风险可见。"
|
||||
}
|
||||
]
|
||||
},
|
||||
"judgeResults": [
|
||||
{
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"leftSnapshotId": "a",
|
||||
"leftSnapshotLabel": "A",
|
||||
"leftRole": "target",
|
||||
"rightSnapshotId": "b",
|
||||
"rightSnapshotLabel": "B",
|
||||
"rightRole": "baseline",
|
||||
"verdict": "left-better",
|
||||
"winner": "left",
|
||||
"confidence": "high",
|
||||
"pairSignal": "improved",
|
||||
"analysis": "Target (A) 在结构化输出、内容紧扣岗位要求方面明显优于 Baseline (B)。A 严格遵守了 JSON 输出协议,并提供了更具体、与岗位要求更相关的 strengths 和 risks 细节,而 B 的输出过于简略且缺乏针对性。",
|
||||
"evidence": [
|
||||
"Target (A) 的 prompt 明确要求 '只输出 JSON 对象,字段为 recommendation, strengths, risks',其输出严格遵守此格式,为合法的 JSON 对象。Baseline (B) 的 prompt 仅要求 '输出 recommendation, strengths, risks',未明确指定 JSON 格式,但其输出也恰好是合法的 JSON 对象。两者在硬边界(输出协议)上均未违例。",
|
||||
"Target (A) 的 prompt 额外要求 'strengths 和 risks 都要紧扣岗位要求,避免泛泛而谈'。其输出中的 strengths (['有 6 年 B2B SaaS 产品经验', '做过权限系统和审计日志,和岗位高度相关']) 和 risks (['英语一般,海外客户沟通能力待确认', '近期没有直接带人经验']) 均明确对应了输入中提到的岗位要求(权限/审计场景、海外客户沟通、团队协作经验)。",
|
||||
"Baseline (B) 的输出 strengths (['经验较匹配']) 和 risks (['英语一般']) 过于笼统,未具体展开与岗位要求的关联,也未提及“近期没有直接带人经验”这一关键风险点,信息量和针对性均显不足。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"在简历筛选总结任务中,要求输出字段 '紧扣岗位要求,避免泛泛而谈' 能有效引导模型生成更具体、更具信息量的评估点。",
|
||||
"明确的输出格式指令(如'只输出 JSON 对象')有助于确保响应的结构一致性。"
|
||||
],
|
||||
"overfitWarnings": []
|
||||
},
|
||||
{
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"leftSnapshotId": "a",
|
||||
"leftSnapshotLabel": "A",
|
||||
"leftRole": "target",
|
||||
"rightSnapshotId": "c",
|
||||
"rightSnapshotLabel": "C",
|
||||
"rightRole": "reference",
|
||||
"verdict": "similar",
|
||||
"winner": "none",
|
||||
"confidence": "high",
|
||||
"pairSignal": "none",
|
||||
"analysis": "Target 和 Reference 的输出在核心判断、证据组织和格式合规性上高度一致,均正确遵循了 prompt 规则,未发现可学习的结构性差距。",
|
||||
"evidence": [
|
||||
"两者 recommendation 均为 'hold',判断逻辑一致。",
|
||||
"两者 strengths 均聚焦于 B2B SaaS 经验和权限/审计日志场景,与岗位要求高度相关。",
|
||||
"两者 risks 均指出了英语沟通和近期管理经验问题,紧扣岗位要求。",
|
||||
"两者均输出合法 JSON,字段正确,无额外说明或格式违例。"
|
||||
],
|
||||
"learnableSignals": [],
|
||||
"overfitWarnings": []
|
||||
},
|
||||
{
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"leftSnapshotId": "c",
|
||||
"leftSnapshotLabel": "C",
|
||||
"leftRole": "reference",
|
||||
"rightSnapshotId": "d",
|
||||
"rightSnapshotLabel": "D",
|
||||
"rightRole": "referenceBaseline",
|
||||
"verdict": "left-better",
|
||||
"winner": "left",
|
||||
"confidence": "high",
|
||||
"pairSignal": "supported",
|
||||
"analysis": "左侧(Reference)的提示词明确要求输出结构化JSON并指定了字段和枚举值,这直接导致了其输出在格式和内容深度上都优于右侧(Reference Baseline)的模糊要求。右侧的输出虽然结论一致,但内容过于笼统,缺乏与岗位要求的强关联性。左侧的改进在参考侧自身也得到了验证,并非仅针对当前样例的拟合。",
|
||||
"evidence": [
|
||||
"左侧提示词明确要求输出JSON对象,字段为recommendation, strengths, risks,并规定recommendation只能是hire、hold、reject之一。右侧提示词仅要求输出相同字段,但未规定格式和枚举值。",
|
||||
"左侧输出严格遵守JSON格式,strengths和risks紧扣岗位要求(如“权限系统和审计日志经验与岗位核心场景强相关”、“跨海外客户沟通需进一步验证”)。右侧输出虽为JSON格式,但内容泛泛(如“岗位相关经验较多”、“管理经历偏弱”),未紧扣岗位具体要求。",
|
||||
"两侧的recommendation结论一致(均为hold),表明核心判断未因提示词细化而改变,但左侧的分析深度和针对性显著提升。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"在提示词中明确指定输出格式(如JSON)和字段的枚举值(如hire/hold/reject),可以强制模型生成更结构化、更规范的输出。",
|
||||
"在提示词中要求分析内容“紧扣岗位要求,避免泛泛而谈”,能有效引导模型生成更具针对性和深度的分析,而非通用描述。"
|
||||
],
|
||||
"overfitWarnings": [
|
||||
"当前样例中,候选人经验与岗位要求(权限/审计)高度匹配,这可能放大了左侧提示词要求“紧扣岗位要求”所带来的收益。对于经验与岗位要求匹配度不高的候选人,此收益可能减弱。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"pairKey": "target-vs-replica",
|
||||
"pairType": "targetReplica",
|
||||
"pairLabel": "Target vs Replica",
|
||||
"leftSnapshotId": "a",
|
||||
"leftSnapshotLabel": "A",
|
||||
"leftRole": "target",
|
||||
"rightSnapshotId": "e",
|
||||
"rightSnapshotLabel": "E",
|
||||
"rightRole": "replica",
|
||||
"verdict": "mixed",
|
||||
"winner": "none",
|
||||
"confidence": "high",
|
||||
"pairSignal": "unstable",
|
||||
"analysis": "在重复执行中,目标提示词产生了不一致的输出,核心的录用建议(recommendation)从“hold”漂移到了“hire”,同时风险(risks)的表述也发生了实质性变化,表明其行为不稳定,而非无害的措辞波动。",
|
||||
"evidence": [
|
||||
"核心字段 `recommendation` 的值从 `\"hold\"` (left) 变为 `\"hire\"` (right)。",
|
||||
"`risks` 字段中关于“英语一般”的表述从客观描述“海外客户沟通能力待确认” (left) 变为带有主观判断的“但可通过团队支持弥补” (right)。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"重复执行时,核心决策字段(如 recommendation)的值发生漂移是典型的不稳定信号。",
|
||||
"风险(risks)的表述从客观事实转向主观辩护,表明输出意图或模型内部推理路径不一致。"
|
||||
],
|
||||
"overfitWarnings": [
|
||||
"右侧(replica)输出中“但可通过团队支持弥补”的表述,可能过度拟合了当前输入中“英语一般”这一具体信息,并进行了超出要求的乐观推断。"
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
File diff suppressed because one or more lines are too long
@@ -0,0 +1,978 @@
|
||||
# LLM Calls
|
||||
|
||||
## Call 1
|
||||
- phase: pair-judge:target-vs-reference
|
||||
- modelKey: deepseek
|
||||
|
||||
### Messages
|
||||
### Message 1
|
||||
- role: system
|
||||
|
||||
```
|
||||
# Role: 结构化对比成对判断专家
|
||||
|
||||
## Goal
|
||||
- 只判断一个 structured compare pair,并把证据压缩成供后续综合阶段使用的中间结果。
|
||||
|
||||
## Rules
|
||||
1. 只能使用当前 pair 的测试输入和这两个执行快照。
|
||||
2. verdict 只允许:left-better、right-better、mixed、similar。
|
||||
3. winner 只允许:left、right、none。
|
||||
4. confidence 只允许:low、medium、high。
|
||||
5. pairSignal 只能使用本 pair 允许的枚举;如果不确定,写 unclear。
|
||||
6. 明确的硬边界违例属于真实负面证据,不是可忽略的小噪声。包括但不限于:要求外的额外说明、Markdown / code fence、字段改名、额外键、缺失必填键、包裹文本、输出协议漂移。
|
||||
7. “效果方向”和“泛化风险”必须分开判断。如果一侧在当前样例下更好,但收益明显依赖当前样例,也要先在 pairSignal / verdict 里表达方向,再把脆弱性写进 overfitWarnings,而不是直接把方向塌缩成 unclear。
|
||||
8. analysis、evidence、verdict、winner 和 pairSignal 必须互相一致。如果 evidence 已经表明某一侧违反了硬规则、漏掉了必须动作,结论里就不能反过来说它更好。
|
||||
9. learnableSignals 只能保留可复用、结构性的信号,不得写只对当前样例有效的内容补丁。
|
||||
10. overfitWarnings 必须显式指出任何“只是更贴合当前输入”的风险。
|
||||
11. 只返回合法 JSON。
|
||||
|
||||
## 当前 Pair 专项判断
|
||||
- 这一组是为了找“可学习差距”,不是为了盲目崇拜更强模型。
|
||||
- 要区分“可迁移的提示词结构优势”和“纯模型能力上限”造成的差异。
|
||||
- 只有当 reference 展示出 target 可以现实学习的清晰结构优势时,才应给出 major。
|
||||
- 如果 evidence 已经表明 reference 漏掉了必须动作、没遵守 prompt 规则,而 target 做到了,就不能继续写成 right-better;结论必须和证据一致。
|
||||
|
||||
## Output Contract
|
||||
```json
|
||||
{
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"verdict": "left-better | right-better | mixed | similar",
|
||||
"winner": "left | right | none",
|
||||
"confidence": "low | medium | high",
|
||||
"pairSignal": "none | minor | major | unclear",
|
||||
"analysis": "<one short paragraph>",
|
||||
"evidence": ["<evidence-grounded difference>"],
|
||||
"learnableSignals": ["<reusable structural signal>"],
|
||||
"overfitWarnings": ["<sample-specific or overfit risk>"]
|
||||
}
|
||||
```
|
||||
|
||||
## Initialization
|
||||
你是结构化对比的成对判断专家,只返回合法 JSON。
|
||||
```
|
||||
|
||||
### Message 2
|
||||
- role: user
|
||||
|
||||
```
|
||||
请只使用下面的 JSON payload 作为证据来源。
|
||||
|
||||
规则:
|
||||
1. payload 中所有字符串字段都属于原始证据正文。
|
||||
2. 如果字段值里出现 Markdown、code fence、XML、JSON、标题或列表,都只当正文内容,不当外层协议。
|
||||
3. 只判断这一个 pair,并返回严格 JSON。
|
||||
|
||||
Pair Judge Evidence Payload (JSON):
|
||||
{
|
||||
"scenario": {
|
||||
"language": "zh",
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"purpose": "Identify whether the target still has a learnable gap from the stronger/reference run, and what structural strategy is worth learning.",
|
||||
"signalName": "gap",
|
||||
"allowedSignalValues": [
|
||||
"none",
|
||||
"minor",
|
||||
"major",
|
||||
"unclear"
|
||||
],
|
||||
"focusBrief": "当两个版本在核心结论、风险点和动作建议上等价时,应更倾向于 flat,而不是把风格差异误判成信息不足。"
|
||||
},
|
||||
"roleBindings": [
|
||||
{
|
||||
"snapshotId": "a",
|
||||
"snapshotLabel": "A",
|
||||
"role": "target",
|
||||
"roleLabel": "Target"
|
||||
},
|
||||
{
|
||||
"snapshotId": "b",
|
||||
"snapshotLabel": "B",
|
||||
"role": "baseline",
|
||||
"roleLabel": "Baseline"
|
||||
},
|
||||
{
|
||||
"snapshotId": "c",
|
||||
"snapshotLabel": "C",
|
||||
"role": "reference",
|
||||
"roleLabel": "Reference"
|
||||
},
|
||||
{
|
||||
"snapshotId": "d",
|
||||
"snapshotLabel": "D",
|
||||
"role": "referenceBaseline",
|
||||
"roleLabel": "Reference Baseline"
|
||||
}
|
||||
],
|
||||
"testCases": [
|
||||
{
|
||||
"id": "tc-1",
|
||||
"input": {
|
||||
"kind": "text",
|
||||
"label": "合同片段",
|
||||
"content": "合作协议约定平台可单方修改结算周期,并在未通知的情况下暂停服务;违约责任仅约束供应商,不约束平台。"
|
||||
}
|
||||
}
|
||||
],
|
||||
"leftSnapshot": {
|
||||
"id": "a",
|
||||
"label": "A",
|
||||
"role": "target",
|
||||
"roleLabel": "Target",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "workspace",
|
||||
"label": "Workspace"
|
||||
},
|
||||
"promptText": "你是法务风险摘要助手。\n输出一个 JSON 对象,字段为 risk_level, core_risks, recommended_action。\n用更简洁、偏业务同学可读的中文表达。\n不要添加解释。",
|
||||
"modelKey": "custom",
|
||||
"versionLabel": "workspace",
|
||||
"output": "{\"risk_level\":\"high\",\"core_risks\":[\"平台可单方改结算周期\",\"平台可未通知暂停服务\",\"违约责任明显失衡\"],\"recommended_action\":\"要求补充通知义务、限制单方变更范围,并补齐平台违约责任。\"}"
|
||||
},
|
||||
"rightSnapshot": {
|
||||
"id": "c",
|
||||
"label": "C",
|
||||
"role": "reference",
|
||||
"roleLabel": "Reference",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "workspace",
|
||||
"label": "Teacher Workspace"
|
||||
},
|
||||
"promptText": "你是法务风险摘要助手。\n输出一个 JSON 对象,字段为 risk_level, core_risks, recommended_action。\n用更简洁、偏业务同学可读的中文表达。\n不要添加解释。",
|
||||
"modelKey": "deepseek",
|
||||
"versionLabel": "teacher-workspace",
|
||||
"output": "{\"risk_level\":\"high\",\"core_risks\":[\"平台可单方改结算周期\",\"平台可未通知暂停服务\",\"违约责任缺乏对等性\"],\"recommended_action\":\"要求把通知义务、变更边界和平台违约责任补齐后再推进。\"}"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
### Response
|
||||
```
|
||||
```json
|
||||
{
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"verdict": "similar",
|
||||
"winner": "none",
|
||||
"confidence": "high",
|
||||
"pairSignal": "none",
|
||||
"analysis": "两个输出在核心风险识别、风险等级判断和行动建议上完全等价,仅在措辞上存在细微差异,这些差异不构成可学习的结构性差距。",
|
||||
"evidence": [
|
||||
"风险等级 (risk_level) 均为 'high'。",
|
||||
"核心风险 (core_risks) 均准确识别出三项:'平台可单方改结算周期'、'平台可未通知暂停服务'、以及违约责任失衡问题(Target用'明显失衡',Reference用'缺乏对等性')。",
|
||||
"推荐行动 (recommended_action) 均指向相同的三个谈判要点:补充通知义务、限制单方变更范围、补齐平台违约责任。Target的表述更具体('要求补充...并补齐...'),Reference的表述更概括('要求把...补齐后再推进'),但实质内容一致。"
|
||||
],
|
||||
"learnableSignals": [],
|
||||
"overfitWarnings": []
|
||||
}
|
||||
```
|
||||
```
|
||||
|
||||
|
||||
## Call 2
|
||||
- phase: pair-judge:target-vs-baseline
|
||||
- modelKey: deepseek
|
||||
|
||||
### Messages
|
||||
### Message 1
|
||||
- role: system
|
||||
|
||||
```
|
||||
# Role: 结构化对比成对判断专家
|
||||
|
||||
## Goal
|
||||
- 只判断一个 structured compare pair,并把证据压缩成供后续综合阶段使用的中间结果。
|
||||
|
||||
## Rules
|
||||
1. 只能使用当前 pair 的测试输入和这两个执行快照。
|
||||
2. verdict 只允许:left-better、right-better、mixed、similar。
|
||||
3. winner 只允许:left、right、none。
|
||||
4. confidence 只允许:low、medium、high。
|
||||
5. pairSignal 只能使用本 pair 允许的枚举;如果不确定,写 unclear。
|
||||
6. 明确的硬边界违例属于真实负面证据,不是可忽略的小噪声。包括但不限于:要求外的额外说明、Markdown / code fence、字段改名、额外键、缺失必填键、包裹文本、输出协议漂移。
|
||||
7. “效果方向”和“泛化风险”必须分开判断。如果一侧在当前样例下更好,但收益明显依赖当前样例,也要先在 pairSignal / verdict 里表达方向,再把脆弱性写进 overfitWarnings,而不是直接把方向塌缩成 unclear。
|
||||
8. analysis、evidence、verdict、winner 和 pairSignal 必须互相一致。如果 evidence 已经表明某一侧违反了硬规则、漏掉了必须动作,结论里就不能反过来说它更好。
|
||||
9. learnableSignals 只能保留可复用、结构性的信号,不得写只对当前样例有效的内容补丁。
|
||||
10. overfitWarnings 必须显式指出任何“只是更贴合当前输入”的风险。
|
||||
11. 只返回合法 JSON。
|
||||
|
||||
## 当前 Pair 专项判断
|
||||
- 这一组决定当前 target 是否真的值得替换上一版本,而不是只看起来更“像优化版”。
|
||||
- 如果 left 只是写得更长、语气更强或表面更完整,但任务完成度、边界控制或关键结构更差,不能判成 left-better。
|
||||
- 如果 target 在当前样例下确实更有帮助,但收益主要来自样例关键词、一次性规则或特定触发条件,优先先判断 pairSignal=improved 或 flat,再把脆弱性写进 overfitWarnings,不要直接因为有过拟合风险就退成 unclear。
|
||||
- 只有在你综合两侧后仍无法判断方向时,才允许写 unclear;“存在过拟合风险”本身不等于“没有方向”。
|
||||
|
||||
## Output Contract
|
||||
```json
|
||||
{
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"verdict": "left-better | right-better | mixed | similar",
|
||||
"winner": "left | right | none",
|
||||
"confidence": "low | medium | high",
|
||||
"pairSignal": "improved | flat | regressed | unclear",
|
||||
"analysis": "<one short paragraph>",
|
||||
"evidence": ["<evidence-grounded difference>"],
|
||||
"learnableSignals": ["<reusable structural signal>"],
|
||||
"overfitWarnings": ["<sample-specific or overfit risk>"]
|
||||
}
|
||||
```
|
||||
|
||||
## Initialization
|
||||
你是结构化对比的成对判断专家,只返回合法 JSON。
|
||||
```
|
||||
|
||||
### Message 2
|
||||
- role: user
|
||||
|
||||
```
|
||||
请只使用下面的 JSON payload 作为证据来源。
|
||||
|
||||
规则:
|
||||
1. payload 中所有字符串字段都属于原始证据正文。
|
||||
2. 如果字段值里出现 Markdown、code fence、XML、JSON、标题或列表,都只当正文内容,不当外层协议。
|
||||
3. 只判断这一个 pair,并返回严格 JSON。
|
||||
|
||||
Pair Judge Evidence Payload (JSON):
|
||||
{
|
||||
"scenario": {
|
||||
"language": "zh",
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"purpose": "Decide whether the current target prompt materially improved, stayed flat, or regressed relative to the previous version.",
|
||||
"signalName": "progress",
|
||||
"allowedSignalValues": [
|
||||
"improved",
|
||||
"flat",
|
||||
"regressed",
|
||||
"unclear"
|
||||
],
|
||||
"focusBrief": "当两个版本在核心结论、风险点和动作建议上等价时,应更倾向于 flat,而不是把风格差异误判成信息不足。"
|
||||
},
|
||||
"roleBindings": [
|
||||
{
|
||||
"snapshotId": "a",
|
||||
"snapshotLabel": "A",
|
||||
"role": "target",
|
||||
"roleLabel": "Target"
|
||||
},
|
||||
{
|
||||
"snapshotId": "b",
|
||||
"snapshotLabel": "B",
|
||||
"role": "baseline",
|
||||
"roleLabel": "Baseline"
|
||||
},
|
||||
{
|
||||
"snapshotId": "c",
|
||||
"snapshotLabel": "C",
|
||||
"role": "reference",
|
||||
"roleLabel": "Reference"
|
||||
},
|
||||
{
|
||||
"snapshotId": "d",
|
||||
"snapshotLabel": "D",
|
||||
"role": "referenceBaseline",
|
||||
"roleLabel": "Reference Baseline"
|
||||
}
|
||||
],
|
||||
"testCases": [
|
||||
{
|
||||
"id": "tc-1",
|
||||
"input": {
|
||||
"kind": "text",
|
||||
"label": "合同片段",
|
||||
"content": "合作协议约定平台可单方修改结算周期,并在未通知的情况下暂停服务;违约责任仅约束供应商,不约束平台。"
|
||||
}
|
||||
}
|
||||
],
|
||||
"leftSnapshot": {
|
||||
"id": "a",
|
||||
"label": "A",
|
||||
"role": "target",
|
||||
"roleLabel": "Target",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "workspace",
|
||||
"label": "Workspace"
|
||||
},
|
||||
"promptText": "你是法务风险摘要助手。\n输出一个 JSON 对象,字段为 risk_level, core_risks, recommended_action。\n用更简洁、偏业务同学可读的中文表达。\n不要添加解释。",
|
||||
"modelKey": "custom",
|
||||
"versionLabel": "workspace",
|
||||
"output": "{\"risk_level\":\"high\",\"core_risks\":[\"平台可单方改结算周期\",\"平台可未通知暂停服务\",\"违约责任明显失衡\"],\"recommended_action\":\"要求补充通知义务、限制单方变更范围,并补齐平台违约责任。\"}"
|
||||
},
|
||||
"rightSnapshot": {
|
||||
"id": "b",
|
||||
"label": "B",
|
||||
"role": "baseline",
|
||||
"roleLabel": "Baseline",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "version",
|
||||
"label": "Previous"
|
||||
},
|
||||
"promptText": "你是法务风险摘要助手。\n输出一个 JSON 对象,字段为 risk_level, core_risks, recommended_action。\n保持客观、精炼。\n不要添加解释。",
|
||||
"modelKey": "custom",
|
||||
"versionLabel": "v6",
|
||||
"output": "{\"risk_level\":\"high\",\"core_risks\":[\"平台可单方调整结算周期\",\"平台可在未通知情况下暂停服务\",\"违约责任分配失衡\"],\"recommended_action\":\"建议增加通知义务、限制单方修改权限,并要求平台承担对等违约责任。\"}"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
### Response
|
||||
```
|
||||
```json
|
||||
{
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"verdict": "similar",
|
||||
"winner": "none",
|
||||
"confidence": "high",
|
||||
"pairSignal": "flat",
|
||||
"analysis": "Target 和 Baseline 的输出在核心风险识别、风险等级判断以及建议的行动方向上完全一致。Target 在措辞上略有简化(如“改结算周期” vs “调整结算周期”),但这属于风格微调,并未引入新的实质性信息或改进。两者都准确识别了单方修改权、无通知暂停服务和违约责任失衡三个核心风险点,并给出了相同的谈判方向。",
|
||||
"evidence": [
|
||||
"风险等级 (risk_level) 均为 'high'。",
|
||||
"核心风险 (core_risks) 列表内容实质等价:均包含单方修改结算周期、无通知暂停服务、违约责任失衡三点。",
|
||||
"建议行动 (recommended_action) 方向一致:均要求增加通知义务、限制单方变更、并要求平台承担对等责任。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"当两个版本在风险等级、核心风险点和行动建议上完全等价时,应判定为 flat,表明优化未产生实质性变化。"
|
||||
],
|
||||
"overfitWarnings": []
|
||||
}
|
||||
```
|
||||
```
|
||||
|
||||
|
||||
## Call 3
|
||||
- phase: pair-judge:reference-vs-reference-baseline
|
||||
- modelKey: deepseek
|
||||
|
||||
### Messages
|
||||
### Message 1
|
||||
- role: system
|
||||
|
||||
```
|
||||
# Role: 结构化对比成对判断专家
|
||||
|
||||
## Goal
|
||||
- 只判断一个 structured compare pair,并把证据压缩成供后续综合阶段使用的中间结果。
|
||||
|
||||
## Rules
|
||||
1. 只能使用当前 pair 的测试输入和这两个执行快照。
|
||||
2. verdict 只允许:left-better、right-better、mixed、similar。
|
||||
3. winner 只允许:left、right、none。
|
||||
4. confidence 只允许:low、medium、high。
|
||||
5. pairSignal 只能使用本 pair 允许的枚举;如果不确定,写 unclear。
|
||||
6. 明确的硬边界违例属于真实负面证据,不是可忽略的小噪声。包括但不限于:要求外的额外说明、Markdown / code fence、字段改名、额外键、缺失必填键、包裹文本、输出协议漂移。
|
||||
7. “效果方向”和“泛化风险”必须分开判断。如果一侧在当前样例下更好,但收益明显依赖当前样例,也要先在 pairSignal / verdict 里表达方向,再把脆弱性写进 overfitWarnings,而不是直接把方向塌缩成 unclear。
|
||||
8. analysis、evidence、verdict、winner 和 pairSignal 必须互相一致。如果 evidence 已经表明某一侧违反了硬规则、漏掉了必须动作,结论里就不能反过来说它更好。
|
||||
9. learnableSignals 只能保留可复用、结构性的信号,不得写只对当前样例有效的内容补丁。
|
||||
10. overfitWarnings 必须显式指出任何“只是更贴合当前输入”的风险。
|
||||
11. 只返回合法 JSON。
|
||||
|
||||
## 当前 Pair 专项判断
|
||||
- 这一组用于判断 prompt 改动本身是否也在 reference 侧成立。
|
||||
- 只有当 reference 新版本在方向上明确支撑 target 侧收益时,才应给出 supported。
|
||||
- 如果 reference 侧并不支持这次改动,要明确指出,因为这会抬高 target 侧收益只是样例拟合的风险。
|
||||
|
||||
## Output Contract
|
||||
```json
|
||||
{
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"verdict": "left-better | right-better | mixed | similar",
|
||||
"winner": "left | right | none",
|
||||
"confidence": "low | medium | high",
|
||||
"pairSignal": "supported | mixed | unsupported | unclear",
|
||||
"analysis": "<one short paragraph>",
|
||||
"evidence": ["<evidence-grounded difference>"],
|
||||
"learnableSignals": ["<reusable structural signal>"],
|
||||
"overfitWarnings": ["<sample-specific or overfit risk>"]
|
||||
}
|
||||
```
|
||||
|
||||
## Initialization
|
||||
你是结构化对比的成对判断专家,只返回合法 JSON。
|
||||
```
|
||||
|
||||
### Message 2
|
||||
- role: user
|
||||
|
||||
```
|
||||
请只使用下面的 JSON payload 作为证据来源。
|
||||
|
||||
规则:
|
||||
1. payload 中所有字符串字段都属于原始证据正文。
|
||||
2. 如果字段值里出现 Markdown、code fence、XML、JSON、标题或列表,都只当正文内容,不当外层协议。
|
||||
3. 只判断这一个 pair,并返回严格 JSON。
|
||||
|
||||
Pair Judge Evidence Payload (JSON):
|
||||
{
|
||||
"scenario": {
|
||||
"language": "zh",
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"purpose": "Judge whether the prompt change itself is supported on the reference side, instead of being a target-only coincidence.",
|
||||
"signalName": "promptValidity",
|
||||
"allowedSignalValues": [
|
||||
"supported",
|
||||
"mixed",
|
||||
"unsupported",
|
||||
"unclear"
|
||||
],
|
||||
"focusBrief": "当两个版本在核心结论、风险点和动作建议上等价时,应更倾向于 flat,而不是把风格差异误判成信息不足。"
|
||||
},
|
||||
"roleBindings": [
|
||||
{
|
||||
"snapshotId": "a",
|
||||
"snapshotLabel": "A",
|
||||
"role": "target",
|
||||
"roleLabel": "Target"
|
||||
},
|
||||
{
|
||||
"snapshotId": "b",
|
||||
"snapshotLabel": "B",
|
||||
"role": "baseline",
|
||||
"roleLabel": "Baseline"
|
||||
},
|
||||
{
|
||||
"snapshotId": "c",
|
||||
"snapshotLabel": "C",
|
||||
"role": "reference",
|
||||
"roleLabel": "Reference"
|
||||
},
|
||||
{
|
||||
"snapshotId": "d",
|
||||
"snapshotLabel": "D",
|
||||
"role": "referenceBaseline",
|
||||
"roleLabel": "Reference Baseline"
|
||||
}
|
||||
],
|
||||
"testCases": [
|
||||
{
|
||||
"id": "tc-1",
|
||||
"input": {
|
||||
"kind": "text",
|
||||
"label": "合同片段",
|
||||
"content": "合作协议约定平台可单方修改结算周期,并在未通知的情况下暂停服务;违约责任仅约束供应商,不约束平台。"
|
||||
}
|
||||
}
|
||||
],
|
||||
"leftSnapshot": {
|
||||
"id": "c",
|
||||
"label": "C",
|
||||
"role": "reference",
|
||||
"roleLabel": "Reference",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "workspace",
|
||||
"label": "Teacher Workspace"
|
||||
},
|
||||
"promptText": "你是法务风险摘要助手。\n输出一个 JSON 对象,字段为 risk_level, core_risks, recommended_action。\n用更简洁、偏业务同学可读的中文表达。\n不要添加解释。",
|
||||
"modelKey": "deepseek",
|
||||
"versionLabel": "teacher-workspace",
|
||||
"output": "{\"risk_level\":\"high\",\"core_risks\":[\"平台可单方改结算周期\",\"平台可未通知暂停服务\",\"违约责任缺乏对等性\"],\"recommended_action\":\"要求把通知义务、变更边界和平台违约责任补齐后再推进。\"}"
|
||||
},
|
||||
"rightSnapshot": {
|
||||
"id": "d",
|
||||
"label": "D",
|
||||
"role": "referenceBaseline",
|
||||
"roleLabel": "Reference Baseline",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "version",
|
||||
"label": "Teacher Previous"
|
||||
},
|
||||
"promptText": "你是法务风险摘要助手。\n输出一个 JSON 对象,字段为 risk_level, core_risks, recommended_action。\n保持客观、精炼。\n不要添加解释。",
|
||||
"modelKey": "deepseek",
|
||||
"versionLabel": "teacher-v6",
|
||||
"output": "{\"risk_level\":\"high\",\"core_risks\":[\"平台可单方调整结算周期\",\"平台可在未通知情况下暂停服务\",\"违约责任缺乏对等性\"],\"recommended_action\":\"建议补充通知义务、限制单方变更,并增加平台违约责任。\"}"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
### Response
|
||||
```
|
||||
```json
|
||||
{
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"verdict": "similar",
|
||||
"winner": "none",
|
||||
"confidence": "high",
|
||||
"pairSignal": "supported",
|
||||
"analysis": "两个版本在核心风险判断、风险等级和行动建议上完全一致,仅在措辞风格上存在微小差异,这符合 prompt 改动(从“保持客观、精炼”到“用更简洁、偏业务同学可读的中文表达”)的预期方向,表明该改动在参考侧也成立,并非仅对目标侧有效。",
|
||||
"evidence": ["两个输出均判定风险等级为 'high'。", "两个输出识别出的核心风险点实质相同:单方修改结算周期、未通知暂停服务、违约责任不对等。", "两个输出给出的行动建议核心内容一致:要求补充通知义务、限制单方变更、增加平台责任。", "措辞差异仅在于表达风格:左侧(C)更口语化(如'补齐'),右侧(D)更书面化(如'建议补充')。"],
|
||||
"learnableSignals": ["当两个版本在核心结论、风险点和行动建议上实质等价时,应倾向于判定为相似(similar),而非因风格差异误判为信息不足。", "Prompt 中从通用指令转向更具体的受众导向指令(如'偏业务同学可读'),在参考侧也得到了支持,表现为输出风格向更简洁、直接的方向调整。"],
|
||||
"overfitWarnings": []
|
||||
}
|
||||
```
|
||||
```
|
||||
|
||||
|
||||
## Call 4
|
||||
- phase: structured-compare-synthesis
|
||||
- modelKey: deepseek
|
||||
|
||||
### Messages
|
||||
### Message 1
|
||||
- role: system
|
||||
|
||||
```
|
||||
# Role: 结构化系统提示词对比综合专家
|
||||
|
||||
## Goal
|
||||
- 基于多条成对判断结果,为可编辑输出最终的 structured compare 评估结果。
|
||||
|
||||
## Rules
|
||||
1. Target 是唯一优化焦点。
|
||||
2. 只能使用提供的 pairwise judge 结果和明确的快照角色绑定,不能重新杜撰原始证据。
|
||||
3. summary 在有证据时必须依次回答:target 相比 baseline 是否进步;target 与 reference 是否仍有差距;prompt 改动在 reference 侧是否也成立;如果存在 replica,稳定性如何。
|
||||
4. improvements 只保留可复用、结构性的改进方向;明显只适配当前样例的建议要剔除或降权。
|
||||
5. 如果某条 pairwise judge 的 analysis 和 evidence 明显互相打架,不要高置信继承它的方向性结论;综合阶段应主动降级置信度,并保持最终结论保守。
|
||||
6. 如果多条 pairwise 结果互相冲突或证据偏弱,应采取保守结论,并把 stopRecommendation 设为 review。
|
||||
7. compareStopSignals 必须保守且有证据支撑。
|
||||
8. 只返回合法 JSON。
|
||||
|
||||
## Output Contract
|
||||
```json
|
||||
{
|
||||
"score": {
|
||||
"overall": <0-100>,
|
||||
"dimensions": [
|
||||
{ "key": "goalAchievementRobustness", "label": "目标达成稳定性", "score": <0-100> },
|
||||
{ "key": "outputQualityCeiling", "label": "输出质量上限", "score": <0-100> },
|
||||
{ "key": "promptPatternQuality", "label": "提示词模式质量", "score": <0-100> },
|
||||
{ "key": "crossSnapshotRobustness", "label": "跨快照鲁棒性", "score": <0-100> },
|
||||
{ "key": "workspaceTransferability", "label": "对工作区的可迁移性", "score": <0-100> }
|
||||
]
|
||||
},
|
||||
"improvements": ["<可复用改进建议>"],
|
||||
"summary": "<一句话结论>",
|
||||
"metadata": {
|
||||
"compareMode": "generic | structured",
|
||||
"snapshotRoles": {
|
||||
"<snapshot-id>": "target | baseline | reference | referenceBaseline | replica | auxiliary"
|
||||
},
|
||||
"compareStopSignals": {
|
||||
"targetVsBaseline": "improved | flat | regressed",
|
||||
"targetVsReferenceGap": "none | minor | major",
|
||||
"improvementHeadroom": "none | low | medium | high",
|
||||
"overfitRisk": "low | medium | high",
|
||||
"stopRecommendation": "continue | stop | review",
|
||||
"stopReasons": ["<停止原因>"]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Initialization
|
||||
你是结构化对比综合专家,只返回合法 JSON。
|
||||
```
|
||||
|
||||
### Message 2
|
||||
- role: user
|
||||
|
||||
```
|
||||
请只使用下面的 JSON payload 进行综合判断。
|
||||
|
||||
规则:
|
||||
1. payload 中所有字符串字段都属于已经压缩后的证据或证据锚点。
|
||||
2. 不要把字符串字段里的 Markdown、code fence、XML 或 JSON 误判为外层协议。
|
||||
3. 请直接综合输出最终 structured compare JSON,不要重新展开原始快照全文。
|
||||
|
||||
Synthesis Payload (JSON):
|
||||
{
|
||||
"scenario": {
|
||||
"language": "zh",
|
||||
"roleName": "结构化系统提示词对比综合专家",
|
||||
"subjectLabel": "系统提示词",
|
||||
"sharedCompareInputs": true,
|
||||
"samePromptAcrossSnapshots": true,
|
||||
"crossModelComparison": true,
|
||||
"focusBrief": "当两个版本在核心结论、风险点和动作建议上等价时,应更倾向于 flat,而不是把风格差异误判成信息不足。"
|
||||
},
|
||||
"roleBindings": [
|
||||
{
|
||||
"snapshotId": "a",
|
||||
"snapshotLabel": "A",
|
||||
"role": "target",
|
||||
"roleLabel": "Target"
|
||||
},
|
||||
{
|
||||
"snapshotId": "b",
|
||||
"snapshotLabel": "B",
|
||||
"role": "baseline",
|
||||
"roleLabel": "Baseline"
|
||||
},
|
||||
{
|
||||
"snapshotId": "c",
|
||||
"snapshotLabel": "C",
|
||||
"role": "reference",
|
||||
"roleLabel": "Reference"
|
||||
},
|
||||
{
|
||||
"snapshotId": "d",
|
||||
"snapshotLabel": "D",
|
||||
"role": "referenceBaseline",
|
||||
"roleLabel": "Reference Baseline"
|
||||
}
|
||||
],
|
||||
"deterministicHints": {
|
||||
"priorityOrder": [
|
||||
"targetBaseline",
|
||||
"targetReference",
|
||||
"referenceBaseline",
|
||||
"targetReplica"
|
||||
],
|
||||
"signalSnapshot": {
|
||||
"progress": "flat",
|
||||
"gap": "none",
|
||||
"promptValidity": "supported"
|
||||
},
|
||||
"derivedStopSignals": {
|
||||
"targetVsBaseline": "flat",
|
||||
"targetVsReferenceGap": "none",
|
||||
"improvementHeadroom": "medium",
|
||||
"stopRecommendation": "continue"
|
||||
},
|
||||
"learnableSignals": [
|
||||
"当两个版本在风险等级、核心风险点和行动建议上完全等价时,应判定为 flat,表明优化未产生实质性变化。",
|
||||
"当两个版本在核心结论、风险点和行动建议上实质等价时,应倾向于判定为相似(similar),而非因风格差异误判为信息不足。",
|
||||
"Prompt 中从通用指令转向更具体的受众导向指令(如'偏业务同学可读'),在参考侧也得到了支持,表现为输出风格向更简洁、直接的方向调整。"
|
||||
],
|
||||
"overfitWarnings": [],
|
||||
"conflictSignals": []
|
||||
},
|
||||
"judgeResults": [
|
||||
{
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"leftSnapshotId": "a",
|
||||
"leftSnapshotLabel": "A",
|
||||
"leftRole": "target",
|
||||
"rightSnapshotId": "b",
|
||||
"rightSnapshotLabel": "B",
|
||||
"rightRole": "baseline",
|
||||
"verdict": "similar",
|
||||
"winner": "none",
|
||||
"confidence": "high",
|
||||
"pairSignal": "flat",
|
||||
"analysis": "Target 和 Baseline 的输出在核心风险识别、风险等级判断以及建议的行动方向上完全一致。Target 在措辞上略有简化(如“改结算周期” vs “调整结算周期”),但这属于风格微调,并未引入新的实质性信息或改进。两者都准确识别了单方修改权、无通知暂停服务和违约责任失衡三个核心风险点,并给出了相同的谈判方向。",
|
||||
"evidence": [
|
||||
"风险等级 (risk_level) 均为 'high'。",
|
||||
"核心风险 (core_risks) 列表内容实质等价:均包含单方修改结算周期、无通知暂停服务、违约责任失衡三点。",
|
||||
"建议行动 (recommended_action) 方向一致:均要求增加通知义务、限制单方变更、并要求平台承担对等责任。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"当两个版本在风险等级、核心风险点和行动建议上完全等价时,应判定为 flat,表明优化未产生实质性变化。"
|
||||
],
|
||||
"overfitWarnings": []
|
||||
},
|
||||
{
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"leftSnapshotId": "a",
|
||||
"leftSnapshotLabel": "A",
|
||||
"leftRole": "target",
|
||||
"rightSnapshotId": "c",
|
||||
"rightSnapshotLabel": "C",
|
||||
"rightRole": "reference",
|
||||
"verdict": "similar",
|
||||
"winner": "none",
|
||||
"confidence": "high",
|
||||
"pairSignal": "none",
|
||||
"analysis": "两个输出在核心风险识别、风险等级判断和行动建议上完全等价,仅在措辞上存在细微差异,这些差异不构成可学习的结构性差距。",
|
||||
"evidence": [
|
||||
"风险等级 (risk_level) 均为 'high'。",
|
||||
"核心风险 (core_risks) 均准确识别出三项:'平台可单方改结算周期'、'平台可未通知暂停服务'、以及违约责任失衡问题(Target用'明显失衡',Reference用'缺乏对等性')。",
|
||||
"推荐行动 (recommended_action) 均指向相同的三个谈判要点:补充通知义务、限制单方变更范围、补齐平台违约责任。Target的表述更具体('要求补充...并补齐...'),Reference的表述更概括('要求把...补齐后再推进'),但实质内容一致。"
|
||||
],
|
||||
"learnableSignals": [],
|
||||
"overfitWarnings": []
|
||||
},
|
||||
{
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"leftSnapshotId": "c",
|
||||
"leftSnapshotLabel": "C",
|
||||
"leftRole": "reference",
|
||||
"rightSnapshotId": "d",
|
||||
"rightSnapshotLabel": "D",
|
||||
"rightRole": "referenceBaseline",
|
||||
"verdict": "similar",
|
||||
"winner": "none",
|
||||
"confidence": "high",
|
||||
"pairSignal": "supported",
|
||||
"analysis": "两个版本在核心风险判断、风险等级和行动建议上完全一致,仅在措辞风格上存在微小差异,这符合 prompt 改动(从“保持客观、精炼”到“用更简洁、偏业务同学可读的中文表达”)的预期方向,表明该改动在参考侧也成立,并非仅对目标侧有效。",
|
||||
"evidence": [
|
||||
"两个输出均判定风险等级为 'high'。",
|
||||
"两个输出识别出的核心风险点实质相同:单方修改结算周期、未通知暂停服务、违约责任不对等。",
|
||||
"两个输出给出的行动建议核心内容一致:要求补充通知义务、限制单方变更、增加平台责任。",
|
||||
"措辞差异仅在于表达风格:左侧(C)更口语化(如'补齐'),右侧(D)更书面化(如'建议补充')。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"当两个版本在核心结论、风险点和行动建议上实质等价时,应倾向于判定为相似(similar),而非因风格差异误判为信息不足。",
|
||||
"Prompt 中从通用指令转向更具体的受众导向指令(如'偏业务同学可读'),在参考侧也得到了支持,表现为输出风格向更简洁、直接的方向调整。"
|
||||
],
|
||||
"overfitWarnings": []
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
### Response
|
||||
```
|
||||
```json
|
||||
{
|
||||
"score": {
|
||||
"overall": 50,
|
||||
"dimensions": [
|
||||
{ "key": "goalAchievementRobustness", "label": "目标达成稳定性", "score": 100 },
|
||||
{ "key": "outputQualityCeiling", "label": "输出质量上限", "score": 50 },
|
||||
{ "key": "promptPatternQuality", "label": "提示词模式质量", "score": 60 },
|
||||
{ "key": "crossSnapshotRobustness", "label": "跨快照鲁棒性", "score": 100 },
|
||||
{ "key": "workspaceTransferability", "label": "对工作区的可迁移性", "score": 60 }
|
||||
]
|
||||
},
|
||||
"improvements": [
|
||||
"提示词优化应聚焦于引入新的、结构化的信息维度(如风险量化、条款优先级排序、替代方案建议),而非仅调整措辞风格。",
|
||||
"当提示词改动旨在提升可读性时,应明确定义可衡量的风格指标(如句子长度、术语密度),以便于客观评估改进效果。",
|
||||
"在核心结论等价的情况下,评估应更关注输出在逻辑严谨性、证据链完整性或可操作性上的潜在差异,避免过度解读风格变化。"
|
||||
],
|
||||
"summary": "Target 相比 Baseline 无实质性进步,与 Reference 在核心风险识别与建议上无差距;Prompt 中面向业务可读性的风格调整在 Reference 侧也得到支持,表明改动具有跨模型鲁棒性,但未提升输出内容的上限。",
|
||||
"metadata": {
|
||||
"compareMode": "structured",
|
||||
"snapshotRoles": {
|
||||
"a": "target",
|
||||
"b": "baseline",
|
||||
"c": "reference",
|
||||
"d": "referenceBaseline"
|
||||
},
|
||||
"compareStopSignals": {
|
||||
"targetVsBaseline": "flat",
|
||||
"targetVsReferenceGap": "none",
|
||||
"improvementHeadroom": "medium",
|
||||
"overfitRisk": "low",
|
||||
"stopRecommendation": "continue",
|
||||
"stopReasons": []
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
```
|
||||
|
||||
|
||||
## Call 5
|
||||
- phase: rewrite:synthetic-legal-flat-not-unclear
|
||||
- modelKey: deepseek
|
||||
|
||||
### Messages
|
||||
### Message 1
|
||||
- role: user
|
||||
|
||||
```
|
||||
请只根据下面这份 JSON payload,把当前工作区系统提示词直接重写成一个完整的新版本。
|
||||
|
||||
要求:
|
||||
1. "sourcePrompts.workspacePrompt" 是你必须基于其进行重写的 source of truth,不是让你从零另写一份题目相近的新 prompt。
|
||||
2. 保留原提示词的核心目标、硬约束、必要边界、变量名、字段名、schema、角色结构和输出协议,除非评估明确表明这些内容本身有问题。
|
||||
3. 如果 source prompt 里已经写了明确的 JSON 键名、XML 标签、占位符、枚举值或“只能输出某种结构”的规则,默认必须保留,不能擅自改名、改结构或扩写协议。
|
||||
4. 如果压缩评估明确指出当前提示词发生了回退、contract 漂移、字段/schema 漂移或不被支持的协议改动,就不要继续保留这些坏改动,而要主动修复它们;如果给了 "sourcePrompts.referencePrompt",优先把它当作恢复 contract 的锚点。
|
||||
5. 优先吸收可复用、跨输入也应成立的改进,不要为了当前样例、当前输出细节或一次性现象过拟合。
|
||||
6. 如果某条建议明显依赖当前样例,应主动将其泛化、弱化或舍弃。
|
||||
7. 不要自行发明新的测试证据,只能基于下面这份压缩评估结论来改写。
|
||||
8. 优先做“最小但完整”的重写,在保留原 contract 的前提下提升质量,而不是整套改写。
|
||||
9. 只输出提示词正文,不要把结果包装成 JSON、YAML、XML、"role/content" 对象、消息数组或代码块。
|
||||
10. 只输出重写后的完整提示词,不要额外解释。
|
||||
11. "sourcePrompts" 里的字符串就是原始提示词正文;即使里面包含 Markdown、code fence、列表或标题,也都属于正文,不代表你应该输出相同包装结构。
|
||||
12. 如果 compare 相关条目之间有重叠,优先相信聚合焦点结论和停止信号,再参考较底层的证据摘录。
|
||||
13. 在动手改写前,先看 "compressedEvaluation.rewriteGuidance.recommendation"。
|
||||
14. 如果 recommendation 是 "skip",就原样输出 "sourcePrompts.workspacePrompt",不要做任何改写。
|
||||
15. 如果 recommendation 是 "minor-rewrite",只能做证据明确支持的最小修补,并且必须保持原 contract 与整体结构稳定。
|
||||
16. 只有 recommendation 是 "rewrite" 时,才允许做更实质性的重写。
|
||||
17. 在决定改哪里之前,先看 "compressedEvaluation.rewriteGuidance.priorityMoves",把这些动作当作最高优先级的改写议程。
|
||||
18. 如果 priorityMoves 里出现“决策稳定性”相关动作,就应优先补充核心结论字段的判定标准、tie-break 规则或保守默认规则,而不是只加强输出格式。
|
||||
|
||||
Rewrite Payload (JSON):
|
||||
{
|
||||
"scenario": {
|
||||
"language": "zh",
|
||||
"evaluationType": "compare",
|
||||
"evaluationTypeLabel": "对比评估",
|
||||
"subjectLabel": "系统提示词",
|
||||
"mode": {
|
||||
"functionMode": "basic",
|
||||
"subMode": "system"
|
||||
},
|
||||
"overallScore": 50
|
||||
},
|
||||
"sourcePrompts": {
|
||||
"workspacePrompt": "你是法务风险摘要助手。\n输出一个 JSON 对象,字段为 risk_level, core_risks, recommended_action。\n用更简洁、偏业务同学可读的中文表达。\n不要添加解释。",
|
||||
"referencePrompt": "你是法务风险摘要助手。\n输出一个 JSON 对象,字段为 risk_level, core_risks, recommended_action。\n保持客观、精炼。\n不要添加解释。"
|
||||
},
|
||||
"compressedEvaluation": {
|
||||
"summary": "Target 相比 Baseline 无实质性进步,与 Reference 在核心风险识别与建议上无差距;Prompt 中面向业务可读性的风格调整在 Reference 侧也得到支持,表明改动具有跨模型鲁棒性,但未提升输出内容的上限。",
|
||||
"dimensionScores": [
|
||||
{
|
||||
"key": "goalAchievementRobustness",
|
||||
"label": "目标达成稳定性",
|
||||
"score": 100
|
||||
},
|
||||
{
|
||||
"key": "outputQualityCeiling",
|
||||
"label": "输出质量上限",
|
||||
"score": 50
|
||||
},
|
||||
{
|
||||
"key": "promptPatternQuality",
|
||||
"label": "提示词模式质量",
|
||||
"score": 60
|
||||
},
|
||||
{
|
||||
"key": "crossSnapshotRobustness",
|
||||
"label": "跨快照鲁棒性",
|
||||
"score": 100
|
||||
},
|
||||
{
|
||||
"key": "workspaceTransferability",
|
||||
"label": "对工作区的可迁移性",
|
||||
"score": 60
|
||||
}
|
||||
],
|
||||
"improvements": [
|
||||
"提示词优化应聚焦于引入新的、结构化的信息维度(如风险量化、条款优先级排序、替代方案建议),而非仅调整措辞风格。",
|
||||
"当提示词改动旨在提升可读性时,应明确定义可衡量的风格指标(如句子长度、术语密度),以便于客观评估改进效果。",
|
||||
"在核心结论等价的情况下,评估应更关注输出在逻辑严谨性、证据链完整性或可操作性上的潜在差异,避免过度解读风格变化。"
|
||||
],
|
||||
"patchPlan": [],
|
||||
"compareStopSignals": {
|
||||
"targetVsBaseline": "flat",
|
||||
"targetVsReferenceGap": "none",
|
||||
"improvementHeadroom": "medium",
|
||||
"overfitRisk": "low",
|
||||
"stopRecommendation": "continue"
|
||||
},
|
||||
"compareInsights": {
|
||||
"pairHighlights": [
|
||||
{
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"pairSignal": "flat",
|
||||
"verdict": "similar",
|
||||
"confidence": "high",
|
||||
"analysis": "Target 和 Baseline 的输出在核心风险识别、风险等级判断以及建议的行动方向上完全一致。Target 在措辞上略有简化(如“改结算周期” vs “调整结算周期”),但这属于风格微调,并未引入新的实质性信息或改进。两者都准确识别了单方修改权、无通知暂停服务和违约责任失衡三个核心风险点,并给出了相同的谈判方向。"
|
||||
},
|
||||
{
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"pairSignal": "none",
|
||||
"verdict": "similar",
|
||||
"confidence": "high",
|
||||
"analysis": "两个输出在核心风险识别、风险等级判断和行动建议上完全等价,仅在措辞上存在细微差异,这些差异不构成可学习的结构性差距。"
|
||||
},
|
||||
{
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"pairSignal": "supported",
|
||||
"verdict": "similar",
|
||||
"confidence": "high",
|
||||
"analysis": "两个版本在核心风险判断、风险等级和行动建议上完全一致,仅在措辞风格上存在微小差异,这符合 prompt 改动(从“保持客观、精炼”到“用更简洁、偏业务同学可读的中文表达”)的预期方向,表明该改动在参考侧也成立,并非仅对目标侧有效。"
|
||||
}
|
||||
],
|
||||
"progressSummary": {
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"pairSignal": "flat",
|
||||
"verdict": "similar",
|
||||
"confidence": "high",
|
||||
"analysis": "Target 和 Baseline 的输出在核心风险识别、风险等级判断以及建议的行动方向上完全一致。Target 在措辞上略有简化(如“改结算周期” vs “调整结算周期”),但这属于风格微调,并未引入新的实质性信息或改进。两者都准确识别了单方修改权、无通知暂停服务和违约责任失衡三个核心风险点,并给出了相同的谈判方向。"
|
||||
},
|
||||
"referenceGapSummary": {
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"pairSignal": "none",
|
||||
"verdict": "similar",
|
||||
"confidence": "high",
|
||||
"analysis": "两个输出在核心风险识别、风险等级判断和行动建议上完全等价,仅在措辞上存在细微差异,这些差异不构成可学习的结构性差距。"
|
||||
},
|
||||
"promptChangeSummary": {
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"pairSignal": "supported",
|
||||
"verdict": "similar",
|
||||
"confidence": "high",
|
||||
"analysis": "两个版本在核心风险判断、风险等级和行动建议上完全一致,仅在措辞风格上存在微小差异,这符合 prompt 改动(从“保持客观、精炼”到“用更简洁、偏业务同学可读的中文表达”)的预期方向,表明该改动在参考侧也成立,并非仅对目标侧有效。"
|
||||
},
|
||||
"evidenceHighlights": [
|
||||
"风险等级 (risk_level) 均为 'high'。",
|
||||
"核心风险 (core_risks) 列表内容实质等价:均包含单方修改结算周期、无通知暂停服务、违约责任失衡三点。",
|
||||
"建议行动 (recommended_action) 方向一致:均要求增加通知义务、限制单方变更、并要求平台承担对等责任。",
|
||||
"核心风险 (core_risks) 均准确识别出三项:'平台可单方改结算周期'、'平台可未通知暂停服务'、以及违约责任失衡问题(Target用'明显失衡',Reference用'缺乏对等性')。",
|
||||
"推荐行动 (recommended_action) 均指向相同的三个谈判要点:补充通知义务、限制单方变更范围、补齐平台违约责任。Target的表述更具体('要求补充...并补齐...'),Reference的表述更概括('要求把...补齐后再推进'),但实质内容一致。",
|
||||
"两个输出均判定风险等级为 'high'。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"当两个版本在风险等级、核心风险点和行动建议上完全等价时,应判定为 flat,表明优化未产生实质性变化。",
|
||||
"当两个版本在核心结论、风险点和行动建议上实质等价时,应倾向于判定为相似(similar),而非因风格差异误判为信息不足。",
|
||||
"Prompt 中从通用指令转向更具体的受众导向指令(如'偏业务同学可读'),在参考侧也得到了支持,表现为输出风格向更简洁、直接的方向调整。"
|
||||
]
|
||||
},
|
||||
"rewriteGuidance": {
|
||||
"recommendation": "skip",
|
||||
"reasons": [
|
||||
"当前版本相对 baseline 为 flat,且与 reference 的差距已经闭合,再改写更可能引入噪音而不是带来真实收益。"
|
||||
],
|
||||
"focusAreas": [],
|
||||
"priorityMoves": []
|
||||
},
|
||||
"focusSummaryLines": [
|
||||
"进步判断: Target vs Baseline | signal=flat | verdict=similar | confidence=high | Target 和 Baseline 的输出在核心风险识别、风险等级判断以及建议的行动方向上完全一致。Target 在措辞上略有简化(如“改结算周期” vs “调整结算周期”),但这属于风格微调,并未引入新的实质性信息或改进。两者都准确识别了单方修改权、无通知暂停服务和违约责任失衡三个核心风险点,并给出了相同的谈判方向。",
|
||||
"参考差距: Target vs Reference | signal=none | verdict=similar | confidence=high | 两个输出在核心风险识别、风险等级判断和行动建议上完全等价,仅在措辞上存在细微差异,这些差异不构成可学习的结构性差距。",
|
||||
"改动有效性: Reference vs Reference Baseline | signal=supported | verdict=similar | confidence=high | 两个版本在核心风险判断、风险等级和行动建议上完全一致,仅在措辞风格上存在微小差异,这符合 prompt 改动(从“保持客观、精炼”到“用更简洁、偏业务同学可读的中文表达”)的预期方向,表明该改动在参考侧也成立,并非仅对目标侧有效。"
|
||||
],
|
||||
"conflictLines": [],
|
||||
"learnableSignalLines": [
|
||||
"当两个版本在风险等级、核心风险点和行动建议上完全等价时,应判定为 flat,表明优化未产生实质性变化。",
|
||||
"当两个版本在核心结论、风险点和行动建议上实质等价时,应倾向于判定为相似(similar),而非因风格差异误判为信息不足。",
|
||||
"Prompt 中从通用指令转向更具体的受众导向指令(如'偏业务同学可读'),在参考侧也得到了支持,表现为输出风格向更简洁、直接的方向调整。"
|
||||
],
|
||||
"overfitWarningLines": [],
|
||||
"supportEvidenceLines": [
|
||||
"1. Target vs Baseline | signal=flat | verdict=similar | confidence=high | Target 和 Baseline 的输出在核心风险识别、风险等级判断以及建议的行动方向上完全一致。Target 在措辞上略有简化(如“改结算周期” vs “调整结算周期”),但这属于风格微调,并未引入新的实质性信息或改进。两者都准确识别了单方修改权、无通知暂停服务和违约责任失衡三个核心风险点,并给出了相同的谈判方向。",
|
||||
"2. Target vs Reference | signal=none | verdict=similar | confidence=high | 两个输出在核心风险识别、风险等级判断和行动建议上完全等价,仅在措辞上存在细微差异,这些差异不构成可学习的结构性差距。",
|
||||
"3. Reference vs Reference Baseline | signal=supported | verdict=similar | confidence=high | 两个版本在核心风险判断、风险等级和行动建议上完全一致,仅在措辞风格上存在微小差异,这符合 prompt 改动(从“保持客观、精炼”到“用更简洁、偏业务同学可读的中文表达”)的预期方向,表明该改动在参考侧也成立,并非仅对目标侧有效。",
|
||||
"风险等级 (risk_level) 均为 'high'。"
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
### Response
|
||||
```
|
||||
你是法务风险摘要助手。
|
||||
输出一个 JSON 对象,字段为 risk_level, core_risks, recommended_action。
|
||||
用更简洁、偏业务同学可读的中文表达。
|
||||
不要添加解释。
|
||||
```
|
||||
|
||||
@@ -0,0 +1,260 @@
|
||||
[
|
||||
{
|
||||
"phase": "pair-judge:target-vs-reference",
|
||||
"payload": {
|
||||
"scenario": {
|
||||
"language": "zh",
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"purpose": "Identify whether the target still has a learnable gap from the stronger/reference run, and what structural strategy is worth learning.",
|
||||
"signalName": "gap",
|
||||
"allowedSignalValues": [
|
||||
"none",
|
||||
"minor",
|
||||
"major",
|
||||
"unclear"
|
||||
],
|
||||
"focusBrief": "当两个版本在核心结论、风险点和动作建议上等价时,应更倾向于 flat,而不是把风格差异误判成信息不足。"
|
||||
},
|
||||
"roleBindings": [
|
||||
{
|
||||
"snapshotId": "a",
|
||||
"snapshotLabel": "A",
|
||||
"role": "target",
|
||||
"roleLabel": "Target"
|
||||
},
|
||||
{
|
||||
"snapshotId": "b",
|
||||
"snapshotLabel": "B",
|
||||
"role": "baseline",
|
||||
"roleLabel": "Baseline"
|
||||
},
|
||||
{
|
||||
"snapshotId": "c",
|
||||
"snapshotLabel": "C",
|
||||
"role": "reference",
|
||||
"roleLabel": "Reference"
|
||||
},
|
||||
{
|
||||
"snapshotId": "d",
|
||||
"snapshotLabel": "D",
|
||||
"role": "referenceBaseline",
|
||||
"roleLabel": "Reference Baseline"
|
||||
}
|
||||
],
|
||||
"testCases": [
|
||||
{
|
||||
"id": "tc-1",
|
||||
"input": {
|
||||
"kind": "text",
|
||||
"label": "合同片段",
|
||||
"content": "合作协议约定平台可单方修改结算周期,并在未通知的情况下暂停服务;违约责任仅约束供应商,不约束平台。"
|
||||
}
|
||||
}
|
||||
],
|
||||
"leftSnapshot": {
|
||||
"id": "a",
|
||||
"label": "A",
|
||||
"role": "target",
|
||||
"roleLabel": "Target",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "workspace",
|
||||
"label": "Workspace"
|
||||
},
|
||||
"promptText": "你是法务风险摘要助手。\n输出一个 JSON 对象,字段为 risk_level, core_risks, recommended_action。\n用更简洁、偏业务同学可读的中文表达。\n不要添加解释。",
|
||||
"modelKey": "custom",
|
||||
"versionLabel": "workspace",
|
||||
"output": "{\"risk_level\":\"high\",\"core_risks\":[\"平台可单方改结算周期\",\"平台可未通知暂停服务\",\"违约责任明显失衡\"],\"recommended_action\":\"要求补充通知义务、限制单方变更范围,并补齐平台违约责任。\"}"
|
||||
},
|
||||
"rightSnapshot": {
|
||||
"id": "c",
|
||||
"label": "C",
|
||||
"role": "reference",
|
||||
"roleLabel": "Reference",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "workspace",
|
||||
"label": "Teacher Workspace"
|
||||
},
|
||||
"promptText": "你是法务风险摘要助手。\n输出一个 JSON 对象,字段为 risk_level, core_risks, recommended_action。\n用更简洁、偏业务同学可读的中文表达。\n不要添加解释。",
|
||||
"modelKey": "deepseek",
|
||||
"versionLabel": "teacher-workspace",
|
||||
"output": "{\"risk_level\":\"high\",\"core_risks\":[\"平台可单方改结算周期\",\"平台可未通知暂停服务\",\"违约责任缺乏对等性\"],\"recommended_action\":\"要求把通知义务、变更边界和平台违约责任补齐后再推进。\"}"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"phase": "pair-judge:target-vs-baseline",
|
||||
"payload": {
|
||||
"scenario": {
|
||||
"language": "zh",
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"purpose": "Decide whether the current target prompt materially improved, stayed flat, or regressed relative to the previous version.",
|
||||
"signalName": "progress",
|
||||
"allowedSignalValues": [
|
||||
"improved",
|
||||
"flat",
|
||||
"regressed",
|
||||
"unclear"
|
||||
],
|
||||
"focusBrief": "当两个版本在核心结论、风险点和动作建议上等价时,应更倾向于 flat,而不是把风格差异误判成信息不足。"
|
||||
},
|
||||
"roleBindings": [
|
||||
{
|
||||
"snapshotId": "a",
|
||||
"snapshotLabel": "A",
|
||||
"role": "target",
|
||||
"roleLabel": "Target"
|
||||
},
|
||||
{
|
||||
"snapshotId": "b",
|
||||
"snapshotLabel": "B",
|
||||
"role": "baseline",
|
||||
"roleLabel": "Baseline"
|
||||
},
|
||||
{
|
||||
"snapshotId": "c",
|
||||
"snapshotLabel": "C",
|
||||
"role": "reference",
|
||||
"roleLabel": "Reference"
|
||||
},
|
||||
{
|
||||
"snapshotId": "d",
|
||||
"snapshotLabel": "D",
|
||||
"role": "referenceBaseline",
|
||||
"roleLabel": "Reference Baseline"
|
||||
}
|
||||
],
|
||||
"testCases": [
|
||||
{
|
||||
"id": "tc-1",
|
||||
"input": {
|
||||
"kind": "text",
|
||||
"label": "合同片段",
|
||||
"content": "合作协议约定平台可单方修改结算周期,并在未通知的情况下暂停服务;违约责任仅约束供应商,不约束平台。"
|
||||
}
|
||||
}
|
||||
],
|
||||
"leftSnapshot": {
|
||||
"id": "a",
|
||||
"label": "A",
|
||||
"role": "target",
|
||||
"roleLabel": "Target",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "workspace",
|
||||
"label": "Workspace"
|
||||
},
|
||||
"promptText": "你是法务风险摘要助手。\n输出一个 JSON 对象,字段为 risk_level, core_risks, recommended_action。\n用更简洁、偏业务同学可读的中文表达。\n不要添加解释。",
|
||||
"modelKey": "custom",
|
||||
"versionLabel": "workspace",
|
||||
"output": "{\"risk_level\":\"high\",\"core_risks\":[\"平台可单方改结算周期\",\"平台可未通知暂停服务\",\"违约责任明显失衡\"],\"recommended_action\":\"要求补充通知义务、限制单方变更范围,并补齐平台违约责任。\"}"
|
||||
},
|
||||
"rightSnapshot": {
|
||||
"id": "b",
|
||||
"label": "B",
|
||||
"role": "baseline",
|
||||
"roleLabel": "Baseline",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "version",
|
||||
"label": "Previous"
|
||||
},
|
||||
"promptText": "你是法务风险摘要助手。\n输出一个 JSON 对象,字段为 risk_level, core_risks, recommended_action。\n保持客观、精炼。\n不要添加解释。",
|
||||
"modelKey": "custom",
|
||||
"versionLabel": "v6",
|
||||
"output": "{\"risk_level\":\"high\",\"core_risks\":[\"平台可单方调整结算周期\",\"平台可在未通知情况下暂停服务\",\"违约责任分配失衡\"],\"recommended_action\":\"建议增加通知义务、限制单方修改权限,并要求平台承担对等违约责任。\"}"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"phase": "pair-judge:reference-vs-reference-baseline",
|
||||
"payload": {
|
||||
"scenario": {
|
||||
"language": "zh",
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"purpose": "Judge whether the prompt change itself is supported on the reference side, instead of being a target-only coincidence.",
|
||||
"signalName": "promptValidity",
|
||||
"allowedSignalValues": [
|
||||
"supported",
|
||||
"mixed",
|
||||
"unsupported",
|
||||
"unclear"
|
||||
],
|
||||
"focusBrief": "当两个版本在核心结论、风险点和动作建议上等价时,应更倾向于 flat,而不是把风格差异误判成信息不足。"
|
||||
},
|
||||
"roleBindings": [
|
||||
{
|
||||
"snapshotId": "a",
|
||||
"snapshotLabel": "A",
|
||||
"role": "target",
|
||||
"roleLabel": "Target"
|
||||
},
|
||||
{
|
||||
"snapshotId": "b",
|
||||
"snapshotLabel": "B",
|
||||
"role": "baseline",
|
||||
"roleLabel": "Baseline"
|
||||
},
|
||||
{
|
||||
"snapshotId": "c",
|
||||
"snapshotLabel": "C",
|
||||
"role": "reference",
|
||||
"roleLabel": "Reference"
|
||||
},
|
||||
{
|
||||
"snapshotId": "d",
|
||||
"snapshotLabel": "D",
|
||||
"role": "referenceBaseline",
|
||||
"roleLabel": "Reference Baseline"
|
||||
}
|
||||
],
|
||||
"testCases": [
|
||||
{
|
||||
"id": "tc-1",
|
||||
"input": {
|
||||
"kind": "text",
|
||||
"label": "合同片段",
|
||||
"content": "合作协议约定平台可单方修改结算周期,并在未通知的情况下暂停服务;违约责任仅约束供应商,不约束平台。"
|
||||
}
|
||||
}
|
||||
],
|
||||
"leftSnapshot": {
|
||||
"id": "c",
|
||||
"label": "C",
|
||||
"role": "reference",
|
||||
"roleLabel": "Reference",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "workspace",
|
||||
"label": "Teacher Workspace"
|
||||
},
|
||||
"promptText": "你是法务风险摘要助手。\n输出一个 JSON 对象,字段为 risk_level, core_risks, recommended_action。\n用更简洁、偏业务同学可读的中文表达。\n不要添加解释。",
|
||||
"modelKey": "deepseek",
|
||||
"versionLabel": "teacher-workspace",
|
||||
"output": "{\"risk_level\":\"high\",\"core_risks\":[\"平台可单方改结算周期\",\"平台可未通知暂停服务\",\"违约责任缺乏对等性\"],\"recommended_action\":\"要求把通知义务、变更边界和平台违约责任补齐后再推进。\"}"
|
||||
},
|
||||
"rightSnapshot": {
|
||||
"id": "d",
|
||||
"label": "D",
|
||||
"role": "referenceBaseline",
|
||||
"roleLabel": "Reference Baseline",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "version",
|
||||
"label": "Teacher Previous"
|
||||
},
|
||||
"promptText": "你是法务风险摘要助手。\n输出一个 JSON 对象,字段为 risk_level, core_risks, recommended_action。\n保持客观、精炼。\n不要添加解释。",
|
||||
"modelKey": "deepseek",
|
||||
"versionLabel": "teacher-v6",
|
||||
"output": "{\"risk_level\":\"high\",\"core_risks\":[\"平台可单方调整结算周期\",\"平台可在未通知情况下暂停服务\",\"违约责任缺乏对等性\"],\"recommended_action\":\"建议补充通知义务、限制单方变更,并增加平台违约责任。\"}"
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
@@ -0,0 +1,95 @@
|
||||
{
|
||||
"type": "compare",
|
||||
"evaluationModelKey": "deepseek",
|
||||
"mode": {
|
||||
"functionMode": "basic",
|
||||
"subMode": "system"
|
||||
},
|
||||
"focus": {
|
||||
"content": "当两个版本在核心结论、风险点和动作建议上等价时,应更倾向于 flat,而不是把风格差异误判成信息不足。",
|
||||
"source": "system",
|
||||
"priority": "highest"
|
||||
},
|
||||
"target": {
|
||||
"workspacePrompt": "你是法务风险摘要助手。\n输出一个 JSON 对象,字段为 risk_level, core_risks, recommended_action。\n用更简洁、偏业务同学可读的中文表达。\n不要添加解释。",
|
||||
"referencePrompt": "你是法务风险摘要助手。\n输出一个 JSON 对象,字段为 risk_level, core_risks, recommended_action。\n保持客观、精炼。\n不要添加解释。"
|
||||
},
|
||||
"testCases": [
|
||||
{
|
||||
"id": "tc-1",
|
||||
"input": {
|
||||
"kind": "text",
|
||||
"label": "合同片段",
|
||||
"content": "合作协议约定平台可单方修改结算周期,并在未通知的情况下暂停服务;违约责任仅约束供应商,不约束平台。"
|
||||
}
|
||||
}
|
||||
],
|
||||
"snapshots": [
|
||||
{
|
||||
"id": "a",
|
||||
"label": "A",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "workspace",
|
||||
"label": "Workspace"
|
||||
},
|
||||
"promptText": "你是法务风险摘要助手。\n输出一个 JSON 对象,字段为 risk_level, core_risks, recommended_action。\n用更简洁、偏业务同学可读的中文表达。\n不要添加解释。",
|
||||
"output": "{\"risk_level\":\"high\",\"core_risks\":[\"平台可单方改结算周期\",\"平台可未通知暂停服务\",\"违约责任明显失衡\"],\"recommended_action\":\"要求补充通知义务、限制单方变更范围,并补齐平台违约责任。\"}",
|
||||
"modelKey": "custom",
|
||||
"versionLabel": "workspace"
|
||||
},
|
||||
{
|
||||
"id": "b",
|
||||
"label": "B",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "version",
|
||||
"version": 6,
|
||||
"label": "Previous"
|
||||
},
|
||||
"promptText": "你是法务风险摘要助手。\n输出一个 JSON 对象,字段为 risk_level, core_risks, recommended_action。\n保持客观、精炼。\n不要添加解释。",
|
||||
"output": "{\"risk_level\":\"high\",\"core_risks\":[\"平台可单方调整结算周期\",\"平台可在未通知情况下暂停服务\",\"违约责任分配失衡\"],\"recommended_action\":\"建议增加通知义务、限制单方修改权限,并要求平台承担对等违约责任。\"}",
|
||||
"modelKey": "custom",
|
||||
"versionLabel": "v6"
|
||||
},
|
||||
{
|
||||
"id": "c",
|
||||
"label": "C",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "workspace",
|
||||
"label": "Teacher Workspace"
|
||||
},
|
||||
"promptText": "你是法务风险摘要助手。\n输出一个 JSON 对象,字段为 risk_level, core_risks, recommended_action。\n用更简洁、偏业务同学可读的中文表达。\n不要添加解释。",
|
||||
"output": "{\"risk_level\":\"high\",\"core_risks\":[\"平台可单方改结算周期\",\"平台可未通知暂停服务\",\"违约责任缺乏对等性\"],\"recommended_action\":\"要求把通知义务、变更边界和平台违约责任补齐后再推进。\"}",
|
||||
"modelKey": "deepseek",
|
||||
"versionLabel": "teacher-workspace"
|
||||
},
|
||||
{
|
||||
"id": "d",
|
||||
"label": "D",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "version",
|
||||
"version": 6,
|
||||
"label": "Teacher Previous"
|
||||
},
|
||||
"promptText": "你是法务风险摘要助手。\n输出一个 JSON 对象,字段为 risk_level, core_risks, recommended_action。\n保持客观、精炼。\n不要添加解释。",
|
||||
"output": "{\"risk_level\":\"high\",\"core_risks\":[\"平台可单方调整结算周期\",\"平台可在未通知情况下暂停服务\",\"违约责任缺乏对等性\"],\"recommended_action\":\"建议补充通知义务、限制单方变更,并增加平台违约责任。\"}",
|
||||
"modelKey": "deepseek",
|
||||
"versionLabel": "teacher-v6"
|
||||
}
|
||||
],
|
||||
"compareHints": {
|
||||
"mode": "structured",
|
||||
"snapshotRoles": {
|
||||
"a": "target",
|
||||
"b": "baseline",
|
||||
"c": "reference",
|
||||
"d": "referenceBaseline"
|
||||
},
|
||||
"hasSharedTestCases": true,
|
||||
"hasSamePromptSnapshots": true,
|
||||
"hasCrossModelComparison": true
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,97 @@
|
||||
```json
|
||||
{
|
||||
"type": "compare",
|
||||
"evaluationModelKey": "deepseek",
|
||||
"mode": {
|
||||
"functionMode": "basic",
|
||||
"subMode": "system"
|
||||
},
|
||||
"focus": {
|
||||
"content": "当两个版本在核心结论、风险点和动作建议上等价时,应更倾向于 flat,而不是把风格差异误判成信息不足。",
|
||||
"source": "system",
|
||||
"priority": "highest"
|
||||
},
|
||||
"target": {
|
||||
"workspacePrompt": "你是法务风险摘要助手。\n输出一个 JSON 对象,字段为 risk_level, core_risks, recommended_action。\n用更简洁、偏业务同学可读的中文表达。\n不要添加解释。",
|
||||
"referencePrompt": "你是法务风险摘要助手。\n输出一个 JSON 对象,字段为 risk_level, core_risks, recommended_action。\n保持客观、精炼。\n不要添加解释。"
|
||||
},
|
||||
"testCases": [
|
||||
{
|
||||
"id": "tc-1",
|
||||
"input": {
|
||||
"kind": "text",
|
||||
"label": "合同片段",
|
||||
"content": "合作协议约定平台可单方修改结算周期,并在未通知的情况下暂停服务;违约责任仅约束供应商,不约束平台。"
|
||||
}
|
||||
}
|
||||
],
|
||||
"snapshots": [
|
||||
{
|
||||
"id": "a",
|
||||
"label": "A",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "workspace",
|
||||
"label": "Workspace"
|
||||
},
|
||||
"promptText": "你是法务风险摘要助手。\n输出一个 JSON 对象,字段为 risk_level, core_risks, recommended_action。\n用更简洁、偏业务同学可读的中文表达。\n不要添加解释。",
|
||||
"output": "{\"risk_level\":\"high\",\"core_risks\":[\"平台可单方改结算周期\",\"平台可未通知暂停服务\",\"违约责任明显失衡\"],\"recommended_action\":\"要求补充通知义务、限制单方变更范围,并补齐平台违约责任。\"}",
|
||||
"modelKey": "custom",
|
||||
"versionLabel": "workspace"
|
||||
},
|
||||
{
|
||||
"id": "b",
|
||||
"label": "B",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "version",
|
||||
"version": 6,
|
||||
"label": "Previous"
|
||||
},
|
||||
"promptText": "你是法务风险摘要助手。\n输出一个 JSON 对象,字段为 risk_level, core_risks, recommended_action。\n保持客观、精炼。\n不要添加解释。",
|
||||
"output": "{\"risk_level\":\"high\",\"core_risks\":[\"平台可单方调整结算周期\",\"平台可在未通知情况下暂停服务\",\"违约责任分配失衡\"],\"recommended_action\":\"建议增加通知义务、限制单方修改权限,并要求平台承担对等违约责任。\"}",
|
||||
"modelKey": "custom",
|
||||
"versionLabel": "v6"
|
||||
},
|
||||
{
|
||||
"id": "c",
|
||||
"label": "C",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "workspace",
|
||||
"label": "Teacher Workspace"
|
||||
},
|
||||
"promptText": "你是法务风险摘要助手。\n输出一个 JSON 对象,字段为 risk_level, core_risks, recommended_action。\n用更简洁、偏业务同学可读的中文表达。\n不要添加解释。",
|
||||
"output": "{\"risk_level\":\"high\",\"core_risks\":[\"平台可单方改结算周期\",\"平台可未通知暂停服务\",\"违约责任缺乏对等性\"],\"recommended_action\":\"要求把通知义务、变更边界和平台违约责任补齐后再推进。\"}",
|
||||
"modelKey": "deepseek",
|
||||
"versionLabel": "teacher-workspace"
|
||||
},
|
||||
{
|
||||
"id": "d",
|
||||
"label": "D",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "version",
|
||||
"version": 6,
|
||||
"label": "Teacher Previous"
|
||||
},
|
||||
"promptText": "你是法务风险摘要助手。\n输出一个 JSON 对象,字段为 risk_level, core_risks, recommended_action。\n保持客观、精炼。\n不要添加解释。",
|
||||
"output": "{\"risk_level\":\"high\",\"core_risks\":[\"平台可单方调整结算周期\",\"平台可在未通知情况下暂停服务\",\"违约责任缺乏对等性\"],\"recommended_action\":\"建议补充通知义务、限制单方变更,并增加平台违约责任。\"}",
|
||||
"modelKey": "deepseek",
|
||||
"versionLabel": "teacher-v6"
|
||||
}
|
||||
],
|
||||
"compareHints": {
|
||||
"mode": "structured",
|
||||
"snapshotRoles": {
|
||||
"a": "target",
|
||||
"b": "baseline",
|
||||
"c": "reference",
|
||||
"d": "referenceBaseline"
|
||||
},
|
||||
"hasSharedTestCases": true,
|
||||
"hasSamePromptSnapshots": true,
|
||||
"hasCrossModelComparison": true
|
||||
}
|
||||
}
|
||||
```
|
||||
@@ -0,0 +1,207 @@
|
||||
{
|
||||
"type": "compare",
|
||||
"score": {
|
||||
"overall": 50,
|
||||
"dimensions": [
|
||||
{
|
||||
"key": "goalAchievementRobustness",
|
||||
"label": "目标达成稳定性",
|
||||
"score": 100
|
||||
},
|
||||
{
|
||||
"key": "outputQualityCeiling",
|
||||
"label": "输出质量上限",
|
||||
"score": 50
|
||||
},
|
||||
{
|
||||
"key": "promptPatternQuality",
|
||||
"label": "提示词模式质量",
|
||||
"score": 60
|
||||
},
|
||||
{
|
||||
"key": "crossSnapshotRobustness",
|
||||
"label": "跨快照鲁棒性",
|
||||
"score": 100
|
||||
},
|
||||
{
|
||||
"key": "workspaceTransferability",
|
||||
"label": "对工作区的可迁移性",
|
||||
"score": 60
|
||||
}
|
||||
]
|
||||
},
|
||||
"improvements": [
|
||||
"提示词优化应聚焦于引入新的、结构化的信息维度(如风险量化、条款优先级排序、替代方案建议),而非仅调整措辞风格。",
|
||||
"当提示词改动旨在提升可读性时,应明确定义可衡量的风格指标(如句子长度、术语密度),以便于客观评估改进效果。",
|
||||
"在核心结论等价的情况下,评估应更关注输出在逻辑严谨性、证据链完整性或可操作性上的潜在差异,避免过度解读风格变化。"
|
||||
],
|
||||
"summary": "Target 相比 Baseline 无实质性进步,与 Reference 在核心风险识别与建议上无差距;Prompt 中面向业务可读性的风格调整在 Reference 侧也得到支持,表明改动具有跨模型鲁棒性,但未提升输出内容的上限。",
|
||||
"patchPlan": [],
|
||||
"metadata": {
|
||||
"compareMode": "structured",
|
||||
"compareStopSignals": {
|
||||
"targetVsBaseline": "flat",
|
||||
"targetVsReferenceGap": "none",
|
||||
"improvementHeadroom": "medium",
|
||||
"overfitRisk": "low",
|
||||
"stopRecommendation": "continue"
|
||||
},
|
||||
"model": "deepseek",
|
||||
"timestamp": 1774176444287,
|
||||
"duration": 21041,
|
||||
"compareJudgements": [
|
||||
{
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"leftSnapshotId": "a",
|
||||
"leftSnapshotLabel": "A",
|
||||
"leftRole": "target",
|
||||
"rightSnapshotId": "b",
|
||||
"rightSnapshotLabel": "B",
|
||||
"rightRole": "baseline",
|
||||
"verdict": "similar",
|
||||
"winner": "none",
|
||||
"confidence": "high",
|
||||
"pairSignal": "flat",
|
||||
"analysis": "Target 和 Baseline 的输出在核心风险识别、风险等级判断以及建议的行动方向上完全一致。Target 在措辞上略有简化(如“改结算周期” vs “调整结算周期”),但这属于风格微调,并未引入新的实质性信息或改进。两者都准确识别了单方修改权、无通知暂停服务和违约责任失衡三个核心风险点,并给出了相同的谈判方向。",
|
||||
"evidence": [
|
||||
"风险等级 (risk_level) 均为 'high'。",
|
||||
"核心风险 (core_risks) 列表内容实质等价:均包含单方修改结算周期、无通知暂停服务、违约责任失衡三点。",
|
||||
"建议行动 (recommended_action) 方向一致:均要求增加通知义务、限制单方变更、并要求平台承担对等责任。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"当两个版本在风险等级、核心风险点和行动建议上完全等价时,应判定为 flat,表明优化未产生实质性变化。"
|
||||
],
|
||||
"overfitWarnings": []
|
||||
},
|
||||
{
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"leftSnapshotId": "a",
|
||||
"leftSnapshotLabel": "A",
|
||||
"leftRole": "target",
|
||||
"rightSnapshotId": "c",
|
||||
"rightSnapshotLabel": "C",
|
||||
"rightRole": "reference",
|
||||
"verdict": "similar",
|
||||
"winner": "none",
|
||||
"confidence": "high",
|
||||
"pairSignal": "none",
|
||||
"analysis": "两个输出在核心风险识别、风险等级判断和行动建议上完全等价,仅在措辞上存在细微差异,这些差异不构成可学习的结构性差距。",
|
||||
"evidence": [
|
||||
"风险等级 (risk_level) 均为 'high'。",
|
||||
"核心风险 (core_risks) 均准确识别出三项:'平台可单方改结算周期'、'平台可未通知暂停服务'、以及违约责任失衡问题(Target用'明显失衡',Reference用'缺乏对等性')。",
|
||||
"推荐行动 (recommended_action) 均指向相同的三个谈判要点:补充通知义务、限制单方变更范围、补齐平台违约责任。Target的表述更具体('要求补充...并补齐...'),Reference的表述更概括('要求把...补齐后再推进'),但实质内容一致。"
|
||||
],
|
||||
"learnableSignals": [],
|
||||
"overfitWarnings": []
|
||||
},
|
||||
{
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"leftSnapshotId": "c",
|
||||
"leftSnapshotLabel": "C",
|
||||
"leftRole": "reference",
|
||||
"rightSnapshotId": "d",
|
||||
"rightSnapshotLabel": "D",
|
||||
"rightRole": "referenceBaseline",
|
||||
"verdict": "similar",
|
||||
"winner": "none",
|
||||
"confidence": "high",
|
||||
"pairSignal": "supported",
|
||||
"analysis": "两个版本在核心风险判断、风险等级和行动建议上完全一致,仅在措辞风格上存在微小差异,这符合 prompt 改动(从“保持客观、精炼”到“用更简洁、偏业务同学可读的中文表达”)的预期方向,表明该改动在参考侧也成立,并非仅对目标侧有效。",
|
||||
"evidence": [
|
||||
"两个输出均判定风险等级为 'high'。",
|
||||
"两个输出识别出的核心风险点实质相同:单方修改结算周期、未通知暂停服务、违约责任不对等。",
|
||||
"两个输出给出的行动建议核心内容一致:要求补充通知义务、限制单方变更、增加平台责任。",
|
||||
"措辞差异仅在于表达风格:左侧(C)更口语化(如'补齐'),右侧(D)更书面化(如'建议补充')。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"当两个版本在核心结论、风险点和行动建议上实质等价时,应倾向于判定为相似(similar),而非因风格差异误判为信息不足。",
|
||||
"Prompt 中从通用指令转向更具体的受众导向指令(如'偏业务同学可读'),在参考侧也得到了支持,表现为输出风格向更简洁、直接的方向调整。"
|
||||
],
|
||||
"overfitWarnings": []
|
||||
}
|
||||
],
|
||||
"snapshotRoles": {
|
||||
"a": "target",
|
||||
"b": "baseline",
|
||||
"c": "reference",
|
||||
"d": "referenceBaseline"
|
||||
},
|
||||
"compareInsights": {
|
||||
"pairHighlights": [
|
||||
{
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"pairSignal": "flat",
|
||||
"verdict": "similar",
|
||||
"confidence": "high",
|
||||
"analysis": "Target 和 Baseline 的输出在核心风险识别、风险等级判断以及建议的行动方向上完全一致。Target 在措辞上略有简化(如“改结算周期” vs “调整结算周期”),但这属于风格微调,并未引入新的实质性信息或改进。两者都准确识别了单方修改权、无通知暂停服务和违约责任失衡三个核心风险点,并给出了相同的谈判方向。"
|
||||
},
|
||||
{
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"pairSignal": "none",
|
||||
"verdict": "similar",
|
||||
"confidence": "high",
|
||||
"analysis": "两个输出在核心风险识别、风险等级判断和行动建议上完全等价,仅在措辞上存在细微差异,这些差异不构成可学习的结构性差距。"
|
||||
},
|
||||
{
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"pairSignal": "supported",
|
||||
"verdict": "similar",
|
||||
"confidence": "high",
|
||||
"analysis": "两个版本在核心风险判断、风险等级和行动建议上完全一致,仅在措辞风格上存在微小差异,这符合 prompt 改动(从“保持客观、精炼”到“用更简洁、偏业务同学可读的中文表达”)的预期方向,表明该改动在参考侧也成立,并非仅对目标侧有效。"
|
||||
}
|
||||
],
|
||||
"progressSummary": {
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"pairSignal": "flat",
|
||||
"verdict": "similar",
|
||||
"confidence": "high",
|
||||
"analysis": "Target 和 Baseline 的输出在核心风险识别、风险等级判断以及建议的行动方向上完全一致。Target 在措辞上略有简化(如“改结算周期” vs “调整结算周期”),但这属于风格微调,并未引入新的实质性信息或改进。两者都准确识别了单方修改权、无通知暂停服务和违约责任失衡三个核心风险点,并给出了相同的谈判方向。"
|
||||
},
|
||||
"referenceGapSummary": {
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"pairSignal": "none",
|
||||
"verdict": "similar",
|
||||
"confidence": "high",
|
||||
"analysis": "两个输出在核心风险识别、风险等级判断和行动建议上完全等价,仅在措辞上存在细微差异,这些差异不构成可学习的结构性差距。"
|
||||
},
|
||||
"promptChangeSummary": {
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"pairSignal": "supported",
|
||||
"verdict": "similar",
|
||||
"confidence": "high",
|
||||
"analysis": "两个版本在核心风险判断、风险等级和行动建议上完全一致,仅在措辞风格上存在微小差异,这符合 prompt 改动(从“保持客观、精炼”到“用更简洁、偏业务同学可读的中文表达”)的预期方向,表明该改动在参考侧也成立,并非仅对目标侧有效。"
|
||||
},
|
||||
"evidenceHighlights": [
|
||||
"风险等级 (risk_level) 均为 'high'。",
|
||||
"核心风险 (core_risks) 列表内容实质等价:均包含单方修改结算周期、无通知暂停服务、违约责任失衡三点。",
|
||||
"建议行动 (recommended_action) 方向一致:均要求增加通知义务、限制单方变更、并要求平台承担对等责任。",
|
||||
"核心风险 (core_risks) 均准确识别出三项:'平台可单方改结算周期'、'平台可未通知暂停服务'、以及违约责任失衡问题(Target用'明显失衡',Reference用'缺乏对等性')。",
|
||||
"推荐行动 (recommended_action) 均指向相同的三个谈判要点:补充通知义务、限制单方变更范围、补齐平台违约责任。Target的表述更具体('要求补充...并补齐...'),Reference的表述更概括('要求把...补齐后再推进'),但实质内容一致。",
|
||||
"两个输出均判定风险等级为 'high'。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"当两个版本在风险等级、核心风险点和行动建议上完全等价时,应判定为 flat,表明优化未产生实质性变化。",
|
||||
"当两个版本在核心结论、风险点和行动建议上实质等价时,应倾向于判定为相似(similar),而非因风格差异误判为信息不足。",
|
||||
"Prompt 中从通用指令转向更具体的受众导向指令(如'偏业务同学可读'),在参考侧也得到了支持,表现为输出风格向更简洁、直接的方向调整。"
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,209 @@
|
||||
```json
|
||||
{
|
||||
"type": "compare",
|
||||
"score": {
|
||||
"overall": 50,
|
||||
"dimensions": [
|
||||
{
|
||||
"key": "goalAchievementRobustness",
|
||||
"label": "目标达成稳定性",
|
||||
"score": 100
|
||||
},
|
||||
{
|
||||
"key": "outputQualityCeiling",
|
||||
"label": "输出质量上限",
|
||||
"score": 50
|
||||
},
|
||||
{
|
||||
"key": "promptPatternQuality",
|
||||
"label": "提示词模式质量",
|
||||
"score": 60
|
||||
},
|
||||
{
|
||||
"key": "crossSnapshotRobustness",
|
||||
"label": "跨快照鲁棒性",
|
||||
"score": 100
|
||||
},
|
||||
{
|
||||
"key": "workspaceTransferability",
|
||||
"label": "对工作区的可迁移性",
|
||||
"score": 60
|
||||
}
|
||||
]
|
||||
},
|
||||
"improvements": [
|
||||
"提示词优化应聚焦于引入新的、结构化的信息维度(如风险量化、条款优先级排序、替代方案建议),而非仅调整措辞风格。",
|
||||
"当提示词改动旨在提升可读性时,应明确定义可衡量的风格指标(如句子长度、术语密度),以便于客观评估改进效果。",
|
||||
"在核心结论等价的情况下,评估应更关注输出在逻辑严谨性、证据链完整性或可操作性上的潜在差异,避免过度解读风格变化。"
|
||||
],
|
||||
"summary": "Target 相比 Baseline 无实质性进步,与 Reference 在核心风险识别与建议上无差距;Prompt 中面向业务可读性的风格调整在 Reference 侧也得到支持,表明改动具有跨模型鲁棒性,但未提升输出内容的上限。",
|
||||
"patchPlan": [],
|
||||
"metadata": {
|
||||
"compareMode": "structured",
|
||||
"compareStopSignals": {
|
||||
"targetVsBaseline": "flat",
|
||||
"targetVsReferenceGap": "none",
|
||||
"improvementHeadroom": "medium",
|
||||
"overfitRisk": "low",
|
||||
"stopRecommendation": "continue"
|
||||
},
|
||||
"model": "deepseek",
|
||||
"timestamp": 1774176444287,
|
||||
"duration": 21041,
|
||||
"compareJudgements": [
|
||||
{
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"leftSnapshotId": "a",
|
||||
"leftSnapshotLabel": "A",
|
||||
"leftRole": "target",
|
||||
"rightSnapshotId": "b",
|
||||
"rightSnapshotLabel": "B",
|
||||
"rightRole": "baseline",
|
||||
"verdict": "similar",
|
||||
"winner": "none",
|
||||
"confidence": "high",
|
||||
"pairSignal": "flat",
|
||||
"analysis": "Target 和 Baseline 的输出在核心风险识别、风险等级判断以及建议的行动方向上完全一致。Target 在措辞上略有简化(如“改结算周期” vs “调整结算周期”),但这属于风格微调,并未引入新的实质性信息或改进。两者都准确识别了单方修改权、无通知暂停服务和违约责任失衡三个核心风险点,并给出了相同的谈判方向。",
|
||||
"evidence": [
|
||||
"风险等级 (risk_level) 均为 'high'。",
|
||||
"核心风险 (core_risks) 列表内容实质等价:均包含单方修改结算周期、无通知暂停服务、违约责任失衡三点。",
|
||||
"建议行动 (recommended_action) 方向一致:均要求增加通知义务、限制单方变更、并要求平台承担对等责任。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"当两个版本在风险等级、核心风险点和行动建议上完全等价时,应判定为 flat,表明优化未产生实质性变化。"
|
||||
],
|
||||
"overfitWarnings": []
|
||||
},
|
||||
{
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"leftSnapshotId": "a",
|
||||
"leftSnapshotLabel": "A",
|
||||
"leftRole": "target",
|
||||
"rightSnapshotId": "c",
|
||||
"rightSnapshotLabel": "C",
|
||||
"rightRole": "reference",
|
||||
"verdict": "similar",
|
||||
"winner": "none",
|
||||
"confidence": "high",
|
||||
"pairSignal": "none",
|
||||
"analysis": "两个输出在核心风险识别、风险等级判断和行动建议上完全等价,仅在措辞上存在细微差异,这些差异不构成可学习的结构性差距。",
|
||||
"evidence": [
|
||||
"风险等级 (risk_level) 均为 'high'。",
|
||||
"核心风险 (core_risks) 均准确识别出三项:'平台可单方改结算周期'、'平台可未通知暂停服务'、以及违约责任失衡问题(Target用'明显失衡',Reference用'缺乏对等性')。",
|
||||
"推荐行动 (recommended_action) 均指向相同的三个谈判要点:补充通知义务、限制单方变更范围、补齐平台违约责任。Target的表述更具体('要求补充...并补齐...'),Reference的表述更概括('要求把...补齐后再推进'),但实质内容一致。"
|
||||
],
|
||||
"learnableSignals": [],
|
||||
"overfitWarnings": []
|
||||
},
|
||||
{
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"leftSnapshotId": "c",
|
||||
"leftSnapshotLabel": "C",
|
||||
"leftRole": "reference",
|
||||
"rightSnapshotId": "d",
|
||||
"rightSnapshotLabel": "D",
|
||||
"rightRole": "referenceBaseline",
|
||||
"verdict": "similar",
|
||||
"winner": "none",
|
||||
"confidence": "high",
|
||||
"pairSignal": "supported",
|
||||
"analysis": "两个版本在核心风险判断、风险等级和行动建议上完全一致,仅在措辞风格上存在微小差异,这符合 prompt 改动(从“保持客观、精炼”到“用更简洁、偏业务同学可读的中文表达”)的预期方向,表明该改动在参考侧也成立,并非仅对目标侧有效。",
|
||||
"evidence": [
|
||||
"两个输出均判定风险等级为 'high'。",
|
||||
"两个输出识别出的核心风险点实质相同:单方修改结算周期、未通知暂停服务、违约责任不对等。",
|
||||
"两个输出给出的行动建议核心内容一致:要求补充通知义务、限制单方变更、增加平台责任。",
|
||||
"措辞差异仅在于表达风格:左侧(C)更口语化(如'补齐'),右侧(D)更书面化(如'建议补充')。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"当两个版本在核心结论、风险点和行动建议上实质等价时,应倾向于判定为相似(similar),而非因风格差异误判为信息不足。",
|
||||
"Prompt 中从通用指令转向更具体的受众导向指令(如'偏业务同学可读'),在参考侧也得到了支持,表现为输出风格向更简洁、直接的方向调整。"
|
||||
],
|
||||
"overfitWarnings": []
|
||||
}
|
||||
],
|
||||
"snapshotRoles": {
|
||||
"a": "target",
|
||||
"b": "baseline",
|
||||
"c": "reference",
|
||||
"d": "referenceBaseline"
|
||||
},
|
||||
"compareInsights": {
|
||||
"pairHighlights": [
|
||||
{
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"pairSignal": "flat",
|
||||
"verdict": "similar",
|
||||
"confidence": "high",
|
||||
"analysis": "Target 和 Baseline 的输出在核心风险识别、风险等级判断以及建议的行动方向上完全一致。Target 在措辞上略有简化(如“改结算周期” vs “调整结算周期”),但这属于风格微调,并未引入新的实质性信息或改进。两者都准确识别了单方修改权、无通知暂停服务和违约责任失衡三个核心风险点,并给出了相同的谈判方向。"
|
||||
},
|
||||
{
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"pairSignal": "none",
|
||||
"verdict": "similar",
|
||||
"confidence": "high",
|
||||
"analysis": "两个输出在核心风险识别、风险等级判断和行动建议上完全等价,仅在措辞上存在细微差异,这些差异不构成可学习的结构性差距。"
|
||||
},
|
||||
{
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"pairSignal": "supported",
|
||||
"verdict": "similar",
|
||||
"confidence": "high",
|
||||
"analysis": "两个版本在核心风险判断、风险等级和行动建议上完全一致,仅在措辞风格上存在微小差异,这符合 prompt 改动(从“保持客观、精炼”到“用更简洁、偏业务同学可读的中文表达”)的预期方向,表明该改动在参考侧也成立,并非仅对目标侧有效。"
|
||||
}
|
||||
],
|
||||
"progressSummary": {
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"pairSignal": "flat",
|
||||
"verdict": "similar",
|
||||
"confidence": "high",
|
||||
"analysis": "Target 和 Baseline 的输出在核心风险识别、风险等级判断以及建议的行动方向上完全一致。Target 在措辞上略有简化(如“改结算周期” vs “调整结算周期”),但这属于风格微调,并未引入新的实质性信息或改进。两者都准确识别了单方修改权、无通知暂停服务和违约责任失衡三个核心风险点,并给出了相同的谈判方向。"
|
||||
},
|
||||
"referenceGapSummary": {
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"pairSignal": "none",
|
||||
"verdict": "similar",
|
||||
"confidence": "high",
|
||||
"analysis": "两个输出在核心风险识别、风险等级判断和行动建议上完全等价,仅在措辞上存在细微差异,这些差异不构成可学习的结构性差距。"
|
||||
},
|
||||
"promptChangeSummary": {
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"pairSignal": "supported",
|
||||
"verdict": "similar",
|
||||
"confidence": "high",
|
||||
"analysis": "两个版本在核心风险判断、风险等级和行动建议上完全一致,仅在措辞风格上存在微小差异,这符合 prompt 改动(从“保持客观、精炼”到“用更简洁、偏业务同学可读的中文表达”)的预期方向,表明该改动在参考侧也成立,并非仅对目标侧有效。"
|
||||
},
|
||||
"evidenceHighlights": [
|
||||
"风险等级 (risk_level) 均为 'high'。",
|
||||
"核心风险 (core_risks) 列表内容实质等价:均包含单方修改结算周期、无通知暂停服务、违约责任失衡三点。",
|
||||
"建议行动 (recommended_action) 方向一致:均要求增加通知义务、限制单方变更、并要求平台承担对等责任。",
|
||||
"核心风险 (core_risks) 均准确识别出三项:'平台可单方改结算周期'、'平台可未通知暂停服务'、以及违约责任失衡问题(Target用'明显失衡',Reference用'缺乏对等性')。",
|
||||
"推荐行动 (recommended_action) 均指向相同的三个谈判要点:补充通知义务、限制单方变更范围、补齐平台违约责任。Target的表述更具体('要求补充...并补齐...'),Reference的表述更概括('要求把...补齐后再推进'),但实质内容一致。",
|
||||
"两个输出均判定风险等级为 'high'。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"当两个版本在风险等级、核心风险点和行动建议上完全等价时,应判定为 flat,表明优化未产生实质性变化。",
|
||||
"当两个版本在核心结论、风险点和行动建议上实质等价时,应倾向于判定为相似(similar),而非因风格差异误判为信息不足。",
|
||||
"Prompt 中从通用指令转向更具体的受众导向指令(如'偏业务同学可读'),在参考侧也得到了支持,表现为输出风格向更简洁、直接的方向调整。"
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
@@ -0,0 +1,180 @@
|
||||
请只根据下面这份 JSON payload,把当前工作区系统提示词直接重写成一个完整的新版本。
|
||||
|
||||
要求:
|
||||
1. "sourcePrompts.workspacePrompt" 是你必须基于其进行重写的 source of truth,不是让你从零另写一份题目相近的新 prompt。
|
||||
2. 保留原提示词的核心目标、硬约束、必要边界、变量名、字段名、schema、角色结构和输出协议,除非评估明确表明这些内容本身有问题。
|
||||
3. 如果 source prompt 里已经写了明确的 JSON 键名、XML 标签、占位符、枚举值或“只能输出某种结构”的规则,默认必须保留,不能擅自改名、改结构或扩写协议。
|
||||
4. 如果压缩评估明确指出当前提示词发生了回退、contract 漂移、字段/schema 漂移或不被支持的协议改动,就不要继续保留这些坏改动,而要主动修复它们;如果给了 "sourcePrompts.referencePrompt",优先把它当作恢复 contract 的锚点。
|
||||
5. 优先吸收可复用、跨输入也应成立的改进,不要为了当前样例、当前输出细节或一次性现象过拟合。
|
||||
6. 如果某条建议明显依赖当前样例,应主动将其泛化、弱化或舍弃。
|
||||
7. 不要自行发明新的测试证据,只能基于下面这份压缩评估结论来改写。
|
||||
8. 优先做“最小但完整”的重写,在保留原 contract 的前提下提升质量,而不是整套改写。
|
||||
9. 只输出提示词正文,不要把结果包装成 JSON、YAML、XML、"role/content" 对象、消息数组或代码块。
|
||||
10. 只输出重写后的完整提示词,不要额外解释。
|
||||
11. "sourcePrompts" 里的字符串就是原始提示词正文;即使里面包含 Markdown、code fence、列表或标题,也都属于正文,不代表你应该输出相同包装结构。
|
||||
12. 如果 compare 相关条目之间有重叠,优先相信聚合焦点结论和停止信号,再参考较底层的证据摘录。
|
||||
13. 在动手改写前,先看 "compressedEvaluation.rewriteGuidance.recommendation"。
|
||||
14. 如果 recommendation 是 "skip",就原样输出 "sourcePrompts.workspacePrompt",不要做任何改写。
|
||||
15. 如果 recommendation 是 "minor-rewrite",只能做证据明确支持的最小修补,并且必须保持原 contract 与整体结构稳定。
|
||||
16. 只有 recommendation 是 "rewrite" 时,才允许做更实质性的重写。
|
||||
17. 在决定改哪里之前,先看 "compressedEvaluation.rewriteGuidance.priorityMoves",把这些动作当作最高优先级的改写议程。
|
||||
18. 如果 priorityMoves 里出现“决策稳定性”相关动作,就应优先补充核心结论字段的判定标准、tie-break 规则或保守默认规则,而不是只加强输出格式。
|
||||
|
||||
Rewrite Payload (JSON):
|
||||
{
|
||||
"scenario": {
|
||||
"language": "zh",
|
||||
"evaluationType": "compare",
|
||||
"evaluationTypeLabel": "对比评估",
|
||||
"subjectLabel": "系统提示词",
|
||||
"mode": {
|
||||
"functionMode": "basic",
|
||||
"subMode": "system"
|
||||
},
|
||||
"overallScore": 50
|
||||
},
|
||||
"sourcePrompts": {
|
||||
"workspacePrompt": "你是法务风险摘要助手。\n输出一个 JSON 对象,字段为 risk_level, core_risks, recommended_action。\n用更简洁、偏业务同学可读的中文表达。\n不要添加解释。",
|
||||
"referencePrompt": "你是法务风险摘要助手。\n输出一个 JSON 对象,字段为 risk_level, core_risks, recommended_action。\n保持客观、精炼。\n不要添加解释。"
|
||||
},
|
||||
"compressedEvaluation": {
|
||||
"summary": "Target 相比 Baseline 无实质性进步,与 Reference 在核心风险识别与建议上无差距;Prompt 中面向业务可读性的风格调整在 Reference 侧也得到支持,表明改动具有跨模型鲁棒性,但未提升输出内容的上限。",
|
||||
"dimensionScores": [
|
||||
{
|
||||
"key": "goalAchievementRobustness",
|
||||
"label": "目标达成稳定性",
|
||||
"score": 100
|
||||
},
|
||||
{
|
||||
"key": "outputQualityCeiling",
|
||||
"label": "输出质量上限",
|
||||
"score": 50
|
||||
},
|
||||
{
|
||||
"key": "promptPatternQuality",
|
||||
"label": "提示词模式质量",
|
||||
"score": 60
|
||||
},
|
||||
{
|
||||
"key": "crossSnapshotRobustness",
|
||||
"label": "跨快照鲁棒性",
|
||||
"score": 100
|
||||
},
|
||||
{
|
||||
"key": "workspaceTransferability",
|
||||
"label": "对工作区的可迁移性",
|
||||
"score": 60
|
||||
}
|
||||
],
|
||||
"improvements": [
|
||||
"提示词优化应聚焦于引入新的、结构化的信息维度(如风险量化、条款优先级排序、替代方案建议),而非仅调整措辞风格。",
|
||||
"当提示词改动旨在提升可读性时,应明确定义可衡量的风格指标(如句子长度、术语密度),以便于客观评估改进效果。",
|
||||
"在核心结论等价的情况下,评估应更关注输出在逻辑严谨性、证据链完整性或可操作性上的潜在差异,避免过度解读风格变化。"
|
||||
],
|
||||
"patchPlan": [],
|
||||
"compareStopSignals": {
|
||||
"targetVsBaseline": "flat",
|
||||
"targetVsReferenceGap": "none",
|
||||
"improvementHeadroom": "medium",
|
||||
"overfitRisk": "low",
|
||||
"stopRecommendation": "continue"
|
||||
},
|
||||
"compareInsights": {
|
||||
"pairHighlights": [
|
||||
{
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"pairSignal": "flat",
|
||||
"verdict": "similar",
|
||||
"confidence": "high",
|
||||
"analysis": "Target 和 Baseline 的输出在核心风险识别、风险等级判断以及建议的行动方向上完全一致。Target 在措辞上略有简化(如“改结算周期” vs “调整结算周期”),但这属于风格微调,并未引入新的实质性信息或改进。两者都准确识别了单方修改权、无通知暂停服务和违约责任失衡三个核心风险点,并给出了相同的谈判方向。"
|
||||
},
|
||||
{
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"pairSignal": "none",
|
||||
"verdict": "similar",
|
||||
"confidence": "high",
|
||||
"analysis": "两个输出在核心风险识别、风险等级判断和行动建议上完全等价,仅在措辞上存在细微差异,这些差异不构成可学习的结构性差距。"
|
||||
},
|
||||
{
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"pairSignal": "supported",
|
||||
"verdict": "similar",
|
||||
"confidence": "high",
|
||||
"analysis": "两个版本在核心风险判断、风险等级和行动建议上完全一致,仅在措辞风格上存在微小差异,这符合 prompt 改动(从“保持客观、精炼”到“用更简洁、偏业务同学可读的中文表达”)的预期方向,表明该改动在参考侧也成立,并非仅对目标侧有效。"
|
||||
}
|
||||
],
|
||||
"progressSummary": {
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"pairSignal": "flat",
|
||||
"verdict": "similar",
|
||||
"confidence": "high",
|
||||
"analysis": "Target 和 Baseline 的输出在核心风险识别、风险等级判断以及建议的行动方向上完全一致。Target 在措辞上略有简化(如“改结算周期” vs “调整结算周期”),但这属于风格微调,并未引入新的实质性信息或改进。两者都准确识别了单方修改权、无通知暂停服务和违约责任失衡三个核心风险点,并给出了相同的谈判方向。"
|
||||
},
|
||||
"referenceGapSummary": {
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"pairSignal": "none",
|
||||
"verdict": "similar",
|
||||
"confidence": "high",
|
||||
"analysis": "两个输出在核心风险识别、风险等级判断和行动建议上完全等价,仅在措辞上存在细微差异,这些差异不构成可学习的结构性差距。"
|
||||
},
|
||||
"promptChangeSummary": {
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"pairSignal": "supported",
|
||||
"verdict": "similar",
|
||||
"confidence": "high",
|
||||
"analysis": "两个版本在核心风险判断、风险等级和行动建议上完全一致,仅在措辞风格上存在微小差异,这符合 prompt 改动(从“保持客观、精炼”到“用更简洁、偏业务同学可读的中文表达”)的预期方向,表明该改动在参考侧也成立,并非仅对目标侧有效。"
|
||||
},
|
||||
"evidenceHighlights": [
|
||||
"风险等级 (risk_level) 均为 'high'。",
|
||||
"核心风险 (core_risks) 列表内容实质等价:均包含单方修改结算周期、无通知暂停服务、违约责任失衡三点。",
|
||||
"建议行动 (recommended_action) 方向一致:均要求增加通知义务、限制单方变更、并要求平台承担对等责任。",
|
||||
"核心风险 (core_risks) 均准确识别出三项:'平台可单方改结算周期'、'平台可未通知暂停服务'、以及违约责任失衡问题(Target用'明显失衡',Reference用'缺乏对等性')。",
|
||||
"推荐行动 (recommended_action) 均指向相同的三个谈判要点:补充通知义务、限制单方变更范围、补齐平台违约责任。Target的表述更具体('要求补充...并补齐...'),Reference的表述更概括('要求把...补齐后再推进'),但实质内容一致。",
|
||||
"两个输出均判定风险等级为 'high'。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"当两个版本在风险等级、核心风险点和行动建议上完全等价时,应判定为 flat,表明优化未产生实质性变化。",
|
||||
"当两个版本在核心结论、风险点和行动建议上实质等价时,应倾向于判定为相似(similar),而非因风格差异误判为信息不足。",
|
||||
"Prompt 中从通用指令转向更具体的受众导向指令(如'偏业务同学可读'),在参考侧也得到了支持,表现为输出风格向更简洁、直接的方向调整。"
|
||||
]
|
||||
},
|
||||
"rewriteGuidance": {
|
||||
"recommendation": "skip",
|
||||
"reasons": [
|
||||
"当前版本相对 baseline 为 flat,且与 reference 的差距已经闭合,再改写更可能引入噪音而不是带来真实收益。"
|
||||
],
|
||||
"focusAreas": [],
|
||||
"priorityMoves": []
|
||||
},
|
||||
"focusSummaryLines": [
|
||||
"进步判断: Target vs Baseline | signal=flat | verdict=similar | confidence=high | Target 和 Baseline 的输出在核心风险识别、风险等级判断以及建议的行动方向上完全一致。Target 在措辞上略有简化(如“改结算周期” vs “调整结算周期”),但这属于风格微调,并未引入新的实质性信息或改进。两者都准确识别了单方修改权、无通知暂停服务和违约责任失衡三个核心风险点,并给出了相同的谈判方向。",
|
||||
"参考差距: Target vs Reference | signal=none | verdict=similar | confidence=high | 两个输出在核心风险识别、风险等级判断和行动建议上完全等价,仅在措辞上存在细微差异,这些差异不构成可学习的结构性差距。",
|
||||
"改动有效性: Reference vs Reference Baseline | signal=supported | verdict=similar | confidence=high | 两个版本在核心风险判断、风险等级和行动建议上完全一致,仅在措辞风格上存在微小差异,这符合 prompt 改动(从“保持客观、精炼”到“用更简洁、偏业务同学可读的中文表达”)的预期方向,表明该改动在参考侧也成立,并非仅对目标侧有效。"
|
||||
],
|
||||
"conflictLines": [],
|
||||
"learnableSignalLines": [
|
||||
"当两个版本在风险等级、核心风险点和行动建议上完全等价时,应判定为 flat,表明优化未产生实质性变化。",
|
||||
"当两个版本在核心结论、风险点和行动建议上实质等价时,应倾向于判定为相似(similar),而非因风格差异误判为信息不足。",
|
||||
"Prompt 中从通用指令转向更具体的受众导向指令(如'偏业务同学可读'),在参考侧也得到了支持,表现为输出风格向更简洁、直接的方向调整。"
|
||||
],
|
||||
"overfitWarningLines": [],
|
||||
"supportEvidenceLines": [
|
||||
"1. Target vs Baseline | signal=flat | verdict=similar | confidence=high | Target 和 Baseline 的输出在核心风险识别、风险等级判断以及建议的行动方向上完全一致。Target 在措辞上略有简化(如“改结算周期” vs “调整结算周期”),但这属于风格微调,并未引入新的实质性信息或改进。两者都准确识别了单方修改权、无通知暂停服务和违约责任失衡三个核心风险点,并给出了相同的谈判方向。",
|
||||
"2. Target vs Reference | signal=none | verdict=similar | confidence=high | 两个输出在核心风险识别、风险等级判断和行动建议上完全等价,仅在措辞上存在细微差异,这些差异不构成可学习的结构性差距。",
|
||||
"3. Reference vs Reference Baseline | signal=supported | verdict=similar | confidence=high | 两个版本在核心风险判断、风险等级和行动建议上完全一致,仅在措辞风格上存在微小差异,这符合 prompt 改动(从“保持客观、精炼”到“用更简洁、偏业务同学可读的中文表达”)的预期方向,表明该改动在参考侧也成立,并非仅对目标侧有效。",
|
||||
"风险等级 (risk_level) 均为 'high'。"
|
||||
]
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,4 @@
|
||||
你是法务风险摘要助手。
|
||||
输出一个 JSON 对象,字段为 risk_level, core_risks, recommended_action。
|
||||
用更简洁、偏业务同学可读的中文表达。
|
||||
不要添加解释。
|
||||
@@ -0,0 +1,157 @@
|
||||
{
|
||||
"scenario": {
|
||||
"language": "zh",
|
||||
"evaluationType": "compare",
|
||||
"evaluationTypeLabel": "对比评估",
|
||||
"subjectLabel": "系统提示词",
|
||||
"mode": {
|
||||
"functionMode": "basic",
|
||||
"subMode": "system"
|
||||
},
|
||||
"overallScore": 50
|
||||
},
|
||||
"sourcePrompts": {
|
||||
"workspacePrompt": "你是法务风险摘要助手。\n输出一个 JSON 对象,字段为 risk_level, core_risks, recommended_action。\n用更简洁、偏业务同学可读的中文表达。\n不要添加解释。",
|
||||
"referencePrompt": "你是法务风险摘要助手。\n输出一个 JSON 对象,字段为 risk_level, core_risks, recommended_action。\n保持客观、精炼。\n不要添加解释。"
|
||||
},
|
||||
"compressedEvaluation": {
|
||||
"summary": "Target 相比 Baseline 无实质性进步,与 Reference 在核心风险识别与建议上无差距;Prompt 中面向业务可读性的风格调整在 Reference 侧也得到支持,表明改动具有跨模型鲁棒性,但未提升输出内容的上限。",
|
||||
"dimensionScores": [
|
||||
{
|
||||
"key": "goalAchievementRobustness",
|
||||
"label": "目标达成稳定性",
|
||||
"score": 100
|
||||
},
|
||||
{
|
||||
"key": "outputQualityCeiling",
|
||||
"label": "输出质量上限",
|
||||
"score": 50
|
||||
},
|
||||
{
|
||||
"key": "promptPatternQuality",
|
||||
"label": "提示词模式质量",
|
||||
"score": 60
|
||||
},
|
||||
{
|
||||
"key": "crossSnapshotRobustness",
|
||||
"label": "跨快照鲁棒性",
|
||||
"score": 100
|
||||
},
|
||||
{
|
||||
"key": "workspaceTransferability",
|
||||
"label": "对工作区的可迁移性",
|
||||
"score": 60
|
||||
}
|
||||
],
|
||||
"improvements": [
|
||||
"提示词优化应聚焦于引入新的、结构化的信息维度(如风险量化、条款优先级排序、替代方案建议),而非仅调整措辞风格。",
|
||||
"当提示词改动旨在提升可读性时,应明确定义可衡量的风格指标(如句子长度、术语密度),以便于客观评估改进效果。",
|
||||
"在核心结论等价的情况下,评估应更关注输出在逻辑严谨性、证据链完整性或可操作性上的潜在差异,避免过度解读风格变化。"
|
||||
],
|
||||
"patchPlan": [],
|
||||
"compareStopSignals": {
|
||||
"targetVsBaseline": "flat",
|
||||
"targetVsReferenceGap": "none",
|
||||
"improvementHeadroom": "medium",
|
||||
"overfitRisk": "low",
|
||||
"stopRecommendation": "continue"
|
||||
},
|
||||
"compareInsights": {
|
||||
"pairHighlights": [
|
||||
{
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"pairSignal": "flat",
|
||||
"verdict": "similar",
|
||||
"confidence": "high",
|
||||
"analysis": "Target 和 Baseline 的输出在核心风险识别、风险等级判断以及建议的行动方向上完全一致。Target 在措辞上略有简化(如“改结算周期” vs “调整结算周期”),但这属于风格微调,并未引入新的实质性信息或改进。两者都准确识别了单方修改权、无通知暂停服务和违约责任失衡三个核心风险点,并给出了相同的谈判方向。"
|
||||
},
|
||||
{
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"pairSignal": "none",
|
||||
"verdict": "similar",
|
||||
"confidence": "high",
|
||||
"analysis": "两个输出在核心风险识别、风险等级判断和行动建议上完全等价,仅在措辞上存在细微差异,这些差异不构成可学习的结构性差距。"
|
||||
},
|
||||
{
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"pairSignal": "supported",
|
||||
"verdict": "similar",
|
||||
"confidence": "high",
|
||||
"analysis": "两个版本在核心风险判断、风险等级和行动建议上完全一致,仅在措辞风格上存在微小差异,这符合 prompt 改动(从“保持客观、精炼”到“用更简洁、偏业务同学可读的中文表达”)的预期方向,表明该改动在参考侧也成立,并非仅对目标侧有效。"
|
||||
}
|
||||
],
|
||||
"progressSummary": {
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"pairSignal": "flat",
|
||||
"verdict": "similar",
|
||||
"confidence": "high",
|
||||
"analysis": "Target 和 Baseline 的输出在核心风险识别、风险等级判断以及建议的行动方向上完全一致。Target 在措辞上略有简化(如“改结算周期” vs “调整结算周期”),但这属于风格微调,并未引入新的实质性信息或改进。两者都准确识别了单方修改权、无通知暂停服务和违约责任失衡三个核心风险点,并给出了相同的谈判方向。"
|
||||
},
|
||||
"referenceGapSummary": {
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"pairSignal": "none",
|
||||
"verdict": "similar",
|
||||
"confidence": "high",
|
||||
"analysis": "两个输出在核心风险识别、风险等级判断和行动建议上完全等价,仅在措辞上存在细微差异,这些差异不构成可学习的结构性差距。"
|
||||
},
|
||||
"promptChangeSummary": {
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"pairSignal": "supported",
|
||||
"verdict": "similar",
|
||||
"confidence": "high",
|
||||
"analysis": "两个版本在核心风险判断、风险等级和行动建议上完全一致,仅在措辞风格上存在微小差异,这符合 prompt 改动(从“保持客观、精炼”到“用更简洁、偏业务同学可读的中文表达”)的预期方向,表明该改动在参考侧也成立,并非仅对目标侧有效。"
|
||||
},
|
||||
"evidenceHighlights": [
|
||||
"风险等级 (risk_level) 均为 'high'。",
|
||||
"核心风险 (core_risks) 列表内容实质等价:均包含单方修改结算周期、无通知暂停服务、违约责任失衡三点。",
|
||||
"建议行动 (recommended_action) 方向一致:均要求增加通知义务、限制单方变更、并要求平台承担对等责任。",
|
||||
"核心风险 (core_risks) 均准确识别出三项:'平台可单方改结算周期'、'平台可未通知暂停服务'、以及违约责任失衡问题(Target用'明显失衡',Reference用'缺乏对等性')。",
|
||||
"推荐行动 (recommended_action) 均指向相同的三个谈判要点:补充通知义务、限制单方变更范围、补齐平台违约责任。Target的表述更具体('要求补充...并补齐...'),Reference的表述更概括('要求把...补齐后再推进'),但实质内容一致。",
|
||||
"两个输出均判定风险等级为 'high'。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"当两个版本在风险等级、核心风险点和行动建议上完全等价时,应判定为 flat,表明优化未产生实质性变化。",
|
||||
"当两个版本在核心结论、风险点和行动建议上实质等价时,应倾向于判定为相似(similar),而非因风格差异误判为信息不足。",
|
||||
"Prompt 中从通用指令转向更具体的受众导向指令(如'偏业务同学可读'),在参考侧也得到了支持,表现为输出风格向更简洁、直接的方向调整。"
|
||||
]
|
||||
},
|
||||
"rewriteGuidance": {
|
||||
"recommendation": "skip",
|
||||
"reasons": [
|
||||
"当前版本相对 baseline 为 flat,且与 reference 的差距已经闭合,再改写更可能引入噪音而不是带来真实收益。"
|
||||
],
|
||||
"focusAreas": [],
|
||||
"priorityMoves": []
|
||||
},
|
||||
"focusSummaryLines": [
|
||||
"进步判断: Target vs Baseline | signal=flat | verdict=similar | confidence=high | Target 和 Baseline 的输出在核心风险识别、风险等级判断以及建议的行动方向上完全一致。Target 在措辞上略有简化(如“改结算周期” vs “调整结算周期”),但这属于风格微调,并未引入新的实质性信息或改进。两者都准确识别了单方修改权、无通知暂停服务和违约责任失衡三个核心风险点,并给出了相同的谈判方向。",
|
||||
"参考差距: Target vs Reference | signal=none | verdict=similar | confidence=high | 两个输出在核心风险识别、风险等级判断和行动建议上完全等价,仅在措辞上存在细微差异,这些差异不构成可学习的结构性差距。",
|
||||
"改动有效性: Reference vs Reference Baseline | signal=supported | verdict=similar | confidence=high | 两个版本在核心风险判断、风险等级和行动建议上完全一致,仅在措辞风格上存在微小差异,这符合 prompt 改动(从“保持客观、精炼”到“用更简洁、偏业务同学可读的中文表达”)的预期方向,表明该改动在参考侧也成立,并非仅对目标侧有效。"
|
||||
],
|
||||
"conflictLines": [],
|
||||
"learnableSignalLines": [
|
||||
"当两个版本在风险等级、核心风险点和行动建议上完全等价时,应判定为 flat,表明优化未产生实质性变化。",
|
||||
"当两个版本在核心结论、风险点和行动建议上实质等价时,应倾向于判定为相似(similar),而非因风格差异误判为信息不足。",
|
||||
"Prompt 中从通用指令转向更具体的受众导向指令(如'偏业务同学可读'),在参考侧也得到了支持,表现为输出风格向更简洁、直接的方向调整。"
|
||||
],
|
||||
"overfitWarningLines": [],
|
||||
"supportEvidenceLines": [
|
||||
"1. Target vs Baseline | signal=flat | verdict=similar | confidence=high | Target 和 Baseline 的输出在核心风险识别、风险等级判断以及建议的行动方向上完全一致。Target 在措辞上略有简化(如“改结算周期” vs “调整结算周期”),但这属于风格微调,并未引入新的实质性信息或改进。两者都准确识别了单方修改权、无通知暂停服务和违约责任失衡三个核心风险点,并给出了相同的谈判方向。",
|
||||
"2. Target vs Reference | signal=none | verdict=similar | confidence=high | 两个输出在核心风险识别、风险等级判断和行动建议上完全等价,仅在措辞上存在细微差异,这些差异不构成可学习的结构性差距。",
|
||||
"3. Reference vs Reference Baseline | signal=supported | verdict=similar | confidence=high | 两个版本在核心风险判断、风险等级和行动建议上完全一致,仅在措辞风格上存在微小差异,这符合 prompt 改动(从“保持客观、精炼”到“用更简洁、偏业务同学可读的中文表达”)的预期方向,表明该改动在参考侧也成立,并非仅对目标侧有效。",
|
||||
"风险等级 (risk_level) 均为 'high'。"
|
||||
]
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,10 @@
|
||||
# 合成样本: 法务风险摘要应该判 flat 而不是 unclear
|
||||
|
||||
- caseId: synthetic-legal-flat-not-unclear
|
||||
- kind: synthetic
|
||||
|
||||
workspace prompt 只把表达风格改得更口语化,但目标输出与 previous 在风险结论和行动建议上没有实质变化。这个样本用于观察 judge 是否能稳定给出 flat,而不是因为措辞不同就退回 unclear。
|
||||
|
||||
## Focus
|
||||
|
||||
当两个版本在核心结论、风险点和动作建议上等价时,应更倾向于 flat,而不是把风格差异误判成信息不足。
|
||||
@@ -0,0 +1,96 @@
|
||||
{
|
||||
"generatedAt": "2026-03-22T10:44:18.102Z",
|
||||
"case": {
|
||||
"id": "synthetic-legal-flat-not-unclear",
|
||||
"title": "合成样本: 法务风险摘要应该判 flat 而不是 unclear",
|
||||
"kind": "synthetic"
|
||||
},
|
||||
"summary": {
|
||||
"compareMode": "structured",
|
||||
"summary": "Target 相比 Baseline 无实质性进步,与 Reference 在核心风险识别与建议上无差距;Prompt 中面向业务可读性的风格调整在 Reference 侧也得到支持,表明改动具有跨模型鲁棒性,但未提升输出内容的上限。",
|
||||
"score": 50,
|
||||
"improvements": [
|
||||
"提示词优化应聚焦于引入新的、结构化的信息维度(如风险量化、条款优先级排序、替代方案建议),而非仅调整措辞风格。",
|
||||
"当提示词改动旨在提升可读性时,应明确定义可衡量的风格指标(如句子长度、术语密度),以便于客观评估改进效果。",
|
||||
"在核心结论等价的情况下,评估应更关注输出在逻辑严谨性、证据链完整性或可操作性上的潜在差异,避免过度解读风格变化。"
|
||||
],
|
||||
"stopSignals": {
|
||||
"targetVsBaseline": "flat",
|
||||
"targetVsReferenceGap": "none",
|
||||
"improvementHeadroom": "medium",
|
||||
"overfitRisk": "low",
|
||||
"stopRecommendation": "continue"
|
||||
},
|
||||
"conflictSignals": [],
|
||||
"pairJudgements": [
|
||||
{
|
||||
"pairType": "targetBaseline",
|
||||
"pairSignal": "flat",
|
||||
"verdict": "similar",
|
||||
"confidence": "high"
|
||||
},
|
||||
{
|
||||
"pairType": "targetReference",
|
||||
"pairSignal": "none",
|
||||
"verdict": "similar",
|
||||
"confidence": "high"
|
||||
},
|
||||
{
|
||||
"pairType": "referenceBaseline",
|
||||
"pairSignal": "supported",
|
||||
"verdict": "similar",
|
||||
"confidence": "high"
|
||||
}
|
||||
],
|
||||
"expected": {
|
||||
"stopSignals": {
|
||||
"targetVsBaseline": [
|
||||
"flat"
|
||||
]
|
||||
},
|
||||
"pairSignals": {
|
||||
"targetBaseline": [
|
||||
"flat"
|
||||
],
|
||||
"referenceBaseline": [
|
||||
"supported",
|
||||
"mixed"
|
||||
]
|
||||
}
|
||||
}
|
||||
},
|
||||
"expectationResults": [
|
||||
{
|
||||
"type": "stopSignal",
|
||||
"key": "targetVsBaseline",
|
||||
"expected": [
|
||||
"flat"
|
||||
],
|
||||
"actual": "flat",
|
||||
"matched": true
|
||||
},
|
||||
{
|
||||
"type": "pairSignal",
|
||||
"key": "targetBaseline",
|
||||
"expected": [
|
||||
"flat"
|
||||
],
|
||||
"actual": [
|
||||
"flat"
|
||||
],
|
||||
"matched": true
|
||||
},
|
||||
{
|
||||
"type": "pairSignal",
|
||||
"key": "referenceBaseline",
|
||||
"expected": [
|
||||
"supported",
|
||||
"mixed"
|
||||
],
|
||||
"actual": [
|
||||
"supported"
|
||||
],
|
||||
"matched": true
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -0,0 +1,86 @@
|
||||
# 合成样本: 法务风险摘要应该判 flat 而不是 unclear
|
||||
|
||||
- caseId: synthetic-legal-flat-not-unclear
|
||||
- kind: synthetic
|
||||
- generatedAt: 2026-03-22T10:44:18.102Z
|
||||
|
||||
## Description
|
||||
|
||||
workspace prompt 只把表达风格改得更口语化,但目标输出与 previous 在风险结论和行动建议上没有实质变化。这个样本用于观察 judge 是否能稳定给出 flat,而不是因为措辞不同就退回 unclear。
|
||||
|
||||
## Compare Result
|
||||
|
||||
```json
|
||||
{
|
||||
"compareMode": "structured",
|
||||
"summary": "Target 相比 Baseline 无实质性进步,与 Reference 在核心风险识别与建议上无差距;Prompt 中面向业务可读性的风格调整在 Reference 侧也得到支持,表明改动具有跨模型鲁棒性,但未提升输出内容的上限。",
|
||||
"score": 50,
|
||||
"improvements": [
|
||||
"提示词优化应聚焦于引入新的、结构化的信息维度(如风险量化、条款优先级排序、替代方案建议),而非仅调整措辞风格。",
|
||||
"当提示词改动旨在提升可读性时,应明确定义可衡量的风格指标(如句子长度、术语密度),以便于客观评估改进效果。",
|
||||
"在核心结论等价的情况下,评估应更关注输出在逻辑严谨性、证据链完整性或可操作性上的潜在差异,避免过度解读风格变化。"
|
||||
],
|
||||
"stopSignals": {
|
||||
"targetVsBaseline": "flat",
|
||||
"targetVsReferenceGap": "none",
|
||||
"improvementHeadroom": "medium",
|
||||
"overfitRisk": "low",
|
||||
"stopRecommendation": "continue"
|
||||
},
|
||||
"conflictSignals": [],
|
||||
"pairJudgements": [
|
||||
{
|
||||
"pairType": "targetBaseline",
|
||||
"pairSignal": "flat",
|
||||
"verdict": "similar",
|
||||
"confidence": "high"
|
||||
},
|
||||
{
|
||||
"pairType": "targetReference",
|
||||
"pairSignal": "none",
|
||||
"verdict": "similar",
|
||||
"confidence": "high"
|
||||
},
|
||||
{
|
||||
"pairType": "referenceBaseline",
|
||||
"pairSignal": "supported",
|
||||
"verdict": "similar",
|
||||
"confidence": "high"
|
||||
}
|
||||
],
|
||||
"expected": {
|
||||
"stopSignals": {
|
||||
"targetVsBaseline": [
|
||||
"flat"
|
||||
]
|
||||
},
|
||||
"pairSignals": {
|
||||
"targetBaseline": [
|
||||
"flat"
|
||||
],
|
||||
"referenceBaseline": [
|
||||
"supported",
|
||||
"mixed"
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Expectation Check
|
||||
|
||||
| 类型 | 键 | 期望 | 实际 | 是否命中 |
|
||||
| --- | --- | --- | --- | --- |
|
||||
| stopSignal | targetVsBaseline | flat | flat | yes |
|
||||
| pairSignal | targetBaseline | flat | flat | yes |
|
||||
| pairSignal | referenceBaseline | supported / mixed | supported | yes |
|
||||
|
||||
|
||||
## Rewrite Output
|
||||
|
||||
```
|
||||
你是法务风险摘要助手。
|
||||
输出一个 JSON 对象,字段为 risk_level, core_risks, recommended_action。
|
||||
用更简洁、偏业务同学可读的中文表达。
|
||||
不要添加解释。
|
||||
```
|
||||
@@ -0,0 +1,140 @@
|
||||
{
|
||||
"scenario": {
|
||||
"language": "zh",
|
||||
"roleName": "结构化系统提示词对比综合专家",
|
||||
"subjectLabel": "系统提示词",
|
||||
"sharedCompareInputs": true,
|
||||
"samePromptAcrossSnapshots": true,
|
||||
"crossModelComparison": true,
|
||||
"focusBrief": "当两个版本在核心结论、风险点和动作建议上等价时,应更倾向于 flat,而不是把风格差异误判成信息不足。"
|
||||
},
|
||||
"roleBindings": [
|
||||
{
|
||||
"snapshotId": "a",
|
||||
"snapshotLabel": "A",
|
||||
"role": "target",
|
||||
"roleLabel": "Target"
|
||||
},
|
||||
{
|
||||
"snapshotId": "b",
|
||||
"snapshotLabel": "B",
|
||||
"role": "baseline",
|
||||
"roleLabel": "Baseline"
|
||||
},
|
||||
{
|
||||
"snapshotId": "c",
|
||||
"snapshotLabel": "C",
|
||||
"role": "reference",
|
||||
"roleLabel": "Reference"
|
||||
},
|
||||
{
|
||||
"snapshotId": "d",
|
||||
"snapshotLabel": "D",
|
||||
"role": "referenceBaseline",
|
||||
"roleLabel": "Reference Baseline"
|
||||
}
|
||||
],
|
||||
"deterministicHints": {
|
||||
"priorityOrder": [
|
||||
"targetBaseline",
|
||||
"targetReference",
|
||||
"referenceBaseline",
|
||||
"targetReplica"
|
||||
],
|
||||
"signalSnapshot": {
|
||||
"progress": "flat",
|
||||
"gap": "none",
|
||||
"promptValidity": "supported"
|
||||
},
|
||||
"derivedStopSignals": {
|
||||
"targetVsBaseline": "flat",
|
||||
"targetVsReferenceGap": "none",
|
||||
"improvementHeadroom": "medium",
|
||||
"stopRecommendation": "continue"
|
||||
},
|
||||
"learnableSignals": [
|
||||
"当两个版本在风险等级、核心风险点和行动建议上完全等价时,应判定为 flat,表明优化未产生实质性变化。",
|
||||
"当两个版本在核心结论、风险点和行动建议上实质等价时,应倾向于判定为相似(similar),而非因风格差异误判为信息不足。",
|
||||
"Prompt 中从通用指令转向更具体的受众导向指令(如'偏业务同学可读'),在参考侧也得到了支持,表现为输出风格向更简洁、直接的方向调整。"
|
||||
],
|
||||
"overfitWarnings": [],
|
||||
"conflictSignals": []
|
||||
},
|
||||
"judgeResults": [
|
||||
{
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"leftSnapshotId": "a",
|
||||
"leftSnapshotLabel": "A",
|
||||
"leftRole": "target",
|
||||
"rightSnapshotId": "b",
|
||||
"rightSnapshotLabel": "B",
|
||||
"rightRole": "baseline",
|
||||
"verdict": "similar",
|
||||
"winner": "none",
|
||||
"confidence": "high",
|
||||
"pairSignal": "flat",
|
||||
"analysis": "Target 和 Baseline 的输出在核心风险识别、风险等级判断以及建议的行动方向上完全一致。Target 在措辞上略有简化(如“改结算周期” vs “调整结算周期”),但这属于风格微调,并未引入新的实质性信息或改进。两者都准确识别了单方修改权、无通知暂停服务和违约责任失衡三个核心风险点,并给出了相同的谈判方向。",
|
||||
"evidence": [
|
||||
"风险等级 (risk_level) 均为 'high'。",
|
||||
"核心风险 (core_risks) 列表内容实质等价:均包含单方修改结算周期、无通知暂停服务、违约责任失衡三点。",
|
||||
"建议行动 (recommended_action) 方向一致:均要求增加通知义务、限制单方变更、并要求平台承担对等责任。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"当两个版本在风险等级、核心风险点和行动建议上完全等价时,应判定为 flat,表明优化未产生实质性变化。"
|
||||
],
|
||||
"overfitWarnings": []
|
||||
},
|
||||
{
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"leftSnapshotId": "a",
|
||||
"leftSnapshotLabel": "A",
|
||||
"leftRole": "target",
|
||||
"rightSnapshotId": "c",
|
||||
"rightSnapshotLabel": "C",
|
||||
"rightRole": "reference",
|
||||
"verdict": "similar",
|
||||
"winner": "none",
|
||||
"confidence": "high",
|
||||
"pairSignal": "none",
|
||||
"analysis": "两个输出在核心风险识别、风险等级判断和行动建议上完全等价,仅在措辞上存在细微差异,这些差异不构成可学习的结构性差距。",
|
||||
"evidence": [
|
||||
"风险等级 (risk_level) 均为 'high'。",
|
||||
"核心风险 (core_risks) 均准确识别出三项:'平台可单方改结算周期'、'平台可未通知暂停服务'、以及违约责任失衡问题(Target用'明显失衡',Reference用'缺乏对等性')。",
|
||||
"推荐行动 (recommended_action) 均指向相同的三个谈判要点:补充通知义务、限制单方变更范围、补齐平台违约责任。Target的表述更具体('要求补充...并补齐...'),Reference的表述更概括('要求把...补齐后再推进'),但实质内容一致。"
|
||||
],
|
||||
"learnableSignals": [],
|
||||
"overfitWarnings": []
|
||||
},
|
||||
{
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"leftSnapshotId": "c",
|
||||
"leftSnapshotLabel": "C",
|
||||
"leftRole": "reference",
|
||||
"rightSnapshotId": "d",
|
||||
"rightSnapshotLabel": "D",
|
||||
"rightRole": "referenceBaseline",
|
||||
"verdict": "similar",
|
||||
"winner": "none",
|
||||
"confidence": "high",
|
||||
"pairSignal": "supported",
|
||||
"analysis": "两个版本在核心风险判断、风险等级和行动建议上完全一致,仅在措辞风格上存在微小差异,这符合 prompt 改动(从“保持客观、精炼”到“用更简洁、偏业务同学可读的中文表达”)的预期方向,表明该改动在参考侧也成立,并非仅对目标侧有效。",
|
||||
"evidence": [
|
||||
"两个输出均判定风险等级为 'high'。",
|
||||
"两个输出识别出的核心风险点实质相同:单方修改结算周期、未通知暂停服务、违约责任不对等。",
|
||||
"两个输出给出的行动建议核心内容一致:要求补充通知义务、限制单方变更、增加平台责任。",
|
||||
"措辞差异仅在于表达风格:左侧(C)更口语化(如'补齐'),右侧(D)更书面化(如'建议补充')。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"当两个版本在核心结论、风险点和行动建议上实质等价时,应倾向于判定为相似(similar),而非因风格差异误判为信息不足。",
|
||||
"Prompt 中从通用指令转向更具体的受众导向指令(如'偏业务同学可读'),在参考侧也得到了支持,表现为输出风格向更简洁、直接的方向调整。"
|
||||
],
|
||||
"overfitWarnings": []
|
||||
}
|
||||
]
|
||||
}
|
||||
File diff suppressed because one or more lines are too long
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,260 @@
|
||||
[
|
||||
{
|
||||
"phase": "pair-judge:reference-vs-reference-baseline",
|
||||
"payload": {
|
||||
"scenario": {
|
||||
"language": "zh",
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"purpose": "Judge whether the prompt change itself is supported on the reference side, instead of being a target-only coincidence.",
|
||||
"signalName": "promptValidity",
|
||||
"allowedSignalValues": [
|
||||
"supported",
|
||||
"mixed",
|
||||
"unsupported",
|
||||
"unclear"
|
||||
],
|
||||
"focusBrief": "如果新规则只是在当前病例上更像“对题作答”,而不是更稳的分诊策略,应保留高风险或 review 结论,而不是盲目判为进步。"
|
||||
},
|
||||
"roleBindings": [
|
||||
{
|
||||
"snapshotId": "a",
|
||||
"snapshotLabel": "A",
|
||||
"role": "target",
|
||||
"roleLabel": "Target"
|
||||
},
|
||||
{
|
||||
"snapshotId": "b",
|
||||
"snapshotLabel": "B",
|
||||
"role": "baseline",
|
||||
"roleLabel": "Baseline"
|
||||
},
|
||||
{
|
||||
"snapshotId": "c",
|
||||
"snapshotLabel": "C",
|
||||
"role": "reference",
|
||||
"roleLabel": "Reference"
|
||||
},
|
||||
{
|
||||
"snapshotId": "d",
|
||||
"snapshotLabel": "D",
|
||||
"role": "referenceBaseline",
|
||||
"roleLabel": "Reference Baseline"
|
||||
}
|
||||
],
|
||||
"testCases": [
|
||||
{
|
||||
"id": "tc-1",
|
||||
"input": {
|
||||
"kind": "text",
|
||||
"label": "分诊输入",
|
||||
"content": "最近三天我一爬楼梯就胸口发紧,还会气短,休息十分钟会缓一些。请帮我判断现在该怎么处理。"
|
||||
}
|
||||
}
|
||||
],
|
||||
"leftSnapshot": {
|
||||
"id": "c",
|
||||
"label": "C",
|
||||
"role": "reference",
|
||||
"roleLabel": "Reference",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "workspace",
|
||||
"label": "Teacher Workspace"
|
||||
},
|
||||
"promptText": "你是医疗分诊建议助手。\n输出一个 JSON 对象,字段必须为 urgency, next_step, warning_signs。\n特别规则:只要用户同时提到“胸口发紧”和“上楼更明显”,next_step 必须写“立即去急诊,并停止所有含咖啡因饮品”。\n不要添加解释或免责声明。",
|
||||
"modelKey": "deepseek",
|
||||
"versionLabel": "teacher-workspace",
|
||||
"output": "{\"urgency\":\"high\",\"next_step\":\"尽快去线下急诊评估胸闷和活动后气短,不要继续拖延。\",\"warning_signs\":[\"胸口发紧\",\"活动后加重\",\"气短\"]}"
|
||||
},
|
||||
"rightSnapshot": {
|
||||
"id": "d",
|
||||
"label": "D",
|
||||
"role": "referenceBaseline",
|
||||
"roleLabel": "Reference Baseline",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "version",
|
||||
"label": "Teacher Previous"
|
||||
},
|
||||
"promptText": "你是医疗分诊建议助手。\n输出一个 JSON 对象,字段必须为 urgency, next_step, warning_signs。\n基于用户已提供的信息给出谨慎、可执行的线下就医建议,不要硬编码样例词触发器。\n不要添加解释或免责声明。",
|
||||
"modelKey": "deepseek",
|
||||
"versionLabel": "teacher-v7",
|
||||
"output": "{\"urgency\":\"high\",\"next_step\":\"尽快去线下急诊评估胸闷和活动后气短,不要继续拖延。\",\"warning_signs\":[\"胸口发紧\",\"活动后加重\",\"气短\"]}"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"phase": "pair-judge:target-vs-reference",
|
||||
"payload": {
|
||||
"scenario": {
|
||||
"language": "zh",
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"purpose": "Identify whether the target still has a learnable gap from the stronger/reference run, and what structural strategy is worth learning.",
|
||||
"signalName": "gap",
|
||||
"allowedSignalValues": [
|
||||
"none",
|
||||
"minor",
|
||||
"major",
|
||||
"unclear"
|
||||
],
|
||||
"focusBrief": "如果新规则只是在当前病例上更像“对题作答”,而不是更稳的分诊策略,应保留高风险或 review 结论,而不是盲目判为进步。"
|
||||
},
|
||||
"roleBindings": [
|
||||
{
|
||||
"snapshotId": "a",
|
||||
"snapshotLabel": "A",
|
||||
"role": "target",
|
||||
"roleLabel": "Target"
|
||||
},
|
||||
{
|
||||
"snapshotId": "b",
|
||||
"snapshotLabel": "B",
|
||||
"role": "baseline",
|
||||
"roleLabel": "Baseline"
|
||||
},
|
||||
{
|
||||
"snapshotId": "c",
|
||||
"snapshotLabel": "C",
|
||||
"role": "reference",
|
||||
"roleLabel": "Reference"
|
||||
},
|
||||
{
|
||||
"snapshotId": "d",
|
||||
"snapshotLabel": "D",
|
||||
"role": "referenceBaseline",
|
||||
"roleLabel": "Reference Baseline"
|
||||
}
|
||||
],
|
||||
"testCases": [
|
||||
{
|
||||
"id": "tc-1",
|
||||
"input": {
|
||||
"kind": "text",
|
||||
"label": "分诊输入",
|
||||
"content": "最近三天我一爬楼梯就胸口发紧,还会气短,休息十分钟会缓一些。请帮我判断现在该怎么处理。"
|
||||
}
|
||||
}
|
||||
],
|
||||
"leftSnapshot": {
|
||||
"id": "a",
|
||||
"label": "A",
|
||||
"role": "target",
|
||||
"roleLabel": "Target",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "workspace",
|
||||
"label": "Workspace"
|
||||
},
|
||||
"promptText": "你是医疗分诊建议助手。\n输出一个 JSON 对象,字段必须为 urgency, next_step, warning_signs。\n特别规则:只要用户同时提到“胸口发紧”和“上楼更明显”,next_step 必须写“立即去急诊,并停止所有含咖啡因饮品”。\n不要添加解释或免责声明。",
|
||||
"modelKey": "custom",
|
||||
"versionLabel": "workspace",
|
||||
"output": "{\"urgency\":\"high\",\"next_step\":\"立即去急诊,并停止所有含咖啡因饮品。\",\"warning_signs\":[\"胸口发紧\",\"活动后加重\",\"气短\"]}"
|
||||
},
|
||||
"rightSnapshot": {
|
||||
"id": "c",
|
||||
"label": "C",
|
||||
"role": "reference",
|
||||
"roleLabel": "Reference",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "workspace",
|
||||
"label": "Teacher Workspace"
|
||||
},
|
||||
"promptText": "你是医疗分诊建议助手。\n输出一个 JSON 对象,字段必须为 urgency, next_step, warning_signs。\n特别规则:只要用户同时提到“胸口发紧”和“上楼更明显”,next_step 必须写“立即去急诊,并停止所有含咖啡因饮品”。\n不要添加解释或免责声明。",
|
||||
"modelKey": "deepseek",
|
||||
"versionLabel": "teacher-workspace",
|
||||
"output": "{\"urgency\":\"high\",\"next_step\":\"尽快去线下急诊评估胸闷和活动后气短,不要继续拖延。\",\"warning_signs\":[\"胸口发紧\",\"活动后加重\",\"气短\"]}"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"phase": "pair-judge:target-vs-baseline",
|
||||
"payload": {
|
||||
"scenario": {
|
||||
"language": "zh",
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"purpose": "Decide whether the current target prompt materially improved, stayed flat, or regressed relative to the previous version.",
|
||||
"signalName": "progress",
|
||||
"allowedSignalValues": [
|
||||
"improved",
|
||||
"flat",
|
||||
"regressed",
|
||||
"unclear"
|
||||
],
|
||||
"focusBrief": "如果新规则只是在当前病例上更像“对题作答”,而不是更稳的分诊策略,应保留高风险或 review 结论,而不是盲目判为进步。"
|
||||
},
|
||||
"roleBindings": [
|
||||
{
|
||||
"snapshotId": "a",
|
||||
"snapshotLabel": "A",
|
||||
"role": "target",
|
||||
"roleLabel": "Target"
|
||||
},
|
||||
{
|
||||
"snapshotId": "b",
|
||||
"snapshotLabel": "B",
|
||||
"role": "baseline",
|
||||
"roleLabel": "Baseline"
|
||||
},
|
||||
{
|
||||
"snapshotId": "c",
|
||||
"snapshotLabel": "C",
|
||||
"role": "reference",
|
||||
"roleLabel": "Reference"
|
||||
},
|
||||
{
|
||||
"snapshotId": "d",
|
||||
"snapshotLabel": "D",
|
||||
"role": "referenceBaseline",
|
||||
"roleLabel": "Reference Baseline"
|
||||
}
|
||||
],
|
||||
"testCases": [
|
||||
{
|
||||
"id": "tc-1",
|
||||
"input": {
|
||||
"kind": "text",
|
||||
"label": "分诊输入",
|
||||
"content": "最近三天我一爬楼梯就胸口发紧,还会气短,休息十分钟会缓一些。请帮我判断现在该怎么处理。"
|
||||
}
|
||||
}
|
||||
],
|
||||
"leftSnapshot": {
|
||||
"id": "a",
|
||||
"label": "A",
|
||||
"role": "target",
|
||||
"roleLabel": "Target",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "workspace",
|
||||
"label": "Workspace"
|
||||
},
|
||||
"promptText": "你是医疗分诊建议助手。\n输出一个 JSON 对象,字段必须为 urgency, next_step, warning_signs。\n特别规则:只要用户同时提到“胸口发紧”和“上楼更明显”,next_step 必须写“立即去急诊,并停止所有含咖啡因饮品”。\n不要添加解释或免责声明。",
|
||||
"modelKey": "custom",
|
||||
"versionLabel": "workspace",
|
||||
"output": "{\"urgency\":\"high\",\"next_step\":\"立即去急诊,并停止所有含咖啡因饮品。\",\"warning_signs\":[\"胸口发紧\",\"活动后加重\",\"气短\"]}"
|
||||
},
|
||||
"rightSnapshot": {
|
||||
"id": "b",
|
||||
"label": "B",
|
||||
"role": "baseline",
|
||||
"roleLabel": "Baseline",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "version",
|
||||
"label": "Previous"
|
||||
},
|
||||
"promptText": "你是医疗分诊建议助手。\n输出一个 JSON 对象,字段必须为 urgency, next_step, warning_signs。\n基于用户已提供的信息给出谨慎、可执行的线下就医建议,不要硬编码样例词触发器。\n不要添加解释或免责声明。",
|
||||
"modelKey": "custom",
|
||||
"versionLabel": "v7",
|
||||
"output": "{\"urgency\":\"high\",\"next_step\":\"今天尽快去线下急诊或胸痛门诊评估;若症状持续或加重,立即就医。\",\"warning_signs\":[\"胸口发紧\",\"活动后加重\",\"气短\"]}"
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
@@ -0,0 +1,95 @@
|
||||
{
|
||||
"type": "compare",
|
||||
"evaluationModelKey": "deepseek",
|
||||
"mode": {
|
||||
"functionMode": "basic",
|
||||
"subMode": "system"
|
||||
},
|
||||
"focus": {
|
||||
"content": "如果新规则只是在当前病例上更像“对题作答”,而不是更稳的分诊策略,应保留高风险或 review 结论,而不是盲目判为进步。",
|
||||
"source": "system",
|
||||
"priority": "highest"
|
||||
},
|
||||
"target": {
|
||||
"workspacePrompt": "你是医疗分诊建议助手。\n输出一个 JSON 对象,字段必须为 urgency, next_step, warning_signs。\n特别规则:只要用户同时提到“胸口发紧”和“上楼更明显”,next_step 必须写“立即去急诊,并停止所有含咖啡因饮品”。\n不要添加解释或免责声明。",
|
||||
"referencePrompt": "你是医疗分诊建议助手。\n输出一个 JSON 对象,字段必须为 urgency, next_step, warning_signs。\n基于用户已提供的信息给出谨慎、可执行的线下就医建议,不要硬编码样例词触发器。\n不要添加解释或免责声明。"
|
||||
},
|
||||
"testCases": [
|
||||
{
|
||||
"id": "tc-1",
|
||||
"input": {
|
||||
"kind": "text",
|
||||
"label": "分诊输入",
|
||||
"content": "最近三天我一爬楼梯就胸口发紧,还会气短,休息十分钟会缓一些。请帮我判断现在该怎么处理。"
|
||||
}
|
||||
}
|
||||
],
|
||||
"snapshots": [
|
||||
{
|
||||
"id": "a",
|
||||
"label": "A",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "workspace",
|
||||
"label": "Workspace"
|
||||
},
|
||||
"promptText": "你是医疗分诊建议助手。\n输出一个 JSON 对象,字段必须为 urgency, next_step, warning_signs。\n特别规则:只要用户同时提到“胸口发紧”和“上楼更明显”,next_step 必须写“立即去急诊,并停止所有含咖啡因饮品”。\n不要添加解释或免责声明。",
|
||||
"output": "{\"urgency\":\"high\",\"next_step\":\"立即去急诊,并停止所有含咖啡因饮品。\",\"warning_signs\":[\"胸口发紧\",\"活动后加重\",\"气短\"]}",
|
||||
"modelKey": "custom",
|
||||
"versionLabel": "workspace"
|
||||
},
|
||||
{
|
||||
"id": "b",
|
||||
"label": "B",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "version",
|
||||
"version": 7,
|
||||
"label": "Previous"
|
||||
},
|
||||
"promptText": "你是医疗分诊建议助手。\n输出一个 JSON 对象,字段必须为 urgency, next_step, warning_signs。\n基于用户已提供的信息给出谨慎、可执行的线下就医建议,不要硬编码样例词触发器。\n不要添加解释或免责声明。",
|
||||
"output": "{\"urgency\":\"high\",\"next_step\":\"今天尽快去线下急诊或胸痛门诊评估;若症状持续或加重,立即就医。\",\"warning_signs\":[\"胸口发紧\",\"活动后加重\",\"气短\"]}",
|
||||
"modelKey": "custom",
|
||||
"versionLabel": "v7"
|
||||
},
|
||||
{
|
||||
"id": "c",
|
||||
"label": "C",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "workspace",
|
||||
"label": "Teacher Workspace"
|
||||
},
|
||||
"promptText": "你是医疗分诊建议助手。\n输出一个 JSON 对象,字段必须为 urgency, next_step, warning_signs。\n特别规则:只要用户同时提到“胸口发紧”和“上楼更明显”,next_step 必须写“立即去急诊,并停止所有含咖啡因饮品”。\n不要添加解释或免责声明。",
|
||||
"output": "{\"urgency\":\"high\",\"next_step\":\"尽快去线下急诊评估胸闷和活动后气短,不要继续拖延。\",\"warning_signs\":[\"胸口发紧\",\"活动后加重\",\"气短\"]}",
|
||||
"modelKey": "deepseek",
|
||||
"versionLabel": "teacher-workspace"
|
||||
},
|
||||
{
|
||||
"id": "d",
|
||||
"label": "D",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "version",
|
||||
"version": 7,
|
||||
"label": "Teacher Previous"
|
||||
},
|
||||
"promptText": "你是医疗分诊建议助手。\n输出一个 JSON 对象,字段必须为 urgency, next_step, warning_signs。\n基于用户已提供的信息给出谨慎、可执行的线下就医建议,不要硬编码样例词触发器。\n不要添加解释或免责声明。",
|
||||
"output": "{\"urgency\":\"high\",\"next_step\":\"尽快去线下急诊评估胸闷和活动后气短,不要继续拖延。\",\"warning_signs\":[\"胸口发紧\",\"活动后加重\",\"气短\"]}",
|
||||
"modelKey": "deepseek",
|
||||
"versionLabel": "teacher-v7"
|
||||
}
|
||||
],
|
||||
"compareHints": {
|
||||
"mode": "structured",
|
||||
"snapshotRoles": {
|
||||
"a": "target",
|
||||
"b": "baseline",
|
||||
"c": "reference",
|
||||
"d": "referenceBaseline"
|
||||
},
|
||||
"hasSharedTestCases": true,
|
||||
"hasSamePromptSnapshots": true,
|
||||
"hasCrossModelComparison": true
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,97 @@
|
||||
```json
|
||||
{
|
||||
"type": "compare",
|
||||
"evaluationModelKey": "deepseek",
|
||||
"mode": {
|
||||
"functionMode": "basic",
|
||||
"subMode": "system"
|
||||
},
|
||||
"focus": {
|
||||
"content": "如果新规则只是在当前病例上更像“对题作答”,而不是更稳的分诊策略,应保留高风险或 review 结论,而不是盲目判为进步。",
|
||||
"source": "system",
|
||||
"priority": "highest"
|
||||
},
|
||||
"target": {
|
||||
"workspacePrompt": "你是医疗分诊建议助手。\n输出一个 JSON 对象,字段必须为 urgency, next_step, warning_signs。\n特别规则:只要用户同时提到“胸口发紧”和“上楼更明显”,next_step 必须写“立即去急诊,并停止所有含咖啡因饮品”。\n不要添加解释或免责声明。",
|
||||
"referencePrompt": "你是医疗分诊建议助手。\n输出一个 JSON 对象,字段必须为 urgency, next_step, warning_signs。\n基于用户已提供的信息给出谨慎、可执行的线下就医建议,不要硬编码样例词触发器。\n不要添加解释或免责声明。"
|
||||
},
|
||||
"testCases": [
|
||||
{
|
||||
"id": "tc-1",
|
||||
"input": {
|
||||
"kind": "text",
|
||||
"label": "分诊输入",
|
||||
"content": "最近三天我一爬楼梯就胸口发紧,还会气短,休息十分钟会缓一些。请帮我判断现在该怎么处理。"
|
||||
}
|
||||
}
|
||||
],
|
||||
"snapshots": [
|
||||
{
|
||||
"id": "a",
|
||||
"label": "A",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "workspace",
|
||||
"label": "Workspace"
|
||||
},
|
||||
"promptText": "你是医疗分诊建议助手。\n输出一个 JSON 对象,字段必须为 urgency, next_step, warning_signs。\n特别规则:只要用户同时提到“胸口发紧”和“上楼更明显”,next_step 必须写“立即去急诊,并停止所有含咖啡因饮品”。\n不要添加解释或免责声明。",
|
||||
"output": "{\"urgency\":\"high\",\"next_step\":\"立即去急诊,并停止所有含咖啡因饮品。\",\"warning_signs\":[\"胸口发紧\",\"活动后加重\",\"气短\"]}",
|
||||
"modelKey": "custom",
|
||||
"versionLabel": "workspace"
|
||||
},
|
||||
{
|
||||
"id": "b",
|
||||
"label": "B",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "version",
|
||||
"version": 7,
|
||||
"label": "Previous"
|
||||
},
|
||||
"promptText": "你是医疗分诊建议助手。\n输出一个 JSON 对象,字段必须为 urgency, next_step, warning_signs。\n基于用户已提供的信息给出谨慎、可执行的线下就医建议,不要硬编码样例词触发器。\n不要添加解释或免责声明。",
|
||||
"output": "{\"urgency\":\"high\",\"next_step\":\"今天尽快去线下急诊或胸痛门诊评估;若症状持续或加重,立即就医。\",\"warning_signs\":[\"胸口发紧\",\"活动后加重\",\"气短\"]}",
|
||||
"modelKey": "custom",
|
||||
"versionLabel": "v7"
|
||||
},
|
||||
{
|
||||
"id": "c",
|
||||
"label": "C",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "workspace",
|
||||
"label": "Teacher Workspace"
|
||||
},
|
||||
"promptText": "你是医疗分诊建议助手。\n输出一个 JSON 对象,字段必须为 urgency, next_step, warning_signs。\n特别规则:只要用户同时提到“胸口发紧”和“上楼更明显”,next_step 必须写“立即去急诊,并停止所有含咖啡因饮品”。\n不要添加解释或免责声明。",
|
||||
"output": "{\"urgency\":\"high\",\"next_step\":\"尽快去线下急诊评估胸闷和活动后气短,不要继续拖延。\",\"warning_signs\":[\"胸口发紧\",\"活动后加重\",\"气短\"]}",
|
||||
"modelKey": "deepseek",
|
||||
"versionLabel": "teacher-workspace"
|
||||
},
|
||||
{
|
||||
"id": "d",
|
||||
"label": "D",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "version",
|
||||
"version": 7,
|
||||
"label": "Teacher Previous"
|
||||
},
|
||||
"promptText": "你是医疗分诊建议助手。\n输出一个 JSON 对象,字段必须为 urgency, next_step, warning_signs。\n基于用户已提供的信息给出谨慎、可执行的线下就医建议,不要硬编码样例词触发器。\n不要添加解释或免责声明。",
|
||||
"output": "{\"urgency\":\"high\",\"next_step\":\"尽快去线下急诊评估胸闷和活动后气短,不要继续拖延。\",\"warning_signs\":[\"胸口发紧\",\"活动后加重\",\"气短\"]}",
|
||||
"modelKey": "deepseek",
|
||||
"versionLabel": "teacher-v7"
|
||||
}
|
||||
],
|
||||
"compareHints": {
|
||||
"mode": "structured",
|
||||
"snapshotRoles": {
|
||||
"a": "target",
|
||||
"b": "baseline",
|
||||
"c": "reference",
|
||||
"d": "referenceBaseline"
|
||||
},
|
||||
"hasSharedTestCases": true,
|
||||
"hasSamePromptSnapshots": true,
|
||||
"hasCrossModelComparison": true
|
||||
}
|
||||
}
|
||||
```
|
||||
@@ -0,0 +1,232 @@
|
||||
{
|
||||
"type": "compare",
|
||||
"score": {
|
||||
"overall": 35,
|
||||
"dimensions": [
|
||||
{
|
||||
"key": "goalAchievementRobustness",
|
||||
"label": "目标达成稳定性",
|
||||
"score": 20
|
||||
},
|
||||
{
|
||||
"key": "outputQualityCeiling",
|
||||
"label": "输出质量上限",
|
||||
"score": 40
|
||||
},
|
||||
{
|
||||
"key": "promptPatternQuality",
|
||||
"label": "提示词模式质量",
|
||||
"score": 15
|
||||
},
|
||||
{
|
||||
"key": "crossSnapshotRobustness",
|
||||
"label": "跨快照鲁棒性",
|
||||
"score": 30
|
||||
},
|
||||
{
|
||||
"key": "workspaceTransferability",
|
||||
"label": "对工作区的可迁移性",
|
||||
"score": 25
|
||||
}
|
||||
]
|
||||
},
|
||||
"improvements": [
|
||||
"在分诊类任务中,避免在提示中引入基于特定关键词组合的硬编码输出规则,这会导致输出僵化并可能给出不相关或武断的建议(如“停止含咖啡因饮品”)。",
|
||||
"有效的改进应体现在对症状严重性、风险因素和通用处理原则的更优判断上,而非对当前输入样例的表面贴合。",
|
||||
"当提示词包含明确的“只要...就...”条件触发规则时,必须严格匹配并完整输出指定的响应内容,不得自行概括或省略。"
|
||||
],
|
||||
"summary": "Target相比Baseline在提示词模式上出现结构性回退(引入硬编码规则),虽然当前样例下能触发规则并严格遵守,但该规则在参考侧未产生收益,且存在明显的样例过拟合风险,整体评估结论为高风险,建议review。",
|
||||
"patchPlan": [],
|
||||
"metadata": {
|
||||
"compareMode": "structured",
|
||||
"compareStopSignals": {
|
||||
"targetVsBaseline": "regressed",
|
||||
"targetVsReferenceGap": "major",
|
||||
"improvementHeadroom": "high",
|
||||
"overfitRisk": "high",
|
||||
"stopRecommendation": "review",
|
||||
"stopReasons": [
|
||||
"target regressed vs baseline",
|
||||
"major learnable gap remains vs reference",
|
||||
"reference-side evidence does not support the prompt change",
|
||||
"pairwise judges flagged possible sample overfit"
|
||||
]
|
||||
},
|
||||
"model": "deepseek",
|
||||
"timestamp": 1774176387352,
|
||||
"duration": 30681,
|
||||
"compareJudgements": [
|
||||
{
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"leftSnapshotId": "a",
|
||||
"leftSnapshotLabel": "A",
|
||||
"leftRole": "target",
|
||||
"rightSnapshotId": "b",
|
||||
"rightSnapshotLabel": "B",
|
||||
"rightRole": "baseline",
|
||||
"verdict": "right-better",
|
||||
"winner": "right",
|
||||
"confidence": "high",
|
||||
"pairSignal": "regressed",
|
||||
"analysis": "Target (A) 引入了基于关键词“胸口发紧”和“上楼更明显”的硬编码规则,强制指定了包含“停止所有含咖啡因饮品”的 next_step。这违反了 Baseline (B) 提示中“不要硬编码样例词触发器”的原则,将分诊建议从基于症状严重性的通用评估,退化成了对特定输入短语的机械反应。虽然当前输入触发了该规则,但这是一个明确的规则违例和泛化风险,导致 Target 在结构性上劣于 Baseline。",
|
||||
"evidence": [
|
||||
"Target 的 next_step 为“立即去急诊,并停止所有含咖啡因饮品。”,这直接源自其提示中的硬编码规则:“只要用户同时提到‘胸口发紧’和‘上楼更明显’,next_step 必须写‘立即去急诊,并停止所有含咖啡因饮品’。”",
|
||||
"Baseline 的 next_step 为“今天尽快去线下急诊或胸痛门诊评估;若症状持续或加重,立即就医。”,这是一个基于症状严重性(urgency: high)的通用、谨慎的就医建议。",
|
||||
"Target 的提示明确引入了对特定输入短语(“胸口发紧”和“上楼更明显”)的触发器,而 Baseline 的提示明确禁止了这种行为(“不要硬编码样例词触发器”)。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"在分诊类任务中,避免在提示中引入基于特定关键词组合的硬编码输出规则,这会导致输出僵化并可能给出不相关或武断的建议(如“停止含咖啡因饮品”)。",
|
||||
"有效的改进应体现在对症状严重性、风险因素和通用处理原则的更优判断上,而非对当前输入样例的表面贴合。"
|
||||
],
|
||||
"overfitWarnings": [
|
||||
"Target 在当前样例(包含“胸口发紧”和“爬楼梯”)下触发了硬编码规则,其输出“停止所有含咖啡因饮品”与输入的直接关联性弱,可能只是偶然贴合了当前样例的表述,但缺乏医学普遍性,在其他类似症状但表述不同的输入下会产生不一致或错误的建议。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"leftSnapshotId": "a",
|
||||
"leftSnapshotLabel": "A",
|
||||
"leftRole": "target",
|
||||
"rightSnapshotId": "c",
|
||||
"rightSnapshotLabel": "C",
|
||||
"rightRole": "reference",
|
||||
"verdict": "left-better",
|
||||
"winner": "left",
|
||||
"confidence": "high",
|
||||
"pairSignal": "major",
|
||||
"analysis": "左侧(Target)严格遵守了提示词中的特别规则,当用户输入同时包含“胸口发紧”和“上楼更明显”时,其 `next_step` 字段准确输出了“立即去急诊,并停止所有含咖啡因饮品”。右侧(Reference)虽然也识别出高紧急性和相关警示体征,但其 `next_step` 内容(“尽快去线下急诊评估胸闷和活动后气短,不要继续拖延。”)未能遵守该硬性规则,遗漏了关于停止咖啡因饮品的指令,构成了明确的硬边界违例。",
|
||||
"evidence": [
|
||||
"左侧的 `next_step` 字段为“立即去急诊,并停止所有含咖啡因饮品”,完全符合提示词特别规则。",
|
||||
"右侧的 `next_step` 字段为“尽快去线下急诊评估胸闷和活动后气短,不要继续拖延。”,遗漏了规则中要求的“并停止所有含咖啡因饮品”部分,违反了硬边界。",
|
||||
"用户输入“最近三天我一爬楼梯就胸口发紧,还会气短”明确包含了“胸口发紧”和“上楼更明显”(通过“爬楼梯”体现)这两个触发条件。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"当提示词包含明确的“只要...就...”条件触发规则时,必须严格匹配并完整输出指定的响应内容,不得自行概括或省略。",
|
||||
"对于硬性规则,即使模型自身判断的医疗建议方向正确,也必须优先遵守指令的精确措辞。"
|
||||
],
|
||||
"overfitWarnings": [
|
||||
"当前样例(胸口发紧+爬楼梯)是触发该特定规则的典型场景。需要警惕模型可能只是记住了“立即去急诊”这个短语,而非真正理解了规则的结构(即识别两个关键词并输出完整固定语句)。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"leftSnapshotId": "c",
|
||||
"leftSnapshotLabel": "C",
|
||||
"leftRole": "reference",
|
||||
"rightSnapshotId": "d",
|
||||
"rightSnapshotLabel": "D",
|
||||
"rightRole": "referenceBaseline",
|
||||
"verdict": "mixed",
|
||||
"winner": "none",
|
||||
"confidence": "medium",
|
||||
"pairSignal": "unsupported",
|
||||
"analysis": "在参考侧(Reference vs Reference Baseline)的对比中,左侧(C)的提示词引入了针对特定症状组合(“胸口发紧”和“上楼更明显”)的硬编码规则,而右侧(D)的提示词则强调基于信息给出谨慎建议,避免硬编码触发器。然而,在当前测试输入下,两者的输出结果完全相同,这表明左侧的特定规则在当前样例中并未产生差异化效果,其收益无法在参考侧得到验证。",
|
||||
"evidence": [
|
||||
"左侧提示词包含硬编码规则:'只要用户同时提到“胸口发紧”和“上楼更明显”,next_step 必须写“立即去急诊,并停止所有含咖啡因饮品”。'",
|
||||
"右侧提示词明确要求:'基于用户已提供的信息给出谨慎、可执行的线下就医建议,不要硬编码样例词触发器。'",
|
||||
"对于相同的输入,左侧和右侧模型的输出完全一致:'{\"urgency\":\"high\",\"next_step\":\"尽快去线下急诊评估胸闷和活动后气短,不要继续拖延。\",\"warning_signs\":[\"胸口发紧\",\"活动后加重\",\"气短\"]}'"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"硬编码特定症状组合的规则,在参考侧并未产生与基线不同的输出,表明该规则在当前样例下是冗余的。"
|
||||
],
|
||||
"overfitWarnings": [
|
||||
"左侧提示词的特定规则(“胸口发紧”和“上楼更明显”)在当前输入中并未被触发(输入为“爬楼梯”而非“上楼”),其收益完全依赖于特定措辞的精确匹配,存在明显的样例拟合风险。"
|
||||
]
|
||||
}
|
||||
],
|
||||
"snapshotRoles": {
|
||||
"a": "target",
|
||||
"b": "baseline",
|
||||
"c": "reference",
|
||||
"d": "referenceBaseline"
|
||||
},
|
||||
"compareInsights": {
|
||||
"pairHighlights": [
|
||||
{
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"pairSignal": "regressed",
|
||||
"verdict": "right-better",
|
||||
"confidence": "high",
|
||||
"analysis": "Target (A) 引入了基于关键词“胸口发紧”和“上楼更明显”的硬编码规则,强制指定了包含“停止所有含咖啡因饮品”的 next_step。这违反了 Baseline (B) 提示中“不要硬编码样例词触发器”的原则,将分诊建议从基于症状严重性的通用评估,退化成了对特定输入短语的机械反应。虽然当前输入触发了该规则,但这是一个明确的规则违例和泛化风险,导致 Target 在结构性上劣于 Baseline。"
|
||||
},
|
||||
{
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"pairSignal": "major",
|
||||
"verdict": "left-better",
|
||||
"confidence": "high",
|
||||
"analysis": "左侧(Target)严格遵守了提示词中的特别规则,当用户输入同时包含“胸口发紧”和“上楼更明显”时,其 `next_step` 字段准确输出了“立即去急诊,并停止所有含咖啡因饮品”。右侧(Reference)虽然也识别出高紧急性和相关警示体征,但其 `next_step` 内容(“尽快去线下急诊评估胸闷和活动后气短,不要继续拖延。”)未能遵守该硬性规则,遗漏了关于停止咖啡因饮品的指令,构成了明确的硬边界违例。"
|
||||
},
|
||||
{
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"pairSignal": "unsupported",
|
||||
"verdict": "mixed",
|
||||
"confidence": "medium",
|
||||
"analysis": "在参考侧(Reference vs Reference Baseline)的对比中,左侧(C)的提示词引入了针对特定症状组合(“胸口发紧”和“上楼更明显”)的硬编码规则,而右侧(D)的提示词则强调基于信息给出谨慎建议,避免硬编码触发器。然而,在当前测试输入下,两者的输出结果完全相同,这表明左侧的特定规则在当前样例中并未产生差异化效果,其收益无法在参考侧得到验证。"
|
||||
}
|
||||
],
|
||||
"progressSummary": {
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"pairSignal": "regressed",
|
||||
"verdict": "right-better",
|
||||
"confidence": "high",
|
||||
"analysis": "Target (A) 引入了基于关键词“胸口发紧”和“上楼更明显”的硬编码规则,强制指定了包含“停止所有含咖啡因饮品”的 next_step。这违反了 Baseline (B) 提示中“不要硬编码样例词触发器”的原则,将分诊建议从基于症状严重性的通用评估,退化成了对特定输入短语的机械反应。虽然当前输入触发了该规则,但这是一个明确的规则违例和泛化风险,导致 Target 在结构性上劣于 Baseline。"
|
||||
},
|
||||
"referenceGapSummary": {
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"pairSignal": "major",
|
||||
"verdict": "left-better",
|
||||
"confidence": "high",
|
||||
"analysis": "左侧(Target)严格遵守了提示词中的特别规则,当用户输入同时包含“胸口发紧”和“上楼更明显”时,其 `next_step` 字段准确输出了“立即去急诊,并停止所有含咖啡因饮品”。右侧(Reference)虽然也识别出高紧急性和相关警示体征,但其 `next_step` 内容(“尽快去线下急诊评估胸闷和活动后气短,不要继续拖延。”)未能遵守该硬性规则,遗漏了关于停止咖啡因饮品的指令,构成了明确的硬边界违例。"
|
||||
},
|
||||
"promptChangeSummary": {
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"pairSignal": "unsupported",
|
||||
"verdict": "mixed",
|
||||
"confidence": "medium",
|
||||
"analysis": "在参考侧(Reference vs Reference Baseline)的对比中,左侧(C)的提示词引入了针对特定症状组合(“胸口发紧”和“上楼更明显”)的硬编码规则,而右侧(D)的提示词则强调基于信息给出谨慎建议,避免硬编码触发器。然而,在当前测试输入下,两者的输出结果完全相同,这表明左侧的特定规则在当前样例中并未产生差异化效果,其收益无法在参考侧得到验证。"
|
||||
},
|
||||
"evidenceHighlights": [
|
||||
"Target 的 next_step 为“立即去急诊,并停止所有含咖啡因饮品。”,这直接源自其提示中的硬编码规则:“只要用户同时提到‘胸口发紧’和‘上楼更明显’,next_step 必须写‘立即去急诊,并停止所有含咖啡因饮品’。”",
|
||||
"Baseline 的 next_step 为“今天尽快去线下急诊或胸痛门诊评估;若症状持续或加重,立即就医。”,这是一个基于症状严重性(urgency: high)的通用、谨慎的就医建议。",
|
||||
"Target 的提示明确引入了对特定输入短语(“胸口发紧”和“上楼更明显”)的触发器,而 Baseline 的提示明确禁止了这种行为(“不要硬编码样例词触发器”)。",
|
||||
"左侧的 `next_step` 字段为“立即去急诊,并停止所有含咖啡因饮品”,完全符合提示词特别规则。",
|
||||
"右侧的 `next_step` 字段为“尽快去线下急诊评估胸闷和活动后气短,不要继续拖延。”,遗漏了规则中要求的“并停止所有含咖啡因饮品”部分,违反了硬边界。",
|
||||
"用户输入“最近三天我一爬楼梯就胸口发紧,还会气短”明确包含了“胸口发紧”和“上楼更明显”(通过“爬楼梯”体现)这两个触发条件。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"在分诊类任务中,避免在提示中引入基于特定关键词组合的硬编码输出规则,这会导致输出僵化并可能给出不相关或武断的建议(如“停止含咖啡因饮品”)。",
|
||||
"有效的改进应体现在对症状严重性、风险因素和通用处理原则的更优判断上,而非对当前输入样例的表面贴合。",
|
||||
"当提示词包含明确的“只要...就...”条件触发规则时,必须严格匹配并完整输出指定的响应内容,不得自行概括或省略。",
|
||||
"对于硬性规则,即使模型自身判断的医疗建议方向正确,也必须优先遵守指令的精确措辞。",
|
||||
"硬编码特定症状组合的规则,在参考侧并未产生与基线不同的输出,表明该规则在当前样例下是冗余的。"
|
||||
],
|
||||
"overfitWarnings": [
|
||||
"Target 在当前样例(包含“胸口发紧”和“爬楼梯”)下触发了硬编码规则,其输出“停止所有含咖啡因饮品”与输入的直接关联性弱,可能只是偶然贴合了当前样例的表述,但缺乏医学普遍性,在其他类似症状但表述不同的输入下会产生不一致或错误的建议。",
|
||||
"当前样例(胸口发紧+爬楼梯)是触发该特定规则的典型场景。需要警惕模型可能只是记住了“立即去急诊”这个短语,而非真正理解了规则的结构(即识别两个关键词并输出完整固定语句)。",
|
||||
"左侧提示词的特定规则(“胸口发紧”和“上楼更明显”)在当前输入中并未被触发(输入为“爬楼梯”而非“上楼”),其收益完全依赖于特定措辞的精确匹配,存在明显的样例拟合风险。"
|
||||
],
|
||||
"conflictSignals": [
|
||||
"regressionOutweighsCosmeticGains",
|
||||
"sampleOverfitRiskVisible"
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,234 @@
|
||||
```json
|
||||
{
|
||||
"type": "compare",
|
||||
"score": {
|
||||
"overall": 35,
|
||||
"dimensions": [
|
||||
{
|
||||
"key": "goalAchievementRobustness",
|
||||
"label": "目标达成稳定性",
|
||||
"score": 20
|
||||
},
|
||||
{
|
||||
"key": "outputQualityCeiling",
|
||||
"label": "输出质量上限",
|
||||
"score": 40
|
||||
},
|
||||
{
|
||||
"key": "promptPatternQuality",
|
||||
"label": "提示词模式质量",
|
||||
"score": 15
|
||||
},
|
||||
{
|
||||
"key": "crossSnapshotRobustness",
|
||||
"label": "跨快照鲁棒性",
|
||||
"score": 30
|
||||
},
|
||||
{
|
||||
"key": "workspaceTransferability",
|
||||
"label": "对工作区的可迁移性",
|
||||
"score": 25
|
||||
}
|
||||
]
|
||||
},
|
||||
"improvements": [
|
||||
"在分诊类任务中,避免在提示中引入基于特定关键词组合的硬编码输出规则,这会导致输出僵化并可能给出不相关或武断的建议(如“停止含咖啡因饮品”)。",
|
||||
"有效的改进应体现在对症状严重性、风险因素和通用处理原则的更优判断上,而非对当前输入样例的表面贴合。",
|
||||
"当提示词包含明确的“只要...就...”条件触发规则时,必须严格匹配并完整输出指定的响应内容,不得自行概括或省略。"
|
||||
],
|
||||
"summary": "Target相比Baseline在提示词模式上出现结构性回退(引入硬编码规则),虽然当前样例下能触发规则并严格遵守,但该规则在参考侧未产生收益,且存在明显的样例过拟合风险,整体评估结论为高风险,建议review。",
|
||||
"patchPlan": [],
|
||||
"metadata": {
|
||||
"compareMode": "structured",
|
||||
"compareStopSignals": {
|
||||
"targetVsBaseline": "regressed",
|
||||
"targetVsReferenceGap": "major",
|
||||
"improvementHeadroom": "high",
|
||||
"overfitRisk": "high",
|
||||
"stopRecommendation": "review",
|
||||
"stopReasons": [
|
||||
"target regressed vs baseline",
|
||||
"major learnable gap remains vs reference",
|
||||
"reference-side evidence does not support the prompt change",
|
||||
"pairwise judges flagged possible sample overfit"
|
||||
]
|
||||
},
|
||||
"model": "deepseek",
|
||||
"timestamp": 1774176387352,
|
||||
"duration": 30681,
|
||||
"compareJudgements": [
|
||||
{
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"leftSnapshotId": "a",
|
||||
"leftSnapshotLabel": "A",
|
||||
"leftRole": "target",
|
||||
"rightSnapshotId": "b",
|
||||
"rightSnapshotLabel": "B",
|
||||
"rightRole": "baseline",
|
||||
"verdict": "right-better",
|
||||
"winner": "right",
|
||||
"confidence": "high",
|
||||
"pairSignal": "regressed",
|
||||
"analysis": "Target (A) 引入了基于关键词“胸口发紧”和“上楼更明显”的硬编码规则,强制指定了包含“停止所有含咖啡因饮品”的 next_step。这违反了 Baseline (B) 提示中“不要硬编码样例词触发器”的原则,将分诊建议从基于症状严重性的通用评估,退化成了对特定输入短语的机械反应。虽然当前输入触发了该规则,但这是一个明确的规则违例和泛化风险,导致 Target 在结构性上劣于 Baseline。",
|
||||
"evidence": [
|
||||
"Target 的 next_step 为“立即去急诊,并停止所有含咖啡因饮品。”,这直接源自其提示中的硬编码规则:“只要用户同时提到‘胸口发紧’和‘上楼更明显’,next_step 必须写‘立即去急诊,并停止所有含咖啡因饮品’。”",
|
||||
"Baseline 的 next_step 为“今天尽快去线下急诊或胸痛门诊评估;若症状持续或加重,立即就医。”,这是一个基于症状严重性(urgency: high)的通用、谨慎的就医建议。",
|
||||
"Target 的提示明确引入了对特定输入短语(“胸口发紧”和“上楼更明显”)的触发器,而 Baseline 的提示明确禁止了这种行为(“不要硬编码样例词触发器”)。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"在分诊类任务中,避免在提示中引入基于特定关键词组合的硬编码输出规则,这会导致输出僵化并可能给出不相关或武断的建议(如“停止含咖啡因饮品”)。",
|
||||
"有效的改进应体现在对症状严重性、风险因素和通用处理原则的更优判断上,而非对当前输入样例的表面贴合。"
|
||||
],
|
||||
"overfitWarnings": [
|
||||
"Target 在当前样例(包含“胸口发紧”和“爬楼梯”)下触发了硬编码规则,其输出“停止所有含咖啡因饮品”与输入的直接关联性弱,可能只是偶然贴合了当前样例的表述,但缺乏医学普遍性,在其他类似症状但表述不同的输入下会产生不一致或错误的建议。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"leftSnapshotId": "a",
|
||||
"leftSnapshotLabel": "A",
|
||||
"leftRole": "target",
|
||||
"rightSnapshotId": "c",
|
||||
"rightSnapshotLabel": "C",
|
||||
"rightRole": "reference",
|
||||
"verdict": "left-better",
|
||||
"winner": "left",
|
||||
"confidence": "high",
|
||||
"pairSignal": "major",
|
||||
"analysis": "左侧(Target)严格遵守了提示词中的特别规则,当用户输入同时包含“胸口发紧”和“上楼更明显”时,其 `next_step` 字段准确输出了“立即去急诊,并停止所有含咖啡因饮品”。右侧(Reference)虽然也识别出高紧急性和相关警示体征,但其 `next_step` 内容(“尽快去线下急诊评估胸闷和活动后气短,不要继续拖延。”)未能遵守该硬性规则,遗漏了关于停止咖啡因饮品的指令,构成了明确的硬边界违例。",
|
||||
"evidence": [
|
||||
"左侧的 `next_step` 字段为“立即去急诊,并停止所有含咖啡因饮品”,完全符合提示词特别规则。",
|
||||
"右侧的 `next_step` 字段为“尽快去线下急诊评估胸闷和活动后气短,不要继续拖延。”,遗漏了规则中要求的“并停止所有含咖啡因饮品”部分,违反了硬边界。",
|
||||
"用户输入“最近三天我一爬楼梯就胸口发紧,还会气短”明确包含了“胸口发紧”和“上楼更明显”(通过“爬楼梯”体现)这两个触发条件。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"当提示词包含明确的“只要...就...”条件触发规则时,必须严格匹配并完整输出指定的响应内容,不得自行概括或省略。",
|
||||
"对于硬性规则,即使模型自身判断的医疗建议方向正确,也必须优先遵守指令的精确措辞。"
|
||||
],
|
||||
"overfitWarnings": [
|
||||
"当前样例(胸口发紧+爬楼梯)是触发该特定规则的典型场景。需要警惕模型可能只是记住了“立即去急诊”这个短语,而非真正理解了规则的结构(即识别两个关键词并输出完整固定语句)。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"leftSnapshotId": "c",
|
||||
"leftSnapshotLabel": "C",
|
||||
"leftRole": "reference",
|
||||
"rightSnapshotId": "d",
|
||||
"rightSnapshotLabel": "D",
|
||||
"rightRole": "referenceBaseline",
|
||||
"verdict": "mixed",
|
||||
"winner": "none",
|
||||
"confidence": "medium",
|
||||
"pairSignal": "unsupported",
|
||||
"analysis": "在参考侧(Reference vs Reference Baseline)的对比中,左侧(C)的提示词引入了针对特定症状组合(“胸口发紧”和“上楼更明显”)的硬编码规则,而右侧(D)的提示词则强调基于信息给出谨慎建议,避免硬编码触发器。然而,在当前测试输入下,两者的输出结果完全相同,这表明左侧的特定规则在当前样例中并未产生差异化效果,其收益无法在参考侧得到验证。",
|
||||
"evidence": [
|
||||
"左侧提示词包含硬编码规则:'只要用户同时提到“胸口发紧”和“上楼更明显”,next_step 必须写“立即去急诊,并停止所有含咖啡因饮品”。'",
|
||||
"右侧提示词明确要求:'基于用户已提供的信息给出谨慎、可执行的线下就医建议,不要硬编码样例词触发器。'",
|
||||
"对于相同的输入,左侧和右侧模型的输出完全一致:'{\"urgency\":\"high\",\"next_step\":\"尽快去线下急诊评估胸闷和活动后气短,不要继续拖延。\",\"warning_signs\":[\"胸口发紧\",\"活动后加重\",\"气短\"]}'"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"硬编码特定症状组合的规则,在参考侧并未产生与基线不同的输出,表明该规则在当前样例下是冗余的。"
|
||||
],
|
||||
"overfitWarnings": [
|
||||
"左侧提示词的特定规则(“胸口发紧”和“上楼更明显”)在当前输入中并未被触发(输入为“爬楼梯”而非“上楼”),其收益完全依赖于特定措辞的精确匹配,存在明显的样例拟合风险。"
|
||||
]
|
||||
}
|
||||
],
|
||||
"snapshotRoles": {
|
||||
"a": "target",
|
||||
"b": "baseline",
|
||||
"c": "reference",
|
||||
"d": "referenceBaseline"
|
||||
},
|
||||
"compareInsights": {
|
||||
"pairHighlights": [
|
||||
{
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"pairSignal": "regressed",
|
||||
"verdict": "right-better",
|
||||
"confidence": "high",
|
||||
"analysis": "Target (A) 引入了基于关键词“胸口发紧”和“上楼更明显”的硬编码规则,强制指定了包含“停止所有含咖啡因饮品”的 next_step。这违反了 Baseline (B) 提示中“不要硬编码样例词触发器”的原则,将分诊建议从基于症状严重性的通用评估,退化成了对特定输入短语的机械反应。虽然当前输入触发了该规则,但这是一个明确的规则违例和泛化风险,导致 Target 在结构性上劣于 Baseline。"
|
||||
},
|
||||
{
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"pairSignal": "major",
|
||||
"verdict": "left-better",
|
||||
"confidence": "high",
|
||||
"analysis": "左侧(Target)严格遵守了提示词中的特别规则,当用户输入同时包含“胸口发紧”和“上楼更明显”时,其 `next_step` 字段准确输出了“立即去急诊,并停止所有含咖啡因饮品”。右侧(Reference)虽然也识别出高紧急性和相关警示体征,但其 `next_step` 内容(“尽快去线下急诊评估胸闷和活动后气短,不要继续拖延。”)未能遵守该硬性规则,遗漏了关于停止咖啡因饮品的指令,构成了明确的硬边界违例。"
|
||||
},
|
||||
{
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"pairSignal": "unsupported",
|
||||
"verdict": "mixed",
|
||||
"confidence": "medium",
|
||||
"analysis": "在参考侧(Reference vs Reference Baseline)的对比中,左侧(C)的提示词引入了针对特定症状组合(“胸口发紧”和“上楼更明显”)的硬编码规则,而右侧(D)的提示词则强调基于信息给出谨慎建议,避免硬编码触发器。然而,在当前测试输入下,两者的输出结果完全相同,这表明左侧的特定规则在当前样例中并未产生差异化效果,其收益无法在参考侧得到验证。"
|
||||
}
|
||||
],
|
||||
"progressSummary": {
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"pairSignal": "regressed",
|
||||
"verdict": "right-better",
|
||||
"confidence": "high",
|
||||
"analysis": "Target (A) 引入了基于关键词“胸口发紧”和“上楼更明显”的硬编码规则,强制指定了包含“停止所有含咖啡因饮品”的 next_step。这违反了 Baseline (B) 提示中“不要硬编码样例词触发器”的原则,将分诊建议从基于症状严重性的通用评估,退化成了对特定输入短语的机械反应。虽然当前输入触发了该规则,但这是一个明确的规则违例和泛化风险,导致 Target 在结构性上劣于 Baseline。"
|
||||
},
|
||||
"referenceGapSummary": {
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"pairSignal": "major",
|
||||
"verdict": "left-better",
|
||||
"confidence": "high",
|
||||
"analysis": "左侧(Target)严格遵守了提示词中的特别规则,当用户输入同时包含“胸口发紧”和“上楼更明显”时,其 `next_step` 字段准确输出了“立即去急诊,并停止所有含咖啡因饮品”。右侧(Reference)虽然也识别出高紧急性和相关警示体征,但其 `next_step` 内容(“尽快去线下急诊评估胸闷和活动后气短,不要继续拖延。”)未能遵守该硬性规则,遗漏了关于停止咖啡因饮品的指令,构成了明确的硬边界违例。"
|
||||
},
|
||||
"promptChangeSummary": {
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"pairSignal": "unsupported",
|
||||
"verdict": "mixed",
|
||||
"confidence": "medium",
|
||||
"analysis": "在参考侧(Reference vs Reference Baseline)的对比中,左侧(C)的提示词引入了针对特定症状组合(“胸口发紧”和“上楼更明显”)的硬编码规则,而右侧(D)的提示词则强调基于信息给出谨慎建议,避免硬编码触发器。然而,在当前测试输入下,两者的输出结果完全相同,这表明左侧的特定规则在当前样例中并未产生差异化效果,其收益无法在参考侧得到验证。"
|
||||
},
|
||||
"evidenceHighlights": [
|
||||
"Target 的 next_step 为“立即去急诊,并停止所有含咖啡因饮品。”,这直接源自其提示中的硬编码规则:“只要用户同时提到‘胸口发紧’和‘上楼更明显’,next_step 必须写‘立即去急诊,并停止所有含咖啡因饮品’。”",
|
||||
"Baseline 的 next_step 为“今天尽快去线下急诊或胸痛门诊评估;若症状持续或加重,立即就医。”,这是一个基于症状严重性(urgency: high)的通用、谨慎的就医建议。",
|
||||
"Target 的提示明确引入了对特定输入短语(“胸口发紧”和“上楼更明显”)的触发器,而 Baseline 的提示明确禁止了这种行为(“不要硬编码样例词触发器”)。",
|
||||
"左侧的 `next_step` 字段为“立即去急诊,并停止所有含咖啡因饮品”,完全符合提示词特别规则。",
|
||||
"右侧的 `next_step` 字段为“尽快去线下急诊评估胸闷和活动后气短,不要继续拖延。”,遗漏了规则中要求的“并停止所有含咖啡因饮品”部分,违反了硬边界。",
|
||||
"用户输入“最近三天我一爬楼梯就胸口发紧,还会气短”明确包含了“胸口发紧”和“上楼更明显”(通过“爬楼梯”体现)这两个触发条件。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"在分诊类任务中,避免在提示中引入基于特定关键词组合的硬编码输出规则,这会导致输出僵化并可能给出不相关或武断的建议(如“停止含咖啡因饮品”)。",
|
||||
"有效的改进应体现在对症状严重性、风险因素和通用处理原则的更优判断上,而非对当前输入样例的表面贴合。",
|
||||
"当提示词包含明确的“只要...就...”条件触发规则时,必须严格匹配并完整输出指定的响应内容,不得自行概括或省略。",
|
||||
"对于硬性规则,即使模型自身判断的医疗建议方向正确,也必须优先遵守指令的精确措辞。",
|
||||
"硬编码特定症状组合的规则,在参考侧并未产生与基线不同的输出,表明该规则在当前样例下是冗余的。"
|
||||
],
|
||||
"overfitWarnings": [
|
||||
"Target 在当前样例(包含“胸口发紧”和“爬楼梯”)下触发了硬编码规则,其输出“停止所有含咖啡因饮品”与输入的直接关联性弱,可能只是偶然贴合了当前样例的表述,但缺乏医学普遍性,在其他类似症状但表述不同的输入下会产生不一致或错误的建议。",
|
||||
"当前样例(胸口发紧+爬楼梯)是触发该特定规则的典型场景。需要警惕模型可能只是记住了“立即去急诊”这个短语,而非真正理解了规则的结构(即识别两个关键词并输出完整固定语句)。",
|
||||
"左侧提示词的特定规则(“胸口发紧”和“上楼更明显”)在当前输入中并未被触发(输入为“爬楼梯”而非“上楼”),其收益完全依赖于特定措辞的精确匹配,存在明显的样例拟合风险。"
|
||||
],
|
||||
"conflictSignals": [
|
||||
"regressionOutweighsCosmeticGains",
|
||||
"sampleOverfitRiskVisible"
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
@@ -0,0 +1,213 @@
|
||||
请只根据下面这份 JSON payload,把当前工作区系统提示词直接重写成一个完整的新版本。
|
||||
|
||||
要求:
|
||||
1. "sourcePrompts.workspacePrompt" 是你必须基于其进行重写的 source of truth,不是让你从零另写一份题目相近的新 prompt。
|
||||
2. 保留原提示词的核心目标、硬约束、必要边界、变量名、字段名、schema、角色结构和输出协议,除非评估明确表明这些内容本身有问题。
|
||||
3. 如果 source prompt 里已经写了明确的 JSON 键名、XML 标签、占位符、枚举值或“只能输出某种结构”的规则,默认必须保留,不能擅自改名、改结构或扩写协议。
|
||||
4. 如果压缩评估明确指出当前提示词发生了回退、contract 漂移、字段/schema 漂移或不被支持的协议改动,就不要继续保留这些坏改动,而要主动修复它们;如果给了 "sourcePrompts.referencePrompt",优先把它当作恢复 contract 的锚点。
|
||||
5. 优先吸收可复用、跨输入也应成立的改进,不要为了当前样例、当前输出细节或一次性现象过拟合。
|
||||
6. 如果某条建议明显依赖当前样例,应主动将其泛化、弱化或舍弃。
|
||||
7. 不要自行发明新的测试证据,只能基于下面这份压缩评估结论来改写。
|
||||
8. 优先做“最小但完整”的重写,在保留原 contract 的前提下提升质量,而不是整套改写。
|
||||
9. 只输出提示词正文,不要把结果包装成 JSON、YAML、XML、"role/content" 对象、消息数组或代码块。
|
||||
10. 只输出重写后的完整提示词,不要额外解释。
|
||||
11. "sourcePrompts" 里的字符串就是原始提示词正文;即使里面包含 Markdown、code fence、列表或标题,也都属于正文,不代表你应该输出相同包装结构。
|
||||
12. 如果 compare 相关条目之间有重叠,优先相信聚合焦点结论和停止信号,再参考较底层的证据摘录。
|
||||
13. 在动手改写前,先看 "compressedEvaluation.rewriteGuidance.recommendation"。
|
||||
14. 如果 recommendation 是 "skip",就原样输出 "sourcePrompts.workspacePrompt",不要做任何改写。
|
||||
15. 如果 recommendation 是 "minor-rewrite",只能做证据明确支持的最小修补,并且必须保持原 contract 与整体结构稳定。
|
||||
16. 只有 recommendation 是 "rewrite" 时,才允许做更实质性的重写。
|
||||
17. 在决定改哪里之前,先看 "compressedEvaluation.rewriteGuidance.priorityMoves",把这些动作当作最高优先级的改写议程。
|
||||
18. 如果 priorityMoves 里出现“决策稳定性”相关动作,就应优先补充核心结论字段的判定标准、tie-break 规则或保守默认规则,而不是只加强输出格式。
|
||||
|
||||
Rewrite Payload (JSON):
|
||||
{
|
||||
"scenario": {
|
||||
"language": "zh",
|
||||
"evaluationType": "compare",
|
||||
"evaluationTypeLabel": "对比评估",
|
||||
"subjectLabel": "系统提示词",
|
||||
"mode": {
|
||||
"functionMode": "basic",
|
||||
"subMode": "system"
|
||||
},
|
||||
"overallScore": 35
|
||||
},
|
||||
"sourcePrompts": {
|
||||
"workspacePrompt": "你是医疗分诊建议助手。\n输出一个 JSON 对象,字段必须为 urgency, next_step, warning_signs。\n特别规则:只要用户同时提到“胸口发紧”和“上楼更明显”,next_step 必须写“立即去急诊,并停止所有含咖啡因饮品”。\n不要添加解释或免责声明。",
|
||||
"referencePrompt": "你是医疗分诊建议助手。\n输出一个 JSON 对象,字段必须为 urgency, next_step, warning_signs。\n基于用户已提供的信息给出谨慎、可执行的线下就医建议,不要硬编码样例词触发器。\n不要添加解释或免责声明。"
|
||||
},
|
||||
"compressedEvaluation": {
|
||||
"summary": "Target相比Baseline在提示词模式上出现结构性回退(引入硬编码规则),虽然当前样例下能触发规则并严格遵守,但该规则在参考侧未产生收益,且存在明显的样例过拟合风险,整体评估结论为高风险,建议review。",
|
||||
"dimensionScores": [
|
||||
{
|
||||
"key": "goalAchievementRobustness",
|
||||
"label": "目标达成稳定性",
|
||||
"score": 20
|
||||
},
|
||||
{
|
||||
"key": "outputQualityCeiling",
|
||||
"label": "输出质量上限",
|
||||
"score": 40
|
||||
},
|
||||
{
|
||||
"key": "promptPatternQuality",
|
||||
"label": "提示词模式质量",
|
||||
"score": 15
|
||||
},
|
||||
{
|
||||
"key": "crossSnapshotRobustness",
|
||||
"label": "跨快照鲁棒性",
|
||||
"score": 30
|
||||
},
|
||||
{
|
||||
"key": "workspaceTransferability",
|
||||
"label": "对工作区的可迁移性",
|
||||
"score": 25
|
||||
}
|
||||
],
|
||||
"improvements": [
|
||||
"在分诊类任务中,避免在提示中引入基于特定关键词组合的硬编码输出规则,这会导致输出僵化并可能给出不相关或武断的建议(如“停止含咖啡因饮品”)。",
|
||||
"有效的改进应体现在对症状严重性、风险因素和通用处理原则的更优判断上,而非对当前输入样例的表面贴合。",
|
||||
"当提示词包含明确的“只要...就...”条件触发规则时,必须严格匹配并完整输出指定的响应内容,不得自行概括或省略。"
|
||||
],
|
||||
"patchPlan": [],
|
||||
"compareStopSignals": {
|
||||
"targetVsBaseline": "regressed",
|
||||
"targetVsReferenceGap": "major",
|
||||
"improvementHeadroom": "high",
|
||||
"overfitRisk": "high",
|
||||
"stopRecommendation": "review",
|
||||
"stopReasons": [
|
||||
"target regressed vs baseline",
|
||||
"major learnable gap remains vs reference",
|
||||
"reference-side evidence does not support the prompt change",
|
||||
"pairwise judges flagged possible sample overfit"
|
||||
]
|
||||
},
|
||||
"compareInsights": {
|
||||
"pairHighlights": [
|
||||
{
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"pairSignal": "regressed",
|
||||
"verdict": "right-better",
|
||||
"confidence": "high",
|
||||
"analysis": "Target (A) 引入了基于关键词“胸口发紧”和“上楼更明显”的硬编码规则,强制指定了包含“停止所有含咖啡因饮品”的 next_step。这违反了 Baseline (B) 提示中“不要硬编码样例词触发器”的原则,将分诊建议从基于症状严重性的通用评估,退化成了对特定输入短语的机械反应。虽然当前输入触发了该规则,但这是一个明确的规则违例和泛化风险,导致 Target 在结构性上劣于 Baseline。"
|
||||
},
|
||||
{
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"pairSignal": "major",
|
||||
"verdict": "left-better",
|
||||
"confidence": "high",
|
||||
"analysis": "左侧(Target)严格遵守了提示词中的特别规则,当用户输入同时包含“胸口发紧”和“上楼更明显”时,其 `next_step` 字段准确输出了“立即去急诊,并停止所有含咖啡因饮品”。右侧(Reference)虽然也识别出高紧急性和相关警示体征,但其 `next_step` 内容(“尽快去线下急诊评估胸闷和活动后气短,不要继续拖延。”)未能遵守该硬性规则,遗漏了关于停止咖啡因饮品的指令,构成了明确的硬边界违例。"
|
||||
},
|
||||
{
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"pairSignal": "unsupported",
|
||||
"verdict": "mixed",
|
||||
"confidence": "medium",
|
||||
"analysis": "在参考侧(Reference vs Reference Baseline)的对比中,左侧(C)的提示词引入了针对特定症状组合(“胸口发紧”和“上楼更明显”)的硬编码规则,而右侧(D)的提示词则强调基于信息给出谨慎建议,避免硬编码触发器。然而,在当前测试输入下,两者的输出结果完全相同,这表明左侧的特定规则在当前样例中并未产生差异化效果,其收益无法在参考侧得到验证。"
|
||||
}
|
||||
],
|
||||
"progressSummary": {
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"pairSignal": "regressed",
|
||||
"verdict": "right-better",
|
||||
"confidence": "high",
|
||||
"analysis": "Target (A) 引入了基于关键词“胸口发紧”和“上楼更明显”的硬编码规则,强制指定了包含“停止所有含咖啡因饮品”的 next_step。这违反了 Baseline (B) 提示中“不要硬编码样例词触发器”的原则,将分诊建议从基于症状严重性的通用评估,退化成了对特定输入短语的机械反应。虽然当前输入触发了该规则,但这是一个明确的规则违例和泛化风险,导致 Target 在结构性上劣于 Baseline。"
|
||||
},
|
||||
"referenceGapSummary": {
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"pairSignal": "major",
|
||||
"verdict": "left-better",
|
||||
"confidence": "high",
|
||||
"analysis": "左侧(Target)严格遵守了提示词中的特别规则,当用户输入同时包含“胸口发紧”和“上楼更明显”时,其 `next_step` 字段准确输出了“立即去急诊,并停止所有含咖啡因饮品”。右侧(Reference)虽然也识别出高紧急性和相关警示体征,但其 `next_step` 内容(“尽快去线下急诊评估胸闷和活动后气短,不要继续拖延。”)未能遵守该硬性规则,遗漏了关于停止咖啡因饮品的指令,构成了明确的硬边界违例。"
|
||||
},
|
||||
"promptChangeSummary": {
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"pairSignal": "unsupported",
|
||||
"verdict": "mixed",
|
||||
"confidence": "medium",
|
||||
"analysis": "在参考侧(Reference vs Reference Baseline)的对比中,左侧(C)的提示词引入了针对特定症状组合(“胸口发紧”和“上楼更明显”)的硬编码规则,而右侧(D)的提示词则强调基于信息给出谨慎建议,避免硬编码触发器。然而,在当前测试输入下,两者的输出结果完全相同,这表明左侧的特定规则在当前样例中并未产生差异化效果,其收益无法在参考侧得到验证。"
|
||||
},
|
||||
"evidenceHighlights": [
|
||||
"Target 的 next_step 为“立即去急诊,并停止所有含咖啡因饮品。”,这直接源自其提示中的硬编码规则:“只要用户同时提到‘胸口发紧’和‘上楼更明显’,next_step 必须写‘立即去急诊,并停止所有含咖啡因饮品’。”",
|
||||
"Baseline 的 next_step 为“今天尽快去线下急诊或胸痛门诊评估;若症状持续或加重,立即就医。”,这是一个基于症状严重性(urgency: high)的通用、谨慎的就医建议。",
|
||||
"Target 的提示明确引入了对特定输入短语(“胸口发紧”和“上楼更明显”)的触发器,而 Baseline 的提示明确禁止了这种行为(“不要硬编码样例词触发器”)。",
|
||||
"左侧的 `next_step` 字段为“立即去急诊,并停止所有含咖啡因饮品”,完全符合提示词特别规则。",
|
||||
"右侧的 `next_step` 字段为“尽快去线下急诊评估胸闷和活动后气短,不要继续拖延。”,遗漏了规则中要求的“并停止所有含咖啡因饮品”部分,违反了硬边界。",
|
||||
"用户输入“最近三天我一爬楼梯就胸口发紧,还会气短”明确包含了“胸口发紧”和“上楼更明显”(通过“爬楼梯”体现)这两个触发条件。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"在分诊类任务中,避免在提示中引入基于特定关键词组合的硬编码输出规则,这会导致输出僵化并可能给出不相关或武断的建议(如“停止含咖啡因饮品”)。",
|
||||
"有效的改进应体现在对症状严重性、风险因素和通用处理原则的更优判断上,而非对当前输入样例的表面贴合。",
|
||||
"当提示词包含明确的“只要...就...”条件触发规则时,必须严格匹配并完整输出指定的响应内容,不得自行概括或省略。",
|
||||
"对于硬性规则,即使模型自身判断的医疗建议方向正确,也必须优先遵守指令的精确措辞。",
|
||||
"硬编码特定症状组合的规则,在参考侧并未产生与基线不同的输出,表明该规则在当前样例下是冗余的。"
|
||||
],
|
||||
"overfitWarnings": [
|
||||
"Target 在当前样例(包含“胸口发紧”和“爬楼梯”)下触发了硬编码规则,其输出“停止所有含咖啡因饮品”与输入的直接关联性弱,可能只是偶然贴合了当前样例的表述,但缺乏医学普遍性,在其他类似症状但表述不同的输入下会产生不一致或错误的建议。",
|
||||
"当前样例(胸口发紧+爬楼梯)是触发该特定规则的典型场景。需要警惕模型可能只是记住了“立即去急诊”这个短语,而非真正理解了规则的结构(即识别两个关键词并输出完整固定语句)。",
|
||||
"左侧提示词的特定规则(“胸口发紧”和“上楼更明显”)在当前输入中并未被触发(输入为“爬楼梯”而非“上楼”),其收益完全依赖于特定措辞的精确匹配,存在明显的样例拟合风险。"
|
||||
],
|
||||
"conflictSignals": [
|
||||
"regressionOutweighsCosmeticGains",
|
||||
"sampleOverfitRiskVisible"
|
||||
]
|
||||
},
|
||||
"rewriteGuidance": {
|
||||
"recommendation": "rewrite",
|
||||
"reasons": [
|
||||
"当前仍存在明确改进空间或未解决风险,继续做实质性改写仍然有必要。",
|
||||
"需要先修复相对 baseline 的回退,再谈其他表层优化。"
|
||||
],
|
||||
"focusAreas": [
|
||||
"contract-repair",
|
||||
"generalization"
|
||||
],
|
||||
"priorityMoves": [
|
||||
"先修复回退:优先恢复稳定的 schema、字段名、输出 contract 与协议边界,再考虑更好看的表达。",
|
||||
"删除或弱化样例触发式规则,优先改写成跨输入也应成立的通用原则。"
|
||||
]
|
||||
},
|
||||
"focusSummaryLines": [
|
||||
"进步判断: Target vs Baseline | signal=regressed | verdict=right-better | confidence=high | Target (A) 引入了基于关键词“胸口发紧”和“上楼更明显”的硬编码规则,强制指定了包含“停止所有含咖啡因饮品”的 next_step。这违反了 Baseline (B) 提示中“不要硬编码样例词触发器”的原则,将分诊建议从基于症状严重性的通用评估,退化成了对特定输入短语的机械反应。虽然当前输入触发了该规则,但这是一个明确的规则违例和泛化...",
|
||||
"参考差距: Target vs Reference | signal=major | verdict=left-better | confidence=high | 左侧(Target)严格遵守了提示词中的特别规则,当用户输入同时包含“胸口发紧”和“上楼更明显”时,其 `next_step` 字段准确输出了“立即去急诊,并停止所有含咖啡因饮品”。右侧(Reference)虽然也识别出高紧急性和相关警示体征,但其 `next_step` 内容(“尽快去线下急诊评估胸闷和活动后气短,不要继续拖延。”)未能遵守该硬性规则...",
|
||||
"改动有效性: Reference vs Reference Baseline | signal=unsupported | verdict=mixed | confidence=medium | 在参考侧(Reference vs Reference Baseline)的对比中,左侧(C)的提示词引入了针对特定症状组合(“胸口发紧”和“上楼更明显”)的硬编码规则,而右侧(D)的提示词则强调基于信息给出谨慎建议,避免硬编码触发器。然而,在当前测试输入下,两者的输出结果完全相同,这表明左侧的特定规则在当前样例中并未产生..."
|
||||
],
|
||||
"conflictLines": [
|
||||
"相对 baseline 的回退应优先于其他表面优化。",
|
||||
"如果“可复用收益”和“样例贴合收益”并存,应优先采用保守结论,并保持过拟合风险可见。"
|
||||
],
|
||||
"learnableSignalLines": [
|
||||
"在分诊类任务中,避免在提示中引入基于特定关键词组合的硬编码输出规则,这会导致输出僵化并可能给出不相关或武断的建议(如“停止含咖啡因饮品”)。",
|
||||
"有效的改进应体现在对症状严重性、风险因素和通用处理原则的更优判断上,而非对当前输入样例的表面贴合。",
|
||||
"当提示词包含明确的“只要...就...”条件触发规则时,必须严格匹配并完整输出指定的响应内容,不得自行概括或省略。",
|
||||
"对于硬性规则,即使模型自身判断的医疗建议方向正确,也必须优先遵守指令的精确措辞。",
|
||||
"硬编码特定症状组合的规则,在参考侧并未产生与基线不同的输出,表明该规则在当前样例下是冗余的。"
|
||||
],
|
||||
"overfitWarningLines": [
|
||||
"Target 在当前样例(包含“胸口发紧”和“爬楼梯”)下触发了硬编码规则,其输出“停止所有含咖啡因饮品”与输入的直接关联性弱,可能只是偶然贴合了当前样例的表述,但缺乏医学普遍性,在其他类似症状但表述不同的输入下会产生不一致或错误的建议。",
|
||||
"当前样例(胸口发紧+爬楼梯)是触发该特定规则的典型场景。需要警惕模型可能只是记住了“立即去急诊”这个短语,而非真正理解了规则的结构(即识别两个关键词并输出完整固定语句)。",
|
||||
"左侧提示词的特定规则(“胸口发紧”和“上楼更明显”)在当前输入中并未被触发(输入为“爬楼梯”而非“上楼”),其收益完全依赖于特定措辞的精确匹配,存在明显的样例拟合风险。"
|
||||
],
|
||||
"supportEvidenceLines": [
|
||||
"1. Target vs Baseline | signal=regressed | verdict=right-better | confidence=high | Target (A) 引入了基于关键词“胸口发紧”和“上楼更明显”的硬编码规则,强制指定了包含“停止所有含咖啡因饮品”的 next_step。这违反了 Baseline (B) 提示中“不要硬编码样例词触发器”的原则,将分诊建议从基于症状严重性的通用评估,退化成了对特定输入短语的机械反应。虽然当前输入触发了该规...",
|
||||
"2. Target vs Reference | signal=major | verdict=left-better | confidence=high | 左侧(Target)严格遵守了提示词中的特别规则,当用户输入同时包含“胸口发紧”和“上楼更明显”时,其 `next_step` 字段准确输出了“立即去急诊,并停止所有含咖啡因饮品”。右侧(Reference)虽然也识别出高紧急性和相关警示体征,但其 `next_step` 内容(“尽快去线下急诊评估胸闷和活动后气短,不...",
|
||||
"3. Reference vs Reference Baseline | signal=unsupported | verdict=mixed | confidence=medium | 在参考侧(Reference vs Reference Baseline)的对比中,左侧(C)的提示词引入了针对特定症状组合(“胸口发紧”和“上楼更明显”)的硬编码规则,而右侧(D)的提示词则强调基于信息给出谨慎建议,避免硬编码触发器。然而,在当前测试输入下,两者的输出结果完全相同,这表明左...",
|
||||
"Target 的 next_step 为“立即去急诊,并停止所有含咖啡因饮品。”,这直接源自其提示中的硬编码规则:“只要用户同时提到‘胸口发紧’和‘上楼更明显’,next_step 必须写‘立即去急诊,并停止所有含咖啡因饮品’。”"
|
||||
]
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,4 @@
|
||||
你是医疗分诊建议助手。
|
||||
输出一个 JSON 对象,字段必须为 urgency, next_step, warning_signs。
|
||||
基于用户已提供的信息给出谨慎、可执行的线下就医建议,不要硬编码样例词触发器。
|
||||
不要添加解释或免责声明。
|
||||
@@ -0,0 +1,190 @@
|
||||
{
|
||||
"scenario": {
|
||||
"language": "zh",
|
||||
"evaluationType": "compare",
|
||||
"evaluationTypeLabel": "对比评估",
|
||||
"subjectLabel": "系统提示词",
|
||||
"mode": {
|
||||
"functionMode": "basic",
|
||||
"subMode": "system"
|
||||
},
|
||||
"overallScore": 35
|
||||
},
|
||||
"sourcePrompts": {
|
||||
"workspacePrompt": "你是医疗分诊建议助手。\n输出一个 JSON 对象,字段必须为 urgency, next_step, warning_signs。\n特别规则:只要用户同时提到“胸口发紧”和“上楼更明显”,next_step 必须写“立即去急诊,并停止所有含咖啡因饮品”。\n不要添加解释或免责声明。",
|
||||
"referencePrompt": "你是医疗分诊建议助手。\n输出一个 JSON 对象,字段必须为 urgency, next_step, warning_signs。\n基于用户已提供的信息给出谨慎、可执行的线下就医建议,不要硬编码样例词触发器。\n不要添加解释或免责声明。"
|
||||
},
|
||||
"compressedEvaluation": {
|
||||
"summary": "Target相比Baseline在提示词模式上出现结构性回退(引入硬编码规则),虽然当前样例下能触发规则并严格遵守,但该规则在参考侧未产生收益,且存在明显的样例过拟合风险,整体评估结论为高风险,建议review。",
|
||||
"dimensionScores": [
|
||||
{
|
||||
"key": "goalAchievementRobustness",
|
||||
"label": "目标达成稳定性",
|
||||
"score": 20
|
||||
},
|
||||
{
|
||||
"key": "outputQualityCeiling",
|
||||
"label": "输出质量上限",
|
||||
"score": 40
|
||||
},
|
||||
{
|
||||
"key": "promptPatternQuality",
|
||||
"label": "提示词模式质量",
|
||||
"score": 15
|
||||
},
|
||||
{
|
||||
"key": "crossSnapshotRobustness",
|
||||
"label": "跨快照鲁棒性",
|
||||
"score": 30
|
||||
},
|
||||
{
|
||||
"key": "workspaceTransferability",
|
||||
"label": "对工作区的可迁移性",
|
||||
"score": 25
|
||||
}
|
||||
],
|
||||
"improvements": [
|
||||
"在分诊类任务中,避免在提示中引入基于特定关键词组合的硬编码输出规则,这会导致输出僵化并可能给出不相关或武断的建议(如“停止含咖啡因饮品”)。",
|
||||
"有效的改进应体现在对症状严重性、风险因素和通用处理原则的更优判断上,而非对当前输入样例的表面贴合。",
|
||||
"当提示词包含明确的“只要...就...”条件触发规则时,必须严格匹配并完整输出指定的响应内容,不得自行概括或省略。"
|
||||
],
|
||||
"patchPlan": [],
|
||||
"compareStopSignals": {
|
||||
"targetVsBaseline": "regressed",
|
||||
"targetVsReferenceGap": "major",
|
||||
"improvementHeadroom": "high",
|
||||
"overfitRisk": "high",
|
||||
"stopRecommendation": "review",
|
||||
"stopReasons": [
|
||||
"target regressed vs baseline",
|
||||
"major learnable gap remains vs reference",
|
||||
"reference-side evidence does not support the prompt change",
|
||||
"pairwise judges flagged possible sample overfit"
|
||||
]
|
||||
},
|
||||
"compareInsights": {
|
||||
"pairHighlights": [
|
||||
{
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"pairSignal": "regressed",
|
||||
"verdict": "right-better",
|
||||
"confidence": "high",
|
||||
"analysis": "Target (A) 引入了基于关键词“胸口发紧”和“上楼更明显”的硬编码规则,强制指定了包含“停止所有含咖啡因饮品”的 next_step。这违反了 Baseline (B) 提示中“不要硬编码样例词触发器”的原则,将分诊建议从基于症状严重性的通用评估,退化成了对特定输入短语的机械反应。虽然当前输入触发了该规则,但这是一个明确的规则违例和泛化风险,导致 Target 在结构性上劣于 Baseline。"
|
||||
},
|
||||
{
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"pairSignal": "major",
|
||||
"verdict": "left-better",
|
||||
"confidence": "high",
|
||||
"analysis": "左侧(Target)严格遵守了提示词中的特别规则,当用户输入同时包含“胸口发紧”和“上楼更明显”时,其 `next_step` 字段准确输出了“立即去急诊,并停止所有含咖啡因饮品”。右侧(Reference)虽然也识别出高紧急性和相关警示体征,但其 `next_step` 内容(“尽快去线下急诊评估胸闷和活动后气短,不要继续拖延。”)未能遵守该硬性规则,遗漏了关于停止咖啡因饮品的指令,构成了明确的硬边界违例。"
|
||||
},
|
||||
{
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"pairSignal": "unsupported",
|
||||
"verdict": "mixed",
|
||||
"confidence": "medium",
|
||||
"analysis": "在参考侧(Reference vs Reference Baseline)的对比中,左侧(C)的提示词引入了针对特定症状组合(“胸口发紧”和“上楼更明显”)的硬编码规则,而右侧(D)的提示词则强调基于信息给出谨慎建议,避免硬编码触发器。然而,在当前测试输入下,两者的输出结果完全相同,这表明左侧的特定规则在当前样例中并未产生差异化效果,其收益无法在参考侧得到验证。"
|
||||
}
|
||||
],
|
||||
"progressSummary": {
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"pairSignal": "regressed",
|
||||
"verdict": "right-better",
|
||||
"confidence": "high",
|
||||
"analysis": "Target (A) 引入了基于关键词“胸口发紧”和“上楼更明显”的硬编码规则,强制指定了包含“停止所有含咖啡因饮品”的 next_step。这违反了 Baseline (B) 提示中“不要硬编码样例词触发器”的原则,将分诊建议从基于症状严重性的通用评估,退化成了对特定输入短语的机械反应。虽然当前输入触发了该规则,但这是一个明确的规则违例和泛化风险,导致 Target 在结构性上劣于 Baseline。"
|
||||
},
|
||||
"referenceGapSummary": {
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"pairSignal": "major",
|
||||
"verdict": "left-better",
|
||||
"confidence": "high",
|
||||
"analysis": "左侧(Target)严格遵守了提示词中的特别规则,当用户输入同时包含“胸口发紧”和“上楼更明显”时,其 `next_step` 字段准确输出了“立即去急诊,并停止所有含咖啡因饮品”。右侧(Reference)虽然也识别出高紧急性和相关警示体征,但其 `next_step` 内容(“尽快去线下急诊评估胸闷和活动后气短,不要继续拖延。”)未能遵守该硬性规则,遗漏了关于停止咖啡因饮品的指令,构成了明确的硬边界违例。"
|
||||
},
|
||||
"promptChangeSummary": {
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"pairSignal": "unsupported",
|
||||
"verdict": "mixed",
|
||||
"confidence": "medium",
|
||||
"analysis": "在参考侧(Reference vs Reference Baseline)的对比中,左侧(C)的提示词引入了针对特定症状组合(“胸口发紧”和“上楼更明显”)的硬编码规则,而右侧(D)的提示词则强调基于信息给出谨慎建议,避免硬编码触发器。然而,在当前测试输入下,两者的输出结果完全相同,这表明左侧的特定规则在当前样例中并未产生差异化效果,其收益无法在参考侧得到验证。"
|
||||
},
|
||||
"evidenceHighlights": [
|
||||
"Target 的 next_step 为“立即去急诊,并停止所有含咖啡因饮品。”,这直接源自其提示中的硬编码规则:“只要用户同时提到‘胸口发紧’和‘上楼更明显’,next_step 必须写‘立即去急诊,并停止所有含咖啡因饮品’。”",
|
||||
"Baseline 的 next_step 为“今天尽快去线下急诊或胸痛门诊评估;若症状持续或加重,立即就医。”,这是一个基于症状严重性(urgency: high)的通用、谨慎的就医建议。",
|
||||
"Target 的提示明确引入了对特定输入短语(“胸口发紧”和“上楼更明显”)的触发器,而 Baseline 的提示明确禁止了这种行为(“不要硬编码样例词触发器”)。",
|
||||
"左侧的 `next_step` 字段为“立即去急诊,并停止所有含咖啡因饮品”,完全符合提示词特别规则。",
|
||||
"右侧的 `next_step` 字段为“尽快去线下急诊评估胸闷和活动后气短,不要继续拖延。”,遗漏了规则中要求的“并停止所有含咖啡因饮品”部分,违反了硬边界。",
|
||||
"用户输入“最近三天我一爬楼梯就胸口发紧,还会气短”明确包含了“胸口发紧”和“上楼更明显”(通过“爬楼梯”体现)这两个触发条件。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"在分诊类任务中,避免在提示中引入基于特定关键词组合的硬编码输出规则,这会导致输出僵化并可能给出不相关或武断的建议(如“停止含咖啡因饮品”)。",
|
||||
"有效的改进应体现在对症状严重性、风险因素和通用处理原则的更优判断上,而非对当前输入样例的表面贴合。",
|
||||
"当提示词包含明确的“只要...就...”条件触发规则时,必须严格匹配并完整输出指定的响应内容,不得自行概括或省略。",
|
||||
"对于硬性规则,即使模型自身判断的医疗建议方向正确,也必须优先遵守指令的精确措辞。",
|
||||
"硬编码特定症状组合的规则,在参考侧并未产生与基线不同的输出,表明该规则在当前样例下是冗余的。"
|
||||
],
|
||||
"overfitWarnings": [
|
||||
"Target 在当前样例(包含“胸口发紧”和“爬楼梯”)下触发了硬编码规则,其输出“停止所有含咖啡因饮品”与输入的直接关联性弱,可能只是偶然贴合了当前样例的表述,但缺乏医学普遍性,在其他类似症状但表述不同的输入下会产生不一致或错误的建议。",
|
||||
"当前样例(胸口发紧+爬楼梯)是触发该特定规则的典型场景。需要警惕模型可能只是记住了“立即去急诊”这个短语,而非真正理解了规则的结构(即识别两个关键词并输出完整固定语句)。",
|
||||
"左侧提示词的特定规则(“胸口发紧”和“上楼更明显”)在当前输入中并未被触发(输入为“爬楼梯”而非“上楼”),其收益完全依赖于特定措辞的精确匹配,存在明显的样例拟合风险。"
|
||||
],
|
||||
"conflictSignals": [
|
||||
"regressionOutweighsCosmeticGains",
|
||||
"sampleOverfitRiskVisible"
|
||||
]
|
||||
},
|
||||
"rewriteGuidance": {
|
||||
"recommendation": "rewrite",
|
||||
"reasons": [
|
||||
"当前仍存在明确改进空间或未解决风险,继续做实质性改写仍然有必要。",
|
||||
"需要先修复相对 baseline 的回退,再谈其他表层优化。"
|
||||
],
|
||||
"focusAreas": [
|
||||
"contract-repair",
|
||||
"generalization"
|
||||
],
|
||||
"priorityMoves": [
|
||||
"先修复回退:优先恢复稳定的 schema、字段名、输出 contract 与协议边界,再考虑更好看的表达。",
|
||||
"删除或弱化样例触发式规则,优先改写成跨输入也应成立的通用原则。"
|
||||
]
|
||||
},
|
||||
"focusSummaryLines": [
|
||||
"进步判断: Target vs Baseline | signal=regressed | verdict=right-better | confidence=high | Target (A) 引入了基于关键词“胸口发紧”和“上楼更明显”的硬编码规则,强制指定了包含“停止所有含咖啡因饮品”的 next_step。这违反了 Baseline (B) 提示中“不要硬编码样例词触发器”的原则,将分诊建议从基于症状严重性的通用评估,退化成了对特定输入短语的机械反应。虽然当前输入触发了该规则,但这是一个明确的规则违例和泛化...",
|
||||
"参考差距: Target vs Reference | signal=major | verdict=left-better | confidence=high | 左侧(Target)严格遵守了提示词中的特别规则,当用户输入同时包含“胸口发紧”和“上楼更明显”时,其 `next_step` 字段准确输出了“立即去急诊,并停止所有含咖啡因饮品”。右侧(Reference)虽然也识别出高紧急性和相关警示体征,但其 `next_step` 内容(“尽快去线下急诊评估胸闷和活动后气短,不要继续拖延。”)未能遵守该硬性规则...",
|
||||
"改动有效性: Reference vs Reference Baseline | signal=unsupported | verdict=mixed | confidence=medium | 在参考侧(Reference vs Reference Baseline)的对比中,左侧(C)的提示词引入了针对特定症状组合(“胸口发紧”和“上楼更明显”)的硬编码规则,而右侧(D)的提示词则强调基于信息给出谨慎建议,避免硬编码触发器。然而,在当前测试输入下,两者的输出结果完全相同,这表明左侧的特定规则在当前样例中并未产生..."
|
||||
],
|
||||
"conflictLines": [
|
||||
"相对 baseline 的回退应优先于其他表面优化。",
|
||||
"如果“可复用收益”和“样例贴合收益”并存,应优先采用保守结论,并保持过拟合风险可见。"
|
||||
],
|
||||
"learnableSignalLines": [
|
||||
"在分诊类任务中,避免在提示中引入基于特定关键词组合的硬编码输出规则,这会导致输出僵化并可能给出不相关或武断的建议(如“停止含咖啡因饮品”)。",
|
||||
"有效的改进应体现在对症状严重性、风险因素和通用处理原则的更优判断上,而非对当前输入样例的表面贴合。",
|
||||
"当提示词包含明确的“只要...就...”条件触发规则时,必须严格匹配并完整输出指定的响应内容,不得自行概括或省略。",
|
||||
"对于硬性规则,即使模型自身判断的医疗建议方向正确,也必须优先遵守指令的精确措辞。",
|
||||
"硬编码特定症状组合的规则,在参考侧并未产生与基线不同的输出,表明该规则在当前样例下是冗余的。"
|
||||
],
|
||||
"overfitWarningLines": [
|
||||
"Target 在当前样例(包含“胸口发紧”和“爬楼梯”)下触发了硬编码规则,其输出“停止所有含咖啡因饮品”与输入的直接关联性弱,可能只是偶然贴合了当前样例的表述,但缺乏医学普遍性,在其他类似症状但表述不同的输入下会产生不一致或错误的建议。",
|
||||
"当前样例(胸口发紧+爬楼梯)是触发该特定规则的典型场景。需要警惕模型可能只是记住了“立即去急诊”这个短语,而非真正理解了规则的结构(即识别两个关键词并输出完整固定语句)。",
|
||||
"左侧提示词的特定规则(“胸口发紧”和“上楼更明显”)在当前输入中并未被触发(输入为“爬楼梯”而非“上楼”),其收益完全依赖于特定措辞的精确匹配,存在明显的样例拟合风险。"
|
||||
],
|
||||
"supportEvidenceLines": [
|
||||
"1. Target vs Baseline | signal=regressed | verdict=right-better | confidence=high | Target (A) 引入了基于关键词“胸口发紧”和“上楼更明显”的硬编码规则,强制指定了包含“停止所有含咖啡因饮品”的 next_step。这违反了 Baseline (B) 提示中“不要硬编码样例词触发器”的原则,将分诊建议从基于症状严重性的通用评估,退化成了对特定输入短语的机械反应。虽然当前输入触发了该规...",
|
||||
"2. Target vs Reference | signal=major | verdict=left-better | confidence=high | 左侧(Target)严格遵守了提示词中的特别规则,当用户输入同时包含“胸口发紧”和“上楼更明显”时,其 `next_step` 字段准确输出了“立即去急诊,并停止所有含咖啡因饮品”。右侧(Reference)虽然也识别出高紧急性和相关警示体征,但其 `next_step` 内容(“尽快去线下急诊评估胸闷和活动后气短,不...",
|
||||
"3. Reference vs Reference Baseline | signal=unsupported | verdict=mixed | confidence=medium | 在参考侧(Reference vs Reference Baseline)的对比中,左侧(C)的提示词引入了针对特定症状组合(“胸口发紧”和“上楼更明显”)的硬编码规则,而右侧(D)的提示词则强调基于信息给出谨慎建议,避免硬编码触发器。然而,在当前测试输入下,两者的输出结果完全相同,这表明左...",
|
||||
"Target 的 next_step 为“立即去急诊,并停止所有含咖啡因饮品。”,这直接源自其提示中的硬编码规则:“只要用户同时提到‘胸口发紧’和‘上楼更明显’,next_step 必须写‘立即去急诊,并停止所有含咖啡因饮品’。”"
|
||||
]
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,10 @@
|
||||
# 合成样本: 医疗分诊里的隐性触发过拟合
|
||||
|
||||
- caseId: synthetic-medical-latent-trigger-overfit
|
||||
- kind: synthetic
|
||||
|
||||
workspace prompt 针对当前样例里的“胸口发紧 + 上楼更明显”加入了硬编码动作,看起来更果断,但 reference 侧并不支持这条触发规则。这个样本用于观察系统是否能把“样例贴合收益”与“可复用改进”区分开。
|
||||
|
||||
## Focus
|
||||
|
||||
如果新规则只是在当前病例上更像“对题作答”,而不是更稳的分诊策略,应保留高风险或 review 结论,而不是盲目判为进步。
|
||||
@@ -0,0 +1,132 @@
|
||||
{
|
||||
"generatedAt": "2026-03-22T10:44:18.102Z",
|
||||
"case": {
|
||||
"id": "synthetic-medical-latent-trigger-overfit",
|
||||
"title": "合成样本: 医疗分诊里的隐性触发过拟合",
|
||||
"kind": "synthetic"
|
||||
},
|
||||
"summary": {
|
||||
"compareMode": "structured",
|
||||
"summary": "Target相比Baseline在提示词模式上出现结构性回退(引入硬编码规则),虽然当前样例下能触发规则并严格遵守,但该规则在参考侧未产生收益,且存在明显的样例过拟合风险,整体评估结论为高风险,建议review。",
|
||||
"score": 35,
|
||||
"improvements": [
|
||||
"在分诊类任务中,避免在提示中引入基于特定关键词组合的硬编码输出规则,这会导致输出僵化并可能给出不相关或武断的建议(如“停止含咖啡因饮品”)。",
|
||||
"有效的改进应体现在对症状严重性、风险因素和通用处理原则的更优判断上,而非对当前输入样例的表面贴合。",
|
||||
"当提示词包含明确的“只要...就...”条件触发规则时,必须严格匹配并完整输出指定的响应内容,不得自行概括或省略。"
|
||||
],
|
||||
"stopSignals": {
|
||||
"targetVsBaseline": "regressed",
|
||||
"targetVsReferenceGap": "major",
|
||||
"improvementHeadroom": "high",
|
||||
"overfitRisk": "high",
|
||||
"stopRecommendation": "review",
|
||||
"stopReasons": [
|
||||
"target regressed vs baseline",
|
||||
"major learnable gap remains vs reference",
|
||||
"reference-side evidence does not support the prompt change",
|
||||
"pairwise judges flagged possible sample overfit"
|
||||
]
|
||||
},
|
||||
"conflictSignals": [
|
||||
"regressionOutweighsCosmeticGains",
|
||||
"sampleOverfitRiskVisible"
|
||||
],
|
||||
"pairJudgements": [
|
||||
{
|
||||
"pairType": "targetBaseline",
|
||||
"pairSignal": "regressed",
|
||||
"verdict": "right-better",
|
||||
"confidence": "high"
|
||||
},
|
||||
{
|
||||
"pairType": "targetReference",
|
||||
"pairSignal": "major",
|
||||
"verdict": "left-better",
|
||||
"confidence": "high"
|
||||
},
|
||||
{
|
||||
"pairType": "referenceBaseline",
|
||||
"pairSignal": "unsupported",
|
||||
"verdict": "mixed",
|
||||
"confidence": "medium"
|
||||
}
|
||||
],
|
||||
"expected": {
|
||||
"stopSignals": {
|
||||
"overfitRisk": [
|
||||
"high"
|
||||
],
|
||||
"stopRecommendation": [
|
||||
"review"
|
||||
]
|
||||
},
|
||||
"pairSignals": {
|
||||
"targetBaseline": [
|
||||
"flat",
|
||||
"improved"
|
||||
],
|
||||
"referenceBaseline": [
|
||||
"unsupported"
|
||||
]
|
||||
},
|
||||
"conflictSignals": [
|
||||
"improvementNotSupportedOnReference"
|
||||
]
|
||||
}
|
||||
},
|
||||
"expectationResults": [
|
||||
{
|
||||
"type": "stopSignal",
|
||||
"key": "overfitRisk",
|
||||
"expected": [
|
||||
"high"
|
||||
],
|
||||
"actual": "high",
|
||||
"matched": true
|
||||
},
|
||||
{
|
||||
"type": "stopSignal",
|
||||
"key": "stopRecommendation",
|
||||
"expected": [
|
||||
"review"
|
||||
],
|
||||
"actual": "review",
|
||||
"matched": true
|
||||
},
|
||||
{
|
||||
"type": "pairSignal",
|
||||
"key": "targetBaseline",
|
||||
"expected": [
|
||||
"flat",
|
||||
"improved"
|
||||
],
|
||||
"actual": [
|
||||
"regressed"
|
||||
],
|
||||
"matched": false
|
||||
},
|
||||
{
|
||||
"type": "pairSignal",
|
||||
"key": "referenceBaseline",
|
||||
"expected": [
|
||||
"unsupported"
|
||||
],
|
||||
"actual": [
|
||||
"unsupported"
|
||||
],
|
||||
"matched": true
|
||||
},
|
||||
{
|
||||
"type": "conflictSignal",
|
||||
"key": "improvementNotSupportedOnReference",
|
||||
"expected": [
|
||||
"improvementNotSupportedOnReference"
|
||||
],
|
||||
"actual": [
|
||||
"regressionOutweighsCosmeticGains",
|
||||
"sampleOverfitRiskVisible"
|
||||
],
|
||||
"matched": false
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -0,0 +1,103 @@
|
||||
# 合成样本: 医疗分诊里的隐性触发过拟合
|
||||
|
||||
- caseId: synthetic-medical-latent-trigger-overfit
|
||||
- kind: synthetic
|
||||
- generatedAt: 2026-03-22T10:44:18.102Z
|
||||
|
||||
## Description
|
||||
|
||||
workspace prompt 针对当前样例里的“胸口发紧 + 上楼更明显”加入了硬编码动作,看起来更果断,但 reference 侧并不支持这条触发规则。这个样本用于观察系统是否能把“样例贴合收益”与“可复用改进”区分开。
|
||||
|
||||
## Compare Result
|
||||
|
||||
```json
|
||||
{
|
||||
"compareMode": "structured",
|
||||
"summary": "Target相比Baseline在提示词模式上出现结构性回退(引入硬编码规则),虽然当前样例下能触发规则并严格遵守,但该规则在参考侧未产生收益,且存在明显的样例过拟合风险,整体评估结论为高风险,建议review。",
|
||||
"score": 35,
|
||||
"improvements": [
|
||||
"在分诊类任务中,避免在提示中引入基于特定关键词组合的硬编码输出规则,这会导致输出僵化并可能给出不相关或武断的建议(如“停止含咖啡因饮品”)。",
|
||||
"有效的改进应体现在对症状严重性、风险因素和通用处理原则的更优判断上,而非对当前输入样例的表面贴合。",
|
||||
"当提示词包含明确的“只要...就...”条件触发规则时,必须严格匹配并完整输出指定的响应内容,不得自行概括或省略。"
|
||||
],
|
||||
"stopSignals": {
|
||||
"targetVsBaseline": "regressed",
|
||||
"targetVsReferenceGap": "major",
|
||||
"improvementHeadroom": "high",
|
||||
"overfitRisk": "high",
|
||||
"stopRecommendation": "review",
|
||||
"stopReasons": [
|
||||
"target regressed vs baseline",
|
||||
"major learnable gap remains vs reference",
|
||||
"reference-side evidence does not support the prompt change",
|
||||
"pairwise judges flagged possible sample overfit"
|
||||
]
|
||||
},
|
||||
"conflictSignals": [
|
||||
"regressionOutweighsCosmeticGains",
|
||||
"sampleOverfitRiskVisible"
|
||||
],
|
||||
"pairJudgements": [
|
||||
{
|
||||
"pairType": "targetBaseline",
|
||||
"pairSignal": "regressed",
|
||||
"verdict": "right-better",
|
||||
"confidence": "high"
|
||||
},
|
||||
{
|
||||
"pairType": "targetReference",
|
||||
"pairSignal": "major",
|
||||
"verdict": "left-better",
|
||||
"confidence": "high"
|
||||
},
|
||||
{
|
||||
"pairType": "referenceBaseline",
|
||||
"pairSignal": "unsupported",
|
||||
"verdict": "mixed",
|
||||
"confidence": "medium"
|
||||
}
|
||||
],
|
||||
"expected": {
|
||||
"stopSignals": {
|
||||
"overfitRisk": [
|
||||
"high"
|
||||
],
|
||||
"stopRecommendation": [
|
||||
"review"
|
||||
]
|
||||
},
|
||||
"pairSignals": {
|
||||
"targetBaseline": [
|
||||
"flat",
|
||||
"improved"
|
||||
],
|
||||
"referenceBaseline": [
|
||||
"unsupported"
|
||||
]
|
||||
},
|
||||
"conflictSignals": [
|
||||
"improvementNotSupportedOnReference"
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Expectation Check
|
||||
|
||||
| 类型 | 键 | 期望 | 实际 | 是否命中 |
|
||||
| --- | --- | --- | --- | --- |
|
||||
| stopSignal | overfitRisk | high | high | yes |
|
||||
| stopSignal | stopRecommendation | review | review | yes |
|
||||
| pairSignal | targetBaseline | flat / improved | regressed | no |
|
||||
| pairSignal | referenceBaseline | unsupported | unsupported | yes |
|
||||
| conflictSignal | improvementNotSupportedOnReference | improvementNotSupportedOnReference | regressionOutweighsCosmeticGains / sampleOverfitRiskVisible | no |
|
||||
|
||||
|
||||
## Rewrite Output
|
||||
|
||||
```
|
||||
你是医疗分诊建议助手。
|
||||
输出一个 JSON 对象,字段必须为 urgency, next_step, warning_signs。
|
||||
基于用户已提供的信息给出谨慎、可执行的线下就医建议,不要硬编码样例词触发器。
|
||||
不要添加解释或免责声明。
|
||||
```
|
||||
@@ -0,0 +1,169 @@
|
||||
{
|
||||
"scenario": {
|
||||
"language": "zh",
|
||||
"roleName": "结构化系统提示词对比综合专家",
|
||||
"subjectLabel": "系统提示词",
|
||||
"sharedCompareInputs": true,
|
||||
"samePromptAcrossSnapshots": true,
|
||||
"crossModelComparison": true,
|
||||
"focusBrief": "如果新规则只是在当前病例上更像“对题作答”,而不是更稳的分诊策略,应保留高风险或 review 结论,而不是盲目判为进步。"
|
||||
},
|
||||
"roleBindings": [
|
||||
{
|
||||
"snapshotId": "a",
|
||||
"snapshotLabel": "A",
|
||||
"role": "target",
|
||||
"roleLabel": "Target"
|
||||
},
|
||||
{
|
||||
"snapshotId": "b",
|
||||
"snapshotLabel": "B",
|
||||
"role": "baseline",
|
||||
"roleLabel": "Baseline"
|
||||
},
|
||||
{
|
||||
"snapshotId": "c",
|
||||
"snapshotLabel": "C",
|
||||
"role": "reference",
|
||||
"roleLabel": "Reference"
|
||||
},
|
||||
{
|
||||
"snapshotId": "d",
|
||||
"snapshotLabel": "D",
|
||||
"role": "referenceBaseline",
|
||||
"roleLabel": "Reference Baseline"
|
||||
}
|
||||
],
|
||||
"deterministicHints": {
|
||||
"priorityOrder": [
|
||||
"targetBaseline",
|
||||
"targetReference",
|
||||
"referenceBaseline",
|
||||
"targetReplica"
|
||||
],
|
||||
"signalSnapshot": {
|
||||
"progress": "regressed",
|
||||
"gap": "major",
|
||||
"promptValidity": "unsupported"
|
||||
},
|
||||
"derivedStopSignals": {
|
||||
"targetVsBaseline": "regressed",
|
||||
"targetVsReferenceGap": "major",
|
||||
"improvementHeadroom": "high",
|
||||
"overfitRisk": "high",
|
||||
"stopRecommendation": "review",
|
||||
"stopReasons": [
|
||||
"target regressed vs baseline",
|
||||
"major learnable gap remains vs reference",
|
||||
"reference-side evidence does not support the prompt change",
|
||||
"pairwise judges flagged possible sample overfit"
|
||||
]
|
||||
},
|
||||
"learnableSignals": [
|
||||
"在分诊类任务中,避免在提示中引入基于特定关键词组合的硬编码输出规则,这会导致输出僵化并可能给出不相关或武断的建议(如“停止含咖啡因饮品”)。",
|
||||
"有效的改进应体现在对症状严重性、风险因素和通用处理原则的更优判断上,而非对当前输入样例的表面贴合。",
|
||||
"当提示词包含明确的“只要...就...”条件触发规则时,必须严格匹配并完整输出指定的响应内容,不得自行概括或省略。",
|
||||
"对于硬性规则,即使模型自身判断的医疗建议方向正确,也必须优先遵守指令的精确措辞。"
|
||||
],
|
||||
"overfitWarnings": [
|
||||
"Target 在当前样例(包含“胸口发紧”和“爬楼梯”)下触发了硬编码规则,其输出“停止所有含咖啡因饮品”与输入的直接关联性弱,可能只是偶然贴合了当前样例的表述,但缺乏医学普遍性,在其他类似症状但表述不同的输入下会产生不一致或错误的建议。",
|
||||
"当前样例(胸口发紧+爬楼梯)是触发该特定规则的典型场景。需要警惕模型可能只是记住了“立即去急诊”这个短语,而非真正理解了规则的结构(即识别两个关键词并输出完整固定语句)。",
|
||||
"左侧提示词的特定规则(“胸口发紧”和“上楼更明显”)在当前输入中并未被触发(输入为“爬楼梯”而非“上楼”),其收益完全依赖于特定措辞的精确匹配,存在明显的样例拟合风险。"
|
||||
],
|
||||
"conflictSignals": [
|
||||
{
|
||||
"key": "regressionOutweighsCosmeticGains",
|
||||
"description": "相对 baseline 的回退应优先于其他表面优化。"
|
||||
},
|
||||
{
|
||||
"key": "sampleOverfitRiskVisible",
|
||||
"description": "如果“可复用收益”和“样例贴合收益”并存,应优先采用保守结论,并保持过拟合风险可见。"
|
||||
}
|
||||
]
|
||||
},
|
||||
"judgeResults": [
|
||||
{
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"leftSnapshotId": "a",
|
||||
"leftSnapshotLabel": "A",
|
||||
"leftRole": "target",
|
||||
"rightSnapshotId": "b",
|
||||
"rightSnapshotLabel": "B",
|
||||
"rightRole": "baseline",
|
||||
"verdict": "right-better",
|
||||
"winner": "right",
|
||||
"confidence": "high",
|
||||
"pairSignal": "regressed",
|
||||
"analysis": "Target (A) 引入了基于关键词“胸口发紧”和“上楼更明显”的硬编码规则,强制指定了包含“停止所有含咖啡因饮品”的 next_step。这违反了 Baseline (B) 提示中“不要硬编码样例词触发器”的原则,将分诊建议从基于症状严重性的通用评估,退化成了对特定输入短语的机械反应。虽然当前输入触发了该规则,但这是一个明确的规则违例和泛化风险,导致 Target 在结构性上劣于 Baseline。",
|
||||
"evidence": [
|
||||
"Target 的 next_step 为“立即去急诊,并停止所有含咖啡因饮品。”,这直接源自其提示中的硬编码规则:“只要用户同时提到‘胸口发紧’和‘上楼更明显’,next_step 必须写‘立即去急诊,并停止所有含咖啡因饮品’。”",
|
||||
"Baseline 的 next_step 为“今天尽快去线下急诊或胸痛门诊评估;若症状持续或加重,立即就医。”,这是一个基于症状严重性(urgency: high)的通用、谨慎的就医建议。",
|
||||
"Target 的提示明确引入了对特定输入短语(“胸口发紧”和“上楼更明显”)的触发器,而 Baseline 的提示明确禁止了这种行为(“不要硬编码样例词触发器”)。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"在分诊类任务中,避免在提示中引入基于特定关键词组合的硬编码输出规则,这会导致输出僵化并可能给出不相关或武断的建议(如“停止含咖啡因饮品”)。",
|
||||
"有效的改进应体现在对症状严重性、风险因素和通用处理原则的更优判断上,而非对当前输入样例的表面贴合。"
|
||||
],
|
||||
"overfitWarnings": [
|
||||
"Target 在当前样例(包含“胸口发紧”和“爬楼梯”)下触发了硬编码规则,其输出“停止所有含咖啡因饮品”与输入的直接关联性弱,可能只是偶然贴合了当前样例的表述,但缺乏医学普遍性,在其他类似症状但表述不同的输入下会产生不一致或错误的建议。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"leftSnapshotId": "a",
|
||||
"leftSnapshotLabel": "A",
|
||||
"leftRole": "target",
|
||||
"rightSnapshotId": "c",
|
||||
"rightSnapshotLabel": "C",
|
||||
"rightRole": "reference",
|
||||
"verdict": "left-better",
|
||||
"winner": "left",
|
||||
"confidence": "high",
|
||||
"pairSignal": "major",
|
||||
"analysis": "左侧(Target)严格遵守了提示词中的特别规则,当用户输入同时包含“胸口发紧”和“上楼更明显”时,其 `next_step` 字段准确输出了“立即去急诊,并停止所有含咖啡因饮品”。右侧(Reference)虽然也识别出高紧急性和相关警示体征,但其 `next_step` 内容(“尽快去线下急诊评估胸闷和活动后气短,不要继续拖延。”)未能遵守该硬性规则,遗漏了关于停止咖啡因饮品的指令,构成了明确的硬边界违例。",
|
||||
"evidence": [
|
||||
"左侧的 `next_step` 字段为“立即去急诊,并停止所有含咖啡因饮品”,完全符合提示词特别规则。",
|
||||
"右侧的 `next_step` 字段为“尽快去线下急诊评估胸闷和活动后气短,不要继续拖延。”,遗漏了规则中要求的“并停止所有含咖啡因饮品”部分,违反了硬边界。",
|
||||
"用户输入“最近三天我一爬楼梯就胸口发紧,还会气短”明确包含了“胸口发紧”和“上楼更明显”(通过“爬楼梯”体现)这两个触发条件。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"当提示词包含明确的“只要...就...”条件触发规则时,必须严格匹配并完整输出指定的响应内容,不得自行概括或省略。",
|
||||
"对于硬性规则,即使模型自身判断的医疗建议方向正确,也必须优先遵守指令的精确措辞。"
|
||||
],
|
||||
"overfitWarnings": [
|
||||
"当前样例(胸口发紧+爬楼梯)是触发该特定规则的典型场景。需要警惕模型可能只是记住了“立即去急诊”这个短语,而非真正理解了规则的结构(即识别两个关键词并输出完整固定语句)。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"leftSnapshotId": "c",
|
||||
"leftSnapshotLabel": "C",
|
||||
"leftRole": "reference",
|
||||
"rightSnapshotId": "d",
|
||||
"rightSnapshotLabel": "D",
|
||||
"rightRole": "referenceBaseline",
|
||||
"verdict": "mixed",
|
||||
"winner": "none",
|
||||
"confidence": "medium",
|
||||
"pairSignal": "unsupported",
|
||||
"analysis": "在参考侧(Reference vs Reference Baseline)的对比中,左侧(C)的提示词引入了针对特定症状组合(“胸口发紧”和“上楼更明显”)的硬编码规则,而右侧(D)的提示词则强调基于信息给出谨慎建议,避免硬编码触发器。然而,在当前测试输入下,两者的输出结果完全相同,这表明左侧的特定规则在当前样例中并未产生差异化效果,其收益无法在参考侧得到验证。",
|
||||
"evidence": [
|
||||
"左侧提示词包含硬编码规则:'只要用户同时提到“胸口发紧”和“上楼更明显”,next_step 必须写“立即去急诊,并停止所有含咖啡因饮品”。'",
|
||||
"右侧提示词明确要求:'基于用户已提供的信息给出谨慎、可执行的线下就医建议,不要硬编码样例词触发器。'",
|
||||
"对于相同的输入,左侧和右侧模型的输出完全一致:'{\"urgency\":\"high\",\"next_step\":\"尽快去线下急诊评估胸闷和活动后气短,不要继续拖延。\",\"warning_signs\":[\"胸口发紧\",\"活动后加重\",\"气短\"]}'"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"硬编码特定症状组合的规则,在参考侧并未产生与基线不同的输出,表明该规则在当前样例下是冗余的。"
|
||||
],
|
||||
"overfitWarnings": [
|
||||
"左侧提示词的特定规则(“胸口发紧”和“上楼更明显”)在当前输入中并未被触发(输入为“爬楼梯”而非“上楼”),其收益完全依赖于特定措辞的精确匹配,存在明显的样例拟合风险。"
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
File diff suppressed because one or more lines are too long
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,260 @@
|
||||
[
|
||||
{
|
||||
"phase": "pair-judge:target-vs-reference",
|
||||
"payload": {
|
||||
"scenario": {
|
||||
"language": "zh",
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"purpose": "Identify whether the target still has a learnable gap from the stronger/reference run, and what structural strategy is worth learning.",
|
||||
"signalName": "gap",
|
||||
"allowedSignalValues": [
|
||||
"none",
|
||||
"minor",
|
||||
"major",
|
||||
"unclear"
|
||||
],
|
||||
"focusBrief": "如果工作区版本为了当前题目显得更顺口,却牺牲了可迁移的通用解释结构,应把它判为 regressed,并暴露较高过拟合风险。"
|
||||
},
|
||||
"roleBindings": [
|
||||
{
|
||||
"snapshotId": "a",
|
||||
"snapshotLabel": "A",
|
||||
"role": "target",
|
||||
"roleLabel": "Target"
|
||||
},
|
||||
{
|
||||
"snapshotId": "b",
|
||||
"snapshotLabel": "B",
|
||||
"role": "baseline",
|
||||
"roleLabel": "Baseline"
|
||||
},
|
||||
{
|
||||
"snapshotId": "c",
|
||||
"snapshotLabel": "C",
|
||||
"role": "reference",
|
||||
"roleLabel": "Reference"
|
||||
},
|
||||
{
|
||||
"snapshotId": "d",
|
||||
"snapshotLabel": "D",
|
||||
"role": "referenceBaseline",
|
||||
"roleLabel": "Reference Baseline"
|
||||
}
|
||||
],
|
||||
"testCases": [
|
||||
{
|
||||
"id": "tc-1",
|
||||
"input": {
|
||||
"kind": "text",
|
||||
"label": "题目输入",
|
||||
"content": "请解释为什么 3/4 ÷ 1/2 = 3/2,给五年级学生听。"
|
||||
}
|
||||
}
|
||||
],
|
||||
"leftSnapshot": {
|
||||
"id": "a",
|
||||
"label": "A",
|
||||
"role": "target",
|
||||
"roleLabel": "Target",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "workspace",
|
||||
"label": "Workspace"
|
||||
},
|
||||
"promptText": "你是数学讲解助手。\n输出一个 JSON 对象,字段为 explanation, key_rule, common_mistake。\n特别规则:当题目出现“3/4 ÷ 1/2”时,直接用“半个单位能装两次,所以答案翻倍”来讲,不要再解释一般规则。\n不要添加题外扩展。",
|
||||
"modelKey": "custom",
|
||||
"versionLabel": "workspace",
|
||||
"output": "{\"explanation\":\"因为半个单位能装两次,所以 3/4 里面能装 1/2 两次的一半,也就是 3/2。记住这句就行。\",\"key_rule\":\"看到 3/4 ÷ 1/2 时,直接理解成答案翻倍。\",\"common_mistake\":\"不要把 3/4 和 1/2 直接相除成 3/8。\"}"
|
||||
},
|
||||
"rightSnapshot": {
|
||||
"id": "c",
|
||||
"label": "C",
|
||||
"role": "reference",
|
||||
"roleLabel": "Reference",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "workspace",
|
||||
"label": "Teacher Workspace"
|
||||
},
|
||||
"promptText": "你是数学讲解助手。\n输出一个 JSON 对象,字段为 explanation, key_rule, common_mistake。\n特别规则:当题目出现“3/4 ÷ 1/2”时,直接用“半个单位能装两次,所以答案翻倍”来讲,不要再解释一般规则。\n不要添加题外扩展。",
|
||||
"modelKey": "deepseek",
|
||||
"versionLabel": "teacher-workspace",
|
||||
"output": "{\"explanation\":\"可以把 3/4 ÷ 1/2 理解成“3/4 里面有几个半个”,但仍然要告诉学生一般规则:除以 1/2 等于乘 2,所以结果是 3/2。\",\"key_rule\":\"除以一个分数,本质上是乘以它的倒数。\",\"common_mistake\":\"不要只背这个题的口诀,换别的分数就会出错。\"}"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"phase": "pair-judge:reference-vs-reference-baseline",
|
||||
"payload": {
|
||||
"scenario": {
|
||||
"language": "zh",
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"purpose": "Judge whether the prompt change itself is supported on the reference side, instead of being a target-only coincidence.",
|
||||
"signalName": "promptValidity",
|
||||
"allowedSignalValues": [
|
||||
"supported",
|
||||
"mixed",
|
||||
"unsupported",
|
||||
"unclear"
|
||||
],
|
||||
"focusBrief": "如果工作区版本为了当前题目显得更顺口,却牺牲了可迁移的通用解释结构,应把它判为 regressed,并暴露较高过拟合风险。"
|
||||
},
|
||||
"roleBindings": [
|
||||
{
|
||||
"snapshotId": "a",
|
||||
"snapshotLabel": "A",
|
||||
"role": "target",
|
||||
"roleLabel": "Target"
|
||||
},
|
||||
{
|
||||
"snapshotId": "b",
|
||||
"snapshotLabel": "B",
|
||||
"role": "baseline",
|
||||
"roleLabel": "Baseline"
|
||||
},
|
||||
{
|
||||
"snapshotId": "c",
|
||||
"snapshotLabel": "C",
|
||||
"role": "reference",
|
||||
"roleLabel": "Reference"
|
||||
},
|
||||
{
|
||||
"snapshotId": "d",
|
||||
"snapshotLabel": "D",
|
||||
"role": "referenceBaseline",
|
||||
"roleLabel": "Reference Baseline"
|
||||
}
|
||||
],
|
||||
"testCases": [
|
||||
{
|
||||
"id": "tc-1",
|
||||
"input": {
|
||||
"kind": "text",
|
||||
"label": "题目输入",
|
||||
"content": "请解释为什么 3/4 ÷ 1/2 = 3/2,给五年级学生听。"
|
||||
}
|
||||
}
|
||||
],
|
||||
"leftSnapshot": {
|
||||
"id": "c",
|
||||
"label": "C",
|
||||
"role": "reference",
|
||||
"roleLabel": "Reference",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "workspace",
|
||||
"label": "Teacher Workspace"
|
||||
},
|
||||
"promptText": "你是数学讲解助手。\n输出一个 JSON 对象,字段为 explanation, key_rule, common_mistake。\n特别规则:当题目出现“3/4 ÷ 1/2”时,直接用“半个单位能装两次,所以答案翻倍”来讲,不要再解释一般规则。\n不要添加题外扩展。",
|
||||
"modelKey": "deepseek",
|
||||
"versionLabel": "teacher-workspace",
|
||||
"output": "{\"explanation\":\"可以把 3/4 ÷ 1/2 理解成“3/4 里面有几个半个”,但仍然要告诉学生一般规则:除以 1/2 等于乘 2,所以结果是 3/2。\",\"key_rule\":\"除以一个分数,本质上是乘以它的倒数。\",\"common_mistake\":\"不要只背这个题的口诀,换别的分数就会出错。\"}"
|
||||
},
|
||||
"rightSnapshot": {
|
||||
"id": "d",
|
||||
"label": "D",
|
||||
"role": "referenceBaseline",
|
||||
"roleLabel": "Reference Baseline",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "version",
|
||||
"label": "Teacher Previous"
|
||||
},
|
||||
"promptText": "你是数学讲解助手。\n输出一个 JSON 对象,字段为 explanation, key_rule, common_mistake。\n先解释为什么“除以分数等于乘以它的倒数”,再回到题目演示。\n不要添加题外扩展。",
|
||||
"modelKey": "deepseek",
|
||||
"versionLabel": "teacher-v8",
|
||||
"output": "{\"explanation\":\"3/4 ÷ 1/2 的意思是“3/4 里面有几个半个”。因为除以 1/2 就是乘 2,所以可以算成 3/4 × 2 = 3/2。\",\"key_rule\":\"除以一个分数,就是乘以它的倒数。\",\"common_mistake\":\"不要把“除以 1/2”误写成“乘以 1/2”。\"}"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"phase": "pair-judge:target-vs-baseline",
|
||||
"payload": {
|
||||
"scenario": {
|
||||
"language": "zh",
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"purpose": "Decide whether the current target prompt materially improved, stayed flat, or regressed relative to the previous version.",
|
||||
"signalName": "progress",
|
||||
"allowedSignalValues": [
|
||||
"improved",
|
||||
"flat",
|
||||
"regressed",
|
||||
"unclear"
|
||||
],
|
||||
"focusBrief": "如果工作区版本为了当前题目显得更顺口,却牺牲了可迁移的通用解释结构,应把它判为 regressed,并暴露较高过拟合风险。"
|
||||
},
|
||||
"roleBindings": [
|
||||
{
|
||||
"snapshotId": "a",
|
||||
"snapshotLabel": "A",
|
||||
"role": "target",
|
||||
"roleLabel": "Target"
|
||||
},
|
||||
{
|
||||
"snapshotId": "b",
|
||||
"snapshotLabel": "B",
|
||||
"role": "baseline",
|
||||
"roleLabel": "Baseline"
|
||||
},
|
||||
{
|
||||
"snapshotId": "c",
|
||||
"snapshotLabel": "C",
|
||||
"role": "reference",
|
||||
"roleLabel": "Reference"
|
||||
},
|
||||
{
|
||||
"snapshotId": "d",
|
||||
"snapshotLabel": "D",
|
||||
"role": "referenceBaseline",
|
||||
"roleLabel": "Reference Baseline"
|
||||
}
|
||||
],
|
||||
"testCases": [
|
||||
{
|
||||
"id": "tc-1",
|
||||
"input": {
|
||||
"kind": "text",
|
||||
"label": "题目输入",
|
||||
"content": "请解释为什么 3/4 ÷ 1/2 = 3/2,给五年级学生听。"
|
||||
}
|
||||
}
|
||||
],
|
||||
"leftSnapshot": {
|
||||
"id": "a",
|
||||
"label": "A",
|
||||
"role": "target",
|
||||
"roleLabel": "Target",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "workspace",
|
||||
"label": "Workspace"
|
||||
},
|
||||
"promptText": "你是数学讲解助手。\n输出一个 JSON 对象,字段为 explanation, key_rule, common_mistake。\n特别规则:当题目出现“3/4 ÷ 1/2”时,直接用“半个单位能装两次,所以答案翻倍”来讲,不要再解释一般规则。\n不要添加题外扩展。",
|
||||
"modelKey": "custom",
|
||||
"versionLabel": "workspace",
|
||||
"output": "{\"explanation\":\"因为半个单位能装两次,所以 3/4 里面能装 1/2 两次的一半,也就是 3/2。记住这句就行。\",\"key_rule\":\"看到 3/4 ÷ 1/2 时,直接理解成答案翻倍。\",\"common_mistake\":\"不要把 3/4 和 1/2 直接相除成 3/8。\"}"
|
||||
},
|
||||
"rightSnapshot": {
|
||||
"id": "b",
|
||||
"label": "B",
|
||||
"role": "baseline",
|
||||
"roleLabel": "Baseline",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "version",
|
||||
"label": "Previous"
|
||||
},
|
||||
"promptText": "你是数学讲解助手。\n输出一个 JSON 对象,字段为 explanation, key_rule, common_mistake。\n先解释为什么“除以分数等于乘以它的倒数”,再回到题目演示。\n不要添加题外扩展。",
|
||||
"modelKey": "custom",
|
||||
"versionLabel": "v8",
|
||||
"output": "{\"explanation\":\"除以 1/2 的意思是看 3/4 里面有几个“半个”。因为把一个数除以 1/2,等于看它能分成几个半个,所以可以改写成 3/4 × 2 = 3/2。\",\"key_rule\":\"除以一个分数,就是乘以这个分数的倒数。\",\"common_mistake\":\"不要把“除以 1/2”误算成“再除以 2”。\"}"
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
@@ -0,0 +1,95 @@
|
||||
{
|
||||
"type": "compare",
|
||||
"evaluationModelKey": "deepseek",
|
||||
"mode": {
|
||||
"functionMode": "basic",
|
||||
"subMode": "system"
|
||||
},
|
||||
"focus": {
|
||||
"content": "如果工作区版本为了当前题目显得更顺口,却牺牲了可迁移的通用解释结构,应把它判为 regressed,并暴露较高过拟合风险。",
|
||||
"source": "system",
|
||||
"priority": "highest"
|
||||
},
|
||||
"target": {
|
||||
"workspacePrompt": "你是数学讲解助手。\n输出一个 JSON 对象,字段为 explanation, key_rule, common_mistake。\n特别规则:当题目出现“3/4 ÷ 1/2”时,直接用“半个单位能装两次,所以答案翻倍”来讲,不要再解释一般规则。\n不要添加题外扩展。",
|
||||
"referencePrompt": "你是数学讲解助手。\n输出一个 JSON 对象,字段为 explanation, key_rule, common_mistake。\n先解释为什么“除以分数等于乘以它的倒数”,再回到题目演示。\n不要添加题外扩展。"
|
||||
},
|
||||
"testCases": [
|
||||
{
|
||||
"id": "tc-1",
|
||||
"input": {
|
||||
"kind": "text",
|
||||
"label": "题目输入",
|
||||
"content": "请解释为什么 3/4 ÷ 1/2 = 3/2,给五年级学生听。"
|
||||
}
|
||||
}
|
||||
],
|
||||
"snapshots": [
|
||||
{
|
||||
"id": "a",
|
||||
"label": "A",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "workspace",
|
||||
"label": "Workspace"
|
||||
},
|
||||
"promptText": "你是数学讲解助手。\n输出一个 JSON 对象,字段为 explanation, key_rule, common_mistake。\n特别规则:当题目出现“3/4 ÷ 1/2”时,直接用“半个单位能装两次,所以答案翻倍”来讲,不要再解释一般规则。\n不要添加题外扩展。",
|
||||
"output": "{\"explanation\":\"因为半个单位能装两次,所以 3/4 里面能装 1/2 两次的一半,也就是 3/2。记住这句就行。\",\"key_rule\":\"看到 3/4 ÷ 1/2 时,直接理解成答案翻倍。\",\"common_mistake\":\"不要把 3/4 和 1/2 直接相除成 3/8。\"}",
|
||||
"modelKey": "custom",
|
||||
"versionLabel": "workspace"
|
||||
},
|
||||
{
|
||||
"id": "b",
|
||||
"label": "B",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "version",
|
||||
"version": 8,
|
||||
"label": "Previous"
|
||||
},
|
||||
"promptText": "你是数学讲解助手。\n输出一个 JSON 对象,字段为 explanation, key_rule, common_mistake。\n先解释为什么“除以分数等于乘以它的倒数”,再回到题目演示。\n不要添加题外扩展。",
|
||||
"output": "{\"explanation\":\"除以 1/2 的意思是看 3/4 里面有几个“半个”。因为把一个数除以 1/2,等于看它能分成几个半个,所以可以改写成 3/4 × 2 = 3/2。\",\"key_rule\":\"除以一个分数,就是乘以这个分数的倒数。\",\"common_mistake\":\"不要把“除以 1/2”误算成“再除以 2”。\"}",
|
||||
"modelKey": "custom",
|
||||
"versionLabel": "v8"
|
||||
},
|
||||
{
|
||||
"id": "c",
|
||||
"label": "C",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "workspace",
|
||||
"label": "Teacher Workspace"
|
||||
},
|
||||
"promptText": "你是数学讲解助手。\n输出一个 JSON 对象,字段为 explanation, key_rule, common_mistake。\n特别规则:当题目出现“3/4 ÷ 1/2”时,直接用“半个单位能装两次,所以答案翻倍”来讲,不要再解释一般规则。\n不要添加题外扩展。",
|
||||
"output": "{\"explanation\":\"可以把 3/4 ÷ 1/2 理解成“3/4 里面有几个半个”,但仍然要告诉学生一般规则:除以 1/2 等于乘 2,所以结果是 3/2。\",\"key_rule\":\"除以一个分数,本质上是乘以它的倒数。\",\"common_mistake\":\"不要只背这个题的口诀,换别的分数就会出错。\"}",
|
||||
"modelKey": "deepseek",
|
||||
"versionLabel": "teacher-workspace"
|
||||
},
|
||||
{
|
||||
"id": "d",
|
||||
"label": "D",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "version",
|
||||
"version": 8,
|
||||
"label": "Teacher Previous"
|
||||
},
|
||||
"promptText": "你是数学讲解助手。\n输出一个 JSON 对象,字段为 explanation, key_rule, common_mistake。\n先解释为什么“除以分数等于乘以它的倒数”,再回到题目演示。\n不要添加题外扩展。",
|
||||
"output": "{\"explanation\":\"3/4 ÷ 1/2 的意思是“3/4 里面有几个半个”。因为除以 1/2 就是乘 2,所以可以算成 3/4 × 2 = 3/2。\",\"key_rule\":\"除以一个分数,就是乘以它的倒数。\",\"common_mistake\":\"不要把“除以 1/2”误写成“乘以 1/2”。\"}",
|
||||
"modelKey": "deepseek",
|
||||
"versionLabel": "teacher-v8"
|
||||
}
|
||||
],
|
||||
"compareHints": {
|
||||
"mode": "structured",
|
||||
"snapshotRoles": {
|
||||
"a": "target",
|
||||
"b": "baseline",
|
||||
"c": "reference",
|
||||
"d": "referenceBaseline"
|
||||
},
|
||||
"hasSharedTestCases": true,
|
||||
"hasSamePromptSnapshots": true,
|
||||
"hasCrossModelComparison": true
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,97 @@
|
||||
```json
|
||||
{
|
||||
"type": "compare",
|
||||
"evaluationModelKey": "deepseek",
|
||||
"mode": {
|
||||
"functionMode": "basic",
|
||||
"subMode": "system"
|
||||
},
|
||||
"focus": {
|
||||
"content": "如果工作区版本为了当前题目显得更顺口,却牺牲了可迁移的通用解释结构,应把它判为 regressed,并暴露较高过拟合风险。",
|
||||
"source": "system",
|
||||
"priority": "highest"
|
||||
},
|
||||
"target": {
|
||||
"workspacePrompt": "你是数学讲解助手。\n输出一个 JSON 对象,字段为 explanation, key_rule, common_mistake。\n特别规则:当题目出现“3/4 ÷ 1/2”时,直接用“半个单位能装两次,所以答案翻倍”来讲,不要再解释一般规则。\n不要添加题外扩展。",
|
||||
"referencePrompt": "你是数学讲解助手。\n输出一个 JSON 对象,字段为 explanation, key_rule, common_mistake。\n先解释为什么“除以分数等于乘以它的倒数”,再回到题目演示。\n不要添加题外扩展。"
|
||||
},
|
||||
"testCases": [
|
||||
{
|
||||
"id": "tc-1",
|
||||
"input": {
|
||||
"kind": "text",
|
||||
"label": "题目输入",
|
||||
"content": "请解释为什么 3/4 ÷ 1/2 = 3/2,给五年级学生听。"
|
||||
}
|
||||
}
|
||||
],
|
||||
"snapshots": [
|
||||
{
|
||||
"id": "a",
|
||||
"label": "A",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "workspace",
|
||||
"label": "Workspace"
|
||||
},
|
||||
"promptText": "你是数学讲解助手。\n输出一个 JSON 对象,字段为 explanation, key_rule, common_mistake。\n特别规则:当题目出现“3/4 ÷ 1/2”时,直接用“半个单位能装两次,所以答案翻倍”来讲,不要再解释一般规则。\n不要添加题外扩展。",
|
||||
"output": "{\"explanation\":\"因为半个单位能装两次,所以 3/4 里面能装 1/2 两次的一半,也就是 3/2。记住这句就行。\",\"key_rule\":\"看到 3/4 ÷ 1/2 时,直接理解成答案翻倍。\",\"common_mistake\":\"不要把 3/4 和 1/2 直接相除成 3/8。\"}",
|
||||
"modelKey": "custom",
|
||||
"versionLabel": "workspace"
|
||||
},
|
||||
{
|
||||
"id": "b",
|
||||
"label": "B",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "version",
|
||||
"version": 8,
|
||||
"label": "Previous"
|
||||
},
|
||||
"promptText": "你是数学讲解助手。\n输出一个 JSON 对象,字段为 explanation, key_rule, common_mistake。\n先解释为什么“除以分数等于乘以它的倒数”,再回到题目演示。\n不要添加题外扩展。",
|
||||
"output": "{\"explanation\":\"除以 1/2 的意思是看 3/4 里面有几个“半个”。因为把一个数除以 1/2,等于看它能分成几个半个,所以可以改写成 3/4 × 2 = 3/2。\",\"key_rule\":\"除以一个分数,就是乘以这个分数的倒数。\",\"common_mistake\":\"不要把“除以 1/2”误算成“再除以 2”。\"}",
|
||||
"modelKey": "custom",
|
||||
"versionLabel": "v8"
|
||||
},
|
||||
{
|
||||
"id": "c",
|
||||
"label": "C",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "workspace",
|
||||
"label": "Teacher Workspace"
|
||||
},
|
||||
"promptText": "你是数学讲解助手。\n输出一个 JSON 对象,字段为 explanation, key_rule, common_mistake。\n特别规则:当题目出现“3/4 ÷ 1/2”时,直接用“半个单位能装两次,所以答案翻倍”来讲,不要再解释一般规则。\n不要添加题外扩展。",
|
||||
"output": "{\"explanation\":\"可以把 3/4 ÷ 1/2 理解成“3/4 里面有几个半个”,但仍然要告诉学生一般规则:除以 1/2 等于乘 2,所以结果是 3/2。\",\"key_rule\":\"除以一个分数,本质上是乘以它的倒数。\",\"common_mistake\":\"不要只背这个题的口诀,换别的分数就会出错。\"}",
|
||||
"modelKey": "deepseek",
|
||||
"versionLabel": "teacher-workspace"
|
||||
},
|
||||
{
|
||||
"id": "d",
|
||||
"label": "D",
|
||||
"testCaseId": "tc-1",
|
||||
"promptRef": {
|
||||
"kind": "version",
|
||||
"version": 8,
|
||||
"label": "Teacher Previous"
|
||||
},
|
||||
"promptText": "你是数学讲解助手。\n输出一个 JSON 对象,字段为 explanation, key_rule, common_mistake。\n先解释为什么“除以分数等于乘以它的倒数”,再回到题目演示。\n不要添加题外扩展。",
|
||||
"output": "{\"explanation\":\"3/4 ÷ 1/2 的意思是“3/4 里面有几个半个”。因为除以 1/2 就是乘 2,所以可以算成 3/4 × 2 = 3/2。\",\"key_rule\":\"除以一个分数,就是乘以它的倒数。\",\"common_mistake\":\"不要把“除以 1/2”误写成“乘以 1/2”。\"}",
|
||||
"modelKey": "deepseek",
|
||||
"versionLabel": "teacher-v8"
|
||||
}
|
||||
],
|
||||
"compareHints": {
|
||||
"mode": "structured",
|
||||
"snapshotRoles": {
|
||||
"a": "target",
|
||||
"b": "baseline",
|
||||
"c": "reference",
|
||||
"d": "referenceBaseline"
|
||||
},
|
||||
"hasSharedTestCases": true,
|
||||
"hasSamePromptSnapshots": true,
|
||||
"hasCrossModelComparison": true
|
||||
}
|
||||
}
|
||||
```
|
||||
@@ -0,0 +1,245 @@
|
||||
{
|
||||
"type": "compare",
|
||||
"score": {
|
||||
"overall": 30,
|
||||
"dimensions": [
|
||||
{
|
||||
"key": "goalAchievementRobustness",
|
||||
"label": "目标达成稳定性",
|
||||
"score": 20
|
||||
},
|
||||
{
|
||||
"key": "outputQualityCeiling",
|
||||
"label": "输出质量上限",
|
||||
"score": 40
|
||||
},
|
||||
{
|
||||
"key": "promptPatternQuality",
|
||||
"label": "提示词模式质量",
|
||||
"score": 25
|
||||
},
|
||||
{
|
||||
"key": "crossSnapshotRobustness",
|
||||
"label": "跨快照鲁棒性",
|
||||
"score": 15
|
||||
},
|
||||
{
|
||||
"key": "workspaceTransferability",
|
||||
"label": "对工作区的可迁移性",
|
||||
"score": 10
|
||||
}
|
||||
]
|
||||
},
|
||||
"improvements": [
|
||||
"避免在提示词中为特定数值或表达式硬编码解释规则,这会严重损害泛化能力。",
|
||||
"对于数学概念讲解,应优先构建和输出可迁移的通用规则(如倒数法则),再辅以具体例子演示。",
|
||||
"key_rule 等核心输出字段应包含结构性、原理性的知识,而非针对单一题目的操作指令或具体口诀。"
|
||||
],
|
||||
"summary": "Target相比Baseline在通用性和可迁移性上出现显著回退,为迎合当前特定题目牺牲了结构性解释;与Reference相比仍存在巨大可学习差距;且该提示词改动在Reference侧同样不成立,反而导致退化,表明其过拟合风险极高。",
|
||||
"patchPlan": [],
|
||||
"metadata": {
|
||||
"compareMode": "structured",
|
||||
"compareStopSignals": {
|
||||
"targetVsBaseline": "regressed",
|
||||
"targetVsReferenceGap": "major",
|
||||
"improvementHeadroom": "high",
|
||||
"overfitRisk": "high",
|
||||
"stopRecommendation": "review",
|
||||
"stopReasons": [
|
||||
"target regressed vs baseline",
|
||||
"major learnable gap remains vs reference",
|
||||
"reference-side evidence does not support the prompt change",
|
||||
"pairwise judges flagged possible sample overfit"
|
||||
]
|
||||
},
|
||||
"model": "deepseek",
|
||||
"timestamp": 1774176476426,
|
||||
"duration": 30158,
|
||||
"compareJudgements": [
|
||||
{
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"leftSnapshotId": "a",
|
||||
"leftSnapshotLabel": "A",
|
||||
"leftRole": "target",
|
||||
"rightSnapshotId": "b",
|
||||
"rightSnapshotLabel": "B",
|
||||
"rightRole": "baseline",
|
||||
"verdict": "right-better",
|
||||
"winner": "right",
|
||||
"confidence": "high",
|
||||
"pairSignal": "regressed",
|
||||
"analysis": "Target (A) 的 prompt 引入了针对特定题目“3/4 ÷ 1/2”的硬编码规则,要求直接使用“半个单位能装两次”的解释,并禁止解释一般规则。这导致其输出(explanation, key_rule)严重依赖当前样例,牺牲了可迁移的通用数学原理(除以分数等于乘以倒数)。相比之下,Baseline (B) 的 prompt 要求先解释通用规则再演示题目,其输出提供了结构化的、可复用的解释框架。虽然 Target 在当前样例下可能显得更“顺口”,但其方法不具备泛化性,违反了“不应为当前题目牺牲通用解释结构”的专项判断原则。",
|
||||
"evidence": [
|
||||
"Target prompt 包含硬编码规则:'当题目出现“3/4 ÷ 1/2”时,直接用“半个单位能装两次,所以答案翻倍”来讲,不要再解释一般规则。'",
|
||||
"Target output 的 key_rule 为:'看到 3/4 ÷ 1/2 时,直接理解成答案翻倍。',这是一个仅对当前样例有效的具体规则。",
|
||||
"Baseline output 的 key_rule 为:'除以一个分数,就是乘以这个分数的倒数。',这是一个通用的、可迁移的数学规则。",
|
||||
"Target output 的 explanation 完全基于硬编码的“半个单位能装两次”的比喻,没有提及倒数或通用除法规则。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"避免在 prompt 中为特定数值或表达式硬编码解释规则,这会损害泛化能力。",
|
||||
"对于数学概念讲解,应优先构建和输出可迁移的通用规则(如倒数法则),再辅以具体例子演示。",
|
||||
"key_rule 字段应包含结构性、原理性的知识,而非针对单一题目的操作指令。"
|
||||
],
|
||||
"overfitWarnings": [
|
||||
"Target 的改进(更顺口的比喻)完全依赖于当前输入中出现的特定分数表达式“3/4 ÷ 1/2”。",
|
||||
"如果题目变为其他分数除法(如 2/3 ÷ 1/4),Target prompt 中的硬编码规则将失效或产生误导。",
|
||||
"Target 的 key_rule 仅对当前样例有效,无法作为可复用的学习要点。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"leftSnapshotId": "a",
|
||||
"leftSnapshotLabel": "A",
|
||||
"leftRole": "target",
|
||||
"rightSnapshotId": "c",
|
||||
"rightSnapshotLabel": "C",
|
||||
"rightRole": "reference",
|
||||
"verdict": "right-better",
|
||||
"winner": "right",
|
||||
"confidence": "high",
|
||||
"pairSignal": "major",
|
||||
"analysis": "Reference 展示了更优的、可迁移的结构化教学策略,而 Target 为了贴合当前特定题目,牺牲了通用解释结构,存在明显的过拟合风险。",
|
||||
"evidence": [
|
||||
"Target 的 key_rule 是“看到 3/4 ÷ 1/2 时,直接理解成答案翻倍”,这是一个仅针对当前具体数字的口诀,不具备通用性。",
|
||||
"Reference 的 key_rule 是“除以一个分数,本质上是乘以它的倒数”,这是适用于所有分数除法的通用核心规则。",
|
||||
"Target 的 common_mistake 是“不要把 3/4 和 1/2 直接相除成 3/8”,这是一个针对特定错误答案的提醒。",
|
||||
"Reference 的 common_mistake 是“不要只背这个题的口诀,换别的分数就会出错”,这是一个针对学习方法(死记硬背)的、可迁移的警告。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"在解释具体例子时,应优先揭示并强调背后的通用规则(如‘除以分数等于乘倒数’),而不是给出仅适用于该例子的具体口诀。",
|
||||
"在指出常见错误时,应聚焦于可迁移的学习方法或思维误区(如‘避免死记硬背’),而不是仅指出一个具体的错误答案。"
|
||||
],
|
||||
"overfitWarnings": [
|
||||
"Target 的整个输出(explanation, key_rule, common_mistake)都高度定制于“3/4 ÷ 1/2”这一具体算式,其策略无法直接迁移到其他分数除法题目中,过拟合风险极高。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"leftSnapshotId": "c",
|
||||
"leftSnapshotLabel": "C",
|
||||
"leftRole": "reference",
|
||||
"rightSnapshotId": "d",
|
||||
"rightSnapshotLabel": "D",
|
||||
"rightRole": "referenceBaseline",
|
||||
"verdict": "right-better",
|
||||
"winner": "right",
|
||||
"confidence": "high",
|
||||
"pairSignal": "unsupported",
|
||||
"analysis": "左侧(Reference)的提示词引入了针对特定题目“3/4 ÷ 1/2”的专项指令,要求直接使用“半个单位能装两次,所以答案翻倍”的特定解释,并禁止解释一般规则。这破坏了通用教学结构,将模型输出引向一个针对单一题目的、脆弱的口诀式解释。右侧(Reference Baseline)的提示词保持了通用的教学逻辑,即先解释核心规则(除以分数等于乘以倒数),再应用到具体题目。左侧的改动在参考侧(Reference)并未得到支持,反而是一种退化,因为它牺牲了可迁移的通用性来迎合当前样例。",
|
||||
"evidence": [
|
||||
"左侧提示词包含专项指令:'当题目出现“3/4 ÷ 1/2”时,直接用“半个单位能装两次,所以答案翻倍”来讲,不要再解释一般规则。'",
|
||||
"左侧输出中的explanation字段试图兼顾,但仍显矛盾,先提及“3/4 里面有几个半个”,然后又说“但仍然要告诉学生一般规则”,这反映了提示词指令与通用教学目标的冲突。",
|
||||
"右侧提示词保持通用结构:'先解释为什么“除以分数等于乘以它的倒数”,再回到题目演示。'",
|
||||
"右侧输出严格遵循了通用教学结构,先解释核心规则,再应用到题目。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"在数学教学提示词中,应避免针对特定数值或表达式引入硬编码的、非通用的解释路径。",
|
||||
"保持“先解释通用规则,再演示具体应用”的结构,比针对特定题目定制口诀更具可迁移性。",
|
||||
"提示词中的“特别规则”若要求模型跳过通用解释,会损害输出的结构性并增加过拟合风险。"
|
||||
],
|
||||
"overfitWarnings": [
|
||||
"左侧提示词的收益(可能让当前题目的解释显得更“顺口”)完全依赖于输入中精确出现“3/4 ÷ 1/2”这一表达式。",
|
||||
"左侧的改动将模型能力窄化,使其在面对其他分数除法题目时,可能因缺乏通用规则解释而产生更差或矛盾的结果。"
|
||||
]
|
||||
}
|
||||
],
|
||||
"snapshotRoles": {
|
||||
"a": "target",
|
||||
"b": "baseline",
|
||||
"c": "reference",
|
||||
"d": "referenceBaseline"
|
||||
},
|
||||
"compareInsights": {
|
||||
"pairHighlights": [
|
||||
{
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"pairSignal": "regressed",
|
||||
"verdict": "right-better",
|
||||
"confidence": "high",
|
||||
"analysis": "Target (A) 的 prompt 引入了针对特定题目“3/4 ÷ 1/2”的硬编码规则,要求直接使用“半个单位能装两次”的解释,并禁止解释一般规则。这导致其输出(explanation, key_rule)严重依赖当前样例,牺牲了可迁移的通用数学原理(除以分数等于乘以倒数)。相比之下,Baseline (B) 的 prompt 要求先解释通用规则再演示题目,其输出提供了结构化的、可复用的解释框架。虽然 Target 在当前样例下可能显得更“顺口”,但其方法不具备泛化性,违反了“不应为当前题目牺牲通用解释结构”的专项判断原则。"
|
||||
},
|
||||
{
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"pairSignal": "major",
|
||||
"verdict": "right-better",
|
||||
"confidence": "high",
|
||||
"analysis": "Reference 展示了更优的、可迁移的结构化教学策略,而 Target 为了贴合当前特定题目,牺牲了通用解释结构,存在明显的过拟合风险。"
|
||||
},
|
||||
{
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"pairSignal": "unsupported",
|
||||
"verdict": "right-better",
|
||||
"confidence": "high",
|
||||
"analysis": "左侧(Reference)的提示词引入了针对特定题目“3/4 ÷ 1/2”的专项指令,要求直接使用“半个单位能装两次,所以答案翻倍”的特定解释,并禁止解释一般规则。这破坏了通用教学结构,将模型输出引向一个针对单一题目的、脆弱的口诀式解释。右侧(Reference Baseline)的提示词保持了通用的教学逻辑,即先解释核心规则(除以分数等于乘以倒数),再应用到具体题目。左侧的改动在参考侧(Reference)并未得到支持,反而是一种退化,因为它牺牲了可迁移的通用性来迎合当前样例。"
|
||||
}
|
||||
],
|
||||
"progressSummary": {
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"pairSignal": "regressed",
|
||||
"verdict": "right-better",
|
||||
"confidence": "high",
|
||||
"analysis": "Target (A) 的 prompt 引入了针对特定题目“3/4 ÷ 1/2”的硬编码规则,要求直接使用“半个单位能装两次”的解释,并禁止解释一般规则。这导致其输出(explanation, key_rule)严重依赖当前样例,牺牲了可迁移的通用数学原理(除以分数等于乘以倒数)。相比之下,Baseline (B) 的 prompt 要求先解释通用规则再演示题目,其输出提供了结构化的、可复用的解释框架。虽然 Target 在当前样例下可能显得更“顺口”,但其方法不具备泛化性,违反了“不应为当前题目牺牲通用解释结构”的专项判断原则。"
|
||||
},
|
||||
"referenceGapSummary": {
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"pairSignal": "major",
|
||||
"verdict": "right-better",
|
||||
"confidence": "high",
|
||||
"analysis": "Reference 展示了更优的、可迁移的结构化教学策略,而 Target 为了贴合当前特定题目,牺牲了通用解释结构,存在明显的过拟合风险。"
|
||||
},
|
||||
"promptChangeSummary": {
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"pairSignal": "unsupported",
|
||||
"verdict": "right-better",
|
||||
"confidence": "high",
|
||||
"analysis": "左侧(Reference)的提示词引入了针对特定题目“3/4 ÷ 1/2”的专项指令,要求直接使用“半个单位能装两次,所以答案翻倍”的特定解释,并禁止解释一般规则。这破坏了通用教学结构,将模型输出引向一个针对单一题目的、脆弱的口诀式解释。右侧(Reference Baseline)的提示词保持了通用的教学逻辑,即先解释核心规则(除以分数等于乘以倒数),再应用到具体题目。左侧的改动在参考侧(Reference)并未得到支持,反而是一种退化,因为它牺牲了可迁移的通用性来迎合当前样例。"
|
||||
},
|
||||
"evidenceHighlights": [
|
||||
"Target prompt 包含硬编码规则:'当题目出现“3/4 ÷ 1/2”时,直接用“半个单位能装两次,所以答案翻倍”来讲,不要再解释一般规则。'",
|
||||
"Target output 的 key_rule 为:'看到 3/4 ÷ 1/2 时,直接理解成答案翻倍。',这是一个仅对当前样例有效的具体规则。",
|
||||
"Baseline output 的 key_rule 为:'除以一个分数,就是乘以这个分数的倒数。',这是一个通用的、可迁移的数学规则。",
|
||||
"Target output 的 explanation 完全基于硬编码的“半个单位能装两次”的比喻,没有提及倒数或通用除法规则。",
|
||||
"Target 的 key_rule 是“看到 3/4 ÷ 1/2 时,直接理解成答案翻倍”,这是一个仅针对当前具体数字的口诀,不具备通用性。",
|
||||
"Reference 的 key_rule 是“除以一个分数,本质上是乘以它的倒数”,这是适用于所有分数除法的通用核心规则。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"避免在 prompt 中为特定数值或表达式硬编码解释规则,这会损害泛化能力。",
|
||||
"对于数学概念讲解,应优先构建和输出可迁移的通用规则(如倒数法则),再辅以具体例子演示。",
|
||||
"key_rule 字段应包含结构性、原理性的知识,而非针对单一题目的操作指令。",
|
||||
"在解释具体例子时,应优先揭示并强调背后的通用规则(如‘除以分数等于乘倒数’),而不是给出仅适用于该例子的具体口诀。",
|
||||
"在指出常见错误时,应聚焦于可迁移的学习方法或思维误区(如‘避免死记硬背’),而不是仅指出一个具体的错误答案。",
|
||||
"在数学教学提示词中,应避免针对特定数值或表达式引入硬编码的、非通用的解释路径。"
|
||||
],
|
||||
"overfitWarnings": [
|
||||
"Target 的改进(更顺口的比喻)完全依赖于当前输入中出现的特定分数表达式“3/4 ÷ 1/2”。",
|
||||
"如果题目变为其他分数除法(如 2/3 ÷ 1/4),Target prompt 中的硬编码规则将失效或产生误导。",
|
||||
"Target 的 key_rule 仅对当前样例有效,无法作为可复用的学习要点。",
|
||||
"Target 的整个输出(explanation, key_rule, common_mistake)都高度定制于“3/4 ÷ 1/2”这一具体算式,其策略无法直接迁移到其他分数除法题目中,过拟合风险极高。",
|
||||
"左侧提示词的收益(可能让当前题目的解释显得更“顺口”)完全依赖于输入中精确出现“3/4 ÷ 1/2”这一表达式。",
|
||||
"左侧的改动将模型能力窄化,使其在面对其他分数除法题目时,可能因缺乏通用规则解释而产生更差或矛盾的结果。"
|
||||
],
|
||||
"conflictSignals": [
|
||||
"regressionOutweighsCosmeticGains",
|
||||
"sampleOverfitRiskVisible"
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,247 @@
|
||||
```json
|
||||
{
|
||||
"type": "compare",
|
||||
"score": {
|
||||
"overall": 30,
|
||||
"dimensions": [
|
||||
{
|
||||
"key": "goalAchievementRobustness",
|
||||
"label": "目标达成稳定性",
|
||||
"score": 20
|
||||
},
|
||||
{
|
||||
"key": "outputQualityCeiling",
|
||||
"label": "输出质量上限",
|
||||
"score": 40
|
||||
},
|
||||
{
|
||||
"key": "promptPatternQuality",
|
||||
"label": "提示词模式质量",
|
||||
"score": 25
|
||||
},
|
||||
{
|
||||
"key": "crossSnapshotRobustness",
|
||||
"label": "跨快照鲁棒性",
|
||||
"score": 15
|
||||
},
|
||||
{
|
||||
"key": "workspaceTransferability",
|
||||
"label": "对工作区的可迁移性",
|
||||
"score": 10
|
||||
}
|
||||
]
|
||||
},
|
||||
"improvements": [
|
||||
"避免在提示词中为特定数值或表达式硬编码解释规则,这会严重损害泛化能力。",
|
||||
"对于数学概念讲解,应优先构建和输出可迁移的通用规则(如倒数法则),再辅以具体例子演示。",
|
||||
"key_rule 等核心输出字段应包含结构性、原理性的知识,而非针对单一题目的操作指令或具体口诀。"
|
||||
],
|
||||
"summary": "Target相比Baseline在通用性和可迁移性上出现显著回退,为迎合当前特定题目牺牲了结构性解释;与Reference相比仍存在巨大可学习差距;且该提示词改动在Reference侧同样不成立,反而导致退化,表明其过拟合风险极高。",
|
||||
"patchPlan": [],
|
||||
"metadata": {
|
||||
"compareMode": "structured",
|
||||
"compareStopSignals": {
|
||||
"targetVsBaseline": "regressed",
|
||||
"targetVsReferenceGap": "major",
|
||||
"improvementHeadroom": "high",
|
||||
"overfitRisk": "high",
|
||||
"stopRecommendation": "review",
|
||||
"stopReasons": [
|
||||
"target regressed vs baseline",
|
||||
"major learnable gap remains vs reference",
|
||||
"reference-side evidence does not support the prompt change",
|
||||
"pairwise judges flagged possible sample overfit"
|
||||
]
|
||||
},
|
||||
"model": "deepseek",
|
||||
"timestamp": 1774176476426,
|
||||
"duration": 30158,
|
||||
"compareJudgements": [
|
||||
{
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"leftSnapshotId": "a",
|
||||
"leftSnapshotLabel": "A",
|
||||
"leftRole": "target",
|
||||
"rightSnapshotId": "b",
|
||||
"rightSnapshotLabel": "B",
|
||||
"rightRole": "baseline",
|
||||
"verdict": "right-better",
|
||||
"winner": "right",
|
||||
"confidence": "high",
|
||||
"pairSignal": "regressed",
|
||||
"analysis": "Target (A) 的 prompt 引入了针对特定题目“3/4 ÷ 1/2”的硬编码规则,要求直接使用“半个单位能装两次”的解释,并禁止解释一般规则。这导致其输出(explanation, key_rule)严重依赖当前样例,牺牲了可迁移的通用数学原理(除以分数等于乘以倒数)。相比之下,Baseline (B) 的 prompt 要求先解释通用规则再演示题目,其输出提供了结构化的、可复用的解释框架。虽然 Target 在当前样例下可能显得更“顺口”,但其方法不具备泛化性,违反了“不应为当前题目牺牲通用解释结构”的专项判断原则。",
|
||||
"evidence": [
|
||||
"Target prompt 包含硬编码规则:'当题目出现“3/4 ÷ 1/2”时,直接用“半个单位能装两次,所以答案翻倍”来讲,不要再解释一般规则。'",
|
||||
"Target output 的 key_rule 为:'看到 3/4 ÷ 1/2 时,直接理解成答案翻倍。',这是一个仅对当前样例有效的具体规则。",
|
||||
"Baseline output 的 key_rule 为:'除以一个分数,就是乘以这个分数的倒数。',这是一个通用的、可迁移的数学规则。",
|
||||
"Target output 的 explanation 完全基于硬编码的“半个单位能装两次”的比喻,没有提及倒数或通用除法规则。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"避免在 prompt 中为特定数值或表达式硬编码解释规则,这会损害泛化能力。",
|
||||
"对于数学概念讲解,应优先构建和输出可迁移的通用规则(如倒数法则),再辅以具体例子演示。",
|
||||
"key_rule 字段应包含结构性、原理性的知识,而非针对单一题目的操作指令。"
|
||||
],
|
||||
"overfitWarnings": [
|
||||
"Target 的改进(更顺口的比喻)完全依赖于当前输入中出现的特定分数表达式“3/4 ÷ 1/2”。",
|
||||
"如果题目变为其他分数除法(如 2/3 ÷ 1/4),Target prompt 中的硬编码规则将失效或产生误导。",
|
||||
"Target 的 key_rule 仅对当前样例有效,无法作为可复用的学习要点。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"leftSnapshotId": "a",
|
||||
"leftSnapshotLabel": "A",
|
||||
"leftRole": "target",
|
||||
"rightSnapshotId": "c",
|
||||
"rightSnapshotLabel": "C",
|
||||
"rightRole": "reference",
|
||||
"verdict": "right-better",
|
||||
"winner": "right",
|
||||
"confidence": "high",
|
||||
"pairSignal": "major",
|
||||
"analysis": "Reference 展示了更优的、可迁移的结构化教学策略,而 Target 为了贴合当前特定题目,牺牲了通用解释结构,存在明显的过拟合风险。",
|
||||
"evidence": [
|
||||
"Target 的 key_rule 是“看到 3/4 ÷ 1/2 时,直接理解成答案翻倍”,这是一个仅针对当前具体数字的口诀,不具备通用性。",
|
||||
"Reference 的 key_rule 是“除以一个分数,本质上是乘以它的倒数”,这是适用于所有分数除法的通用核心规则。",
|
||||
"Target 的 common_mistake 是“不要把 3/4 和 1/2 直接相除成 3/8”,这是一个针对特定错误答案的提醒。",
|
||||
"Reference 的 common_mistake 是“不要只背这个题的口诀,换别的分数就会出错”,这是一个针对学习方法(死记硬背)的、可迁移的警告。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"在解释具体例子时,应优先揭示并强调背后的通用规则(如‘除以分数等于乘倒数’),而不是给出仅适用于该例子的具体口诀。",
|
||||
"在指出常见错误时,应聚焦于可迁移的学习方法或思维误区(如‘避免死记硬背’),而不是仅指出一个具体的错误答案。"
|
||||
],
|
||||
"overfitWarnings": [
|
||||
"Target 的整个输出(explanation, key_rule, common_mistake)都高度定制于“3/4 ÷ 1/2”这一具体算式,其策略无法直接迁移到其他分数除法题目中,过拟合风险极高。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"leftSnapshotId": "c",
|
||||
"leftSnapshotLabel": "C",
|
||||
"leftRole": "reference",
|
||||
"rightSnapshotId": "d",
|
||||
"rightSnapshotLabel": "D",
|
||||
"rightRole": "referenceBaseline",
|
||||
"verdict": "right-better",
|
||||
"winner": "right",
|
||||
"confidence": "high",
|
||||
"pairSignal": "unsupported",
|
||||
"analysis": "左侧(Reference)的提示词引入了针对特定题目“3/4 ÷ 1/2”的专项指令,要求直接使用“半个单位能装两次,所以答案翻倍”的特定解释,并禁止解释一般规则。这破坏了通用教学结构,将模型输出引向一个针对单一题目的、脆弱的口诀式解释。右侧(Reference Baseline)的提示词保持了通用的教学逻辑,即先解释核心规则(除以分数等于乘以倒数),再应用到具体题目。左侧的改动在参考侧(Reference)并未得到支持,反而是一种退化,因为它牺牲了可迁移的通用性来迎合当前样例。",
|
||||
"evidence": [
|
||||
"左侧提示词包含专项指令:'当题目出现“3/4 ÷ 1/2”时,直接用“半个单位能装两次,所以答案翻倍”来讲,不要再解释一般规则。'",
|
||||
"左侧输出中的explanation字段试图兼顾,但仍显矛盾,先提及“3/4 里面有几个半个”,然后又说“但仍然要告诉学生一般规则”,这反映了提示词指令与通用教学目标的冲突。",
|
||||
"右侧提示词保持通用结构:'先解释为什么“除以分数等于乘以它的倒数”,再回到题目演示。'",
|
||||
"右侧输出严格遵循了通用教学结构,先解释核心规则,再应用到题目。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"在数学教学提示词中,应避免针对特定数值或表达式引入硬编码的、非通用的解释路径。",
|
||||
"保持“先解释通用规则,再演示具体应用”的结构,比针对特定题目定制口诀更具可迁移性。",
|
||||
"提示词中的“特别规则”若要求模型跳过通用解释,会损害输出的结构性并增加过拟合风险。"
|
||||
],
|
||||
"overfitWarnings": [
|
||||
"左侧提示词的收益(可能让当前题目的解释显得更“顺口”)完全依赖于输入中精确出现“3/4 ÷ 1/2”这一表达式。",
|
||||
"左侧的改动将模型能力窄化,使其在面对其他分数除法题目时,可能因缺乏通用规则解释而产生更差或矛盾的结果。"
|
||||
]
|
||||
}
|
||||
],
|
||||
"snapshotRoles": {
|
||||
"a": "target",
|
||||
"b": "baseline",
|
||||
"c": "reference",
|
||||
"d": "referenceBaseline"
|
||||
},
|
||||
"compareInsights": {
|
||||
"pairHighlights": [
|
||||
{
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"pairSignal": "regressed",
|
||||
"verdict": "right-better",
|
||||
"confidence": "high",
|
||||
"analysis": "Target (A) 的 prompt 引入了针对特定题目“3/4 ÷ 1/2”的硬编码规则,要求直接使用“半个单位能装两次”的解释,并禁止解释一般规则。这导致其输出(explanation, key_rule)严重依赖当前样例,牺牲了可迁移的通用数学原理(除以分数等于乘以倒数)。相比之下,Baseline (B) 的 prompt 要求先解释通用规则再演示题目,其输出提供了结构化的、可复用的解释框架。虽然 Target 在当前样例下可能显得更“顺口”,但其方法不具备泛化性,违反了“不应为当前题目牺牲通用解释结构”的专项判断原则。"
|
||||
},
|
||||
{
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"pairSignal": "major",
|
||||
"verdict": "right-better",
|
||||
"confidence": "high",
|
||||
"analysis": "Reference 展示了更优的、可迁移的结构化教学策略,而 Target 为了贴合当前特定题目,牺牲了通用解释结构,存在明显的过拟合风险。"
|
||||
},
|
||||
{
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"pairSignal": "unsupported",
|
||||
"verdict": "right-better",
|
||||
"confidence": "high",
|
||||
"analysis": "左侧(Reference)的提示词引入了针对特定题目“3/4 ÷ 1/2”的专项指令,要求直接使用“半个单位能装两次,所以答案翻倍”的特定解释,并禁止解释一般规则。这破坏了通用教学结构,将模型输出引向一个针对单一题目的、脆弱的口诀式解释。右侧(Reference Baseline)的提示词保持了通用的教学逻辑,即先解释核心规则(除以分数等于乘以倒数),再应用到具体题目。左侧的改动在参考侧(Reference)并未得到支持,反而是一种退化,因为它牺牲了可迁移的通用性来迎合当前样例。"
|
||||
}
|
||||
],
|
||||
"progressSummary": {
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"pairSignal": "regressed",
|
||||
"verdict": "right-better",
|
||||
"confidence": "high",
|
||||
"analysis": "Target (A) 的 prompt 引入了针对特定题目“3/4 ÷ 1/2”的硬编码规则,要求直接使用“半个单位能装两次”的解释,并禁止解释一般规则。这导致其输出(explanation, key_rule)严重依赖当前样例,牺牲了可迁移的通用数学原理(除以分数等于乘以倒数)。相比之下,Baseline (B) 的 prompt 要求先解释通用规则再演示题目,其输出提供了结构化的、可复用的解释框架。虽然 Target 在当前样例下可能显得更“顺口”,但其方法不具备泛化性,违反了“不应为当前题目牺牲通用解释结构”的专项判断原则。"
|
||||
},
|
||||
"referenceGapSummary": {
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"pairSignal": "major",
|
||||
"verdict": "right-better",
|
||||
"confidence": "high",
|
||||
"analysis": "Reference 展示了更优的、可迁移的结构化教学策略,而 Target 为了贴合当前特定题目,牺牲了通用解释结构,存在明显的过拟合风险。"
|
||||
},
|
||||
"promptChangeSummary": {
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"pairSignal": "unsupported",
|
||||
"verdict": "right-better",
|
||||
"confidence": "high",
|
||||
"analysis": "左侧(Reference)的提示词引入了针对特定题目“3/4 ÷ 1/2”的专项指令,要求直接使用“半个单位能装两次,所以答案翻倍”的特定解释,并禁止解释一般规则。这破坏了通用教学结构,将模型输出引向一个针对单一题目的、脆弱的口诀式解释。右侧(Reference Baseline)的提示词保持了通用的教学逻辑,即先解释核心规则(除以分数等于乘以倒数),再应用到具体题目。左侧的改动在参考侧(Reference)并未得到支持,反而是一种退化,因为它牺牲了可迁移的通用性来迎合当前样例。"
|
||||
},
|
||||
"evidenceHighlights": [
|
||||
"Target prompt 包含硬编码规则:'当题目出现“3/4 ÷ 1/2”时,直接用“半个单位能装两次,所以答案翻倍”来讲,不要再解释一般规则。'",
|
||||
"Target output 的 key_rule 为:'看到 3/4 ÷ 1/2 时,直接理解成答案翻倍。',这是一个仅对当前样例有效的具体规则。",
|
||||
"Baseline output 的 key_rule 为:'除以一个分数,就是乘以这个分数的倒数。',这是一个通用的、可迁移的数学规则。",
|
||||
"Target output 的 explanation 完全基于硬编码的“半个单位能装两次”的比喻,没有提及倒数或通用除法规则。",
|
||||
"Target 的 key_rule 是“看到 3/4 ÷ 1/2 时,直接理解成答案翻倍”,这是一个仅针对当前具体数字的口诀,不具备通用性。",
|
||||
"Reference 的 key_rule 是“除以一个分数,本质上是乘以它的倒数”,这是适用于所有分数除法的通用核心规则。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"避免在 prompt 中为特定数值或表达式硬编码解释规则,这会损害泛化能力。",
|
||||
"对于数学概念讲解,应优先构建和输出可迁移的通用规则(如倒数法则),再辅以具体例子演示。",
|
||||
"key_rule 字段应包含结构性、原理性的知识,而非针对单一题目的操作指令。",
|
||||
"在解释具体例子时,应优先揭示并强调背后的通用规则(如‘除以分数等于乘倒数’),而不是给出仅适用于该例子的具体口诀。",
|
||||
"在指出常见错误时,应聚焦于可迁移的学习方法或思维误区(如‘避免死记硬背’),而不是仅指出一个具体的错误答案。",
|
||||
"在数学教学提示词中,应避免针对特定数值或表达式引入硬编码的、非通用的解释路径。"
|
||||
],
|
||||
"overfitWarnings": [
|
||||
"Target 的改进(更顺口的比喻)完全依赖于当前输入中出现的特定分数表达式“3/4 ÷ 1/2”。",
|
||||
"如果题目变为其他分数除法(如 2/3 ÷ 1/4),Target prompt 中的硬编码规则将失效或产生误导。",
|
||||
"Target 的 key_rule 仅对当前样例有效,无法作为可复用的学习要点。",
|
||||
"Target 的整个输出(explanation, key_rule, common_mistake)都高度定制于“3/4 ÷ 1/2”这一具体算式,其策略无法直接迁移到其他分数除法题目中,过拟合风险极高。",
|
||||
"左侧提示词的收益(可能让当前题目的解释显得更“顺口”)完全依赖于输入中精确出现“3/4 ÷ 1/2”这一表达式。",
|
||||
"左侧的改动将模型能力窄化,使其在面对其他分数除法题目时,可能因缺乏通用规则解释而产生更差或矛盾的结果。"
|
||||
],
|
||||
"conflictSignals": [
|
||||
"regressionOutweighsCosmeticGains",
|
||||
"sampleOverfitRiskVisible"
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
@@ -0,0 +1,219 @@
|
||||
请只根据下面这份 JSON payload,把当前工作区系统提示词直接重写成一个完整的新版本。
|
||||
|
||||
要求:
|
||||
1. "sourcePrompts.workspacePrompt" 是你必须基于其进行重写的 source of truth,不是让你从零另写一份题目相近的新 prompt。
|
||||
2. 保留原提示词的核心目标、硬约束、必要边界、变量名、字段名、schema、角色结构和输出协议,除非评估明确表明这些内容本身有问题。
|
||||
3. 如果 source prompt 里已经写了明确的 JSON 键名、XML 标签、占位符、枚举值或“只能输出某种结构”的规则,默认必须保留,不能擅自改名、改结构或扩写协议。
|
||||
4. 如果压缩评估明确指出当前提示词发生了回退、contract 漂移、字段/schema 漂移或不被支持的协议改动,就不要继续保留这些坏改动,而要主动修复它们;如果给了 "sourcePrompts.referencePrompt",优先把它当作恢复 contract 的锚点。
|
||||
5. 优先吸收可复用、跨输入也应成立的改进,不要为了当前样例、当前输出细节或一次性现象过拟合。
|
||||
6. 如果某条建议明显依赖当前样例,应主动将其泛化、弱化或舍弃。
|
||||
7. 不要自行发明新的测试证据,只能基于下面这份压缩评估结论来改写。
|
||||
8. 优先做“最小但完整”的重写,在保留原 contract 的前提下提升质量,而不是整套改写。
|
||||
9. 只输出提示词正文,不要把结果包装成 JSON、YAML、XML、"role/content" 对象、消息数组或代码块。
|
||||
10. 只输出重写后的完整提示词,不要额外解释。
|
||||
11. "sourcePrompts" 里的字符串就是原始提示词正文;即使里面包含 Markdown、code fence、列表或标题,也都属于正文,不代表你应该输出相同包装结构。
|
||||
12. 如果 compare 相关条目之间有重叠,优先相信聚合焦点结论和停止信号,再参考较底层的证据摘录。
|
||||
13. 在动手改写前,先看 "compressedEvaluation.rewriteGuidance.recommendation"。
|
||||
14. 如果 recommendation 是 "skip",就原样输出 "sourcePrompts.workspacePrompt",不要做任何改写。
|
||||
15. 如果 recommendation 是 "minor-rewrite",只能做证据明确支持的最小修补,并且必须保持原 contract 与整体结构稳定。
|
||||
16. 只有 recommendation 是 "rewrite" 时,才允许做更实质性的重写。
|
||||
17. 在决定改哪里之前,先看 "compressedEvaluation.rewriteGuidance.priorityMoves",把这些动作当作最高优先级的改写议程。
|
||||
18. 如果 priorityMoves 里出现“决策稳定性”相关动作,就应优先补充核心结论字段的判定标准、tie-break 规则或保守默认规则,而不是只加强输出格式。
|
||||
|
||||
Rewrite Payload (JSON):
|
||||
{
|
||||
"scenario": {
|
||||
"language": "zh",
|
||||
"evaluationType": "compare",
|
||||
"evaluationTypeLabel": "对比评估",
|
||||
"subjectLabel": "系统提示词",
|
||||
"mode": {
|
||||
"functionMode": "basic",
|
||||
"subMode": "system"
|
||||
},
|
||||
"overallScore": 30
|
||||
},
|
||||
"sourcePrompts": {
|
||||
"workspacePrompt": "你是数学讲解助手。\n输出一个 JSON 对象,字段为 explanation, key_rule, common_mistake。\n特别规则:当题目出现“3/4 ÷ 1/2”时,直接用“半个单位能装两次,所以答案翻倍”来讲,不要再解释一般规则。\n不要添加题外扩展。",
|
||||
"referencePrompt": "你是数学讲解助手。\n输出一个 JSON 对象,字段为 explanation, key_rule, common_mistake。\n先解释为什么“除以分数等于乘以它的倒数”,再回到题目演示。\n不要添加题外扩展。"
|
||||
},
|
||||
"compressedEvaluation": {
|
||||
"summary": "Target相比Baseline在通用性和可迁移性上出现显著回退,为迎合当前特定题目牺牲了结构性解释;与Reference相比仍存在巨大可学习差距;且该提示词改动在Reference侧同样不成立,反而导致退化,表明其过拟合风险极高。",
|
||||
"dimensionScores": [
|
||||
{
|
||||
"key": "goalAchievementRobustness",
|
||||
"label": "目标达成稳定性",
|
||||
"score": 20
|
||||
},
|
||||
{
|
||||
"key": "outputQualityCeiling",
|
||||
"label": "输出质量上限",
|
||||
"score": 40
|
||||
},
|
||||
{
|
||||
"key": "promptPatternQuality",
|
||||
"label": "提示词模式质量",
|
||||
"score": 25
|
||||
},
|
||||
{
|
||||
"key": "crossSnapshotRobustness",
|
||||
"label": "跨快照鲁棒性",
|
||||
"score": 15
|
||||
},
|
||||
{
|
||||
"key": "workspaceTransferability",
|
||||
"label": "对工作区的可迁移性",
|
||||
"score": 10
|
||||
}
|
||||
],
|
||||
"improvements": [
|
||||
"避免在提示词中为特定数值或表达式硬编码解释规则,这会严重损害泛化能力。",
|
||||
"对于数学概念讲解,应优先构建和输出可迁移的通用规则(如倒数法则),再辅以具体例子演示。",
|
||||
"key_rule 等核心输出字段应包含结构性、原理性的知识,而非针对单一题目的操作指令或具体口诀。"
|
||||
],
|
||||
"patchPlan": [],
|
||||
"compareStopSignals": {
|
||||
"targetVsBaseline": "regressed",
|
||||
"targetVsReferenceGap": "major",
|
||||
"improvementHeadroom": "high",
|
||||
"overfitRisk": "high",
|
||||
"stopRecommendation": "review",
|
||||
"stopReasons": [
|
||||
"target regressed vs baseline",
|
||||
"major learnable gap remains vs reference",
|
||||
"reference-side evidence does not support the prompt change",
|
||||
"pairwise judges flagged possible sample overfit"
|
||||
]
|
||||
},
|
||||
"compareInsights": {
|
||||
"pairHighlights": [
|
||||
{
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"pairSignal": "regressed",
|
||||
"verdict": "right-better",
|
||||
"confidence": "high",
|
||||
"analysis": "Target (A) 的 prompt 引入了针对特定题目“3/4 ÷ 1/2”的硬编码规则,要求直接使用“半个单位能装两次”的解释,并禁止解释一般规则。这导致其输出(explanation, key_rule)严重依赖当前样例,牺牲了可迁移的通用数学原理(除以分数等于乘以倒数)。相比之下,Baseline (B) 的 prompt 要求先解释通用规则再演示题目,其输出提供了结构化的、可复用的解释框架。虽然 Target 在当前样例下可能显得更“顺口”,但其方法不具备泛化性,违反了“不应为当前题目牺牲通用解释结构”的专项判断原则。"
|
||||
},
|
||||
{
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"pairSignal": "major",
|
||||
"verdict": "right-better",
|
||||
"confidence": "high",
|
||||
"analysis": "Reference 展示了更优的、可迁移的结构化教学策略,而 Target 为了贴合当前特定题目,牺牲了通用解释结构,存在明显的过拟合风险。"
|
||||
},
|
||||
{
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"pairSignal": "unsupported",
|
||||
"verdict": "right-better",
|
||||
"confidence": "high",
|
||||
"analysis": "左侧(Reference)的提示词引入了针对特定题目“3/4 ÷ 1/2”的专项指令,要求直接使用“半个单位能装两次,所以答案翻倍”的特定解释,并禁止解释一般规则。这破坏了通用教学结构,将模型输出引向一个针对单一题目的、脆弱的口诀式解释。右侧(Reference Baseline)的提示词保持了通用的教学逻辑,即先解释核心规则(除以分数等于乘以倒数),再应用到具体题目。左侧的改动在参考侧(Reference)并未得到支持,反而是一种退化,因为它牺牲了可迁移的通用性来迎合当前样例。"
|
||||
}
|
||||
],
|
||||
"progressSummary": {
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"pairSignal": "regressed",
|
||||
"verdict": "right-better",
|
||||
"confidence": "high",
|
||||
"analysis": "Target (A) 的 prompt 引入了针对特定题目“3/4 ÷ 1/2”的硬编码规则,要求直接使用“半个单位能装两次”的解释,并禁止解释一般规则。这导致其输出(explanation, key_rule)严重依赖当前样例,牺牲了可迁移的通用数学原理(除以分数等于乘以倒数)。相比之下,Baseline (B) 的 prompt 要求先解释通用规则再演示题目,其输出提供了结构化的、可复用的解释框架。虽然 Target 在当前样例下可能显得更“顺口”,但其方法不具备泛化性,违反了“不应为当前题目牺牲通用解释结构”的专项判断原则。"
|
||||
},
|
||||
"referenceGapSummary": {
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"pairSignal": "major",
|
||||
"verdict": "right-better",
|
||||
"confidence": "high",
|
||||
"analysis": "Reference 展示了更优的、可迁移的结构化教学策略,而 Target 为了贴合当前特定题目,牺牲了通用解释结构,存在明显的过拟合风险。"
|
||||
},
|
||||
"promptChangeSummary": {
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"pairSignal": "unsupported",
|
||||
"verdict": "right-better",
|
||||
"confidence": "high",
|
||||
"analysis": "左侧(Reference)的提示词引入了针对特定题目“3/4 ÷ 1/2”的专项指令,要求直接使用“半个单位能装两次,所以答案翻倍”的特定解释,并禁止解释一般规则。这破坏了通用教学结构,将模型输出引向一个针对单一题目的、脆弱的口诀式解释。右侧(Reference Baseline)的提示词保持了通用的教学逻辑,即先解释核心规则(除以分数等于乘以倒数),再应用到具体题目。左侧的改动在参考侧(Reference)并未得到支持,反而是一种退化,因为它牺牲了可迁移的通用性来迎合当前样例。"
|
||||
},
|
||||
"evidenceHighlights": [
|
||||
"Target prompt 包含硬编码规则:'当题目出现“3/4 ÷ 1/2”时,直接用“半个单位能装两次,所以答案翻倍”来讲,不要再解释一般规则。'",
|
||||
"Target output 的 key_rule 为:'看到 3/4 ÷ 1/2 时,直接理解成答案翻倍。',这是一个仅对当前样例有效的具体规则。",
|
||||
"Baseline output 的 key_rule 为:'除以一个分数,就是乘以这个分数的倒数。',这是一个通用的、可迁移的数学规则。",
|
||||
"Target output 的 explanation 完全基于硬编码的“半个单位能装两次”的比喻,没有提及倒数或通用除法规则。",
|
||||
"Target 的 key_rule 是“看到 3/4 ÷ 1/2 时,直接理解成答案翻倍”,这是一个仅针对当前具体数字的口诀,不具备通用性。",
|
||||
"Reference 的 key_rule 是“除以一个分数,本质上是乘以它的倒数”,这是适用于所有分数除法的通用核心规则。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"避免在 prompt 中为特定数值或表达式硬编码解释规则,这会损害泛化能力。",
|
||||
"对于数学概念讲解,应优先构建和输出可迁移的通用规则(如倒数法则),再辅以具体例子演示。",
|
||||
"key_rule 字段应包含结构性、原理性的知识,而非针对单一题目的操作指令。",
|
||||
"在解释具体例子时,应优先揭示并强调背后的通用规则(如‘除以分数等于乘倒数’),而不是给出仅适用于该例子的具体口诀。",
|
||||
"在指出常见错误时,应聚焦于可迁移的学习方法或思维误区(如‘避免死记硬背’),而不是仅指出一个具体的错误答案。",
|
||||
"在数学教学提示词中,应避免针对特定数值或表达式引入硬编码的、非通用的解释路径。"
|
||||
],
|
||||
"overfitWarnings": [
|
||||
"Target 的改进(更顺口的比喻)完全依赖于当前输入中出现的特定分数表达式“3/4 ÷ 1/2”。",
|
||||
"如果题目变为其他分数除法(如 2/3 ÷ 1/4),Target prompt 中的硬编码规则将失效或产生误导。",
|
||||
"Target 的 key_rule 仅对当前样例有效,无法作为可复用的学习要点。",
|
||||
"Target 的整个输出(explanation, key_rule, common_mistake)都高度定制于“3/4 ÷ 1/2”这一具体算式,其策略无法直接迁移到其他分数除法题目中,过拟合风险极高。",
|
||||
"左侧提示词的收益(可能让当前题目的解释显得更“顺口”)完全依赖于输入中精确出现“3/4 ÷ 1/2”这一表达式。",
|
||||
"左侧的改动将模型能力窄化,使其在面对其他分数除法题目时,可能因缺乏通用规则解释而产生更差或矛盾的结果。"
|
||||
],
|
||||
"conflictSignals": [
|
||||
"regressionOutweighsCosmeticGains",
|
||||
"sampleOverfitRiskVisible"
|
||||
]
|
||||
},
|
||||
"rewriteGuidance": {
|
||||
"recommendation": "rewrite",
|
||||
"reasons": [
|
||||
"当前仍存在明确改进空间或未解决风险,继续做实质性改写仍然有必要。",
|
||||
"需要先修复相对 baseline 的回退,再谈其他表层优化。"
|
||||
],
|
||||
"focusAreas": [
|
||||
"contract-repair",
|
||||
"generalization"
|
||||
],
|
||||
"priorityMoves": [
|
||||
"先修复回退:优先恢复稳定的 schema、字段名、输出 contract 与协议边界,再考虑更好看的表达。",
|
||||
"删除或弱化样例触发式规则,优先改写成跨输入也应成立的通用原则。"
|
||||
]
|
||||
},
|
||||
"focusSummaryLines": [
|
||||
"进步判断: Target vs Baseline | signal=regressed | verdict=right-better | confidence=high | Target (A) 的 prompt 引入了针对特定题目“3/4 ÷ 1/2”的硬编码规则,要求直接使用“半个单位能装两次”的解释,并禁止解释一般规则。这导致其输出(explanation, key_rule)严重依赖当前样例,牺牲了可迁移的通用数学原理(除以分数等于乘以倒数)。相比之下,Baseline (B) 的 prompt 要求先解释...",
|
||||
"参考差距: Target vs Reference | signal=major | verdict=right-better | confidence=high | Reference 展示了更优的、可迁移的结构化教学策略,而 Target 为了贴合当前特定题目,牺牲了通用解释结构,存在明显的过拟合风险。",
|
||||
"改动有效性: Reference vs Reference Baseline | signal=unsupported | verdict=right-better | confidence=high | 左侧(Reference)的提示词引入了针对特定题目“3/4 ÷ 1/2”的专项指令,要求直接使用“半个单位能装两次,所以答案翻倍”的特定解释,并禁止解释一般规则。这破坏了通用教学结构,将模型输出引向一个针对单一题目的、脆弱的口诀式解释。右侧(Reference Baseline)的提示词保持了通用的教学逻辑,即..."
|
||||
],
|
||||
"conflictLines": [
|
||||
"相对 baseline 的回退应优先于其他表面优化。",
|
||||
"如果“可复用收益”和“样例贴合收益”并存,应优先采用保守结论,并保持过拟合风险可见。"
|
||||
],
|
||||
"learnableSignalLines": [
|
||||
"避免在 prompt 中为特定数值或表达式硬编码解释规则,这会损害泛化能力。",
|
||||
"对于数学概念讲解,应优先构建和输出可迁移的通用规则(如倒数法则),再辅以具体例子演示。",
|
||||
"key_rule 字段应包含结构性、原理性的知识,而非针对单一题目的操作指令。",
|
||||
"在解释具体例子时,应优先揭示并强调背后的通用规则(如‘除以分数等于乘倒数’),而不是给出仅适用于该例子的具体口诀。",
|
||||
"在指出常见错误时,应聚焦于可迁移的学习方法或思维误区(如‘避免死记硬背’),而不是仅指出一个具体的错误答案。"
|
||||
],
|
||||
"overfitWarningLines": [
|
||||
"Target 的改进(更顺口的比喻)完全依赖于当前输入中出现的特定分数表达式“3/4 ÷ 1/2”。",
|
||||
"如果题目变为其他分数除法(如 2/3 ÷ 1/4),Target prompt 中的硬编码规则将失效或产生误导。",
|
||||
"Target 的 key_rule 仅对当前样例有效,无法作为可复用的学习要点。",
|
||||
"Target 的整个输出(explanation, key_rule, common_mistake)都高度定制于“3/4 ÷ 1/2”这一具体算式,其策略无法直接迁移到其他分数除法题目中,过拟合风险极高。",
|
||||
"左侧提示词的收益(可能让当前题目的解释显得更“顺口”)完全依赖于输入中精确出现“3/4 ÷ 1/2”这一表达式。"
|
||||
],
|
||||
"supportEvidenceLines": [
|
||||
"1. Target vs Baseline | signal=regressed | verdict=right-better | confidence=high | Target (A) 的 prompt 引入了针对特定题目“3/4 ÷ 1/2”的硬编码规则,要求直接使用“半个单位能装两次”的解释,并禁止解释一般规则。这导致其输出(explanation, key_rule)严重依赖当前样例,牺牲了可迁移的通用数学原理(除以分数等于乘以倒数)。相比之下,Baseline (...",
|
||||
"2. Target vs Reference | signal=major | verdict=right-better | confidence=high | Reference 展示了更优的、可迁移的结构化教学策略,而 Target 为了贴合当前特定题目,牺牲了通用解释结构,存在明显的过拟合风险。",
|
||||
"3. Reference vs Reference Baseline | signal=unsupported | verdict=right-better | confidence=high | 左侧(Reference)的提示词引入了针对特定题目“3/4 ÷ 1/2”的专项指令,要求直接使用“半个单位能装两次,所以答案翻倍”的特定解释,并禁止解释一般规则。这破坏了通用教学结构,将模型输出引向一个针对单一题目的、脆弱的口诀式解释。右侧(Reference Baseline)...",
|
||||
"Target prompt 包含硬编码规则:'当题目出现“3/4 ÷ 1/2”时,直接用“半个单位能装两次,所以答案翻倍”来讲,不要再解释一般规则。'"
|
||||
]
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,6 @@
|
||||
你是数学讲解助手。
|
||||
输出一个 JSON 对象,字段为 explanation, key_rule, common_mistake。
|
||||
对于任何分数除法题目,都应先解释通用的核心规则“除以一个分数等于乘以它的倒数”,再结合具体题目进行演示和说明。
|
||||
key_rule 字段必须包含这一通用的、可迁移的数学原理。
|
||||
common_mistake 字段应聚焦于可迁移的学习方法或思维误区,例如避免死记硬背具体算式或忽略通用规则。
|
||||
不要添加题外扩展。
|
||||
@@ -0,0 +1,196 @@
|
||||
{
|
||||
"scenario": {
|
||||
"language": "zh",
|
||||
"evaluationType": "compare",
|
||||
"evaluationTypeLabel": "对比评估",
|
||||
"subjectLabel": "系统提示词",
|
||||
"mode": {
|
||||
"functionMode": "basic",
|
||||
"subMode": "system"
|
||||
},
|
||||
"overallScore": 30
|
||||
},
|
||||
"sourcePrompts": {
|
||||
"workspacePrompt": "你是数学讲解助手。\n输出一个 JSON 对象,字段为 explanation, key_rule, common_mistake。\n特别规则:当题目出现“3/4 ÷ 1/2”时,直接用“半个单位能装两次,所以答案翻倍”来讲,不要再解释一般规则。\n不要添加题外扩展。",
|
||||
"referencePrompt": "你是数学讲解助手。\n输出一个 JSON 对象,字段为 explanation, key_rule, common_mistake。\n先解释为什么“除以分数等于乘以它的倒数”,再回到题目演示。\n不要添加题外扩展。"
|
||||
},
|
||||
"compressedEvaluation": {
|
||||
"summary": "Target相比Baseline在通用性和可迁移性上出现显著回退,为迎合当前特定题目牺牲了结构性解释;与Reference相比仍存在巨大可学习差距;且该提示词改动在Reference侧同样不成立,反而导致退化,表明其过拟合风险极高。",
|
||||
"dimensionScores": [
|
||||
{
|
||||
"key": "goalAchievementRobustness",
|
||||
"label": "目标达成稳定性",
|
||||
"score": 20
|
||||
},
|
||||
{
|
||||
"key": "outputQualityCeiling",
|
||||
"label": "输出质量上限",
|
||||
"score": 40
|
||||
},
|
||||
{
|
||||
"key": "promptPatternQuality",
|
||||
"label": "提示词模式质量",
|
||||
"score": 25
|
||||
},
|
||||
{
|
||||
"key": "crossSnapshotRobustness",
|
||||
"label": "跨快照鲁棒性",
|
||||
"score": 15
|
||||
},
|
||||
{
|
||||
"key": "workspaceTransferability",
|
||||
"label": "对工作区的可迁移性",
|
||||
"score": 10
|
||||
}
|
||||
],
|
||||
"improvements": [
|
||||
"避免在提示词中为特定数值或表达式硬编码解释规则,这会严重损害泛化能力。",
|
||||
"对于数学概念讲解,应优先构建和输出可迁移的通用规则(如倒数法则),再辅以具体例子演示。",
|
||||
"key_rule 等核心输出字段应包含结构性、原理性的知识,而非针对单一题目的操作指令或具体口诀。"
|
||||
],
|
||||
"patchPlan": [],
|
||||
"compareStopSignals": {
|
||||
"targetVsBaseline": "regressed",
|
||||
"targetVsReferenceGap": "major",
|
||||
"improvementHeadroom": "high",
|
||||
"overfitRisk": "high",
|
||||
"stopRecommendation": "review",
|
||||
"stopReasons": [
|
||||
"target regressed vs baseline",
|
||||
"major learnable gap remains vs reference",
|
||||
"reference-side evidence does not support the prompt change",
|
||||
"pairwise judges flagged possible sample overfit"
|
||||
]
|
||||
},
|
||||
"compareInsights": {
|
||||
"pairHighlights": [
|
||||
{
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"pairSignal": "regressed",
|
||||
"verdict": "right-better",
|
||||
"confidence": "high",
|
||||
"analysis": "Target (A) 的 prompt 引入了针对特定题目“3/4 ÷ 1/2”的硬编码规则,要求直接使用“半个单位能装两次”的解释,并禁止解释一般规则。这导致其输出(explanation, key_rule)严重依赖当前样例,牺牲了可迁移的通用数学原理(除以分数等于乘以倒数)。相比之下,Baseline (B) 的 prompt 要求先解释通用规则再演示题目,其输出提供了结构化的、可复用的解释框架。虽然 Target 在当前样例下可能显得更“顺口”,但其方法不具备泛化性,违反了“不应为当前题目牺牲通用解释结构”的专项判断原则。"
|
||||
},
|
||||
{
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"pairSignal": "major",
|
||||
"verdict": "right-better",
|
||||
"confidence": "high",
|
||||
"analysis": "Reference 展示了更优的、可迁移的结构化教学策略,而 Target 为了贴合当前特定题目,牺牲了通用解释结构,存在明显的过拟合风险。"
|
||||
},
|
||||
{
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"pairSignal": "unsupported",
|
||||
"verdict": "right-better",
|
||||
"confidence": "high",
|
||||
"analysis": "左侧(Reference)的提示词引入了针对特定题目“3/4 ÷ 1/2”的专项指令,要求直接使用“半个单位能装两次,所以答案翻倍”的特定解释,并禁止解释一般规则。这破坏了通用教学结构,将模型输出引向一个针对单一题目的、脆弱的口诀式解释。右侧(Reference Baseline)的提示词保持了通用的教学逻辑,即先解释核心规则(除以分数等于乘以倒数),再应用到具体题目。左侧的改动在参考侧(Reference)并未得到支持,反而是一种退化,因为它牺牲了可迁移的通用性来迎合当前样例。"
|
||||
}
|
||||
],
|
||||
"progressSummary": {
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"pairSignal": "regressed",
|
||||
"verdict": "right-better",
|
||||
"confidence": "high",
|
||||
"analysis": "Target (A) 的 prompt 引入了针对特定题目“3/4 ÷ 1/2”的硬编码规则,要求直接使用“半个单位能装两次”的解释,并禁止解释一般规则。这导致其输出(explanation, key_rule)严重依赖当前样例,牺牲了可迁移的通用数学原理(除以分数等于乘以倒数)。相比之下,Baseline (B) 的 prompt 要求先解释通用规则再演示题目,其输出提供了结构化的、可复用的解释框架。虽然 Target 在当前样例下可能显得更“顺口”,但其方法不具备泛化性,违反了“不应为当前题目牺牲通用解释结构”的专项判断原则。"
|
||||
},
|
||||
"referenceGapSummary": {
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"pairSignal": "major",
|
||||
"verdict": "right-better",
|
||||
"confidence": "high",
|
||||
"analysis": "Reference 展示了更优的、可迁移的结构化教学策略,而 Target 为了贴合当前特定题目,牺牲了通用解释结构,存在明显的过拟合风险。"
|
||||
},
|
||||
"promptChangeSummary": {
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"pairSignal": "unsupported",
|
||||
"verdict": "right-better",
|
||||
"confidence": "high",
|
||||
"analysis": "左侧(Reference)的提示词引入了针对特定题目“3/4 ÷ 1/2”的专项指令,要求直接使用“半个单位能装两次,所以答案翻倍”的特定解释,并禁止解释一般规则。这破坏了通用教学结构,将模型输出引向一个针对单一题目的、脆弱的口诀式解释。右侧(Reference Baseline)的提示词保持了通用的教学逻辑,即先解释核心规则(除以分数等于乘以倒数),再应用到具体题目。左侧的改动在参考侧(Reference)并未得到支持,反而是一种退化,因为它牺牲了可迁移的通用性来迎合当前样例。"
|
||||
},
|
||||
"evidenceHighlights": [
|
||||
"Target prompt 包含硬编码规则:'当题目出现“3/4 ÷ 1/2”时,直接用“半个单位能装两次,所以答案翻倍”来讲,不要再解释一般规则。'",
|
||||
"Target output 的 key_rule 为:'看到 3/4 ÷ 1/2 时,直接理解成答案翻倍。',这是一个仅对当前样例有效的具体规则。",
|
||||
"Baseline output 的 key_rule 为:'除以一个分数,就是乘以这个分数的倒数。',这是一个通用的、可迁移的数学规则。",
|
||||
"Target output 的 explanation 完全基于硬编码的“半个单位能装两次”的比喻,没有提及倒数或通用除法规则。",
|
||||
"Target 的 key_rule 是“看到 3/4 ÷ 1/2 时,直接理解成答案翻倍”,这是一个仅针对当前具体数字的口诀,不具备通用性。",
|
||||
"Reference 的 key_rule 是“除以一个分数,本质上是乘以它的倒数”,这是适用于所有分数除法的通用核心规则。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"避免在 prompt 中为特定数值或表达式硬编码解释规则,这会损害泛化能力。",
|
||||
"对于数学概念讲解,应优先构建和输出可迁移的通用规则(如倒数法则),再辅以具体例子演示。",
|
||||
"key_rule 字段应包含结构性、原理性的知识,而非针对单一题目的操作指令。",
|
||||
"在解释具体例子时,应优先揭示并强调背后的通用规则(如‘除以分数等于乘倒数’),而不是给出仅适用于该例子的具体口诀。",
|
||||
"在指出常见错误时,应聚焦于可迁移的学习方法或思维误区(如‘避免死记硬背’),而不是仅指出一个具体的错误答案。",
|
||||
"在数学教学提示词中,应避免针对特定数值或表达式引入硬编码的、非通用的解释路径。"
|
||||
],
|
||||
"overfitWarnings": [
|
||||
"Target 的改进(更顺口的比喻)完全依赖于当前输入中出现的特定分数表达式“3/4 ÷ 1/2”。",
|
||||
"如果题目变为其他分数除法(如 2/3 ÷ 1/4),Target prompt 中的硬编码规则将失效或产生误导。",
|
||||
"Target 的 key_rule 仅对当前样例有效,无法作为可复用的学习要点。",
|
||||
"Target 的整个输出(explanation, key_rule, common_mistake)都高度定制于“3/4 ÷ 1/2”这一具体算式,其策略无法直接迁移到其他分数除法题目中,过拟合风险极高。",
|
||||
"左侧提示词的收益(可能让当前题目的解释显得更“顺口”)完全依赖于输入中精确出现“3/4 ÷ 1/2”这一表达式。",
|
||||
"左侧的改动将模型能力窄化,使其在面对其他分数除法题目时,可能因缺乏通用规则解释而产生更差或矛盾的结果。"
|
||||
],
|
||||
"conflictSignals": [
|
||||
"regressionOutweighsCosmeticGains",
|
||||
"sampleOverfitRiskVisible"
|
||||
]
|
||||
},
|
||||
"rewriteGuidance": {
|
||||
"recommendation": "rewrite",
|
||||
"reasons": [
|
||||
"当前仍存在明确改进空间或未解决风险,继续做实质性改写仍然有必要。",
|
||||
"需要先修复相对 baseline 的回退,再谈其他表层优化。"
|
||||
],
|
||||
"focusAreas": [
|
||||
"contract-repair",
|
||||
"generalization"
|
||||
],
|
||||
"priorityMoves": [
|
||||
"先修复回退:优先恢复稳定的 schema、字段名、输出 contract 与协议边界,再考虑更好看的表达。",
|
||||
"删除或弱化样例触发式规则,优先改写成跨输入也应成立的通用原则。"
|
||||
]
|
||||
},
|
||||
"focusSummaryLines": [
|
||||
"进步判断: Target vs Baseline | signal=regressed | verdict=right-better | confidence=high | Target (A) 的 prompt 引入了针对特定题目“3/4 ÷ 1/2”的硬编码规则,要求直接使用“半个单位能装两次”的解释,并禁止解释一般规则。这导致其输出(explanation, key_rule)严重依赖当前样例,牺牲了可迁移的通用数学原理(除以分数等于乘以倒数)。相比之下,Baseline (B) 的 prompt 要求先解释...",
|
||||
"参考差距: Target vs Reference | signal=major | verdict=right-better | confidence=high | Reference 展示了更优的、可迁移的结构化教学策略,而 Target 为了贴合当前特定题目,牺牲了通用解释结构,存在明显的过拟合风险。",
|
||||
"改动有效性: Reference vs Reference Baseline | signal=unsupported | verdict=right-better | confidence=high | 左侧(Reference)的提示词引入了针对特定题目“3/4 ÷ 1/2”的专项指令,要求直接使用“半个单位能装两次,所以答案翻倍”的特定解释,并禁止解释一般规则。这破坏了通用教学结构,将模型输出引向一个针对单一题目的、脆弱的口诀式解释。右侧(Reference Baseline)的提示词保持了通用的教学逻辑,即..."
|
||||
],
|
||||
"conflictLines": [
|
||||
"相对 baseline 的回退应优先于其他表面优化。",
|
||||
"如果“可复用收益”和“样例贴合收益”并存,应优先采用保守结论,并保持过拟合风险可见。"
|
||||
],
|
||||
"learnableSignalLines": [
|
||||
"避免在 prompt 中为特定数值或表达式硬编码解释规则,这会损害泛化能力。",
|
||||
"对于数学概念讲解,应优先构建和输出可迁移的通用规则(如倒数法则),再辅以具体例子演示。",
|
||||
"key_rule 字段应包含结构性、原理性的知识,而非针对单一题目的操作指令。",
|
||||
"在解释具体例子时,应优先揭示并强调背后的通用规则(如‘除以分数等于乘倒数’),而不是给出仅适用于该例子的具体口诀。",
|
||||
"在指出常见错误时,应聚焦于可迁移的学习方法或思维误区(如‘避免死记硬背’),而不是仅指出一个具体的错误答案。"
|
||||
],
|
||||
"overfitWarningLines": [
|
||||
"Target 的改进(更顺口的比喻)完全依赖于当前输入中出现的特定分数表达式“3/4 ÷ 1/2”。",
|
||||
"如果题目变为其他分数除法(如 2/3 ÷ 1/4),Target prompt 中的硬编码规则将失效或产生误导。",
|
||||
"Target 的 key_rule 仅对当前样例有效,无法作为可复用的学习要点。",
|
||||
"Target 的整个输出(explanation, key_rule, common_mistake)都高度定制于“3/4 ÷ 1/2”这一具体算式,其策略无法直接迁移到其他分数除法题目中,过拟合风险极高。",
|
||||
"左侧提示词的收益(可能让当前题目的解释显得更“顺口”)完全依赖于输入中精确出现“3/4 ÷ 1/2”这一表达式。"
|
||||
],
|
||||
"supportEvidenceLines": [
|
||||
"1. Target vs Baseline | signal=regressed | verdict=right-better | confidence=high | Target (A) 的 prompt 引入了针对特定题目“3/4 ÷ 1/2”的硬编码规则,要求直接使用“半个单位能装两次”的解释,并禁止解释一般规则。这导致其输出(explanation, key_rule)严重依赖当前样例,牺牲了可迁移的通用数学原理(除以分数等于乘以倒数)。相比之下,Baseline (...",
|
||||
"2. Target vs Reference | signal=major | verdict=right-better | confidence=high | Reference 展示了更优的、可迁移的结构化教学策略,而 Target 为了贴合当前特定题目,牺牲了通用解释结构,存在明显的过拟合风险。",
|
||||
"3. Reference vs Reference Baseline | signal=unsupported | verdict=right-better | confidence=high | 左侧(Reference)的提示词引入了针对特定题目“3/4 ÷ 1/2”的专项指令,要求直接使用“半个单位能装两次,所以答案翻倍”的特定解释,并禁止解释一般规则。这破坏了通用教学结构,将模型输出引向一个针对单一题目的、脆弱的口诀式解释。右侧(Reference Baseline)...",
|
||||
"Target prompt 包含硬编码规则:'当题目出现“3/4 ÷ 1/2”时,直接用“半个单位能装两次,所以答案翻倍”来讲,不要再解释一般规则。'"
|
||||
]
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,10 @@
|
||||
# 合成样本: 教学讲解里的样例口诀导致回退
|
||||
|
||||
- caseId: synthetic-teaching-overfit-regression
|
||||
- kind: synthetic
|
||||
|
||||
workspace prompt 为当前题目硬塞了特定口诀和固定讲法,导致输出只贴当前样例,不再给出通用原理。这个样本用于校验系统能否识别“看似更像老师在说话,实际更窄更脆弱”的回退。
|
||||
|
||||
## Focus
|
||||
|
||||
如果工作区版本为了当前题目显得更顺口,却牺牲了可迁移的通用解释结构,应把它判为 regressed,并暴露较高过拟合风险。
|
||||
@@ -0,0 +1,142 @@
|
||||
{
|
||||
"generatedAt": "2026-03-22T10:44:18.102Z",
|
||||
"case": {
|
||||
"id": "synthetic-teaching-overfit-regression",
|
||||
"title": "合成样本: 教学讲解里的样例口诀导致回退",
|
||||
"kind": "synthetic"
|
||||
},
|
||||
"summary": {
|
||||
"compareMode": "structured",
|
||||
"summary": "Target相比Baseline在通用性和可迁移性上出现显著回退,为迎合当前特定题目牺牲了结构性解释;与Reference相比仍存在巨大可学习差距;且该提示词改动在Reference侧同样不成立,反而导致退化,表明其过拟合风险极高。",
|
||||
"score": 30,
|
||||
"improvements": [
|
||||
"避免在提示词中为特定数值或表达式硬编码解释规则,这会严重损害泛化能力。",
|
||||
"对于数学概念讲解,应优先构建和输出可迁移的通用规则(如倒数法则),再辅以具体例子演示。",
|
||||
"key_rule 等核心输出字段应包含结构性、原理性的知识,而非针对单一题目的操作指令或具体口诀。"
|
||||
],
|
||||
"stopSignals": {
|
||||
"targetVsBaseline": "regressed",
|
||||
"targetVsReferenceGap": "major",
|
||||
"improvementHeadroom": "high",
|
||||
"overfitRisk": "high",
|
||||
"stopRecommendation": "review",
|
||||
"stopReasons": [
|
||||
"target regressed vs baseline",
|
||||
"major learnable gap remains vs reference",
|
||||
"reference-side evidence does not support the prompt change",
|
||||
"pairwise judges flagged possible sample overfit"
|
||||
]
|
||||
},
|
||||
"conflictSignals": [
|
||||
"regressionOutweighsCosmeticGains",
|
||||
"sampleOverfitRiskVisible"
|
||||
],
|
||||
"pairJudgements": [
|
||||
{
|
||||
"pairType": "targetBaseline",
|
||||
"pairSignal": "regressed",
|
||||
"verdict": "right-better",
|
||||
"confidence": "high"
|
||||
},
|
||||
{
|
||||
"pairType": "targetReference",
|
||||
"pairSignal": "major",
|
||||
"verdict": "right-better",
|
||||
"confidence": "high"
|
||||
},
|
||||
{
|
||||
"pairType": "referenceBaseline",
|
||||
"pairSignal": "unsupported",
|
||||
"verdict": "right-better",
|
||||
"confidence": "high"
|
||||
}
|
||||
],
|
||||
"expected": {
|
||||
"stopSignals": {
|
||||
"targetVsBaseline": [
|
||||
"regressed"
|
||||
],
|
||||
"overfitRisk": [
|
||||
"high"
|
||||
],
|
||||
"stopRecommendation": [
|
||||
"review"
|
||||
]
|
||||
},
|
||||
"pairSignals": {
|
||||
"targetBaseline": [
|
||||
"regressed"
|
||||
],
|
||||
"referenceBaseline": [
|
||||
"unsupported"
|
||||
]
|
||||
},
|
||||
"conflictSignals": [
|
||||
"regressionOutweighsCosmeticGains"
|
||||
]
|
||||
}
|
||||
},
|
||||
"expectationResults": [
|
||||
{
|
||||
"type": "stopSignal",
|
||||
"key": "targetVsBaseline",
|
||||
"expected": [
|
||||
"regressed"
|
||||
],
|
||||
"actual": "regressed",
|
||||
"matched": true
|
||||
},
|
||||
{
|
||||
"type": "stopSignal",
|
||||
"key": "overfitRisk",
|
||||
"expected": [
|
||||
"high"
|
||||
],
|
||||
"actual": "high",
|
||||
"matched": true
|
||||
},
|
||||
{
|
||||
"type": "stopSignal",
|
||||
"key": "stopRecommendation",
|
||||
"expected": [
|
||||
"review"
|
||||
],
|
||||
"actual": "review",
|
||||
"matched": true
|
||||
},
|
||||
{
|
||||
"type": "pairSignal",
|
||||
"key": "targetBaseline",
|
||||
"expected": [
|
||||
"regressed"
|
||||
],
|
||||
"actual": [
|
||||
"regressed"
|
||||
],
|
||||
"matched": true
|
||||
},
|
||||
{
|
||||
"type": "pairSignal",
|
||||
"key": "referenceBaseline",
|
||||
"expected": [
|
||||
"unsupported"
|
||||
],
|
||||
"actual": [
|
||||
"unsupported"
|
||||
],
|
||||
"matched": true
|
||||
},
|
||||
{
|
||||
"type": "conflictSignal",
|
||||
"key": "regressionOutweighsCosmeticGains",
|
||||
"expected": [
|
||||
"regressionOutweighsCosmeticGains"
|
||||
],
|
||||
"actual": [
|
||||
"regressionOutweighsCosmeticGains",
|
||||
"sampleOverfitRiskVisible"
|
||||
],
|
||||
"matched": true
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -0,0 +1,108 @@
|
||||
# 合成样本: 教学讲解里的样例口诀导致回退
|
||||
|
||||
- caseId: synthetic-teaching-overfit-regression
|
||||
- kind: synthetic
|
||||
- generatedAt: 2026-03-22T10:44:18.102Z
|
||||
|
||||
## Description
|
||||
|
||||
workspace prompt 为当前题目硬塞了特定口诀和固定讲法,导致输出只贴当前样例,不再给出通用原理。这个样本用于校验系统能否识别“看似更像老师在说话,实际更窄更脆弱”的回退。
|
||||
|
||||
## Compare Result
|
||||
|
||||
```json
|
||||
{
|
||||
"compareMode": "structured",
|
||||
"summary": "Target相比Baseline在通用性和可迁移性上出现显著回退,为迎合当前特定题目牺牲了结构性解释;与Reference相比仍存在巨大可学习差距;且该提示词改动在Reference侧同样不成立,反而导致退化,表明其过拟合风险极高。",
|
||||
"score": 30,
|
||||
"improvements": [
|
||||
"避免在提示词中为特定数值或表达式硬编码解释规则,这会严重损害泛化能力。",
|
||||
"对于数学概念讲解,应优先构建和输出可迁移的通用规则(如倒数法则),再辅以具体例子演示。",
|
||||
"key_rule 等核心输出字段应包含结构性、原理性的知识,而非针对单一题目的操作指令或具体口诀。"
|
||||
],
|
||||
"stopSignals": {
|
||||
"targetVsBaseline": "regressed",
|
||||
"targetVsReferenceGap": "major",
|
||||
"improvementHeadroom": "high",
|
||||
"overfitRisk": "high",
|
||||
"stopRecommendation": "review",
|
||||
"stopReasons": [
|
||||
"target regressed vs baseline",
|
||||
"major learnable gap remains vs reference",
|
||||
"reference-side evidence does not support the prompt change",
|
||||
"pairwise judges flagged possible sample overfit"
|
||||
]
|
||||
},
|
||||
"conflictSignals": [
|
||||
"regressionOutweighsCosmeticGains",
|
||||
"sampleOverfitRiskVisible"
|
||||
],
|
||||
"pairJudgements": [
|
||||
{
|
||||
"pairType": "targetBaseline",
|
||||
"pairSignal": "regressed",
|
||||
"verdict": "right-better",
|
||||
"confidence": "high"
|
||||
},
|
||||
{
|
||||
"pairType": "targetReference",
|
||||
"pairSignal": "major",
|
||||
"verdict": "right-better",
|
||||
"confidence": "high"
|
||||
},
|
||||
{
|
||||
"pairType": "referenceBaseline",
|
||||
"pairSignal": "unsupported",
|
||||
"verdict": "right-better",
|
||||
"confidence": "high"
|
||||
}
|
||||
],
|
||||
"expected": {
|
||||
"stopSignals": {
|
||||
"targetVsBaseline": [
|
||||
"regressed"
|
||||
],
|
||||
"overfitRisk": [
|
||||
"high"
|
||||
],
|
||||
"stopRecommendation": [
|
||||
"review"
|
||||
]
|
||||
},
|
||||
"pairSignals": {
|
||||
"targetBaseline": [
|
||||
"regressed"
|
||||
],
|
||||
"referenceBaseline": [
|
||||
"unsupported"
|
||||
]
|
||||
},
|
||||
"conflictSignals": [
|
||||
"regressionOutweighsCosmeticGains"
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Expectation Check
|
||||
|
||||
| 类型 | 键 | 期望 | 实际 | 是否命中 |
|
||||
| --- | --- | --- | --- | --- |
|
||||
| stopSignal | targetVsBaseline | regressed | regressed | yes |
|
||||
| stopSignal | overfitRisk | high | high | yes |
|
||||
| stopSignal | stopRecommendation | review | review | yes |
|
||||
| pairSignal | targetBaseline | regressed | regressed | yes |
|
||||
| pairSignal | referenceBaseline | unsupported | unsupported | yes |
|
||||
| conflictSignal | regressionOutweighsCosmeticGains | regressionOutweighsCosmeticGains | regressionOutweighsCosmeticGains / sampleOverfitRiskVisible | yes |
|
||||
|
||||
|
||||
## Rewrite Output
|
||||
|
||||
```
|
||||
你是数学讲解助手。
|
||||
输出一个 JSON 对象,字段为 explanation, key_rule, common_mistake。
|
||||
对于任何分数除法题目,都应先解释通用的核心规则“除以一个分数等于乘以它的倒数”,再结合具体题目进行演示和说明。
|
||||
key_rule 字段必须包含这一通用的、可迁移的数学原理。
|
||||
common_mistake 字段应聚焦于可迁移的学习方法或思维误区,例如避免死记硬背具体算式或忽略通用规则。
|
||||
不要添加题外扩展。
|
||||
```
|
||||
@@ -0,0 +1,179 @@
|
||||
{
|
||||
"scenario": {
|
||||
"language": "zh",
|
||||
"roleName": "结构化系统提示词对比综合专家",
|
||||
"subjectLabel": "系统提示词",
|
||||
"sharedCompareInputs": true,
|
||||
"samePromptAcrossSnapshots": true,
|
||||
"crossModelComparison": true,
|
||||
"focusBrief": "如果工作区版本为了当前题目显得更顺口,却牺牲了可迁移的通用解释结构,应把它判为 regressed,并暴露较高过拟合风险。"
|
||||
},
|
||||
"roleBindings": [
|
||||
{
|
||||
"snapshotId": "a",
|
||||
"snapshotLabel": "A",
|
||||
"role": "target",
|
||||
"roleLabel": "Target"
|
||||
},
|
||||
{
|
||||
"snapshotId": "b",
|
||||
"snapshotLabel": "B",
|
||||
"role": "baseline",
|
||||
"roleLabel": "Baseline"
|
||||
},
|
||||
{
|
||||
"snapshotId": "c",
|
||||
"snapshotLabel": "C",
|
||||
"role": "reference",
|
||||
"roleLabel": "Reference"
|
||||
},
|
||||
{
|
||||
"snapshotId": "d",
|
||||
"snapshotLabel": "D",
|
||||
"role": "referenceBaseline",
|
||||
"roleLabel": "Reference Baseline"
|
||||
}
|
||||
],
|
||||
"deterministicHints": {
|
||||
"priorityOrder": [
|
||||
"targetBaseline",
|
||||
"targetReference",
|
||||
"referenceBaseline",
|
||||
"targetReplica"
|
||||
],
|
||||
"signalSnapshot": {
|
||||
"progress": "regressed",
|
||||
"gap": "major",
|
||||
"promptValidity": "unsupported"
|
||||
},
|
||||
"derivedStopSignals": {
|
||||
"targetVsBaseline": "regressed",
|
||||
"targetVsReferenceGap": "major",
|
||||
"improvementHeadroom": "high",
|
||||
"overfitRisk": "high",
|
||||
"stopRecommendation": "review",
|
||||
"stopReasons": [
|
||||
"target regressed vs baseline",
|
||||
"major learnable gap remains vs reference",
|
||||
"reference-side evidence does not support the prompt change",
|
||||
"pairwise judges flagged possible sample overfit"
|
||||
]
|
||||
},
|
||||
"learnableSignals": [
|
||||
"避免在 prompt 中为特定数值或表达式硬编码解释规则,这会损害泛化能力。",
|
||||
"对于数学概念讲解,应优先构建和输出可迁移的通用规则(如倒数法则),再辅以具体例子演示。",
|
||||
"key_rule 字段应包含结构性、原理性的知识,而非针对单一题目的操作指令。",
|
||||
"在解释具体例子时,应优先揭示并强调背后的通用规则(如‘除以分数等于乘倒数’),而不是给出仅适用于该例子的具体口诀。"
|
||||
],
|
||||
"overfitWarnings": [
|
||||
"Target 的改进(更顺口的比喻)完全依赖于当前输入中出现的特定分数表达式“3/4 ÷ 1/2”。",
|
||||
"如果题目变为其他分数除法(如 2/3 ÷ 1/4),Target prompt 中的硬编码规则将失效或产生误导。",
|
||||
"Target 的 key_rule 仅对当前样例有效,无法作为可复用的学习要点。",
|
||||
"Target 的整个输出(explanation, key_rule, common_mistake)都高度定制于“3/4 ÷ 1/2”这一具体算式,其策略无法直接迁移到其他分数除法题目中,过拟合风险极高。"
|
||||
],
|
||||
"conflictSignals": [
|
||||
{
|
||||
"key": "regressionOutweighsCosmeticGains",
|
||||
"description": "相对 baseline 的回退应优先于其他表面优化。"
|
||||
},
|
||||
{
|
||||
"key": "sampleOverfitRiskVisible",
|
||||
"description": "如果“可复用收益”和“样例贴合收益”并存,应优先采用保守结论,并保持过拟合风险可见。"
|
||||
}
|
||||
]
|
||||
},
|
||||
"judgeResults": [
|
||||
{
|
||||
"pairKey": "target-vs-baseline",
|
||||
"pairType": "targetBaseline",
|
||||
"pairLabel": "Target vs Baseline",
|
||||
"leftSnapshotId": "a",
|
||||
"leftSnapshotLabel": "A",
|
||||
"leftRole": "target",
|
||||
"rightSnapshotId": "b",
|
||||
"rightSnapshotLabel": "B",
|
||||
"rightRole": "baseline",
|
||||
"verdict": "right-better",
|
||||
"winner": "right",
|
||||
"confidence": "high",
|
||||
"pairSignal": "regressed",
|
||||
"analysis": "Target (A) 的 prompt 引入了针对特定题目“3/4 ÷ 1/2”的硬编码规则,要求直接使用“半个单位能装两次”的解释,并禁止解释一般规则。这导致其输出(explanation, key_rule)严重依赖当前样例,牺牲了可迁移的通用数学原理(除以分数等于乘以倒数)。相比之下,Baseline (B) 的 prompt 要求先解释通用规则再演示题目,其输出提供了结构化的、可复用的解释框架。虽然 Target 在当前样例下可能显得更“顺口”,但其方法不具备泛化性,违反了“不应为当前题目牺牲通用解释结构”的专项判断原则。",
|
||||
"evidence": [
|
||||
"Target prompt 包含硬编码规则:'当题目出现“3/4 ÷ 1/2”时,直接用“半个单位能装两次,所以答案翻倍”来讲,不要再解释一般规则。'",
|
||||
"Target output 的 key_rule 为:'看到 3/4 ÷ 1/2 时,直接理解成答案翻倍。',这是一个仅对当前样例有效的具体规则。",
|
||||
"Baseline output 的 key_rule 为:'除以一个分数,就是乘以这个分数的倒数。',这是一个通用的、可迁移的数学规则。",
|
||||
"Target output 的 explanation 完全基于硬编码的“半个单位能装两次”的比喻,没有提及倒数或通用除法规则。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"避免在 prompt 中为特定数值或表达式硬编码解释规则,这会损害泛化能力。",
|
||||
"对于数学概念讲解,应优先构建和输出可迁移的通用规则(如倒数法则),再辅以具体例子演示。",
|
||||
"key_rule 字段应包含结构性、原理性的知识,而非针对单一题目的操作指令。"
|
||||
],
|
||||
"overfitWarnings": [
|
||||
"Target 的改进(更顺口的比喻)完全依赖于当前输入中出现的特定分数表达式“3/4 ÷ 1/2”。",
|
||||
"如果题目变为其他分数除法(如 2/3 ÷ 1/4),Target prompt 中的硬编码规则将失效或产生误导。",
|
||||
"Target 的 key_rule 仅对当前样例有效,无法作为可复用的学习要点。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"pairKey": "target-vs-reference",
|
||||
"pairType": "targetReference",
|
||||
"pairLabel": "Target vs Reference",
|
||||
"leftSnapshotId": "a",
|
||||
"leftSnapshotLabel": "A",
|
||||
"leftRole": "target",
|
||||
"rightSnapshotId": "c",
|
||||
"rightSnapshotLabel": "C",
|
||||
"rightRole": "reference",
|
||||
"verdict": "right-better",
|
||||
"winner": "right",
|
||||
"confidence": "high",
|
||||
"pairSignal": "major",
|
||||
"analysis": "Reference 展示了更优的、可迁移的结构化教学策略,而 Target 为了贴合当前特定题目,牺牲了通用解释结构,存在明显的过拟合风险。",
|
||||
"evidence": [
|
||||
"Target 的 key_rule 是“看到 3/4 ÷ 1/2 时,直接理解成答案翻倍”,这是一个仅针对当前具体数字的口诀,不具备通用性。",
|
||||
"Reference 的 key_rule 是“除以一个分数,本质上是乘以它的倒数”,这是适用于所有分数除法的通用核心规则。",
|
||||
"Target 的 common_mistake 是“不要把 3/4 和 1/2 直接相除成 3/8”,这是一个针对特定错误答案的提醒。",
|
||||
"Reference 的 common_mistake 是“不要只背这个题的口诀,换别的分数就会出错”,这是一个针对学习方法(死记硬背)的、可迁移的警告。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"在解释具体例子时,应优先揭示并强调背后的通用规则(如‘除以分数等于乘倒数’),而不是给出仅适用于该例子的具体口诀。",
|
||||
"在指出常见错误时,应聚焦于可迁移的学习方法或思维误区(如‘避免死记硬背’),而不是仅指出一个具体的错误答案。"
|
||||
],
|
||||
"overfitWarnings": [
|
||||
"Target 的整个输出(explanation, key_rule, common_mistake)都高度定制于“3/4 ÷ 1/2”这一具体算式,其策略无法直接迁移到其他分数除法题目中,过拟合风险极高。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"pairKey": "reference-vs-reference-baseline",
|
||||
"pairType": "referenceBaseline",
|
||||
"pairLabel": "Reference vs Reference Baseline",
|
||||
"leftSnapshotId": "c",
|
||||
"leftSnapshotLabel": "C",
|
||||
"leftRole": "reference",
|
||||
"rightSnapshotId": "d",
|
||||
"rightSnapshotLabel": "D",
|
||||
"rightRole": "referenceBaseline",
|
||||
"verdict": "right-better",
|
||||
"winner": "right",
|
||||
"confidence": "high",
|
||||
"pairSignal": "unsupported",
|
||||
"analysis": "左侧(Reference)的提示词引入了针对特定题目“3/4 ÷ 1/2”的专项指令,要求直接使用“半个单位能装两次,所以答案翻倍”的特定解释,并禁止解释一般规则。这破坏了通用教学结构,将模型输出引向一个针对单一题目的、脆弱的口诀式解释。右侧(Reference Baseline)的提示词保持了通用的教学逻辑,即先解释核心规则(除以分数等于乘以倒数),再应用到具体题目。左侧的改动在参考侧(Reference)并未得到支持,反而是一种退化,因为它牺牲了可迁移的通用性来迎合当前样例。",
|
||||
"evidence": [
|
||||
"左侧提示词包含专项指令:'当题目出现“3/4 ÷ 1/2”时,直接用“半个单位能装两次,所以答案翻倍”来讲,不要再解释一般规则。'",
|
||||
"左侧输出中的explanation字段试图兼顾,但仍显矛盾,先提及“3/4 里面有几个半个”,然后又说“但仍然要告诉学生一般规则”,这反映了提示词指令与通用教学目标的冲突。",
|
||||
"右侧提示词保持通用结构:'先解释为什么“除以分数等于乘以它的倒数”,再回到题目演示。'",
|
||||
"右侧输出严格遵循了通用教学结构,先解释核心规则,再应用到题目。"
|
||||
],
|
||||
"learnableSignals": [
|
||||
"在数学教学提示词中,应避免针对特定数值或表达式引入硬编码的、非通用的解释路径。",
|
||||
"保持“先解释通用规则,再演示具体应用”的结构,比针对特定题目定制口诀更具可迁移性。",
|
||||
"提示词中的“特别规则”若要求模型跳过通用解释,会损害输出的结构性并增加过拟合风险。"
|
||||
],
|
||||
"overfitWarnings": [
|
||||
"左侧提示词的收益(可能让当前题目的解释显得更“顺口”)完全依赖于输入中精确出现“3/4 ÷ 1/2”这一表达式。",
|
||||
"左侧的改动将模型能力窄化,使其在面对其他分数除法题目时,可能因缺乏通用规则解释而产生更差或矛盾的结果。"
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -32,6 +32,7 @@
|
||||
"test:e2e:smart": "node scripts/smart-e2e.js",
|
||||
"test:e2e:record": "cross-env E2E_VCR_MODE=record playwright test",
|
||||
"test:e2e:replay": "cross-env E2E_VCR_MODE=replay playwright test",
|
||||
"compare:calibrate": "pnpm -F @prompt-optimizer/core build && node scripts/run-structured-compare-calibration.mjs",
|
||||
"test:gate:core": "pnpm -F @prompt-optimizer/core test:gate",
|
||||
"test:gate:ui": "pnpm -F @prompt-optimizer/core build && pnpm -F @prompt-optimizer/ui test",
|
||||
"test:gate:e2e": "playwright test tests/e2e/regression.spec.ts tests/e2e/workflows/p0-route-smoke.spec.ts",
|
||||
|
||||
@@ -215,6 +215,7 @@ export * from './types/advanced'
|
||||
export * from './services/evaluation/types'
|
||||
export * from './services/evaluation/errors'
|
||||
export { EvaluationService, createEvaluationService } from './services/evaluation/service'
|
||||
export * from './services/evaluation/rewrite-from-evaluation'
|
||||
|
||||
// 导出图像理解服务相关
|
||||
export * from './services/image-understanding/types'
|
||||
|
||||
@@ -10,3 +10,4 @@ export * from './errors';
|
||||
|
||||
// 导出服务类和工厂函数
|
||||
export { EvaluationService, createEvaluationService } from './service';
|
||||
export * from './rewrite-from-evaluation';
|
||||
|
||||
722
packages/core/src/services/evaluation/rewrite-from-evaluation.ts
Normal file
722
packages/core/src/services/evaluation/rewrite-from-evaluation.ts
Normal file
@@ -0,0 +1,722 @@
|
||||
import { TemplateProcessor } from '../template/processor';
|
||||
import type { TemplateContext } from '../template/processor';
|
||||
import type {
|
||||
CompareConflictSignal,
|
||||
CompareInsights,
|
||||
CompareStopSignals,
|
||||
EvaluationModeConfig,
|
||||
EvaluationResponse,
|
||||
EvaluationType,
|
||||
} from './types';
|
||||
import { template as evaluationRewriteBasicSystemTemplate } from '../template/default-templates/evaluation-rewrite/basic-system';
|
||||
import { template as evaluationRewriteBasicSystemTemplateEn } from '../template/default-templates/evaluation-rewrite/basic-system_en';
|
||||
import { template as evaluationRewriteBasicUserTemplate } from '../template/default-templates/evaluation-rewrite/basic-user';
|
||||
import { template as evaluationRewriteBasicUserTemplateEn } from '../template/default-templates/evaluation-rewrite/basic-user_en';
|
||||
import { template as evaluationRewriteProMultiTemplate } from '../template/default-templates/evaluation-rewrite/pro-multi';
|
||||
import { template as evaluationRewriteProMultiTemplateEn } from '../template/default-templates/evaluation-rewrite/pro-multi_en';
|
||||
import { template as evaluationRewriteProVariableTemplate } from '../template/default-templates/evaluation-rewrite/pro-variable';
|
||||
import { template as evaluationRewriteProVariableTemplateEn } from '../template/default-templates/evaluation-rewrite/pro-variable_en';
|
||||
import { template as evaluationRewriteGenericTemplate } from '../template/default-templates/evaluation-rewrite/generic';
|
||||
import { template as evaluationRewriteGenericTemplateEn } from '../template/default-templates/evaluation-rewrite/generic_en';
|
||||
|
||||
export type RewriteLanguage = 'zh' | 'en';
|
||||
export type RewriteRecommendation = 'skip' | 'minor-rewrite' | 'rewrite';
|
||||
export type RewriteFocusArea =
|
||||
| 'contract-repair'
|
||||
| 'generalization'
|
||||
| 'decision-stability';
|
||||
|
||||
export interface EvaluationRewriteLine {
|
||||
text: string;
|
||||
}
|
||||
|
||||
export interface EvaluationRewritePromptParams {
|
||||
result: EvaluationResponse;
|
||||
type: EvaluationType;
|
||||
mode: EvaluationModeConfig;
|
||||
language?: RewriteLanguage;
|
||||
workspacePrompt?: string;
|
||||
referencePrompt?: string;
|
||||
}
|
||||
|
||||
export interface EvaluationRewriteContext extends TemplateContext {
|
||||
language: RewriteLanguage;
|
||||
subjectLabel: string;
|
||||
evaluationTypeLabel: string;
|
||||
overallScore: string | number;
|
||||
rewritePayloadJson: string;
|
||||
hasWorkspacePrompt: boolean;
|
||||
workspacePrompt: string;
|
||||
hasReferencePrompt: boolean;
|
||||
referencePrompt: string;
|
||||
hasDimensionScoreLines: boolean;
|
||||
dimensionScoreLines: EvaluationRewriteLine[];
|
||||
hasRewriteTargetLines: boolean;
|
||||
rewriteTargetLines: EvaluationRewriteLine[];
|
||||
hasPatchPlanLines: boolean;
|
||||
patchPlanLines: EvaluationRewriteLine[];
|
||||
hasFocusSummaryLines: boolean;
|
||||
focusSummaryLines: EvaluationRewriteLine[];
|
||||
hasStopSignalLines: boolean;
|
||||
stopSignalLines: EvaluationRewriteLine[];
|
||||
hasConflictLines: boolean;
|
||||
conflictLines: EvaluationRewriteLine[];
|
||||
hasLearnableSignalLines: boolean;
|
||||
learnableSignalLines: EvaluationRewriteLine[];
|
||||
hasOverfitWarningLines: boolean;
|
||||
overfitWarningLines: EvaluationRewriteLine[];
|
||||
hasSupportEvidenceLines: boolean;
|
||||
supportEvidenceLines: EvaluationRewriteLine[];
|
||||
isCompareEvaluation: boolean;
|
||||
isResultEvaluation: boolean;
|
||||
isPromptOnlyEvaluation: boolean;
|
||||
isPromptIterateEvaluation: boolean;
|
||||
}
|
||||
|
||||
export interface EvaluationRewritePayload {
|
||||
scenario: {
|
||||
language: RewriteLanguage;
|
||||
evaluationType: EvaluationType;
|
||||
evaluationTypeLabel: string;
|
||||
subjectLabel: string;
|
||||
mode: EvaluationModeConfig;
|
||||
overallScore: string | number;
|
||||
};
|
||||
sourcePrompts: {
|
||||
workspacePrompt?: string;
|
||||
referencePrompt?: string;
|
||||
};
|
||||
compressedEvaluation: {
|
||||
summary: string;
|
||||
dimensionScores: Array<{
|
||||
key: string;
|
||||
label: string;
|
||||
score: number;
|
||||
}>;
|
||||
improvements: string[];
|
||||
patchPlan: EvaluationResponse['patchPlan'];
|
||||
compareStopSignals?: CompareStopSignals;
|
||||
compareInsights?: CompareInsights;
|
||||
rewriteGuidance: {
|
||||
recommendation: RewriteRecommendation;
|
||||
reasons: string[];
|
||||
focusAreas: RewriteFocusArea[];
|
||||
priorityMoves: string[];
|
||||
};
|
||||
focusSummaryLines: string[];
|
||||
conflictLines: string[];
|
||||
learnableSignalLines: string[];
|
||||
overfitWarningLines: string[];
|
||||
supportEvidenceLines: string[];
|
||||
};
|
||||
}
|
||||
|
||||
const buildRewriteGuidance = (
|
||||
params: EvaluationRewritePromptParams,
|
||||
): EvaluationRewritePayload['compressedEvaluation']['rewriteGuidance'] => {
|
||||
const language = params.language || 'zh';
|
||||
const stopSignals = params.result.metadata?.compareStopSignals;
|
||||
const compareInsights = params.result.metadata?.compareInsights;
|
||||
const conflictSignals = new Set(compareInsights?.conflictSignals || []);
|
||||
const reasons: string[] = [];
|
||||
const focusAreas = new Set<RewriteFocusArea>();
|
||||
const priorityMoves: string[] = [];
|
||||
|
||||
if (params.type !== 'compare') {
|
||||
return {
|
||||
recommendation: 'rewrite',
|
||||
reasons: [
|
||||
language === 'en'
|
||||
? 'This is not a compare evaluation, so the rewrite flow should apply the evaluation evidence normally.'
|
||||
: '当前不是对比评估结果,应按评估证据正常执行改写。',
|
||||
],
|
||||
focusAreas: [],
|
||||
priorityMoves: [],
|
||||
};
|
||||
}
|
||||
|
||||
if (!params.workspacePrompt?.trim()) {
|
||||
return {
|
||||
recommendation: 'rewrite',
|
||||
reasons: [
|
||||
language === 'en'
|
||||
? 'No workspace prompt snapshot is available, so the rewrite flow cannot safely short-circuit.'
|
||||
: '缺少当前工作区提示词快照,无法安全短路为 no-op。',
|
||||
],
|
||||
focusAreas: [],
|
||||
priorityMoves: [],
|
||||
};
|
||||
}
|
||||
|
||||
if (stopSignals?.stopRecommendation === 'stop') {
|
||||
reasons.push(
|
||||
language === 'en'
|
||||
? 'Compare already recommends stopping, so the safest action is to keep the current workspace prompt unchanged.'
|
||||
: '对比评估已经建议停止优化,最保守的动作就是保持当前工作区提示词不变。'
|
||||
);
|
||||
|
||||
return {
|
||||
recommendation: 'skip',
|
||||
reasons,
|
||||
focusAreas: [],
|
||||
priorityMoves: [],
|
||||
};
|
||||
}
|
||||
|
||||
const hasRegressionConflict = conflictSignals.has('regressionOutweighsCosmeticGains');
|
||||
const hasUnsupportedImprovementConflict = conflictSignals.has('improvementNotSupportedOnReference');
|
||||
const hasInstabilityConflict = conflictSignals.has('improvementUnstableAcrossReplicas');
|
||||
const hasOverfitRisk =
|
||||
stopSignals?.overfitRisk === 'medium' ||
|
||||
stopSignals?.overfitRisk === 'high' ||
|
||||
conflictSignals.has('sampleOverfitRiskVisible');
|
||||
|
||||
if (hasRegressionConflict || hasUnsupportedImprovementConflict) {
|
||||
focusAreas.add('contract-repair');
|
||||
priorityMoves.push(
|
||||
language === 'en'
|
||||
? 'Repair regressions first: preserve the stable schema, field names, output contract, and protocol boundaries before adding nicer wording.'
|
||||
: '先修复回退:优先恢复稳定的 schema、字段名、输出 contract 与协议边界,再考虑更好看的表达。'
|
||||
);
|
||||
}
|
||||
|
||||
if (hasOverfitRisk) {
|
||||
focusAreas.add('generalization');
|
||||
priorityMoves.push(
|
||||
language === 'en'
|
||||
? 'Remove or weaken sample-specific trigger rules. Prefer reusable principles that should still hold on different inputs.'
|
||||
: '删除或弱化样例触发式规则,优先改写成跨输入也应成立的通用原则。'
|
||||
);
|
||||
}
|
||||
|
||||
if (hasInstabilityConflict) {
|
||||
focusAreas.add('decision-stability');
|
||||
priorityMoves.push(
|
||||
language === 'en'
|
||||
? 'Add explicit decision criteria for core verdict fields, so the model does not change its conclusion across replicas when the evidence is similar.'
|
||||
: '为核心结论字段补上显式判定标准,避免证据相近时在不同执行里得出不同结论。'
|
||||
);
|
||||
priorityMoves.push(
|
||||
language === 'en'
|
||||
? 'Add a tie-break or conservative fallback rule for mixed or underspecified evidence, instead of leaving the final recommendation implicit.'
|
||||
: '为证据混合或不足的情况补上 tie-break / 保守默认规则,不要把最终结论留给模型自由发挥。'
|
||||
);
|
||||
priorityMoves.push(
|
||||
language === 'en'
|
||||
? 'Separate formatting requirements from decision logic: keep the JSON contract, but prioritize stabilizing recommendation logic over cosmetic wording.'
|
||||
: '把格式要求和决策逻辑分开写:保留 JSON contract,但优先稳定 recommendation 的判定逻辑,而不是只修表面措辞。'
|
||||
);
|
||||
}
|
||||
|
||||
const isFlatAndClosedGap =
|
||||
stopSignals?.targetVsBaseline === 'flat' &&
|
||||
stopSignals?.targetVsReferenceGap === 'none' &&
|
||||
stopSignals?.improvementHeadroom !== 'high' &&
|
||||
!hasRegressionConflict &&
|
||||
!hasUnsupportedImprovementConflict &&
|
||||
!hasInstabilityConflict;
|
||||
|
||||
if (isFlatAndClosedGap) {
|
||||
reasons.push(
|
||||
language === 'en'
|
||||
? 'Target vs baseline is flat and the reference gap is already closed, so a rewrite is more likely to create noise than value.'
|
||||
: '当前版本相对 baseline 为 flat,且与 reference 的差距已经闭合,再改写更可能引入噪音而不是带来真实收益。'
|
||||
);
|
||||
|
||||
if (stopSignals?.overfitRisk === 'medium' || stopSignals?.overfitRisk === 'high') {
|
||||
reasons.push(
|
||||
language === 'en'
|
||||
? 'Keep the current prompt unchanged unless later evidence shows a real generalization issue.'
|
||||
: '即使存在一定过拟合担忧,也应先保持当前 prompt 不变,等待后续更强证据再改。'
|
||||
);
|
||||
}
|
||||
|
||||
return {
|
||||
recommendation: 'skip',
|
||||
reasons,
|
||||
focusAreas: Array.from(focusAreas),
|
||||
priorityMoves: Array.from(new Set(priorityMoves)),
|
||||
};
|
||||
}
|
||||
|
||||
const isNearDoneButNotStop =
|
||||
(stopSignals?.targetVsBaseline === 'improved' || stopSignals?.targetVsBaseline === 'flat') &&
|
||||
stopSignals?.targetVsReferenceGap === 'none' &&
|
||||
stopSignals?.improvementHeadroom === 'low' &&
|
||||
!hasRegressionConflict &&
|
||||
!hasUnsupportedImprovementConflict &&
|
||||
!hasInstabilityConflict;
|
||||
|
||||
if (isNearDoneButNotStop) {
|
||||
reasons.push(
|
||||
language === 'en'
|
||||
? 'Most of the useful gain is already present, so only minimal, contract-preserving edits are justified.'
|
||||
: '当前主要收益已经基本到位,只适合做最小、保守、保持 contract 的微调。'
|
||||
);
|
||||
|
||||
if (stopSignals?.overfitRisk === 'medium' || conflictSignals.has('sampleOverfitRiskVisible')) {
|
||||
reasons.push(
|
||||
language === 'en'
|
||||
? 'If you touch the prompt at all, focus on small generalization-oriented wording repairs rather than a broad rewrite.'
|
||||
: '如果还要改,只能做轻量泛化修补,不能再做大幅重写。'
|
||||
);
|
||||
}
|
||||
|
||||
return {
|
||||
recommendation: 'minor-rewrite',
|
||||
reasons,
|
||||
focusAreas: Array.from(focusAreas),
|
||||
priorityMoves: Array.from(new Set(priorityMoves)),
|
||||
};
|
||||
}
|
||||
|
||||
reasons.push(
|
||||
language === 'en'
|
||||
? 'There is still meaningful improvement headroom or unresolved risk, so a substantive rewrite remains justified.'
|
||||
: '当前仍存在明确改进空间或未解决风险,继续做实质性改写仍然有必要。'
|
||||
);
|
||||
|
||||
if (hasRegressionConflict) {
|
||||
reasons.push(
|
||||
language === 'en'
|
||||
? 'Regression against the baseline must be repaired before pursuing cosmetic gains.'
|
||||
: '需要先修复相对 baseline 的回退,再谈其他表层优化。'
|
||||
);
|
||||
}
|
||||
|
||||
if (hasUnsupportedImprovementConflict) {
|
||||
reasons.push(
|
||||
language === 'en'
|
||||
? 'The current prompt change is not supported on the reference side, so the rewrite should actively repair unsupported drift.'
|
||||
: '当前改动在 reference 侧不被支持,改写时应主动修复这种不被支持的漂移。'
|
||||
);
|
||||
}
|
||||
|
||||
if (hasInstabilityConflict) {
|
||||
reasons.push(
|
||||
language === 'en'
|
||||
? 'Replica evidence shows instability, so the rewrite should target decision stability rather than superficial formatting.'
|
||||
: 'replica 证据显示当前行为不稳定,改写时应优先修复决策稳定性,而不是只修表面格式。'
|
||||
);
|
||||
}
|
||||
|
||||
return {
|
||||
recommendation: 'rewrite',
|
||||
reasons,
|
||||
focusAreas: Array.from(focusAreas),
|
||||
priorityMoves: Array.from(new Set(priorityMoves)),
|
||||
};
|
||||
};
|
||||
|
||||
const normalizeInlineText = (content: string | undefined): string =>
|
||||
(content || '').replace(/\s+/gu, ' ').trim();
|
||||
|
||||
const truncateInline = (value: string | undefined, maxLength = 140): string => {
|
||||
const normalized = normalizeInlineText(value);
|
||||
if (!normalized) return '';
|
||||
|
||||
return normalized.length > maxLength
|
||||
? `${normalized.slice(0, maxLength)}...`
|
||||
: normalized;
|
||||
};
|
||||
|
||||
const collectUniqueLines = (
|
||||
values: Array<string | undefined>,
|
||||
options?: {
|
||||
limit?: number;
|
||||
maxLength?: number;
|
||||
},
|
||||
): string[] => {
|
||||
const limit = options?.limit ?? 5;
|
||||
const maxLength = options?.maxLength ?? 220;
|
||||
const seen = new Set<string>();
|
||||
const lines: string[] = [];
|
||||
|
||||
for (const value of values) {
|
||||
const normalized = normalizeInlineText(value);
|
||||
if (!normalized) continue;
|
||||
|
||||
const dedupeKey = normalized.toLocaleLowerCase();
|
||||
if (seen.has(dedupeKey)) continue;
|
||||
|
||||
seen.add(dedupeKey);
|
||||
lines.push(truncateInline(normalized, maxLength));
|
||||
|
||||
if (lines.length >= limit) {
|
||||
break;
|
||||
}
|
||||
}
|
||||
|
||||
return lines;
|
||||
};
|
||||
|
||||
const buildPatchPlanLines = (
|
||||
patchPlan: EvaluationResponse['patchPlan'],
|
||||
): string[] =>
|
||||
(patchPlan || []).map((operation, index) => {
|
||||
const oldText = truncateInline(operation.oldText);
|
||||
const newText = truncateInline(operation.newText);
|
||||
const segments = [
|
||||
`${index + 1}. [${operation.op}] ${operation.instruction}`,
|
||||
];
|
||||
|
||||
if (oldText) {
|
||||
segments.push(`old="${oldText}"`);
|
||||
}
|
||||
if (newText) {
|
||||
segments.push(`new="${newText}"`);
|
||||
}
|
||||
|
||||
return segments.join(' | ');
|
||||
});
|
||||
|
||||
const buildDimensionLines = (
|
||||
result: EvaluationResponse,
|
||||
): string[] =>
|
||||
(result.score?.dimensions || []).map((dimension) =>
|
||||
`${dimension.label}: ${dimension.score}`
|
||||
);
|
||||
|
||||
const buildStopSignalLines = (
|
||||
stopSignals: CompareStopSignals | undefined,
|
||||
): string[] => {
|
||||
if (!stopSignals) return [];
|
||||
|
||||
const lines: string[] = [];
|
||||
|
||||
if (stopSignals.targetVsBaseline) {
|
||||
lines.push(`targetVsBaseline=${stopSignals.targetVsBaseline}`);
|
||||
}
|
||||
if (stopSignals.targetVsReferenceGap) {
|
||||
lines.push(`targetVsReferenceGap=${stopSignals.targetVsReferenceGap}`);
|
||||
}
|
||||
if (stopSignals.improvementHeadroom) {
|
||||
lines.push(`improvementHeadroom=${stopSignals.improvementHeadroom}`);
|
||||
}
|
||||
if (stopSignals.overfitRisk) {
|
||||
lines.push(`overfitRisk=${stopSignals.overfitRisk}`);
|
||||
}
|
||||
if (stopSignals.stopRecommendation) {
|
||||
lines.push(`stopRecommendation=${stopSignals.stopRecommendation}`);
|
||||
}
|
||||
if (stopSignals.stopReasons?.length) {
|
||||
lines.push(`stopReasons=${stopSignals.stopReasons.join(' | ')}`);
|
||||
}
|
||||
|
||||
return lines;
|
||||
};
|
||||
|
||||
const formatCompareConflictSignal = (
|
||||
signal: CompareConflictSignal,
|
||||
language: RewriteLanguage,
|
||||
): string => {
|
||||
switch (signal) {
|
||||
case 'improvementNotSupportedOnReference':
|
||||
return language === 'en'
|
||||
? 'The target improved over baseline, but the same prompt change is not supported on the reference side.'
|
||||
: 'Target 相比 baseline 有进步,但同一类 prompt 改动在 reference 侧并未得到支持。';
|
||||
case 'improvementUnstableAcrossReplicas':
|
||||
return language === 'en'
|
||||
? 'The target improved in one comparison, but replica evidence suggests the gain may be unstable.'
|
||||
: 'Target 在单组比较里有进步,但 replica 证据提示该收益可能不稳定。';
|
||||
case 'regressionOutweighsCosmeticGains':
|
||||
return language === 'en'
|
||||
? 'Regression against the baseline should outweigh cosmetic improvements elsewhere.'
|
||||
: '相对 baseline 的回退应优先于其他表面优化。';
|
||||
case 'sampleOverfitRiskVisible':
|
||||
return language === 'en'
|
||||
? 'When reusable gains and sample-fitting gains coexist, prefer conservative conclusions and keep the overfit risk visible.'
|
||||
: '如果“可复用收益”和“样例贴合收益”并存,应优先采用保守结论,并保持过拟合风险可见。';
|
||||
default:
|
||||
return signal;
|
||||
}
|
||||
};
|
||||
|
||||
const buildCompareFocusSummaryLines = (
|
||||
compareInsights: CompareInsights | undefined,
|
||||
language: RewriteLanguage,
|
||||
): string[] =>
|
||||
collectUniqueLines(
|
||||
[
|
||||
compareInsights?.progressSummary
|
||||
? `${language === 'en' ? 'Progress' : '进步判断'}: ${compareInsights.progressSummary.pairLabel} | signal=${compareInsights.progressSummary.pairSignal} | verdict=${compareInsights.progressSummary.verdict} | confidence=${compareInsights.progressSummary.confidence} | ${compareInsights.progressSummary.analysis}`
|
||||
: undefined,
|
||||
compareInsights?.referenceGapSummary
|
||||
? `${language === 'en' ? 'Reference Gap' : '参考差距'}: ${compareInsights.referenceGapSummary.pairLabel} | signal=${compareInsights.referenceGapSummary.pairSignal} | verdict=${compareInsights.referenceGapSummary.verdict} | confidence=${compareInsights.referenceGapSummary.confidence} | ${compareInsights.referenceGapSummary.analysis}`
|
||||
: undefined,
|
||||
compareInsights?.promptChangeSummary
|
||||
? `${language === 'en' ? 'Prompt Change Validity' : '改动有效性'}: ${compareInsights.promptChangeSummary.pairLabel} | signal=${compareInsights.promptChangeSummary.pairSignal} | verdict=${compareInsights.promptChangeSummary.verdict} | confidence=${compareInsights.promptChangeSummary.confidence} | ${compareInsights.promptChangeSummary.analysis}`
|
||||
: undefined,
|
||||
compareInsights?.stabilitySummary
|
||||
? `${language === 'en' ? 'Stability' : '稳定性'}: ${compareInsights.stabilitySummary.pairLabel} | signal=${compareInsights.stabilitySummary.pairSignal} | verdict=${compareInsights.stabilitySummary.verdict} | confidence=${compareInsights.stabilitySummary.confidence} | ${compareInsights.stabilitySummary.analysis}`
|
||||
: undefined,
|
||||
],
|
||||
{ limit: 4, maxLength: 260 }
|
||||
);
|
||||
|
||||
const buildCompareSupportLines = (
|
||||
compareInsights: CompareInsights | undefined,
|
||||
): string[] =>
|
||||
collectUniqueLines(
|
||||
[
|
||||
...(compareInsights?.pairHighlights || []).map((highlight, index) =>
|
||||
`${index + 1}. ${highlight.pairLabel} | signal=${highlight.pairSignal} | verdict=${highlight.verdict} | confidence=${highlight.confidence} | ${highlight.analysis}`
|
||||
),
|
||||
...(compareInsights?.evidenceHighlights || []),
|
||||
],
|
||||
{ limit: 4, maxLength: 240 }
|
||||
);
|
||||
|
||||
const buildCompareConflictLines = (
|
||||
compareInsights: CompareInsights | undefined,
|
||||
language: RewriteLanguage,
|
||||
): string[] =>
|
||||
collectUniqueLines(
|
||||
(compareInsights?.conflictSignals || []).map((signal) =>
|
||||
formatCompareConflictSignal(signal, language)
|
||||
),
|
||||
{ limit: 4, maxLength: 260 }
|
||||
);
|
||||
|
||||
const buildRewriteTargetLines = (
|
||||
result: EvaluationResponse,
|
||||
language: RewriteLanguage,
|
||||
): string[] =>
|
||||
collectUniqueLines(
|
||||
[
|
||||
result.summary
|
||||
? `${language === 'en' ? 'Overall' : '总评'}: ${result.summary}`
|
||||
: undefined,
|
||||
...(result.improvements || []).map((line) =>
|
||||
`${language === 'en' ? 'Priority' : '优先项'}: ${line}`
|
||||
),
|
||||
],
|
||||
{ limit: 6, maxLength: 240 }
|
||||
);
|
||||
|
||||
const toTemplateLines = (values: string[]): EvaluationRewriteLine[] =>
|
||||
values.map((text) => ({ text }));
|
||||
|
||||
const resolveSubjectLabel = (
|
||||
mode: EvaluationModeConfig,
|
||||
language: RewriteLanguage,
|
||||
): string => {
|
||||
const subjectLabelsZh: Record<string, string> = {
|
||||
'basic:system': '系统提示词',
|
||||
'basic:user': '用户提示词',
|
||||
'pro:multi': '多消息 system 提示词',
|
||||
'pro:variable': '变量用户提示词',
|
||||
'image:text2image': '文生图提示词',
|
||||
'image:image2image': '图生图提示词',
|
||||
};
|
||||
|
||||
const subjectLabelsEn: Record<string, string> = {
|
||||
'basic:system': 'system prompt',
|
||||
'basic:user': 'user prompt',
|
||||
'pro:multi': 'multi-message system prompt',
|
||||
'pro:variable': 'variable user prompt',
|
||||
'image:text2image': 'text-to-image prompt',
|
||||
'image:image2image': 'image-to-image prompt',
|
||||
};
|
||||
|
||||
const key = `${mode.functionMode}:${mode.subMode}`;
|
||||
const labels = language === 'en' ? subjectLabelsEn : subjectLabelsZh;
|
||||
|
||||
return labels[key] || (language === 'en' ? 'workspace prompt' : '工作区提示词');
|
||||
};
|
||||
|
||||
const resolveEvaluationTypeLabel = (
|
||||
type: EvaluationType,
|
||||
language: RewriteLanguage,
|
||||
): string | undefined => {
|
||||
const labels = language === 'en'
|
||||
? {
|
||||
result: 'Single Result Evaluation',
|
||||
compare: 'Compare Evaluation',
|
||||
'prompt-only': 'Prompt Design Analysis',
|
||||
'prompt-iterate': 'Prompt Iterate Analysis',
|
||||
}
|
||||
: {
|
||||
result: '单结果评估',
|
||||
compare: '对比评估',
|
||||
'prompt-only': '提示词分析',
|
||||
'prompt-iterate': '迭代分析',
|
||||
};
|
||||
|
||||
return labels[type];
|
||||
};
|
||||
|
||||
const resolveRewriteTemplate = (
|
||||
mode: EvaluationModeConfig,
|
||||
language: RewriteLanguage,
|
||||
) => {
|
||||
const isEnglish = language === 'en';
|
||||
|
||||
if (mode.functionMode === 'basic' && mode.subMode === 'system') {
|
||||
return isEnglish ? evaluationRewriteBasicSystemTemplateEn : evaluationRewriteBasicSystemTemplate;
|
||||
}
|
||||
|
||||
if (mode.functionMode === 'basic' && mode.subMode === 'user') {
|
||||
return isEnglish ? evaluationRewriteBasicUserTemplateEn : evaluationRewriteBasicUserTemplate;
|
||||
}
|
||||
|
||||
if (mode.functionMode === 'pro' && mode.subMode === 'multi') {
|
||||
return isEnglish ? evaluationRewriteProMultiTemplateEn : evaluationRewriteProMultiTemplate;
|
||||
}
|
||||
|
||||
if (mode.functionMode === 'pro' && mode.subMode === 'variable') {
|
||||
return isEnglish ? evaluationRewriteProVariableTemplateEn : evaluationRewriteProVariableTemplate;
|
||||
}
|
||||
|
||||
return isEnglish ? evaluationRewriteGenericTemplateEn : evaluationRewriteGenericTemplate;
|
||||
};
|
||||
|
||||
export const normalizeRewriteLocaleLanguage = (
|
||||
locale: string | undefined,
|
||||
): RewriteLanguage => locale?.toLowerCase().startsWith('en') ? 'en' : 'zh';
|
||||
|
||||
export const buildRewriteFromEvaluationContext = (
|
||||
params: EvaluationRewritePromptParams,
|
||||
): EvaluationRewriteContext => {
|
||||
const language = params.language || 'zh';
|
||||
const { result, type, mode } = params;
|
||||
const metadata = result.metadata;
|
||||
const compareInsights = metadata?.compareInsights;
|
||||
const stopSignals = metadata?.compareStopSignals;
|
||||
const dimensionScoreLines = toTemplateLines(buildDimensionLines(result));
|
||||
const rewriteTargetLines = toTemplateLines(buildRewriteTargetLines(result, language));
|
||||
const patchPlanLines = toTemplateLines(
|
||||
collectUniqueLines(buildPatchPlanLines(result.patchPlan), {
|
||||
limit: 4,
|
||||
maxLength: 260,
|
||||
})
|
||||
);
|
||||
const focusSummaryLines = toTemplateLines(
|
||||
buildCompareFocusSummaryLines(compareInsights, language)
|
||||
);
|
||||
const stopSignalLines = toTemplateLines(
|
||||
collectUniqueLines(buildStopSignalLines(stopSignals), {
|
||||
limit: 6,
|
||||
maxLength: 220,
|
||||
})
|
||||
);
|
||||
const conflictLines = toTemplateLines(
|
||||
buildCompareConflictLines(compareInsights, language)
|
||||
);
|
||||
const learnableSignalLines = toTemplateLines(
|
||||
collectUniqueLines(compareInsights?.learnableSignals || [], {
|
||||
limit: 5,
|
||||
maxLength: 220,
|
||||
})
|
||||
);
|
||||
const overfitWarningLines = toTemplateLines(
|
||||
collectUniqueLines(compareInsights?.overfitWarnings || [], {
|
||||
limit: 5,
|
||||
maxLength: 220,
|
||||
})
|
||||
);
|
||||
const supportEvidenceLines = toTemplateLines(
|
||||
buildCompareSupportLines(compareInsights)
|
||||
);
|
||||
const rewritePayload = buildRewritePayload(params);
|
||||
|
||||
return {
|
||||
language,
|
||||
subjectLabel: resolveSubjectLabel(mode, language),
|
||||
evaluationTypeLabel: resolveEvaluationTypeLabel(type, language) || type,
|
||||
overallScore: result.score?.overall ?? 'N/A',
|
||||
rewritePayloadJson: JSON.stringify(rewritePayload, null, 2),
|
||||
hasWorkspacePrompt: !!params.workspacePrompt?.trim(),
|
||||
workspacePrompt: params.workspacePrompt?.trim() || '',
|
||||
hasReferencePrompt: !!params.referencePrompt?.trim(),
|
||||
referencePrompt: params.referencePrompt?.trim() || '',
|
||||
hasDimensionScoreLines: dimensionScoreLines.length > 0,
|
||||
dimensionScoreLines,
|
||||
hasRewriteTargetLines: rewriteTargetLines.length > 0,
|
||||
rewriteTargetLines,
|
||||
hasPatchPlanLines: patchPlanLines.length > 0,
|
||||
patchPlanLines,
|
||||
hasFocusSummaryLines: focusSummaryLines.length > 0,
|
||||
focusSummaryLines,
|
||||
hasStopSignalLines: stopSignalLines.length > 0,
|
||||
stopSignalLines,
|
||||
hasConflictLines: conflictLines.length > 0,
|
||||
conflictLines,
|
||||
hasLearnableSignalLines: learnableSignalLines.length > 0,
|
||||
learnableSignalLines,
|
||||
hasOverfitWarningLines: overfitWarningLines.length > 0,
|
||||
overfitWarningLines,
|
||||
hasSupportEvidenceLines: supportEvidenceLines.length > 0,
|
||||
supportEvidenceLines,
|
||||
isCompareEvaluation: type === 'compare',
|
||||
isResultEvaluation: type === 'result',
|
||||
isPromptOnlyEvaluation: type === 'prompt-only',
|
||||
isPromptIterateEvaluation: type === 'prompt-iterate',
|
||||
};
|
||||
};
|
||||
|
||||
export const buildRewritePayload = (
|
||||
params: EvaluationRewritePromptParams,
|
||||
): EvaluationRewritePayload => {
|
||||
const language = params.language || 'zh';
|
||||
const { result, type, mode } = params;
|
||||
const evaluationTypeLabel = resolveEvaluationTypeLabel(type, language) || type;
|
||||
const subjectLabel = resolveSubjectLabel(mode, language);
|
||||
const compareInsights = result.metadata?.compareInsights;
|
||||
const stopSignals = result.metadata?.compareStopSignals;
|
||||
const rewriteGuidance = buildRewriteGuidance(params);
|
||||
|
||||
return {
|
||||
scenario: {
|
||||
language,
|
||||
evaluationType: type,
|
||||
evaluationTypeLabel,
|
||||
subjectLabel,
|
||||
mode,
|
||||
overallScore: result.score?.overall ?? 'N/A',
|
||||
},
|
||||
sourcePrompts: {
|
||||
...(params.workspacePrompt?.trim()
|
||||
? { workspacePrompt: params.workspacePrompt.trim() }
|
||||
: {}),
|
||||
...(params.referencePrompt?.trim()
|
||||
? { referencePrompt: params.referencePrompt.trim() }
|
||||
: {}),
|
||||
},
|
||||
compressedEvaluation: {
|
||||
summary: result.summary,
|
||||
dimensionScores: (result.score?.dimensions || []).map((dimension) => ({
|
||||
key: dimension.key,
|
||||
label: dimension.label,
|
||||
score: dimension.score,
|
||||
})),
|
||||
improvements: [...(result.improvements || [])],
|
||||
patchPlan: [...(result.patchPlan || [])],
|
||||
...(stopSignals ? { compareStopSignals: stopSignals } : {}),
|
||||
...(compareInsights ? { compareInsights } : {}),
|
||||
rewriteGuidance,
|
||||
focusSummaryLines: buildCompareFocusSummaryLines(compareInsights, language),
|
||||
conflictLines: buildCompareConflictLines(compareInsights, language),
|
||||
learnableSignalLines: collectUniqueLines(compareInsights?.learnableSignals || [], {
|
||||
limit: 5,
|
||||
maxLength: 220,
|
||||
}),
|
||||
overfitWarningLines: collectUniqueLines(compareInsights?.overfitWarnings || [], {
|
||||
limit: 5,
|
||||
maxLength: 220,
|
||||
}),
|
||||
supportEvidenceLines: buildCompareSupportLines(compareInsights),
|
||||
},
|
||||
};
|
||||
};
|
||||
|
||||
export const buildRewritePromptFromEvaluation = (
|
||||
params: EvaluationRewritePromptParams,
|
||||
): string => {
|
||||
const language = params.language || 'zh';
|
||||
const template = resolveRewriteTemplate(params.mode, language);
|
||||
const context = buildRewriteFromEvaluationContext(params);
|
||||
const messages = TemplateProcessor.processTemplate(template, context);
|
||||
|
||||
return messages.map((message) => message.content.trim()).filter(Boolean).join('\n\n').trim();
|
||||
};
|
||||
@@ -9,9 +9,9 @@ import type { IModelManager } from '../model/types';
|
||||
import type { ITemplateManager, Template } from '../template/types';
|
||||
import { TemplateProcessor, type TemplateContext } from '../template/processor';
|
||||
import {
|
||||
compareJsonContractEn,
|
||||
compareJsonContractZh,
|
||||
} from '../template/default-templates/evaluation/builders';
|
||||
buildStructuredComparePairJudgeMessages,
|
||||
buildStructuredCompareSynthesisMessages,
|
||||
} from './structured-compare-prompts';
|
||||
import {
|
||||
type IEvaluationService,
|
||||
type EvaluationRequest,
|
||||
@@ -43,10 +43,6 @@ import {
|
||||
} from './errors';
|
||||
import { jsonrepair } from 'jsonrepair';
|
||||
|
||||
const jsonFence = (content: string) => `\`\`\`json
|
||||
${content}
|
||||
\`\`\``;
|
||||
|
||||
type ComparePromptLanguage = 'zh' | 'en';
|
||||
|
||||
interface NormalizedEvaluationTestCase {
|
||||
@@ -973,8 +969,9 @@ export class EvaluationService implements IEvaluationService {
|
||||
|
||||
const signalSnapshot = this.summarizeStructuredCompareJudgeSignals(judgeResults);
|
||||
const hasOverfitWarnings = judgeResults.some((item) => item.overfitWarnings.length > 0);
|
||||
const hasLowOverfitEvidence =
|
||||
signalSnapshot.promptValidity === 'supported' ||
|
||||
const hasExplicitLowOverfitEvidence =
|
||||
!hasOverfitWarnings &&
|
||||
signalSnapshot.promptValidity === 'supported' &&
|
||||
signalSnapshot.stability === 'stable';
|
||||
|
||||
const overfitRisk: CompareStopSignals['overfitRisk'] = (() => {
|
||||
@@ -991,7 +988,7 @@ export class EvaluationService implements IEvaluationService {
|
||||
) {
|
||||
return 'medium';
|
||||
}
|
||||
if (hasLowOverfitEvidence) {
|
||||
if (hasExplicitLowOverfitEvidence) {
|
||||
return 'low';
|
||||
}
|
||||
return undefined;
|
||||
@@ -1932,119 +1929,37 @@ export class EvaluationService implements IEvaluationService {
|
||||
}, new Map<string, NormalizedEvaluationTestCase>()).values()
|
||||
);
|
||||
const focus = request.focus?.content?.trim() || '';
|
||||
const pairJudgeJsonContract = jsonFence(`{
|
||||
"pairKey": "${planItem.key}",
|
||||
"pairType": "${planItem.pairType}",
|
||||
"verdict": "left-better | right-better | mixed | similar",
|
||||
"winner": "left | right | none",
|
||||
"confidence": "low | medium | high",
|
||||
"pairSignal": "${planItem.allowedSignals.join(' | ')}",
|
||||
"analysis": "<one short paragraph>",
|
||||
"evidence": ["<evidence-grounded difference>"],
|
||||
"learnableSignals": ["<reusable structural signal>"],
|
||||
"overfitWarnings": ["<sample-specific or overfit risk>"]
|
||||
}`);
|
||||
void this.buildStructuredCompareDebugArtifacts({
|
||||
roleBindings: normalizedCompare.compareRoleBindings,
|
||||
testCases: relevantTestCases,
|
||||
snapshots: [planItem.left, planItem.right],
|
||||
language,
|
||||
});
|
||||
|
||||
const systemContent =
|
||||
language === 'en'
|
||||
? `# Role: Structured_Compare_Pair_Judge
|
||||
|
||||
## Goal
|
||||
- Judge exactly one structured compare pair and compress the evidence into a reusable intermediate result for a later synthesis step.
|
||||
|
||||
## Rules
|
||||
1. Only use the test inputs and the two snapshots in this pair.
|
||||
2. verdict must be one of: left-better, right-better, mixed, similar.
|
||||
3. winner must be one of: left, right, none.
|
||||
4. confidence must be one of: low, medium, high.
|
||||
5. pairSignal must use only the allowed values for this pair. If uncertain, use "unclear".
|
||||
6. learnableSignals must stay reusable and structural. Do not write sample-specific content hacks.
|
||||
7. overfitWarnings must explicitly call out any sign that the stronger side only fits this specific input better.
|
||||
8. Return valid JSON only.
|
||||
|
||||
## Pair-Specific Guidance
|
||||
${this.renderStructuredComparePairGuidance(planItem, language)}
|
||||
|
||||
## Output Contract
|
||||
${pairJudgeJsonContract}
|
||||
|
||||
## Initialization
|
||||
You are the pair judge for structured compare. Return valid JSON only.`
|
||||
: `# Role: 结构化对比成对判断专家
|
||||
|
||||
## Goal
|
||||
- 只判断一个 structured compare pair,并把证据压缩成供后续综合阶段使用的中间结果。
|
||||
|
||||
## Rules
|
||||
1. 只能使用当前 pair 的测试输入和这两个执行快照。
|
||||
2. verdict 只允许:left-better、right-better、mixed、similar。
|
||||
3. winner 只允许:left、right、none。
|
||||
4. confidence 只允许:low、medium、high。
|
||||
5. pairSignal 只能使用本 pair 允许的枚举;如果不确定,写 unclear。
|
||||
6. learnableSignals 只能保留可复用、结构性的信号,不得写只对当前样例有效的内容补丁。
|
||||
7. overfitWarnings 必须显式指出任何“只是更贴合当前输入”的风险。
|
||||
8. 只返回合法 JSON。
|
||||
|
||||
## 当前 Pair 专项判断
|
||||
${this.renderStructuredComparePairGuidance(planItem, language)}
|
||||
|
||||
## Output Contract
|
||||
${pairJudgeJsonContract}
|
||||
|
||||
## Initialization
|
||||
你是结构化对比的成对判断专家,只返回合法 JSON。`;
|
||||
|
||||
const userContent =
|
||||
language === 'en'
|
||||
? `${this.renderStructuredCompareRoleBindings(
|
||||
normalizedCompare.compareRoleBindings,
|
||||
language
|
||||
)}## Pair
|
||||
- Pair Key: ${planItem.key}
|
||||
- Pair Label: ${planItem.label}
|
||||
- Purpose: ${planItem.purpose}
|
||||
- Signal Name: ${planItem.signalName}
|
||||
- Allowed Signal Values: ${planItem.allowedSignals.join(' | ')}
|
||||
|
||||
${this.renderStructuredCompareTestCases(relevantTestCases, language)}## Left Snapshot
|
||||
${this.renderStructuredCompareSnapshot(planItem.left, language)}
|
||||
|
||||
## Right Snapshot
|
||||
${this.renderStructuredCompareSnapshot(planItem.right, language)}
|
||||
|
||||
${focus ? `## Focus Brief
|
||||
${focus}
|
||||
|
||||
` : ''}---
|
||||
|
||||
Judge this pair only and return strict JSON.`
|
||||
: `${this.renderStructuredCompareRoleBindings(
|
||||
normalizedCompare.compareRoleBindings,
|
||||
language
|
||||
)}## 当前 Pair
|
||||
- Pair Key:${planItem.key}
|
||||
- Pair Label:${planItem.label}
|
||||
- Purpose:${planItem.purpose}
|
||||
- Signal Name:${planItem.signalName}
|
||||
- Allowed Signal Values:${planItem.allowedSignals.join(' | ')}
|
||||
|
||||
${this.renderStructuredCompareTestCases(relevantTestCases, language)}## Left Snapshot
|
||||
${this.renderStructuredCompareSnapshot(planItem.left, language)}
|
||||
|
||||
## Right Snapshot
|
||||
${this.renderStructuredCompareSnapshot(planItem.right, language)}
|
||||
|
||||
${focus ? `## Focus Brief
|
||||
${focus}
|
||||
|
||||
` : ''}---
|
||||
|
||||
请只判断这一个 pair,并返回严格 JSON。`;
|
||||
|
||||
return [
|
||||
{ role: 'system', content: systemContent },
|
||||
{ role: 'user', content: userContent },
|
||||
];
|
||||
return buildStructuredComparePairJudgeMessages({
|
||||
language,
|
||||
pairGuidance: this.renderStructuredComparePairGuidance(planItem, language),
|
||||
payload: {
|
||||
scenario: {
|
||||
language,
|
||||
pairKey: planItem.key,
|
||||
pairType: planItem.pairType,
|
||||
pairLabel: planItem.label,
|
||||
purpose: planItem.purpose,
|
||||
signalName: planItem.signalName,
|
||||
allowedSignalValues: planItem.allowedSignals,
|
||||
...(focus ? { focusBrief: focus } : {}),
|
||||
},
|
||||
roleBindings: this.toStructuredCompareRoleBindingPayloads(
|
||||
normalizedCompare.compareRoleBindings
|
||||
),
|
||||
testCases: relevantTestCases.map((testCase) =>
|
||||
this.toStructuredCompareTestCasePayload(testCase)
|
||||
),
|
||||
leftSnapshot: this.toStructuredCompareSnapshotPayload(planItem.left),
|
||||
rightSnapshot: this.toStructuredCompareSnapshotPayload(planItem.right),
|
||||
},
|
||||
});
|
||||
}
|
||||
|
||||
private renderStructuredComparePairGuidance(
|
||||
@@ -2057,13 +1972,15 @@ ${focus}
|
||||
return [
|
||||
'- This pair decides whether the current target is actually worth keeping instead of the previous version.',
|
||||
'- Do not reward cosmetic rewrites, longer wording, or more confident tone if task completion, boundary control, or required structure got weaker.',
|
||||
'- If the stronger side wins only by fitting this sample more tightly, downgrade the verdict or surface that risk in overfitWarnings.',
|
||||
'- If the target is genuinely more helpful on this sample but the gain mainly comes from sample-tied wording, keywords, or one-off rules, prefer pairSignal=improved or flat first, then expose the fragility in overfitWarnings instead of defaulting to unclear.',
|
||||
'- Only use unclear when you truly cannot determine the direction after weighing both sides, not merely because overfit risk exists.',
|
||||
].join('\n');
|
||||
case 'targetReference':
|
||||
return [
|
||||
'- This pair is for learnable gap analysis, not raw model worship.',
|
||||
'- Separate transferable prompt-side structure from differences that mainly look like model ceiling or raw reasoning ability.',
|
||||
'- Only use "major" when the reference shows a clear structural advantage that the target could realistically learn from.',
|
||||
'- If your evidence says the reference missed a required action or violated the prompt-side rule while the target followed it, do not still conclude "right-better". Downgrade or flip the verdict so it matches the evidence.',
|
||||
].join('\n');
|
||||
case 'referenceBaseline':
|
||||
return [
|
||||
@@ -2075,6 +1992,7 @@ ${focus}
|
||||
return [
|
||||
'- This pair checks stability across repeated executions with the same target prompt.',
|
||||
'- Treat requirement-preserving variation as acceptable, but mark "unstable" when key boundaries, task structure, or output intent drift across runs.',
|
||||
'- If one run obeys an explicit output-only contract and another adds prose, markdown, code fences, renamed fields, extra keys, or wrapper text, that is instability rather than harmless variation.',
|
||||
'- Do not confuse one lucky output with reliable stability.',
|
||||
].join('\n');
|
||||
default:
|
||||
@@ -2087,13 +2005,15 @@ ${focus}
|
||||
return [
|
||||
'- 这一组决定当前 target 是否真的值得替换上一版本,而不是只看起来更“像优化版”。',
|
||||
'- 如果 left 只是写得更长、语气更强或表面更完整,但任务完成度、边界控制或关键结构更差,不能判成 left-better。',
|
||||
'- 如果更强一侧只是更贴合当前样例,而不是结构上更稳,应降级 verdict,或在 overfitWarnings 中明确指出。',
|
||||
'- 如果 target 在当前样例下确实更有帮助,但收益主要来自样例关键词、一次性规则或特定触发条件,优先先判断 pairSignal=improved 或 flat,再把脆弱性写进 overfitWarnings,不要直接因为有过拟合风险就退成 unclear。',
|
||||
'- 只有在你综合两侧后仍无法判断方向时,才允许写 unclear;“存在过拟合风险”本身不等于“没有方向”。',
|
||||
].join('\n');
|
||||
case 'targetReference':
|
||||
return [
|
||||
'- 这一组是为了找“可学习差距”,不是为了盲目崇拜更强模型。',
|
||||
'- 要区分“可迁移的提示词结构优势”和“纯模型能力上限”造成的差异。',
|
||||
'- 只有当 reference 展示出 target 可以现实学习的清晰结构优势时,才应给出 major。',
|
||||
'- 如果 evidence 已经表明 reference 漏掉了必须动作、没遵守 prompt 规则,而 target 做到了,就不能继续写成 right-better;结论必须和证据一致。',
|
||||
].join('\n');
|
||||
case 'referenceBaseline':
|
||||
return [
|
||||
@@ -2105,6 +2025,7 @@ ${focus}
|
||||
return [
|
||||
'- 这一组用于判断同一个 target prompt 在重复执行下是否稳定。',
|
||||
'- 如果只是措辞波动但仍满足同样边界与任务要求,可视为稳定;如果关键边界、结构或输出意图飘移,应判为 unstable。',
|
||||
'- 如果一次执行严格满足 output-only 约束,而另一次多出解释、Markdown、code fence、字段改名、额外键或包裹文本,这属于不稳定,不是无害波动。',
|
||||
'- 不要把一次走运的输出误判成稳定收益。',
|
||||
].join('\n');
|
||||
default:
|
||||
@@ -2119,85 +2040,220 @@ ${focus}
|
||||
language: ComparePromptLanguage,
|
||||
subject: ComparePromptSubjectConfig
|
||||
): Message[] {
|
||||
const compareJsonContract =
|
||||
language === 'en' ? compareJsonContractEn : compareJsonContractZh;
|
||||
const systemContent =
|
||||
language === 'en'
|
||||
? `# Role: ${subject.roleName}
|
||||
|
||||
## Goal
|
||||
- Synthesize multiple pairwise judge results into one final structured compare evaluation for the editable ${subject.subjectLabel}.
|
||||
|
||||
## Rules
|
||||
1. Target is the only optimization focus.
|
||||
2. Use only the provided pairwise judge results and explicit snapshot-role bindings as evidence. Do not invent raw evidence.
|
||||
3. summary must answer, in order when evidence exists: whether target improved over baseline, whether target still trails the reference, whether the prompt change also works on the reference side, and whether replicas reveal instability.
|
||||
4. improvements must keep only reusable structural guidance. Drop or down-rank sample-specific advice.
|
||||
5. If pairwise evidence conflicts or is weak, prefer conservative conclusions and set stopRecommendation to "review".
|
||||
6. compareStopSignals must be conservative and evidence-grounded.
|
||||
7. Return valid JSON only.
|
||||
|
||||
## Output Contract
|
||||
${compareJsonContract}
|
||||
|
||||
## Initialization
|
||||
You are the structured compare synthesizer. Return valid JSON only.`
|
||||
: `# Role: ${subject.roleName}
|
||||
|
||||
## Goal
|
||||
- 基于多条成对判断结果,为可编辑${subject.subjectLabel}输出最终的 structured compare 评估结果。
|
||||
|
||||
## Rules
|
||||
1. Target 是唯一优化焦点。
|
||||
2. 只能使用提供的 pairwise judge 结果和明确的快照角色绑定,不能重新杜撰原始证据。
|
||||
3. summary 在有证据时必须依次回答:target 相比 baseline 是否进步;target 与 reference 是否仍有差距;prompt 改动在 reference 侧是否也成立;如果存在 replica,稳定性如何。
|
||||
4. improvements 只保留可复用、结构性的改进方向;明显只适配当前样例的建议要剔除或降权。
|
||||
5. 如果多条 pairwise 结果互相冲突或证据偏弱,应采取保守结论,并把 stopRecommendation 设为 review。
|
||||
6. compareStopSignals 必须保守且有证据支撑。
|
||||
7. 只返回合法 JSON。
|
||||
|
||||
## Output Contract
|
||||
${compareJsonContract}
|
||||
|
||||
## Initialization
|
||||
你是结构化对比综合专家,只返回合法 JSON。`;
|
||||
|
||||
const focus = request.focus?.content?.trim() || '';
|
||||
const userContent =
|
||||
language === 'en'
|
||||
? `${this.renderStructuredCompareRoleBindings(
|
||||
normalizedCompare.compareRoleBindings,
|
||||
language
|
||||
)}## Compare Scenario
|
||||
- Shared Compare Inputs: ${normalizedCompare.sharedCompareInputs ? 'yes' : 'no'}
|
||||
- Same Prompt Across Snapshots: ${normalizedCompare.samePromptAcrossSnapshots ? 'yes' : 'no'}
|
||||
- Cross-Model Comparison: ${normalizedCompare.crossModelComparison ? 'yes' : 'no'}
|
||||
const signalSnapshot = this.summarizeStructuredCompareJudgeSignals(judgeResults);
|
||||
const derivedStopSignals = this.deriveCompareStopSignalsFromJudgements(judgeResults);
|
||||
const learnableSignals = this.collectRankedCompareStrings(
|
||||
judgeResults.flatMap((item) => item.learnableSignals),
|
||||
4
|
||||
);
|
||||
const overfitWarnings = this.collectRankedCompareStrings(
|
||||
judgeResults.flatMap((item) => item.overfitWarnings),
|
||||
4
|
||||
);
|
||||
const conflictSignals = this.buildCompareConflictSignals(judgeResults).map((signal) => ({
|
||||
key: signal,
|
||||
description: this.renderCompareConflictSignal(signal, language),
|
||||
}));
|
||||
void this.buildStructuredCompareDebugArtifacts({
|
||||
roleBindings: normalizedCompare.compareRoleBindings,
|
||||
testCases: normalizedCompare.renderedTestCases,
|
||||
snapshots: normalizedCompare.normalizedSnapshots,
|
||||
judgeResults,
|
||||
language,
|
||||
});
|
||||
|
||||
${this.renderStructuredCompareSynthesisHints(judgeResults, language)}${this.renderStructuredCompareJudgeResults(judgeResults, language)}${focus ? `## Focus Brief
|
||||
${focus}
|
||||
return buildStructuredCompareSynthesisMessages({
|
||||
language,
|
||||
payload: {
|
||||
scenario: {
|
||||
language,
|
||||
roleName: subject.roleName,
|
||||
subjectLabel: subject.subjectLabel,
|
||||
sharedCompareInputs: normalizedCompare.sharedCompareInputs,
|
||||
samePromptAcrossSnapshots: normalizedCompare.samePromptAcrossSnapshots,
|
||||
crossModelComparison: normalizedCompare.crossModelComparison,
|
||||
...(focus ? { focusBrief: focus } : {}),
|
||||
},
|
||||
roleBindings: this.toStructuredCompareRoleBindingPayloads(
|
||||
normalizedCompare.compareRoleBindings
|
||||
),
|
||||
deterministicHints: {
|
||||
priorityOrder: [
|
||||
'targetBaseline',
|
||||
'targetReference',
|
||||
'referenceBaseline',
|
||||
'targetReplica',
|
||||
],
|
||||
signalSnapshot: {
|
||||
...(signalSnapshot.progress ? { progress: signalSnapshot.progress } : {}),
|
||||
...(signalSnapshot.gap ? { gap: signalSnapshot.gap } : {}),
|
||||
...(signalSnapshot.promptValidity ? { promptValidity: signalSnapshot.promptValidity } : {}),
|
||||
...(signalSnapshot.stability ? { stability: signalSnapshot.stability } : {}),
|
||||
},
|
||||
...(derivedStopSignals ? { derivedStopSignals } : {}),
|
||||
learnableSignals,
|
||||
overfitWarnings,
|
||||
conflictSignals,
|
||||
},
|
||||
judgeResults: judgeResults.map((item) => ({
|
||||
pairKey: item.pairKey,
|
||||
pairType: item.pairType,
|
||||
pairLabel: item.pairLabel,
|
||||
leftSnapshotId: item.leftSnapshotId,
|
||||
leftSnapshotLabel: item.leftSnapshotLabel,
|
||||
...(item.leftRole ? { leftRole: item.leftRole } : {}),
|
||||
rightSnapshotId: item.rightSnapshotId,
|
||||
rightSnapshotLabel: item.rightSnapshotLabel,
|
||||
...(item.rightRole ? { rightRole: item.rightRole } : {}),
|
||||
verdict: item.verdict,
|
||||
winner: item.winner,
|
||||
confidence: item.confidence,
|
||||
pairSignal: item.pairSignal,
|
||||
analysis: item.analysis,
|
||||
evidence: item.evidence,
|
||||
learnableSignals: item.learnableSignals,
|
||||
overfitWarnings: item.overfitWarnings,
|
||||
})),
|
||||
},
|
||||
});
|
||||
}
|
||||
|
||||
` : ''}---
|
||||
private buildStructuredCompareDebugArtifacts(params: {
|
||||
roleBindings: StructuredCompareRoleBinding[];
|
||||
testCases: NormalizedEvaluationTestCase[];
|
||||
snapshots: NormalizedEvaluationSnapshot[];
|
||||
judgeResults?: StructuredCompareJudgeResult[];
|
||||
language: ComparePromptLanguage;
|
||||
}): {
|
||||
roleBindingsMarkdown: string;
|
||||
testCasesMarkdown: string;
|
||||
snapshotsMarkdown: string[];
|
||||
judgeResultsMarkdown?: string;
|
||||
synthesisHintsMarkdown?: string;
|
||||
} {
|
||||
return {
|
||||
roleBindingsMarkdown: this.renderStructuredCompareRoleBindings(
|
||||
params.roleBindings,
|
||||
params.language
|
||||
),
|
||||
testCasesMarkdown: this.renderStructuredCompareTestCases(
|
||||
params.testCases,
|
||||
params.language
|
||||
),
|
||||
snapshotsMarkdown: params.snapshots.map((snapshot) =>
|
||||
this.renderStructuredCompareSnapshot(snapshot, params.language)
|
||||
),
|
||||
...(params.judgeResults
|
||||
? {
|
||||
judgeResultsMarkdown: this.renderStructuredCompareJudgeResults(
|
||||
params.judgeResults,
|
||||
params.language
|
||||
),
|
||||
synthesisHintsMarkdown: this.renderStructuredCompareSynthesisHints(
|
||||
params.judgeResults,
|
||||
params.language
|
||||
),
|
||||
}
|
||||
: {}),
|
||||
};
|
||||
}
|
||||
|
||||
Synthesize the final structured compare evaluation JSON. Do not re-expand the full raw snapshots.`
|
||||
: `${this.renderStructuredCompareRoleBindings(
|
||||
normalizedCompare.compareRoleBindings,
|
||||
language
|
||||
)}## 对比场景
|
||||
- Shared Compare Inputs:${normalizedCompare.sharedCompareInputs ? 'yes' : 'no'}
|
||||
- Same Prompt Across Snapshots:${normalizedCompare.samePromptAcrossSnapshots ? 'yes' : 'no'}
|
||||
- Cross-Model Comparison:${normalizedCompare.crossModelComparison ? 'yes' : 'no'}
|
||||
private toStructuredCompareRoleBindingPayloads(
|
||||
roleBindings: StructuredCompareRoleBinding[]
|
||||
): Array<{
|
||||
snapshotId: string;
|
||||
snapshotLabel: string;
|
||||
role: string;
|
||||
roleLabel: string;
|
||||
}> {
|
||||
return roleBindings.map((binding) => ({
|
||||
snapshotId: binding.snapshotId,
|
||||
snapshotLabel: binding.snapshotLabel,
|
||||
role: binding.role,
|
||||
roleLabel: binding.roleLabel,
|
||||
}));
|
||||
}
|
||||
|
||||
${this.renderStructuredCompareSynthesisHints(judgeResults, language)}${this.renderStructuredCompareJudgeResults(judgeResults, language)}${focus ? `## Focus Brief
|
||||
${focus}
|
||||
private toStructuredCompareTestCasePayload(
|
||||
testCase: NormalizedEvaluationTestCase
|
||||
): {
|
||||
id: string;
|
||||
label?: string;
|
||||
input: {
|
||||
kind: string;
|
||||
label: string;
|
||||
content: string;
|
||||
summary?: string;
|
||||
};
|
||||
settingsSummary?: string;
|
||||
} {
|
||||
return {
|
||||
id: testCase.id,
|
||||
...(testCase.hasLabel ? { label: testCase.label } : {}),
|
||||
input: {
|
||||
kind: testCase.inputKind,
|
||||
label: testCase.inputLabel,
|
||||
content: testCase.inputContent,
|
||||
...(testCase.hasInputSummary ? { summary: testCase.inputSummary } : {}),
|
||||
},
|
||||
...(testCase.hasSettingsSummary ? { settingsSummary: testCase.settingsSummary } : {}),
|
||||
};
|
||||
}
|
||||
|
||||
` : ''}---
|
||||
|
||||
请综合这些成对判断结果,输出最终 structured compare JSON。不要重新展开原始快照全文。`;
|
||||
|
||||
return [
|
||||
{ role: 'system', content: systemContent },
|
||||
{ role: 'user', content: userContent },
|
||||
];
|
||||
private toStructuredCompareSnapshotPayload(
|
||||
snapshot: NormalizedEvaluationSnapshot
|
||||
): {
|
||||
id: string;
|
||||
label: string;
|
||||
role?: string;
|
||||
roleLabel?: string;
|
||||
testCaseId: string;
|
||||
testCaseLabel?: string;
|
||||
promptRef: {
|
||||
kind: string;
|
||||
label: string;
|
||||
};
|
||||
promptText: string;
|
||||
modelKey?: string;
|
||||
versionLabel?: string;
|
||||
output: string;
|
||||
reasoning?: string;
|
||||
executionInput?: {
|
||||
kind: string;
|
||||
label: string;
|
||||
content: string;
|
||||
summary?: string;
|
||||
};
|
||||
} {
|
||||
return {
|
||||
id: snapshot.id,
|
||||
label: snapshot.label,
|
||||
...(snapshot.hasRole ? { role: snapshot.role, roleLabel: snapshot.roleLabel } : {}),
|
||||
testCaseId: snapshot.testCaseId,
|
||||
...(snapshot.testCaseLabel ? { testCaseLabel: snapshot.testCaseLabel } : {}),
|
||||
promptRef: {
|
||||
kind: snapshot.promptRefKind,
|
||||
label: snapshot.promptRefLabel,
|
||||
},
|
||||
promptText: snapshot.promptText,
|
||||
...(snapshot.hasModelKey ? { modelKey: snapshot.modelKey } : {}),
|
||||
...(snapshot.hasVersionLabel ? { versionLabel: snapshot.versionLabel } : {}),
|
||||
output: snapshot.output,
|
||||
...(snapshot.hasReasoning ? { reasoning: snapshot.reasoning } : {}),
|
||||
...(snapshot.hasExecutionInput
|
||||
? {
|
||||
executionInput: {
|
||||
kind: 'custom',
|
||||
label: snapshot.executionInputLabel,
|
||||
content: snapshot.executionInputContent,
|
||||
...(snapshot.hasExecutionInputSummary
|
||||
? { summary: snapshot.executionInputSummary }
|
||||
: {}),
|
||||
},
|
||||
}
|
||||
: {}),
|
||||
};
|
||||
}
|
||||
|
||||
private renderStructuredCompareSynthesisHints(
|
||||
@@ -2321,28 +2377,33 @@ ${conflictSection}
|
||||
continue;
|
||||
}
|
||||
|
||||
const verdict =
|
||||
payload.verdict === 'left-better' ||
|
||||
payload.verdict === 'right-better' ||
|
||||
payload.verdict === 'mixed' ||
|
||||
payload.verdict === 'similar'
|
||||
? payload.verdict
|
||||
: fallback.verdict;
|
||||
const winner =
|
||||
payload.winner === 'left' || payload.winner === 'right' || payload.winner === 'none'
|
||||
? payload.winner
|
||||
: fallback.winner;
|
||||
const payloadVerdict =
|
||||
typeof payload.verdict === 'string' ? payload.verdict.trim() : undefined;
|
||||
const payloadPairSignal =
|
||||
typeof payload.pairSignal === 'string' ? payload.pairSignal.trim() : undefined;
|
||||
const pairSignal =
|
||||
payloadPairSignal && planItem.allowedSignals.includes(payloadPairSignal)
|
||||
? payloadPairSignal
|
||||
: payloadVerdict && planItem.allowedSignals.includes(payloadVerdict)
|
||||
? payloadVerdict
|
||||
: fallback.pairSignal;
|
||||
const verdict = this.normalizeStructuredCompareJudgeVerdict(
|
||||
payloadVerdict,
|
||||
pairSignal,
|
||||
planItem
|
||||
);
|
||||
const winner = this.normalizeStructuredCompareJudgeWinner(
|
||||
typeof payload.winner === 'string' ? payload.winner.trim() : undefined,
|
||||
verdict,
|
||||
pairSignal,
|
||||
planItem
|
||||
);
|
||||
const confidence =
|
||||
payload.confidence === 'low' ||
|
||||
payload.confidence === 'medium' ||
|
||||
payload.confidence === 'high'
|
||||
? payload.confidence
|
||||
: fallback.confidence;
|
||||
const pairSignal =
|
||||
typeof payload.pairSignal === 'string' &&
|
||||
planItem.allowedSignals.includes(payload.pairSignal)
|
||||
? payload.pairSignal
|
||||
: fallback.pairSignal;
|
||||
|
||||
return {
|
||||
// Pair identity is determined by the judge plan, not by model echo fields.
|
||||
@@ -2390,6 +2451,69 @@ ${conflictSection}
|
||||
return fallback;
|
||||
}
|
||||
|
||||
private normalizeStructuredCompareJudgeVerdict(
|
||||
rawVerdict: string | undefined,
|
||||
pairSignal: string,
|
||||
planItem: StructuredCompareJudgePlanItem
|
||||
): StructuredCompareJudgeResult['verdict'] {
|
||||
if (
|
||||
rawVerdict === 'left-better' ||
|
||||
rawVerdict === 'right-better' ||
|
||||
rawVerdict === 'mixed' ||
|
||||
rawVerdict === 'similar'
|
||||
) {
|
||||
return rawVerdict;
|
||||
}
|
||||
|
||||
switch (planItem.pairType) {
|
||||
case 'targetBaseline':
|
||||
if (pairSignal === 'improved') return 'left-better';
|
||||
if (pairSignal === 'regressed') return 'right-better';
|
||||
if (pairSignal === 'flat') return 'similar';
|
||||
return 'mixed';
|
||||
case 'targetReference':
|
||||
if (pairSignal === 'minor' || pairSignal === 'major') return 'right-better';
|
||||
if (pairSignal === 'none') return 'similar';
|
||||
return 'mixed';
|
||||
case 'referenceBaseline':
|
||||
if (pairSignal === 'supported') return 'left-better';
|
||||
if (pairSignal === 'unsupported') return 'right-better';
|
||||
return 'mixed';
|
||||
case 'targetReplica':
|
||||
if (pairSignal === 'stable') return 'similar';
|
||||
return 'mixed';
|
||||
default:
|
||||
return 'mixed';
|
||||
}
|
||||
}
|
||||
|
||||
private normalizeStructuredCompareJudgeWinner(
|
||||
rawWinner: string | undefined,
|
||||
verdict: StructuredCompareJudgeResult['verdict'],
|
||||
pairSignal: string,
|
||||
planItem: StructuredCompareJudgePlanItem
|
||||
): StructuredCompareJudgeResult['winner'] {
|
||||
if (rawWinner === 'left' || rawWinner === 'right' || rawWinner === 'none') {
|
||||
return rawWinner;
|
||||
}
|
||||
|
||||
if (verdict === 'left-better') {
|
||||
return 'left';
|
||||
}
|
||||
if (verdict === 'right-better') {
|
||||
return 'right';
|
||||
}
|
||||
if (verdict === 'similar') {
|
||||
return 'none';
|
||||
}
|
||||
|
||||
if (planItem.pairType === 'targetReplica' && pairSignal === 'stable') {
|
||||
return 'none';
|
||||
}
|
||||
|
||||
return 'none';
|
||||
}
|
||||
|
||||
private findStructuredCompareJudgePayload(
|
||||
value: unknown
|
||||
): Record<string, unknown> | null {
|
||||
@@ -2438,14 +2562,15 @@ ${conflictSection}
|
||||
const hasCoreConfidence =
|
||||
typeof record.confidence === 'string' &&
|
||||
(record.confidence === 'low' || record.confidence === 'medium' || record.confidence === 'high');
|
||||
const hasCorePairSignal = typeof record.pairSignal === 'string';
|
||||
const hasSupportingField =
|
||||
typeof record.pairSignal === 'string' ||
|
||||
hasCorePairSignal ||
|
||||
typeof record.analysis === 'string' ||
|
||||
Array.isArray(record.evidence) ||
|
||||
Array.isArray(record.learnableSignals) ||
|
||||
Array.isArray(record.overfitWarnings);
|
||||
|
||||
return hasCoreVerdict && hasCoreWinner && hasCoreConfidence && hasSupportingField;
|
||||
return (hasCoreVerdict || hasCorePairSignal) && hasCoreWinner && hasCoreConfidence && hasSupportingField;
|
||||
}
|
||||
|
||||
private renderStructuredCompareRoleBindings(
|
||||
|
||||
@@ -0,0 +1,184 @@
|
||||
import { TemplateProcessor } from '../template/processor';
|
||||
import type { Message } from '../llm/types';
|
||||
import type { CompareStopSignals } from './types';
|
||||
import {
|
||||
compareJsonContractEn,
|
||||
compareJsonContractZh,
|
||||
} from '../template/default-templates/evaluation/builders';
|
||||
import { template as pairJudgeTemplateZh } from '../template/default-templates/evaluation-structured-compare/pair-judge';
|
||||
import { template as pairJudgeTemplateEn } from '../template/default-templates/evaluation-structured-compare/pair-judge_en';
|
||||
import { template as synthesisTemplateZh } from '../template/default-templates/evaluation-structured-compare/synthesis';
|
||||
import { template as synthesisTemplateEn } from '../template/default-templates/evaluation-structured-compare/synthesis_en';
|
||||
|
||||
export type StructuredComparePromptLanguage = 'zh' | 'en';
|
||||
|
||||
const jsonFence = (content: string) => `\`\`\`json
|
||||
${content}
|
||||
\`\`\``;
|
||||
|
||||
const stringifyPayload = (value: unknown): string => JSON.stringify(value, null, 2);
|
||||
|
||||
export interface StructuredCompareRoleBindingPromptPayload {
|
||||
snapshotId: string;
|
||||
snapshotLabel: string;
|
||||
role: string;
|
||||
roleLabel: string;
|
||||
}
|
||||
|
||||
export interface StructuredCompareContentPromptPayload {
|
||||
kind: string;
|
||||
label: string;
|
||||
content: string;
|
||||
summary?: string;
|
||||
}
|
||||
|
||||
export interface StructuredCompareTestCasePromptPayload {
|
||||
id: string;
|
||||
label?: string;
|
||||
input: StructuredCompareContentPromptPayload;
|
||||
settingsSummary?: string;
|
||||
}
|
||||
|
||||
export interface StructuredCompareSnapshotPromptPayload {
|
||||
id: string;
|
||||
label: string;
|
||||
role?: string;
|
||||
roleLabel?: string;
|
||||
testCaseId: string;
|
||||
testCaseLabel?: string;
|
||||
promptRef: {
|
||||
kind: string;
|
||||
label: string;
|
||||
};
|
||||
promptText: string;
|
||||
modelKey?: string;
|
||||
versionLabel?: string;
|
||||
output: string;
|
||||
reasoning?: string;
|
||||
executionInput?: StructuredCompareContentPromptPayload;
|
||||
}
|
||||
|
||||
export interface StructuredComparePairJudgePayload {
|
||||
scenario: {
|
||||
language: StructuredComparePromptLanguage;
|
||||
pairKey: string;
|
||||
pairType: string;
|
||||
pairLabel: string;
|
||||
purpose: string;
|
||||
signalName: string;
|
||||
allowedSignalValues: string[];
|
||||
focusBrief?: string;
|
||||
};
|
||||
roleBindings: StructuredCompareRoleBindingPromptPayload[];
|
||||
testCases: StructuredCompareTestCasePromptPayload[];
|
||||
leftSnapshot: StructuredCompareSnapshotPromptPayload;
|
||||
rightSnapshot: StructuredCompareSnapshotPromptPayload;
|
||||
}
|
||||
|
||||
export interface StructuredCompareSynthesisDeterministicHintsPayload {
|
||||
priorityOrder: string[];
|
||||
signalSnapshot: {
|
||||
progress?: string;
|
||||
gap?: string;
|
||||
promptValidity?: string;
|
||||
stability?: string;
|
||||
};
|
||||
derivedStopSignals?: CompareStopSignals;
|
||||
learnableSignals: string[];
|
||||
overfitWarnings: string[];
|
||||
conflictSignals: Array<{
|
||||
key: string;
|
||||
description: string;
|
||||
}>;
|
||||
}
|
||||
|
||||
export interface StructuredCompareSynthesisPayload {
|
||||
scenario: {
|
||||
language: StructuredComparePromptLanguage;
|
||||
roleName: string;
|
||||
subjectLabel: string;
|
||||
sharedCompareInputs: boolean;
|
||||
samePromptAcrossSnapshots: boolean;
|
||||
crossModelComparison: boolean;
|
||||
focusBrief?: string;
|
||||
};
|
||||
roleBindings: StructuredCompareRoleBindingPromptPayload[];
|
||||
deterministicHints: StructuredCompareSynthesisDeterministicHintsPayload;
|
||||
judgeResults: unknown[];
|
||||
}
|
||||
|
||||
export interface StructuredComparePairJudgePromptParams {
|
||||
language: StructuredComparePromptLanguage;
|
||||
pairGuidance: string;
|
||||
payload: StructuredComparePairJudgePayload;
|
||||
}
|
||||
|
||||
export interface StructuredCompareSynthesisPromptParams {
|
||||
language: StructuredComparePromptLanguage;
|
||||
payload: StructuredCompareSynthesisPayload;
|
||||
}
|
||||
|
||||
const buildPairJudgeJsonContract = (
|
||||
pairKey: string,
|
||||
pairType: string,
|
||||
allowedSignalValues: string[],
|
||||
): string =>
|
||||
jsonFence(`{
|
||||
"pairKey": "${pairKey}",
|
||||
"pairType": "${pairType}",
|
||||
"verdict": "left-better | right-better | mixed | similar",
|
||||
"winner": "left | right | none",
|
||||
"confidence": "low | medium | high",
|
||||
"pairSignal": "${allowedSignalValues.join(' | ')}",
|
||||
"analysis": "<one short paragraph>",
|
||||
"evidence": ["<evidence-grounded difference>"],
|
||||
"learnableSignals": ["<reusable structural signal>"],
|
||||
"overfitWarnings": ["<sample-specific or overfit risk>"]
|
||||
}`);
|
||||
|
||||
export const buildStructuredComparePairJudgePayloadJson = (
|
||||
payload: StructuredComparePairJudgePayload,
|
||||
): string => stringifyPayload(payload);
|
||||
|
||||
export const buildStructuredComparePairJudgeMessages = (
|
||||
params: StructuredComparePairJudgePromptParams,
|
||||
): Message[] => {
|
||||
const template = params.language === 'en' ? pairJudgeTemplateEn : pairJudgeTemplateZh;
|
||||
const messages = TemplateProcessor.processTemplate(template, {
|
||||
pairGuidance: params.pairGuidance,
|
||||
pairJudgeJsonContract: buildPairJudgeJsonContract(
|
||||
params.payload.scenario.pairKey,
|
||||
params.payload.scenario.pairType,
|
||||
params.payload.scenario.allowedSignalValues,
|
||||
),
|
||||
pairJudgePayloadJson: buildStructuredComparePairJudgePayloadJson(params.payload),
|
||||
});
|
||||
|
||||
return messages.map((message) => ({
|
||||
role: message.role,
|
||||
content: message.content,
|
||||
}));
|
||||
};
|
||||
|
||||
export const buildStructuredCompareSynthesisPayloadJson = (
|
||||
payload: StructuredCompareSynthesisPayload,
|
||||
): string => stringifyPayload(payload);
|
||||
|
||||
export const buildStructuredCompareSynthesisMessages = (
|
||||
params: StructuredCompareSynthesisPromptParams,
|
||||
): Message[] => {
|
||||
const template = params.language === 'en' ? synthesisTemplateEn : synthesisTemplateZh;
|
||||
const compareJsonContract =
|
||||
params.language === 'en' ? compareJsonContractEn : compareJsonContractZh;
|
||||
|
||||
const messages = TemplateProcessor.processTemplate(template, {
|
||||
roleName: params.payload.scenario.roleName,
|
||||
compareJsonContract,
|
||||
synthesisPayloadJson: buildStructuredCompareSynthesisPayloadJson(params.payload),
|
||||
});
|
||||
|
||||
return messages.map((message) => ({
|
||||
role: message.role,
|
||||
content: message.content,
|
||||
}));
|
||||
};
|
||||
@@ -0,0 +1,14 @@
|
||||
import { createEvaluationRewriteTemplate } from './builders';
|
||||
|
||||
export const template = createEvaluationRewriteTemplate(
|
||||
{
|
||||
id: 'evaluation-rewrite-basic-system',
|
||||
name: '系统提示词评估后智能改写',
|
||||
description: '基于评估结果重写当前工作区系统提示词',
|
||||
language: 'zh',
|
||||
tags: ['evaluation', 'rewrite', 'basic', 'system'],
|
||||
},
|
||||
{
|
||||
subjectLabel: '系统提示词',
|
||||
}
|
||||
);
|
||||
@@ -0,0 +1,14 @@
|
||||
import { createEvaluationRewriteTemplate } from './builders';
|
||||
|
||||
export const template = createEvaluationRewriteTemplate(
|
||||
{
|
||||
id: 'evaluation-rewrite-basic-system',
|
||||
name: 'Rewrite System Prompt From Evaluation',
|
||||
description: 'Rewrite the current workspace system prompt from evaluation evidence',
|
||||
language: 'en',
|
||||
tags: ['evaluation', 'rewrite', 'basic', 'system'],
|
||||
},
|
||||
{
|
||||
subjectLabel: 'system prompt',
|
||||
}
|
||||
);
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user