英文口语 · 单轮内容评语（MVP 设计）

状态：设计收敛，待实现日期：2026-04-23 归属后端：cococlass-english-speaking-api 前端消费：PPT → src/views/Editor/EnglishSpeaking/ + enspeak 原型的 SentenceCard 契约关联设计：doc/EnglishSpeakingIntegration.md

1. 背景与问题

现有评分链路覆盖发音侧四维：accuracy / fluency / completeness / prosody（Azure Speech SDK 的 Pronunciation Assessment）。发音侧回答的是"听起来像不像母语者"，不回答"说了什么是否合适"。

enspeak 原型 DetailedReport.tsx:55 的 SentenceCard 已经给出 UI 契约：每个学生句除发音四分外，还展示 feedback.{highlights, corrections, suggestions}。这部分目前为空，本设计补齐。

MVP 目标：用一次 LLM 调用生成这三段结构化评语，不引入任何新 provider，不改动对话主流程。

2. 核心约束

评分只在结果页展示：content_feedback 永远不出现在 /speak 的响应里（无论 SSE 流还是最终帧）。对话进行中前端拿不到任何评分数据，避免分心 + 避免评分阻塞对话节奏。
评分数据的唯一出口是 /api/speaking/dialogue/report：学生/教师打开结果页时按 sessionId 拉取，拿到每轮的发音四分 + content_feedback。
评估粒度 = 每轮（per-turn）：每一轮学生发言独立评估、独立存库。结果页按轮次渲染卡片，OverallReport 的 overallScore / 聚合数据由前端从 per-turn 结果计算（平均分、top-N 亮点），不引入 session 级 LLM 调用。
评估时机 = 每轮 /speak 完成后立即后台触发，绝不延后到会话结束批量跑：
- asyncio.create_task(_evaluate_pronunciation) 在本轮 /speak 返回前就挂起，与 AI 回复并发
- N 轮对话的 N 次评估摊在对话进行的整个时间窗口内并发完成，结果页首屏延迟 ≈ 最后一轮的单轮评估延迟，而非 N × 单轮延迟
- 失败隔离：某一轮 Azure / LLM 失败不影响其他轮次；学生中途放弃对话时已完成轮次的评分仍可查看
评估生成是 fire-and-forget：/speak 立即返回，不等待任何评分完成。每轮内部串行 Azure → Content LLM（Content LLM 需要 Azure 的四分作为输入，故不能并行）。
结果页需容忍 pending：打开结果页时可能最近几轮的评估还没跑完，前端应轮询 /report 或按 status 字段显示"评估中"占位。

3. 范围

In scope

每轮学生发言结束后，在已有的后台评估任务里多挂一次 LLM 调用，产出 content_feedback
对 PronunciationEvaluation 表加一列 content_feedback JSON
GET /api/speaking/dialogue/report 返回值把 content_feedback 并入每个 evaluation

Out of scope（后续迭代）

Session 级的连贯性/任务达成评语（本期 per-turn only）
独立的"内容分"数值（本期只出评语文字，不出 0–100 分）
LLM judge 的 self-consistency / calibration / 金标对比

4. 数据契约

4.1 输出结构（与 enspeak `SentenceCard.feedback` 对齐）

{
  "highlights": ["..."],                              // 1–2 条，中文，≤30 字
  "corrections": [
    {
      "original":    "I go to park yesterday",       // 英文原句
      "corrected":   "I went to the park yesterday", // 英文改正
      "explanation": "过去式应用 went，park 前加 the" // 中文解释
    }
  ],
  "suggestions": ["..."]                              // 1–2 条，中文，≤30 字
}

空数组合法（没话可说时不强凑）
整个对象允许为 null（LLM 失败时的降级）

4.2 数据库变更

pronunciation_evaluation 表新增一列：

content_feedback: Mapped[Optional[dict]] = mapped_column(JSON, nullable=True)

无独立状态字段：

status == "completed" 且 content_feedback == null → 内容评语调用失败（LLM 超时/解析失败），不阻塞发音结果
status == "failed" → 发音失败，内容评语跳过不调用

5. 架构

/speak (SSE)
  │
  ├─ 同步：ASR → LLM 对话回复
  │
  └─ 后台 asyncio.create_task(_evaluate_pronunciation)
       │
       ├─ Azure Pronunciation Assessment  →  4 个分数 + word_analysis
       │
       └─ [新增] ContentEvaluator.evaluate()
              · 输入：4 分 + 学生转录 + AI 上一句
              · 输出：{highlights, corrections, suggestions}
              · 失败：log 并置 content_feedback=null，不影响 status

新增模块一个：app/service/speaking/content_evaluator.py

6. LLM 调用规格

6.1 Provider

复用现有 LLMProvider（onehub_llm）。无需新 provider 协议。

6.2 模型与参数

模型：gpt-4o-mini 或等价轻量模型（onehub 内部选）
temperature = 0
response_format = { "type": "json_schema", "json_schema": FEEDBACK_SCHEMA }
- onehub 不支持 json_schema 时降级为 { "type": "json_object" } + prompt 里加 schema 说明
超时：10s，失败即降级 content_feedback=null

6.3 Prompt（system）

You are an English tutor evaluating a student's single spoken turn in an
open dialogue. You receive:
- Azure pronunciation scores (accuracy/fluency/completeness/prosody, 0–100)
- The immediate prior AI turn (context)
- The student's transcript

Produce JSON with three arrays:
- highlights:  1–2 Chinese sentences praising specific strengths. Reference a
               pronunciation dimension if that score is ≥ 85. ≤ 30 chars each.
- corrections: grammar / word-choice fixes. Each item:
                 { original (EN), corrected (EN), explanation (ZH, ≤ 30 chars) }
- suggestions: 1–2 Chinese actionable improvements. Reference a pronunciation
               dimension if that score is < 70. ≤ 30 chars each.

Rules:
- Empty arrays are valid. Do not invent errors to fill quota.
- If the student only said a filler ("yes", "ok", "hmm"), return all empty
  arrays plus one encouragement in highlights.
- Never include the scores as raw numbers in output text; describe
  qualitatively ("发音准确度很高" not "accuracy 92").

6.4 User payload

{
  "pronunciation": {
    "accuracy": 72, "fluency": 85, "completeness": 90, "prosody": 60
  },
  "ai_said":      "What did you do last weekend?",
  "student_said": "I go to park with my family and play ball."
}

历史消息只给 AI 上一句（per-turn 评语场景足够；深层连贯性留给后续 session-level 迭代）。

7. 失败策略

场景	status	content_feedback	前端表现
Azure 成功 + LLM 成功	completed	`{...}`	完整卡片
Azure 成功 + LLM 超时	completed	`null`	只显示发音四分
Azure 成功 + JSON 解析失败	completed	`null`	只显示发音四分
Azure 失败	failed	`null`（不调用）	卡片置错误态

前端 hasEvaluation = score !== undefined || feedback 的逻辑原样兼容。

8. 成本与性能预算

每轮：输入 ~500 token，输出 ~200 token
按 gpt-4o-mini ≈ $0.15 / 1M input、$0.60 / 1M output 估算：单轮 < $0.0002
延迟：10s 内必须完成或放弃。用户在对话中不等此结果（已是后台 fire-and-forget），延迟只影响"查看结果页"的 pending 窗口大小
不做 self-consistency / 多次采样，MVP 接受 ±10 分波动（本期无数字分，只有文字评语，波动容忍度高）

9. API 契约改动

GET /api/speaking/dialogue/report 返回的 evaluations[i] 加一个字段：

{
  "round": 1,
  "status": "completed",
  "accuracy_score": 72,
  "fluency_score": 85,
  "completeness_score": 90,
  "prosody_score": 60,
  "content_feedback": {           // ← 新增，可为 null
    "highlights":  [...],
    "corrections": [...],
    "suggestions": [...]
  }
}

PPT 前端 src/types/englishSpeaking.ts 的 evaluation 类型同步加 contentFeedback。

10. 测试约束

tests/service/speaking/test_content_evaluator.py
- mock LLM provider，验证 prompt 组装
- JSON schema 合法 / 非法 / 超时三条分支
- 空学生转录 / 全 filler 转录 → 走 "all empty + encouragement" 分支
tests/service/speaking/test_dialogue_service.py（扩充）
- Azure 成功 + content 成功 → 两个字段都写入
- Azure 成功 + content 失败 → status completed，content_feedback null
- Azure 失败 → content 不被调用
不覆盖真实 LLM 调用（集成测试在后续阶段）

11. 实施步骤（给实现计划用的骨架）

DB 迁移：pronunciation_evaluation.content_feedback JSON nullable
新建 app/service/speaking/content_evaluator.py，实现 evaluate(transcript, prior_ai_turn, pron_scores) -> dict | None
在 dialogue_service._evaluate_pronunciation 的 Azure 调用成功分支后追加 content 评估
GET /report 返回值补 content_feedback
PPT 前端 EnglishSpeaking/ 下报告渲染补 feedback 展示（enspeak SentenceCard 三段 UI 直接移植）
Unit tests

ContentEvaluationDesign.md 9.1 KB History Raw