# 英文口语 · 单轮内容评语（MVP 设计）

> 状态：设计收敛，待实现
> 日期：2026-04-23
> 归属后端：`cococlass-english-speaking-api`
> 前端消费：`PPT` → `src/views/Editor/EnglishSpeaking/` + `enspeak` 原型的 `SentenceCard` 契约
> 关联设计：`doc/EnglishSpeakingIntegration.md`

---

## 1. 背景与问题

现有评分链路覆盖**发音侧**四维：`accuracy / fluency / completeness / prosody`（Azure Speech SDK 的 Pronunciation Assessment）。发音侧回答的是"听起来像不像母语者"，不回答"说了什么是否合适"。

enspeak 原型 `DetailedReport.tsx:55` 的 `SentenceCard` 已经给出 UI 契约：每个学生句除发音四分外，还展示 `feedback.{highlights, corrections, suggestions}`。这部分目前为空，本设计补齐。

MVP 目标：用一次 LLM 调用生成这三段结构化评语，不引入任何新 provider，不改动对话主流程。

---

## 2. 核心约束

1. **评分只在结果页展示**：`content_feedback` 永远**不出现在 `/speak` 的响应里**（无论 SSE 流还是最终帧）。对话进行中前端拿不到任何评分数据，避免分心 + 避免评分阻塞对话节奏。
2. **评分数据的唯一出口是 `/api/speaking/dialogue/report`**：学生/教师打开结果页时按 `sessionId` 拉取，拿到每轮的发音四分 + `content_feedback`。
3. **评估粒度 = 每轮（per-turn）**：每一轮学生发言独立评估、独立存库。结果页按轮次渲染卡片，OverallReport 的 `overallScore` / 聚合数据由**前端从 per-turn 结果计算**（平均分、top-N 亮点），不引入 session 级 LLM 调用。
4. **评估时机 = 每轮 `/speak` 完成后立即后台触发，绝不延后到会话结束批量跑**：
   - `asyncio.create_task(_evaluate_pronunciation)` 在本轮 `/speak` 返回前就挂起，与 AI 回复并发
   - N 轮对话的 N 次评估**摊在对话进行的整个时间窗口内并发完成**，结果页首屏延迟 ≈ 最后一轮的单轮评估延迟，而非 N × 单轮延迟
   - 失败隔离：某一轮 Azure / LLM 失败不影响其他轮次；学生中途放弃对话时已完成轮次的评分仍可查看
5. **评估生成是 fire-and-forget**：`/speak` 立即返回，不等待任何评分完成。每轮内部串行 Azure → Content LLM（Content LLM 需要 Azure 的四分作为输入，故不能并行）。
6. **结果页需容忍 pending**：打开结果页时可能最近几轮的评估还没跑完，前端应轮询 `/report` 或按 `status` 字段显示"评估中"占位。

---

## 3. 范围

**In scope**
- 每轮学生发言结束后，在已有的后台评估任务里多挂一次 LLM 调用，产出 `content_feedback`
- 对 `PronunciationEvaluation` 表加一列 `content_feedback JSON`
- `GET /api/speaking/dialogue/report` 返回值把 `content_feedback` 并入每个 `evaluation`

**Out of scope（后续迭代）**
- Session 级的连贯性/任务达成评语（本期 per-turn only）
- 独立的"内容分"数值（本期只出评语文字，不出 0–100 分）
- LLM judge 的 self-consistency / calibration / 金标对比

---

## 4. 数据契约

### 4.1 输出结构（与 enspeak `SentenceCard.feedback` 对齐）

```jsonc
{
  "highlights": ["..."],                              // 1–2 条，中文，≤30 字
  "corrections": [
    {
      "original":    "I go to park yesterday",       // 英文原句
      "corrected":   "I went to the park yesterday", // 英文改正
      "explanation": "过去式应用 went，park 前加 the" // 中文解释
    }
  ],
  "suggestions": ["..."]                              // 1–2 条，中文，≤30 字
}
```

- 空数组合法（没话可说时不强凑）
- 整个对象允许为 `null`（LLM 失败时的降级）

### 4.2 数据库变更

`pronunciation_evaluation` 表新增一列：

```python
content_feedback: Mapped[Optional[dict]] = mapped_column(JSON, nullable=True)
```

无独立状态字段：
- `status == "completed"` 且 `content_feedback == null` → 内容评语调用失败（LLM 超时/解析失败），不阻塞发音结果
- `status == "failed"` → 发音失败，内容评语跳过不调用

---

## 5. 架构

```
/speak (SSE)
  │
  ├─ 同步：ASR → LLM 对话回复
  │
  └─ 后台 asyncio.create_task(_evaluate_pronunciation)
       │
       ├─ Azure Pronunciation Assessment  →  4 个分数 + word_analysis
       │
       └─ [新增] ContentEvaluator.evaluate()
              · 输入：4 分 + 学生转录 + AI 上一句
              · 输出：{highlights, corrections, suggestions}
              · 失败：log 并置 content_feedback=null，不影响 status
```

新增模块一个：`app/service/speaking/content_evaluator.py`

---

## 6. LLM 调用规格

### 6.1 Provider

复用现有 `LLMProvider`（onehub_llm）。无需新 provider 协议。

### 6.2 模型与参数

- 模型：`gpt-4o-mini` 或等价轻量模型（onehub 内部选）
- `temperature = 0`
- `response_format = { "type": "json_schema", "json_schema": FEEDBACK_SCHEMA }`
  - onehub 不支持 json_schema 时降级为 `{ "type": "json_object" }` + prompt 里加 schema 说明
- 超时：10s，失败即降级 `content_feedback=null`

### 6.3 Prompt（system）

```
You are an English tutor evaluating a student's single spoken turn in an
open dialogue. You receive:
- Azure pronunciation scores (accuracy/fluency/completeness/prosody, 0–100)
- The immediate prior AI turn (context)
- The student's transcript

Produce JSON with three arrays:
- highlights:  1–2 Chinese sentences praising specific strengths. Reference a
               pronunciation dimension if that score is ≥ 85. ≤ 30 chars each.
- corrections: grammar / word-choice fixes. Each item:
                 { original (EN), corrected (EN), explanation (ZH, ≤ 30 chars) }
- suggestions: 1–2 Chinese actionable improvements. Reference a pronunciation
               dimension if that score is < 70. ≤ 30 chars each.

Rules:
- Empty arrays are valid. Do not invent errors to fill quota.
- If the student only said a filler ("yes", "ok", "hmm"), return all empty
  arrays plus one encouragement in highlights.
- Never include the scores as raw numbers in output text; describe
  qualitatively ("发音准确度很高" not "accuracy 92").
```

### 6.4 User payload

```json
{
  "pronunciation": {
    "accuracy": 72, "fluency": 85, "completeness": 90, "prosody": 60
  },
  "ai_said":      "What did you do last weekend?",
  "student_said": "I go to park with my family and play ball."
}
```

历史消息**只给 AI 上一句**（per-turn 评语场景足够；深层连贯性留给后续 session-level 迭代）。

---

## 7. 失败策略

| 场景 | status | content_feedback | 前端表现 |
|---|---|---|---|
| Azure 成功 + LLM 成功 | completed | `{...}` | 完整卡片 |
| Azure 成功 + LLM 超时 | completed | `null` | 只显示发音四分 |
| Azure 成功 + JSON 解析失败 | completed | `null` | 只显示发音四分 |
| Azure 失败 | failed | `null`（不调用） | 卡片置错误态 |

前端 `hasEvaluation = score !== undefined || feedback` 的逻辑原样兼容。

---

## 8. 成本与性能预算

- 每轮：输入 ~500 token，输出 ~200 token
- 按 `gpt-4o-mini` ≈ $0.15 / 1M input、$0.60 / 1M output 估算：**单轮 < $0.0002**
- 延迟：10s 内必须完成或放弃。用户在对话中不等此结果（已是后台 fire-and-forget），延迟只影响"查看结果页"的 pending 窗口大小
- 不做 self-consistency / 多次采样，MVP 接受 ±10 分波动（本期无数字分，只有文字评语，波动容忍度高）

---

## 9. API 契约改动

`GET /api/speaking/dialogue/report` 返回的 `evaluations[i]` 加一个字段：

```jsonc
{
  "round": 1,
  "status": "completed",
  "accuracy_score": 72,
  "fluency_score": 85,
  "completeness_score": 90,
  "prosody_score": 60,
  "content_feedback": {           // ← 新增，可为 null
    "highlights":  [...],
    "corrections": [...],
    "suggestions": [...]
  }
}
```

PPT 前端 `src/types/englishSpeaking.ts` 的 evaluation 类型同步加 `contentFeedback`。

---

## 10. 测试约束

- `tests/service/speaking/test_content_evaluator.py`
  - mock LLM provider，验证 prompt 组装
  - JSON schema 合法 / 非法 / 超时三条分支
  - 空学生转录 / 全 filler 转录 → 走 "all empty + encouragement" 分支
- `tests/service/speaking/test_dialogue_service.py`（扩充）
  - Azure 成功 + content 成功 → 两个字段都写入
  - Azure 成功 + content 失败 → status completed，content_feedback null
  - Azure 失败 → content 不被调用
- 不覆盖真实 LLM 调用（集成测试在后续阶段）

---

## 11. 实施步骤（给实现计划用的骨架）

1. DB 迁移：`pronunciation_evaluation.content_feedback JSON nullable`
2. 新建 `app/service/speaking/content_evaluator.py`，实现 `evaluate(transcript, prior_ai_turn, pron_scores) -> dict | None`
3. 在 `dialogue_service._evaluate_pronunciation` 的 Azure 调用成功分支后追加 content 评估
4. `GET /report` 返回值补 `content_feedback`
5. PPT 前端 `EnglishSpeaking/` 下报告渲染补 `feedback` 展示（enspeak `SentenceCard` 三段 UI 直接移植）
6. Unit tests