Przeglądaj źródła

docs: design unified audio player with Azure TTS

Design doc for refactoring DialogueChatView audio: extract a
view-level useAudioPlayer composable as the single owner of
playback, replace browser TTS with Azure Speech REST, and
surface synthesis/playback errors per-message with one-click
retry. Engine becomes pure dialogue state.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jimmylee 2 tygodni temu
rodzic
commit
c13aebc278

+ 426 - 0
docs/superpowers/specs/2026-04-26-azure-tts-audio-player-design.md

@@ -0,0 +1,426 @@
+# Azure TTS + Unified Audio Player Design
+
+**Date:** 2026-04-26
+**Scope:** `src/views/Editor/EnglishSpeaking/preview/DialogueChatView.vue` and its composables/services
+**Status:** Approved for implementation planning
+
+## Context
+
+`DialogueChatView.vue` currently has three audio playback triggers:
+
+1. **Auto-play AI replies** — `useDialogueEngine.speakTTS()` runs after every AI `done` event (greeting, `sendStudentMessage`, WebSocket stream, `regenerateAiMessage`), using the browser-native `SpeechSynthesisUtterance` API.
+2. **Click-replay AI** — `togglePlay()` in the view creates a separate `SpeechSynthesisUtterance` for the same message.
+3. **Click-replay student** — `togglePlay()` plays the message's stored `audioBlob` via `HTMLAudioElement`.
+
+The three triggers do not coordinate. Two owners (engine + view) both call `speechSynthesis.cancel()`, the click-replay path does not know about ongoing auto-play, and starting a new recording does not interrupt currently-playing audio. As a result:
+
+- Clicking the play button on a message that is currently being auto-played does not toggle to "stop" — the visual play/pause state is wrong because `playingMessageId` (view) is unaware of `ttsUtterance` (engine).
+- Starting a recording while AI audio plays causes the AI voice to leak into the microphone.
+- Student-recording playback can overlap with auto-TTS for a freshly arrived AI message.
+
+We are also switching the AI voice from browser-native TTS to **Azure Speech REST** for higher-quality output. This is the right moment to refactor.
+
+## Goals
+
+1. Introduce a single, view-level audio playback owner (`useAudioPlayer` composable).
+2. Replace browser TTS with Azure Speech REST synthesis for AI messages.
+3. Auto-synthesize and auto-play each AI message after streaming completes.
+4. Support click-replay for both AI messages (cached synthesis) and student messages (existing blob).
+5. Enforce three rules **structurally** (i.e., not by discipline):
+   - Single playback channel (new playback interrupts any prior one).
+   - Recording start interrupts current playback.
+   - Click on the currently-playing message stops it.
+6. Surface synthesis / playback failures in a unified per-message error state with one-click retry.
+7. Decouple `useDialogueEngine` from audio entirely — the engine becomes pure dialogue state.
+
+## Non-goals
+
+- **Streaming synthesis** (synthesizing token-by-token while the model streams). Out of scope; full-text synthesis after `done` is acceptable.
+- **Multiple voices / voice configuration UI**. Hard-code `en-US-AriaNeural` for now.
+- **Cross-view audio coordination** (e.g., report screen also plays audio). The player is view-level; if the report screen later needs playback it can instantiate its own.
+- **Backend Azure token endpoint** (planned but deferred — see Security Debt below).
+- **SDK-based synthesis**. Use REST only; the `microsoft-cognitiveservices-speech-sdk` package is not introduced.
+
+## Non-trivial decisions
+
+### D1. Player ownership: view-level, not module singleton
+
+The player is created inside `DialogueChatView.setup()` via `useAudioPlayer()` and torn down on view unmount. Reasoning: today all three triggers live in this one view; a module singleton would add a global lifecycle hazard (ensure `stop()` on every navigation) without solving any current problem. If a future view needs playback, it will instantiate its own player; cross-view coordination is a separate, future problem.
+
+### D2. Auto-play trigger lives in the view, not the engine
+
+The engine no longer touches `speechSynthesis`. Instead, the view runs a `watch()` on `engine.messages` and triggers `player.play(...)` when an AI message transitions to `status === 'done'`. Reasoning: "auto-play after a message completes" is a presentation concern. Keeping it in the view means engine has zero audio dependency, and any future toggle ("disable auto-play") is a view-only change.
+
+The watcher uses a `Set<string>` of already-auto-played message IDs to avoid re-playing on unrelated re-renders.
+
+### D3. Synthesis is one round-trip per message, cached by message ID
+
+Each AI message text → one Azure REST call → MP3 blob → cached in a `Map<messageId, Blob>` inside the player. Replays hit the cache (no second call). Cache lives for the player's lifetime; on unmount, all cached object URLs are revoked and the map is cleared. Student-recording blobs are **not** added to this cache — the message itself owns the blob.
+
+### D4. Errors are surfaced per message, not globally
+
+The player exposes `errorId: Ref<string | null>`. The play button on the affected message renders an error variant (warning icon + "点击重试" text). Clicking it retries by calling `player.play(id, source)` again. Reasoning:
+
+- The failure scope is one message's playback, not a system state.
+- Locating the error at the play button keeps retry intuitive — the same affordance that "starts" playback also "retries".
+- Avoids introducing a new global toast/banner component.
+
+All failure paths (Azure network/5xx, `audio.play()` rejection from autoplay policy, decoder errors) collapse to the same UI: warning icon + "播放失败,点击重试". We do not differentiate causes; the user action is identical.
+
+### D5. WebSocket stream path: attach `audioBlob` to student message
+
+The `beginStudentStream` path (`useDialogueEngine.ts`) does not currently attach the recorded blob to the student message it pushes. As a result, the student's "replay recording" button is silent in WS mode. We fix this in the same change: `handleFinishRecording` will attach the blob to the in-flight student message via a new `engine.attachStudentBlob(studentMsgId, blob)` helper. Without this fix, student replay is broken in production (WS is the default path).
+
+### D6. Three player ref states drive button rendering
+
+The play button has four visual states, derived from three player refs:
+
+| State    | Condition                                         | Render                       |
+|----------|---------------------------------------------------|------------------------------|
+| idle     | none of the below                                 | ▶ play icon                  |
+| loading  | `player.loadingId.value === message.id`           | spinner                      |
+| playing  | `player.playingId.value === message.id`           | ⏸ pause icon                 |
+| error    | `player.errorId.value === message.id`             | ⚠ warning icon, red border, "点击重试" |
+
+`loadingId` is non-null between `play()` invocation and either the audio's `onplaying` event (success path) or the catch block (failure path). It is needed because the synthesis round-trip is observable (~500ms-2s) and the user must see something happen.
+
+## Architecture
+
+### File layout
+
+| File                                                              | Operation | Responsibility                                                          |
+|-------------------------------------------------------------------|-----------|-------------------------------------------------------------------------|
+| `src/views/Editor/EnglishSpeaking/composables/useAudioPlayer.ts`  | **NEW**   | Sole audio owner. Single-channel rule, TTS cache, error surface.        |
+| `src/views/Editor/EnglishSpeaking/services/speechService.ts`      | **NEW**   | Azure Speech REST synthesis. Stateless `synthesize(text, signal)`.      |
+| `src/views/Editor/EnglishSpeaking/composables/useDialogueEngine.ts` | **EDIT**  | Remove `speakTTS`, `cancelTTS`, `ttsUtterance`. Add `attachStudentBlob`. |
+| `src/views/Editor/EnglishSpeaking/preview/DialogueChatView.vue`   | **EDIT**  | Use `useAudioPlayer`. Add auto-play watcher. Wire stop into recording start. Render new button states. |
+| `.env.example`                                                    | **NEW or APPEND** | Document `VITE_AZURE_SPEECH_KEY` and `VITE_AZURE_SPEECH_REGION`. |
+
+### `useAudioPlayer` API
+
+```ts
+function useAudioPlayer(): {
+  playingId: Readonly<Ref<string | null>>
+  loadingId: Readonly<Ref<string | null>>
+  errorId:   Readonly<Ref<string | null>>
+
+  play(id: string, source: PlaySource): Promise<void>
+  stop(): void
+}
+
+type PlaySource =
+  | { kind: 'tts';  text: string }
+  | { kind: 'blob'; blob: Blob }
+```
+
+**Contract:**
+
+- `play(id, source)` is the single playback entry point.
+  - Clears any prior `errorId` (a fresh attempt — error is stale).
+  - Calls internal `stop()` to interrupt the current playback.
+  - Sets `loadingId = id`.
+  - For `kind: 'tts'`: hits cache or calls `synthesize(text)`.
+  - Constructs `Audio(URL.createObjectURL(blob))`.
+  - On `audio.onplaying`: `loadingId = null; playingId = id`.
+  - On `audio.onended`: `playingId = null` (if still us). No error.
+  - On `audio.onerror` (mid-play decoder failure), `audio.play()` rejection, synthesis throw: `loadingId = null; playingId = null; errorId = id`.
+- `stop()` aborts pending synthesis (`AbortController.abort()`), pauses current `HTMLAudioElement`, clears `playingId` and `loadingId`. Does **not** clear `errorId` (errors are sticky until a new `play()` for that id, or the user navigates away).
+- `onUnmounted`: `stop()`, revoke all cached URLs, clear cache map.
+
+### Single-channel rule (structural enforcement)
+
+The composable holds at most one of each:
+- one `currentAudio: HTMLAudioElement | null`
+- one `synthAbort: AbortController | null`
+- one `playingId` value
+
+`play()` always begins by calling `stop()`, so by construction there can never be two active audio elements or two in-flight syntheses simultaneously. The view does not need to "remember to cancel" anything; the rule is impossible to violate from outside the composable.
+
+### `speechService.ts`
+
+```ts
+const KEY = import.meta.env.VITE_AZURE_SPEECH_KEY as string
+const REGION = import.meta.env.VITE_AZURE_SPEECH_REGION as string
+const VOICE = 'en-US-AriaNeural'
+const FORMAT = 'audio-24khz-48kbitrate-mono-mp3'
+
+export async function synthesize(text: string, signal?: AbortSignal): Promise<Blob> {
+  if (!KEY || !REGION) throw new Error('Azure Speech credentials not configured')
+
+  const ssml =
+    `<speak version='1.0' xml:lang='en-US'>` +
+    `<voice name='${VOICE}'>${escapeXml(text)}</voice>` +
+    `</speak>`
+
+  const res = await fetch(
+    `https://${REGION}.tts.speech.microsoft.com/cognitiveservices/v1`,
+    {
+      method: 'POST',
+      signal,
+      headers: {
+        'Ocp-Apim-Subscription-Key': KEY,
+        'Content-Type': 'application/ssml+xml',
+        'X-Microsoft-OutputFormat': FORMAT,
+        'User-Agent': 'PPT-EnglishSpeaking',
+      },
+      body: ssml,
+    },
+  )
+  if (!res.ok) throw new Error(`Azure TTS failed: ${res.status}`)
+  return res.blob()
+}
+
+function escapeXml(s: string): string {
+  return s.replace(/[<>&'"]/g, c => ({
+    '<': '&lt;', '>': '&gt;', '&': '&amp;', "'": '&apos;', '"': '&quot;',
+  }[c]!))
+}
+```
+
+The service is stateless — no token cache, no retry logic. Each call is a single fetch. The player owns the cache.
+
+### View integration
+
+**Auto-play watcher:**
+
+```ts
+const player = useAudioPlayer()
+const autoPlayedIds = new Set<string>()
+
+watch(
+  () => engine.messages.value.map(m => `${m.id}:${m.status}`).join('|'),
+  () => {
+    for (const m of engine.messages.value) {
+      if (m.role === 'ai' && m.status === 'done' && m.content && !autoPlayedIds.has(m.id)) {
+        autoPlayedIds.add(m.id)
+        player.play(m.id, { kind: 'tts', text: m.content })
+      }
+    }
+  },
+)
+```
+
+The `autoPlayedIds` set is scoped to the component instance, so it resets if the view is remounted. Re-generating an AI message creates a new `id`, so it correctly auto-plays again.
+
+**Click toggle:**
+
+```ts
+function togglePlay(id: string) {
+  if (player.playingId.value === id || player.loadingId.value === id) {
+    player.stop()
+    return
+  }
+  const msg = engine.messages.value.find(m => m.id === id)
+  if (!msg) return
+  if (msg.role === 'student' && msg.audioBlob) {
+    player.play(id, { kind: 'blob', blob: msg.audioBlob })
+  } else if (msg.role === 'ai' && msg.content) {
+    player.play(id, { kind: 'tts', text: msg.content })
+  }
+}
+```
+
+`errorId === id` falls through to the `play()` branch, naturally retrying.
+
+**Recording start interrupt:**
+
+```ts
+async function handleStartRecording() {
+  if (!engine.canRecord.value || recorder.isRecording.value) return
+  player.stop()  // ← new
+  // ...rest unchanged
+}
+```
+
+`handleRestart` similarly calls `player.stop()` (replacing `engine.cancelTTS()`).
+
+**Template state derivation:**
+
+Each play button reads `player.playingId.value` / `player.loadingId.value` / `player.errorId.value` directly. No local `playingMessageId` ref.
+
+**Error UI (per voice-bar):**
+
+When `player.errorId.value === message.id`:
+- Play button gets `play-btn-error` modifier class (red border, warning icon).
+- A `<span class="play-error-hint">点击重试</span>` replaces the duration label.
+
+### Engine changes
+
+Removed:
+- `let ttsUtterance: SpeechSynthesisUtterance | null = null` (line 16)
+- `function speakTTS(text)` (lines 244-252)
+- `function cancelTTS()` (lines 254-259)
+- All four `speakTTS(...)` call sites (lines 56, 124, 201, 369)
+- `cancelTTS()` call in `onUnmounted` (line 429)
+- `cancelTTS` from the returned object (line 454)
+
+Added:
+
+```ts
+function attachStudentBlob(messageId: string, blob: Blob) {
+  const msg = messages.value.find(m => m.id === messageId)
+  if (msg && msg.role === 'student') msg.audioBlob = blob
+}
+```
+
+Returned alongside other helpers.
+
+### View changes (audioBlob fix in WS path)
+
+```ts
+async function handleFinishRecording() {
+  if (!recorder.isRecording.value) return
+  const ctl = streamCtl
+  streamCtl = null
+  try {
+    const blob = await recorder.stopRecording()
+    recorder.onChunk.value = null
+    if (ctl) {
+      engine.attachStudentBlob(ctl.studentMsgId, blob)  // ← new
+      ctl.finish()
+    } else {
+      await engine.sendStudentMessage(blob)
+    }
+  } catch (err) {
+    console.error('Recording/send failed:', err)
+  }
+}
+```
+
+### Removed view code
+
+- `let currentAudio: HTMLAudioElement | null = null`
+- `let currentAudioUrl: string | null = null`
+- `function stopCurrentPlayback()`
+- TTS branch inside `togglePlay()`
+- `playingMessageId` ref
+- `engine.cancelTTS()` call in `handleRestart`
+
+## Data flow
+
+### Auto-play happy path
+
+```
+1. WS / HTTP stream finishes → engine sets aiMsg.status = 'done'
+2. View watcher detects new (id, 'done') → autoPlayedIds.add(id) → player.play(id, { kind: 'tts', text })
+3. player.play → stop() → loadingId = id → speechService.synthesize() (≈800ms)
+4. fetch resolves → blob → ttsCache.set(id, blob) → URL.createObjectURL(blob)
+5. new Audio(url).play() → audio.onplaying fires → loadingId = null, playingId = id
+6. audio finishes → onended → playingId = null
+```
+
+### Replay (cache hit) path
+
+```
+1. User clicks play button on AI message → togglePlay(id)
+2. Not playing/loading → player.play(id, { kind: 'tts', text })
+3. player.play → stop() → loadingId = id → ttsCache.get(id) hits → no fetch
+4. URL.createObjectURL(blob) → new Audio.play() → onplaying → playingId = id
+```
+
+### Recording-start interrupt
+
+```
+1. User clicks 开始录音 → handleStartRecording
+2. player.stop() → synthAbort.abort() (if synthesizing) → audio.pause() (if playing)
+   → loadingId = null, playingId = null
+3. recorder.startRecording() proceeds
+```
+
+### Synthesis failure path
+
+```
+1. player.play(id, { kind: 'tts', text }) → loadingId = id
+2. synthesize() → fetch rejects (network) or 5xx
+3. catch: loadingId = null, errorId = id
+4. View re-renders: play button shows ⚠ + "点击重试"
+5. User clicks button → togglePlay → not playing/loading → player.play(id, ...) (errorId cleared, retry)
+```
+
+## Edge cases
+
+| Scenario                                           | Handling                                                                 |
+|----------------------------------------------------|--------------------------------------------------------------------------|
+| Synthesis network failure                          | `errorId = id`. Button shows retry. No global toast.                     |
+| Azure 401 (bad key) / 5xx                          | Same as network failure. Logged to console for ops.                      |
+| `audio.play()` rejected (autoplay policy)          | Same as failure. User clicks → counts as user gesture → succeeds.        |
+| Synthesis in progress, recording starts            | `synthAbort.abort()` + `loadingId = null`. No audio plays. No error.     |
+| Synthesis in progress, new AI message arrives      | Old synth aborted. New `play()` starts. Old message NOT marked errored. |
+| User replaying old msg, new AI message arrives     | Old audio paused. New auto-play takes over (matches existing single-channel behavior). |
+| User clicks same message twice                     | Second click hits `playingId === id` → `stop()`. Toggle.                 |
+| User clicks errored message                        | Error cleared, retry begins.                                             |
+| View unmounts mid-synthesis                        | `onUnmounted` → `stop()` → abort. Cached URLs revoked.                   |
+| `crypto.randomUUID()` collision (theoretical)      | N/A — not handled. Astronomically unlikely.                              |
+| Empty `text` for TTS                               | View guards (`m.content` truthy check in watcher and toggle).            |
+| `VITE_AZURE_SPEECH_KEY` missing                    | `synthesize()` throws immediately. Falls into normal failure path → error UI. Console error explains. |
+
+## Testing strategy
+
+There is no existing unit test infrastructure for this view. Verification is manual, captured as a checklist:
+
+**Single-channel correctness:**
+- [ ] AI message arrives → auto-plays. Click another AI message's play button → first stops, second plays.
+- [ ] AI auto-playing → click student message replay → AI stops, student plays.
+- [ ] Student replaying → click AI play button → student stops, AI plays.
+- [ ] AI auto-playing → start recording → audio stops within 100ms (no leak into mic).
+
+**Toggle correctness:**
+- [ ] Click playing message → stops. Click again → plays from start.
+- [ ] Click loading message (while synthesis pending) → cancels, no audio plays.
+
+**Cache:**
+- [ ] First click on AI message → Network panel shows POST to `*.tts.speech.microsoft.com`.
+- [ ] Second click on same message → no new network call.
+- [ ] Re-generate AI message (new id) → next click triggers new synthesis.
+
+**Error UI:**
+- [ ] Disconnect network → AI message arrives → button shows ⚠ + "点击重试".
+- [ ] Reconnect network → click retry → audio plays.
+- [ ] Set bogus `VITE_AZURE_SPEECH_KEY` → all AI plays fail with retry UI.
+
+**Lifecycle:**
+- [ ] Enter dialogue, exit, re-enter — no zombie audio. No console warnings about revoked URLs.
+- [ ] Refresh page mid-playback — no error in console on unload.
+
+**WS audio attachment fix:**
+- [ ] Speak a message via WS path → click student replay button → audio plays correctly (regression fix).
+
+## Configuration
+
+`.env.local` (developer machine, not committed):
+
+```
+VITE_AZURE_SPEECH_KEY=<your-key>
+VITE_AZURE_SPEECH_REGION=eastus
+```
+
+`.env.example` (committed, no values):
+
+```
+VITE_AZURE_SPEECH_KEY=
+VITE_AZURE_SPEECH_REGION=
+```
+
+## Security debt (must address before production)
+
+`VITE_AZURE_SPEECH_KEY` is bundled into the JavaScript shipped to browsers. Anyone with DevTools can extract it. **This is acceptable only for local development and internal demos.**
+
+Before shipping to external users, replace with the deferred design:
+
+1. Backend endpoint `GET /api/speaking/azure-token` returns `{ token, region }` (10-min expiry).
+2. `speechService.ts` fetches the token before each batch of synthesis calls; caches it for ≤9 minutes; refreshes on 401.
+3. Synthesis request `Authorization: Bearer <token>` instead of `Ocp-Apim-Subscription-Key`.
+4. `VITE_AZURE_SPEECH_KEY` is removed from the frontend.
+
+This change is isolated to `speechService.ts` (player and view code stay identical).
+
+## Out of this design's scope
+
+- Voice persona switching (multi-voice support).
+- Streaming synthesis with sentence boundaries.
+- Background music / sound effects mixing.
+- Volume / speed controls.
+- Persisting cached audio across sessions (IndexedDB).
+- Accessibility audit of the new error state (WCAG roles, aria-live).
+
+These are deferred until product asks for them.