|
@@ -0,0 +1,426 @@
|
|
|
|
|
+# Azure TTS + Unified Audio Player Design
|
|
|
|
|
+
|
|
|
|
|
+**Date:** 2026-04-26
|
|
|
|
|
+**Scope:** `src/views/Editor/EnglishSpeaking/preview/DialogueChatView.vue` and its composables/services
|
|
|
|
|
+**Status:** Approved for implementation planning
|
|
|
|
|
+
|
|
|
|
|
+## Context
|
|
|
|
|
+
|
|
|
|
|
+`DialogueChatView.vue` currently has three audio playback triggers:
|
|
|
|
|
+
|
|
|
|
|
+1. **Auto-play AI replies** — `useDialogueEngine.speakTTS()` runs after every AI `done` event (greeting, `sendStudentMessage`, WebSocket stream, `regenerateAiMessage`), using the browser-native `SpeechSynthesisUtterance` API.
|
|
|
|
|
+2. **Click-replay AI** — `togglePlay()` in the view creates a separate `SpeechSynthesisUtterance` for the same message.
|
|
|
|
|
+3. **Click-replay student** — `togglePlay()` plays the message's stored `audioBlob` via `HTMLAudioElement`.
|
|
|
|
|
+
|
|
|
|
|
+The three triggers do not coordinate. Two owners (engine + view) both call `speechSynthesis.cancel()`, the click-replay path does not know about ongoing auto-play, and starting a new recording does not interrupt currently-playing audio. As a result:
|
|
|
|
|
+
|
|
|
|
|
+- Clicking the play button on a message that is currently being auto-played does not toggle to "stop" — the visual play/pause state is wrong because `playingMessageId` (view) is unaware of `ttsUtterance` (engine).
|
|
|
|
|
+- Starting a recording while AI audio plays causes the AI voice to leak into the microphone.
|
|
|
|
|
+- Student-recording playback can overlap with auto-TTS for a freshly arrived AI message.
|
|
|
|
|
+
|
|
|
|
|
+We are also switching the AI voice from browser-native TTS to **Azure Speech REST** for higher-quality output. This is the right moment to refactor.
|
|
|
|
|
+
|
|
|
|
|
+## Goals
|
|
|
|
|
+
|
|
|
|
|
+1. Introduce a single, view-level audio playback owner (`useAudioPlayer` composable).
|
|
|
|
|
+2. Replace browser TTS with Azure Speech REST synthesis for AI messages.
|
|
|
|
|
+3. Auto-synthesize and auto-play each AI message after streaming completes.
|
|
|
|
|
+4. Support click-replay for both AI messages (cached synthesis) and student messages (existing blob).
|
|
|
|
|
+5. Enforce three rules **structurally** (i.e., not by discipline):
|
|
|
|
|
+ - Single playback channel (new playback interrupts any prior one).
|
|
|
|
|
+ - Recording start interrupts current playback.
|
|
|
|
|
+ - Click on the currently-playing message stops it.
|
|
|
|
|
+6. Surface synthesis / playback failures in a unified per-message error state with one-click retry.
|
|
|
|
|
+7. Decouple `useDialogueEngine` from audio entirely — the engine becomes pure dialogue state.
|
|
|
|
|
+
|
|
|
|
|
+## Non-goals
|
|
|
|
|
+
|
|
|
|
|
+- **Streaming synthesis** (synthesizing token-by-token while the model streams). Out of scope; full-text synthesis after `done` is acceptable.
|
|
|
|
|
+- **Multiple voices / voice configuration UI**. Hard-code `en-US-AriaNeural` for now.
|
|
|
|
|
+- **Cross-view audio coordination** (e.g., report screen also plays audio). The player is view-level; if the report screen later needs playback it can instantiate its own.
|
|
|
|
|
+- **Backend Azure token endpoint** (planned but deferred — see Security Debt below).
|
|
|
|
|
+- **SDK-based synthesis**. Use REST only; the `microsoft-cognitiveservices-speech-sdk` package is not introduced.
|
|
|
|
|
+
|
|
|
|
|
+## Non-trivial decisions
|
|
|
|
|
+
|
|
|
|
|
+### D1. Player ownership: view-level, not module singleton
|
|
|
|
|
+
|
|
|
|
|
+The player is created inside `DialogueChatView.setup()` via `useAudioPlayer()` and torn down on view unmount. Reasoning: today all three triggers live in this one view; a module singleton would add a global lifecycle hazard (ensure `stop()` on every navigation) without solving any current problem. If a future view needs playback, it will instantiate its own player; cross-view coordination is a separate, future problem.
|
|
|
|
|
+
|
|
|
|
|
+### D2. Auto-play trigger lives in the view, not the engine
|
|
|
|
|
+
|
|
|
|
|
+The engine no longer touches `speechSynthesis`. Instead, the view runs a `watch()` on `engine.messages` and triggers `player.play(...)` when an AI message transitions to `status === 'done'`. Reasoning: "auto-play after a message completes" is a presentation concern. Keeping it in the view means engine has zero audio dependency, and any future toggle ("disable auto-play") is a view-only change.
|
|
|
|
|
+
|
|
|
|
|
+The watcher uses a `Set<string>` of already-auto-played message IDs to avoid re-playing on unrelated re-renders.
|
|
|
|
|
+
|
|
|
|
|
+### D3. Synthesis is one round-trip per message, cached by message ID
|
|
|
|
|
+
|
|
|
|
|
+Each AI message text → one Azure REST call → MP3 blob → cached in a `Map<messageId, Blob>` inside the player. Replays hit the cache (no second call). Cache lives for the player's lifetime; on unmount, all cached object URLs are revoked and the map is cleared. Student-recording blobs are **not** added to this cache — the message itself owns the blob.
|
|
|
|
|
+
|
|
|
|
|
+### D4. Errors are surfaced per message, not globally
|
|
|
|
|
+
|
|
|
|
|
+The player exposes `errorId: Ref<string | null>`. The play button on the affected message renders an error variant (warning icon + "点击重试" text). Clicking it retries by calling `player.play(id, source)` again. Reasoning:
|
|
|
|
|
+
|
|
|
|
|
+- The failure scope is one message's playback, not a system state.
|
|
|
|
|
+- Locating the error at the play button keeps retry intuitive — the same affordance that "starts" playback also "retries".
|
|
|
|
|
+- Avoids introducing a new global toast/banner component.
|
|
|
|
|
+
|
|
|
|
|
+All failure paths (Azure network/5xx, `audio.play()` rejection from autoplay policy, decoder errors) collapse to the same UI: warning icon + "播放失败,点击重试". We do not differentiate causes; the user action is identical.
|
|
|
|
|
+
|
|
|
|
|
+### D5. WebSocket stream path: attach `audioBlob` to student message
|
|
|
|
|
+
|
|
|
|
|
+The `beginStudentStream` path (`useDialogueEngine.ts`) does not currently attach the recorded blob to the student message it pushes. As a result, the student's "replay recording" button is silent in WS mode. We fix this in the same change: `handleFinishRecording` will attach the blob to the in-flight student message via a new `engine.attachStudentBlob(studentMsgId, blob)` helper. Without this fix, student replay is broken in production (WS is the default path).
|
|
|
|
|
+
|
|
|
|
|
+### D6. Three player ref states drive button rendering
|
|
|
|
|
+
|
|
|
|
|
+The play button has four visual states, derived from three player refs:
|
|
|
|
|
+
|
|
|
|
|
+| State | Condition | Render |
|
|
|
|
|
+|----------|---------------------------------------------------|------------------------------|
|
|
|
|
|
+| idle | none of the below | ▶ play icon |
|
|
|
|
|
+| loading | `player.loadingId.value === message.id` | spinner |
|
|
|
|
|
+| playing | `player.playingId.value === message.id` | ⏸ pause icon |
|
|
|
|
|
+| error | `player.errorId.value === message.id` | ⚠ warning icon, red border, "点击重试" |
|
|
|
|
|
+
|
|
|
|
|
+`loadingId` is non-null between `play()` invocation and either the audio's `onplaying` event (success path) or the catch block (failure path). It is needed because the synthesis round-trip is observable (~500ms-2s) and the user must see something happen.
|
|
|
|
|
+
|
|
|
|
|
+## Architecture
|
|
|
|
|
+
|
|
|
|
|
+### File layout
|
|
|
|
|
+
|
|
|
|
|
+| File | Operation | Responsibility |
|
|
|
|
|
+|-------------------------------------------------------------------|-----------|-------------------------------------------------------------------------|
|
|
|
|
|
+| `src/views/Editor/EnglishSpeaking/composables/useAudioPlayer.ts` | **NEW** | Sole audio owner. Single-channel rule, TTS cache, error surface. |
|
|
|
|
|
+| `src/views/Editor/EnglishSpeaking/services/speechService.ts` | **NEW** | Azure Speech REST synthesis. Stateless `synthesize(text, signal)`. |
|
|
|
|
|
+| `src/views/Editor/EnglishSpeaking/composables/useDialogueEngine.ts` | **EDIT** | Remove `speakTTS`, `cancelTTS`, `ttsUtterance`. Add `attachStudentBlob`. |
|
|
|
|
|
+| `src/views/Editor/EnglishSpeaking/preview/DialogueChatView.vue` | **EDIT** | Use `useAudioPlayer`. Add auto-play watcher. Wire stop into recording start. Render new button states. |
|
|
|
|
|
+| `.env.example` | **NEW or APPEND** | Document `VITE_AZURE_SPEECH_KEY` and `VITE_AZURE_SPEECH_REGION`. |
|
|
|
|
|
+
|
|
|
|
|
+### `useAudioPlayer` API
|
|
|
|
|
+
|
|
|
|
|
+```ts
|
|
|
|
|
+function useAudioPlayer(): {
|
|
|
|
|
+ playingId: Readonly<Ref<string | null>>
|
|
|
|
|
+ loadingId: Readonly<Ref<string | null>>
|
|
|
|
|
+ errorId: Readonly<Ref<string | null>>
|
|
|
|
|
+
|
|
|
|
|
+ play(id: string, source: PlaySource): Promise<void>
|
|
|
|
|
+ stop(): void
|
|
|
|
|
+}
|
|
|
|
|
+
|
|
|
|
|
+type PlaySource =
|
|
|
|
|
+ | { kind: 'tts'; text: string }
|
|
|
|
|
+ | { kind: 'blob'; blob: Blob }
|
|
|
|
|
+```
|
|
|
|
|
+
|
|
|
|
|
+**Contract:**
|
|
|
|
|
+
|
|
|
|
|
+- `play(id, source)` is the single playback entry point.
|
|
|
|
|
+ - Clears any prior `errorId` (a fresh attempt — error is stale).
|
|
|
|
|
+ - Calls internal `stop()` to interrupt the current playback.
|
|
|
|
|
+ - Sets `loadingId = id`.
|
|
|
|
|
+ - For `kind: 'tts'`: hits cache or calls `synthesize(text)`.
|
|
|
|
|
+ - Constructs `Audio(URL.createObjectURL(blob))`.
|
|
|
|
|
+ - On `audio.onplaying`: `loadingId = null; playingId = id`.
|
|
|
|
|
+ - On `audio.onended`: `playingId = null` (if still us). No error.
|
|
|
|
|
+ - On `audio.onerror` (mid-play decoder failure), `audio.play()` rejection, synthesis throw: `loadingId = null; playingId = null; errorId = id`.
|
|
|
|
|
+- `stop()` aborts pending synthesis (`AbortController.abort()`), pauses current `HTMLAudioElement`, clears `playingId` and `loadingId`. Does **not** clear `errorId` (errors are sticky until a new `play()` for that id, or the user navigates away).
|
|
|
|
|
+- `onUnmounted`: `stop()`, revoke all cached URLs, clear cache map.
|
|
|
|
|
+
|
|
|
|
|
+### Single-channel rule (structural enforcement)
|
|
|
|
|
+
|
|
|
|
|
+The composable holds at most one of each:
|
|
|
|
|
+- one `currentAudio: HTMLAudioElement | null`
|
|
|
|
|
+- one `synthAbort: AbortController | null`
|
|
|
|
|
+- one `playingId` value
|
|
|
|
|
+
|
|
|
|
|
+`play()` always begins by calling `stop()`, so by construction there can never be two active audio elements or two in-flight syntheses simultaneously. The view does not need to "remember to cancel" anything; the rule is impossible to violate from outside the composable.
|
|
|
|
|
+
|
|
|
|
|
+### `speechService.ts`
|
|
|
|
|
+
|
|
|
|
|
+```ts
|
|
|
|
|
+const KEY = import.meta.env.VITE_AZURE_SPEECH_KEY as string
|
|
|
|
|
+const REGION = import.meta.env.VITE_AZURE_SPEECH_REGION as string
|
|
|
|
|
+const VOICE = 'en-US-AriaNeural'
|
|
|
|
|
+const FORMAT = 'audio-24khz-48kbitrate-mono-mp3'
|
|
|
|
|
+
|
|
|
|
|
+export async function synthesize(text: string, signal?: AbortSignal): Promise<Blob> {
|
|
|
|
|
+ if (!KEY || !REGION) throw new Error('Azure Speech credentials not configured')
|
|
|
|
|
+
|
|
|
|
|
+ const ssml =
|
|
|
|
|
+ `<speak version='1.0' xml:lang='en-US'>` +
|
|
|
|
|
+ `<voice name='${VOICE}'>${escapeXml(text)}</voice>` +
|
|
|
|
|
+ `</speak>`
|
|
|
|
|
+
|
|
|
|
|
+ const res = await fetch(
|
|
|
|
|
+ `https://${REGION}.tts.speech.microsoft.com/cognitiveservices/v1`,
|
|
|
|
|
+ {
|
|
|
|
|
+ method: 'POST',
|
|
|
|
|
+ signal,
|
|
|
|
|
+ headers: {
|
|
|
|
|
+ 'Ocp-Apim-Subscription-Key': KEY,
|
|
|
|
|
+ 'Content-Type': 'application/ssml+xml',
|
|
|
|
|
+ 'X-Microsoft-OutputFormat': FORMAT,
|
|
|
|
|
+ 'User-Agent': 'PPT-EnglishSpeaking',
|
|
|
|
|
+ },
|
|
|
|
|
+ body: ssml,
|
|
|
|
|
+ },
|
|
|
|
|
+ )
|
|
|
|
|
+ if (!res.ok) throw new Error(`Azure TTS failed: ${res.status}`)
|
|
|
|
|
+ return res.blob()
|
|
|
|
|
+}
|
|
|
|
|
+
|
|
|
|
|
+function escapeXml(s: string): string {
|
|
|
|
|
+ return s.replace(/[<>&'"]/g, c => ({
|
|
|
|
|
+ '<': '<', '>': '>', '&': '&', "'": ''', '"': '"',
|
|
|
|
|
+ }[c]!))
|
|
|
|
|
+}
|
|
|
|
|
+```
|
|
|
|
|
+
|
|
|
|
|
+The service is stateless — no token cache, no retry logic. Each call is a single fetch. The player owns the cache.
|
|
|
|
|
+
|
|
|
|
|
+### View integration
|
|
|
|
|
+
|
|
|
|
|
+**Auto-play watcher:**
|
|
|
|
|
+
|
|
|
|
|
+```ts
|
|
|
|
|
+const player = useAudioPlayer()
|
|
|
|
|
+const autoPlayedIds = new Set<string>()
|
|
|
|
|
+
|
|
|
|
|
+watch(
|
|
|
|
|
+ () => engine.messages.value.map(m => `${m.id}:${m.status}`).join('|'),
|
|
|
|
|
+ () => {
|
|
|
|
|
+ for (const m of engine.messages.value) {
|
|
|
|
|
+ if (m.role === 'ai' && m.status === 'done' && m.content && !autoPlayedIds.has(m.id)) {
|
|
|
|
|
+ autoPlayedIds.add(m.id)
|
|
|
|
|
+ player.play(m.id, { kind: 'tts', text: m.content })
|
|
|
|
|
+ }
|
|
|
|
|
+ }
|
|
|
|
|
+ },
|
|
|
|
|
+)
|
|
|
|
|
+```
|
|
|
|
|
+
|
|
|
|
|
+The `autoPlayedIds` set is scoped to the component instance, so it resets if the view is remounted. Re-generating an AI message creates a new `id`, so it correctly auto-plays again.
|
|
|
|
|
+
|
|
|
|
|
+**Click toggle:**
|
|
|
|
|
+
|
|
|
|
|
+```ts
|
|
|
|
|
+function togglePlay(id: string) {
|
|
|
|
|
+ if (player.playingId.value === id || player.loadingId.value === id) {
|
|
|
|
|
+ player.stop()
|
|
|
|
|
+ return
|
|
|
|
|
+ }
|
|
|
|
|
+ const msg = engine.messages.value.find(m => m.id === id)
|
|
|
|
|
+ if (!msg) return
|
|
|
|
|
+ if (msg.role === 'student' && msg.audioBlob) {
|
|
|
|
|
+ player.play(id, { kind: 'blob', blob: msg.audioBlob })
|
|
|
|
|
+ } else if (msg.role === 'ai' && msg.content) {
|
|
|
|
|
+ player.play(id, { kind: 'tts', text: msg.content })
|
|
|
|
|
+ }
|
|
|
|
|
+}
|
|
|
|
|
+```
|
|
|
|
|
+
|
|
|
|
|
+`errorId === id` falls through to the `play()` branch, naturally retrying.
|
|
|
|
|
+
|
|
|
|
|
+**Recording start interrupt:**
|
|
|
|
|
+
|
|
|
|
|
+```ts
|
|
|
|
|
+async function handleStartRecording() {
|
|
|
|
|
+ if (!engine.canRecord.value || recorder.isRecording.value) return
|
|
|
|
|
+ player.stop() // ← new
|
|
|
|
|
+ // ...rest unchanged
|
|
|
|
|
+}
|
|
|
|
|
+```
|
|
|
|
|
+
|
|
|
|
|
+`handleRestart` similarly calls `player.stop()` (replacing `engine.cancelTTS()`).
|
|
|
|
|
+
|
|
|
|
|
+**Template state derivation:**
|
|
|
|
|
+
|
|
|
|
|
+Each play button reads `player.playingId.value` / `player.loadingId.value` / `player.errorId.value` directly. No local `playingMessageId` ref.
|
|
|
|
|
+
|
|
|
|
|
+**Error UI (per voice-bar):**
|
|
|
|
|
+
|
|
|
|
|
+When `player.errorId.value === message.id`:
|
|
|
|
|
+- Play button gets `play-btn-error` modifier class (red border, warning icon).
|
|
|
|
|
+- A `<span class="play-error-hint">点击重试</span>` replaces the duration label.
|
|
|
|
|
+
|
|
|
|
|
+### Engine changes
|
|
|
|
|
+
|
|
|
|
|
+Removed:
|
|
|
|
|
+- `let ttsUtterance: SpeechSynthesisUtterance | null = null` (line 16)
|
|
|
|
|
+- `function speakTTS(text)` (lines 244-252)
|
|
|
|
|
+- `function cancelTTS()` (lines 254-259)
|
|
|
|
|
+- All four `speakTTS(...)` call sites (lines 56, 124, 201, 369)
|
|
|
|
|
+- `cancelTTS()` call in `onUnmounted` (line 429)
|
|
|
|
|
+- `cancelTTS` from the returned object (line 454)
|
|
|
|
|
+
|
|
|
|
|
+Added:
|
|
|
|
|
+
|
|
|
|
|
+```ts
|
|
|
|
|
+function attachStudentBlob(messageId: string, blob: Blob) {
|
|
|
|
|
+ const msg = messages.value.find(m => m.id === messageId)
|
|
|
|
|
+ if (msg && msg.role === 'student') msg.audioBlob = blob
|
|
|
|
|
+}
|
|
|
|
|
+```
|
|
|
|
|
+
|
|
|
|
|
+Returned alongside other helpers.
|
|
|
|
|
+
|
|
|
|
|
+### View changes (audioBlob fix in WS path)
|
|
|
|
|
+
|
|
|
|
|
+```ts
|
|
|
|
|
+async function handleFinishRecording() {
|
|
|
|
|
+ if (!recorder.isRecording.value) return
|
|
|
|
|
+ const ctl = streamCtl
|
|
|
|
|
+ streamCtl = null
|
|
|
|
|
+ try {
|
|
|
|
|
+ const blob = await recorder.stopRecording()
|
|
|
|
|
+ recorder.onChunk.value = null
|
|
|
|
|
+ if (ctl) {
|
|
|
|
|
+ engine.attachStudentBlob(ctl.studentMsgId, blob) // ← new
|
|
|
|
|
+ ctl.finish()
|
|
|
|
|
+ } else {
|
|
|
|
|
+ await engine.sendStudentMessage(blob)
|
|
|
|
|
+ }
|
|
|
|
|
+ } catch (err) {
|
|
|
|
|
+ console.error('Recording/send failed:', err)
|
|
|
|
|
+ }
|
|
|
|
|
+}
|
|
|
|
|
+```
|
|
|
|
|
+
|
|
|
|
|
+### Removed view code
|
|
|
|
|
+
|
|
|
|
|
+- `let currentAudio: HTMLAudioElement | null = null`
|
|
|
|
|
+- `let currentAudioUrl: string | null = null`
|
|
|
|
|
+- `function stopCurrentPlayback()`
|
|
|
|
|
+- TTS branch inside `togglePlay()`
|
|
|
|
|
+- `playingMessageId` ref
|
|
|
|
|
+- `engine.cancelTTS()` call in `handleRestart`
|
|
|
|
|
+
|
|
|
|
|
+## Data flow
|
|
|
|
|
+
|
|
|
|
|
+### Auto-play happy path
|
|
|
|
|
+
|
|
|
|
|
+```
|
|
|
|
|
+1. WS / HTTP stream finishes → engine sets aiMsg.status = 'done'
|
|
|
|
|
+2. View watcher detects new (id, 'done') → autoPlayedIds.add(id) → player.play(id, { kind: 'tts', text })
|
|
|
|
|
+3. player.play → stop() → loadingId = id → speechService.synthesize() (≈800ms)
|
|
|
|
|
+4. fetch resolves → blob → ttsCache.set(id, blob) → URL.createObjectURL(blob)
|
|
|
|
|
+5. new Audio(url).play() → audio.onplaying fires → loadingId = null, playingId = id
|
|
|
|
|
+6. audio finishes → onended → playingId = null
|
|
|
|
|
+```
|
|
|
|
|
+
|
|
|
|
|
+### Replay (cache hit) path
|
|
|
|
|
+
|
|
|
|
|
+```
|
|
|
|
|
+1. User clicks play button on AI message → togglePlay(id)
|
|
|
|
|
+2. Not playing/loading → player.play(id, { kind: 'tts', text })
|
|
|
|
|
+3. player.play → stop() → loadingId = id → ttsCache.get(id) hits → no fetch
|
|
|
|
|
+4. URL.createObjectURL(blob) → new Audio.play() → onplaying → playingId = id
|
|
|
|
|
+```
|
|
|
|
|
+
|
|
|
|
|
+### Recording-start interrupt
|
|
|
|
|
+
|
|
|
|
|
+```
|
|
|
|
|
+1. User clicks 开始录音 → handleStartRecording
|
|
|
|
|
+2. player.stop() → synthAbort.abort() (if synthesizing) → audio.pause() (if playing)
|
|
|
|
|
+ → loadingId = null, playingId = null
|
|
|
|
|
+3. recorder.startRecording() proceeds
|
|
|
|
|
+```
|
|
|
|
|
+
|
|
|
|
|
+### Synthesis failure path
|
|
|
|
|
+
|
|
|
|
|
+```
|
|
|
|
|
+1. player.play(id, { kind: 'tts', text }) → loadingId = id
|
|
|
|
|
+2. synthesize() → fetch rejects (network) or 5xx
|
|
|
|
|
+3. catch: loadingId = null, errorId = id
|
|
|
|
|
+4. View re-renders: play button shows ⚠ + "点击重试"
|
|
|
|
|
+5. User clicks button → togglePlay → not playing/loading → player.play(id, ...) (errorId cleared, retry)
|
|
|
|
|
+```
|
|
|
|
|
+
|
|
|
|
|
+## Edge cases
|
|
|
|
|
+
|
|
|
|
|
+| Scenario | Handling |
|
|
|
|
|
+|----------------------------------------------------|--------------------------------------------------------------------------|
|
|
|
|
|
+| Synthesis network failure | `errorId = id`. Button shows retry. No global toast. |
|
|
|
|
|
+| Azure 401 (bad key) / 5xx | Same as network failure. Logged to console for ops. |
|
|
|
|
|
+| `audio.play()` rejected (autoplay policy) | Same as failure. User clicks → counts as user gesture → succeeds. |
|
|
|
|
|
+| Synthesis in progress, recording starts | `synthAbort.abort()` + `loadingId = null`. No audio plays. No error. |
|
|
|
|
|
+| Synthesis in progress, new AI message arrives | Old synth aborted. New `play()` starts. Old message NOT marked errored. |
|
|
|
|
|
+| User replaying old msg, new AI message arrives | Old audio paused. New auto-play takes over (matches existing single-channel behavior). |
|
|
|
|
|
+| User clicks same message twice | Second click hits `playingId === id` → `stop()`. Toggle. |
|
|
|
|
|
+| User clicks errored message | Error cleared, retry begins. |
|
|
|
|
|
+| View unmounts mid-synthesis | `onUnmounted` → `stop()` → abort. Cached URLs revoked. |
|
|
|
|
|
+| `crypto.randomUUID()` collision (theoretical) | N/A — not handled. Astronomically unlikely. |
|
|
|
|
|
+| Empty `text` for TTS | View guards (`m.content` truthy check in watcher and toggle). |
|
|
|
|
|
+| `VITE_AZURE_SPEECH_KEY` missing | `synthesize()` throws immediately. Falls into normal failure path → error UI. Console error explains. |
|
|
|
|
|
+
|
|
|
|
|
+## Testing strategy
|
|
|
|
|
+
|
|
|
|
|
+There is no existing unit test infrastructure for this view. Verification is manual, captured as a checklist:
|
|
|
|
|
+
|
|
|
|
|
+**Single-channel correctness:**
|
|
|
|
|
+- [ ] AI message arrives → auto-plays. Click another AI message's play button → first stops, second plays.
|
|
|
|
|
+- [ ] AI auto-playing → click student message replay → AI stops, student plays.
|
|
|
|
|
+- [ ] Student replaying → click AI play button → student stops, AI plays.
|
|
|
|
|
+- [ ] AI auto-playing → start recording → audio stops within 100ms (no leak into mic).
|
|
|
|
|
+
|
|
|
|
|
+**Toggle correctness:**
|
|
|
|
|
+- [ ] Click playing message → stops. Click again → plays from start.
|
|
|
|
|
+- [ ] Click loading message (while synthesis pending) → cancels, no audio plays.
|
|
|
|
|
+
|
|
|
|
|
+**Cache:**
|
|
|
|
|
+- [ ] First click on AI message → Network panel shows POST to `*.tts.speech.microsoft.com`.
|
|
|
|
|
+- [ ] Second click on same message → no new network call.
|
|
|
|
|
+- [ ] Re-generate AI message (new id) → next click triggers new synthesis.
|
|
|
|
|
+
|
|
|
|
|
+**Error UI:**
|
|
|
|
|
+- [ ] Disconnect network → AI message arrives → button shows ⚠ + "点击重试".
|
|
|
|
|
+- [ ] Reconnect network → click retry → audio plays.
|
|
|
|
|
+- [ ] Set bogus `VITE_AZURE_SPEECH_KEY` → all AI plays fail with retry UI.
|
|
|
|
|
+
|
|
|
|
|
+**Lifecycle:**
|
|
|
|
|
+- [ ] Enter dialogue, exit, re-enter — no zombie audio. No console warnings about revoked URLs.
|
|
|
|
|
+- [ ] Refresh page mid-playback — no error in console on unload.
|
|
|
|
|
+
|
|
|
|
|
+**WS audio attachment fix:**
|
|
|
|
|
+- [ ] Speak a message via WS path → click student replay button → audio plays correctly (regression fix).
|
|
|
|
|
+
|
|
|
|
|
+## Configuration
|
|
|
|
|
+
|
|
|
|
|
+`.env.local` (developer machine, not committed):
|
|
|
|
|
+
|
|
|
|
|
+```
|
|
|
|
|
+VITE_AZURE_SPEECH_KEY=<your-key>
|
|
|
|
|
+VITE_AZURE_SPEECH_REGION=eastus
|
|
|
|
|
+```
|
|
|
|
|
+
|
|
|
|
|
+`.env.example` (committed, no values):
|
|
|
|
|
+
|
|
|
|
|
+```
|
|
|
|
|
+VITE_AZURE_SPEECH_KEY=
|
|
|
|
|
+VITE_AZURE_SPEECH_REGION=
|
|
|
|
|
+```
|
|
|
|
|
+
|
|
|
|
|
+## Security debt (must address before production)
|
|
|
|
|
+
|
|
|
|
|
+`VITE_AZURE_SPEECH_KEY` is bundled into the JavaScript shipped to browsers. Anyone with DevTools can extract it. **This is acceptable only for local development and internal demos.**
|
|
|
|
|
+
|
|
|
|
|
+Before shipping to external users, replace with the deferred design:
|
|
|
|
|
+
|
|
|
|
|
+1. Backend endpoint `GET /api/speaking/azure-token` returns `{ token, region }` (10-min expiry).
|
|
|
|
|
+2. `speechService.ts` fetches the token before each batch of synthesis calls; caches it for ≤9 minutes; refreshes on 401.
|
|
|
|
|
+3. Synthesis request `Authorization: Bearer <token>` instead of `Ocp-Apim-Subscription-Key`.
|
|
|
|
|
+4. `VITE_AZURE_SPEECH_KEY` is removed from the frontend.
|
|
|
|
|
+
|
|
|
|
|
+This change is isolated to `speechService.ts` (player and view code stay identical).
|
|
|
|
|
+
|
|
|
|
|
+## Out of this design's scope
|
|
|
|
|
+
|
|
|
|
|
+- Voice persona switching (multi-voice support).
|
|
|
|
|
+- Streaming synthesis with sentence boundaries.
|
|
|
|
|
+- Background music / sound effects mixing.
|
|
|
|
|
+- Volume / speed controls.
|
|
|
|
|
+- Persisting cached audio across sessions (IndexedDB).
|
|
|
|
|
+- Accessibility audit of the new error state (WCAG roles, aria-live).
|
|
|
|
|
+
|
|
|
|
|
+These are deferred until product asks for them.
|