D
DreamLake

Key Design

Text Track Splitting

Text tracks (VTT/SRT/JSONL) are split into time-windowed chunks for efficient time-range streaming. The client fetches chunks directly from S3 — no server in the data path.

Problem

A 2-hour recording with dense captions produces a large VTT file. Loading the entire file to display captions at a specific timestamp is wasteful. We need to load only the relevant portion.

Solution

Lambda splits the text track into 30-second chunks stored in S3. An m3u8 playlist indexes the chunks. The client parses the m3u8, then fetches only the chunks covering the visible time range — directly from S3.

Pipeline

Raw text file in staging

 ├─ 1. Download from S3: {owner}/{project}/staging/{hash}
 ├─ 2. Parse into timed entries
VTT/SRT → cue timestamps (HH:MM:SS.mmm --> HH:MM:SS.mmm)
JSONL   → ts/dt fields per line
 ├─ 3. Group entries into 30-second windows
 ├─ 4. For each window, write a chunk:
VTT/SRT → .vtt chunk (WEBVTT header + cues)
JSONL   → .jsonl chunk (matching lines)
 ├─ 5. Hash each chunk (SHA256[:16])
 │     Upload to chunks/{hash}.{vtt|jsonl}
 ├─ 6. Build m3u8 playlist with {hash}.{ext} entries
 ├─ 7. Upload playlist: tracks/text/{id}/stream/{streamHash}.m3u8
 ├─ 8. Update tracks/text/{id}/meta.json with streams[]
 └─ 9. Callback PATCH to BSS

Chunk Format

Input format determines chunk format:

InputChunk ExtensionContent
VTT.vttWEBVTT\n\n + cues in that time window
SRT.vttNormalized to VTT format
JSONL.jsonlLines whose ts falls in the window

M3U8 Format

Unlike video/audio (bare hashes), text track playlists include the file extension:

#EXTM3U
#EXT-X-VERSION:3
#EXT-X-TARGETDURATION:30
#EXT-X-MEDIA-SEQUENCE:0
#EXTINF:30.000,
a1b2c3d4e5f67890.vtt
#EXTINF:30.000,
1234567890abcdef.vtt
#EXT-X-ENDLIST

BSS's rewriteM3u8() handles both formats:

  • Bare hashes → {cdnBase}/chunks/{hash}.ts (video/audio)
  • {hash}.{ext}{cdnBase}/chunks/{hash}.{ext} (text/labels)

Client-Side Playback

The client fetches chunks directly from S3 — the server is only involved once (for the m3u8 index):

Client                         BSS                          S3
 │                              │                           │
 ├─ GET /text-tracks/:id/      │                           │
 │      stream/:hash.m3u8 ────→│                           │
 │  (one request)               ├─ Fetch + rewrite m3u8    │
 │←── m3u8 with CDN URLs ──────│                           │
 │                              │                           │
 ├─ Parse m3u8 locally          │                           │
 ├─ Determine which chunks      │                           │
 │  overlap current time range  │                           │
 │                              │                           │
 ├─ GET chunks directly ───────────────────────────────────→│
 │  (only the ones needed)      │                           │
 │←── .vtt / .jsonl content ───────────────────────────────│
 │                              │                           │
 └─ Parse + render client-side  │                           │

The client caches fetched chunks — repeat queries for the same time range hit the cache.

Why 30-Second Windows

Text data is small — a 30-second VTT chunk is typically a few KB. Using 6-second segments (like video) would create too many tiny files. 30 seconds balances:

  • Few enough chunks to keep the m3u8 small
  • Granular enough that the client doesn't over-fetch

Segment Duration

Media TypeSegment DurationReason
Video6s (configurable via SEGMENT_DURATION)Matches HLS spec for smooth playback
Audio6s (same as video)Consistent with video segments
Text/Labels30s (hardcoded TEXT_SEG_DURATION)Text is tiny, fewer chunks is better