Key Design

Text Track Splitting

Text tracks (VTT/SRT/JSONL) are split into time-windowed chunks for efficient time-range streaming. The client fetches chunks directly from S3 — no server in the data path.

Problem

A 2-hour recording with dense captions produces a large VTT file. Loading the entire file to display captions at a specific timestamp is wasteful. We need to load only the relevant portion.

Solution

Lambda splits the text track into 30-second chunks stored in S3. An m3u8 playlist indexes the chunks. The client parses the m3u8, then fetches only the chunks covering the visible time range — directly from S3.

Pipeline

Raw text file in staging
 │
 ├─ 1. Download from S3: {owner}/{project}/staging/{hash}
 ├─ 2. Parse into timed entries
 │     VTT/SRT → cue timestamps (HH:MM:SS.mmm --> HH:MM:SS.mmm)
 │     JSONL   → ts/dt fields per line
 ├─ 3. Group entries into 30-second windows
 ├─ 4. For each window, write a chunk:
 │     VTT/SRT → .vtt chunk (WEBVTT header + cues)
 │     JSONL   → .jsonl chunk (matching lines)
 ├─ 5. Hash each chunk (SHA256[:16])
 │     Upload to chunks/{hash}.{vtt|jsonl}
 ├─ 6. Build m3u8 playlist with {hash}.{ext} entries
 ├─ 7. Upload playlist: tracks/text/{id}/stream/{streamHash}.m3u8
 ├─ 8. Update tracks/text/{id}/meta.json with streams[]
 └─ 9. Callback PATCH to BSS

Chunk Format

Input format determines chunk format:

Input	Chunk Extension	Content
VTT	`.vtt`	`WEBVTT\n\n` + cues in that time window
SRT	`.vtt`	Normalized to VTT format
JSONL	`.jsonl`	Lines whose `ts` falls in the window

M3U8 Format

Unlike video/audio (bare hashes), text track playlists include the file extension:

#EXTM3U
#EXT-X-VERSION:3
#EXT-X-TARGETDURATION:30
#EXT-X-MEDIA-SEQUENCE:0
#EXTINF:30.000,
a1b2c3d4e5f67890.vtt
#EXTINF:30.000,
1234567890abcdef.vtt
#EXT-X-ENDLIST

BSS's rewriteM3u8() handles both formats:

Bare hashes → {cdnBase}/chunks/{hash}.ts (video/audio)
{hash}.{ext} → {cdnBase}/chunks/{hash}.{ext} (text/labels)

Client-Side Playback

The client fetches chunks directly from S3 — the server is only involved once (for the m3u8 index):

Client                         BSS                          S3
 │                              │                           │
 ├─ GET /text-tracks/:id/      │                           │
 │      stream/:hash.m3u8 ────→│                           │
 │  (one request)               ├─ Fetch + rewrite m3u8    │
 │←── m3u8 with CDN URLs ──────│                           │
 │                              │                           │
 ├─ Parse m3u8 locally          │                           │
 ├─ Determine which chunks      │                           │
 │  overlap current time range  │                           │
 │                              │                           │
 ├─ GET chunks directly ───────────────────────────────────→│
 │  (only the ones needed)      │                           │
 │←── .vtt / .jsonl content ───────────────────────────────│
 │                              │                           │
 └─ Parse + render client-side  │                           │

The client caches fetched chunks — repeat queries for the same time range hit the cache.

Why 30-Second Windows

Text data is small — a 30-second VTT chunk is typically a few KB. Using 6-second segments (like video) would create too many tiny files. 30 seconds balances:

Few enough chunks to keep the m3u8 small
Granular enough that the client doesn't over-fetch

Segment Duration

Media Type	Segment Duration	Reason
Video	6s (configurable via `SEGMENT_DURATION`)	Matches HLS spec for smooth playback
Audio	6s (same as video)	Consistent with video segments
Text/Labels	30s (hardcoded `TEXT_SEG_DURATION`)	Text is tiny, fewer chunks is better