Key Design
Text Track Splitting
Text tracks (VTT/SRT/JSONL) are split into time-windowed chunks for efficient time-range streaming. The client fetches chunks directly from S3 — no server in the data path.
Problem
A 2-hour recording with dense captions produces a large VTT file. Loading the entire file to display captions at a specific timestamp is wasteful. We need to load only the relevant portion.
Solution
Lambda splits the text track into 30-second chunks stored in S3. An m3u8 playlist indexes the chunks. The client parses the m3u8, then fetches only the chunks covering the visible time range — directly from S3.
Pipeline
Raw text file in staging
│
├─ 1. Download from S3: {owner}/{project}/staging/{hash}
├─ 2. Parse into timed entries
│ VTT/SRT → cue timestamps (HH:MM:SS.mmm --> HH:MM:SS.mmm)
│ JSONL → ts/dt fields per line
├─ 3. Group entries into 30-second windows
├─ 4. For each window, write a chunk:
│ VTT/SRT → .vtt chunk (WEBVTT header + cues)
│ JSONL → .jsonl chunk (matching lines)
├─ 5. Hash each chunk (SHA256[:16])
│ Upload to chunks/{hash}.{vtt|jsonl}
├─ 6. Build m3u8 playlist with {hash}.{ext} entries
├─ 7. Upload playlist: tracks/text/{id}/stream/{streamHash}.m3u8
├─ 8. Update tracks/text/{id}/meta.json with streams[]
└─ 9. Callback PATCH to BSSChunk Format
Input format determines chunk format:
| Input | Chunk Extension | Content |
|---|---|---|
| VTT | .vtt | WEBVTT\n\n + cues in that time window |
| SRT | .vtt | Normalized to VTT format |
| JSONL | .jsonl | Lines whose ts falls in the window |
M3U8 Format
Unlike video/audio (bare hashes), text track playlists include the file extension:
#EXTM3U
#EXT-X-VERSION:3
#EXT-X-TARGETDURATION:30
#EXT-X-MEDIA-SEQUENCE:0
#EXTINF:30.000,
a1b2c3d4e5f67890.vtt
#EXTINF:30.000,
1234567890abcdef.vtt
#EXT-X-ENDLISTBSS's rewriteM3u8() handles both formats:
- Bare hashes →
{cdnBase}/chunks/{hash}.ts(video/audio) {hash}.{ext}→{cdnBase}/chunks/{hash}.{ext}(text/labels)
Client-Side Playback
The client fetches chunks directly from S3 — the server is only involved once (for the m3u8 index):
Client BSS S3
│ │ │
├─ GET /text-tracks/:id/ │ │
│ stream/:hash.m3u8 ────→│ │
│ (one request) ├─ Fetch + rewrite m3u8 │
│←── m3u8 with CDN URLs ──────│ │
│ │ │
├─ Parse m3u8 locally │ │
├─ Determine which chunks │ │
│ overlap current time range │ │
│ │ │
├─ GET chunks directly ───────────────────────────────────→│
│ (only the ones needed) │ │
│←── .vtt / .jsonl content ───────────────────────────────│
│ │ │
└─ Parse + render client-side │ │The client caches fetched chunks — repeat queries for the same time range hit the cache.
Why 30-Second Windows
Text data is small — a 30-second VTT chunk is typically a few KB. Using 6-second segments (like video) would create too many tiny files. 30 seconds balances:
- Few enough chunks to keep the m3u8 small
- Granular enough that the client doesn't over-fetch
Segment Duration
| Media Type | Segment Duration | Reason |
|---|---|---|
| Video | 6s (configurable via SEGMENT_DURATION) | Matches HLS spec for smooth playback |
| Audio | 6s (same as video) | Consistent with video segments |
| Text/Labels | 30s (hardcoded TEXT_SEG_DURATION) | Text is tiny, fewer chunks is better |