Key Design
Semantic Video Search
Search video content using natural language. Upload any video format, and the system automatically transcodes, segments, embeds, and indexes it for semantic search.
Architecture
Upload (.mp4, any codec)
↓
BSS → S3 staging
↓
Lambda: detect codec → transcode if needed → 2s HLS chunks
↓ chunks/{hash}.ts in S3
dreamlake vectorize
↓
CLI dispatches chunk jobs → Zaku queue (Redis)
↓
GPU Worker(s):
download chunk → ffmpeg frame → CLIP ViT-L/14 (768d)
LLaVA 13B → caption → CLIP text embed
↓
Worker writes directly to Qdrant
↓
GET /semantic-search?q=robot+arm+cup
↓
dreamlake-server: CLIP text embed → Qdrant nearest neighbor
↓
Return matched 2s clips with playback URLsPipeline Steps
1. Upload
Any video format (H.264, AV1, VP9, etc.) is accepted. The CLI auto-detects the type and uploads via multipart to S3.
dreamlake upload ./video.mp4 --episode robotics@alice:run-042 --to /camera/front2. HLS Splitting (Lambda)
The Lambda function automatically:
- Probes the video codec
- If MPEG-TS compatible (H.264, HEVC): stream-copies into 2s
.tschunks - If not (AV1, VP9, etc.): transcodes to H.264 with keyframes every 2s, then splits
- Uploads chunks to S3 at
chunks/{hash}.ts(content-addressed, deduplicated) - Creates m3u8 playlist
Each 2s chunk = one atomic unit for vectorization and search.
3. Vectorize
Two modes:
Direct mode — sequential, no infrastructure needed:
dreamlake vectorize --episode robotics@alice:run-042Distributed mode — parallel via Zaku task queue:
dreamlake vectorize --episode robotics@alice:run-042 --zaku-url http://localhost:9000Scoping:
| Flag | Scope |
|---|---|
--episode | All videos in one episode |
--collection | All episodes in a collection |
--dataset | All collections in a dataset |
Per chunk, the worker:
- Downloads the 2s
.tschunk from S3 - Extracts the middle frame via ffmpeg
- Runs CLIP ViT-L/14 → 768-dimensional image embedding
- Runs LLaVA 13B → natural language caption → CLIP text embedding
- Writes the point directly to Qdrant
4. Search
Natural language queries against the Qdrant vector database:
GET /namespaces/:ns/projects/:space/semantic-search?q=robot+picking+up+cupQuery parameters:
| Param | Description |
|---|---|
q | Natural language search text (required) |
episode | Scope to episode name |
collection | Scope to collection name |
dataset | Scope to dataset name |
limit | Max results (default 10, max 50) |
using | Vector type: image (default) or caption |
Response:
{
"query": "robot picking up cup",
"results": [
{
"score": 0.211,
"videoId": "69e5e727...",
"episodeName": "run-042",
"chunkHash": "a3f8b2c1...",
"chunkIndex": 26,
"timeStart": 52,
"timeEnd": 54,
"caption": "A robotic arm reaches toward a red cup...",
"playUrl": "https://s3.../chunks/a3f8b2c1.ts"
}
],
"total": 10,
"using": "image"
}5. Playback
Each result includes a playUrl pointing to the 2s .ts chunk in S3. Playable directly in a browser:
<video src="https://s3.../chunks/a3f8b2c1.ts" controls></video>Or seek to the matched time in the full video using videoId + timeStart.
Vector Storage
All vectors and metadata stored in Qdrant (not MongoDB):
| Field | Description |
|---|---|
image vector (768d) | CLIP ViT-L/14 image embedding |
caption vector (768d) | CLIP text embedding of LLaVA caption |
videoId | BSS video ID |
episodeId | DreamLake episode ID |
projectId | Namespace/space slug |
chunkHash | S3 chunk key |
chunkIndex | Position in m3u8 playlist |
timeStart / timeEnd | Time range in seconds |
caption | LLaVA-generated description |
Distributed Processing (Zaku)
When --zaku-url is provided, the CLI dispatches all chunk jobs to a Zaku task queue instead of processing sequentially.
CLI (dispatcher) GPU Server
add N jobs ──────────→ Zaku (Redis)
poll count ←───────── ↓ pop
Worker 1: process + write to Qdrant
Worker 2: process + write to Qdrant
count == 0 → done ...Benefits:
- Parallel: multiple workers process chunks concurrently
- Resilient: failed jobs auto-retry (Zaku resets on exception)
- Detached: Ctrl+C stops the CLI, workers keep processing
- Scalable: add workers without changing the CLI
Codec Support
The Lambda auto-transcodes any input codec to H.264 for MPEG-TS compatibility:
| Input Codec | Action |
|---|---|
| H.264 | Stream copy (fast) |
| HEVC / H.265 | Stream copy |
| MPEG-2 | Stream copy |
| AV1 | Transcode to H.264 |
| VP9 | Transcode to H.264 |
| Other | Transcode to H.264 |
Performance
| Step | Time | Notes |
|---|---|---|
| Upload (3MB video) | ~2s | Multipart to S3 |
| Lambda split (60s video) | ~3s | AV1→H.264, 30 chunks |
| Vectorize per chunk | ~14.5s | CLIP + LLaVA 13B |
| Vectorize per chunk (with caption) | ~14s | CLIP + LLaVA 13B |
| Search query | ~50ms | CLIP text embed + Qdrant |
Storage: ~6KB per chunk (768d × 4 bytes × 2 vectors + payload). For 1 hour of video at 2s chunks: 1,800 points, ~11MB in Qdrant.