How it works

From an uploaded video to a searchable index of visual moments, in five steps.

Video Moment Finder treats video as a sequence of images, not a sequence of words. Instead of indexing transcripts and searching dialogue, we index the frames themselves with a multimodal vision-language model, then search that visual space directly. If you can describe what the moment would look like — or show an example frame — you can find it.

The pipeline

1. Upload or import
You upload a video file up to 30 minutes, or import a YouTube video you own. The upload is direct-to-storage via a presigned URL, so the file never round-trips through our API servers.
2. Frame extraction
A background worker samples frames from the video at a fixed cadence. We deliberately sample frames rather than run on raw video — it keeps the embedding cost bounded by video length and makes every hit in search map back to a timestamp you can jump to.
3. Multimodal embeddings (Qwen3-VL)
Each frame is passed through Qwen3-VL, an open-weight vision-language model. The output is a dense vector that encodes what is visually and semantically present in the frame — objects, actions, scene composition, on-screen text — in a space that text queries also live in.
4. Vector store
Frame vectors are written to a vector database with row-level security scoping them to your account. A single video becomes a few hundred to a few thousand vectors depending on length.
5. Search
At query time, your text prompt or example image is embedded with the same model into the same space. We run an approximate nearest-neighbor search, rank the top matches, and return each hit with its timestamp and thumbnail.

Why Qwen3-VL

Qwen3-VL is an open-weight vision-language model trained jointly on images and text, which means a photo of a whiteboard and the phrase “a whiteboard with equations” end up close together in the same embedding space. That joint space is what makes text-to-frame and image-to-frame search work with the same index.

We picked it over alternatives for three reasons. First, it is open-weight, which matters for a project shipped under AGPLv3 and removes a hard dependency on a proprietary embedding API. Second, its training explicitly includes visual reasoning and on-screen text, both of which matter for real-world video (slides, UI recordings, signage). Third, the model is small enough to run cost-effectively at the scale of hundreds to thousands of frames per video without making the unit economics fall apart.

Accuracy tradeoffs

Best at

Describing visually distinctive moments in plain language ("a whiteboard with a diagram of a neural network", "two people shaking hands outdoors"), or showing an example frame and asking for moments that look like it.

Weaker at

Precise dialogue recall or anything that hinges on spoken words — a transcript search will usually beat us there. Very subtle visual distinctions that look nearly identical across many frames also degrade precision.

Practical tips

Describe what would be on screen, not what is being said. If the first query misses, rephrase rather than going deeper into the results — a better query beats a longer scroll. For image-based search, pick an example frame that is visually clean and unambiguous.

Results are AI-generated and won't always be right — see the support page for more on expected accuracy and refunds.

Building on top

The same pipeline is exposed as an HTTP API and as a remote MCP server, so agents and scripts can upload, check status, and search videos directly. See the developers page for the MCP connector and OpenAPI reference.

Last updated 2026-04-21.

How it works

The pipeline

1. Upload or import

2. Frame extraction

3. Multimodal embeddings (Qwen3-VL)

4. Vector store