Visual analysis
Sample and analyze frames from your imported media — useful for smart reframe, B-roll selection, and moment detection.
Give the AI agent "eyes" on your footage. Visual analysis samples frames, labels content, and finds best moments — improving highlight extraction and reframe decisions.
What it does
- Samples sparse frames from imported media: main track, overlay, or all footage
- Caches samples for agent use
- Can label (manually or via vision model):
- Scene summaries
- On-screen text
- Tags (face, product, screen, etc.)
- Best moments
- Used by extract best moments, smart reframe, and B-roll selection
When to use it
Before tools that benefit from visual context:
- Before extract best moments → finds visually interesting shots
- Before smart reframe → identifies faces, products, screens
- Before B-roll insert → picks visually appealing B-roll
- Before creator template → scene-aware pacing decisions
How to use
Basic sampling:
"Sample visual analysis"
"Analyze the footage"
Targeted sampling:
"Sample frames from just the main track"
"Analyze the imported B-roll assets"
Analysis and caching only:
"Sample visual context and cache it for later"
What gets sampled
Source options:
| Source | Coverage | |--------|----------| | main track | Only main video clips | | overlay tracks | Visual overlay elements | | all imported | Everything in media bin |
Labelling options:
| Label | Helpful for | |-------|-------------| | Scene summary | Contextual editing decisions | | On-screen text | Avoid cropping rendered text | | Face tags | Face-biased reframe | | Product tags | Product-focused editing | | Best moments | Highlight extraction |
Caching and storage
- Samples stored in IndexedDB (browser storage)
- Persists across sessions
- Re-sampling overwrites previous
- Per-project, not shared across projects
How it helps
Extract best moments
Without visual analysis:
- Uses audio energy only
With visual analysis:
- Audio energy + visual best-moment hints
- Action shots, reactions, dynamic scenes score higher
Smart reframe
Without visual analysis:
- Heuristic crop centers on detected motion
With visual analysis:
- Face/product/screen labels guide crop bias
- "Face" tag → upward bias (keeps eyes visible)
- "Product" tag → safe crop (preserves detail)
- "Screen" tag → fit mode (no crop)
B-roll insert
- Labels help pick relevant B-roll (e.g., "reaction" footage for reaction inserts)
- Best-moment labels rank B-roll quality
Limitations
- Sparse sampling — not every frame; approximately 1 frame per few seconds
- Cached — analysis is from last sampling; changes to footage require re-sample
- Vision-dependent — requires vision model configuration for auto-labeling
- Heuristic — labels are best-effort, not 100% accurate
Configuration
Vision labeling (optional): Requires configuration.
Vision analysis can be provided by:
- Claude vision capabilities
- Azure OpenAI GPT-4 Vision
- AWS Bedrock vision models
If not configured, sampling still caches frames for manual label entry.
Tips
- Sample before key tools — workflow: import → sample visual → extract highlights/reframe
- Re-sample after major changes — new imports, long sessions
- Targeted sampling — full analysis can take time; sample just what you need
- Not required — all tools work without visual analysis; it's an enhancement
See also
- Extract best moments — uses visual best-moment hints
- Smart reframe — uses face/product/screen labels
- Insert B-roll — ranks B-roll by visual quality