Video Narration
Upload a video and the AI reads every frame, writes a narration script based on what it sees, then speaks that script in a voice and style you choose. No manual scripting, no recording setup.
Drop your video here or browse
MP4, MOV, WEBM, AVI, MKV (max 50MB)Creating narration...
Analyzing video frames, generating script, and creating voiceover
How It Works
- Upload a video (screen recording, tutorial, demo, etc.)
- Choose a voice, narration style, and language
- AI analyzes the video, writes a script, and adds voiceover automatically
Tips
- Screen recordings and tutorials work best
- Keep videos under 5 minutes for fastest results
- Processing takes 2-5 minutes depending on video length
- The AI describes what it sees happening on screen
Creating narration...
Analyzing video frames, generating script, and creating voiceover
My Narrated Videos
Narration Complete
Why Choose Video Narration
Frame-Level Vision
Florence-2 reads key frames and identifies UI elements, on-screen text, buttons, menus, and visual transitions. The narration script reflects what is actually visible, not a generic summary.
10+ Kokoro Voices
Pick from over ten Kokoro TTS voices spanning male, female, and British accents. Each voice renders at a quality level suited for published tutorials, demos, and training content.
Script to Merged Video
The pipeline ends with a complete MP4 file. Frame extraction, script generation, voice synthesis, and audio merge all happen server-side. You upload one file and download one file.
Five Narration Styles
Tutorial, Professional, Casual, Energetic, and Documentary are distinct prompt modes, not cosmetic labels. Each produces a different sentence structure, pacing, and tone in the final script.
Perfect For
AI-Powered Video Analysis & Narration
Florence-2 Vision + Kokoro TTS
Video Narration runs a four-stage pipeline. Key frames are pulled from your video at regular intervals. Florence-2 analyzes each frame and produces a structured description of what it sees, including text, UI elements, and visual changes. An LLM then chains these descriptions into a single narration script that flows like a real presenter speaking through your content.
Kokoro TTS converts the script to audio using the voice and style you selected. The resulting audio track is timed and merged with your original video file, preserving your visuals exactly while adding the spoken layer. The final output is a standard MP4.
Frequently Asked Questions
The tool extracts frames from your video, runs each frame through Florence-2 to identify what is visible on screen, feeds those descriptions to an LLM to write a narration script, synthesizes that script with Kokoro TTS using your chosen voice and style, then merges the audio back into your original video.
Videos where visual content carries the story: screen recordings, software tutorials, product demos, and app walkthroughs. Florence-2 is particularly strong at reading on-screen text, identifying buttons and menus, and tracking UI state changes between frames.
Usually 2 to 5 minutes. A one-minute video typically processes in about 2 minutes. A five-minute video may take up to 5 minutes. Frame count and resolution both affect speed. A progress indicator shows where the job is in the pipeline.
MP4, MOV, WEBM, AVI, and MKV files up to 50MB. Higher resolution sources give Florence-2 more detail to work with, which improves script accuracy. Output is always MP4.
Yes. Kokoro TTS provides over ten voices including American male, American female, and British options. Pair a voice with a narration style to tune both the sound and the script structure for your specific content.
Five styles that change how the LLM writes the script. Tutorial produces numbered steps and action cues. Professional keeps language formal and direct. Casual writes in first person with a conversational rhythm. Energetic uses short sentences and emphasis suited for promotional clips. Documentary narrates in third person with an informational tone.
Yes. Florence-2 is a vision model trained on detailed image-text pairs. It identifies on-screen text, UI controls, layout regions, and visual transitions. The LLM then converts those structured descriptions into spoken narration, so the script matches what viewers actually see.
Narration output is available in English, Spanish, French, German, Italian, Portuguese, Japanese, Korean, and Chinese. Florence-2 analyzes the visual content independent of any text language displayed on screen, so the source video language does not affect frame analysis.
Yes. You can pick a built-in track from categories like corporate, ambient, and upbeat, or upload your own audio file. A volume control lets you set the music level low enough to stay behind the voiceover without competing with it.
It is passed directly to the LLM when building the narration script. Use it to specify a target audience, instruct the AI to emphasize a particular feature, add a closing call-to-action, or skip sections you do not want narrated. The more specific your instruction, the more precisely the script adapts.
Video Narration vs Other Methods
| Feature | Luxoret AI | Manual / Traditional | Other Tools |
|---|---|---|---|
| Cost per Use | $0.15 | $200-$1000+ per project | $0.20-$0.50 per video |
| Speed | Minutes, not hours | Hours of manual editing | Varies by complexity |
| Skill Required | None — AI handles it | Video editing expertise | Moderate learning curve |
| Software | Browser-based, nothing to install | Expensive editing suite | Desktop app required |
| Quality | AI-enhanced, professional | Depends on editor skill | Template-dependent |
| Revisions | Instant re-processing | Re-edit from scratch | Limited by plan |
Explore More Video Tools
Text to Speech
Convert any text to natural-sounding speech with AI voices.
Speech to Text
Transcribe audio and video to text with timestamps and subtitles.
Video Dubbing
Translate and dub your videos into 13+ languages with AI.
Screen Recorder
Record your screen directly in the browser with webcam overlay.