Link copied!

Video Narration

Upload a video and the AI reads every frame, writes a narration script based on what it sees, then speaks that script in a voice and style you choose. No manual scripting, no recording setup.

Drop your video here or browse

MP4, MOV, WEBM, AVI, MKV (max 50MB)
Volume: 10%

Creating narration...

Analyzing video frames, generating script, and creating voiceover

Download

How It Works

  1. Upload a video (screen recording, tutorial, demo, etc.)
  2. Choose a voice, narration style, and language
  3. AI analyzes the video, writes a script, and adds voiceover automatically

Tips

  • Screen recordings and tutorials work best
  • Keep videos under 5 minutes for fastest results
  • Processing takes 2-5 minutes depending on video length
  • The AI describes what it sees happening on screen

Also Try

Why Choose Video Narration

Frame-Level Vision

Florence-2 reads key frames and identifies UI elements, on-screen text, buttons, menus, and visual transitions. The narration script reflects what is actually visible, not a generic summary.

10+ Kokoro Voices

Pick from over ten Kokoro TTS voices spanning male, female, and British accents. Each voice renders at a quality level suited for published tutorials, demos, and training content.

Script to Merged Video

The pipeline ends with a complete MP4 file. Frame extraction, script generation, voice synthesis, and audio merge all happen server-side. You upload one file and download one file.

Five Narration Styles

Tutorial, Professional, Casual, Energetic, and Documentary are distinct prompt modes, not cosmetic labels. Each produces a different sentence structure, pacing, and tone in the final script.

Perfect For

Screen Recordings Tutorials Product Demos App Walkthroughs Training Videos Social Media Content Documentation

AI-Powered Video Analysis & Narration

Florence-2 Vision + Kokoro TTS

Video Narration runs a four-stage pipeline. Key frames are pulled from your video at regular intervals. Florence-2 analyzes each frame and produces a structured description of what it sees, including text, UI elements, and visual changes. An LLM then chains these descriptions into a single narration script that flows like a real presenter speaking through your content.

Kokoro TTS converts the script to audio using the voice and style you selected. The resulting audio track is timed and merged with your original video file, preserving your visuals exactly while adding the spoken layer. The final output is a standard MP4.

Frequently Asked Questions

The tool extracts frames from your video, runs each frame through Florence-2 to identify what is visible on screen, feeds those descriptions to an LLM to write a narration script, synthesizes that script with Kokoro TTS using your chosen voice and style, then merges the audio back into your original video.

Videos where visual content carries the story: screen recordings, software tutorials, product demos, and app walkthroughs. Florence-2 is particularly strong at reading on-screen text, identifying buttons and menus, and tracking UI state changes between frames.

Usually 2 to 5 minutes. A one-minute video typically processes in about 2 minutes. A five-minute video may take up to 5 minutes. Frame count and resolution both affect speed. A progress indicator shows where the job is in the pipeline.

MP4, MOV, WEBM, AVI, and MKV files up to 50MB. Higher resolution sources give Florence-2 more detail to work with, which improves script accuracy. Output is always MP4.

Yes. Kokoro TTS provides over ten voices including American male, American female, and British options. Pair a voice with a narration style to tune both the sound and the script structure for your specific content.

Five styles that change how the LLM writes the script. Tutorial produces numbered steps and action cues. Professional keeps language formal and direct. Casual writes in first person with a conversational rhythm. Energetic uses short sentences and emphasis suited for promotional clips. Documentary narrates in third person with an informational tone.

Yes. Florence-2 is a vision model trained on detailed image-text pairs. It identifies on-screen text, UI controls, layout regions, and visual transitions. The LLM then converts those structured descriptions into spoken narration, so the script matches what viewers actually see.

Narration output is available in English, Spanish, French, German, Italian, Portuguese, Japanese, Korean, and Chinese. Florence-2 analyzes the visual content independent of any text language displayed on screen, so the source video language does not affect frame analysis.

Yes. You can pick a built-in track from categories like corporate, ambient, and upbeat, or upload your own audio file. A volume control lets you set the music level low enough to stay behind the voiceover without competing with it.

It is passed directly to the LLM when building the narration script. Use it to specify a target audience, instruct the AI to emphasize a particular feature, add a closing call-to-action, or skip sections you do not want narrated. The more specific your instruction, the more precisely the script adapts.

Video Narration vs Other Methods

Feature Luxoret AI Manual / Traditional Other Tools
Cost per Use $0.15 $200-$1000+ per project $0.20-$0.50 per video
Speed Minutes, not hours Hours of manual editing Varies by complexity
Skill Required None — AI handles it Video editing expertise Moderate learning curve
Software Browser-based, nothing to install Expensive editing suite Desktop app required
Quality AI-enhanced, professional Depends on editor skill Template-dependent
Revisions Instant re-processing Re-edit from scratch Limited by plan