From script to final render β Higgsfield, avatar design, scene setup, VFX, lip sync, voiceover matching, and every pro technique in one guide.
Bold question, surprising stat, or relatable pain point. 2β3 punchy sentences. Under 30 words. This makes or breaks your video.
Identify the core tension. Create curiosity. Make the viewer feel understood. Never rush past this β it sells the solution.
4β6 sentences. Specific tools, techniques, stats, or mindset shifts. Each sentence earns its place β cut anything vague.
One or two direct instructions. "Follow for Part 2." "Save this." "Try this today." Simple, specific, actionable.
8β12 words. Quotable. Emotionally resonant. The line viewers screenshot and share. Make it memorable.
130β150 words for a 60-second video. At natural speaking pace (~150 wpm), this lands perfectly without rushing or dead air.
Second person ("you", "your"). Active voice. Short sentences. No filler words. Specific beats vague every time.
Mark each sentence with the shot it belongs to. Helps you sync visuals to audio and prevents over-cutting or under-cutting.
Specify exact range (e.g., "South Asian woman, late 20s"). Vague = inconsistent results.
Color, length, style. "Shoulder-length black hair, slight wave" beats just "dark hair".
Eye color, face shape, defining features. More detail = more consistent renders.
Build, height cues. Avoid extremes. "Athletic medium build" works well in Higgsfield.
Match energy to section: expressive for hook, focused for solution, warm for CTA.
Copy-paste the EXACT same avatar description into every shot. Never vary it.
Higgsfield's "Custom Avatar" mode lets you upload a real face photo. Use a clean, well-lit front-facing shot. This massively improves consistency across shots.
Navigate to "Create Avatar" β fill in the character card: name, appearance, personality. Higgsfield stores this as a reusable template β use it for every scene in your video.
Under Avatar Settings, define a "neutral pose" β standing, slight 3/4 turn toward camera, natural arm position. This serves as your baseline for all non-action shots.
Before recording full scenes, generate 5β8 still frames of the avatar in different expressions. Check face consistency. If anything drifts, tighten the description.
| Topic Category | Recommended Outfit | Colors to Use |
|---|---|---|
| AI & Productivity | Smart casual β fitted shirt, clean chinos, minimal jewelry | White, navy, cream, slate |
| Health & Wellness | Activewear β fitted top, leggings/joggers, light layers | Soft pastels, earthy tones, athletic grays |
| Career & Business | Business casual β blazer, neat top, subtle accessories | Charcoal, white, navy, tan |
| Tech & Gadgets | Streetwear-tech β hoodie, joggers, visible wireless earbuds | Dark neutrals, pops of accent color |
| Daily Lifestyle | Comfortable lifestyle β cozy top, relaxed jeans, put-together | Warm neutrals, muted tones |
Avoid shiny fabrics (creates render artifacts under studio lights). Matte textures β linen, cotton, knit β render most cleanly. Add subtle layering (open overshirt) for visual depth.
Lighter fabrics that move slightly look more realistic outdoors. Add sunglasses, cap, or jacket depending on time of day. Wind effects work better with looser garments.
White walls, minimal shelving, soft plant. Large window with diffused daylight. Blurred mid-range depth.
Park, rooftop, or street at sunset. Warm directional light. Slight bokeh on background. Subject backlit with rim light.
Near-black background. Single accent color light source (purple, teal, or amber). High contrast silhouette. Dramatic and modern.
Warm ambient light, coffee cups, wooden surfaces, blurred people in background. Creates relatable, approachable energy.
Busy street, soft focus background pedestrians, directional sunlight or overcast diffusion. Great for establishing shots.
Morning mist, dappled light through trees, natural greens. Creates calm, authentic atmosphere for wellness or mindfulness topics.
Slow push-in, orbit, crane up, rack focus. These are the most cinematic and render cleanly in Higgsfield. Always specify direction and speed.
Light rays, lens flare, bokeh, dust particles, fog, heat haze. Layer 1β2 max. More than that overwhelms the avatar. Specify intensity (subtle/strong).
Cross-dissolve (universal), whip pan (energy), zoom transition (modern). Handled in CapCut or Premiere after AI generation. Not in Higgsfield prompts.
Auto-caption in CapCut. Keyword callouts (highlight key stat/word on screen). Lower-third name plates. Use high-contrast, readable fonts β Helvetica or Inter.
Digital particles, glowing orbs, energy trails. Good for tech/AI topics. Add via CapCut FX pack or RunwayML's VFX layers after generating base footage.
Apply a LUT in post for mood consistency. Teal-and-orange for cinematic warmth. Desaturated gray for serious/corporate. Warm golden for lifestyle topics.
| Duration | Shot Type | Best For |
|---|---|---|
| 6 seconds | Fast cut | Hook opener, closing line, reaction beat, transition moment |
| 8 seconds | Standard | Setup explanation, CTA delivery, mid-video transitions |
| 10 seconds | Extended | Solution demo, complex gesture/action, key insight delivery |
| 12β15 seconds | Slow burn | Emotional moments, closing sequences, dramatic reveals |
Fast cut. Close-up. Avatar directly addresses camera with bold opening question. High energy expression.
Medium shot. Slight camera push-in. Avatar builds the problem with gestures. Start of emotional connection.
Extended. Avatar demonstrates or explains first key point. Camera at mid-distance, static or slow orbit.
Extended. Cut to complementary scene (different environment or angle). Avatar delivers second key point.
Standard. Avatar direct-to-camera. Warm, confident. Specific call to action. Slight zoom-out feeling of openness.
Fast cut. Close-up again. Avatar holds eye contact. Delivers mic-drop closing line. Static camera β let the words land.
Add 2 Γ 6s B-roll cutaways to reach 60s total, or extend Solution shots to 12s for a more relaxed pace.
110β115% of normal speed. Short-form audiences expect a slightly faster pace. Too slow = boring. Too fast = hard to follow.
Use <emphasis> tags in ElevenLabs to stress key words. Matches natural speech rhythm. Never let the AI emphasize randomly.
Add deliberate 0.3β0.5s pauses after hook question and before closing line. Let powerful statements breathe.
Background music at β18dB to β22dB when voice is present. Voice at 0dB reference. Duck the music by β6dB more at hook and closing line.
Tools like Higgsfield, D-ID, and HeyGen analyze your audio file and generate matching mouth movements frame-by-frame using phoneme mapping.
Professional lip sync works at the phoneme level β individual mouth sounds like "m", "f", "oh". The more phoneme data available, the more realistic the result.
Generate avatar video at 24 or 30fps. Lip sync tools are calibrated to standard frame rates. 60fps can cause interpolation artifacts in mouth movement.
Export from ElevenLabs/Murf as WAV (44.1kHz, 16-bit). Avoid heavy compression or EQ at this stage β lip sync engines work better with clean audio.
In Higgsfield, generate your avatar clips with natural expression but NO audio dependency. You'll add speech movement in the next step.
Use Higgsfield's built-in Lip Sync (best for Higgsfield avatars), HeyGen Lip Sync, or SyncLabs. Upload: (a) avatar video clip, (b) voiceover audio. The tool maps speech to mouth.
Most tools have a +/βframe offset control. Set to 0 first, preview, then nudge Β±1β2 frames if mouth is slightly ahead or behind the audio. This is the most critical quality step.
Lip sync can create mouth-shape errors at the first and last frame of each clip. Add a 2-frame crossfade transition between clips in your editor to mask these.
After assembling all clips in CapCut/Premiere, watch the full video with headphones. Listen for any audioβvisual desync, especially after transitions.
Write script with timing markers. Design avatar description card. Create shot-by-shot breakdown with duration targets. Map voiceover sentences to each shot.
Generate in ElevenLabs with correct speed, emphasis, and pauses. Export as clean WAV. Listen back fully β fix any mis-pronunciations before moving forward.
Generate one shot at a time using your prompt templates. Save each clip with a clear filename (shot01_hook_6s.mp4). Generate 2β3 variations per shot and pick the best.
Upload each clip to SyncLabs/Higgsfield with its corresponding voiceover segment. Check sync offset. Export synced clips.
Assemble synced clips in order. Add transitions (crossfade 2β3 frames). Layer background music at β20dB. Add text overlays and captions. Apply color LUT.
Watch at full volume on phone (your primary audience's device). Check: sync, pacing, caption readability, music balance. Export: 1080Γ1920 (9:16), H.264, 30fps, 10β15 Mbps.