Wav2Lip Video Generation Pipeline

Badaruddin

Badaruddin Chachar

Wav2Lip Video Generation Pipeline

This project enables you to create lip-synced videos from an input image and generated audio from a given text prompt. The pipeline converts text into speech (TTS), and then uses the Wav2Lip model to animate the image with synchronized lip movements to match the audio.

🔥 Features

Convert any text into audio using TTS (Text-to-Speech)
Use Wav2Lip to generate a lip-synced video from a static face image and audio
Automatic audio extraction and video frame preparation
Resize factor customization to fit different GPU memory requirements
CUDA support for fast inference

🛠️ Setup

1. Clone the Repository

git clone https://github.com/your-bchachar/lip_sync_video_generator.git
cd lip_sync_video_generator

2. Set Up Python Virtual Environment (Recommended)

python3 -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate

3. Install Dependencies

pip install -r requirements.txt
Make sure you have ffmpeg installed and accessible via command line.

4. Download Required Models

a. Clone Wav2Lip Repository
Clone the official Wav2Lip repository into the project root directory:
git clone https://github.com/Rudrabha/Wav2Lip.git
b. Wav2Lip Checkpoint (.pth file)
Download the wav2lip.pth file from this Hugging Face link and place it in ./Wav2Lip/checkpoints/

⚙️ How It Works

Input text is converted into audio using a TTS system.
The audio is saved to ./audio/output.wav
The image and audio are passed to the Wav2Lip model.
A video is generated where the lips of the image move in sync with the spoken audio.

🚀 Usage

Run the full pipeline:

python main.py

🐞 Common Issues & Fixes

1. CUDA error: no kernel image is available for execution on the device

Your GPU (e.g., RTX 4090) may not be supported by the current PyTorch installation.
Fix: Reinstall PyTorch with support for compute capability 8.9:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

2. invalid load key, '<' or TorchScript Errors

Occurs when the wrong model file is used (e.g., a TorchScript .pt file instead of a PyTorch .pth checkpoint)
Fix: Use the .pth file from the correct source (e.g., Hugging Face link above)

3. Image too big to run face detection on GPU

Fix: Use the --resize_factor argument (e.g., 2 or 4)

4. Output video not found

Fix: Ensure ffmpeg is installed and accessible from the command line.

📁 Project Structure

.
├── Wav2Lip
│ ├── checkpoints
│ │ └── wav2lip.pth
├── audio
│ └── output.wav
├── images
│ └── sample.jpg
├── video
│ └── result_voice.mp4
├── generate_lipsync_video.py
└── requirements.txt

📜 License

MIT License. See LICENSE file for more information.

🙏 Acknowledgements

Like this project

Posted Jun 28, 2025

Developed a pipeline to create lip-synced videos from text and images.