Text-to-Image Diffusion Model Training

Nikola

Nikola Gojakovic

Text-to-Image Diffusion Model Training

Technical Overview

This implementation provides a complete training pipeline for text-conditional diffusion models using the Denoising Diffusion Probabilistic Models (DDPM) framework. The architecture follows the Stable Diffusion paradigm and includes three key components:
A Variational Autoencoder (VAE) for encoding/decoding images
A CLIP text encoder for text conditioning
A UNet denoising network with cross-attention

Mathematical Framework

Forward Process

The forward noising process gradually corrupts the image:

Reverse Process

The model learns the reverse denoising distribution:
where:
( c ): text conditioning (prompt embedding)
( \theta ): learnable parameters

Architecture Components

VAE Encoder/Decoder Compresses images: ℝ⁶⁴×⁶⁴׳ → ℝ³²×³²×⁴
CLIP Text Encoder Converts text prompts into embeddings: ℝ⁷⁷×⁵¹²
UNet Denoising Network Predicts noise: εθ(zₜ, t, c) with cross-attention conditioning

Dataset Processing

Dataset: CelebA (200k+ celebrity face images)
Text Labels: Generated from facial attributes (e.g., “a person with black hair and sunglasses”)
Preprocessing: Resize to 64×64, normalize to range ([-1, 1])

Network Configuration

UNet2DConditionModel:
sample_size: 32 × 32 (latent space)
in_channels: 4
out_channels: 4
block_channels: [64, 128, 256]
cross_attention_dim: 512
timesteps: 1000


Text → CLIP → Text Embeddings (512d)

Image → VAE → Latents → Add Noise → UNet → Predicted Noise
↑ ↑
Decode ← Denoised Latents ← DDPM Sampling

Training Status

Note: Due to hardware limitations (AMD RX570 with limited VRAM), full model training could not be completed on the current setup. However, the codebase is fully functional and has been tested for compatibility with PyTorch and Hugging Face Diffusers.

Future Training Plan:

Reduce batch size to 1–2 samples
Limit training steps to 50–100 per epoch
Train for 2–3 initial epochs
Utilize gradient checkpointing for memory efficiency
Optionally reduce model/latent resoluti
Like this project

Posted Sep 13, 2025

Developed a training pipeline for text-conditional diffusion models using DDPM.