UNet2DConditionModel:
sample_size: 32 × 32 (latent space)
in_channels: 4
out_channels: 4
block_channels: [64, 128, 256]
cross_attention_dim: 512
timesteps: 1000
Text → CLIP → Text Embeddings (512d)
↓
Image → VAE → Latents → Add Noise → UNet → Predicted Noise
↑ ↑
Decode ← Denoised Latents ← DDPM Sampling