Study session on Latent Diffusion Models (Stable Diffusion-class): conditional diffusion (text, class-label, image conditioning), the VAE-encoder + U-Net-denoiser + VAE-decoder structure, and latent consistency models (LCM) for one-to-four-step sampling. Attempted a Gradio interface for interactive prompt-to-image exploration. Pre-thesis-line preparation for the 3-D-diffusion work that comes later (Topics 27 JiT, 26 MambaFlow3D, 24 polyline diffusion).
By May 2025 the thesis line had identified diffusion as a likely core tool for image-to-3-D (later validated by the survey in Topic 18). Stable Diffusion 1.x / 2.x are the published reference implementations of latent diffusion. The study session covered the LDM paper end-to-end, the VAE / U-Net / text-encoder split, the role of classifier-free guidance, and LCM as the fast-sampling distillation that brings the step count from 100 to 4.
| Component | Role | Stable Diffusion 1.5 size |
|---|---|---|
| VAE encoder | Image (512²) → latent (64² × 4 ch) | ~34 M params |
| Text encoder | Prompt → token-embedding sequence | CLIP ViT-L · ~123 M params |
| U-Net denoiser | Latent + text + timestep → predicted noise | ~860 M params (the bulk) |
| VAE decoder | Denoised latent → image (512²) | ~50 M params |
| Scheduler | Noise schedule + reverse-step formula (DDIM / DDPM / PNDM) | — |
The VAE is the compression. The U-Net is the diffusion.
They're independent components.
The LDM trick is to decouple the heavy generative work (U-Net diffusion) from the pixel-space work (VAE compression). The VAE is trained once on a large dataset; the U-Net is trained on top of the frozen VAE's latent space. Different downstream LDMs share the same VAE and swap only the U-Net. This makes LDM the foundation for the 3-D-diffusion work in the thesis line — the VAE-equivalent (triplane encoder) is shared across Hierarchical Triplane (Topic 33), Hexplane AE (Topic 29), and MambaFlow3D (Topic 26).
Latent Consistency Models distill an LDM into a one-to-four-step sampler via consistency loss. The trade-off: slight quality drop vs ~25× faster inference. Relevant to the thesis line for the same reason flow matching is — fewer sampling steps means interactive-rate single-image-to-3-D becomes possible on consumer GPUs.
A small Gradio interface was attempted for interactive prompt- to-image exploration via the diffusers library. The interface itself worked; the iteration speed on Intel iMac (no GPU) was impractical (~30 s per image), and the project was deferred until the RTX 3060 rental was set up later in the year.
Step through the LDM forward pass — VAE encode, U-Net denoise, VAE decode — with a stylised input image.
White paper · LDM theory study · VAE / U-Net / LCM decomposition · same structure as thesis-line 3-D-diffusion work