Topic 12 May 2025 LDM · Conditional Diffusion · LCM

Latent Diffusion —
Theory + Gradio Attempt.

Study session on Latent Diffusion Models (Stable Diffusion-class): conditional diffusion (text, class-label, image conditioning), the VAE-encoder + U-Net-denoiser + VAE-decoder structure, and latent consistency models (LCM) for one-to-four-step sampling. Attempted a Gradio interface for interactive prompt-to-image exploration. Pre-thesis-line preparation for the 3-D-diffusion work that comes later (Topics 27 JiT, 26 MambaFlow3D, 24 polyline diffusion).

00 — Motivation

Understand LDM before scaling to 3-D.

By May 2025 the thesis line had identified diffusion as a likely core tool for image-to-3-D (later validated by the survey in Topic 18). Stable Diffusion 1.x / 2.x are the published reference implementations of latent diffusion. The study session covered the LDM paper end-to-end, the VAE / U-Net / text-encoder split, the role of classifier-free guidance, and LCM as the fast-sampling distillation that brings the step count from 100 to 4.

01 — LDM Components

Component	Role	Stable Diffusion 1.5 size
VAE encoder	Image (512²) → latent (64² × 4 ch)	~34 M params
Text encoder	Prompt → token-embedding sequence	CLIP ViT-L · ~123 M params
U-Net denoiser	Latent + text + timestep → predicted noise	~860 M params (the bulk)
VAE decoder	Denoised latent → image (512²)	~50 M params
Scheduler	Noise schedule + reverse-step formula (DDIM / DDPM / PNDM)	—

Core Insight

The VAE is the compression. The U-Net is the diffusion.
They're independent components.

The LDM trick is to decouple the heavy generative work (U-Net diffusion) from the pixel-space work (VAE compression). The VAE is trained once on a large dataset; the U-Net is trained on top of the frozen VAE's latent space. Different downstream LDMs share the same VAE and swap only the U-Net. This makes LDM the foundation for the 3-D-diffusion work in the thesis line — the VAE-equivalent (triplane encoder) is shared across Hierarchical Triplane (Topic 33), Hexplane AE (Topic 29), and MambaFlow3D (Topic 26).

02 — LCM Acceleration

Latent Consistency Models distill an LDM into a one-to-four-step sampler via consistency loss. The trade-off: slight quality drop vs ~25× faster inference. Relevant to the thesis line for the same reason flow matching is — fewer sampling steps means interactive-rate single-image-to-3-D becomes possible on consumer GPUs.

03 — Gradio Interface (attempted)

A small Gradio interface was attempted for interactive prompt- to-image exploration via the diffusers library. The interface itself worked; the iteration speed on Intel iMac (no GPU) was impractical (~30 s per image), and the project was deferred until the RTX 3060 rental was set up later in the year.

Interactive Demo · Live

Step through the LDM forward pass — VAE encode, U-Net denoise, VAE decode — with a stylised input image.

01 — Input imageCAT

02 — Latent at step tSTEP 0 / 10

03 — Decoded outputVAE decode

Full Technical Paper

White paper · LDM theory study · VAE / U-Net / LCM decomposition · same structure as thesis-line 3-D-diffusion work

Read Paper →

Appendix — Raw Materials

Transcripts & Source References

████████████████████████████████████████████████

01 — ██████████████████████████

██████████████████████████████████████

█████████ · ████ · █████████████████████

████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████

Restricted Access

Latent Diffusion — Theory + Gradio Attempt.

Understand LDM before scaling to 3-D.

Interactive Demo · Live

Full Technical Paper

Latent Diffusion —
Theory + Gradio Attempt.