By May 2025 the thesis line had identified diffusion as a likely core tool for image-to-3-D generation. Stable Diffusion 1.x / 2.x are the published reference implementations of latent diffusion. The Topic-12 study session covered the LDM paper end-to-end as a theoretical foundation before the thesis line began building 3-D-diffusion components.
This note records the architectural decomposition (§2), the parameter breakdown (§3), the LCM acceleration mechanism (§4), and the thesis-line implications (§5).
LDM is the composition of four mostly-independent components:
| Component | Role | Parameter count |
|---|---|---|
| VAE encoder | Image (512²) → latent (64² × 4 ch) | ~34 M |
| Text encoder | Prompt → token-embedding sequence | CLIP ViT-L · ~123 M |
| U-Net denoiser | Latent + text + timestep → predicted noise | ~860 M (the bulk) |
| VAE decoder | Denoised latent → image (512²) | ~50 M |
| Scheduler | Noise schedule + reverse-step formula (DDIM / DDPM / PNDM) | — |
The decoupling is the load-bearing structural insight. The VAE is trained once on a large dataset (LAION-class) and frozen. The U-Net is trained on top of the frozen latent space. The text encoder is frozen (taken from a pre-trained CLIP). Different downstream LDMs share the same VAE and swap only the U-Net (Stable Diffusion 1.5 vs 2.x vs Stable Diffusion XL).
Conditional generation enters the U-Net via two mechanisms.
At each U-Net block, the residual stream cross-attends to the text-encoder output sequence. The attention queries Q are projected from the latent residual stream; the keys K and values V are projected from the text tokens:
Q = X · W^Q, K = T · W^K, V = T · W^V Attention(X, T) = softmax(Q K^T / √d_k) · Vwhere X is the latent residual stream (post-self-attention, pre-FFN) and T is the text-encoder output. The same mechanism handles image conditioning (T is image-encoder output) and depth / sketch conditioning (T is the conditioning-encoder output). ControlNet adds an additional copy of the U-Net's encoder, trained on the conditioning input and frozen on the original U-Net; the ControlNet path's outputs are added to the U-Net's residual stream at matching depths.
The timestep t is encoded sinusoidally and projected through a small MLP to per-block scale and shift parameters γ_t, β_t. Inside each U-Net block, the layer-norm is replaced with:
AdaLN(x; γ_t, β_t) = γ_t ⊙ LayerNorm(x) + β_tThe timestep modulates the activations multiplicatively (scale) and additively (shift) per block. The mechanism is cheap (one MLP shared across blocks) and effective — the timestep information reaches every layer through normalisation rather than through additive input concatenation. Class-label conditioning uses the same AdaLN mechanism with the label embedding replacing the timestep.
Latent Consistency Models [1] distill an LDM into a one-to-four-step sampler. The training: a consistency loss enforces that the denoiser's prediction at adjacent noise levels collapses to the same clean image. The result: 25× sampling speed-up with marginal quality drop. The trade-off is the same trade-off the thesis-line flow-matching switch [3] targets — small quality reduction for large speed-up, motivated by interactive-rate single-image-to-3-D inference.
Three implications carry forward to thesis-line 3-D-diffusion work.
| LDM component | Thesis-line analogue |
|---|---|
| VAE encoder / decoder (frozen) | Triplane / hexplane encoder / decoder [4,5] — frozen 3-D-feature encoder shared across downstream generators |
| U-Net denoiser (trained) | MambaFlow3D backbone [3] — trained on the triplane latent space |
| Text encoder + cross-attention | Image conditioning via cross-attention from a ViT image-encoder; identical mechanism |
| LCM distillation | Flow-matching velocity prediction — same step-count reduction |
The structural analogy is exact at the architectural level. The thesis-line 3-D-diffusion work is best understood as "LDM, but the VAE is a triplane encoder and the U-Net body can be a Mamba block".
The study session attempted an interactive Gradio interface for prompt-to-image exploration via the diffusers library. The interface itself worked. Inference latency on Intel iMac CPU was approximately 30 seconds per image at default sampler settings, which is impractical for interactive use. The follow-up was deferred until the RTX 3060 rental began in November 2025; the rental was instead spent on the JiT reproduction [6] and the MambaFlow3D Phase-2 work [3], so the Gradio thread was not resumed.
LDM's VAE / U-Net / text-encoder decoupling is the load-bearing structural insight; the same decoupling structures the thesis-line 3-D-diffusion work. LCM-class step-count reduction is the magnitude of speed-up the thesis-line flow-matching switch targets. The Gradio interface attempt confirmed that CPU inference is impractical for LDM-class models.