Latent Diffusion Models Study — White Paper

Latent Diffusion Models, Conditional Diffusion, and Latent Consistency Models: A Study Note on the Stable-Diffusion-Class Architecture and Its Implications for 3-D-Diffusion in the Thesis Line

Aaditya Jain

ad_jain@icloud.com · orcid.org/0009-0005-5534-5641

Latent Diffusion Models · Theory · Thesis-Line Theoretical Foundation

Submitted: May 2025 Subject: cs.LG · cs.CV Keywords: Latent Diffusion Models, LDM, Stable Diffusion, Latent Consistency Models, VAE, U-Net, text encoder, conditional diffusion

Abstract

We document a study note on Latent Diffusion Models (LDMs) — the Stable-Diffusion-class architecture that decomposes generative work between a frozen VAE compressor (pixel space ↔ latent space) and a U-Net denoiser (operating in the smaller latent space). We cover the conditional-diffusion mechanisms (text, class-label, image conditioning via cross-attention), the VAE / U-Net / text-encoder parameter breakdown (the U-Net is ~85 % of Stable Diffusion 1.5's parameters; the VAE and text encoder are independent), and the Latent Consistency Model (LCM) distillation that brings the sampling-step count from 100+ to 1–4 with marginal quality drop. The note's contribution to the thesis line is twofold. First, the explicit decoupling between the VAE and the U-Net is the same decoupling used in the thesis-line 3-D-diffusion work: the triplane / hexplane encoders are the VAE-analogue (frozen, scene-agnostic); the diffusion or flow-matching backbone is the U-Net-analogue (trained on the latent space). Second, LCM's 25× sampling speed-up over standard DDPM is the magnitude of speed-up the thesis line targets for interactive single-image-to-3-D — the LCM trade-off (small quality drop for large speed-up) is the same trade-off the flow-matching switch [3] makes. The study session also attempted a Gradio interface for interactive LDM prompt-to-image exploration; the interface worked but inference on Intel iMac CPU was impractical at ~30 s/image, so the work was deferred until RTX 3060 rental began later in the year. Keywords: LDM, Stable Diffusion, LCM, VAE / U-Net decoupling, thesis-line theoretical foundation.

1. Introduction

By May 2025 the thesis line had identified diffusion as a likely core tool for image-to-3-D generation. Stable Diffusion 1.x / 2.x are the published reference implementations of latent diffusion. The Topic-12 study session covered the LDM paper end-to-end as a theoretical foundation before the thesis line began building 3-D-diffusion components.

This note records the architectural decomposition (§2), the parameter breakdown (§3), the LCM acceleration mechanism (§4), and the thesis-line implications (§5).

2. Architectural Decomposition

LDM is the composition of four mostly-independent components:

Table 1 — LDM components (Stable Diffusion 1.5 specifics).
Component	Role	Parameter count
VAE encoder	Image (512²) → latent (64² × 4 ch)	~34 M
Text encoder	Prompt → token-embedding sequence	CLIP ViT-L · ~123 M
U-Net denoiser	Latent + text + timestep → predicted noise	~860 M (the bulk)
VAE decoder	Denoised latent → image (512²)	~50 M
Scheduler	Noise schedule + reverse-step formula (DDIM / DDPM / PNDM)	—

The decoupling is the load-bearing structural insight. The VAE is trained once on a large dataset (LAION-class) and frozen. The U-Net is trained on top of the frozen latent space. The text encoder is frozen (taken from a pre-trained CLIP). Different downstream LDMs share the same VAE and swap only the U-Net (Stable Diffusion 1.5 vs 2.x vs Stable Diffusion XL).

3. Conditional Diffusion

Conditional generation enters the U-Net via two mechanisms.

3.1 Cross-attention to conditioning tokens

At each U-Net block, the residual stream cross-attends to the text-encoder output sequence. The attention queries Q are projected from the latent residual stream; the keys K and values V are projected from the text tokens:

Q = X · W^Q, K = T · W^K, V = T · W^V Attention(X, T) = softmax(Q K^T / √d_k) · V

where X is the latent residual stream (post-self-attention, pre-FFN) and T is the text-encoder output. The same mechanism handles image conditioning (T is image-encoder output) and depth / sketch conditioning (T is the conditioning-encoder output). ControlNet adds an additional copy of the U-Net's encoder, trained on the conditioning input and frozen on the original U-Net; the ControlNet path's outputs are added to the U-Net's residual stream at matching depths.

3.2 Adaptive normalisation (AdaLN)

The timestep t is encoded sinusoidally and projected through a small MLP to per-block scale and shift parameters γ_t, β_t. Inside each U-Net block, the layer-norm is replaced with:

AdaLN(x; γ_t, β_t) = γ_t ⊙ LayerNorm(x) + β_t

The timestep modulates the activations multiplicatively (scale) and additively (shift) per block. The mechanism is cheap (one MLP shared across blocks) and effective — the timestep information reaches every layer through normalisation rather than through additive input concatenation. Class-label conditioning uses the same AdaLN mechanism with the label embedding replacing the timestep.

4. LCM Acceleration

Latent Consistency Models [1] distill an LDM into a one-to-four-step sampler. The training: a consistency loss enforces that the denoiser's prediction at adjacent noise levels collapses to the same clean image. The result: 25× sampling speed-up with marginal quality drop. The trade-off is the same trade-off the thesis-line flow-matching switch [3] targets — small quality reduction for large speed-up, motivated by interactive-rate single-image-to-3-D inference.

5. Thesis-Line Implications

Three implications carry forward to thesis-line 3-D-diffusion work.

Table 2 — LDM-to-thesis-line analogy.
LDM component	Thesis-line analogue
VAE encoder / decoder (frozen)	Triplane / hexplane encoder / decoder [4,5] — frozen 3-D-feature encoder shared across downstream generators
U-Net denoiser (trained)	MambaFlow3D backbone [3] — trained on the triplane latent space
Text encoder + cross-attention	Image conditioning via cross-attention from a ViT image-encoder; identical mechanism
LCM distillation	Flow-matching velocity prediction — same step-count reduction

The structural analogy is exact at the architectural level. The thesis-line 3-D-diffusion work is best understood as "LDM, but the VAE is a triplane encoder and the U-Net body can be a Mamba block".

6. Failed Gradio Attempt

The study session attempted an interactive Gradio interface for prompt-to-image exploration via the diffusers library. The interface itself worked. Inference latency on Intel iMac CPU was approximately 30 seconds per image at default sampler settings, which is impractical for interactive use. The follow-up was deferred until the RTX 3060 rental began in November 2025; the rental was instead spent on the JiT reproduction [6] and the MambaFlow3D Phase-2 work [3], so the Gradio thread was not resumed.

7. Conclusion

LDM's VAE / U-Net / text-encoder decoupling is the load-bearing structural insight; the same decoupling structures the thesis-line 3-D-diffusion work. LCM-class step-count reduction is the magnitude of speed-up the thesis-line flow-matching switch targets. The Gradio interface attempt confirmed that CPU inference is impractical for LDM-class models.

References

[1] Luo, S. et al. "Latent Consistency Models: Synthesising High-Resolution Images with Few-Step Inference." 2023.

[2] Rombach, R. et al. "High-Resolution Image Synthesis with Latent Diffusion Models (Stable Diffusion)." CVPR, 2022.

[3] Jain, A. "MambaFlow3D: Spec, Speed-up Budget, and ModelNet10 Phase-2." Thesis research, Nov 2025. /whitepaper/mambaflow3d

[4] Jain, A. "Triplane Mechanics Deep-Dive." Thesis research, Jan 2026. /whitepaper/triplane-deep-dive

[5] Jain, A. "Hexplane Autoencoder." Thesis research, Dec 2025. /whitepaper/hexplane-ae

[6] Jain, A. "Training JiT Diffusion on Two Consumer GPUs." Thesis research, Nov 2025. /whitepaper/jit-diffusion

[7] Radford, A. et al. "Learning Transferable Visual Models From Natural Language Supervision (CLIP)." ICML, 2021. The text encoder used by Stable Diffusion 1.5.

[8] Zhang, L., Rao, A., Agrawala, M. "Adding Conditional Control to Text-to-Image Diffusion Models (ControlNet)." ICCV, 2023.