← Research Timeline Aditya Jain / Apple Maps · 3D Reconstruction
May 2025
Topic 12 May 2025 LDM · Conditional Diffusion · LCM

Latent Diffusion —
Theory + Gradio Attempt.

Study session on Latent Diffusion Models (Stable Diffusion-class): conditional diffusion (text, class-label, image conditioning), the VAE-encoder + U-Net-denoiser + VAE-decoder structure, and latent consistency models (LCM) for one-to-four-step sampling. Attempted a Gradio interface for interactive prompt-to-image exploration. Pre-thesis-line preparation for the 3-D-diffusion work that comes later (Topics 27 JiT, 26 MambaFlow3D, 24 polyline diffusion).

00 — Motivation

Understand LDM before scaling to 3-D.

By May 2025 the thesis line had identified diffusion as a likely core tool for image-to-3-D (later validated by the survey in Topic 18). Stable Diffusion 1.x / 2.x are the published reference implementations of latent diffusion. The study session covered the LDM paper end-to-end, the VAE / U-Net / text-encoder split, the role of classifier-free guidance, and LCM as the fast-sampling distillation that brings the step count from 100 to 4.

01 — LDM Components
ComponentRoleStable Diffusion 1.5 size
VAE encoderImage (512²) → latent (64² × 4 ch)~34 M params
Text encoderPrompt → token-embedding sequenceCLIP ViT-L · ~123 M params
U-Net denoiserLatent + text + timestep → predicted noise~860 M params (the bulk)
VAE decoderDenoised latent → image (512²)~50 M params
SchedulerNoise schedule + reverse-step formula (DDIM / DDPM / PNDM)
Core Insight

The VAE is the compression. The U-Net is the diffusion.
They're independent components.

The LDM trick is to decouple the heavy generative work (U-Net diffusion) from the pixel-space work (VAE compression). The VAE is trained once on a large dataset; the U-Net is trained on top of the frozen VAE's latent space. Different downstream LDMs share the same VAE and swap only the U-Net. This makes LDM the foundation for the 3-D-diffusion work in the thesis line — the VAE-equivalent (triplane encoder) is shared across Hierarchical Triplane (Topic 33), Hexplane AE (Topic 29), and MambaFlow3D (Topic 26).

02 — LCM Acceleration

Latent Consistency Models distill an LDM into a one-to-four-step sampler via consistency loss. The trade-off: slight quality drop vs ~25× faster inference. Relevant to the thesis line for the same reason flow matching is — fewer sampling steps means interactive-rate single-image-to-3-D becomes possible on consumer GPUs.

03 — Gradio Interface (attempted)

A small Gradio interface was attempted for interactive prompt- to-image exploration via the diffusers library. The interface itself worked; the iteration speed on Intel iMac (no GPU) was impractical (~30 s per image), and the project was deferred until the RTX 3060 rental was set up later in the year.

Interactive Demo · Live

Step through the LDM forward pass — VAE encode, U-Net denoise, VAE decode — with a stylised input image.

01 — Input imageCAT
02 — Latent at step tSTEP 0 / 10
03 — Decoded outputVAE decode

Full Technical Paper

White paper · LDM theory study · VAE / U-Net / LCM decomposition · same structure as thesis-line 3-D-diffusion work

Read Paper →
Related Thesis Chapters
Red-Square DDPM
Earlier (Feb 2025) hands-on diffusion mechanics work. LDM theory study builds on it.
JiT Diffusion
Production diffusion topic in the thesis line. The LDM theory study informs the architectural choices in the JiT reproduction.
x-Prediction Paper
Theoretical follow-up that revisits LDM's prediction-target choice and motivates the switch to x-prediction for high-dimensional 3-D.
Appendix — Raw Materials
Transcripts & Source References
████████████████████████████████████████████████

██████████████████████████████████████
█████████ · ████ · █████████████████████
████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████
Restricted Access