Technical Note · cs.LG · Feb 2025
Documentation → ← Back to White Papers
A Minimal DDPM From First Principles: Single-Image Training on a 16×16 Red Square as Architecture-Literacy Investment for the Thesis-Line Diffusion Work
Aaditya Jain
Diffusion Models · From-Scratch Foundations · Thesis-Line Foundation Study
Submitted: February 2025 Subject: cs.LG Keywords: DDPM, diffusion, from-scratch implementation, ε-prediction, timestep embedding, thesis-line foundation
Abstract
We document a minimal Denoising Diffusion Probabilistic Model [1] built from first principles in pure PyTorch tensors — a 3 M-parameter MLP denoiser with sinusoidal timestep embedding, a 100-step linear β schedule, and ε-prediction MSE loss. The training data is a single 16 × 16 RGB image of a red square on a black background; the model trains for 200 epochs in approximately five minutes on M2 Mac CPU. After training, ancestral sampling from x_T ∼ 𝒩(0, I) over 100 reverse steps reconstructs the red square with MSE < 0.01 relative to the training image. The reconstruction is uninteresting in itself — a one-image generator can only memorise — but two diagnostic signals confirm the architecture is mechanically correct: the loss curve decreases monotonically from ~1.0 at step 0 to ~0.01 at end-of-training, and the per-step reverse-pass intermediate states visually transition from Gaussian noise at x_T through red-tinted noise at intermediate x_t to the clean red square at x_0. The contribution is the documented from-scratch path — every diffusion-family topic in the subsequent thesis line (the LDM study [2], the MNIST flow-matching backbone validation [3], the MambaFlow3D Phase-2 work [4], the JiT ImageNet-256 reproduction [5], and the polyline-diffusion design study [6]) is informed by having watched the forward-noise / reverse-denoise Markov chain run on a problem this small. The generalisable observation: the 100-step ancestral sampling cost on a 3 M-parameter MLP is approximately 0.8 s on CPU. Scaling that to the 86 M-parameter JiT ViT at 256² resolution puts the sampling cost on the radar of inference-latency-budget concerns, which directly motivates the flow-matching switch later in the thesis line. Keywords: DDPM, ε-prediction, timestep embedding, from-scratch implementation, pedagogical, thesis-line foundation.
1. Introduction

Five thesis-line topics depend on diffusion or diffusion-adjacent architectures as load-bearing components [2,3,4,5,6]. The DDPM paper [1] communicates the algorithm at a level that lets a reader implement it; what the paper does not communicate is the mechanical feel of the noise schedule, the role of the timestep embedding, the variance scaling in the reverse step, and the trajectory of an intermediate x_t from noise back to data. This paper documents the cheapest path to that mechanical feel: implement DDPM from first principles on a one-image toy.

The exercise is the diffusion-side counterpart to the transformer-from-scratch work in [7]. The contribution is identical in shape: a documented from-scratch path and the architecture-literacy investment that pays back across the subsequent thesis line.

2. Architecture

The smallest non-trivial network that exercises every DDPM component meaningfully.

Table 1 — Red-Square DDPM configuration.
ComponentSettingComment
Training dataSingle 16 × 16 RGB image (red square on black)768 floats per image; one training example
Denoiser3-layer MLP, widths 768 → 1024 → 1024 → 768~3 M parameters; over-parameterised on purpose for the toy
Timestep embeddingSinusoidal 128-dim → Linear 128 → 768Added (not concatenated) to the first-layer activation
Diffusion timestepsT = 100Less than the DDPM-paper T = 1 000; sufficient for the toy
β scheduleLinear, β₁ = 1×10⁻⁴ → β₁₀₀ = 0.02Standard DDPM linear schedule
LossMSE on predicted εε-prediction (standard DDPM choice)
OptimiserAdamW, lr 1×10⁻³Aggressive — one training image, no overfit risk
SamplingDDPM ancestral, 100 stepsFrom x_T ∼ 𝒩(0, I) back to x_0
3. The Schedule and the Forward Process

The linear β schedule defines per-step noise variances β_1, β_2, …, β_T for T = 100 timesteps:

β_t = β_1 + (t − 1) · (β_T − β_1) / (T − 1), with β_1 = 1 × 10⁻⁴, β_T = 0.02

From this, the per-step signal-retention α_t = 1 − β_t and the cumulative product α̅_t = ∏_{s ≤ t} α_s are precomputed. At t = 0, α̅_0 = 1 (pure signal); at t = T, α̅_T ≈ 0.0001 (almost pure noise). The closed-form forward process is then:

x_t = √α̅_t · x_0 + √(1 − α̅_t) · ε, ε ∼ 𝒩(0, I)

The closed-form is the load-bearing property — it means training can sample any timestep t in a single step (no need to simulate the Markov chain forward through t intermediate steps). The reverse process, in contrast, must be simulated step-by-step at sampling time.

4. Training and Sampling

Training at each step: sample t ∼ Uniform[1, T], sample ε ∼ 𝒩(0, I), compute x_t = √α̅_t · x_0 + √(1−α̅_t) · ε, predict ε̂ = denoiser(x_t, t), optimise MSE ‖ε − ε̂‖². The denoiser receives the noisy image x_t and the integer timestep t; the timestep is sinusoidally encoded into a 128-dim vector and projected through a linear layer into the 768-dim feature space, then added to the first-layer activation.

Training runs for 200 epochs (one optimisation step per epoch — single training image, so each epoch is one forward+backward pass at a random t). Wallclock: approximately 5 minutes on an M2 Mac CPU.

Sampling: x_T ∼ 𝒩(0, I) 768-dim Gaussian; for t = T, T−1, …, 1, predict ε̂ = denoiser(x_t, t) and compute the reverse-step posterior:

μ_{t-1} = (1 / √α_t) · (x_t − (β_t / √(1 − α̅_t)) · ε̂) σ²_{t-1} = β_t (the simpler of two DDPM-paper choices) x_{t-1} = μ_{t-1} + √σ²_{t-1} · z, z ∼ 𝒩(0, I) if t > 1, else 0

At t = 0 the final x_0 is the generated image. Total sampling cost: 100 forward passes through the 3 M-parameter MLP, approximately 0.8 s on M2 CPU.

5. Results
Table 2 — Training metrics and sampling result.
MetricValueComment
Final training loss~0.012Monotonic decrease from ~1.0; no instability
Reverse-pass reconstruction MSE< 0.01vs training image, after 100-step ancestral sampling
Sampling wallclock~0.8 s on M2 CPU100 forward passes through the 3 M-param MLP
Training wallclock~5 min for 200 epochsCPU only; no GPU needed at this scale
Visual resultRecognisable red squareConfirms the architecture is mechanically correct

The reconstruction is uninteresting as a generative result — a one-image generator can only memorise — but the diagnostic signals confirm the implementation is correct: monotonic loss decrease (the network is learning the noise prediction), recognisable red square in the reverse-pass output (the schedule and the variance scaling are wired correctly), and a forward-noise / reverse-denoise trajectory that is visually inspectable at every step.

6. Trajectory Inspection

The reverse-pass intermediate states x_T, x_{T-1}, …, x_0 are saved at every step and visualised as a 100-frame trajectory. The qualitative observation:

  • t ∈ [100, 80]: pure Gaussian noise, no structure visible.
  • t ∈ [80, 50]: a faint red tint emerges in the centre of the image (the network has identified the "red square" region but the geometry is still noisy).
  • t ∈ [50, 20]: the red square's boundaries sharpen; the black-and-red colour separation becomes crisp; noise still visible in the bulk.
  • t ∈ [20, 0]: residual noise is removed step-by-step; the final x_0 is visually indistinguishable from the training image (MSE < 0.01).

The trajectory is the operational version of "denoising" — it makes concrete what the schedule and the reverse-step formula achieve. Reading the DDPM paper without watching this trajectory leaves a gap that no amount of equation-reading fills.

7. The Sampling-Cost Implication

The 0.8 s ancestral-sampling cost on a 3 M-parameter MLP is the seed of an observation that becomes load-bearing later in the thesis line. Naively scaling to the 86 M-parameter JiT ViT at 256² resolution (the Topic-27 reproduction [5]) — a ~29 × parameter scale-up plus a ~256 × pixel-count scale-up — puts the DDPM sampling cost in the seconds-to-minutes regime per generation. For an interactive use case the cost is unaffordable.

The two ways out are (i) reduce sampling-step count by switching to a denoiser parameterisation that gives competitive quality at fewer steps (flow matching [3], LCM [2]), and (ii) reduce per-step cost by switching the backbone to a linear-time architecture (Mamba [4]). Both decisions in the thesis line are seeded by the cost observation made on the red-square toy here.

8. The Generalisable Lesson

Same as the transformer-from-scratch counterpart [7]: for any architectural class the thesis line will use load-bearing, build a 100-line from-scratch version on a toy problem first. For diffusion the toy is "one image and 100 schedule steps". For transformers the toy is "one short sentence and an untrained block". In both cases the cost is a few hours and the payback is faster paper reading and faster debugging across the year-scale thesis work.

9. Conclusion

A minimal DDPM was implemented from first principles and trained on a single 16 × 16 red-square image. The reverse-pass ancestral sampling reconstructs the training image with MSE < 0.01. The contribution is the documented from-scratch path and the sampling-cost observation that motivates the flow-matching and Mamba switches in the later thesis-line work.

References
[1] Ho, J., Jain, A., Abbeel, P. "Denoising Diffusion Probabilistic Models." NeurIPS, 2020.
[2] Jain, A. "Latent Diffusion Model Study." Thesis research, May 2025. /whitepaper/ldm-study
[3] Jain, A. "MNIST Flow-Matching Backbone Validation." Thesis research, Nov 2025. /whitepaper/mnist-flow-validation
[4] Jain, A. "MambaFlow3D: Spec, Speed-up Budget, and ModelNet10 Phase-2." Thesis research, Nov 2025. /whitepaper/mambaflow3d
[5] Jain, A. "Training JiT Diffusion on Two Consumer GPUs." Thesis research, Nov 2025. /whitepaper/jit-diffusion
[6] Jain, A. "Diffusion for Houdini Polylines — Design Study." Thesis research, Nov 2025. /whitepaper/polyline-diffusion
[7] Jain, A. "A Character-Level Transformer From First Principles." Thesis research, Sep 2025. /whitepaper/mini-llm
[8] Song, J., Meng, C., Ermon, S. "Denoising Diffusion Implicit Models." ICLR, 2021. DDIM — the faster-sampling cousin of DDPM not used here.