Five thesis-line topics depend on diffusion or diffusion-adjacent architectures as load-bearing components [2,3,4,5,6]. The DDPM paper [1] communicates the algorithm at a level that lets a reader implement it; what the paper does not communicate is the mechanical feel of the noise schedule, the role of the timestep embedding, the variance scaling in the reverse step, and the trajectory of an intermediate x_t from noise back to data. This paper documents the cheapest path to that mechanical feel: implement DDPM from first principles on a one-image toy.
The exercise is the diffusion-side counterpart to the transformer-from-scratch work in [7]. The contribution is identical in shape: a documented from-scratch path and the architecture-literacy investment that pays back across the subsequent thesis line.
The smallest non-trivial network that exercises every DDPM component meaningfully.
| Component | Setting | Comment |
|---|---|---|
| Training data | Single 16 × 16 RGB image (red square on black) | 768 floats per image; one training example |
| Denoiser | 3-layer MLP, widths 768 → 1024 → 1024 → 768 | ~3 M parameters; over-parameterised on purpose for the toy |
| Timestep embedding | Sinusoidal 128-dim → Linear 128 → 768 | Added (not concatenated) to the first-layer activation |
| Diffusion timesteps | T = 100 | Less than the DDPM-paper T = 1 000; sufficient for the toy |
| β schedule | Linear, β₁ = 1×10⁻⁴ → β₁₀₀ = 0.02 | Standard DDPM linear schedule |
| Loss | MSE on predicted ε | ε-prediction (standard DDPM choice) |
| Optimiser | AdamW, lr 1×10⁻³ | Aggressive — one training image, no overfit risk |
| Sampling | DDPM ancestral, 100 steps | From x_T ∼ 𝒩(0, I) back to x_0 |
The linear β schedule defines per-step noise variances β_1, β_2, …, β_T for T = 100 timesteps:
β_t = β_1 + (t − 1) · (β_T − β_1) / (T − 1), with β_1 = 1 × 10⁻⁴, β_T = 0.02From this, the per-step signal-retention α_t = 1 − β_t and the cumulative product α̅_t = ∏_{s ≤ t} α_s are precomputed. At t = 0, α̅_0 = 1 (pure signal); at t = T, α̅_T ≈ 0.0001 (almost pure noise). The closed-form forward process is then:
x_t = √α̅_t · x_0 + √(1 − α̅_t) · ε, ε ∼ 𝒩(0, I)The closed-form is the load-bearing property — it means training can sample any timestep t in a single step (no need to simulate the Markov chain forward through t intermediate steps). The reverse process, in contrast, must be simulated step-by-step at sampling time.
Training at each step: sample t ∼ Uniform[1, T], sample ε ∼ 𝒩(0, I), compute x_t = √α̅_t · x_0 + √(1−α̅_t) · ε, predict ε̂ = denoiser(x_t, t), optimise MSE ‖ε − ε̂‖². The denoiser receives the noisy image x_t and the integer timestep t; the timestep is sinusoidally encoded into a 128-dim vector and projected through a linear layer into the 768-dim feature space, then added to the first-layer activation.
Training runs for 200 epochs (one optimisation step per epoch — single training image, so each epoch is one forward+backward pass at a random t). Wallclock: approximately 5 minutes on an M2 Mac CPU.
Sampling: x_T ∼ 𝒩(0, I) 768-dim Gaussian; for t = T, T−1, …, 1, predict ε̂ = denoiser(x_t, t) and compute the reverse-step posterior:
μ_{t-1} = (1 / √α_t) · (x_t − (β_t / √(1 − α̅_t)) · ε̂) σ²_{t-1} = β_t (the simpler of two DDPM-paper choices) x_{t-1} = μ_{t-1} + √σ²_{t-1} · z, z ∼ 𝒩(0, I) if t > 1, else 0At t = 0 the final x_0 is the generated image. Total sampling cost: 100 forward passes through the 3 M-parameter MLP, approximately 0.8 s on M2 CPU.
| Metric | Value | Comment |
|---|---|---|
| Final training loss | ~0.012 | Monotonic decrease from ~1.0; no instability |
| Reverse-pass reconstruction MSE | < 0.01 | vs training image, after 100-step ancestral sampling |
| Sampling wallclock | ~0.8 s on M2 CPU | 100 forward passes through the 3 M-param MLP |
| Training wallclock | ~5 min for 200 epochs | CPU only; no GPU needed at this scale |
| Visual result | Recognisable red square | Confirms the architecture is mechanically correct |
The reconstruction is uninteresting as a generative result — a one-image generator can only memorise — but the diagnostic signals confirm the implementation is correct: monotonic loss decrease (the network is learning the noise prediction), recognisable red square in the reverse-pass output (the schedule and the variance scaling are wired correctly), and a forward-noise / reverse-denoise trajectory that is visually inspectable at every step.
The reverse-pass intermediate states x_T, x_{T-1}, …, x_0 are saved at every step and visualised as a 100-frame trajectory. The qualitative observation:
The trajectory is the operational version of "denoising" — it makes concrete what the schedule and the reverse-step formula achieve. Reading the DDPM paper without watching this trajectory leaves a gap that no amount of equation-reading fills.
The 0.8 s ancestral-sampling cost on a 3 M-parameter MLP is the seed of an observation that becomes load-bearing later in the thesis line. Naively scaling to the 86 M-parameter JiT ViT at 256² resolution (the Topic-27 reproduction [5]) — a ~29 × parameter scale-up plus a ~256 × pixel-count scale-up — puts the DDPM sampling cost in the seconds-to-minutes regime per generation. For an interactive use case the cost is unaffordable.
The two ways out are (i) reduce sampling-step count by switching to a denoiser parameterisation that gives competitive quality at fewer steps (flow matching [3], LCM [2]), and (ii) reduce per-step cost by switching the backbone to a linear-time architecture (Mamba [4]). Both decisions in the thesis line are seeded by the cost observation made on the red-square toy here.
Same as the transformer-from-scratch counterpart [7]: for any architectural class the thesis line will use load-bearing, build a 100-line from-scratch version on a toy problem first. For diffusion the toy is "one image and 100 schedule steps". For transformers the toy is "one short sentence and an untrained block". In both cases the cost is a few hours and the payback is faster paper reading and faster debugging across the year-scale thesis work.
A minimal DDPM was implemented from first principles and trained on a single 16 × 16 red-square image. The reverse-pass ancestral sampling reconstructs the training image with MSE < 0.01. The contribution is the documented from-scratch path and the sampling-cost observation that motivates the flow-matching and Mamba switches in the later thesis-line work.