← Research Timeline Aditya Jain / Apple Maps · 3D Reconstruction
Nov 2025
Topic 28 Nov 2025 x-Prediction · Manifold Hypothesis · Paper Study

"Back to Basics" —
Let Denoising Models Denoise.

Paper-study session on Li & He's "Back to Basics: Let Denoising Generative Models Denoise" (Nov 2025, MIT). The argument: modern diffusion models don't actually denoise — they predict ε (noise), or v (velocity) — but they should predict x (the clean image) because clean data lives on a low-dimensional manifold while noised quantities do not. The manifold hypothesis explains why ε-prediction collapses catastrophically at high dimensions while x-prediction stays stable. Directly informs JiT (Topic 27, x-prediction ViT diffusion) and the MambaFlow3D choice (Topic 26, x-prediction adapted to 3-D sparse-cube tokens).

00 — Motivation

Why does x-prediction matter for 3-D geometry generation?

The MambaFlow3D and Hexplane AE thesis-line work both involve high-dimensional 3-D representations (triplane features, sparse- cube tokens, hexplane stacks). The Li & He paper provides the theoretical reason to prefer x-prediction over ε-prediction in that regime: natural geometry lies on a low-dimensional manifold; Gaussian noise does not. A network that learns to predict the clean manifold-respecting output learns a geometrically valid mapping. A network that learns to predict the noise has to model an artifact that doesn't have manifold structure — and in high dimensions, that becomes unstable.

The paper's headline number: ε-prediction's FID grows exponentially with dimension on high-dimensional benchmarks while x-prediction stays stable. The implication for the thesis line is that any 3-D generator working over triplane / hexplane / sparse-cube tokens should default to x-prediction (or velocity prediction, which is close). JiT (Topic 27) is the empirical consequence of this study — the consumer-GPU JiT reproduction uses x-prediction precisely because the paper analysis here predicted it would be more stable at the ViT-on-ImageNet-256 scale.

01 — The Three Prediction Targets

ε vs v vs x — same architecture, different optima.

TargetWhat network predictsLossLives on a manifold?
ε-predictionGaussian noise that was added‖ε − ε̂‖²No — pure Gaussian, full-rank
v-predictionRotated combination v = α ε − σ x₀‖v − v̂‖²Mixed — partly manifold, partly noise
x-predictionClean image / clean geometry‖x₀ − x̂₀‖²Yes — natural data manifold

The three targets are mathematically equivalent in the sense that each can be derived from the others given the noise schedule. They are not equivalent in training behaviour, because each has a different loss surface and different gradient properties. The paper's empirical claim is that in high-dimensional regimes (large images, sparse-cube tokens, triplane features), x-prediction's loss surface is the only one that stays smooth as dimension grows.

Core Argument

Clean data lives on a manifold.
Noised data does not. Predict the manifold.

The manifold hypothesis says that natural images, natural geometry, and natural physical systems concentrate on low-dimensional manifolds embedded in their high-dimensional ambient space. ε-prediction asks the network to model a full-rank Gaussian, which has no manifold structure. The network's capacity is wasted modelling noise. x-prediction asks the network to model the manifold directly — every parameter contributes to representing valid outputs. At high dimensions, the difference is the difference between FID 5 and FID 500.

02 — Applies to 3-D Representations

Same principle, same expected speed-up, for VDB / SDF / triplane.

The paper's empirical evidence is for images. The thesis-line question raised in the session was whether the same principle applies to 3-D geometric representations — sparse voxels (VDB / FVDB), SDF, triplanes / hexplanes. The answer is yes, and in some ways more strongly than for images:

RepresentationManifold structureExpected x-prediction benefit
Sparse voxels (VDB / FVDB)Strong — coherent sparsity patterns over mostly-empty spacex-prediction directly predicts active voxels; ε-prediction wastes capacity on empty voxels
Signed distance fields (SDF)Very strong — valid SDFs satisfy ‖∇SDF‖ ≈ 1 everywherex-prediction preserves the SDF property; ε-prediction destroys it
Triplane / hexplane featuresStrong — projection from 3-D geometry onto 2-D planes is non-randomSame stability gains as images; the planes are still high-dim images
Point clouds (XYZ + attr)Strong — points sit on the surface manifoldx-prediction in the latent space is what MambaFlow3D (Topic 26) does

Interactive Demo · Live

Compare ε vs x vs v prediction at different noise levels. The left pane is the clean target image (1-D signal for visualisation). The middle pane is the noisy observed signal at step t. The right pane shows three reconstructions — what each prediction target would produce at that step, with a noise simulation showing how each fares.

01 — Clean signal x₀ SINE
02 — Noise level t t = 0.5
03 — ε / v / x reconstructions x is on the manifold

Full Technical Paper

White paper · manifold-hypothesis analysis · x-prediction extension to 3-D representations · thesis-line architecture decision

Read Paper →
Related Thesis Chapters
JiT Diffusion — ImageNet-256
Direct empirical consequence of this paper analysis — the JiT reproduction uses x-prediction because the analysis here predicted it would be more stable at ViT scale.
MambaFlow3D
Applies the x-prediction principle to 3-D sparse-cube tokens via flow matching's velocity prediction (close cousin of x-prediction).
MNIST Flow Validation
Companion empirical study on the same prediction-target choice, restricted to MNIST scale where ε vs v vs x differences are smaller.
Appendix — Raw Materials
Transcripts & Source References
████████████████████████████████████████████████

██████████████████████████████████████
█████████ · ████ · █████████████████████
████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████
Restricted Access