Topic 28 Nov 2025 x-Prediction · Manifold Hypothesis · Paper Study

"Back to Basics" —
Let Denoising Models Denoise.

Paper-study session on Li & He's "Back to Basics: Let Denoising Generative Models Denoise" (Nov 2025, MIT). The argument: modern diffusion models don't actually denoise — they predict ε (noise), or v (velocity) — but they should predict x (the clean image) because clean data lives on a low-dimensional manifold while noised quantities do not. The manifold hypothesis explains why ε-prediction collapses catastrophically at high dimensions while x-prediction stays stable. Directly informs JiT (Topic 27, x-prediction ViT diffusion) and the MambaFlow3D choice (Topic 26, x-prediction adapted to 3-D sparse-cube tokens).

00 — Motivation

Why does x-prediction matter for 3-D geometry generation?

The MambaFlow3D and Hexplane AE thesis-line work both involve high-dimensional 3-D representations (triplane features, sparse- cube tokens, hexplane stacks). The Li & He paper provides the theoretical reason to prefer x-prediction over ε-prediction in that regime: natural geometry lies on a low-dimensional manifold; Gaussian noise does not. A network that learns to predict the clean manifold-respecting output learns a geometrically valid mapping. A network that learns to predict the noise has to model an artifact that doesn't have manifold structure — and in high dimensions, that becomes unstable.

The paper's headline number: ε-prediction's FID grows exponentially with dimension on high-dimensional benchmarks while x-prediction stays stable. The implication for the thesis line is that any 3-D generator working over triplane / hexplane / sparse-cube tokens should default to x-prediction (or velocity prediction, which is close). JiT (Topic 27) is the empirical consequence of this study — the consumer-GPU JiT reproduction uses x-prediction precisely because the paper analysis here predicted it would be more stable at the ViT-on-ImageNet-256 scale.

01 — The Three Prediction Targets

ε vs v vs x — same architecture, different optima.

Target	What network predicts	Loss	Lives on a manifold?
ε-prediction	Gaussian noise that was added	`‖ε − ε̂‖²`	No — pure Gaussian, full-rank
v-prediction	Rotated combination `v = α ε − σ x₀`	`‖v − v̂‖²`	Mixed — partly manifold, partly noise
x-prediction	Clean image / clean geometry	`‖x₀ − x̂₀‖²`	Yes — natural data manifold

The three targets are mathematically equivalent in the sense that each can be derived from the others given the noise schedule. They are not equivalent in training behaviour, because each has a different loss surface and different gradient properties. The paper's empirical claim is that in high-dimensional regimes (large images, sparse-cube tokens, triplane features), x-prediction's loss surface is the only one that stays smooth as dimension grows.

Core Argument

Clean data lives on a manifold.
Noised data does not. Predict the manifold.

The manifold hypothesis says that natural images, natural geometry, and natural physical systems concentrate on low-dimensional manifolds embedded in their high-dimensional ambient space. ε-prediction asks the network to model a full-rank Gaussian, which has no manifold structure. The network's capacity is wasted modelling noise. x-prediction asks the network to model the manifold directly — every parameter contributes to representing valid outputs. At high dimensions, the difference is the difference between FID 5 and FID 500.

02 — Applies to 3-D Representations

Same principle, same expected speed-up, for VDB / SDF / triplane.

The paper's empirical evidence is for images. The thesis-line question raised in the session was whether the same principle applies to 3-D geometric representations — sparse voxels (VDB / FVDB), SDF, triplanes / hexplanes. The answer is yes, and in some ways more strongly than for images:

Representation	Manifold structure	Expected x-prediction benefit
Sparse voxels (VDB / FVDB)	Strong — coherent sparsity patterns over mostly-empty space	x-prediction directly predicts active voxels; ε-prediction wastes capacity on empty voxels
Signed distance fields (SDF)	Very strong — valid SDFs satisfy `‖∇SDF‖ ≈ 1` everywhere	x-prediction preserves the SDF property; ε-prediction destroys it
Triplane / hexplane features	Strong — projection from 3-D geometry onto 2-D planes is non-random	Same stability gains as images; the planes are still high-dim images
Point clouds (XYZ + attr)	Strong — points sit on the surface manifold	x-prediction in the latent space is what MambaFlow3D (Topic 26) does

Interactive Demo · Live

Compare ε vs x vs v prediction at different noise levels. The left pane is the clean target image (1-D signal for visualisation). The middle pane is the noisy observed signal at step t. The right pane shows three reconstructions — what each prediction target would produce at that step, with a noise simulation showing how each fares.

01 — Clean signal x₀ SINE

02 — Noise level t t = 0.5

03 — ε / v / x reconstructions x is on the manifold

Appendix — Raw Materials

Transcripts & Source References

████████████████████████████████████████████████

01 — ██████████████████████████

██████████████████████████████████████

█████████ · ████ · █████████████████████

████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████

Restricted Access

"Back to Basics" — Let Denoising Models Denoise.

Why does x-prediction matter for 3-D geometry generation?

ε vs v vs x — same architecture, different optima.

Same principle, same expected speed-up, for VDB / SDF / triplane.

Interactive Demo · Live

Full Technical Paper

"Back to Basics" —
Let Denoising Models Denoise.