Paper-study session on Li & He's "Back to Basics: Let Denoising Generative Models Denoise" (Nov 2025, MIT). The argument: modern diffusion models don't actually denoise — they predict ε (noise), or v (velocity) — but they should predict x (the clean image) because clean data lives on a low-dimensional manifold while noised quantities do not. The manifold hypothesis explains why ε-prediction collapses catastrophically at high dimensions while x-prediction stays stable. Directly informs JiT (Topic 27, x-prediction ViT diffusion) and the MambaFlow3D choice (Topic 26, x-prediction adapted to 3-D sparse-cube tokens).
The MambaFlow3D and Hexplane AE thesis-line work both involve high-dimensional 3-D representations (triplane features, sparse- cube tokens, hexplane stacks). The Li & He paper provides the theoretical reason to prefer x-prediction over ε-prediction in that regime: natural geometry lies on a low-dimensional manifold; Gaussian noise does not. A network that learns to predict the clean manifold-respecting output learns a geometrically valid mapping. A network that learns to predict the noise has to model an artifact that doesn't have manifold structure — and in high dimensions, that becomes unstable.
The paper's headline number: ε-prediction's FID grows exponentially with dimension on high-dimensional benchmarks while x-prediction stays stable. The implication for the thesis line is that any 3-D generator working over triplane / hexplane / sparse-cube tokens should default to x-prediction (or velocity prediction, which is close). JiT (Topic 27) is the empirical consequence of this study — the consumer-GPU JiT reproduction uses x-prediction precisely because the paper analysis here predicted it would be more stable at the ViT-on-ImageNet-256 scale.
| Target | What network predicts | Loss | Lives on a manifold? |
|---|---|---|---|
| ε-prediction | Gaussian noise that was added | ‖ε − ε̂‖² | No — pure Gaussian, full-rank |
| v-prediction | Rotated combination v = α ε − σ x₀ | ‖v − v̂‖² | Mixed — partly manifold, partly noise |
| x-prediction | Clean image / clean geometry | ‖x₀ − x̂₀‖² | Yes — natural data manifold |
The three targets are mathematically equivalent in the sense that each can be derived from the others given the noise schedule. They are not equivalent in training behaviour, because each has a different loss surface and different gradient properties. The paper's empirical claim is that in high-dimensional regimes (large images, sparse-cube tokens, triplane features), x-prediction's loss surface is the only one that stays smooth as dimension grows.
Clean data lives on a manifold.
Noised data does not. Predict the manifold.
The manifold hypothesis says that natural images, natural geometry, and natural physical systems concentrate on low-dimensional manifolds embedded in their high-dimensional ambient space. ε-prediction asks the network to model a full-rank Gaussian, which has no manifold structure. The network's capacity is wasted modelling noise. x-prediction asks the network to model the manifold directly — every parameter contributes to representing valid outputs. At high dimensions, the difference is the difference between FID 5 and FID 500.
The paper's empirical evidence is for images. The thesis-line question raised in the session was whether the same principle applies to 3-D geometric representations — sparse voxels (VDB / FVDB), SDF, triplanes / hexplanes. The answer is yes, and in some ways more strongly than for images:
| Representation | Manifold structure | Expected x-prediction benefit |
|---|---|---|
| Sparse voxels (VDB / FVDB) | Strong — coherent sparsity patterns over mostly-empty space | x-prediction directly predicts active voxels; ε-prediction wastes capacity on empty voxels |
| Signed distance fields (SDF) | Very strong — valid SDFs satisfy ‖∇SDF‖ ≈ 1 everywhere | x-prediction preserves the SDF property; ε-prediction destroys it |
| Triplane / hexplane features | Strong — projection from 3-D geometry onto 2-D planes is non-random | Same stability gains as images; the planes are still high-dim images |
| Point clouds (XYZ + attr) | Strong — points sit on the surface manifold | x-prediction in the latent space is what MambaFlow3D (Topic 26) does |
Compare ε vs x vs v prediction at different noise levels. The left pane is the clean target image (1-D signal for visualisation). The middle pane is the noisy observed signal at step t. The right pane shows three reconstructions — what each prediction target would produce at that step, with a noise simulation showing how each fares.
White paper · manifold-hypothesis analysis · x-prediction extension to 3-D representations · thesis-line architecture decision