Technical Analysis · cs.LG · Nov 2025
Documentation → ← Back to White Papers
Manifold-Aware Diffusion Targets: An Analysis of Li & He's "Back to Basics" x-Prediction Result and Its Extension to 3-D Geometric Representations
Aaditya Jain
Diffusion Models · Prediction-Target Analysis · Thesis-Line Architecture Decision
Submitted: November 2025 Subject: cs.LG · cs.CV · cs.GR Keywords: x-prediction, ε-prediction, manifold hypothesis, diffusion target choice, 3-D representations, SDF, sparse voxels, triplane
Abstract
We analyse the Li & He "Back to Basics: Let Denoising Generative Models Denoise" preprint [1] (Nov 2025) and extend its central manifold-hypothesis argument from natural images to the 3-D geometric representations the thesis line uses load-bearing. The paper's claim: modern diffusion models predict ε (noise) or v (velocity), but they should predict x (clean image) because clean data lies on a low-dimensional manifold while Gaussian noise does not. The paper's empirical evidence: at high image resolution, ε-prediction FID grows exponentially with dimension while x-prediction (via the JiT architecture) stays stable. We accept the paper's premise and argue that the manifold-hypothesis argument applies more strongly to 3-D geometric representations than to natural images, because valid 3-D geometry has stronger manifold structure than valid natural images: valid SDFs satisfy ‖∇SDF‖ ≈ 1 everywhere, valid sparse voxel grids have coherent surface-occupancy patterns, valid triplane features have non-random projection structure, and valid point clouds sit on the surface manifold. We derive the architecture-decision implications for the thesis-line topics that use diffusion or flow-matching heads — JiT consumer-GPU reproduction [2], MambaFlow3D 3-D scaling [3], MNIST flow-matching backbone validation [4], and polyline-diffusion design study [5] — and report the architectural choices made on the basis of this analysis: x-prediction or flow-matching velocity prediction throughout, never ε-prediction. The contribution is the analysis itself and the documented architecture-decision trail that makes the choices traceable. Keywords: x-prediction, manifold hypothesis, prediction-target choice, 3-D diffusion, thesis-line architecture decision.
1. Introduction

The diffusion-target choice — whether the denoiser network predicts ε (the noise added to the clean image), v (a rotated velocity-like combination of x and ε), or x (the clean image directly) — has historically been treated as a hyperparameter. The three targets are mathematically equivalent in the sense that each can be derived from the others given the noise schedule. They are not equivalent in training behaviour, because each has a different loss surface and different gradient properties.

Li & He [1] argue the choice is not a hyperparameter: ε-prediction is fundamentally worse than x-prediction at high dimensionality, and the reason is the manifold hypothesis. Natural data lies on a low-dimensional manifold; Gaussian noise does not; the network's capacity is better spent predicting the manifold-respecting clean signal than the manifold-free noise. The paper's empirical signature: at high resolution, ε-prediction FID grows exponentially with image dimension while x-prediction's stays bounded.

This paper does two things. First, it summarises the Li & He argument compactly (§2). Second, it argues that the manifold-hypothesis case for x-prediction is stronger for 3-D geometric representations than for natural images, because 3-D geometry has structural constraints (SDF gradient norm, voxel sparsity coherence, triplane projection structure) that natural images do not have (§3). The thesis-line architecture decisions consequent on this analysis are recorded in §4.

2. The "Back to Basics" Argument (Summary)
2.1 Forward process and the interconvertibility of targets

A diffusion forward process produces noised samples x_t = √α̅_t x_0 + √(1−α̅_t) ε, where α̅_t = ∏_{s≤t}(1 − β_s) and β_t is the noise schedule. The reverse process (the trained network) can be parameterised to predict any of three quantities — ε, v, or x — and given the noise schedule and the observed x_t, each of the three can be derived from the others:

x̂_0 = (x_t − √(1−α̅_t) ε̂) / √α̅_t ε̂ = (x_t − √α̅_t x̂_0) / √(1−α̅_t) v̂ = √α̅_t ε̂ − √(1−α̅_t) x̂_0

Mathematically the three are equivalent: a network that predicts one can be wrapped to predict the others. The reason the choice matters in practice is that the three parameterisations have different loss surfaces and different gradient properties during training. The optimisation problem the network actually solves depends on which target you optimise against, not on the target's mathematical equivalence to the others.

Table 1 — Three target parameterisations.
TargetWhat network predictsLossLives on a manifold?Gradient at t → 0Gradient at t → T
ε-predictionGaussian noise ε‖ε − ε̂‖²No — full-rank GaussianPathological (x_tx_0, so ε has no signal)Easy (x_tε)
v-predictionv = √α̅_t ε − √(1−α̅_t) x_0‖v − v̂‖²Mixed — partial manifoldStableStable
x-predictionClean image x_0‖x_0 − x̂_0‖²Yes — data manifoldEasy (x_tx_0)Pathological at very high t
2.2 The empirical signature

The paper's empirical claim is that ε-prediction's loss surface becomes pathological at high dimensions: the network must approximate a full-rank Gaussian, the gradient signal is small relative to the parameter-space size, and FID grows exponentially with dimension. x-prediction's loss surface stays smooth because the network is approximating a low-dimensional manifold, and the gradient signal is concentrated on the manifold's tangent space rather than spread across the ambient full-rank space.

Specifically, the paper reports that at ImageNet-64 the three targets behave comparably (FID within 10 % of each other across all three). At ImageNet-256 the gap widens — ε-prediction lags x-prediction by a factor of ~1.5–2× on FID. At ImageNet-512 the gap becomes catastrophic — ε-prediction's FID is 10× worse than x-prediction's. The trend is exponential in the image-dimension scale, not linear.

2.3 The manifold-hypothesis intuition

The manifold hypothesis says natural images sit on a low-dimensional manifold embedded in the high-dimensional pixel space. For ImageNet-class images the manifold dimension is estimated at ~100–1000 (the intrinsic dimensionality of the data distribution), embedded in a ~196 K-dim pixel space (256 × 256 × 3). The ratio of ambient to intrinsic dimensionality is ~200×.

A network that predicts the clean image is using its capacity to model a 100–1000-dim manifold. A network that predicts the noise is using its capacity to model the 196 K-dim full-rank Gaussian — the noise lives in the ambient space, not on the manifold. The capacity-allocation difference is ~200×, and the difference grows with the resolution scale because the ambient dimension grows quadratically while the manifold dimension grows only mildly. This is the structural source of the exponential FID gap.

3. Extension to 3-D Geometric Representations

The Li & He paper's evidence is for natural images. The thesis-line 3-D representations have a stronger manifold structure than natural images, which makes the x-prediction case stronger for them.

Table 2 — Manifold structure of 3-D representations.
RepresentationManifold constraintx-prediction benefit
Signed Distance Field (SDF)‖∇SDF‖ ≈ 1 everywhere; valid SDFs form a measure-zero subset of ℝ^(N³)Strongest case. ε-prediction destroys the gradient-norm property; x-prediction preserves it
Sparse voxel grid (VDB / FVDB)Active voxels form a thin surface band; ~99 % of the grid is emptyε-prediction wastes capacity predicting noise for the empty bulk; x-prediction directly predicts active voxels
Triplane / hexplane featuresProjection from 3-D geometry onto 2-D feature planes is highly structuredSame argument as natural images, applied to feature planes
Point cloudPoints sit on the surface manifold (2-D embedded in 3-D)x-prediction in latent-space (after PointNet++ encoding) preserves the manifold structure

The SDF case is the strongest. A valid SDF must satisfy the eikonal equation ‖∇SDF‖ = 1 almost everywhere. This is a hard geometric constraint, not a soft prior. A network that predicts noise into an SDF produces an output that is not a valid SDF — the gradient-norm property is destroyed pixel-by-pixel. A network that predicts the clean SDF directly preserves the constraint as long as the network has learned to produce valid SDFs in the first place.

The argument generalises: any 3-D representation with a hard geometric constraint (SDF gradient norm; sparse-voxel surface band; triplane consistency between the three planes) is structurally biased toward x-prediction over ε-prediction. The argument is stronger than for natural images precisely because natural images have only statistical regularities, while 3-D geometry has algebraic constraints.

4. Thesis-Line Architecture Decisions

The analysis above directly informs four downstream thesis-line topics.

Table 3 — Architecture decisions consequent on the manifold analysis.
TopicDecisionWhy
JiT consumer-GPU reproduction [2]x-prediction (matches reference paper)Reference JiT already uses x-prediction; analysis confirms this is the right choice for the ViT-B / ImageNet-256 scale
MambaFlow3D 3-D scaling [3]Flow-matching velocity predictionFlow-matching velocity prediction is the closest cousin of x-prediction; ε-prediction is ruled out
MNIST backbone validation [4]Flow-matching across all three backbonesEmpirical validation on the same prediction target ensures the comparison isolates the backbone variable
Polyline diffusion design [5]Flow-matching velocity predictionPolyline coordinates have weaker manifold structure than SDF but stronger than natural images; flow-matching is the safest target
5. Per-Representation Analysis
5.1 SDF — the strongest case

A valid signed-distance field f: ℝ³ → ℝ must satisfy the eikonal equation ‖∇f(x)‖ = 1 almost everywhere, with the surface defined as the zero level set {x : f(x) = 0}. The set of valid SDFs is a measure-zero subset of the function space of all scalar fields — it is a manifold of dimension equal to the surface complexity (effectively the number of independent surface parameters), embedded in the infinite-dimensional function space.

ε-prediction on SDFs trains the network to predict noise that, when added to x_t, recovers the clean SDF. The training signal is in the noise space, which has no eikonal constraint. The network's output (the predicted noise) is unconstrained. Subtracting the predicted noise from x_t recovers an estimate of the clean SDF, but the eikonal-constraint preservation depends on the noise being exactly Gaussian — any imperfection in the noise prediction destroys the eikonal constraint pixel-by-pixel.

x-prediction on SDFs trains the network to predict the clean SDF directly. The training signal is in the SDF space; the network's output is an SDF. Imperfections in the prediction shift the predicted SDF off the eikonal-constraint manifold, but the network is at least trying to land on the manifold. A regulariser (eikonal-loss term) can then enforce the constraint as a soft prior; this composes with x-prediction but not with ε-prediction.

5.2 Sparse voxels — the bulk-vs-band case

A typical sparse voxel grid for a 3-D scene has ~99 % empty voxels and ~1 % active voxels concentrated in a thin band around the surface. ε-prediction trains the network to predict Gaussian noise across the entire grid, including the 99 % empty bulk where there is no useful signal to predict. The network's capacity is spent predicting noise for empty space. x-prediction trains the network to predict the clean sparse-occupancy pattern, where the 99 % empty bulk has a near-zero target value and the 1 % surface band has informative targets. The capacity-allocation is implicitly biased toward the active band.

5.3 Triplane / hexplane features — the projection-structure case

Triplane features F_xy, F_xz, F_yz for a 3-D scene are produced by projection from the 3-D geometry onto three 2-D planes. The projection has structural regularities — adjacent pixels on a plane correspond to adjacent 3-D rays; the three planes are consistent with each other in the sense that a single 3-D feature is sampled by all three. ε-prediction does not exploit this structure (the predicted noise has no projection-consistency constraint); x-prediction implicitly preserves it (the predicted clean features are projections of the same underlying 3-D geometry).

5.4 Point clouds — the latent-space case

Point clouds sit on a 2-D surface manifold embedded in 3-D space. Raw point-cloud diffusion runs into the variable-cardinality problem (different scenes have different point counts); the standard fix is to encode the point cloud to a fixed-shape latent (via PointNet++ or a learned encoder) and diffuse in that latent space. The manifold-hypothesis argument applies to the latent space rather than to the raw points — x-prediction or v-prediction in the latent space respects the latent-manifold structure that the encoder has learned, while ε-prediction trains against latent-space noise that has no such structure.

6. Caveat — What the Analysis Does Not Claim

Three honest caveats. (i) The Li & He paper's empirical evidence is for ImageNet-class natural images. The 3-D-representation extension above is an analytical argument, not yet validated empirically at scale. The Hexplane AE topic [6] provides one empirical data point (the binary-occupancy VAE failure has a similar manifold-mismatch cause) but a controlled ε-vs-x experiment at SparC3D-class scale is not yet run. (ii) At small scale (the MNIST validation in [4]) the differences between the three target choices are smaller — the manifold-hypothesis argument is asymptotic in dimensionality. (iii) v-prediction is not analytically distinguished from x-prediction here; both are "partially manifold-aware" and the empirical literature gives mixed results between them.

7. Conclusion

Li & He's manifold-hypothesis argument for x-prediction over ε-prediction generalises to 3-D geometric representations. The case is structurally stronger for 3-D because 3-D geometry has algebraic constraints (SDF gradient norm, sparse-voxel coherence, triplane projection structure) that natural images lack. The per-representation analysis (§5) gives the structural rationale; the thesis-line architecture decisions (§4) operationalise it. The architecture rule going forward is unambiguous: x-prediction or flow-matching velocity prediction throughout; ε-prediction never used as a load-bearing target in 3-D-diffusion work.

References
[1] Li, T., He, K. "Back to Basics: Let Denoising Generative Models Denoise." MIT preprint, Nov 2025.
[2] Jain, A. "Training JiT Diffusion on Two Consumer GPUs." Thesis research, Nov 2025. /whitepaper/jit-diffusion
[3] Jain, A. "MambaFlow3D: Spec, Speed-up Budget, and ModelNet10 Phase-2." Thesis research, Nov 2025. /whitepaper/mambaflow3d
[4] Jain, A. "MNIST Flow-Matching Backbone Validation." Thesis research, Nov 2025. /whitepaper/mnist-flow-validation
[5] Jain, A. "Diffusion for Houdini Polylines — Design Study." Thesis research, Nov 2025. /whitepaper/polyline-diffusion
[6] Jain, A. "When VAEs Meet Binary Geometry." Thesis research, Dec 2025. /whitepaper/hexplane-ae
[7] Salimans, T., Ho, J. "Progressive Distillation for Fast Sampling of Diffusion Models." ICLR, 2022. v-prediction.
[8] Karras, T. et al. "Elucidating the Design Space of Diffusion-Based Generative Models." NeurIPS, 2022. Survey of target choices.
[9] Lipman, Y. et al. "Flow Matching for Generative Modelling." ICLR, 2023. The velocity-prediction alternative used in the thesis line.