The diffusion-target choice — whether the denoiser network predicts ε (the noise added to the clean image), v (a rotated velocity-like combination of x and ε), or x (the clean image directly) — has historically been treated as a hyperparameter. The three targets are mathematically equivalent in the sense that each can be derived from the others given the noise schedule. They are not equivalent in training behaviour, because each has a different loss surface and different gradient properties.
Li & He [1] argue the choice is not a hyperparameter: ε-prediction is fundamentally worse than x-prediction at high dimensionality, and the reason is the manifold hypothesis. Natural data lies on a low-dimensional manifold; Gaussian noise does not; the network's capacity is better spent predicting the manifold-respecting clean signal than the manifold-free noise. The paper's empirical signature: at high resolution, ε-prediction FID grows exponentially with image dimension while x-prediction's stays bounded.
This paper does two things. First, it summarises the Li & He argument compactly (§2). Second, it argues that the manifold-hypothesis case for x-prediction is stronger for 3-D geometric representations than for natural images, because 3-D geometry has structural constraints (SDF gradient norm, voxel sparsity coherence, triplane projection structure) that natural images do not have (§3). The thesis-line architecture decisions consequent on this analysis are recorded in §4.
A diffusion forward process produces noised samples x_t = √α̅_t x_0 + √(1−α̅_t) ε, where α̅_t = ∏_{s≤t}(1 − β_s) and β_t is the noise schedule. The reverse process (the trained network) can be parameterised to predict any of three quantities — ε, v, or x — and given the noise schedule and the observed x_t, each of the three can be derived from the others:
x̂_0 = (x_t − √(1−α̅_t) ε̂) / √α̅_t ε̂ = (x_t − √α̅_t x̂_0) / √(1−α̅_t) v̂ = √α̅_t ε̂ − √(1−α̅_t) x̂_0Mathematically the three are equivalent: a network that predicts one can be wrapped to predict the others. The reason the choice matters in practice is that the three parameterisations have different loss surfaces and different gradient properties during training. The optimisation problem the network actually solves depends on which target you optimise against, not on the target's mathematical equivalence to the others.
| Target | What network predicts | Loss | Lives on a manifold? | Gradient at t → 0 | Gradient at t → T |
|---|---|---|---|---|---|
| ε-prediction | Gaussian noise ε | ‖ε − ε̂‖² | No — full-rank Gaussian | Pathological (x_t ≈ x_0, so ε has no signal) | Easy (x_t ≈ ε) |
| v-prediction | v = √α̅_t ε − √(1−α̅_t) x_0 | ‖v − v̂‖² | Mixed — partial manifold | Stable | Stable |
| x-prediction | Clean image x_0 | ‖x_0 − x̂_0‖² | Yes — data manifold | Easy (x_t ≈ x_0) | Pathological at very high t |
The paper's empirical claim is that ε-prediction's loss surface becomes pathological at high dimensions: the network must approximate a full-rank Gaussian, the gradient signal is small relative to the parameter-space size, and FID grows exponentially with dimension. x-prediction's loss surface stays smooth because the network is approximating a low-dimensional manifold, and the gradient signal is concentrated on the manifold's tangent space rather than spread across the ambient full-rank space.
Specifically, the paper reports that at ImageNet-64 the three targets behave comparably (FID within 10 % of each other across all three). At ImageNet-256 the gap widens — ε-prediction lags x-prediction by a factor of ~1.5–2× on FID. At ImageNet-512 the gap becomes catastrophic — ε-prediction's FID is 10× worse than x-prediction's. The trend is exponential in the image-dimension scale, not linear.
The manifold hypothesis says natural images sit on a low-dimensional manifold embedded in the high-dimensional pixel space. For ImageNet-class images the manifold dimension is estimated at ~100–1000 (the intrinsic dimensionality of the data distribution), embedded in a ~196 K-dim pixel space (256 × 256 × 3). The ratio of ambient to intrinsic dimensionality is ~200×.
A network that predicts the clean image is using its capacity to model a 100–1000-dim manifold. A network that predicts the noise is using its capacity to model the 196 K-dim full-rank Gaussian — the noise lives in the ambient space, not on the manifold. The capacity-allocation difference is ~200×, and the difference grows with the resolution scale because the ambient dimension grows quadratically while the manifold dimension grows only mildly. This is the structural source of the exponential FID gap.
The Li & He paper's evidence is for natural images. The thesis-line 3-D representations have a stronger manifold structure than natural images, which makes the x-prediction case stronger for them.
| Representation | Manifold constraint | x-prediction benefit |
|---|---|---|
| Signed Distance Field (SDF) | ‖∇SDF‖ ≈ 1 everywhere; valid SDFs form a measure-zero subset of ℝ^(N³) | Strongest case. ε-prediction destroys the gradient-norm property; x-prediction preserves it |
| Sparse voxel grid (VDB / FVDB) | Active voxels form a thin surface band; ~99 % of the grid is empty | ε-prediction wastes capacity predicting noise for the empty bulk; x-prediction directly predicts active voxels |
| Triplane / hexplane features | Projection from 3-D geometry onto 2-D feature planes is highly structured | Same argument as natural images, applied to feature planes |
| Point cloud | Points sit on the surface manifold (2-D embedded in 3-D) | x-prediction in latent-space (after PointNet++ encoding) preserves the manifold structure |
The SDF case is the strongest. A valid SDF must satisfy the eikonal equation ‖∇SDF‖ = 1 almost everywhere. This is a hard geometric constraint, not a soft prior. A network that predicts noise into an SDF produces an output that is not a valid SDF — the gradient-norm property is destroyed pixel-by-pixel. A network that predicts the clean SDF directly preserves the constraint as long as the network has learned to produce valid SDFs in the first place.
The argument generalises: any 3-D representation with a hard geometric constraint (SDF gradient norm; sparse-voxel surface band; triplane consistency between the three planes) is structurally biased toward x-prediction over ε-prediction. The argument is stronger than for natural images precisely because natural images have only statistical regularities, while 3-D geometry has algebraic constraints.
The analysis above directly informs four downstream thesis-line topics.
| Topic | Decision | Why |
|---|---|---|
| JiT consumer-GPU reproduction [2] | x-prediction (matches reference paper) | Reference JiT already uses x-prediction; analysis confirms this is the right choice for the ViT-B / ImageNet-256 scale |
| MambaFlow3D 3-D scaling [3] | Flow-matching velocity prediction | Flow-matching velocity prediction is the closest cousin of x-prediction; ε-prediction is ruled out |
| MNIST backbone validation [4] | Flow-matching across all three backbones | Empirical validation on the same prediction target ensures the comparison isolates the backbone variable |
| Polyline diffusion design [5] | Flow-matching velocity prediction | Polyline coordinates have weaker manifold structure than SDF but stronger than natural images; flow-matching is the safest target |
A valid signed-distance field f: ℝ³ → ℝ must satisfy the eikonal equation ‖∇f(x)‖ = 1 almost everywhere, with the surface defined as the zero level set {x : f(x) = 0}. The set of valid SDFs is a measure-zero subset of the function space of all scalar fields — it is a manifold of dimension equal to the surface complexity (effectively the number of independent surface parameters), embedded in the infinite-dimensional function space.
ε-prediction on SDFs trains the network to predict noise that, when added to x_t, recovers the clean SDF. The training signal is in the noise space, which has no eikonal constraint. The network's output (the predicted noise) is unconstrained. Subtracting the predicted noise from x_t recovers an estimate of the clean SDF, but the eikonal-constraint preservation depends on the noise being exactly Gaussian — any imperfection in the noise prediction destroys the eikonal constraint pixel-by-pixel.
x-prediction on SDFs trains the network to predict the clean SDF directly. The training signal is in the SDF space; the network's output is an SDF. Imperfections in the prediction shift the predicted SDF off the eikonal-constraint manifold, but the network is at least trying to land on the manifold. A regulariser (eikonal-loss term) can then enforce the constraint as a soft prior; this composes with x-prediction but not with ε-prediction.
A typical sparse voxel grid for a 3-D scene has ~99 % empty voxels and ~1 % active voxels concentrated in a thin band around the surface. ε-prediction trains the network to predict Gaussian noise across the entire grid, including the 99 % empty bulk where there is no useful signal to predict. The network's capacity is spent predicting noise for empty space. x-prediction trains the network to predict the clean sparse-occupancy pattern, where the 99 % empty bulk has a near-zero target value and the 1 % surface band has informative targets. The capacity-allocation is implicitly biased toward the active band.
Triplane features F_xy, F_xz, F_yz for a 3-D scene are produced by projection from the 3-D geometry onto three 2-D planes. The projection has structural regularities — adjacent pixels on a plane correspond to adjacent 3-D rays; the three planes are consistent with each other in the sense that a single 3-D feature is sampled by all three. ε-prediction does not exploit this structure (the predicted noise has no projection-consistency constraint); x-prediction implicitly preserves it (the predicted clean features are projections of the same underlying 3-D geometry).
Point clouds sit on a 2-D surface manifold embedded in 3-D space. Raw point-cloud diffusion runs into the variable-cardinality problem (different scenes have different point counts); the standard fix is to encode the point cloud to a fixed-shape latent (via PointNet++ or a learned encoder) and diffuse in that latent space. The manifold-hypothesis argument applies to the latent space rather than to the raw points — x-prediction or v-prediction in the latent space respects the latent-manifold structure that the encoder has learned, while ε-prediction trains against latent-space noise that has no such structure.
Three honest caveats. (i) The Li & He paper's empirical evidence is for ImageNet-class natural images. The 3-D-representation extension above is an analytical argument, not yet validated empirically at scale. The Hexplane AE topic [6] provides one empirical data point (the binary-occupancy VAE failure has a similar manifold-mismatch cause) but a controlled ε-vs-x experiment at SparC3D-class scale is not yet run. (ii) At small scale (the MNIST validation in [4]) the differences between the three target choices are smaller — the manifold-hypothesis argument is asymptotic in dimensionality. (iii) v-prediction is not analytically distinguished from x-prediction here; both are "partially manifold-aware" and the empirical literature gives mixed results between them.
Li & He's manifold-hypothesis argument for x-prediction over ε-prediction generalises to 3-D geometric representations. The case is structurally stronger for 3-D because 3-D geometry has algebraic constraints (SDF gradient norm, sparse-voxel coherence, triplane projection structure) that natural images lack. The per-representation analysis (§5) gives the structural rationale; the thesis-line architecture decisions (§4) operationalise it. The architecture rule going forward is unambiguous: x-prediction or flow-matching velocity prediction throughout; ε-prediction never used as a load-bearing target in 3-D-diffusion work.