A planned 6-plane variational autoencoder for head-mesh reconstruction that hit KLD explosion despite aggressive regularisation. Diagnosed the failure as a structural mismatch between VAE Gaussian priors and binary occupancy inputs, and pivoted to a deterministic autoencoder with continuous depth + surface-normal features. Eliminated the explosion; training converged; mesh extraction works.
The broader thesis goal is a compressed latent representation for 3-D geometry that supports interactive editing of real-world shapes from street-view input. Variational autoencoders are the natural starting point — they produce smooth, traversable latent spaces by construction. Combined with the six-plane orthographic decomposition (top, bottom, front, back, left, right) that the rest of this thesis arc uses, a 6-plane VAE was the obvious experiment: encode a head mesh as six 512² hexplane images, train the VAE on a single mesh first to validate the architecture, scale to a full dataset once the loop converges.
It didn't converge. Despite aggressive regularisation — tiny KL weights, hard caps on the KL term, posterior-collapse safeguards — the KLD consistently exploded. Reconstruction loss kept improving, but the extracted meshes were empty. The latent space was useless: the encoder was effectively outputting the prior, the decoder was effectively decoding noise. This is the classic posterior-collapse signature, but the cause turned out to be more fundamental than the usual KL-annealing fix.
The diagnostic insight that resolved the experiment: the hexplane input is binary — each pixel is either 1 (occupied) or 0 (empty), with ~90 K occupied pixels per 512² plane (~0.034 % occupancy). A VAE assumes the input is a sample from a continuous Gaussian distribution; minimising the KL term means the encoder's posterior should match a continuous Gaussian. The binary input has zero variance inside each cluster of pixels — there's no continuous structure for the Gaussian assumption to match. The decoder fights to reproduce binary output from a Gaussian latent, the KL fights to make the posterior Gaussian, and the loss landscape has no fixed point that satisfies both. The pivot — to a deterministic autoencoder with continuous depth + surface-normal inputs — eliminates the assumption violation entirely.
Input. Six orthographic views (top, bottom, front,
back, left, right) of a head mesh, rendered at 512² resolution as
binary occupancy maps. Each plane records which pixels project onto a
surface point and which don't. The stack is a [6, 512, 512]
tensor with ~90 K occupied pixels per plane.
Model. A standard 6-plane VAE — encoder maps the
hexplane stack to a latent z ∈ ℝ^d, decoder maps
z back to a hexplane reconstruction, KL term regularises
the posterior toward 𝒩(0, I). Total parameters
~38 M. Training on a single head mesh first, with 2 048 SDF samples per
epoch for mesh-extraction supervision.
Failure. Across multiple training runs, the KLD term consistently exploded over the first ~50 epochs. Aggressive controls failed: tiny initial KL weight (1×10⁻⁶), linear KL warm-up over 3 000 steps, hard cap on KL contribution to total loss, increased model capacity, gradient clipping at norm 1.0. Reconstruction loss continued to fall but extracted meshes were empty — marching cubes returned no iso-surface above any reasonable threshold. The latent space carried no geometry information.
Binary in, Gaussian out.
The assumption is wrong.
A VAE assumes inputs and outputs are samples from continuous distributions whose conditional independence is parameterised by the latent. Binary occupancy maps violate this at the most basic level — every pixel is exactly 0 or exactly 1, with no in-between values for the Gaussian assumption to model. The KL term and the reconstruction term pull the optimisation in mutually incompatible directions. No schedule tweak resolves it; only changing the input representation does.
A VAE optimises the ELBO:
where the reconstruction term assumes x and x̂
are samples from a continuous Gaussian conditioned on z,
and the KL term assumes the posterior q(z|x) is also
continuous Gaussian. Both assumptions are violated for binary
occupancy input:
Violation 1 — the reconstruction term. Squared error between a binary target (0 or 1) and a continuous prediction has only two valid prediction values per pixel: 0 or 1. The decoder learns to output values close to these, but the gradient signal is concentrated at the boundary — pixels far from the boundary get near-zero gradient, pixels at the boundary get high gradient. The optimiser converges to a degenerate solution where the decoder outputs the mean occupancy rate (~0.034 %) everywhere, satisfying the reconstruction loss in expectation but producing an empty mesh under any threshold.
Violation 2 — the KL term. The KL term pushes
q(z|x) toward 𝒩(0, I). The reconstruction
term in a VAE with continuous targets fights this — informative
posteriors deviate from the prior to encode the input. With binary
targets, the encoder has limited gradient signal to resist the
KL pull (because reconstruction loss is satisfied by the degenerate
mean solution). The posterior collapses to the prior and the latent
carries no information. The "exploding KL" symptom appears
paradoxically because, during the collapse trajectory, the model
briefly fits noise in the binary input as if it were Gaussian
structure — producing transient high-KL posteriors before collapsing.
𝒩(μ, σ²) sampled per encodingβ · KL[q(z|x) ‖ 𝒩(0, I)]z = E(x)The first change replaces binary occupancy with continuous geometric features. Per pixel, instead of "1 if surface else 0", the input is a 4-vector: depth-to-surface-along-view-ray, plus surface-normal (nx, ny, nz) at the hit point. Background pixels (no surface hit) get a sentinel depth + zero normal. The result is a continuous-valued representation that the squared-error reconstruction loss can optimise gradient-wise without converging to the mean-occupancy degenerate solution.
The second change drops the VAE in favour of a deterministic
autoencoder. No KL term, no Gaussian-prior assumption, no
posterior to collapse. The latent is whatever the encoder outputs
deterministically per input — a learned compression of the 50 MB
hexplane to a 1.6 MB latent vector. Training is then a pure
reconstruction problem: minimise ‖x − E⁻¹(E(x))‖².
𝒩(0, I) and decode it.
That's a real loss for downstream generative use. The decision was
to validate the encoder/decoder architecture first via the AE, then
revisit the variational extension once binary input was off the
critical path. The right structural choice for variational
and binary is either a discrete-latent VAE (categorical
posterior) or a learned-prior VAE (the prior matches the input
distribution rather than 𝒩(0, I)) — both are open future work.
| Item | Value |
|---|---|
| Input shape (hexplane) | [6, 4, 512, 512] — 6 views × (depth + 3 normal channels) |
| Latent dimension | ~400 K floats (~1.6 MB at fp32) |
| Model parameters | ~38 M |
| Training data (Phase 1) | 1 head mesh, 2 048 SDF samples per epoch for mesh-extraction supervision |
| GPU | RTX 3060 12 GB |
| Optimiser | Adam, lr = 1×10⁻⁴ |
| Loss components | Hexplane reconstruction MSE + SDF supervision loss |
| Hexplane caching | Pre-computed once, cached on GPU; saves ~60 s per epoch on the head mesh |
| Mesh extraction | Marching cubes at threshold 0.5 on decoded hexplane → 5 719 verts, 10 584 faces |
A secondary thread of the work was getting GPU-accelerated rendering working for the hexplane generation step. Initial CPU-based rasterisation with caching worked but was slower than the original VAE version. Subsequent attempts to use GPU rendering hit two specific blockers worth noting for future implementation work: PyTorch3D installation issues on the RTX 3060 setup (CUDA version incompatibilities with the prebuilt wheels), and nvdiffrast OpenGL/EGL dependency problems (the EGL display context couldn't be created in the cloud-GPU environment without an X server). The CPU-cached pipeline is functional and used for current validation; a third option (nvdiffrast in CUDA-only mode, no GL backend) is open for the next iteration.
Toggle input representation between BINARY (the VAE-attempt failure mode) and CONTINUOUS (depth + normals, the committed pivot). The middle pane shows the 6 hexplane views at the current representation (binary = solid silhouettes, continuous = shaded depth+normal); the right pane shows the reconstructed mesh, which only converges when the input is continuous. Drag the mesh to rotate.
arXiv-format write-up · VAE failure-mode diagnostic + continuous-feature fix · binary-input × Gaussian-prior incompatibility analysis