← Research Timeline Aditya Jain / Apple Maps · 3D Reconstruction
Dec 2025
Topic 29 Dec 2025 Neural 3D · Debugging · Architecture Pivot

6-Plane VAE
→ Hexplane Autoencoder.

A planned 6-plane variational autoencoder for head-mesh reconstruction that hit KLD explosion despite aggressive regularisation. Diagnosed the failure as a structural mismatch between VAE Gaussian priors and binary occupancy inputs, and pivoted to a deterministic autoencoder with continuous depth + surface-normal features. Eliminated the explosion; training converged; mesh extraction works.

00 — Motivation

A failed VAE is a useful experiment if you publish the failure mode.

The broader thesis goal is a compressed latent representation for 3-D geometry that supports interactive editing of real-world shapes from street-view input. Variational autoencoders are the natural starting point — they produce smooth, traversable latent spaces by construction. Combined with the six-plane orthographic decomposition (top, bottom, front, back, left, right) that the rest of this thesis arc uses, a 6-plane VAE was the obvious experiment: encode a head mesh as six 512² hexplane images, train the VAE on a single mesh first to validate the architecture, scale to a full dataset once the loop converges.

It didn't converge. Despite aggressive regularisation — tiny KL weights, hard caps on the KL term, posterior-collapse safeguards — the KLD consistently exploded. Reconstruction loss kept improving, but the extracted meshes were empty. The latent space was useless: the encoder was effectively outputting the prior, the decoder was effectively decoding noise. This is the classic posterior-collapse signature, but the cause turned out to be more fundamental than the usual KL-annealing fix.

The diagnostic insight that resolved the experiment: the hexplane input is binary — each pixel is either 1 (occupied) or 0 (empty), with ~90 K occupied pixels per 512² plane (~0.034 % occupancy). A VAE assumes the input is a sample from a continuous Gaussian distribution; minimising the KL term means the encoder's posterior should match a continuous Gaussian. The binary input has zero variance inside each cluster of pixels — there's no continuous structure for the Gaussian assumption to match. The decoder fights to reproduce binary output from a Gaussian latent, the KL fights to make the posterior Gaussian, and the loss landscape has no fixed point that satisfies both. The pivot — to a deterministic autoencoder with continuous depth + surface-normal inputs — eliminates the assumption violation entirely.

What it leaves behind
The VAE attempt is documented here as a negative result with a generalisable cause: any neural representation that uses Gaussian-prior priors over binary inputs will hit posterior collapse, and the fix is at the input representation rather than at the KL schedule. The same lesson applied to subsequent attempts on the thesis line — every later representation uses continuous features (SDF, depth, surface-normals) rather than binary occupancy.
01 — Setup & Failure Mode

The VAE that wouldn't train.

Input. Six orthographic views (top, bottom, front, back, left, right) of a head mesh, rendered at 512² resolution as binary occupancy maps. Each plane records which pixels project onto a surface point and which don't. The stack is a [6, 512, 512] tensor with ~90 K occupied pixels per plane.

Model. A standard 6-plane VAE — encoder maps the hexplane stack to a latent z ∈ ℝ^d, decoder maps z back to a hexplane reconstruction, KL term regularises the posterior toward 𝒩(0, I). Total parameters ~38 M. Training on a single head mesh first, with 2 048 SDF samples per epoch for mesh-extraction supervision.

Failure. Across multiple training runs, the KLD term consistently exploded over the first ~50 epochs. Aggressive controls failed: tiny initial KL weight (1×10⁻⁶), linear KL warm-up over 3 000 steps, hard cap on KL contribution to total loss, increased model capacity, gradient clipping at norm 1.0. Reconstruction loss continued to fall but extracted meshes were empty — marching cubes returned no iso-surface above any reasonable threshold. The latent space carried no geometry information.

Core Insight

Binary in, Gaussian out.
The assumption is wrong.

A VAE assumes inputs and outputs are samples from continuous distributions whose conditional independence is parameterised by the latent. Binary occupancy maps violate this at the most basic level — every pixel is exactly 0 or exactly 1, with no in-between values for the Gaussian assumption to model. The KL term and the reconstruction term pull the optimisation in mutually incompatible directions. No schedule tweak resolves it; only changing the input representation does.

02 — Diagnosis

Posterior collapse caused by input distribution mismatch.

A VAE optimises the ELBO:

L = ‖x − x̂‖² + β · KL[q(z|x) ‖ p(z)]

where the reconstruction term assumes x and are samples from a continuous Gaussian conditioned on z, and the KL term assumes the posterior q(z|x) is also continuous Gaussian. Both assumptions are violated for binary occupancy input:

Violation 1 — the reconstruction term. Squared error between a binary target (0 or 1) and a continuous prediction has only two valid prediction values per pixel: 0 or 1. The decoder learns to output values close to these, but the gradient signal is concentrated at the boundary — pixels far from the boundary get near-zero gradient, pixels at the boundary get high gradient. The optimiser converges to a degenerate solution where the decoder outputs the mean occupancy rate (~0.034 %) everywhere, satisfying the reconstruction loss in expectation but producing an empty mesh under any threshold.

Violation 2 — the KL term. The KL term pushes q(z|x) toward 𝒩(0, I). The reconstruction term in a VAE with continuous targets fights this — informative posteriors deviate from the prior to encode the input. With binary targets, the encoder has limited gradient signal to resist the KL pull (because reconstruction loss is satisfied by the degenerate mean solution). The posterior collapses to the prior and the latent carries no information. The "exploding KL" symptom appears paradoxically because, during the collapse trajectory, the model briefly fits noise in the binary input as if it were Gaussian structure — producing transient high-KL posteriors before collapsing.

03 — The Pivot · Deterministic AE + Continuous Features

Two changes, both at the input representation.

REJECTED — 6-plane VAE
Binary occupancy + Gaussian latent
  • Input: binary [6, 512, 512] occupancy maps
  • Latent: 𝒩(μ, σ²) sampled per encoding
  • KL term: β · KL[q(z|x) ‖ 𝒩(0, I)]
  • Failure: KL explodes; extracted mesh empty
  • Root cause: binary input ⊥ Gaussian assumption
COMMITTED — hexplane AE
Continuous depth + normals + deterministic latent
  • Input: [6, 4, 512, 512] — depth + 3-axis surface normals per pixel
  • Latent: deterministic encoder output z = E(x)
  • Loss: just reconstruction — no KL term
  • Result: stable training, mesh extraction works
  • Compression: 1.6 MB latent vs ~50 MB hexplane input

The first change replaces binary occupancy with continuous geometric features. Per pixel, instead of "1 if surface else 0", the input is a 4-vector: depth-to-surface-along-view-ray, plus surface-normal (nx, ny, nz) at the hit point. Background pixels (no surface hit) get a sentinel depth + zero normal. The result is a continuous-valued representation that the squared-error reconstruction loss can optimise gradient-wise without converging to the mean-occupancy degenerate solution.

The second change drops the VAE in favour of a deterministic autoencoder. No KL term, no Gaussian-prior assumption, no posterior to collapse. The latent is whatever the encoder outputs deterministically per input — a learned compression of the 50 MB hexplane to a 1.6 MB latent vector. Training is then a pure reconstruction problem: minimise ‖x − E⁻¹(E(x))‖².

What's lost by dropping the VAE
The deterministic AE doesn't give a sampleable latent distribution — can't sample a new head from 𝒩(0, I) and decode it. That's a real loss for downstream generative use. The decision was to validate the encoder/decoder architecture first via the AE, then revisit the variational extension once binary input was off the critical path. The right structural choice for variational and binary is either a discrete-latent VAE (categorical posterior) or a learned-prior VAE (the prior matches the input distribution rather than 𝒩(0, I)) — both are open future work.
04 — Implementation Details

Numbers and infrastructure.

ItemValue
Input shape (hexplane)[6, 4, 512, 512] — 6 views × (depth + 3 normal channels)
Latent dimension~400 K floats (~1.6 MB at fp32)
Model parameters~38 M
Training data (Phase 1)1 head mesh, 2 048 SDF samples per epoch for mesh-extraction supervision
GPURTX 3060 12 GB
OptimiserAdam, lr = 1×10⁻⁴
Loss componentsHexplane reconstruction MSE + SDF supervision loss
Hexplane cachingPre-computed once, cached on GPU; saves ~60 s per epoch on the head mesh
Mesh extractionMarching cubes at threshold 0.5 on decoded hexplane → 5 719 verts, 10 584 faces
GPU rendering backend exploration

A secondary thread of the work was getting GPU-accelerated rendering working for the hexplane generation step. Initial CPU-based rasterisation with caching worked but was slower than the original VAE version. Subsequent attempts to use GPU rendering hit two specific blockers worth noting for future implementation work: PyTorch3D installation issues on the RTX 3060 setup (CUDA version incompatibilities with the prebuilt wheels), and nvdiffrast OpenGL/EGL dependency problems (the EGL display context couldn't be created in the cloud-GPU environment without an X server). The CPU-cached pipeline is functional and used for current validation; a third option (nvdiffrast in CUDA-only mode, no GL backend) is open for the next iteration.

Interactive Demo · Live

Toggle input representation between BINARY (the VAE-attempt failure mode) and CONTINUOUS (depth + normals, the committed pivot). The middle pane shows the 6 hexplane views at the current representation (binary = solid silhouettes, continuous = shaded depth+normal); the right pane shows the reconstructed mesh, which only converges when the input is continuous. Drag the mesh to rotate.

01 — Input Mesh · CLICK TO CYCLE HEAD
02 — Six Hexplane Views continuous
03 — Reconstructed Mesh Drag to rotate

Full Technical Paper

arXiv-format write-up · VAE failure-mode diagnostic + continuous-feature fix · binary-input × Gaussian-prior incompatibility analysis

Read Paper →
Related Thesis Chapters
Hierarchical Part-Based Triplane
Sister architecture that took the binary-input lesson into account. Per-part triplanes use continuous SDF features rather than binary occupancy; no KLD-explosion failure mode.
6-Plane Mesh Reconstruction
Same six-orthographic-view input substrate, but classical computational geometry instead of neural decoding. The two threads converge on the same input representation; this topic is the neural branch.
Sphere Depth Maps from Cube Faces
The depth-map generation infrastructure that produces the continuous-feature input the deterministic AE consumes after the pivot.
Appendix — Raw Materials
Transcripts & Source References
████████████████████████████████████████████████
███████████████████████████████████████

██████████████████████████████████████
█████████ · ████ · █████████████████████
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████
██████████████████████████████████████████████
██████████ · ████ · ███████████████████████████████
██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████
Restricted Access