6-Plane VAE / Hexplane Autoencoder

00 — Motivation

A failed VAE is a useful experiment if you publish the failure mode.

The broader thesis goal is a compressed latent representation for 3-D geometry that supports interactive editing of real-world shapes from street-view input. Variational autoencoders are the natural starting point — they produce smooth, traversable latent spaces by construction. Combined with the six-plane orthographic decomposition (top, bottom, front, back, left, right) that the rest of this thesis arc uses, a 6-plane VAE was the obvious experiment: encode a head mesh as six 512² hexplane images, train the VAE on a single mesh first to validate the architecture, scale to a full dataset once the loop converges.

It didn't converge. Despite aggressive regularisation — tiny KL weights, hard caps on the KL term, posterior-collapse safeguards — the KLD consistently exploded. Reconstruction loss kept improving, but the extracted meshes were empty. The latent space was useless: the encoder was effectively outputting the prior, the decoder was effectively decoding noise. This is the classic posterior-collapse signature, but the cause turned out to be more fundamental than the usual KL-annealing fix.

The diagnostic insight that resolved the experiment: the hexplane input is binary — each pixel is either 1 (occupied) or 0 (empty), with ~90 K occupied pixels per 512² plane (~0.034 % occupancy). A VAE assumes the input is a sample from a continuous Gaussian distribution; minimising the KL term means the encoder's posterior should match a continuous Gaussian. The binary input has zero variance inside each cluster of pixels — there's no continuous structure for the Gaussian assumption to match. The decoder fights to reproduce binary output from a Gaussian latent, the KL fights to make the posterior Gaussian, and the loss landscape has no fixed point that satisfies both. The pivot — to a deterministic autoencoder with continuous depth + surface-normal inputs — eliminates the assumption violation entirely.

What it leaves behind

The VAE attempt is documented here as a negative result with a generalisable cause: any neural representation that uses Gaussian-prior priors over binary inputs will hit posterior collapse, and the fix is at the input representation rather than at the KL schedule. The same lesson applied to subsequent attempts on the thesis line — every later representation uses continuous features (SDF, depth, surface-normals) rather than binary occupancy.

01 — Setup & Failure Mode

The VAE that wouldn't train.

Input. Six orthographic views (top, bottom, front, back, left, right) of a head mesh, rendered at 512² resolution as binary occupancy maps. Each plane records which pixels project onto a surface point and which don't. The stack is a [6, 512, 512] tensor with ~90 K occupied pixels per plane.

Model. A standard 6-plane VAE — encoder maps the hexplane stack to a latent z ∈ ℝ^d, decoder maps z back to a hexplane reconstruction, KL term regularises the posterior toward 𝒩(0, I). Total parameters ~38 M. Training on a single head mesh first, with 2 048 SDF samples per epoch for mesh-extraction supervision.

Failure. Across multiple training runs, the KLD term consistently exploded over the first ~50 epochs. Aggressive controls failed: tiny initial KL weight (1×10⁻⁶), linear KL warm-up over 3 000 steps, hard cap on KL contribution to total loss, increased model capacity, gradient clipping at norm 1.0. Reconstruction loss continued to fall but extracted meshes were empty — marching cubes returned no iso-surface above any reasonable threshold. The latent space carried no geometry information.

Core Insight

Binary in, Gaussian out.
The assumption is wrong.

A VAE assumes inputs and outputs are samples from continuous distributions whose conditional independence is parameterised by the latent. Binary occupancy maps violate this at the most basic level — every pixel is exactly 0 or exactly 1, with no in-between values for the Gaussian assumption to model. The KL term and the reconstruction term pull the optimisation in mutually incompatible directions. No schedule tweak resolves it; only changing the input representation does.

02 — Diagnosis

Posterior collapse caused by input distribution mismatch.

A VAE optimises the ELBO:

L = ‖x − x̂‖² + β · KL[q(z|x) ‖ p(z)]

where the reconstruction term assumes x and x̂ are samples from a continuous Gaussian conditioned on z, and the KL term assumes the posterior q(z|x) is also continuous Gaussian. Both assumptions are violated for binary occupancy input:

Violation 1 — the reconstruction term. Squared error between a binary target (0 or 1) and a continuous prediction has only two valid prediction values per pixel: 0 or 1. The decoder learns to output values close to these, but the gradient signal is concentrated at the boundary — pixels far from the boundary get near-zero gradient, pixels at the boundary get high gradient. The optimiser converges to a degenerate solution where the decoder outputs the mean occupancy rate (~0.034 %) everywhere, satisfying the reconstruction loss in expectation but producing an empty mesh under any threshold.

Violation 2 — the KL term. The KL term pushes q(z|x) toward 𝒩(0, I). The reconstruction term in a VAE with continuous targets fights this — informative posteriors deviate from the prior to encode the input. With binary targets, the encoder has limited gradient signal to resist the KL pull (because reconstruction loss is satisfied by the degenerate mean solution). The posterior collapses to the prior and the latent carries no information. The "exploding KL" symptom appears paradoxically because, during the collapse trajectory, the model briefly fits noise in the binary input as if it were Gaussian structure — producing transient high-KL posteriors before collapsing.

03 — The Pivot · Deterministic AE + Continuous Features

Two changes, both at the input representation.

REJECTED — 6-plane VAE

Binary occupancy + Gaussian latent

Input: binary [6, 512, 512] occupancy maps
Latent: 𝒩(μ, σ²) sampled per encoding
KL term: β · KL[q(z|x) ‖ 𝒩(0, I)]
Failure: KL explodes; extracted mesh empty
Root cause: binary input ⊥ Gaussian assumption

COMMITTED — hexplane AE

Continuous depth + normals + deterministic latent

Input: [6, 4, 512, 512] — depth + 3-axis surface normals per pixel
Latent: deterministic encoder output z = E(x)
Loss: just reconstruction — no KL term
Result: stable training, mesh extraction works
Compression: 1.6 MB latent vs ~50 MB hexplane input

The first change replaces binary occupancy with continuous geometric features. Per pixel, instead of "1 if surface else 0", the input is a 4-vector: depth-to-surface-along-view-ray, plus surface-normal (n_x, n_y, n_z) at the hit point. Background pixels (no surface hit) get a sentinel depth + zero normal. The result is a continuous-valued representation that the squared-error reconstruction loss can optimise gradient-wise without converging to the mean-occupancy degenerate solution.

The second change drops the VAE in favour of a deterministic autoencoder. No KL term, no Gaussian-prior assumption, no posterior to collapse. The latent is whatever the encoder outputs deterministically per input — a learned compression of the 50 MB hexplane to a 1.6 MB latent vector. Training is then a pure reconstruction problem: minimise ‖x − E⁻¹(E(x))‖².

What's lost by dropping the VAE

The deterministic AE doesn't give a sampleable latent distribution — can't sample a new head from 𝒩(0, I) and decode it. That's a real loss for downstream generative use. The decision was to validate the encoder/decoder architecture first via the AE, then revisit the variational extension once binary input was off the critical path. The right structural choice for variational and binary is either a discrete-latent VAE (categorical posterior) or a learned-prior VAE (the prior matches the input distribution rather than 𝒩(0, I)) — both are open future work.

04 — Implementation Details

Numbers and infrastructure.

Item	Value
Input shape (hexplane)	`[6, 4, 512, 512]` — 6 views × (depth + 3 normal channels)
Latent dimension	~400 K floats (~1.6 MB at fp32)
Model parameters	~38 M
Training data (Phase 1)	1 head mesh, 2 048 SDF samples per epoch for mesh-extraction supervision
GPU	RTX 3060 12 GB
Optimiser	Adam, lr = 1×10⁻⁴
Loss components	Hexplane reconstruction MSE + SDF supervision loss
Hexplane caching	Pre-computed once, cached on GPU; saves ~60 s per epoch on the head mesh
Mesh extraction	Marching cubes at threshold 0.5 on decoded hexplane → 5 719 verts, 10 584 faces

GPU rendering backend exploration

A secondary thread of the work was getting GPU-accelerated rendering working for the hexplane generation step. Initial CPU-based rasterisation with caching worked but was slower than the original VAE version. Subsequent attempts to use GPU rendering hit two specific blockers worth noting for future implementation work: PyTorch3D installation issues on the RTX 3060 setup (CUDA version incompatibilities with the prebuilt wheels), and nvdiffrast OpenGL/EGL dependency problems (the EGL display context couldn't be created in the cloud-GPU environment without an X server). The CPU-cached pipeline is functional and used for current validation; a third option (nvdiffrast in CUDA-only mode, no GL backend) is open for the next iteration.

Interactive Demo · Live

Toggle input representation between BINARY (the VAE-attempt failure mode) and CONTINUOUS (depth + normals, the committed pivot). The middle pane shows the 6 hexplane views at the current representation (binary = solid silhouettes, continuous = shaded depth+normal); the right pane shows the reconstructed mesh, which only converges when the input is continuous. Drag the mesh to rotate.

01 — Input Mesh · CLICK TO CYCLE HEAD

02 — Six Hexplane Views continuous

03 — Reconstructed Mesh Drag to rotate

Appendix — Raw Materials

Transcripts & Source References

████████████████████████████████████████████████
███████████████████████████████████████

01 — ██████████████████████████

██████████████████████████████████████

█████████ · ████ · █████████████████████

█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████

██████████████████████████████████████████████

██████████ · ████ · ███████████████████████████████

Restricted Access

6-Plane VAE
→ Hexplane Autoencoder.

A failed VAE is a useful experiment if you publish the failure mode.

The VAE that wouldn't train.

Posterior collapse caused by input distribution mismatch.

Two changes, both at the input representation.

Numbers and infrastructure.

Interactive Demo · Live

Full Technical Paper

6-Plane VAE → Hexplane Autoencoder.

A failed VAE is a useful experiment if you publish the failure mode.

The VAE that wouldn't train.

Posterior collapse caused by input distribution mismatch.

Two changes, both at the input representation.

Numbers and infrastructure.

Interactive Demo · Live

Full Technical Paper

6-Plane VAE
→ Hexplane Autoencoder.