[6, 512, 512] tensor of binary occupancy maps; a 38-million-parameter VAE encoder-decoder; standard KL-regularised ELBO training. Despite aggressive KL-term controls — initial weight 1×10⁻⁶, linear warm-up over 3 000 steps, hard caps on KL contribution to total loss — training consistently collapsed across multiple runs. Reconstruction loss continued to decrease but extracted meshes were empty; the latent space carried no geometry information. We diagnose the failure as a fundamental incompatibility between the VAE's continuous Gaussian-distribution assumption and binary occupancy inputs — the reconstruction term converges to the degenerate mean-occupancy solution because its gradient signal is concentrated at boundary pixels, and the posterior collapses to the prior because the encoder has limited residual gradient to resist the KL pull. The fix is at the input representation: replace binary occupancy with continuous depth + surface-normal features (4 channels per pixel), and drop the variational machinery in favour of a deterministic autoencoder. The pivot eliminates the failure mode entirely; training converges with stable loss curves; marching-cubes mesh extraction recovers 5 719 vertices / 10 584 faces matching the input head mesh. The generalisable contribution is the diagnostic: any neural representation that combines Gaussian-prior latents with binary geometric input will hit posterior collapse for the same structural reason, regardless of KL-schedule tuning. The fix is at the input representation, not at the loss schedule. Keywords: VAE failure mode, posterior collapse, hexplane representation, binary occupancy, deterministic autoencoder, 3-D mesh.
Variational autoencoders [1] are the standard tool for learning compressed, smooth, sampleable latent representations. For 3-D shape representation, the natural application is encoding the geometry as a fixed-shape tensor (voxel grid, triplane, hexplane), training a VAE to reconstruct it, and using the latent space for downstream generation. The six-plane orthographic decomposition into top/bottom/front/back/left/right views [2,3] is a particularly memory-efficient encoding because it stores 6 × N² rather than N³ values.
For binary occupancy hexplanes — where each pixel is exactly 0 (empty) or exactly 1 (surface-hit), the simplest geometric encoding — combining the representation with a standard VAE turns out not to work. The work presented here documents this failure mode after multiple training attempts and the structural fix that resolved it. The point of the paper is the diagnostic, not the specific architecture.
The contributions of this paper are: (1) characterisation of the KLD-explosion symptom on binary-occupancy hexplanes despite aggressive regularisation; (2) formal analysis of the underlying failure mechanism — both VAE distributional assumptions are violated by binary input, producing a loss landscape with no Pareto-optimal fixed point; (3) the representation-level fix — replace binary occupancy with continuous depth + surface-normal features, drop the variational machinery — and the demonstration that this restores stable training and successful mesh extraction; (4) the generalisable lesson that the fix must be at the input representation, not at the KL schedule.
The input is a 3-D mesh rendered orthographically into six views: top (+Y), bottom (−Y), front (−Z), back (+Z), left (−X), right (+X). Each view is a 512 × 512 binary occupancy map — pixel value 1 if a surface point projects to that pixel, 0 otherwise. The full input tensor is x ∈ {0, 1}^(6 × 512 × 512), approximately 1.6 MB at 1 byte per pixel. For the head mesh used in this experiment, the occupied fraction per plane averages ~0.034 % (~90 K of the 262 K pixels per plane).
A standard encoder–decoder VAE. Encoder: convolutional stack over the 6-view input, output mean μ and log-variance log σ² for the latent z ∈ ℝ^d. Reparameterisation: z = μ + σ · ε, ε ∼ 𝒩(0, I). Decoder: deconvolutional stack from z back to a 6-view reconstruction x̂. Total parameter count ~38 M. Latent dimension chosen so the latent representation is approximately 1.6 MB at fp32.
Loss: standard ELBO with KL weight β:
L = ‖x − x̂‖² + β · KL[q(z|x) ‖ 𝒩(0, I)]Training on a single head mesh first to validate the architecture before scaling to a full dataset. 2 048 SDF samples per epoch for mesh-extraction supervision (separate marching-cubes-reconstruction loss). Adam optimiser, learning rate 1×10⁻⁴, RTX 3060 12 GB.
Across multiple training runs (~five attempts with varying hyperparameters), the KLD term consistently exploded over the first ~50 epochs. Reconstruction loss continued to decrease, but extracted meshes were empty — marching cubes at the standard 0.5 threshold returned zero vertices. The latent space carried no usable geometric information.
Standard KL-collapse controls were attempted without success:
| Control | Setting | Result |
|---|---|---|
| Initial KL weight | 1×10⁻⁶ (effectively zero at start) | KL exploded once warm-up reached 1×10⁻⁴ |
| Linear KL warm-up | 3 000 steps | Delayed but did not eliminate the explosion |
| Hard cap on KL contribution | KL term capped at 0.1 of total loss | Encoder learned to push the posterior far from prior to evade the cap, then collapsed |
| Gradient clipping | Norm 1.0 | Slowed training but did not prevent collapse |
| Increased model capacity | +30 % parameters | No effect |
| Tighter prior | 𝒩(0, 0.1·I) | Encoder still collapsed to the tighter prior |
The consistency of the failure across hyperparameter settings suggested a structural problem rather than a tuning problem. The diagnosis below identifies the structural cause.
The VAE optimises the ELBO under two distributional assumptions: (i) the input x is a sample from a continuous distribution p(x|z) conditioned on the latent, parameterised by the decoder; (ii) the approximate posterior q(z|x) is a continuous Gaussian whose parameters are produced by the encoder. Binary occupancy input violates both assumptions in ways that interact to produce the observed failure.
For binary x ∈ {0, 1}, squared error ‖x − x̂‖² has a degenerate optimum at x̂ = E[x] = p (the per-pixel occupancy rate). For the head-mesh hexplane, p ≈ 0.034 % per pixel — a near-zero constant. The gradient signal that would drive the decoder to predict 1 at occupied pixels is concentrated at the boundary between occupied and empty regions (where the loss surface has high local gradient), and is near-zero in the bulk of each region (where the loss surface is locally flat against the degenerate constant solution).
Specifically, for a pixel deep in an occupied region, the decoder's output is already near 1, the residual gradient is small, and the optimiser receives no strong signal to move away from the degenerate mean solution. For a pixel deep in an empty region, the same holds with output near 0. Only boundary pixels — a tiny fraction of the total — contribute meaningfully. The aggregate gradient is small in magnitude relative to the size of the parameter space, and the optimiser drifts toward the degenerate constant solution which satisfies the reconstruction loss in expectation but produces an empty mesh under any threshold.
In a healthy VAE, the encoder's KL term and the reconstruction term are in tension — the reconstruction term pushes q(z|x) to deviate from the prior to encode information about x, the KL term pushes q(z|x) back toward the prior. The balance is what gives VAEs their useful latent structure.
With degenerate-mean reconstruction (§4.1), the reconstruction term is satisfied even without informative posteriors. The encoder has no residual gradient to resist the KL pull. The posterior collapses to the prior — q(z|x) ≈ p(z) = 𝒩(0, I) — and the latent carries no information about the input. The "exploding KL" symptom appears paradoxically during the collapse trajectory because the model briefly fits noise in the binary input as if it were Gaussian structure, producing transient high-KL posteriors before collapsing.
KL annealing, free-bits regularisation, posterior-collapse safeguards, and capacity scaling all address the case where the reconstruction term can resist the KL pull but the KL schedule overwhelms it temporarily during training. In the binary-input case the reconstruction term cannot resist the KL pull at any KL schedule — there is no residual gradient signal because the degenerate solution is already in the basin of attraction. The KL schedule is not the lever; the input representation is.
Two changes, both at the input representation, resolve the failure.
Replace the binary occupancy hexplane with a continuous-feature hexplane. Per pixel, instead of "1 if surface else 0", the input is a 4-vector: depth-to-surface-along-view-ray (sentinel value for background pixels), plus surface-normal (nx, ny, nz) at the hit point. The full input becomes x ∈ ℝ^(6 × 4 × 512 × 512).
The reconstruction term ‖x − x̂‖² on continuous-valued features has no degenerate constant optimum — the depth varies smoothly across the surface, the normals vary smoothly except at sharp edges, and the gradient signal is distributed across all pixels rather than concentrated at boundaries. The decoder receives strong gradient everywhere, the optimiser converges to non-degenerate solutions.
With the input representation fixed, the second question is whether the variational regularisation is still useful. For the validation stage of this work — single-mesh round-trip — the answer is no. The latent does not need to be sampleable from a learned prior; it only needs to be a compressed deterministic representation of the input. Dropping the KL term and the reparameterisation step removes the second source of training pathology and simplifies the loss to pure reconstruction:
L = ‖x − E⁻¹(E(x))‖²The architecture becomes a deterministic autoencoder. The encoder maps the continuous-feature hexplane to a deterministic latent vector; the decoder maps it back. Training is stable, loss decreases monotonically, mesh extraction works.
After the pivot, training on the head mesh converges within ~100 epochs to a stable reconstruction. Marching cubes on the decoded hexplane recovers 5 719 vertices and 10 584 faces matching the input mesh's topology. The latent vector compresses the ~50 MB continuous-feature hexplane to ~1.6 MB at fp32 — a 30× compression with negligible reconstruction error in the round trip.
| Property | VAE on binary hexplane | AE on continuous hexplane |
|---|---|---|
| KL term behaviour | Exploded over first ~50 epochs | No KL term (deterministic AE) |
| Reconstruction loss | Decreased monotonically (toward degenerate solution) | Decreased monotonically (toward correct solution) |
| Marching-cubes extraction | 0 vertices (empty mesh) | 5 719 vertices / 10 584 faces |
| Latent compression | 1.6 MB → 0 bits of usable signal | ~50 MB → 1.6 MB ≈ 30× compression |
| Sampleable from prior | Yes (but uninformative) | No (deterministic AE) |
| Training stability | Failed all 5 attempted runs | Stable across all attempted runs |
The specific failure documented here — binary occupancy + Gaussian-prior VAE — generalises. Any neural representation that combines a Gaussian-prior continuous-latent assumption with binary geometric input will hit posterior collapse for the same structural reason. The fix is at the input representation, not at the KL schedule. The implication for thesis-line work in this lab: subsequent representations (the hierarchical part-based triplane in [4], the six-plane mesh reconstruction in [5], the SculptNet primitive-assembly in [6]) all use continuous SDF or depth features rather than binary occupancy and do not exhibit the failure mode.
For other groups working with binary geometric encodings — voxel grids, sparse occupancy, binary triplanes — the same lesson applies. Either switch to a continuous geometric feature (SDF, depth, occupancy density), or switch to a discrete-latent VAE (categorical or Bernoulli posterior, no Gaussian assumption), or drop the variational machinery in favour of a deterministic autoencoder. KL-schedule engineering will not solve the problem because the problem is not at the schedule.
Three concrete limitations of the current work. (i) Single-mesh validation. Phase 1 validates on a single head mesh. Multi-mesh training is needed to confirm the AE compresses across instances rather than overfitting to one. (ii) Lost sampling. Dropping the VAE eliminated the failure mode but also eliminated the sampleable latent — a deterministic AE cannot generate novel meshes from 𝒩(0, I). (iii) GPU rendering. CPU-cached hexplane generation is fast enough for single-mesh validation but won't scale to dataset training; PyTorch3D and nvdiffrast GPU-rendering paths both hit dependency issues during this work and remain unresolved.
Three future-work directions follow. First, train the deterministic AE on multi-mesh datasets (PartNeXt categories) to test cross-instance compression. Second, revisit the variational extension via a discrete-latent VAE (Bernoulli or categorical posterior, matching the binary input distribution) or a learned-prior VAE (the prior matches the marginal aggregate posterior rather than 𝒩(0, I)) — both are open architectural alternatives that preserve sampleable latents without the binary-Gaussian mismatch. Third, resolve the GPU-rendering thread (CUDA-only nvdiffrast or a different rasteriser) to support dataset-scale training.
VAEs and binary occupancy hexplanes are incompatible at the distributional-assumption level. The KLD-explosion symptom that drove the diagnostic work is a consistent signature of this incompatibility and is not resolvable at the KL-schedule level. The fix is at the input representation: replace binary occupancy with continuous depth + surface-normal features, drop the variational machinery, and the failure mode disappears entirely. The work documented here is small in scope but the diagnostic is broadly applicable to any neural representation combining Gaussian-prior latents with binary geometric input.