By January 2026 the thesis line had three competing 3-D representations on the table: SDF + marching cubes (the Hexplane AE setup, the Six-Plane Mesh extraction), Gaussian splats (the late-2025 explosion of single-image-to-3-D papers including GS-LRM, Splatter Image, and Gamba), and triplanes (EG3D [5], InstantMesh, TRELLIS [6]). The pick-a-default exercise was load-bearing: every downstream generator's native output would consume the chosen representation, so the decision needed a written rationale.
This paper records the rationale. The decision: triplane as the universal intermediate. The reasoning is in §3; the prerequisite is a clean description of triplane mechanics (§2), because the decision turns on properties that the NeRF / G-Splat / VDB alternatives lack.
A triplane representation stores a 3-D scene as three axis-aligned 2-D feature planes F_xy ∈ ℝ^{H × W × C}, F_xz ∈ ℝ^{H × W × C}, F_yz ∈ ℝ^{H × W × C}. Typical values: H = W = 256 (the plane resolution), C = 32 (the per-pixel feature dimension). Total storage: 3 × 256 × 256 × 32 × 4 bytes (fp32) ≈ 25 MB; with fp16 ≈ 12 MB; with C = 16 ≈ 6 MB. The 6–12 MB range cited in the abstract reflects the practical trade-off between resolution and feature richness.
To query the scene at a 3-D point p = (x, y, z) with each coordinate normalised to [0, 1]:
F = F_xy(x, y) ⊕ F_xz(x, z) ⊕ F_yz(y, z)where each F_*(·, ·) is a bilinear sample on the respective plane and ⊕ denotes the aggregation operation. Three aggregation choices are common: sum (cheapest, default for EG3D), concat (3× larger output dim, used by TRELLIS), product (Hadamard, occasionally used for sharpness). The aggregated feature vector F ∈ ℝ^C (or ℝ^{3C} for concat) is passed through a small MLP — typically 2–3 layers, ReLU activations, hidden dim 64 — to produce a per-point density σ ∈ ℝ_+ and colour c ∈ ℝ³.
Bilinear sampling at continuous coordinate (u, v) on a discrete plane uses the four-corner-pixel weighted average:
F(u, v) = (1−a)(1−b) · F[⌊u⌋, ⌊v⌋] + a(1−b) · F[⌈u⌉, ⌊v⌋] + (1−a)b · F[⌊u⌋, ⌈v⌉] + ab · F[⌈u⌉, ⌈v⌉]where a = u − ⌊u⌋ and b = v − ⌊v⌋. The bilinear sample is differentiable in both (u, v) and F, which is what makes triplanes trainable end-to-end via gradient descent.
For each pixel in the camera view, cast a ray r(t) = o + t · d from the camera origin o through the pixel along direction d. Sample N points along the ray (typically N = 64–128 at strided t values t_1, …, t_N). For each sample point query the triplane for (σ_i, c_i). Volume-render via front-to-back alpha compositing:
α_i = 1 − exp(−σ_i · Δt_i) T_i = ∏_{j < i} (1 − α_j) (transmittance up to sample i) C = ∑_i T_i · α_i · c_i (the final pixel colour)where Δt_i = t_{i+1} − t_i is the inter-sample interval. The same operation, computed per pixel, gives the final rendered image. Total cost: image_pixels × N × (3 bilinear samples + small MLP + alpha-composite step). At 256² output and N = 64 the cost is approximately 30–80 ms on a consumer GPU.
When the consumer wants an explicit triangle mesh — for Houdini integration, for export to CAD tools — query the triplane at a dense 3-D grid of points (typically 256³ = ~16 M queries) and produce a density field. Run marching cubes on the field at the isosurface threshold (the σ value chosen to be near-zero at the mesh surface). The marching-cubes step is not part of the rendering loop — it is a one-shot per-scene operation costing roughly 50–200 ms additional. The common misconception that triplane rendering renders meshes is wrong: rendering volume-renders the triplane directly.
| Property | Triplane (chosen) | G-Splats | VDB / FVDB |
|---|---|---|---|
| Storage (typical scene) | 6–12 MB | 50–500 MB | 10–80 MB |
| Render time (256² image) | 30–80 ms | 5–15 ms | 100–300 ms |
| NeRF speed-up | 10–100 × | ~10 × | ~5 × |
| Editability | Edit 2-D feature planes directly | Move Gaussians (awkward) | Houdini-native (best for procedural) |
| Procedural-pipeline composability | Triplane → marching-cubes → mesh → Houdini (1 step) | Splat-to-mesh required (lossy) | Native |
| Differentiability | Yes (bilinear lookup + MLP) | Yes (standard) | Yes (FVDB) |
| Photo-realism | Mid-high | Highest | Mid (depends on shader) |
Triplane wins the storage column, the editability column, and is competitive on render time. G-Splats win raw render speed and photo-realism; VDB wins procedural composability. The decision turns on a thesis-line-specific consideration: the universal intermediate is converted to G-Splat or to VDB / mesh as a one-time step per scene, not per frame. So the per-frame render-time advantage of G-Splats and the Houdini-native advantage of VDB both get recovered through conversion. The storage and editability advantages of triplane are not recoverable through conversion — those are properties of the storage format itself.
Four downstream thesis-line topics are direct consumers of the triplane decision.
| Topic | Use of triplane |
|---|---|
| Hierarchical Triplane [1] | Native target — per-part triplanes in local frames + global triplane for spatial context |
| Hexplane Autoencoder [2] | Six-plane generalisation (triplane × 2 signs); same lookup mechanics |
| MambaFlow3D [3] | Generates triplane-family tokens (SparseCubes in Phase 3); same retrieval semantics |
| SculptNet [4] | Primitive outputs are rasterised into triplanes for differentiable refinement |
Triplane representations are the universal intermediate for the thesis line. The decision turns on storage size (6–12 MB), per-plane editability, and lossless convertibility to G-Splat or VDB / mesh for downstream consumption. Mesh extraction is a separate downstream step, not part of the rendering loop — a misconception worth pinning. The architectural consequences propagate to four downstream thesis-line topics.