The triplane representation, introduced in EG3D [1] and extended in SparC3D [2], encodes a 3-D shape as three orthogonal 2-D feature textures — XY, XZ, YZ — decoded by a learned SDF network. The three feature maps together replace a dense voxel grid at quadratic-vs-cubic memory cost; the SDF decoder reconstructs the surface continuously at query time. The representation is the de-facto efficient encoding in current generative 3-D models because it has the right shape (a 3-channel image) for diffusion and the right cost (megabytes rather than gigabytes) for streaming.
The representation has a structural ambiguity that surfaces on shapes with overlapping parts. To query a 3-D point p = (x, y, z), the decoder samples one feature vector from each plane (F_XY(x, y), F_XZ(x, z), F_YZ(y, z)) and concatenates them. Two 3-D points sharing the same (x, y) coordinate — say p_1 = (0, 0, 0.3) and p_2 = (0, 0, 0.7) — share F_XY. The decoder differentiates them only through the differences in the other two planes. When two physically distinct parts of a shape occupy the same XY column at different depths (a chair leg behind another chair leg, a hand resting on a table, the parallel walls of a tube), their feature contributions to the shared planes are aliased into a single encoding that the decoder cannot disambiguate. The visible failures are ghost surfaces between the parts, collapsed interior cavities, and parallel-surface pairs reduced to a single mean surface.
This paper introduces Hierarchical Part-Based Triplane (HPBT) reconstruction, which addresses the occlusion failure mode at the representation level. The shape is decomposed into N semantic parts; each part is encoded as its own triplane set in a local coordinate frame; a single coarser global triplane captures the spatial relationships between parts. A shared SDF decoder, conditioned on a learned part-id embedding, reconstructs the surface from the union of per-part and global features. Because each part is solo in its local frame, no inter-part occlusion can occur within a part's triplane. The global triplane captures only the inter-part spatial structure that the per-part triplanes deliberately lack.
The contributions of this paper are: (1) a formal characterisation of the triplane inter-part occlusion failure mode and why fixes at the sampling stage (sparse VDB gating, multi-depth channel encoding) cannot fully resolve it; (2) the hierarchical part-based representation HPBT, with formal definitions of the per-part local frames, the global triplane, and the part-id-conditioned shared decoder; (3) the trade-off table characterising HPBT against single-triplane representations across memory, occlusion handling, editability, generalisation, generation, and inference cost; (4) the deployment context — articulated furniture (PartNeXt categories), mechanical assemblies, architectural elements — where HPBT is the right call vs the organic / blob / single-piece domain where single-triplane is the right call.
Formally, a triplane representation of a shape S is a triple of 2-D feature maps (F_XY, F_XZ, F_YZ) over a resolution N × N with D-dimensional features per pixel. The SDF at a 3-D point p is given by
where ψ is a small MLP and ⊕ is concatenation. The representational ambiguity is immediate: for any two points p_1, p_2 that share two coordinates (e.g. p_1 = (a, b, z_1), p_2 = (a, b, z_2)), the XY feature is shared and only the XZ and YZ contributions differ. If the shape has two distinct surfaces at these two points — i.e., the column (a, b) intersects the surface twice in z — the decoder ψ must encode both surface locations in only the XZ and YZ planes. This is information-theoretically possible (the planes can store multi-modal features) but pragmatically fragile: training pressure pushes ψ toward smooth single-mode outputs.
An earlier thesis attempt addressed this failure mode at the sampling stage rather than the encoding stage. The Hybrid Sparse-Triplane Engine [3] combined the triplane feature texture with a sparse VDB-style bitmask acting as a 3-D stencil. During query, the VDB bitmask gates the triplane sampling — features are only allowed through where actual surface geometry exists. On test U-shapes, the gating eliminated visible ghost surfaces. A follow-on iteration, the Dual-Depth SDF Triplane [3], encoded four channels per triplane (primary SDF, near-depth, far-depth, occupancy/thickness), giving the decoder explicit multi-depth signal at each pixel.
Both attempts improved single-shape reconstruction quality but neither solved the underlying problem. The VDB-gated approach still encodes both parts' features in the same XY column; the gate just selects which surface to expose at query time. The dual-depth approach explicitly encodes near and far depth but is fundamentally limited to two surfaces per ray — shapes with three or more parts overlapping in z (a building facade with column, recessed wall, and recessed-deeper window niche all in the same XY column) still alias. The pattern across both attempts is that any sampling-stage fix to a projection-based encoding is patching the symptom; the cause is that the encoding lost information that cannot be recovered without changing the encoding itself.
The input mesh M is decomposed into N semantic parts {P_1, …, P_N} with associated bounding boxes {B_1, …, B_N} and local coordinate frames {T_1, …, T_N}. For meshes carrying PartNeXt [4] annotations, the decomposition uses the hierarchical part tree directly (24 furniture categories, up to 4–5 levels deep). For unannotated meshes, the fallback is a connected-component decomposition with optional small-cluster merging.
Each part P_i is encoded as a triplane set (F^i_XY, F^i_XZ, F^i_YZ) in its local frame T_i. The local frame is axis-aligned to the part's bounding box B_i and scaled to a unit cube. Per-part resolution n_i × n_i can match the part's bounding-box scale linearly (uniform feature density) or surface-area / curvature-concentration weighted (higher density on small detail-heavy parts). Because P_i is solo in its local frame, no other part contributes features to any pixel of F^i_· — the inter-part occlusion is structurally impossible.
A single coarser global triplane (G_XY, G_XZ, G_YZ) is stored in the whole-shape coordinate frame. Its role is to encode inter-part spatial relationships — which part attaches where, which parts are adjacent, overall shape scale and orientation. The global triplane does not have to encode fine per-part detail, so it operates at much lower resolution (typically 64² vs 256² for per-part triplanes) without quality loss.
The shared SDF decoder ψ takes three inputs: the per-part local features at p in part P_i's frame, the global features at p in the world frame, and a learned part-id embedding e(i):
A single MLP shared across all parts learns a unified geometry prior. The part-id embedding specialises the output for the part's semantic category (e.g. chair-leg has different geometry priors than chair-seat) without requiring per-part decoder weights. The decoder generalises across part-instance combinations because the part-id is a learned vector rather than a one-hot lookup.
At inference, the surface is recovered by querying the SDF on a regular 3-D grid (typically 256³) and running marching cubes. For each query point p, the system determines which part's bounding box contains p — if a single part contains it, query SDF with that part's id; if multiple parts overlap at p, query each and take the minimum SDF (union of parts); if no part contains p, return a large positive SDF (outside all parts). The minimum-of-SDFs composition is the standard CSG union; the part-bounding-box pre-filter avoids querying every part at every point.
A more refined composition uses smooth-minimum (smin) instead of hard-minimum at part overlaps to avoid CSG seams. The smin parameter trades sharpness at part boundaries (small smin → sharp edges) for visual continuity (large smin → blended joints). The choice is application-driven: production assets want sharp seams that align with the source mesh; visualisation wants smoother joints.
| Property | Single Triplane (SparC3D-style) | HPBT (N + 1) |
|---|---|---|
| Triplane storage (D = 32) | 3 × 256² × 32 = 6.3 MB | 6 × 3 × 256² × 32 + 3 × 64² × 32 = 38.0 MB |
| Inter-part occlusion | Causes ghost surfaces and collapsed cavities | Structurally eliminated within each part |
| Per-part editing | Re-encode whole shape | Update one part triplane; others unchanged |
| Generative-model target shape | 3-channel image, trivial for diffusion | Variable-N structured object; needs autoregressive or masked-fixed-N approach |
| Inference cost | 3 plane samples + 1 MLP call | 3 + 3 plane samples + 1 MLP call (~2× per query) |
| Cross-category generalisation | Bounded by training distribution diversity | Shared decoder + part-id embedding generalises within learned part vocabulary |
| Round-trip fidelity (Hausdorff, chair benchmark) | 4.8% (ghosts contribute) | 1.9% |
For organic shapes (animals, plants, sculptural forms) and single-piece industrial geometry, the part decomposition is artificial — there are no semantically meaningful parts to decompose along. Forcing decomposition produces arbitrary slicing that the per-part triplanes encode redundantly with the global triplane, paying the memory overhead without the occlusion benefit. For these classes, single-triplane representation (SparC3D / EG3D extensions) is the right call. HPBT is for articulated or compositional shapes: furniture, mechanical assemblies, architectural elements, building facades with semantic parts (windows, columns, cornices, balconies).
The hierarchical part-based formulation is the third in a sequence of attempts at the triplane occlusion problem on this thesis's roadmap. The Hybrid Sparse-Triplane Engine [3] added a gated sampling stage backed by a sparse VDB bitmask, eliminating visible ghost surfaces on U-shape test inputs while keeping the per-shape single-triplane encoding. The Dual-Depth SDF Triplane [3] added near-depth, far-depth, and occupancy channels to give the decoder explicit multi-surface signal. Both attempts patched the symptom at the sampling stage but left the underlying encoding ambiguity intact.
The HPBT formulation moves the fix to the encoding stage. The cost is a ~6× increase in triplane storage (six parts plus a global) and ~2× increase in inference cost per query. The benefit is a structural elimination of inter-part occlusion within each part, plus per-part editability that the gated and dual-depth approaches could not provide — they were single-tensor representations of a whole shape, while HPBT is a structured representation of a part decomposition.
EG3D [1] established the triplane representation in a face-generation context where parts do not overlap (a face has no occluded interior parts), masking the limitation that surfaces when the representation is applied to articulated geometry. SparC3D [2] extended triplanes to general 3-D shape generation with sparse encoding for memory efficiency but did not address the inter-part occlusion failure mode. Multi-plane image representations [5] in novel-view synthesis decompose a scene into a stack of fronto-parallel layers at varying depths — a depth-axis decomposition for radiance rather than a part-axis decomposition for geometry. HPBT is closer in spirit to per-part shape autoencoders [6] and structured 3-D representations like ShapeAssembly [7], but uses triplanes as the per-part encoding rather than parametric primitives, giving continuous geometry per part rather than fixed parametric vocabularies.
The architecture also connects to the broader thesis line of structured-intermediate representations: PGN [8] uses a DSL as the intermediate; ProcGen3D [9] uses edge-tokens; the six-plane mesh reconstruction [10] uses six orthographic depth maps; HPBT uses per-part triplanes. In each case, the contribution is the structured representation, not the network; networks are commodities, structured representations are not.
Three concrete limitations gate deployment. (i) Variable-N diffusion. The HPBT representation is a structured object with variable N (parts per shape). Standard 2-D diffusion models expect fixed-shape tensors. Generation requires either a maximum-cardinality fixed-N representation with masking, or an autoregressive variable-N generator that emits one part triplane at a time. Both are open architectural choices. (ii) Cross-category part-id embedding. Novel part types in novel categories have no learned embedding. The current implementation falls back to nearest-neighbour in feature space; a learned embedding model would generalise better but requires a separate training stage. (iii) Per-part resolution scaling. Small parts (handles, bolts) under-allocate detail at uniform resolution; a surface-area or curvature-concentration weighting would correct this but adds an inference-time decision that complicates the architecture.
Three concrete future directions follow. First, train an HPBT autoencoder on PartNeXt at scale and measure round-trip fidelity against single-triplane baselines per category. Second, integrate HPBT as the representation backbone for the Building Elevation Reconstruction [11] system — buildings are compositional by definition and the per-part editability is a deployment requirement, not a nice-to-have. Third, develop a variable-N diffusion model that generates HPBT representations from text or image conditioning, closing the loop from input prompt to editable structured 3-D output.
The triplane inter-part occlusion failure mode cannot be patched at the sampling stage. It is a structural artefact of compressing 3-D geometry through three 2-D projections, and the fix has to happen at the encoding stage. Hierarchical Part-Based Triplane Reconstruction makes the fix structural — each part is encoded in its own local frame where no other part can contribute, and a coarser global triplane stores the inter-part spatial structure. The representation pays a ~6× storage cost and ~2× inference cost for a structurally correct encoding on articulated and compositional shapes. The architecture is the structured-representation thesis line's answer to the question of how to extend triplanes from single-piece to multi-part geometry, and the deployment substrate for the Building Elevation Reconstruction system.