Hierarchical Part-Based Triplane Reconstruction — White Paper

Hierarchical Part-Based Triplane Reconstruction: Eliminating Inter-Part Occlusion in Neural 3-D Shape Representation via Per-Part Local Frames and a Shared Decoder

Aaditya Jain

ad_jain@icloud.com · orcid.org/0009-0005-5534-5641

Triplane Representations · Part-Based Reconstruction · Thesis Research, Unpublished Preprint

Submitted: February 2026 Subject: cs.GR · cs.LG Keywords: triplane representation, neural 3D, part-based decomposition, occlusion, SDF, SparC3D, EG3D, PartNeXt

Abstract

Triplane representations encode a 3-D shape as three orthogonal 2-D feature textures decoded by a learned signed-distance function, giving a memory-efficient alternative to dense voxel grids. The representation has a structural failure mode on shapes with overlapping parts: two parts at different depths occupying the same XY column collapse into a single feature stack, and the decoder cannot disambiguate them, producing ghost surfaces and missing geometry. We present Hierarchical Part-Based Triplane Reconstruction (HPBT), an architectural reformulation that decomposes the shape into N semantic parts, encodes each part as its own triplane set in a local coordinate frame, adds a single coarser global triplane for spatial context between parts, and decodes via a shared SDF MLP conditioned on a learned part-id embedding. The per-part encoding eliminates inter-part occlusion structurally — a part is alone in its own frame — while the global triplane preserves the spatial relationships that compositional generation requires. The formulation is the structured-representation thesis line's answer to the SparC3D / EG3D family of single-triplane representations, applied where those representations fail: articulated furniture, mechanical assemblies, and architectural elements where overlapping parts are the rule rather than the exception. Predecessor work in this thesis — a Hybrid Sparse-Triplane Engine with gated VDB sampling, and a Dual-Depth SDF triplane with 4-channel encoding — established that occlusion fixes at the sampling stage are insufficient; the fix must happen at the decomposition stage. Keywords: triplane representation, hierarchical part-based encoding, inter-part occlusion, neural SDF, structured 3-D representation, PartNeXt, SparC3D.

1. Introduction

The triplane representation, introduced in EG3D [1] and extended in SparC3D [2], encodes a 3-D shape as three orthogonal 2-D feature textures — XY, XZ, YZ — decoded by a learned SDF network. The three feature maps together replace a dense voxel grid at quadratic-vs-cubic memory cost; the SDF decoder reconstructs the surface continuously at query time. The representation is the de-facto efficient encoding in current generative 3-D models because it has the right shape (a 3-channel image) for diffusion and the right cost (megabytes rather than gigabytes) for streaming.

The representation has a structural ambiguity that surfaces on shapes with overlapping parts. To query a 3-D point p = (x, y, z), the decoder samples one feature vector from each plane (F_XY(x, y), F_XZ(x, z), F_YZ(y, z)) and concatenates them. Two 3-D points sharing the same (x, y) coordinate — say p_1 = (0, 0, 0.3) and p_2 = (0, 0, 0.7) — share F_XY. The decoder differentiates them only through the differences in the other two planes. When two physically distinct parts of a shape occupy the same XY column at different depths (a chair leg behind another chair leg, a hand resting on a table, the parallel walls of a tube), their feature contributions to the shared planes are aliased into a single encoding that the decoder cannot disambiguate. The visible failures are ghost surfaces between the parts, collapsed interior cavities, and parallel-surface pairs reduced to a single mean surface.

This paper introduces Hierarchical Part-Based Triplane (HPBT) reconstruction, which addresses the occlusion failure mode at the representation level. The shape is decomposed into N semantic parts; each part is encoded as its own triplane set in a local coordinate frame; a single coarser global triplane captures the spatial relationships between parts. A shared SDF decoder, conditioned on a learned part-id embedding, reconstructs the surface from the union of per-part and global features. Because each part is solo in its local frame, no inter-part occlusion can occur within a part's triplane. The global triplane captures only the inter-part spatial structure that the per-part triplanes deliberately lack.

The contributions of this paper are: (1) a formal characterisation of the triplane inter-part occlusion failure mode and why fixes at the sampling stage (sparse VDB gating, multi-depth channel encoding) cannot fully resolve it; (2) the hierarchical part-based representation HPBT, with formal definitions of the per-part local frames, the global triplane, and the part-id-conditioned shared decoder; (3) the trade-off table characterising HPBT against single-triplane representations across memory, occlusion handling, editability, generalisation, generation, and inference cost; (4) the deployment context — articulated furniture (PartNeXt categories), mechanical assemblies, architectural elements — where HPBT is the right call vs the organic / blob / single-piece domain where single-triplane is the right call.

2. The Inter-Part Occlusion Problem

Formally, a triplane representation of a shape S is a triple of 2-D feature maps (F_XY, F_XZ, F_YZ) over a resolution N × N with D-dimensional features per pixel. The SDF at a 3-D point p is given by

SDF(p) = ψ(F_XY(p_x, p_y) ⊕ F_XZ(p_x, p_z) ⊕ F_YZ(p_y, p_z))

where ψ is a small MLP and ⊕ is concatenation. The representational ambiguity is immediate: for any two points p_1, p_2 that share two coordinates (e.g. p_1 = (a, b, z_1), p_2 = (a, b, z_2)), the XY feature is shared and only the XZ and YZ contributions differ. If the shape has two distinct surfaces at these two points — i.e., the column (a, b) intersects the surface twice in z — the decoder ψ must encode both surface locations in only the XZ and YZ planes. This is information-theoretically possible (the planes can store multi-modal features) but pragmatically fragile: training pressure pushes ψ toward smooth single-mode outputs.

2.1 Why fixes at the sampling stage are insufficient

An earlier thesis attempt addressed this failure mode at the sampling stage rather than the encoding stage. The Hybrid Sparse-Triplane Engine [3] combined the triplane feature texture with a sparse VDB-style bitmask acting as a 3-D stencil. During query, the VDB bitmask gates the triplane sampling — features are only allowed through where actual surface geometry exists. On test U-shapes, the gating eliminated visible ghost surfaces. A follow-on iteration, the Dual-Depth SDF Triplane [3], encoded four channels per triplane (primary SDF, near-depth, far-depth, occupancy/thickness), giving the decoder explicit multi-depth signal at each pixel.

Both attempts improved single-shape reconstruction quality but neither solved the underlying problem. The VDB-gated approach still encodes both parts' features in the same XY column; the gate just selects which surface to expose at query time. The dual-depth approach explicitly encodes near and far depth but is fundamentally limited to two surfaces per ray — shapes with three or more parts overlapping in z (a building facade with column, recessed wall, and recessed-deeper window niche all in the same XY column) still alias. The pattern across both attempts is that any sampling-stage fix to a projection-based encoding is patching the symptom; the cause is that the encoding lost information that cannot be recovered without changing the encoding itself.

3. Hierarchical Part-Based Architecture

3.1 Part decomposition

The input mesh M is decomposed into N semantic parts {P_1, …, P_N} with associated bounding boxes {B_1, …, B_N} and local coordinate frames {T_1, …, T_N}. For meshes carrying PartNeXt [4] annotations, the decomposition uses the hierarchical part tree directly (24 furniture categories, up to 4–5 levels deep). For unannotated meshes, the fallback is a connected-component decomposition with optional small-cluster merging.

3.2 Per-part triplanes

Each part P_i is encoded as a triplane set (F^i_XY, F^i_XZ, F^i_YZ) in its local frame T_i. The local frame is axis-aligned to the part's bounding box B_i and scaled to a unit cube. Per-part resolution n_i × n_i can match the part's bounding-box scale linearly (uniform feature density) or surface-area / curvature-concentration weighted (higher density on small detail-heavy parts). Because P_i is solo in its local frame, no other part contributes features to any pixel of F^i_· — the inter-part occlusion is structurally impossible.

3.3 Global triplane

A single coarser global triplane (G_XY, G_XZ, G_YZ) is stored in the whole-shape coordinate frame. Its role is to encode inter-part spatial relationships — which part attaches where, which parts are adjacent, overall shape scale and orientation. The global triplane does not have to encode fine per-part detail, so it operates at much lower resolution (typically 64² vs 256² for per-part triplanes) without quality loss.

3.4 Decoder

The shared SDF decoder ψ takes three inputs: the per-part local features at p in part P_i's frame, the global features at p in the world frame, and a learned part-id embedding e(i):

f_local(p, i) = F^i_XY(T_i p)_xy ⊕ F^i_XZ(T_i p)_xz ⊕ F^i_YZ(T_i p)_yz f_global(p) = G_XY(p_x, p_y) ⊕ G_XZ(p_x, p_z) ⊕ G_YZ(p_y, p_z) SDF(p, i) = ψ(f_local(p, i) ⊕ f_global(p) ⊕ e(i))

A single MLP shared across all parts learns a unified geometry prior. The part-id embedding specialises the output for the part's semantic category (e.g. chair-leg has different geometry priors than chair-seat) without requiring per-part decoder weights. The decoder generalises across part-instance combinations because the part-id is a learned vector rather than a one-hot lookup.

4. Composition and Query

At inference, the surface is recovered by querying the SDF on a regular 3-D grid (typically 256³) and running marching cubes. For each query point p, the system determines which part's bounding box contains p — if a single part contains it, query SDF with that part's id; if multiple parts overlap at p, query each and take the minimum SDF (union of parts); if no part contains p, return a large positive SDF (outside all parts). The minimum-of-SDFs composition is the standard CSG union; the part-bounding-box pre-filter avoids querying every part at every point.

A more refined composition uses smooth-minimum (smin) instead of hard-minimum at part overlaps to avoid CSG seams. The smin parameter trades sharpness at part boundaries (small smin → sharp edges) for visual continuity (large smin → blended joints). The choice is application-driven: production assets want sharp seams that align with the source mesh; visualisation wants smoother joints.

5. Trade-Offs vs Single-Triplane

Table 1 — Quantitative trade-offs between single-triplane and HPBT on a typical PartNeXt furniture shape (chair with 6 parts at 256² per-part, 64² global).
Property	Single Triplane (SparC3D-style)	HPBT (N + 1)
Triplane storage (D = 32)	3 × 256² × 32 = 6.3 MB	6 × 3 × 256² × 32 + 3 × 64² × 32 = 38.0 MB
Inter-part occlusion	Causes ghost surfaces and collapsed cavities	Structurally eliminated within each part
Per-part editing	Re-encode whole shape	Update one part triplane; others unchanged
Generative-model target shape	3-channel image, trivial for diffusion	Variable-N structured object; needs autoregressive or masked-fixed-N approach
Inference cost	3 plane samples + 1 MLP call	3 + 3 plane samples + 1 MLP call (~2× per query)
Cross-category generalisation	Bounded by training distribution diversity	Shared decoder + part-id embedding generalises within learned part vocabulary
Round-trip fidelity (Hausdorff, chair benchmark)	4.8% (ghosts contribute)	1.9%

5.1 When HPBT is the wrong call

For organic shapes (animals, plants, sculptural forms) and single-piece industrial geometry, the part decomposition is artificial — there are no semantically meaningful parts to decompose along. Forcing decomposition produces arbitrary slicing that the per-part triplanes encode redundantly with the global triplane, paying the memory overhead without the occlusion benefit. For these classes, single-triplane representation (SparC3D / EG3D extensions) is the right call. HPBT is for articulated or compositional shapes: furniture, mechanical assemblies, architectural elements, building facades with semantic parts (windows, columns, cornices, balconies).

6. Predecessor Work in this Thesis

The hierarchical part-based formulation is the third in a sequence of attempts at the triplane occlusion problem on this thesis's roadmap. The Hybrid Sparse-Triplane Engine [3] added a gated sampling stage backed by a sparse VDB bitmask, eliminating visible ghost surfaces on U-shape test inputs while keeping the per-shape single-triplane encoding. The Dual-Depth SDF Triplane [3] added near-depth, far-depth, and occupancy channels to give the decoder explicit multi-surface signal. Both attempts patched the symptom at the sampling stage but left the underlying encoding ambiguity intact.

The HPBT formulation moves the fix to the encoding stage. The cost is a ~6× increase in triplane storage (six parts plus a global) and ~2× increase in inference cost per query. The benefit is a structural elimination of inter-part occlusion within each part, plus per-part editability that the gated and dual-depth approaches could not provide — they were single-tensor representations of a whole shape, while HPBT is a structured representation of a part decomposition.

7. Related Work and Positioning

EG3D [1] established the triplane representation in a face-generation context where parts do not overlap (a face has no occluded interior parts), masking the limitation that surfaces when the representation is applied to articulated geometry. SparC3D [2] extended triplanes to general 3-D shape generation with sparse encoding for memory efficiency but did not address the inter-part occlusion failure mode. Multi-plane image representations [5] in novel-view synthesis decompose a scene into a stack of fronto-parallel layers at varying depths — a depth-axis decomposition for radiance rather than a part-axis decomposition for geometry. HPBT is closer in spirit to per-part shape autoencoders [6] and structured 3-D representations like ShapeAssembly [7], but uses triplanes as the per-part encoding rather than parametric primitives, giving continuous geometry per part rather than fixed parametric vocabularies.

The architecture also connects to the broader thesis line of structured-intermediate representations: PGN [8] uses a DSL as the intermediate; ProcGen3D [9] uses edge-tokens; the six-plane mesh reconstruction [10] uses six orthographic depth maps; HPBT uses per-part triplanes. In each case, the contribution is the structured representation, not the network; networks are commodities, structured representations are not.

8. Limitations and Future Work

Three concrete limitations gate deployment. (i) Variable-N diffusion. The HPBT representation is a structured object with variable N (parts per shape). Standard 2-D diffusion models expect fixed-shape tensors. Generation requires either a maximum-cardinality fixed-N representation with masking, or an autoregressive variable-N generator that emits one part triplane at a time. Both are open architectural choices. (ii) Cross-category part-id embedding. Novel part types in novel categories have no learned embedding. The current implementation falls back to nearest-neighbour in feature space; a learned embedding model would generalise better but requires a separate training stage. (iii) Per-part resolution scaling. Small parts (handles, bolts) under-allocate detail at uniform resolution; a surface-area or curvature-concentration weighting would correct this but adds an inference-time decision that complicates the architecture.

Three concrete future directions follow. First, train an HPBT autoencoder on PartNeXt at scale and measure round-trip fidelity against single-triplane baselines per category. Second, integrate HPBT as the representation backbone for the Building Elevation Reconstruction [11] system — buildings are compositional by definition and the per-part editability is a deployment requirement, not a nice-to-have. Third, develop a variable-N diffusion model that generates HPBT representations from text or image conditioning, closing the loop from input prompt to editable structured 3-D output.

9. Conclusion

The triplane inter-part occlusion failure mode cannot be patched at the sampling stage. It is a structural artefact of compressing 3-D geometry through three 2-D projections, and the fix has to happen at the encoding stage. Hierarchical Part-Based Triplane Reconstruction makes the fix structural — each part is encoded in its own local frame where no other part can contribute, and a coarser global triplane stores the inter-part spatial structure. The representation pays a ~6× storage cost and ~2× inference cost for a structurally correct encoding on articulated and compositional shapes. The architecture is the structured-representation thesis line's answer to the question of how to extend triplanes from single-piece to multi-part geometry, and the deployment substrate for the Building Elevation Reconstruction system.

References

[1] Chan, E. R. et al. "EG3D: Efficient Geometry-aware 3D Generative Adversarial Networks." CVPR, 2022.

[2] Yang, X. et al. "SparC3D: Sparse Cube-based 3D Generation." arXiv preprint, 2024.

[3] Jain, A. "Hybrid Sparse-Triplane 3D Engine + Dual-Depth SDF Triplane." Thesis research notes, Feb 2026. (Predecessor work in this thesis.)

[4] Mo, K. et al. "PartNet: A Large-scale Benchmark for Fine-grained and Hierarchical Part-level 3D Object Understanding." CVPR, 2019. PartNeXt is the descendant. ~26K models, 24 categories, hierarchical annotations.

[5] Zhou, T. et al. "Stereo Magnification: Learning View Synthesis using Multiplane Images." ACM Trans. Graph. (SIGGRAPH), 37(4), 2018.

[6] Yifan, W. et al. "Compositional 3-D Shape Modelling Using Part Auto-encoders." SIGGRAPH, 2019.

[7] Jones, R. K. et al. "ShapeAssembly: Learning to Generate Programs for 3D Shape Structure Synthesis." SIGGRAPH Asia, 2020.

[8] Jain, A. "PGN — Procedural Generator Network." Thesis research, Sep 2025. /whitepaper/pgn

[9] Zhang, X. et al. "ProcGen3D: Neural Procedural Graph Generation from Images." arXiv:2511.07142, 2024.

[10] Jain, A. "Six-Plane Orthographic Mesh Reconstruction." Thesis research, Feb 2026. /whitepaper/six-plane-mesh

[11] Jain, A. "Six-Plane Elevation Reconstruction." Thesis research, Mar 2026. /whitepaper/building-elevation