Per-part isolated triplane sets that eliminate inter-part occlusion structurally, plus a global triplane for spatial context. Parts are composable, independently updatable, and decoded by a shared SDF network. The architectural answer to triplane ghosting on multi-part shapes.
The thesis-line goal is a real-time, editable 3-D map of the world — every building, every facade element, every interior, encoded in a representation light enough to stream and structured enough to edit. Triplanes are the obvious efficient encoding: three orthogonal 2-D feature textures (XY, XZ, YZ) replace a dense voxel grid at a fraction of the memory cost, and a learned SDF decoder reconstructs the surface continuously. SparC3D and related work show the encoding is competitive on single-object benchmarks.
The trouble starts when objects have overlapping parts. Two structural elements occupying the same column of XY pixels — a chair leg behind another chair leg, a hand resting on a table, a U-shape's inner walls projecting onto the same plane as its outer walls — collapse into a single feature stack. The decoder cannot tell where one part ends and the next begins because the projection erased the layering. The result is "ghosting": phantom surfaces where parts blur together, missing geometry where the rear part is shadowed by the front, and a representation that's only as good as the most occlusion-free shape it was trained on.
The earlier Hybrid Sparse-Triplane Engine work attempted to fix this with a gated approach — a sparse VDB bitmask gating the triplane feature stack to suppress ghost surfaces. The gate helped but the representation was still per-shape, not per-part. The hierarchical part-based formulation here is the next step: give each semantic part its own isolated triplane set, plus a global triplane for spatial relationships between parts. Occlusion is then a per-part problem (where it's much smaller) rather than a per-shape problem. Editability comes for free — parts can be updated independently because they are stored independently.
A triplane stores 2-D feature vectors at each (x, y),
(x, z), and (y, z) pixel. To query a 3-D point
p = (x, y, z), the decoder samples one feature from each
plane and concatenates them: f(p) = [F_XY(x,y), F_XZ(x,z), F_YZ(y,z)].
The MLP decoder turns that triple into an SDF value.
The representational ambiguity is immediate. Consider two points p1 = (0, 0, 0.3)
and p2 = (0, 0, 0.7). Both have (x, y) = (0, 0),
so F_XY(x, y) is identical for both. F_XZ and
F_YZ differ because z differs, but the MLP receives only three
interpolated features — there's no mechanism to recover which 3-D point
along the XY ray the feature actually describes. When two parts of a shape
occupy the same XY column at different depths, their feature contributions
blur into the same encoding and the decoder cannot disambiguate.
The visible failure: U-shaped silhouettes reconstruct as filled rectangles, interior cavities collapse, and pairs of parallel surfaces (e.g., the two walls of a tube) reduce to a single mean surface.
Decompose first.
Project second.
A triplane projection erases the depth axis. Two parts at different depths in the same XY column become indistinguishable. The fix is not a clever decoder — the decoder still sees only three 2-D features per query — but a decomposition that ensures only one part contributes feature mass to any given pixel. Parts are stored separately, queried separately, composed at the end. The decoder never has to disambiguate what was already structurally separated.
Input meshes carry hierarchical part labels from the PartNeXt dataset (~26K models, 24 categories, 4–5 levels deep). Each part gets an axis-aligned bounding box and a local coordinate frame. For shapes without annotations, the fallback is a connected-component + small-cluster decomposition that approximates semantic parts.
Each part's geometry is encoded as a triplane set in its local frame.
Because the part is solo in its frame, no inter-part occlusion can
occur — every (x, y) column in the part's XY plane is
populated by at most one part. Resolution per-part is matched to the
part's bounding-box scale, so small parts (handles, bolts) get
comparable feature density to large parts (chair seat, wall).
A coarser global triplane stores feature vectors in the full-shape coordinate frame. Its job is to encode spatial relationships between parts — which part attaches where, which parts are adjacent, the overall scale. The global triplane does NOT have to encode fine per-part detail (that's the per-part triplanes' job), so it runs at much lower resolution without quality loss.
A single MLP receives the part's local triplane features, the global triplane features at the same world point, and a part_id embedding. Outputs SDF. The shared decoder learns a unified geometry prior across part categories rather than per-part specialised networks — important for generalising to unseen part-instance combinations.
| Property | Single Triplane | Hierarchical Part-Based |
|---|---|---|
| Memory | 3 × N² features | (N_parts + 1) × 3 × n² features (n < N) |
| Inter-part occlusion | Causes ghost surfaces | Structurally eliminated |
| Per-part editing | Re-encode whole shape | Update one part triplane |
| Cross-category generalisation | Limited by training distribution | Shared decoder + part_id embedding generalises |
| Generation by diffusion | Single 3-channel tensor — easy | Variable-cardinality structured object — harder |
| Inference cost | 3 triplane samples + 1 MLP call | 3 + 3 triplane samples + 1 MLP call · ~2× slower |
Pick a shape or click the input canvas to cycle. The middle pane shows the per-part triplane decomposition (parts in different colours, each with its own XY/XZ/YZ feature textures); the right pane shows the reconstructed mesh with parts independently colour-coded. Drag the mesh to rotate.
arXiv-format write-up · Hierarchical Part-Based Triplane Reconstruction · occlusion analysis, architecture, trade-off table, predecessor work