A coarse-to-fine 3-D reconstruction system that mimics how a human artist builds geometry — blocking → shaping → detailing — using five fixed primitives (box, cylinder, cone, sphere, wedge) with independent face/cap deformation. Single-image input, primitive-assembly output, ~1.3 cm geometric accuracy on the PartNeXt chair benchmark.
The thesis's broader question — how do you train a network to reconstruct any shape the way a human artist can — surfaced repeatedly across the earlier projects on this roadmap. The answer that current image-to-3D models give is a one-shot forward pass: input image, output mesh / NeRF / SDF, all at the final resolution. NeRF, 3D Gaussian Splatting, mesh diffusion — every one emits the final geometry immediately. The result is plausible from a distance and structurally unstable up close: parts blend together, scales are approximate, fine detail is texture-painted rather than carved.
A human 3-D artist working in Houdini, Blender, or Maya does not work this way. They block with primitive boxes first — get the overall masses right. Then shape — replace boxes with the right primitive type (cylinder for a leg, sphere for a head, cone for a lampshade). Then detail — independently move vertices to add taper, asymmetry, recesses. Each stage commits before the next. The network that learns to reconstruct any shape — chair, microwave, bridge — needs to learn this progressive commitment, not just the final geometry.
SculptNet is the architectural answer. Five fixed primitives with named types and independent face/cap parameters; a Houdini Python SOP that classifies any PartNeXt part mesh into the right primitive type automatically (PCA + circularity + taper); a four-stage coarse-to-fine pipeline (blocking → shaping → detailing → output) that emits a primitive assembly editable in production tools. The structural- representation thesis line's answer to the open cross-category generalisation question: a vocabulary small enough to learn from data, expressive enough to cover every hard-surface shape an artist would model.
The vocabulary is deliberately small. An earlier design used a variable-N polyhedral system where the network would emit "any number of faces with any orientations" — too unconstrained, hard to train, hard for downstream consumers to parse. The committed design is five named primitive types. Each primitive has a fixed face / cap structure, but every face is independently transformable.
A box with all 8 vertices independent covers most polyhedral shapes — set the top four vertices smaller than the bottom four and you get a frustum; twist them and you get a sheared block. A cylinder with independent caps covers tapered shafts, oblique posts, and most rotationally symmetric parts. The five-primitive vocabulary covers, in combination, every category in PartNeXt — chairs, tables, lamps, cabinets, beds — and extends naturally to mechanical assemblies and architectural elements.
Five primitives.
Four stages of commitment.
The artist analogy is more than a metaphor. Stage-by-stage commitment is the inductive bias that lets a network generalise across categories — the network learns each stage's transition (block → shape, shape → detail) on one category and applies the same transition to a category it has never seen. A chair-trained network blocks a bridge correctly because blocking is the same operation regardless of category; what changes is the input not the operation.
The five-primitive vocabulary is useful only if you can produce high-quality training data — pairs of (PartNeXt mesh, correct primitive + parameters) — at scale. Hand-labelling tens of thousands of parts is infeasible. The solution is a Houdini Python SOP geometric classifier that takes any part mesh and emits the correct primitive type plus fitted parameters automatically. Built scratch over several iterations as classification bugs surfaced and were fixed.
Bug-fix history (each one a specific PartNeXt input that broke the classifier, leading to a permanent threshold adjustment):
| Failure case | Symptom | Fix |
|---|---|---|
| Ceiling-mount disc | Classified as BOX (flat, rectangular bbox) | Detect flat disc by low elongation + high circularity along the thin axis specifically |
| Brass cone base | Classified as SPHERE (low elongation) | Check taper ratio with 90th-percentile radii to catch wide-base cones |
| Lamp glass bulb (elongated) | Classified as SPHERE despite elongation > 2 | Tighten the sphere elongation threshold from 2.0 to 1.6; let elongated rounded shapes fall through to CYLINDER |
| Misplaced fitted primitives | Output cap centres far from original cap centres | Reconstruct cap centres in world space, not centroid-relative offsets |
Phase 1 validation tested the full pipeline on the PartNeXt chair category. Geometric accuracy measured as mean Hausdorff distance between the reconstructed primitive-assembly mesh and the original PartNeXt ground-truth mesh, normalised to the bounding-box diagonal. Result: ~1.3 cm mean Hausdorff at a 50 cm-bounding-box average chair — about 2.6 % of the bounding-box diagonal. Reconstruction misses are concentrated on fine detail (carved backrest splats, ornamented legs) that the five-primitive vocabulary cannot represent without subdividing into more parts.
| Stage | Output type | Vertex count | Editability |
|---|---|---|---|
| Stage 1 — Blocking | N axis-aligned boxes | 8N | Translate, scale per box |
| Stage 2 — Shaping | N typed primitives | ~24N (mixed) | Primitive type, scale, orientation per part |
| Stage 3 — Detailing | Per-face deformed primitives | ~40N | Every face/cap independently editable |
| Stage 4 — Compose | Watertight USD mesh | ~50N (after CSG union) | Per-primitive USD subscope, editable in Houdini |
Pick a subject (chair / lamp / table) or click the input to cycle. Then use the stage slider to advance through the four pipeline stages — watch the assembly progress from blocking boxes to a detailed primitive-assembly mesh. The right pane rotates; the middle pane shows the current per-part primitive types.
arXiv-format write-up · SculptNet: Coarse-to-Fine 3D Reconstruction · five-primitive vocabulary, four-stage progressive commitment, Houdini classifier, chair-benchmark results