← Research Timeline Aditya Jain / Apple Maps · 3D Reconstruction
Feb 2026
Topic 32 Feb 2026 Coarse-to-Fine · Single-Image 3D · Primitive Assembly

SculptNet —
Sculpt How Artists Sculpt.

A coarse-to-fine 3-D reconstruction system that mimics how a human artist builds geometry — blocking → shaping → detailing — using five fixed primitives (box, cylinder, cone, sphere, wedge) with independent face/cap deformation. Single-image input, primitive-assembly output, ~1.3 cm geometric accuracy on the PartNeXt chair benchmark.

00 — Motivation

Neural reconstruction emits geometry in one forward pass. Artists don't.

The thesis's broader question — how do you train a network to reconstruct any shape the way a human artist can — surfaced repeatedly across the earlier projects on this roadmap. The answer that current image-to-3D models give is a one-shot forward pass: input image, output mesh / NeRF / SDF, all at the final resolution. NeRF, 3D Gaussian Splatting, mesh diffusion — every one emits the final geometry immediately. The result is plausible from a distance and structurally unstable up close: parts blend together, scales are approximate, fine detail is texture-painted rather than carved.

A human 3-D artist working in Houdini, Blender, or Maya does not work this way. They block with primitive boxes first — get the overall masses right. Then shape — replace boxes with the right primitive type (cylinder for a leg, sphere for a head, cone for a lampshade). Then detail — independently move vertices to add taper, asymmetry, recesses. Each stage commits before the next. The network that learns to reconstruct any shape — chair, microwave, bridge — needs to learn this progressive commitment, not just the final geometry.

SculptNet is the architectural answer. Five fixed primitives with named types and independent face/cap parameters; a Houdini Python SOP that classifies any PartNeXt part mesh into the right primitive type automatically (PCA + circularity + taper); a four-stage coarse-to-fine pipeline (blocking → shaping → detailing → output) that emits a primitive assembly editable in production tools. The structural- representation thesis line's answer to the open cross-category generalisation question: a vocabulary small enough to learn from data, expressive enough to cover every hard-surface shape an artist would model.

What it replaces
SculptNet sits next to PGN (DSL-program target) and SketchProc3D (CGA-grammar parameters target) on the structured-intermediate- representation thesis line. The difference: PGN and SketchProc3D depend on a non-differentiable symbolic executor (Houdini DSL, CGA grammar) which is the open problem in both. SculptNet replaces the executor with differentiable primitive assembly — no symbolic program, no executor gap, just geometric primitives with continuous parameters that backprop natively.
Phase 0 → Phase 1 progression
Phase 0 — a proof-of-concept that the coarse-to-fine sequence works on a single category — was completed in earlier sessions on the chair dataset. Phase 1 (described in this topic) is the architectural commitment: the five-primitive vocabulary, the four-stage pipeline, the geometric classifier. The initial target was street buildings; the pivot to small statues and objects (drawing inspiration from cgcookie's "6 Principles of Great 3D Modeling") was the decision to validate on a tighter category before scaling. Phase 2 — multi-category PartNeXt training across all 24 categories — is the next step, targeting SIGGRAPH 2026.
01 — The Five Primitives

A small vocabulary, every face independently controllable.

The vocabulary is deliberately small. An earlier design used a variable-N polyhedral system where the network would emit "any number of faces with any orientations" — too unconstrained, hard to train, hard for downstream consumers to parse. The committed design is five named primitive types. Each primitive has a fixed face / cap structure, but every face is independently transformable.

BOX
8 vertices fully independent. Forms boxes, trapezoids, frustums, twisted prisms.
CYLINDER
Top + bottom cap circles. Each with independent centre, radius, tilt.
CONE
Bottom circle + apex. Apex offset from cap centre = oblique cone.
SPHERE
3-axis radii (rx, ry, rz). Ellipsoid generalisation for lamp bulbs, joints.
WEDGE
Triangular prism. Roof shapes, ramp geometry, structural braces.

A box with all 8 vertices independent covers most polyhedral shapes — set the top four vertices smaller than the bottom four and you get a frustum; twist them and you get a sheared block. A cylinder with independent caps covers tapered shafts, oblique posts, and most rotationally symmetric parts. The five-primitive vocabulary covers, in combination, every category in PartNeXt — chairs, tables, lamps, cabinets, beds — and extends naturally to mechanical assemblies and architectural elements.

Vocabulary evolution — what was considered and dropped
The initial design considered six primitive types including an Arch — a parametric curve with extrusion depth, i.e. a tube/cylinder with a bend. Arches show up in PartNeXt as shopping-bag handles, certain chair backrests, and decorative elements. The Arch was dropped from the committed vocabulary because it adds two extra parameters (curve control points) and a non-trivial training-data labelling problem (the geometric classifier would need curve fitting, not just principal-axis analysis). Arched parts are currently approximated as multiple short cylinders chained together — works for shopping-bag handles, less well for tight curves. Single-superquadric representations (autonomousvision.github.io/superquadrics-revisited) were considered as an alternative single-primitive vocabulary; rejected because the five-named-primitive system is easier for downstream consumers to parse as named USD geometry and easier to train as a 5-class type classification rather than a continuous-parameter regression over a single deformable primitive.
02 — Pipeline

Four stages, each commits before the next.

Single image RGB · 256² + optional mask INPUT STAGE 1 · BLOCKING N axis-aligned boxes part counts + bbox STAGE 2 · SHAPING classify primitive type box · cyl · cone · sph · wedge STAGE 3 · DETAILING independent face deform tapers · oblique caps · ellipsoid STAGE 4 · COMPOSE CSG union + smin joints primitive assembly USD MESH editable in Houdini named-primitive hierarchy OUTPUT PROGRESSIVE COMMITMENT — each stage's output is the next stage's input, intermediates are inspectable
Figure 1 — Four-stage coarse-to-fine pipeline. Each stage commits before the next. The intermediate at every stage is a valid primitive assembly that can be inspected, edited, or trained against — not a black-box latent. This is what makes the architecture editable in production tools and trainable with stage-by-stage supervision.
Core Insight

Five primitives.
Four stages of commitment.

The artist analogy is more than a metaphor. Stage-by-stage commitment is the inductive bias that lets a network generalise across categories — the network learns each stage's transition (block → shape, shape → detail) on one category and applies the same transition to a category it has never seen. A chair-trained network blocks a bridge correctly because blocking is the same operation regardless of category; what changes is the input not the operation.

03 — Geometric Classifier · The Training-Data Generator

PCA + circularity + taper → primitive type per PartNeXt part.

The five-primitive vocabulary is useful only if you can produce high-quality training data — pairs of (PartNeXt mesh, correct primitive + parameters) — at scale. Hand-labelling tens of thousands of parts is infeasible. The solution is a Houdini Python SOP geometric classifier that takes any part mesh and emits the correct primitive type plus fitted parameters automatically. Built scratch over several iterations as classification bugs surfaced and were fixed.

# Houdini Python SOP — geometric classifier sketch # Run inside For-Each Connected Piece loop, one part per iteration import numpy as np pts = geometry_points_as_array() # part vertex positions pca = principal_component_analysis(pts) # 3 axes + 3 extents (eigenvalues) # Extent ratios → elongation factor e0, e1, e2 = pca.extents_sorted_desc() elongation = e0 / e1 # > 3.0 → cylindrical / cone-like # Cross-section sampling along principal axis slices = sample_cross_sections(pts, pca.axis_0, n=12) circularity = mean([circularity_of(s) for s in slices]) # 0–1 scale top_R = radius_90th_percentile(slices[-1]) bot_R = radius_90th_percentile(slices[0]) taper = top_R / bot_R # Decision tree → primitive type if elongation > 3 and circularity > 0.85: prim_type = 'CYLINDER' if 0.6 < taper < 1.7 else 'CONE' elif circularity > 0.92 and elongation < 1.6: prim_type = 'SPHERE' # near-isotropic, near-circular elif circularity < 0.55: prim_type = 'BOX' # rectangular cross-section elif min(angles_at_vertices) < 0.45 * pi: prim_type = 'WEDGE' # has a sharp angle else: prim_type = 'BOX' # default fallback return prim_type, fit_params(pts, prim_type, pca)

Bug-fix history (each one a specific PartNeXt input that broke the classifier, leading to a permanent threshold adjustment):

Failure caseSymptomFix
Ceiling-mount discClassified as BOX (flat, rectangular bbox)Detect flat disc by low elongation + high circularity along the thin axis specifically
Brass cone baseClassified as SPHERE (low elongation)Check taper ratio with 90th-percentile radii to catch wide-base cones
Lamp glass bulb (elongated)Classified as SPHERE despite elongation > 2Tighten the sphere elongation threshold from 2.0 to 1.6; let elongated rounded shapes fall through to CYLINDER
Misplaced fitted primitivesOutput cap centres far from original cap centresReconstruct cap centres in world space, not centroid-relative offsets
04 — Results · Phase 1 Validation

~1.3 cm geometric accuracy on the PartNeXt chair benchmark.

Phase 1 validation tested the full pipeline on the PartNeXt chair category. Geometric accuracy measured as mean Hausdorff distance between the reconstructed primitive-assembly mesh and the original PartNeXt ground-truth mesh, normalised to the bounding-box diagonal. Result: ~1.3 cm mean Hausdorff at a 50 cm-bounding-box average chair — about 2.6 % of the bounding-box diagonal. Reconstruction misses are concentrated on fine detail (carved backrest splats, ornamented legs) that the five-primitive vocabulary cannot represent without subdividing into more parts.

StageOutput typeVertex countEditability
Stage 1 — BlockingN axis-aligned boxes8NTranslate, scale per box
Stage 2 — ShapingN typed primitives~24N (mixed)Primitive type, scale, orientation per part
Stage 3 — DetailingPer-face deformed primitives~40NEvery face/cap independently editable
Stage 4 — ComposeWatertight USD mesh~50N (after CSG union)Per-primitive USD subscope, editable in Houdini

Interactive Demo · Live

Pick a subject (chair / lamp / table) or click the input to cycle. Then use the stage slider to advance through the four pipeline stages — watch the assembly progress from blocking boxes to a detailed primitive-assembly mesh. The right pane rotates; the middle pane shows the current per-part primitive types.

01 — Input Image · CLICK TO CYCLE CHAIR
02 — Per-Part Primitive Types STAGE 1
03 — Reconstructed Mesh Drag to rotate

Full Technical Paper

arXiv-format write-up · SculptNet: Coarse-to-Fine 3D Reconstruction · five-primitive vocabulary, four-stage progressive commitment, Houdini classifier, chair-benchmark results

Read Paper →
Related Thesis Chapters
PGN — Procedural Generator Network
Sister architecture on the structured-representation thesis line. PGN emits a DSL program; SculptNet emits a primitive assembly. SculptNet's advantage: no symbolic executor, so no executor gap.
Hierarchical Part-Based Triplane
Sister architecture targeting the same compositional shapes (furniture, mechanical assemblies). Triplane is the neural-decoder route; SculptNet is the parametric-primitive route. Different trade-offs on edit-ability, generation, and inference cost.
ProcGen3D — Edge-Based Tokenization
Conceptual neighbour. ProcGen3D's autoregressive procedural-graph emission is the most direct external reference; SculptNet's coarse-to-fine commitment is the artist-workflow alternative.
Appendix — Raw Materials
Transcripts & Source References
████████████████████████████████████████████████
███████████████████████████████████████

██████████████████████████████████████
█████████ · ████ · █████████████████████
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████
██████████████████████████████████████████████
██████████ · ████ · ███████████████████████████████
██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████

████████████████████████████████████████████
██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████
Restricted Access