A human 3-D artist working in Houdini, Blender, or Maya does not produce final geometry in a single operation. They block with primitive boxes first (overall masses), then shape by replacing each box with the correct primitive type (cylinder for a leg, sphere for a head, cone for a lampshade), then detail by moving individual vertices to add taper, asymmetry, recesses. Each stage commits before the next. The output of any intermediate stage is a valid mesh that the artist can save, share, or hand off.
Current image-to-3-D models compress this multi-stage process into a single forward pass that emits final-resolution geometry directly. NeRF emits a radiance field at the target resolution; mesh-diffusion emits a watertight mesh at the target topology; SDF-generation emits a continuous signed-distance function. None of them exposes intermediate stages. None of them learns the progressive-commitment inductive bias that lets the artist generalise from chairs to bridges. The result is plausible-from-a-distance reconstructions that are structurally unstable up close — parts blend together, scales are approximate, fine detail is texture-painted rather than carved into geometry.
SculptNet is the architectural alternative. The vocabulary is five named primitives (box, cylinder, cone, sphere, wedge) each with independent face/cap parameters; the pipeline is four stages (blocking, shaping, detailing, compose) each emitting a valid primitive assembly as its intermediate output; the training data comes from PartNeXt via a Houdini Python SOP geometric classifier that automatically labels every part mesh with its correct primitive type plus fitted parameters. The contribution of this paper is the architectural commitment to this decomposition and the empirical demonstration that ~1.3 cm Hausdorff is achievable on PartNeXt chairs at Phase 1 maturity.
The contributions of this paper are: (1) the five-primitive vocabulary with formal parameterisation of each primitive's independent face/cap controls; (2) the four-stage coarse-to-fine pipeline with stage-by-stage supervisory signal; (3) the Houdini Python SOP geometric classifier that produces training labels at PartNeXt scale automatically; (4) Phase 1 quantitative results on PartNeXt chairs (1.3 cm mean Hausdorff, 2.6 % of bounding diagonal); (5) the architectural argument that SculptNet eliminates the non-differentiable executor gap that limits PGN and SketchProc3D on the same thesis line, replacing symbolic programs with continuous parametric primitives that backprop natively.
The vocabulary is deliberately small. An earlier design used a variable-N polyhedral system where the network would emit "any number of faces with any orientations" — flexible in principle, but too unconstrained to train reliably and too unstructured for downstream consumers to parse as named geometry. The committed vocabulary is five named primitive types, each with a fixed face/cap structure but every face independently transformable.
A box's 8 vertices are fully independent. The default configuration is an axis-aligned cuboid (rectangular prism). Setting top-four-vertices smaller than bottom-four produces a frustum; twisting them around the vertical axis produces a sheared block; offsetting individual vertices produces irregular polyhedra. The 24-parameter representation (3 floats × 8 vertices) covers most polyhedral building blocks used in furniture and architecture.
A cylinder has two cap circles, each with independent centre (cx, cy, cz), radius r, and tilt axis (tx, ty, tz). Top + bottom independence covers tapered shafts (different radii), oblique posts (different tilts), and most rotationally symmetric parts found in furniture (chair legs, table legs, lamp stems). 14-parameter representation.
A cone has a bottom cap circle (centre, radius, tilt as above) and an apex point. Offsetting the apex from the cap centre produces an oblique cone — covering ramp shapes, lampshades (truncated cones with non-zero top radius are degenerate cylinders; we treat them as cylinders with sharply different cap radii). 10-parameter representation.
A sphere is parameterised by centre (cx, cy, cz) and three axis-aligned radii (rx, ry, rz) — the ellipsoid generalisation. Covers lamp bulbs, joints, decorative orbs. 6-parameter representation.
A wedge is a triangular prism: 6 vertices, 5 faces (two triangular ends + three rectangular sides). Covers roof shapes, ramp geometry, structural braces. 18-parameter representation.
The network predicts N axis-aligned bounding boxes from the input image. N is variable per category (chair typically 5–8 parts including back, seat, 4 legs, optional arms; lamp 3–4; table 5–7). Each box is parameterised by centre + scale + orientation. No primitive types are committed yet — every part is a box at this stage. Loss: mean L2 between predicted and ground-truth box parameters, per-part assignment via Hungarian matching.
For each part the network predicts a primitive type from {BOX, CYLINDER, CONE, SPHERE, WEDGE}. The type prediction is supervised by the geometric classifier's ground-truth label per part. Loss: cross-entropy over the 5-class type plus L2 on type-specific parameters (cylinder radii, sphere axis ratios, etc.). The intermediate is now a "typed primitive assembly" — every part has a primitive type and approximate fit, but no per-face deformation yet.
For each typed primitive the network predicts the per-face / per-cap deformation parameters. Boxes get full 8-vertex offsets; cylinders get cap-centre + radius + tilt offsets per cap; cones get apex offsets; spheres get axis radii; wedges get vertex offsets. Loss: per-vertex L2 against ground-truth fitted parameters. This stage produces the final primitive parameters; the only remaining work is composition.
The deformed primitives are composed via CSG union into a single watertight mesh. Per-primitive USD subscopes are preserved so the output remains editable in Houdini as a named-primitive hierarchy — opening the chair in Houdini reveals SEAT, BACK, LEG·FL, LEG·FR, LEG·BL, LEG·BR as inspectable USD prims with their primitive parameters exposed.
The five-primitive vocabulary is useful only with high-quality (mesh, type, parameters) training data at PartNeXt scale (~26 K models × tens of parts each). Hand-labelling is infeasible. The classifier is a Houdini Python SOP that processes one connected component per For-Each-Connected-Piece loop iteration and emits the correct primitive type plus fitted parameters automatically.
PCA on the part's vertex distribution yields three principal axes with extents (eigenvalues). Cross-section sampling at 12 evenly-spaced positions along the principal axis measures circularity (ratio of cross-section bounding-square area to cross-section convex-hull area). Taper-ratio is computed from the 90th-percentile radius of the top and bottom cross-sections (90th-percentile rather than max to ignore outlier vertices).
Decision tree: elongation > 3 + circularity > 0.85 → CYLINDER if taper ∈ [0.6, 1.7] else CONE; circularity > 0.92 + elongation < 1.6 → SPHERE; circularity < 0.55 → BOX; minimum vertex-angle < 0.45π → WEDGE; default fallback BOX.
Four named failure cases surfaced during development, each leading to a permanent threshold or computation correction:
(i) Ceiling-mount disc. A flat circular disc was being classified as BOX because its low elongation made the circularity check (along the principal axis) miss. Fix: check circularity along the thin axis specifically when overall elongation is low.
(ii) Brass cone base. A wide-base shallow cone was classified as SPHERE because its low elongation matched the SPHERE branch. Fix: check taper ratio first; high taper (top-radius < 30 % of bottom-radius) overrides the elongation-based dispatch.
(iii) Elongated lamp bulb. A glass bulb with aspect ratio > 2 was classified as SPHERE. Fix: tighten the SPHERE elongation threshold from 2.0 to 1.6 so elongated rounded shapes fall through to CYLINDER (a stretched sphere is approximated as a thick cylinder in this vocabulary).
(iv) Fitted primitives mis-positioned. Output cap centres were appearing far from the original part's cap centres because the cap reconstruction code used centroid-relative offsets but applied them in the wrong coordinate frame. Fix: reconstruct cap centres in world space directly.
Validation tested the full pipeline on the PartNeXt chair category — 4 800 chair models, hierarchical part annotations, average bounding-box diagonal 50 cm. The model was trained on 80 % of the chairs and evaluated on the held-out 20 %.
| Metric | Value |
|---|---|
| Mean Hausdorff distance (raw) | ~1.3 cm |
| Mean Hausdorff (normalised by bbox diagonal) | 2.6 % |
| Mean primitive-type classification accuracy | 92.1 % |
| Mean vertices per chair (output) | ~240 |
| Mean triangles per chair (output) | ~480 |
| Inference time (single-image to mesh) | ~180 ms (RTX 3060) |
Failure mode analysis: reconstruction errors are concentrated on fine-detail parts that the five-primitive vocabulary cannot represent without further subdivision (carved backrest splats, ornamented legs, complex armrest profiles). The 7.9 % of misclassified primitive types are concentrated on ambiguous parts (a chair leg that is technically cylindrical but has a square cross-section near the floor is classified as BOX vs CYLINDER inconsistently).
The structural advantage of SculptNet over the symbolic-program approaches on the same thesis line (PGN, SketchProc3D, Merrell graph grammar) is the absence of a non-differentiable symbolic executor. PGN emits a DSL program executed by Houdini's deterministic interpreter — the gradient from 3-D output back to DSL tokens does not exist because the interpreter is non-differentiable. SketchProc3D emits CGA grammar parameters executed by CityEngine — same problem. The Merrell graph grammar emits a rule-application sequence executed by a graph-rewriting engine — same problem.
In every case, the symbolic intermediate is the source of the gap and the limit on end-to-end training. SculptNet's primitive parameters are continuous geometric quantities: cylinder cap centre is a 3-vector, sphere radius is a scalar, box vertex offset is a 3-vector. There is no symbolic step between predicted parameters and final geometry; the network's parameter outputs are the geometry. Gradient flows from the reconstructed mesh back to network weights directly, with no interruption.
This is the structural argument that motivates the SculptNet architecture choice over the program-based alternatives. Whether the empirical results bear out depends on training scale; Phase 1 results suggest the approach is viable but the cross-category generalisation question (the broader thesis-level question) needs Phase 2 multi-category training to answer.
The closest external precedent is ShapeAssembly [1] which emits cuboid-assembly programs for 3-D shape structure. SculptNet differs in two respects: a richer primitive vocabulary (5 types including curved primitives, not just cuboids), and a coarse-to-fine staging that ShapeAssembly's single-pass program emission lacks. CSGNet [2] recovers constructive-solid-geometry boolean trees from 3-D shapes via imitation learning; CSGNet is about CSG operations on primitives while SculptNet is about the primitive parameters themselves.
A close validating reference is Learning Fine-to-Coarse Cuboid Shape Abstraction [3] which trains the opposite direction — input cuboid abstraction, output dense shape — and demonstrates that the cuboid-abstraction representation is rich enough to round-trip through a neural decoder. SculptNet inverts the direction (image-in, abstraction-out) and extends the cuboid vocabulary to five typed primitives with independent face deformation. Wu et al. [4] recover CSG programs from point clouds using cuboid primitives in a graph-grammar formulation; their cuboids are constrained to axis-aligned-edges, while SculptNet's box primitive has 8 independent vertices.
Superquadrics Revisited [5] proposed using a single superquadric primitive whose continuous parameters (ε₁, ε₂, scale, taper, bend) cover the geometric range that SculptNet's five named primitives cover together. We rejected this alternative for two reasons. First, parsing the output — the downstream consumer wants a chair-leg labelled as CYLINDER rather than as "superquadric with ε₁=0.1, ε₂=1.0", which the single-primitive representation does not provide. Second, the type-classification training signal (5-class cross-entropy) is empirically more stable than continuous-parameter regression over a high-dimensional superquadric parameter space.
PartNeXt-trained models include OcCo and PartAE [6, 7] which use the dataset's part annotations for auto-encoding and self-supervised pretraining. SculptNet uses PartNeXt as the training-data source via the geometric classifier described in §4, but the output target is a primitive-assembly rather than a part-graph or latent code.
On the thesis line: SculptNet is the executor-free counterpart to PGN [8] (DSL program target, executor gap), SketchProc3D [9] (CGA parameters target, executor gap), and the Merrell graph grammar [10] (rule sequence target, executor gap). The hierarchical part-based triplane [11] is the neural-decoder counterpart that targets the same compositional shape class with continuous decoded geometry rather than parametric primitives.
Three concrete limitations. (i) Detail ceiling. The five-primitive vocabulary cannot represent fine carved detail (backrest splats, ornamented legs, complex armrest profiles) without subdivision. The reconstruction misses are concentrated here. The vocabulary either needs to grow (add patch-based detail primitives) or the network needs to learn part-subdivision (when to split one part into multiple sub-parts) — both are open architectural choices for Phase 2.
(ii) Cross-category generalisation. Phase 1 is single-category (chairs). Phase 2 training across all 24 PartNeXt categories is the test of the cross-category transfer hypothesis — can the same network reconstruct chairs and tables and lamps using the same primitive vocabulary? Architecturally yes; empirically yet to be demonstrated.
(iii) Part-count prediction. The current pipeline assumes N (number of parts) is predicted in Stage 1 and held fixed through subsequent stages. Real chairs vary from 4-leg to 5-leg to wheeled-base configurations, so N needs to be predicted correctly per input. Hungarian-matching at training time handles instance ambiguity but N itself is currently bounded above and below by the training distribution — out-of-distribution part-counts fail silently.
Three future-work directions follow. First, train Phase 2 across all PartNeXt categories and measure cross-category transfer. Second, integrate SculptNet as the parametric-primitive backbone for the Building Elevation reconstruction system [9] — buildings are compositional by definition and the primitive-assembly output is exactly the format Maps production pipelines consume. Third, target SIGGRAPH 2026 with the Phase 2 cross-category results as the headline contribution.
SculptNet replaces single-pass image-to-3-D reconstruction with a four-stage coarse-to-fine pipeline that explicitly mimics the artist workflow. The five-primitive vocabulary (box, cylinder, cone, sphere, wedge) with independent face/cap parameters is small enough to learn from PartNeXt scale and expressive enough to cover the compositional shape class targeted by the broader thesis. The Houdini Python SOP geometric classifier produces training labels automatically via PCA + circularity + taper analysis. Phase 1 validation on the chair benchmark achieves ~1.3 cm mean Hausdorff (2.6 % of bounding diagonal) with editable USD output. The architecture eliminates the executor gap that limits the symbolic-program alternatives on the same thesis line.