SculptNet — Coarse-to-Fine 3D Reconstruction

00 — Motivation

Neural reconstruction emits geometry in one forward pass. Artists don't.

The thesis's broader question — how do you train a network to reconstruct any shape the way a human artist can — surfaced repeatedly across the earlier projects on this roadmap. The answer that current image-to-3D models give is a one-shot forward pass: input image, output mesh / NeRF / SDF, all at the final resolution. NeRF, 3D Gaussian Splatting, mesh diffusion — every one emits the final geometry immediately. The result is plausible from a distance and structurally unstable up close: parts blend together, scales are approximate, fine detail is texture-painted rather than carved.

A human 3-D artist working in Houdini, Blender, or Maya does not work this way. They block with primitive boxes first — get the overall masses right. Then shape — replace boxes with the right primitive type (cylinder for a leg, sphere for a head, cone for a lampshade). Then detail — independently move vertices to add taper, asymmetry, recesses. Each stage commits before the next. The network that learns to reconstruct any shape — chair, microwave, bridge — needs to learn this progressive commitment, not just the final geometry.

SculptNet is the architectural answer. Five fixed primitives with named types and independent face/cap parameters; a Houdini Python SOP that classifies any PartNeXt part mesh into the right primitive type automatically (PCA + circularity + taper); a four-stage coarse-to-fine pipeline (blocking → shaping → detailing → output) that emits a primitive assembly editable in production tools. The structural- representation thesis line's answer to the open cross-category generalisation question: a vocabulary small enough to learn from data, expressive enough to cover every hard-surface shape an artist would model.

What it replaces

SculptNet sits next to PGN (DSL-program target) and SketchProc3D (CGA-grammar parameters target) on the structured-intermediate- representation thesis line. The difference: PGN and SketchProc3D depend on a non-differentiable symbolic executor (Houdini DSL, CGA grammar) which is the open problem in both. SculptNet replaces the executor with differentiable primitive assembly — no symbolic program, no executor gap, just geometric primitives with continuous parameters that backprop natively.

Phase 0 → Phase 1 progression

Phase 0 — a proof-of-concept that the coarse-to-fine sequence works on a single category — was completed in earlier sessions on the chair dataset. Phase 1 (described in this topic) is the architectural commitment: the five-primitive vocabulary, the four-stage pipeline, the geometric classifier. The initial target was street buildings; the pivot to small statues and objects (drawing inspiration from cgcookie's "6 Principles of Great 3D Modeling") was the decision to validate on a tighter category before scaling. Phase 2 — multi-category PartNeXt training across all 24 categories — is the next step, targeting SIGGRAPH 2026.

01 — The Five Primitives

A small vocabulary, every face independently controllable.

The vocabulary is deliberately small. An earlier design used a variable-N polyhedral system where the network would emit "any number of faces with any orientations" — too unconstrained, hard to train, hard for downstream consumers to parse. The committed design is five named primitive types. Each primitive has a fixed face / cap structure, but every face is independently transformable.

BOX

8 vertices fully independent. Forms boxes, trapezoids, frustums, twisted prisms.

CYLINDER

Top + bottom cap circles. Each with independent centre, radius, tilt.

CONE

Bottom circle + apex. Apex offset from cap centre = oblique cone.

SPHERE

3-axis radii (rx, ry, rz). Ellipsoid generalisation for lamp bulbs, joints.

WEDGE

Triangular prism. Roof shapes, ramp geometry, structural braces.

A box with all 8 vertices independent covers most polyhedral shapes — set the top four vertices smaller than the bottom four and you get a frustum; twist them and you get a sheared block. A cylinder with independent caps covers tapered shafts, oblique posts, and most rotationally symmetric parts. The five-primitive vocabulary covers, in combination, every category in PartNeXt — chairs, tables, lamps, cabinets, beds — and extends naturally to mechanical assemblies and architectural elements.

Vocabulary evolution — what was considered and dropped

The initial design considered six primitive types including an Arch — a parametric curve with extrusion depth, i.e. a tube/cylinder with a bend. Arches show up in PartNeXt as shopping-bag handles, certain chair backrests, and decorative elements. The Arch was dropped from the committed vocabulary because it adds two extra parameters (curve control points) and a non-trivial training-data labelling problem (the geometric classifier would need curve fitting, not just principal-axis analysis). Arched parts are currently approximated as multiple short cylinders chained together — works for shopping-bag handles, less well for tight curves. Single-superquadric representations (autonomousvision.github.io/superquadrics-revisited) were considered as an alternative single-primitive vocabulary; rejected because the five-named-primitive system is easier for downstream consumers to parse as named USD geometry and easier to train as a 5-class type classification rather than a continuous-parameter regression over a single deformable primitive.

02 — Pipeline

Four stages, each commits before the next.

Figure 1 — Four-stage coarse-to-fine pipeline. Each stage commits before the next. The intermediate at every stage is a valid primitive assembly that can be inspected, edited, or trained against — not a black-box latent. This is what makes the architecture editable in production tools and trainable with stage-by-stage supervision.

03 — Geometric Classifier · The Training-Data Generator

PCA + circularity + taper → primitive type per PartNeXt part.

The five-primitive vocabulary is useful only if you can produce high-quality training data — pairs of (PartNeXt mesh, correct primitive + parameters) — at scale. Hand-labelling tens of thousands of parts is infeasible. The solution is a Houdini Python SOP geometric classifier that takes any part mesh and emits the correct primitive type plus fitted parameters automatically. Built scratch over several iterations as classification bugs surfaced and were fixed.

# Houdini Python SOP — geometric classifier sketch
# Run inside For-Each Connected Piece loop, one part per iteration

import numpy as np
pts = geometry_points_as_array()        # part vertex positions
pca = principal_component_analysis(pts)  # 3 axes + 3 extents (eigenvalues)

# Extent ratios → elongation factor
e0, e1, e2 = pca.extents_sorted_desc()
elongation = e0 / e1                      # > 3.0 → cylindrical / cone-like

# Cross-section sampling along principal axis
slices = sample_cross_sections(pts, pca.axis_0, n=12)
circularity = mean([circularity_of(s) for s in slices])   # 0–1 scale
top_R = radius_90th_percentile(slices[-1])
bot_R = radius_90th_percentile(slices[0])
taper = top_R / bot_R

# Decision tree → primitive type
if elongation > 3 and circularity > 0.85:
    prim_type = 'CYLINDER' if 0.6 < taper < 1.7 else 'CONE'
elif circularity > 0.92 and elongation < 1.6:
    prim_type = 'SPHERE'                  # near-isotropic, near-circular
elif circularity < 0.55:
    prim_type = 'BOX'                     # rectangular cross-section
elif min(angles_at_vertices) < 0.45 * pi:
    prim_type = 'WEDGE'                   # has a sharp angle
else:
    prim_type = 'BOX'                     # default fallback

return prim_type, fit_params(pts, prim_type, pca)

Bug-fix history (each one a specific PartNeXt input that broke the classifier, leading to a permanent threshold adjustment):

Failure case	Symptom	Fix
Ceiling-mount disc	Classified as BOX (flat, rectangular bbox)	Detect flat disc by low elongation + high circularity along the thin axis specifically
Brass cone base	Classified as SPHERE (low elongation)	Check taper ratio with 90th-percentile radii to catch wide-base cones
Lamp glass bulb (elongated)	Classified as SPHERE despite elongation > 2	Tighten the sphere elongation threshold from 2.0 to 1.6; let elongated rounded shapes fall through to CYLINDER
Misplaced fitted primitives	Output cap centres far from original cap centres	Reconstruct cap centres in world space, not centroid-relative offsets

04 — Results · Phase 1 Validation

~1.3 cm geometric accuracy on the PartNeXt chair benchmark.

Phase 1 validation tested the full pipeline on the PartNeXt chair category. Geometric accuracy measured as mean Hausdorff distance between the reconstructed primitive-assembly mesh and the original PartNeXt ground-truth mesh, normalised to the bounding-box diagonal. Result: ~1.3 cm mean Hausdorff at a 50 cm-bounding-box average chair — about 2.6 % of the bounding-box diagonal. Reconstruction misses are concentrated on fine detail (carved backrest splats, ornamented legs) that the five-primitive vocabulary cannot represent without subdividing into more parts.

Stage	Output type	Vertex count	Editability
Stage 1 — Blocking	N axis-aligned boxes	8N	Translate, scale per box
Stage 2 — Shaping	N typed primitives	~24N (mixed)	Primitive type, scale, orientation per part
Stage 3 — Detailing	Per-face deformed primitives	~40N	Every face/cap independently editable
Stage 4 — Compose	Watertight USD mesh	~50N (after CSG union)	Per-primitive USD subscope, editable in Houdini

Appendix — Raw Materials

Transcripts & Source References

████████████████████████████████████████████████
███████████████████████████████████████

01 — ██████████████████████████

██████████████████████████████████████

█████████ · ████ · █████████████████████

█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████

██████████████████████████████████████████████

██████████ · ████ · ███████████████████████████████

02 — ████████████████████████████████

████████████████████████████████████████████

Restricted Access

SculptNet —
Sculpt How Artists Sculpt.

Neural reconstruction emits geometry in one forward pass. Artists don't.

A small vocabulary, every face independently controllable.

Four stages, each commits before the next.

PCA + circularity + taper → primitive type per PartNeXt part.

~1.3 cm geometric accuracy on the PartNeXt chair benchmark.

Interactive Demo · Live

Full Technical Paper

SculptNet — Sculpt How Artists Sculpt.

Neural reconstruction emits geometry in one forward pass. Artists don't.

A small vocabulary, every face independently controllable.

Four stages, each commits before the next.

PCA + circularity + taper → primitive type per PartNeXt part.

~1.3 cm geometric accuracy on the PartNeXt chair benchmark.

Interactive Demo · Live

Full Technical Paper

SculptNet —
Sculpt How Artists Sculpt.