A study of autoregressive procedural-graph generation from a single image, framed against the thesis arc: how does this approach compare to graph grammars, DSL synthesis, and the broader question of how to make a neural network reconstruct any shape the way a human artist can.
The thesis-level question driving this study sits one level above any particular architecture: how do you train a network to reconstruct any 3-D shape after seeing only a handful of categories, the way a human artist trained on chairs and microwaves and mechanical devices can model a bridge or a building they have never seen before? Current 3-D reconstruction models — NeRF, 3D Gaussian Splatting, mesh diffusion — fail at this because they learn at the wrong level of abstraction. They learn pixel-to-geometry mappings, not the primitive-and-operation vocabulary that lets a trained artist generalise.
ProcGen3D is the most recent published attempt at the right level. It predicts an executable procedural graph from a single image, not a mesh — so the network's representational target is "this is an extrude followed by a comp(f) split" rather than "this voxel is filled". The structured-intermediate- representation pattern matches the broader thesis line traced through PGN (polyline → DSL), SketchProc3D (sketch → CGA grammar), and the Merrell graph grammar work. ProcGen3D is the most direct external comparison point for that line and the natural reference for an edge-tokenization-based generation component in the thesis's next phase.
This topic is a structured study of the ProcGen3D method, framed against the rest of the thesis: how is edge tokenization different from graph grammar rewriting, what does silhouette do in this pipeline, how does the two-level grammar-plus-procedural-graph hybrid relate to it, and what would a deployable RC-frame skeleton extractor based on the same idea actually need. The work is primarily theoretical — establishing the design space before committing to an implementation — and seeds the architecture choices that the subsequent SculptNet and MambaFlow3D topics build on.
ProcGen3D [xzhang-t.github.io/project/ProcGen3D] takes a single RGB image of an object — cactus, tree, bridge — and predicts the procedural graph a Houdini-style system would use to recreate it. The graph is then executed by a procedural generator (Blender Geometry Nodes-style) to produce a clean, parametric, editable mesh — not a raw mesh soup.
This sits directly in the same problem space as the PGN work, with one important difference in direction:
PGN consumes structured geometric input and emits a domain-specific construction language. ProcGen3D consumes a visual observation and emits a general procedural graph. The architectural template is the same — structured intermediate representation rather than raw mesh — applied to a different input modality.
The procedural graph is flattened into a sequence by encoding each edge as a token. Each token carries the 3D positions of the two endpoint vertices, the semantic attributes of those vertices (types, parameters), and attributes of the edge itself. BFS and DFS orderings were both evaluated; BFS produces slightly better results.
A standard autoregressive transformer is trained to predict the next edge token conditioned on the input image (encoded via a vision backbone). The model is architecturally identical to a language-model transformer; the novelty sits in what the tokens represent, not in the network itself.
This is the paper's key technical contribution. Plain autoregressive sampling can produce procedural graphs that don't faithfully match the input image. So at inference time the transformer is used as a learned prior inside Monte Carlo Tree Search: multiple candidate continuations are expanded, each candidate graph is decoded and rendered, and the rendered silhouette is compared against the input mask. Search is steered toward graphs that minimise silhouette discrepancy. This is the classical test-time-search-with-neural-prior pattern (AlphaGo, etc.) applied to procedural graph generation.
The final token sequence is reassembled into a procedural graph and executed by a downstream generator. The output is a clean parametric mesh — editable, composable, and orders of magnitude smaller than a raw mesh of equivalent fidelity.
A silhouette is the 2D binary mask of a 3D object rendered from a viewpoint — essentially the shadow the object would cast if lit from the camera direction. The first time this came up in studying ProcGen3D, the natural question was whether the network predicts silhouettes. It doesn't.
Two roles: (1) input conditioning — the mask is fed alongside the RGB image into the transformer so the model knows the rough shape boundary; (2) MCTS consistency check — candidate graphs are decoded into meshes, rendered from the same camera angle, and the rendered silhouette is compared against the input mask to score how well the graph matches.
The ablation in the paper (mask vs. RGB as input modality) asks whether to condition on silhouette alone or full RGB. RGB wins because it carries information about internal structure and occlusions that the silhouette alone loses. So silhouette is a tool for alignment, not a prediction target.
Procedural graphs (ProcGen3D, ShapeAssembly, PGN) and graph grammars (Merrell) both involve graphs but represent fundamentally different things. The distinction is central to choosing the right approach for a given task.
In a graph grammar, drawing the graph reveals the shape's wireframe. In a procedural graph, drawing the graph reveals a flowchart. This single observation governs which representation is appropriate for which task.
The natural follow-up question: if both representations exist, which one should a neural reconstruction network output? The answer is procedural graphs, decisively. The reasoning has four parts.
| Property | Graph Grammar | Procedural Graph |
|---|---|---|
| Canonical ground truth | Ambiguous — same shape can be cut into primitives many ways | Unique — the program that generated the shape is known |
| Cycle tokenisation | Hard — cycles have no natural linearisation; every starting point is valid | Natural — topological sort gives a deterministic execution order |
| Generative vs reconstructive design | Generative — produces shapes locally similar to the example | Reconstructive — produces the specific shape from its program |
| Differentiable decoder | Non-differentiable — graph drawing involves rejection sampling | Differentiable end-to-end (PyTorchGeoNodes demonstrated) |
If grammars handle connectivity better than procedural graphs, and procedural graphs handle geometric instantiation better than grammars, the natural design is a two-layer system that uses each for what it's good at:
The grammar solves the "what connects to what" problem; the procedural graph solves the "what does each connection look like geometrically" problem. Each layer only handles what it's good at.
Concretely for a suspension bridge: the grammar layer outputs "two towers, main cables connecting towers, vertical hangers, deck spanning between anchorages". The procedural layer then fills in tower cross-section, cable diameter, deck thickness, surface details. The grammar's job collapses from "produce something bridge-like" to "produce this specific connectivity pattern" — a much sharper, less ambiguous training signal.
PartNeXt is a hierarchical part-level annotation dataset built on top of ShapeNet — roughly 26K models across 24 categories with semantic part labels, part hierarchies, and connectivity relationships. It's a reasonable training substrate for the two-layer hybrid because the hierarchical annotations already serve as the topology graph. Whether a model trained on it generalises is a layered question.
PartNeXt is entirely synthetic — clean lighting, no occlusion, no texture variation. Real photographs introduce all three. Part segmentation specifically is sensitive to lighting (shadows hide part boundaries), occlusion (a leg hidden behind another leg → wrong part count), and strong textures (which override geometric cues). ProcGen3D's MCTS test-time search partially mitigates this by aligning against the real silhouette; without an analogous mechanism, the model degrades sharply on real input.
PartNeXt covers chairs, tables, lamps, cabinets, cars, airplanes — primarily furniture and man-made objects. Organic shapes (animals, plants, humans), industrial objects not in ShapeNet, and architectural elements (bridges) have no coverage. For the thesis's bridge work specifically, PartNeXt is essentially useless as a transfer target.
Within chairs, well-covered types (4-legged dining, office, armchair) generalise cleanly. Underrepresented types (folding, bean bag, Bauhaus tubular, Wassily-style) with non-standard part counts will likely misparse. Generalisation within a category is bounded by the diversity of topologies seen in training.
ShapeNet renders use canonical viewpoints (slightly above, front or 3/4 angle). Real photographs come from arbitrary angles, often partial views of objects. Multi-view training (when available) substantially mitigates this — observing from all sides resolves the foreshortening/depth ambiguities that single-view reconstruction inherently has.
"If an artist has learned how to model chairs, microwaves, and mechanical devices, it's understood that they can model bridges and other hard-surface objects. How can we train a network to do the same?" — The core thesis question
Human artists generalise because they don't memorise shapes — they learn primitives and operations. A chair teaches extrude, bevel, loop-cut, boolean. A microwave teaches panel lines, handle topology, button arrays. A mechanical device teaches gear profiles, chamfers, fastener geometry. Once a rich vocabulary of operations and primitives is in hand, any new hard-surface object becomes a novel composition of known operations.
Current 3D reconstruction models (NeRF, 3DGS, mesh diffusion) generalise poorly because they learn at the wrong level of abstraction — pixel-to-geometry mappings or latent shape distributions, with no notion of "this is an extrusion operation" or "this is a repeated structural element". When such a network is trained on chairs and tested on bridges, it fails not because it lacks bridge data — it fails because it never learned what structural repetition or tension geometry is as an abstract operation.
The two-layer grammar + procedural graph system from §06 is closer to the right level, but still domain-constrained: grammar vocabulary is per-category, procedural operations are predefined. The next-level leap requires the network to learn the four things a human artist actually learns:
Sphere, cylinder, box, plane, curve — plus organic equivalents. Every shape is built from these. The vocabulary is small, finite, and shared across all hard-surface domains.
Extrude, boolean, subdivide, mirror, array, loft, sweep, deform. These are category-agnostic — the same boolean operation applies to a chair leg and a bridge pier. The operation vocabulary, like the primitive vocabulary, is shared across all hard-surface modelling.
How primitives and operations combine: symmetry, repetition, hierarchy, attachment. A bridge truss and a chair stretcher rail both use the same "repeated structural element along a path" composition rule. Identifying the composition rule is far more powerful than memorising its instances.
Given any shape, decompose it into primitives + operations. Looking at a bridge and seeing "a swept profile with repeated cross-bracing" rather than just "a bridge". This is the skill that separates trained-on-the-task models from models that genuinely transfer across categories.
A generalisable 3D reconstruction model needs to learn all four. The architectures that look most promising for this are universal program spaces — instead of grammars specifically, use a richer program representation where the same vocabulary spans all hard-surface domains. ProcGen3D is one step in that direction; the thesis extension is making the procedural program itself the unit of generalisation.
The pure theoretical study of ProcGen3D-style tokenization leaves open the question of what concrete problem to attack first. A natural starting point is reinforced-concrete frame skeleton extraction: given a photograph of a building, recover the underlying column/beam/slab skeleton as an edge graph. RC frames are a tractable subdomain — small vocabulary (column, beam, slab, brace), sparse topology, planar-by-storey structure — and the result is directly usable downstream for both the procedural and grammar-based generators discussed above.
The natural pipeline mirrors ProcGen3D at a smaller scale: detection + segmentation backbones identify column and beam regions in the image, those regions are grouped into a candidate edge graph, and the autoregressive transformer refines the graph structure with the silhouette consistency check as supervision. The output is the "edge soup" — a set of candidate edges with connectivity inferred from spatial proximity rather than learned end-to-end — that becomes the seed for grammar extraction or procedural execution downstream.
| ProcGen3D | Zhang, X. et al. "ProcGen3D: Neural Procedural Graph Generation from Images." arXiv:2511.07142, 2024. xzhang-t.github.io/project/ProcGen3D |
| Graph Grammar | Merrell, P. "Example-Based Procedural Modeling Using Graph Grammars." ACM Trans. Graph., 2023. paulmerrell.org/grammar |
| ShapeAssembly | Jones, R. K. et al. "ShapeAssembly: Learning to Generate Programs for 3D Shape Structure Synthesis." SIGGRAPH Asia, 2020. |
| L-systems (architecture) | Hansmeyer, M. "L-Systems and Architectural Form." michael-hansmeyer.com/l-systems |
| CGA Shape | Müller, P. et al. "Procedural Modeling of Buildings." ACM SIGGRAPH, 2006. |
| L-systems (classic) | Prusinkiewicz, P., Lindenmayer, A. "The Algorithmic Beauty of Plants." Springer, 1990. |
| PartNeXt | Hierarchical part-level annotations on top of ShapeNet. ~26K models, 24 categories, fine-grained part hierarchies with connectivity. |
| ShapeNet | Chang, A. X. et al. "ShapeNet: An Information-Rich 3D Model Repository." arXiv:1512.03012, 2015. |
Pick a subject below (chair / table / lamp) or click the input canvas to cycle. The token stream and the 3-D graph reset; press Step to advance one token at a time and watch the procedural graph assemble into the target shape. Switching subjects changes the input image, the token sequence, and the reconstructed 3-D wireframe simultaneously.