ProcGen3D — Edge-Based Tokenization Research

00 — Motivation

Can a neural network learn to model any shape, the way a human artist can?

The thesis-level question driving this study sits one level above any particular architecture: how do you train a network to reconstruct any 3-D shape after seeing only a handful of categories, the way a human artist trained on chairs and microwaves and mechanical devices can model a bridge or a building they have never seen before? Current 3-D reconstruction models — NeRF, 3D Gaussian Splatting, mesh diffusion — fail at this because they learn at the wrong level of abstraction. They learn pixel-to-geometry mappings, not the primitive-and-operation vocabulary that lets a trained artist generalise.

ProcGen3D is the most recent published attempt at the right level. It predicts an executable procedural graph from a single image, not a mesh — so the network's representational target is "this is an extrude followed by a comp(f) split" rather than "this voxel is filled". The structured-intermediate- representation pattern matches the broader thesis line traced through PGN (polyline → DSL), SketchProc3D (sketch → CGA grammar), and the Merrell graph grammar work. ProcGen3D is the most direct external comparison point for that line and the natural reference for an edge-tokenization-based generation component in the thesis's next phase.

This topic is a structured study of the ProcGen3D method, framed against the rest of the thesis: how is edge tokenization different from graph grammar rewriting, what does silhouette do in this pipeline, how does the two-level grammar-plus-procedural-graph hybrid relate to it, and what would a deployable RC-frame skeleton extractor based on the same idea actually need. The work is primarily theoretical — establishing the design space before committing to an implementation — and seeds the architecture choices that the subsequent SculptNet and MambaFlow3D topics build on.

Where this sits in the thesis

The bigger goal of the thesis is real-time interactive 3-D maps of the real world from street-view input — buildings you can walk into, edit, modify in real time. Reaching that goal requires a generative 3-D model that generalises across categories. This topic studies whether procedural-graph generation (à la ProcGen3D) is the right architectural target, before committing months of compute to training one.

01 — What ProcGen3D Does

From a single RGB image to an executable procedural graph.

ProcGen3D [xzhang-t.github.io/project/ProcGen3D] takes a single RGB image of an object — cactus, tree, bridge — and predicts the procedural graph a Houdini-style system would use to recreate it. The graph is then executed by a procedural generator (Blender Geometry Nodes-style) to produce a clean, parametric, editable mesh — not a raw mesh soup.

This sits directly in the same problem space as the PGN work, with one important difference in direction:

PGN:        polyline + boundary attrs  →  DSL commands  →  bridge mesh
ProcGen3D:  single RGB image          →  edge-token graph  →  generic mesh

PGN consumes structured geometric input and emits a domain-specific construction language. ProcGen3D consumes a visual observation and emits a general procedural graph. The architectural template is the same — structured intermediate representation rather than raw mesh — applied to a different input modality.

02 — Pipeline

Four stages: tokenize · transform · search · decode.

Figure 1 — ProcGen3D's four-stage pipeline. The non-obvious stage is MCTS at inference time: rather than greedy autoregressive sampling, the transformer is used as a learned prior inside a test-time search that scores candidate graphs against the input image's silhouette mask.

Stage A — Edge-based tokenization

The procedural graph is flattened into a sequence by encoding each edge as a token. Each token carries the 3D positions of the two endpoint vertices, the semantic attributes of those vertices (types, parameters), and attributes of the edge itself. BFS and DFS orderings were both evaluated; BFS produces slightly better results.

Stage B — GPT-style transformer

A standard autoregressive transformer is trained to predict the next edge token conditioned on the input image (encoded via a vision backbone). The model is architecturally identical to a language-model transformer; the novelty sits in what the tokens represent, not in the network itself.

Stage C — MCTS-guided inference

This is the paper's key technical contribution. Plain autoregressive sampling can produce procedural graphs that don't faithfully match the input image. So at inference time the transformer is used as a learned prior inside Monte Carlo Tree Search: multiple candidate continuations are expanded, each candidate graph is decoded and rendered, and the rendered silhouette is compared against the input mask. Search is steered toward graphs that minimise silhouette discrepancy. This is the classical test-time-search-with-neural-prior pattern (AlphaGo, etc.) applied to procedural graph generation.

Stage D — Procedural decode

The final token sequence is reassembled into a procedural graph and executed by a downstream generator. The output is a clean parametric mesh — editable, composable, and orders of magnitude smaller than a raw mesh of equivalent fidelity.

03 — Silhouette · Role

Not predicted — used as conditioning and reward.

A silhouette is the 2D binary mask of a 3D object rendered from a viewpoint — essentially the shadow the object would cast if lit from the camera direction. The first time this came up in studying ProcGen3D, the natural question was whether the network predicts silhouettes. It doesn't.

Input image → extract mask (silhouette) → use as condition + MCTS reward
                                                      ↓
                            Transformer predicts procedural graph tokens
                                                      ↓
                            Decode graph → render → compare silhouette → score

Two roles: (1) input conditioning — the mask is fed alongside the RGB image into the transformer so the model knows the rough shape boundary; (2) MCTS consistency check — candidate graphs are decoded into meshes, rendered from the same camera angle, and the rendered silhouette is compared against the input mask to score how well the graph matches.

The ablation in the paper (mask vs. RGB as input modality) asks whether to condition on silhouette alone or full RGB. RGB wins because it carries information about internal structure and occlusions that the silhouette alone loses. So silhouette is a tool for alignment, not a prediction target.

04 — Procedural Graph vs. Graph Grammar

Two uses of the word "graph" — fundamentally different.

Procedural graphs (ProcGen3D, ShapeAssembly, PGN) and graph grammars (Merrell) both involve graphs but represent fundamentally different things. The distinction is central to choosing the right approach for a given task.

Graph Grammar (Merrell)

The graph is the shape.

Nodes = vertices of the geometry itself
Edges = edges of the geometry itself
Rules rewrite subgraphs into subgraphs
Output is a topology; geometry assigned by layout
Generates shapes locally similar to an example
Closer to a generative linguistic system (context-free grammar for shapes)

Procedural Graph (ProcGen3D / ShapeAssembly / PGN)

The graph produces the shape.

Nodes = operations (extrude, boolean, sweep, scale)
Edges = data flowing between operations
Graph is a recipe/program — executed top-to-bottom
Output is a specific deterministic mesh
Designed for specific shape reconstruction
Closer to a computation graph (Houdini SOP network)

In a graph grammar, drawing the graph reveals the shape's wireframe. In a procedural graph, drawing the graph reveals a flowchart. This single observation governs which representation is appropriate for which task.

05 — For Reconstruction · Why Procedural Graphs Win

Graph grammars are designed to generalise away from the example.

The natural follow-up question: if both representations exist, which one should a neural reconstruction network output? The answer is procedural graphs, decisively. The reasoning has four parts.

Property	Graph Grammar	Procedural Graph
Canonical ground truth	Ambiguous — same shape can be cut into primitives many ways	Unique — the program that generated the shape is known
Cycle tokenisation	Hard — cycles have no natural linearisation; every starting point is valid	Natural — topological sort gives a deterministic execution order
Generative vs reconstructive design	Generative — produces shapes locally similar to the example	Reconstructive — produces the specific shape from its program
Differentiable decoder	Non-differentiable — graph drawing involves rejection sampling	Differentiable end-to-end (PyTorchGeoNodes demonstrated)

Philosophical mismatch

Graph grammar was designed to generalise away from the example. Reconstruction requires faithfulness to the example. Training a neural network on graph grammars for reconstruction is fighting against the fundamental design — not impossible, but consistently worse than the alternative.

06 — A Two-Level Hybrid

Grammar for topology, procedural graph for geometry.

If grammars handle connectivity better than procedural graphs, and procedural graphs handle geometric instantiation better than grammars, the natural design is a two-layer system that uses each for what it's good at:

Target image / geometry
        ↓
Neural network predicts graph grammar rules
(structural topology — how parts connect)
        ↓
Grammar execution produces a topology graph
        ↓
Neural network predicts procedural graph operations
(geometry — what each topological element looks like)
        ↓
Procedural graph execution produces final mesh

The grammar solves the "what connects to what" problem; the procedural graph solves the "what does each connection look like geometrically" problem. Each layer only handles what it's good at.

Concretely for a suspension bridge: the grammar layer outputs "two towers, main cables connecting towers, vertical hangers, deck spanning between anchorages". The procedural layer then fills in tower cross-section, cable diameter, deck thickness, surface details. The grammar's job collapses from "produce something bridge-like" to "produce this specific connectivity pattern" — a much sharper, less ambiguous training signal.

Trade-off

The two-level system handles the ambiguity problem partially. The grammar's training signal becomes cleaner because the target is "this specific topology" rather than "any bridge-like graph". But cross-category generalisation still breaks: each category needs its own grammar vocabulary, and topologies novel to the training distribution will fail completely.

07 — Generalisation · The PartNeXt Question

Will a model trained on PartNeXt work on unseen shapes?

PartNeXt is a hierarchical part-level annotation dataset built on top of ShapeNet — roughly 26K models across 24 categories with semantic part labels, part hierarchies, and connectivity relationships. It's a reasonable training substrate for the two-layer hybrid because the hierarchical annotations already serve as the topology graph. Whether a model trained on it generalises is a layered question.

L1 — SIM-TO-REAL GAP

Synthetic CAD ≠ real photographs

PartNeXt is entirely synthetic — clean lighting, no occlusion, no texture variation. Real photographs introduce all three. Part segmentation specifically is sensitive to lighting (shadows hide part boundaries), occlusion (a leg hidden behind another leg → wrong part count), and strong textures (which override geometric cues). ProcGen3D's MCTS test-time search partially mitigates this by aligning against the real silhouette; without an analogous mechanism, the model degrades sharply on real input.

L2 — CATEGORY COVERAGE

24 categories, mostly furniture

PartNeXt covers chairs, tables, lamps, cabinets, cars, airplanes — primarily furniture and man-made objects. Organic shapes (animals, plants, humans), industrial objects not in ShapeNet, and architectural elements (bridges) have no coverage. For the thesis's bridge work specifically, PartNeXt is essentially useless as a transfer target.

L3 — INTRA-CATEGORY

Even within a category, the long tail fails

Within chairs, well-covered types (4-legged dining, office, armchair) generalise cleanly. Underrepresented types (folding, bean bag, Bauhaus tubular, Wassily-style) with non-standard part counts will likely misparse. Generalisation within a category is bounded by the diversity of topologies seen in training.

L4 — VIEWPOINT

Standard renders ≠ arbitrary photo angles

ShapeNet renders use canonical viewpoints (slightly above, front or 3/4 angle). Real photographs come from arbitrary angles, often partial views of objects. Multi-view training (when available) substantially mitigates this — observing from all sides resolves the foreshortening/depth ambiguities that single-view reconstruction inherently has.

08 — Toward Generalisable 3D Reconstruction

Generalisation lives at the operation level, not the shape level.

Human artists generalise because they don't memorise shapes — they learn primitives and operations. A chair teaches extrude, bevel, loop-cut, boolean. A microwave teaches panel lines, handle topology, button arrays. A mechanical device teaches gear profiles, chamfers, fastener geometry. Once a rich vocabulary of operations and primitives is in hand, any new hard-surface object becomes a novel composition of known operations.

Current 3D reconstruction models (NeRF, 3DGS, mesh diffusion) generalise poorly because they learn at the wrong level of abstraction — pixel-to-geometry mappings or latent shape distributions, with no notion of "this is an extrusion operation" or "this is a repeated structural element". When such a network is trained on chairs and tested on bridges, it fails not because it lacks bridge data — it fails because it never learned what structural repetition or tension geometry is as an abstract operation.

The two-layer grammar + procedural graph system from §06 is closer to the right level, but still domain-constrained: grammar vocabulary is per-category, procedural operations are predefined. The next-level leap requires the network to learn the four things a human artist actually learns:

A1 — PRIMITIVES

A primitive vocabulary

Sphere, cylinder, box, plane, curve — plus organic equivalents. Every shape is built from these. The vocabulary is small, finite, and shared across all hard-surface domains.

A2 — OPERATIONS

An operation vocabulary

Extrude, boolean, subdivide, mirror, array, loft, sweep, deform. These are category-agnostic — the same boolean operation applies to a chair leg and a bridge pier. The operation vocabulary, like the primitive vocabulary, is shared across all hard-surface modelling.

A3 — COMPOSITION

Composition rules

How primitives and operations combine: symmetry, repetition, hierarchy, attachment. A bridge truss and a chair stretcher rail both use the same "repeated structural element along a path" composition rule. Identifying the composition rule is far more powerful than memorising its instances.

A4 — PERCEPTUAL DECOMPOSITION

The hardest skill

Given any shape, decompose it into primitives + operations. Looking at a bridge and seeing "a swept profile with repeated cross-bracing" rather than just "a bridge". This is the skill that separates trained-on-the-task models from models that genuinely transfer across categories.

A generalisable 3D reconstruction model needs to learn all four. The architectures that look most promising for this are universal program spaces — instead of grammars specifically, use a richer program representation where the same vocabulary spans all hard-surface domains. ProcGen3D is one step in that direction; the thesis extension is making the procedural program itself the unit of generalisation.

09 — Applied Direction · RC-Frame Skeleton Extraction

A constrained subproblem to test the tokenization idea.

The pure theoretical study of ProcGen3D-style tokenization leaves open the question of what concrete problem to attack first. A natural starting point is reinforced-concrete frame skeleton extraction: given a photograph of a building, recover the underlying column/beam/slab skeleton as an edge graph. RC frames are a tractable subdomain — small vocabulary (column, beam, slab, brace), sparse topology, planar-by-storey structure — and the result is directly usable downstream for both the procedural and grammar-based generators discussed above.

The natural pipeline mirrors ProcGen3D at a smaller scale: detection + segmentation backbones identify column and beam regions in the image, those regions are grouped into a candidate edge graph, and the autoregressive transformer refines the graph structure with the silhouette consistency check as supervision. The output is the "edge soup" — a set of candidate edges with connectivity inferred from spatial proximity rather than learned end-to-end — that becomes the seed for grammar extraction or procedural execution downstream.

10 — References

Direct papers and adjacent reading.

ProcGen3D	Zhang, X. et al. "ProcGen3D: Neural Procedural Graph Generation from Images." arXiv:2511.07142, 2024. xzhang-t.github.io/project/ProcGen3D
Graph Grammar	Merrell, P. "Example-Based Procedural Modeling Using Graph Grammars." ACM Trans. Graph., 2023. paulmerrell.org/grammar
ShapeAssembly	Jones, R. K. et al. "ShapeAssembly: Learning to Generate Programs for 3D Shape Structure Synthesis." SIGGRAPH Asia, 2020.
L-systems (architecture)	Hansmeyer, M. "L-Systems and Architectural Form." michael-hansmeyer.com/l-systems
CGA Shape	Müller, P. et al. "Procedural Modeling of Buildings." ACM SIGGRAPH, 2006.
L-systems (classic)	Prusinkiewicz, P., Lindenmayer, A. "The Algorithmic Beauty of Plants." Springer, 1990.
PartNeXt	Hierarchical part-level annotations on top of ShapeNet. ~26K models, 24 categories, fine-grained part hierarchies with connectivity.
ShapeNet	Chang, A. X. et al. "ShapeNet: An Information-Rich 3D Model Repository." arXiv:1512.03012, 2015.

Appendix — Raw Materials

Transcripts & Source References

████████████████████████████████████████████████
███████████████████████████████████████

01 — ██████████████████████████

██████████████████████████████████████

█████████ · ████ · █████████████████████

█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████

██████████████████████████████████████████████

██████████ · ████ · ███████████████████████████████

02 — ████████████████████████████████

████████████████████████████████████████████

03 — ████████████████████████████████████████████

Feb 2026

Restricted Access

ProcGen3D —
Edge-Based Tokenization.

Can a neural network learn to model any shape, the way a human artist can?

From a single RGB image to an executable procedural graph.

Four stages: tokenize · transform · search · decode.

Stage A — Edge-based tokenization

Stage B — GPT-style transformer

Stage C — MCTS-guided inference

Stage D — Procedural decode

Not predicted — used as conditioning and reward.

Two uses of the word "graph" — fundamentally different.

Graph grammars are designed to generalise away from the example.

Grammar for topology, procedural graph for geometry.

Will a model trained on PartNeXt work on unseen shapes?

Generalisation lives at the operation level, not the shape level.

A constrained subproblem to test the tokenization idea.

Direct papers and adjacent reading.

Interactive Demo · Live

ProcGen3D — Edge-Based Tokenization.

Can a neural network learn to model any shape, the way a human artist can?

From a single RGB image to an executable procedural graph.

Four stages: tokenize · transform · search · decode.

Stage A — Edge-based tokenization

Stage B — GPT-style transformer

Stage C — MCTS-guided inference

Stage D — Procedural decode

Not predicted — used as conditioning and reward.

Two uses of the word "graph" — fundamentally different.

Graph grammars are designed to generalise away from the example.

Grammar for topology, procedural graph for geometry.

Will a model trained on PartNeXt work on unseen shapes?

Generalisation lives at the operation level, not the shape level.

A constrained subproblem to test the tokenization idea.

Direct papers and adjacent reading.

Interactive Demo · Live

ProcGen3D —
Edge-Based Tokenization.