Procedural 3D generation in Maps requires producing geometrically valid assets — bridges, buildings, terrain features — from structured inputs like polylines and semantic boundary attributes. The dominant approach is direct neural generation: diffusion over meshes, SDF fields, or triplane representations. These methods produce plausible geometry but encode no structural knowledge of what makes a valid assembly. A bridge generated by mesh diffusion may look correct from a distance but contain degenerate topology, non-manifold faces, or physically implausible structural relationships.
Graph grammars offer a fundamentally different inductive bias. Rather than learning a smooth distribution over geometry, they learn the combinatorial rules that govern valid assemblies. A grammar extracted from a real bridge knows that piers support decks, that railings appear at closed boundaries, that span count scales with polyline length. These constraints are implicit in the grammar rather than injected as post-hoc regularisers.
Merrell's 2023 method [1] is, to our knowledge, the only published system that automatically extracts such grammars from example shapes without manual rule authoring. It handles the hardest structural case — shapes with closed cycles — that previous grammar methods explicitly avoided. This paper describes our implementation, our 3D extension, and the architectural path toward neural rule prediction.
Two shapes are r-similar if every ball of radius r centred on one shape contains a neighbourhood that appears somewhere in the other. As r→0 this captures exact local topology. A grammar extracted from shape S can generate any shape that is r-similar to S — same local rules, arbitrarily different global form.
Merrell decomposes a shape graph G into primitives by cutting every edge into two directed half-edges — one primitive per vertex. The boundary ∂G of a graph encodes its open half-edges as a string of symbols, with turn angles tracking curvature. A complete graph has ∂G = "∧". Production rules are string transformations on ∂G.
Branch gluing: ∂G₁ + ∂G₂ → ∂G₃ = merge at complementary half-edge pair (a, ā) Loop gluing: "…a…ā…" → "…" requires ±360° turn between a and āWe begin with 2D shapes to validate the boundary string machinery. The extractor represents shapes as planar graphs with directed half-edges. Each vertex becomes a primitive with outgoing half-edges labelled by edge direction. Boundary computation follows Merrell Section 3.3, tracking open half-edges and cumulative turn angles.
In 3D, ∂G is a 2D surface graph — half-edges become half-faces. We implement a simplified version using edge-direction vectors discretised into 26 directions as the matching alphabet. Full 3D boundary (graph isomorphism on ∂G) is deferred: gluing combinations grow O(d!) with vertex degree d, making high-valence vertices intractable.
Frequency-based rule mining detects symmetry without any explicit symmetry analysis. For a chair mesh, the four leg-attachment subgraphs produce identical boundary strings — frequency = 4. The extractor identifies rule R_leg with multiplicity 4, encoding 4-fold leg symmetry automatically. n-fold rotational symmetry always produces rules with frequency exactly n.
| Shape | Primitives | Rules | Max freq | Symmetry |
|---|---|---|---|---|
| Chair | 12 | 7 | 4 | 4-fold legs, 2-fold back |
| Arch (cycle) | 5 | 3 | 4 | Loop-closing rule |
| Grid lattice | 9 | 4 | 6 | Translational H+V |
| Tree (no cycle) | 7 | 3 | 2 | Binary branching |
Grammar extraction gives rule set R. Generation applies rules stochastically — unguided, no conditioning on image or semantic intent. The natural extension replaces stochastic selection with learned prediction: given an image, predict which rule to apply at each step.
We propose a DiT-based sequence model [4] trained on (image, rule-sequence) pairs. DINOv2 [2] encodes the image into 768-dimensional features. Flow matching [3] over a continuous relaxation of the discrete rule-selection space trains the sequence model. The grammar extractor serves as training data generator, not just an end system.
p(rₜ | r₁,...,rₜ₋₁, x_img) ≈ DiT(DINOv2(x_img), [r₁,...,rₜ₋₁])Grammar extraction works reliably for shapes with repetitive structure. Shapes without repeated substructure produce grammars with near-zero rule reuse — the method is a poor fit for freeform organic geometry. The non-differentiable executor gap (no gradient through rule-application) prevents end-to-end training; the neural predictor with black-box executor is the viable path.
This work feeds directly into SculptNet, which applies a structurally similar inductive bias to coarse-to-fine primitive assembly, and into building elevation reconstruction, where grammar-like constraints govern the 6-plane mesh executor. The architectural template — structured input → rule extraction → program → geometry — recurs across the entire thesis.