PGN — Procedural Generator Network

Topic 01 · Sep 2025 · Applied ML · Procedural Modeling

A transformer encoder–decoder architecture trained on pairs of polyline USD files and bridge USD scenes. The network learns to generate executable construction programs in a domain-specific language — producing outputs that are human-readable, parametrically editable, and version-controllable without mesh manipulation.

Training USD pairs

0.097

DSL loss · ep. 25

~60M

Parameters

↓ Scroll to explore architecture Thesis · Apple Maps · Houdini / USD

Fig. 1 — PGN end-to-end pipeline (isometric projection) PGN Pipeline Diagram Dark

Layer 01

Polyline Input · USD

Variable-length sequence of n vertices pᵢ ∈ ℝ³ with semantic boundary attributes. Grid surface encodes spatial topology. CLOSED/OPEN boundary labels directly condition downstream generation.

Layer 02

Transformer Encoder

8-head multi-head attention, d=512, 6 layers. Sinusoidal positional encoding. Produces context vector c_P ∈ ℝ⁵¹². Token blocks on the layer surface represent attention heads processing each polyline vertex.

Layer 03

Latent Space Z ∈ ℝ⁵¹²

Cross-attention fusion of polyline and boundary encodings. Neural graph scatter represents the learned latent distribution over bridge construction programs. Dual-loss: cross-entropy + MSE performance, λ(e): 0.1→0.5.

Layer 04

Transformer Decoder

Causal autoregressive DSL generation. Cross-attention over encoder output at each step. Causal mask enforces left-to-right token ordering. Stacked token cards represent the growing output sequence.

Layer 05

DSL Token Stream → USD

Dense grid surface represents the USD scene graph built by the deterministic executor. BRIDGE_START, GENERATE_DECK, IF_BOUNDARY, BRIDGE_END — human-readable, parametrically editable, version-controllable as text.

Core Thesis

Generate the
program that
builds geometry.
Not the geometry.

A triangle mesh is a terminal artifact — it cannot be re-parameterised, only destructively edited. A construction program is a reusable intent specification. Modifying one parameter token regenerates the entire USD scene in milliseconds. This is what Apple Maps production pipelines require — and the gap that existing 3D generative models do not fill.

§ 1

Network
Architecture

The PGN architecture decomposes into four submodules. The PolylineEncoder applies a transformer encoder to the sequence of 3D polyline vertices, producing a context vector c_P ∈ ℝ⁵¹² via a learned [CLS] token aggregation. A parallel AttributeEncoder embeds the boundary attribute sequence into the same dimensionality c_B ∈ ℝ⁵¹². A cross-attention Fusion module combines both paths into the shared latent z ∈ ℝ⁵¹².

The DSL Decoder — a causal transformer — generates the construction token sequence autoregressively from z, attending over the full encoder output at each step. A lightweight PerformanceEvaluator MLP branches from z and predicts primitive count and rendering cost, contributing to the curriculum-weighted performance loss during training.

Polyline Enc.

Transformer 6L · 8H · d=512

~25M params
n×3 → 512

Attribute Enc.

Transformer 3L · 4H · d=512

~8M params
m×3 → 512

Fusion

Cross-attention · 1L

~2M params
512+512 → 512

DSL Decoder

Causal Trans. 6L · 8H

~24M params
512 → |V|

Perf. Evaluator

MLP · 3L · ReLU

~0.5M params
512 → 2

§ 2

Training &
Loss Formulation

Training follows a dual-loss curriculum. The total loss at epoch e is ℒ = ℒ_DSL + λ(e) · ℒ_perf, where ℒ_DSL is token-level cross-entropy and ℒ_perf is mean squared error on primitive count prediction. The curriculum weight λ(e) = 0.1 + 0.4·(e/E) increases linearly, ensuring structural correctness is learned before geometric efficiency is penalised. AdamW optimiser: encoder LR 10^-4, decoder LR 10^-5. Cosine annealing with warm restarts T₀=100. Gradient clipping norm 1.0.

Fig. 2 — Training loss over 25 epochs. ℒ_total (solid) converges 0.950→0.473. ℒ_DSL (dashed) 0.500→0.097. ℒ_perf (dotted, scaled ÷10 for readability) 0.812→0.375. Monotonic convergence across all three components with no instabilities.

Epoch	ℒ_total	ℒ_DSL	ℒ_perf	λ(e)	LR (dec.)
1	0.950	0.500	8.12	0.101	1.0×10⁻⁵
5	0.801	0.420	7.36	0.119	9.8×10⁻⁶
10	0.694	0.310	6.45	0.138	9.1×10⁻⁶
15	0.601	0.221	5.20	0.157	7.8×10⁻⁶
20	0.543	0.156	4.62	0.176	6.2×10⁻⁶
25	0.473	0.097	3.75	0.100	4.8×10⁻⁶

§ 3

DSL Grammar
Specification

The DSL is a context-free grammar G = (V, Σ, R, S) over 24 terminal tokens in six classes. The vocabulary is deliberately minimal — sufficient to represent the full structural variation in the 15-pair corpus while remaining learnable at this dataset scale. BOUNDARY_TYPE tokens (CLOSED, OPEN, SEAM) are first-class grammar elements derived from Apple Maps USD semantic attributes, enabling learned attribute-conditioned branching without explicit rule injection.

Fig. 3 — DSL grammar parse tree. CONDITIONAL node (highlighted) is the key structural feature — the network learns attribute-conditioned branching from CLOSED/OPEN/SEAM tokens without explicit rule injection.

Token Class	Terminal Tokens	Semantics
CONTROL	BRIDGE_START, BRIDGE_END	Required sequence delimiters
GEOMETRY	GENERATE_DECK, ADD_PILLARS, GENERATE_RAILINGS, GENERATE_PATHWAY	USD prim instantiation
CONDITIONAL	IF_BOUNDARY	Attribute-conditioned branching
BOUNDARY_TYPE	CLOSED, OPEN, CONTINUITY_SEAM	Apple Maps USD attributes
PARAMETER	Real-valued scalars	Geometry dimensions and detail level
MODIFIER	SET_CURVE, SET_DETAIL_LEVEL	Execution context modifiers

§ 4

Engineering
Challenges

C-01

Component spatial misalignment in USD scene assembly

Deck, railing, and pillar geometries generated as disconnected USD prims. Root cause: feature_id attribute propagation through Houdini's piece attribute in the For-Each loop was not threading correctly to the deck generation node. Resolved via explicit polypath + ends node anchoring with per-feature-ID primitive binding.

Resolved

C-02

Bridge geometry non-conformant to polyline curvature

Straight-line interpolation in the executor produced geometrically incorrect planar bridges on curved polylines. Resolved by computing the discrete Frenet frame along the polyline tangent vector at each deck segment, using normal and binormal to orient geometry placement in world space.

Resolved

C-03

Apple Maps USD boundary attribute semantics undocumented

Empirically validated across 15 training pairs: CLOSED → railing generation active; OPEN → open edge, no railing; SEAM → continuity joint between sections. No public documentation available for these internal attribute conventions.

Resolved

C-04

Non-differentiable supervision path through DSL executor

The executor is a deterministic interpreter — no gradient flows from 3D geometry back through exec(T) to the decoder. Direct end-to-end 3D geometry loss is therefore impossible. Mitigations under investigation: (1) nvdiffrast rendering-based supervision on executor output; (2) Vector DB token-level similarity signal using pre-indexed program–geometry pairs. Neither approach fully resolves the supervision gap.

Open

Interactive Demo

A default bridge is pre-loaded and rotating. Click to add vertices to the polyline input, select boundary type, then generate a new DSL program and 3D preview. Use the Clear button to start fresh.

01 — Polyline Input

Default example loaded · Click to add vertices

Boundary

02 — DSL Token Stream

— Awaiting generation —

03 — Bridge Wireframe (3D) Drag to rotate

Full Technical Paper

arXiv-format preprint · PGN: Transformer-Based Procedural Generator Network for 3D Bridge Synthesis from Polyline Semantic Attributes

Read Paper →

This work directly led to →

→

SketchProc3D

PGN established that a network can learn to write construction programs from geometric input. SketchProc3D extended this paradigm to user sketches — replacing polyline vertices with CNN-extracted building type tokens as the encoder input. Inspired by Nishida et al. SIGGRAPH 2016.

Sep – Oct 2025

→

Paul Merrell Graph Grammar

PGN's manually-designed DSL vocabulary exposed the limitation of hand-crafted token grammars. Graph grammar research investigated automatic structural rule extraction from mesh geometry — a direct response to Challenge C-04's non-differentiable supervision gap.

Oct 2025

→

Building Elevation Reconstruction

The PGN architectural pattern — structured geometric input → latent → construction program → 3D output — became the template for the thesis capstone. Street-view images replace polylines; a 6-plane orthographic mesh reconstruction pipeline replaces the DSL executor.

Mar 2026

arXiv Preprint · cs.GR · cs.LG · Sep 2025

PGN: A Transformer-Based Procedural Generator Network for 3D Bridge Synthesis from Polyline Semantic Attributes

Aditya Jain

Apple Maps · 3D Reconstruction Group, Hyderabad · Thesis Research, Unpublished Preprint

Submitted: September 2025 Subject: cs.GR · cs.LG MSC: 68T07 · 65D18 Keywords: procedural generation, program synthesis, seq2seq, USD, transformer

Abstract

We present the Procedural Generator Network (PGN), a sequence-to-sequence transformer architecture that maps polyline geometric input annotated with semantic boundary attributes to executable Domain Specific Language (DSL) programs that construct 3D bridge geometry in Universal Scene Description (USD) format. Unlike direct geometry generation methods — SDF diffusion, NeRF, mesh generation — PGN generates construction programs: ordered sequences of human-readable, parametrically editable procedural commands that a deterministic executor converts to watertight USD scenes at runtime. The network is trained on 15 polyline–bridge USD pairs sourced from Apple Maps production data, using a dual-loss objective combining token-level cross-entropy reconstruction with a curriculum-weighted performance loss penalising geometric redundancy, λ(e) increasing linearly from 0.1 to 0.5. We demonstrate DSL reconstruction loss of 0.097 at epoch 25, achieving successful program execution with correct deck curvature, boundary-conditioned railing generation, and pillar instantiation on test cases. The primary identified open problem is executor non-differentiability — no gradient flows from 3D geometry back through the deterministic interpreter to the decoder, precluding direct end-to-end training with geometric loss. This architectural pattern — structured geometric input → latent representation → construction program → 3D output — serves as the foundation for all subsequent thesis work. Keywords: procedural generation, program synthesis, 3D geometry, transformer, Universal Scene Description, seq2seq learning.

1. Introduction

Contemporary 3D generative models — diffusion-based mesh synthesis [1], neural radiance fields [2], and SDF-based shape generation [3] — produce geometrically high-quality outputs but share a fundamental limitation for production use: their outputs are static triangle meshes or volumetric fields that cannot be re-parameterised without destructive editing. In the context of Apple Maps bridge geometry, this limitation is operationally critical: bridge assets must be editable, versioned, and regeneratable from updated polyline surveys without repeating the full generative inference cycle.

We propose a different generative target. Rather than mapping input geometry to output geometry, we train a network to map input geometry to an executable construction program. This program, expressed in a formal DSL, encodes the structural intent of the bridge as an ordered sequence of human-readable commands. A deterministic executor converts the program to a USD scene at runtime. The resulting system produces outputs that: (i) can be inspected and edited without specialised 3D tools; (ii) can be re-executed with modified parameters to produce geometry variants; (iii) can be version-controlled as text alongside other production assets.

This paper makes the following contributions: (1) a transformer encoder–decoder architecture adapted for the polyline-to-DSL mapping problem; (2) a context-free DSL grammar over 24 terminal tokens sufficient to describe the full structural variation in Apple Maps bridge data; (3) a dual-loss training curriculum combining reconstruction fidelity and geometric efficiency objectives; (4) an analysis of the non-differentiable executor gap — the primary open problem for all executor-based program synthesis approaches to 3D geometry generation; (5) the architectural template that all subsequent thesis projects extend.

Figure 1: PGN architecture. Polyline vertices and boundary attributes B are encoded separately and fused via cross-attention into latent z∈ℝ⁵¹². The causal decoder generates DSL token sequence T autoregressively. A PerformanceEvaluator MLP branches from z to contribute ℒ_perf to the training objective.

2. Related Work

2.1 Procedural 3D Generation

Procedural modelling systems — CityEngine [4], Houdini procedural networks, shape grammars [5] — define geometry through parameterised rule grammars, enabling scalable urban content production but requiring manual rule authoring per asset category. ProcGen3D [6] explores learning grammar rules from image data using autoregressive edge-token prediction with MCTS-guided sampling, but addresses 2D graph structures. PGN extends to 3D USD scene graphs with continuous parameter prediction and semantic attribute conditioning.

2.2 Program Synthesis for Geometry

ShapeAssembly [7] generates part-based shape assembly programs via a language model trained on a curated DSL. CSGNet [8] recovers constructive solid geometry programs from 3D shapes using imitation learning. Our approach differs in three respects: (1) the input is structured geometric data with semantic attributes rather than a shape category label; (2) the output DSL targets direct execution in Houdini production pipelines; (3) the grammar explicitly encodes Apple Maps USD boundary attributes as first-class tokens enabling conditional generation.

2.3 3D Representations for ML Supervision

Sparse voxel representations (FVDB [9]) provide Houdini-native integration and city-scale memory efficiency but do not support differentiable supervision. SDF-based representations [10] provide smooth gradient fields for geometry learning but are prohibitively expensive at urban scene scale. We use USD mesh with graph neural network supervision for per-component geometric loss, reserving VDB for production storage. This evaluation confirmed VDB as the optimal production format (assessed in §2, USD pipeline research, same month).

3. Method

3.1 Problem Formulation

Let P = {p₁,...,pₙ}, pᵢ∈ℝ³ be a polyline with n vertices, and B = {b₁,...,bₘ}∈{CLOSED, OPEN, SEAM}ᵐ its corresponding boundary attribute sequence. The goal is to learn a mapping f:(P,B)→T={t₁,...,tₗ} where T is a sequence of DSL tokens drawn from vocabulary V, such that exec(T) produces a geometrically valid USD bridge scene consistent with the structural intent encoded in (P,B).

3.2 Network Architecture

The PolylineEncoder E_P applies a standard Transformer encoder [11] with sinusoidal positional encoding to P, producing contextualised vertex representations h₁,...,hₙ∈ℝ⁵¹². A [CLS] token aggregates these into context vector c_P∈ℝ⁵¹². The AttributeEncoder E_B applies a smaller 3-layer transformer to B, producing c_B∈ℝ⁵¹². Cross-attention fusion F combines both: z = F(c_P, c_B)∈ℝ⁵¹².

The DSL Decoder D_T is a causal transformer that generates T autoregressively. At step t it attends over the encoder output and previously generated tokens t₁,...,tₜ₋₁ via cross-attention and causal self-attention respectively, predicting the next token from vocabulary V. The PerformanceEvaluator G:z→(ĉ,r̂) is a 3-layer MLP predicting primitive count ĉ and rendering cost r̂.

3.3 Loss Formulation

Reconstruction loss is token cross-entropy:

ℒ_DSL = −Σₜ log p(tₜ | t₁,...,tₜ₋₁, z)

Performance loss penalises geometric redundancy:

ℒ_perf = MSE(ĉ, c*) + α·r̂, α=0.01

Total loss with curriculum weight:

ℒ = ℒ_DSL + λ(e)·ℒ_perf, λ(e) = 0.1 + 0.4·(e/E)

where c* is the ground-truth primitive count for the training bridge, E=25 epochs. The linear curriculum ensures structural correctness is learned (low λ early) before efficiency is optimised (higher λ late).

Table 1: PGN network hyperparameters
Parameter	Value
d_model	512
Encoder layers	6
Decoder layers	6
Attention heads	8
FF dim	2048
Dropout	0.1
Batch size	4
LR (encoders)	1×10⁻⁴
LR (decoder)	1×10⁻⁵
LR schedule	Cosine anneal, T₀=100
Gradient clip	norm 1.0
Training epochs	25

4. DSL Grammar

The DSL is a context-free grammar G=(V,Σ,R,S) where V={BRIDGE_PROGRAM, STMT_LIST, STMT, BOUNDARY_CLAUSE, PARAM} are non-terminals, Σ is the 24-token terminal alphabet, R are production rules, and S=BRIDGE_START is the start symbol. The grammar is deliberately minimal — spanning pedestrian bridges (~250 primitives) to major infrastructure (~7,000 primitives) with 24 terminal types. BOUNDARY_TYPE tokens are first-class grammar elements, enabling the decoder to learn attribute-conditioned branching (IF_BOUNDARY CLOSED → generate railings) without explicit rule injection.

A key design decision was making CONTINUITY_SEAM a first-class boundary type rather than treating it as a modifier. This allows the network to learn that seam boundaries produce continuity joints with adjacent bridge segments — a structural relationship that would be opaque if encoded as a parameter value.

Figure 2: Training loss curves. ℒ_total (solid) converges from 0.950 to 0.473. ℒ_DSL (dashed) from 0.500 to 0.097. Both components converge monotonically without instabilities. Training on M4 iMac, 32GB unified memory, MPS backend.

5. Experiments

5.1 Dataset

The training corpus consists of 15 polyline–bridge USD pairs from Apple Maps production data, spanning four structural categories: pedestrian bridges (n=4), highway overpasses (n=4), curved interchanges (n=4), and major infrastructure (n=3). Polyline lengths range from 39 to 263 vertices; bridge primitive counts from 250 to 10,420. No public benchmark exists for this task; evaluation is loss-based and execution-validated given the dataset scale constraint.

5.2 Quantitative Results

Training converges stably over 25 epochs. DSL reconstruction loss falls from 0.500 (epoch 1) to 0.097 (epoch 25). Total loss falls from 0.950 to 0.473. Performance loss falls from 8.12 to 3.75, indicating progressively more compact program generation. A representative test case — medium-complexity curved bridge (ground truth: 462 primitives) — executes with correct deck curvature, CLOSED-boundary railings, and 6 evenly-spaced pillars. The execution status from the training log confirms: Execution: SUCCESS – 462 polygons.

5.3 Failure Mode Analysis

Two systematic failure modes are identified. First, BRIDGE_END omission on long sequences (>8 tokens): the model occasionally fails to predict the end delimiter, causing parser failure. Addressed by appending a forced end token at inference. Second, SET_DETAIL_LEVEL parameter values occasionally fall outside the valid continuous range [0,1], producing geometrically degenerate deck widths. Corrected with a clamping post-processor at inference time.

Both failure modes are consistent with known pathologies in autoregressive sequence generation: end-of-sequence prediction difficulty and continuous parameter out-of-distribution extrapolation. Both are mechanical problems rather than fundamental architectural failures.

6. Discussion

6.1 The Non-Differentiable Executor Gap

The central open problem is executor non-differentiability. Because exec(T) is a deterministic interpreter, no gradient signal flows from the 3D geometry back through the executor to the decoder. End-to-end training with a direct 3D geometry loss — the natural objective for this task — is therefore impossible with the current architecture. Two mitigation directions are under investigation: (1) approximating exec as differentiable via nvdiffrast [12] rendering supervision; (2) using a Vector DB of program–geometry pairs as retrieval-augmented similarity signal at the token level. Neither approach fully closes the gap.

6.2 Dataset Scale Limitation

15 training pairs is insufficient for robust generalisation to unseen bridge topologies. The model memorises training examples rather than learning truly generalisable structural principles. Data augmentation via procedural variation — randomised parameter perturbation with re-execution — would increase effective dataset size while maintaining geometric validity. This extension is planned but not yet implemented.

7. Conclusion

PGN demonstrates the viability of training a transformer seq2seq architecture to generate executable procedural programs for 3D geometry construction from structured geometric input with semantic attributes. The dual-loss curriculum effectively balances reconstruction fidelity and geometric efficiency, converging to DSL loss 0.097 over 25 epochs. The DSL approach produces outputs with properties unavailable in direct mesh generation paradigms: human readability, parametric editability, and version-controllability as text.

The non-differentiable executor and small dataset scale are the primary barriers to scaling. Both are tractable engineering problems. The architectural pattern established here — geometric input → latent → construction program → 3D output — is extended in all subsequent thesis work: SketchProc3D replaces polylines with sketch input; graph grammar research explores automatic grammar extraction; SculptNet extends to coarse-to-fine primitive assembly; building elevation reconstruction applies the pattern at city scale with street-view image input and a 6-plane mesh reconstruction executor.

References

[1] Zhengxinyang et al. "Locally Attentional SDF Diffusion for Controllable 3D Shape Generation." ACM Trans. Graph., 42(4), 2023. doi:10.1145/3592103

[2] Mildenhall et al. "NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis." ECCV, 2020.

[3] Park et al. "DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation." CVPR, 2019.

[4] Müller et al. "Procedural Modeling of Buildings." ACM SIGGRAPH, 2006.

[5] Prusinkiewicz & Lindenmayer. "The Algorithmic Beauty of Plants." Springer, 1990.

[6] Li et al. "ProcGen3D: Edge-Based Tokenization for Procedural 3D Graph Generation." arXiv, 2025.

[7] Jones et al. "ShapeAssembly: Learning to Generate Programs for 3D Shape Structure Synthesis." ACM Trans. Graph., 39(6), 2020.

[8] Sharma et al. "CSGNet: Neural Shape Parser for Constructive Solid Geometry." CVPR, 2018.

[9] NVIDIA. "FVDB: Feature VDB for Sparse Neural Fields." 2024. youtube.com/watch?v=9zRqkw1F3ww

[10] Zeamox Wang et al. "HotSpot: Personalized Synthesis for Text-to-3D Generation." CVPR, 2025.

[11] Vaswani et al. "Attention Is All You Need." NeurIPS, 2017.

[12] Laine et al. "Modular Primitives for High-Performance Differentiable Rendering." ACM Trans. Graph., 39(6), 2020.

Appendix — Raw Materials

Transcripts & Source References

████████████████████████████████████████████████████
████████████████████████████████████████

01 — ████████████████

████████████████████████████████

████████ · ████ · ████████████████████████

████████████████████████████████████

████████████ · ████ · ████████████████████████████████

████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████

02 — ████████████████████████

████████████████████████████████████████████████████

████ ██████████ · ████████ · ████ · ████████████████████

03 — ████████████████████████████████████████

███ 2025

Restricted Access