A transformer encoder–decoder architecture trained on pairs of polyline USD files and bridge USD scenes. The network learns to generate executable construction programs in a domain-specific language — producing outputs that are human-readable, parametrically editable, and version-controllable without mesh manipulation.
15
Training USD pairs
0.097
DSL loss · ep. 25
~60M
Parameters
↓ Scroll to explore architectureThesis · Apple Maps · Houdini / USD
Variable-length sequence of n vertices pᵢ ∈ ℝ³ with semantic boundary attributes. Grid surface encodes spatial topology. CLOSED/OPEN boundary labels directly condition downstream generation.
Layer 02
Transformer Encoder
8-head multi-head attention, d=512, 6 layers. Sinusoidal positional encoding. Produces context vector c_P ∈ ℝ⁵¹². Token blocks on the layer surface represent attention heads processing each polyline vertex.
Layer 03
Latent Space Z ∈ ℝ⁵¹²
Cross-attention fusion of polyline and boundary encodings. Neural graph scatter represents the learned latent distribution over bridge construction programs. Dual-loss: cross-entropy + MSE performance, λ(e): 0.1→0.5.
Layer 04
Transformer Decoder
Causal autoregressive DSL generation. Cross-attention over encoder output at each step. Causal mask enforces left-to-right token ordering. Stacked token cards represent the growing output sequence.
Layer 05
DSL Token Stream → USD
Dense grid surface represents the USD scene graph built by the deterministic executor. BRIDGE_START, GENERATE_DECK, IF_BOUNDARY, BRIDGE_END — human-readable, parametrically editable, version-controllable as text.
Core Thesis
Generate the program that builds geometry. Not the geometry.
A triangle mesh is a terminal artifact — it cannot be re-parameterised, only destructively edited. A construction program is a reusable intent specification. Modifying one parameter token regenerates the entire USD scene in milliseconds. This is what Apple Maps production pipelines require — and the gap that existing 3D generative models do not fill.
§ 1
Network Architecture
The PGN architecture decomposes into four submodules. The PolylineEncoder applies a transformer encoder to the sequence of 3D polyline vertices, producing a context vector cP ∈ ℝ512 via a learned [CLS] token aggregation. A parallel AttributeEncoder embeds the boundary attribute sequence into the same dimensionality cB ∈ ℝ512. A cross-attention Fusion module combines both paths into the shared latent z ∈ ℝ512.
The DSL Decoder — a causal transformer — generates the construction token sequence autoregressively from z, attending over the full encoder output at each step. A lightweight PerformanceEvaluator MLP branches from z and predicts primitive count and rendering cost, contributing to the curriculum-weighted performance loss during training.
Polyline Enc.
Transformer 6L · 8H · d=512
~25M params n×3 → 512
Attribute Enc.
Transformer 3L · 4H · d=512
~8M params m×3 → 512
Fusion
Cross-attention · 1L
~2M params 512+512 → 512
DSL Decoder
Causal Trans. 6L · 8H
~24M params 512 → |V|
Perf. Evaluator
MLP · 3L · ReLU
~0.5M params 512 → 2
§ 2
Training & Loss Formulation
Training follows a dual-loss curriculum. The total loss at epoch e is ℒ = ℒDSL + λ(e) · ℒperf, where ℒDSL is token-level cross-entropy and ℒperf is mean squared error on primitive count prediction. The curriculum weight λ(e) = 0.1 + 0.4·(e/E) increases linearly, ensuring structural correctness is learned before geometric efficiency is penalised. AdamW optimiser: encoder LR 10-4, decoder LR 10-5. Cosine annealing with warm restarts T₀=100. Gradient clipping norm 1.0.
Fig. 2 — Training loss over 25 epochs. ℒ_total (solid) converges 0.950→0.473. ℒ_DSL (dashed) 0.500→0.097. ℒ_perf (dotted, scaled ÷10 for readability) 0.812→0.375. Monotonic convergence across all three components with no instabilities.
Epoch
ℒ_total
ℒ_DSL
ℒ_perf
λ(e)
LR (dec.)
1
0.950
0.500
8.12
0.101
1.0×10⁻⁵
5
0.801
0.420
7.36
0.119
9.8×10⁻⁶
10
0.694
0.310
6.45
0.138
9.1×10⁻⁶
15
0.601
0.221
5.20
0.157
7.8×10⁻⁶
20
0.543
0.156
4.62
0.176
6.2×10⁻⁶
25
0.473
0.097
3.75
0.100
4.8×10⁻⁶
§ 3
DSL Grammar Specification
The DSL is a context-free grammar G = (V, Σ, R, S) over 24 terminal tokens in six classes. The vocabulary is deliberately minimal — sufficient to represent the full structural variation in the 15-pair corpus while remaining learnable at this dataset scale. BOUNDARY_TYPE tokens (CLOSED, OPEN, SEAM) are first-class grammar elements derived from Apple Maps USD semantic attributes, enabling learned attribute-conditioned branching without explicit rule injection.
Fig. 3 — DSL grammar parse tree. CONDITIONAL node (highlighted) is the key structural feature — the network learns attribute-conditioned branching from CLOSED/OPEN/SEAM tokens without explicit rule injection.
Component spatial misalignment in USD scene assembly
Deck, railing, and pillar geometries generated as disconnected USD prims. Root cause: feature_id attribute propagation through Houdini's piece attribute in the For-Each loop was not threading correctly to the deck generation node. Resolved via explicit polypath + ends node anchoring with per-feature-ID primitive binding.
Resolved
C-02
Bridge geometry non-conformant to polyline curvature
Straight-line interpolation in the executor produced geometrically incorrect planar bridges on curved polylines. Resolved by computing the discrete Frenet frame along the polyline tangent vector at each deck segment, using normal and binormal to orient geometry placement in world space.
Resolved
C-03
Apple Maps USD boundary attribute semantics undocumented
Empirically validated across 15 training pairs: CLOSED → railing generation active; OPEN → open edge, no railing; SEAM → continuity joint between sections. No public documentation available for these internal attribute conventions.
Resolved
C-04
Non-differentiable supervision path through DSL executor
The executor is a deterministic interpreter — no gradient flows from 3D geometry back through exec(T) to the decoder. Direct end-to-end 3D geometry loss is therefore impossible. Mitigations under investigation: (1) nvdiffrast rendering-based supervision on executor output; (2) Vector DB token-level similarity signal using pre-indexed program–geometry pairs. Neither approach fully resolves the supervision gap.
Open
Interactive Demo
A default bridge is pre-loaded and rotating. Click to add vertices to the polyline input, select boundary type, then generate a new DSL program and 3D preview. Use the Clear button to start fresh.
01 — Polyline Input
Default example loaded · Click to add vertices
Boundary
02 — DSL Token Stream
— Awaiting generation —
03 — Bridge Wireframe (3D)Drag to rotate
Full Technical Paper
arXiv-format preprint · PGN: Transformer-Based Procedural Generator Network for 3D Bridge Synthesis from Polyline Semantic Attributes
PGN: A Transformer-Based Procedural Generator Network for 3D Bridge Synthesis from Polyline Semantic Attributes
Aditya Jain
Apple Maps · 3D Reconstruction Group, Hyderabad · Thesis Research, Unpublished Preprint
Submitted: September 2025Subject: cs.GR · cs.LGMSC: 68T07 · 65D18Keywords: procedural generation, program synthesis, seq2seq, USD, transformer
Abstract
We present the Procedural Generator Network (PGN), a sequence-to-sequence transformer architecture that maps polyline geometric input annotated with semantic boundary attributes to executable Domain Specific Language (DSL) programs that construct 3D bridge geometry in Universal Scene Description (USD) format. Unlike direct geometry generation methods — SDF diffusion, NeRF, mesh generation — PGN generates construction programs: ordered sequences of human-readable, parametrically editable procedural commands that a deterministic executor converts to watertight USD scenes at runtime. The network is trained on 15 polyline–bridge USD pairs sourced from Apple Maps production data, using a dual-loss objective combining token-level cross-entropy reconstruction with a curriculum-weighted performance loss penalising geometric redundancy, λ(e) increasing linearly from 0.1 to 0.5. We demonstrate DSL reconstruction loss of 0.097 at epoch 25, achieving successful program execution with correct deck curvature, boundary-conditioned railing generation, and pillar instantiation on test cases. The primary identified open problem is executor non-differentiability — no gradient flows from 3D geometry back through the deterministic interpreter to the decoder, precluding direct end-to-end training with geometric loss. This architectural pattern — structured geometric input → latent representation → construction program → 3D output — serves as the foundation for all subsequent thesis work. Keywords: procedural generation, program synthesis, 3D geometry, transformer, Universal Scene Description, seq2seq learning.
1. Introduction
Contemporary 3D generative models — diffusion-based mesh synthesis [1], neural radiance fields [2], and SDF-based shape generation [3] — produce geometrically high-quality outputs but share a fundamental limitation for production use: their outputs are static triangle meshes or volumetric fields that cannot be re-parameterised without destructive editing. In the context of Apple Maps bridge geometry, this limitation is operationally critical: bridge assets must be editable, versioned, and regeneratable from updated polyline surveys without repeating the full generative inference cycle.
We propose a different generative target. Rather than mapping input geometry to output geometry, we train a network to map input geometry to an executable construction program. This program, expressed in a formal DSL, encodes the structural intent of the bridge as an ordered sequence of human-readable commands. A deterministic executor converts the program to a USD scene at runtime. The resulting system produces outputs that: (i) can be inspected and edited without specialised 3D tools; (ii) can be re-executed with modified parameters to produce geometry variants; (iii) can be version-controlled as text alongside other production assets.
This paper makes the following contributions: (1) a transformer encoder–decoder architecture adapted for the polyline-to-DSL mapping problem; (2) a context-free DSL grammar over 24 terminal tokens sufficient to describe the full structural variation in Apple Maps bridge data; (3) a dual-loss training curriculum combining reconstruction fidelity and geometric efficiency objectives; (4) an analysis of the non-differentiable executor gap — the primary open problem for all executor-based program synthesis approaches to 3D geometry generation; (5) the architectural template that all subsequent thesis projects extend.
Figure 1: PGN architecture. Polyline vertices and boundary attributes B are encoded separately and fused via cross-attention into latent z∈ℝ⁵¹². The causal decoder generates DSL token sequence T autoregressively. A PerformanceEvaluator MLP branches from z to contribute ℒ_perf to the training objective.
2. Related Work
2.1 Procedural 3D Generation
Procedural modelling systems — CityEngine [4], Houdini procedural networks, shape grammars [5] — define geometry through parameterised rule grammars, enabling scalable urban content production but requiring manual rule authoring per asset category. ProcGen3D [6] explores learning grammar rules from image data using autoregressive edge-token prediction with MCTS-guided sampling, but addresses 2D graph structures. PGN extends to 3D USD scene graphs with continuous parameter prediction and semantic attribute conditioning.
2.2 Program Synthesis for Geometry
ShapeAssembly [7] generates part-based shape assembly programs via a language model trained on a curated DSL. CSGNet [8] recovers constructive solid geometry programs from 3D shapes using imitation learning. Our approach differs in three respects: (1) the input is structured geometric data with semantic attributes rather than a shape category label; (2) the output DSL targets direct execution in Houdini production pipelines; (3) the grammar explicitly encodes Apple Maps USD boundary attributes as first-class tokens enabling conditional generation.
2.3 3D Representations for ML Supervision
Sparse voxel representations (FVDB [9]) provide Houdini-native integration and city-scale memory efficiency but do not support differentiable supervision. SDF-based representations [10] provide smooth gradient fields for geometry learning but are prohibitively expensive at urban scene scale. We use USD mesh with graph neural network supervision for per-component geometric loss, reserving VDB for production storage. This evaluation confirmed VDB as the optimal production format (assessed in §2, USD pipeline research, same month).
3. Method
3.1 Problem Formulation
Let P = {p₁,...,pₙ}, pᵢ∈ℝ³ be a polyline with n vertices, and B = {b₁,...,bₘ}∈{CLOSED, OPEN, SEAM}ᵐ its corresponding boundary attribute sequence. The goal is to learn a mapping f:(P,B)→T={t₁,...,tₗ} where T is a sequence of DSL tokens drawn from vocabulary V, such that exec(T) produces a geometrically valid USD bridge scene consistent with the structural intent encoded in (P,B).
3.2 Network Architecture
The PolylineEncoder E_P applies a standard Transformer encoder [11] with sinusoidal positional encoding to P, producing contextualised vertex representations h₁,...,hₙ∈ℝ⁵¹². A [CLS] token aggregates these into context vector c_P∈ℝ⁵¹². The AttributeEncoder E_B applies a smaller 3-layer transformer to B, producing c_B∈ℝ⁵¹². Cross-attention fusion F combines both: z = F(c_P, c_B)∈ℝ⁵¹².
The DSL Decoder D_T is a causal transformer that generates T autoregressively. At step t it attends over the encoder output and previously generated tokens t₁,...,tₜ₋₁ via cross-attention and causal self-attention respectively, predicting the next token from vocabulary V. The PerformanceEvaluator G:z→(ĉ,r̂) is a 3-layer MLP predicting primitive count ĉ and rendering cost r̂.
3.3 Loss Formulation
Reconstruction loss is token cross-entropy:
ℒ_DSL = −Σₜ log p(tₜ | t₁,...,tₜ₋₁, z)
Performance loss penalises geometric redundancy:
ℒ_perf = MSE(ĉ, c*) + α·r̂, α=0.01
Total loss with curriculum weight:
ℒ = ℒ_DSL + λ(e)·ℒ_perf, λ(e) = 0.1 + 0.4·(e/E)
where c* is the ground-truth primitive count for the training bridge, E=25 epochs. The linear curriculum ensures structural correctness is learned (low λ early) before efficiency is optimised (higher λ late).
Table 1: PGN network hyperparameters
Parameter
Value
d_model
512
Encoder layers
6
Decoder layers
6
Attention heads
8
FF dim
2048
Dropout
0.1
Batch size
4
LR (encoders)
1×10⁻⁴
LR (decoder)
1×10⁻⁵
LR schedule
Cosine anneal, T₀=100
Gradient clip
norm 1.0
Training epochs
25
4. DSL Grammar
The DSL is a context-free grammar G=(V,Σ,R,S) where V={BRIDGE_PROGRAM, STMT_LIST, STMT, BOUNDARY_CLAUSE, PARAM} are non-terminals, Σ is the 24-token terminal alphabet, R are production rules, and S=BRIDGE_START is the start symbol. The grammar is deliberately minimal — spanning pedestrian bridges (~250 primitives) to major infrastructure (~7,000 primitives) with 24 terminal types. BOUNDARY_TYPE tokens are first-class grammar elements, enabling the decoder to learn attribute-conditioned branching (IF_BOUNDARY CLOSED → generate railings) without explicit rule injection.
A key design decision was making CONTINUITY_SEAM a first-class boundary type rather than treating it as a modifier. This allows the network to learn that seam boundaries produce continuity joints with adjacent bridge segments — a structural relationship that would be opaque if encoded as a parameter value.
Figure 2: Training loss curves. ℒ_total (solid) converges from 0.950 to 0.473. ℒ_DSL (dashed) from 0.500 to 0.097. Both components converge monotonically without instabilities. Training on M4 iMac, 32GB unified memory, MPS backend.
5. Experiments
5.1 Dataset
The training corpus consists of 15 polyline–bridge USD pairs from Apple Maps production data, spanning four structural categories: pedestrian bridges (n=4), highway overpasses (n=4), curved interchanges (n=4), and major infrastructure (n=3). Polyline lengths range from 39 to 263 vertices; bridge primitive counts from 250 to 10,420. No public benchmark exists for this task; evaluation is loss-based and execution-validated given the dataset scale constraint.
5.2 Quantitative Results
Training converges stably over 25 epochs. DSL reconstruction loss falls from 0.500 (epoch 1) to 0.097 (epoch 25). Total loss falls from 0.950 to 0.473. Performance loss falls from 8.12 to 3.75, indicating progressively more compact program generation. A representative test case — medium-complexity curved bridge (ground truth: 462 primitives) — executes with correct deck curvature, CLOSED-boundary railings, and 6 evenly-spaced pillars. The execution status from the training log confirms: Execution: SUCCESS – 462 polygons.
5.3 Failure Mode Analysis
Two systematic failure modes are identified. First, BRIDGE_END omission on long sequences (>8 tokens): the model occasionally fails to predict the end delimiter, causing parser failure. Addressed by appending a forced end token at inference. Second, SET_DETAIL_LEVEL parameter values occasionally fall outside the valid continuous range [0,1], producing geometrically degenerate deck widths. Corrected with a clamping post-processor at inference time.
Both failure modes are consistent with known pathologies in autoregressive sequence generation: end-of-sequence prediction difficulty and continuous parameter out-of-distribution extrapolation. Both are mechanical problems rather than fundamental architectural failures.
6. Discussion
6.1 The Non-Differentiable Executor Gap
The central open problem is executor non-differentiability. Because exec(T) is a deterministic interpreter, no gradient signal flows from the 3D geometry back through the executor to the decoder. End-to-end training with a direct 3D geometry loss — the natural objective for this task — is therefore impossible with the current architecture. Two mitigation directions are under investigation: (1) approximating exec as differentiable via nvdiffrast [12] rendering supervision; (2) using a Vector DB of program–geometry pairs as retrieval-augmented similarity signal at the token level. Neither approach fully closes the gap.
6.2 Dataset Scale Limitation
15 training pairs is insufficient for robust generalisation to unseen bridge topologies. The model memorises training examples rather than learning truly generalisable structural principles. Data augmentation via procedural variation — randomised parameter perturbation with re-execution — would increase effective dataset size while maintaining geometric validity. This extension is planned but not yet implemented.
7. Conclusion
PGN demonstrates the viability of training a transformer seq2seq architecture to generate executable procedural programs for 3D geometry construction from structured geometric input with semantic attributes. The dual-loss curriculum effectively balances reconstruction fidelity and geometric efficiency, converging to DSL loss 0.097 over 25 epochs. The DSL approach produces outputs with properties unavailable in direct mesh generation paradigms: human readability, parametric editability, and version-controllability as text.
The non-differentiable executor and small dataset scale are the primary barriers to scaling. Both are tractable engineering problems. The architectural pattern established here — geometric input → latent → construction program → 3D output — is extended in all subsequent thesis work: SketchProc3D replaces polylines with sketch input; graph grammar research explores automatic grammar extraction; SculptNet extends to coarse-to-fine primitive assembly; building elevation reconstruction applies the pattern at city scale with street-view image input and a 6-plane mesh reconstruction executor.
References
[1] Zhengxinyang et al. "Locally Attentional SDF Diffusion for Controllable 3D Shape Generation." ACM Trans. Graph., 42(4), 2023. doi:10.1145/3592103
[2] Mildenhall et al. "NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis." ECCV, 2020.
[3] Park et al. "DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation." CVPR, 2019.
[4] Müller et al. "Procedural Modeling of Buildings." ACM SIGGRAPH, 2006.
[5] Prusinkiewicz & Lindenmayer. "The Algorithmic Beauty of Plants." Springer, 1990.
[6] Li et al. "ProcGen3D: Edge-Based Tokenization for Procedural 3D Graph Generation." arXiv, 2025.
[7] Jones et al. "ShapeAssembly: Learning to Generate Programs for 3D Shape Structure Synthesis." ACM Trans. Graph., 39(6), 2020.
[8] Sharma et al. "CSGNet: Neural Shape Parser for Constructive Solid Geometry." CVPR, 2018.