arXiv Preprint · cs.GR · cs.LG · Sep 2025
Documentation → ← Back to White Papers
PGN: A Transformer-Based Procedural Generator Network for 3D Bridge Synthesis from Polyline Semantic Attributes
Aaditya Jain
Procedural Generation · Program Synthesis · Thesis Research, Unpublished Preprint
Submitted: September 2025 Subject: cs.GR · cs.LG MSC: 68T07 · 65D18 Keywords: procedural generation, program synthesis, seq2seq, USD, transformer
Abstract
We present the Procedural Generator Network (PGN), a sequence-to-sequence transformer architecture that maps polyline geometric input annotated with semantic boundary attributes to executable Domain Specific Language (DSL) programs that construct 3D bridge geometry in Universal Scene Description (USD) format. Unlike direct geometry generation methods — SDF diffusion, NeRF, mesh generation — PGN generates construction programs: ordered sequences of human-readable, parametrically editable procedural commands that a deterministic executor converts to watertight USD scenes at runtime. The network is trained on 15 polyline–bridge USD pairs sourced from Maps production data, using a dual-loss objective combining token-level cross-entropy reconstruction with a curriculum-weighted performance loss penalising geometric redundancy, λ(e) increasing linearly from 0.1 to 0.5. We demonstrate DSL reconstruction loss of 0.097 at epoch 25, achieving successful program execution with correct deck curvature, boundary-conditioned railing generation, and pillar instantiation on test cases. The primary identified open problem is executor non-differentiability — no gradient flows from 3D geometry back through the deterministic interpreter to the decoder, precluding direct end-to-end training with geometric loss. This architectural pattern — structured geometric input → latent representation → construction program → 3D output — serves as the foundation for all subsequent thesis work. Keywords: procedural generation, program synthesis, 3D geometry, transformer, Universal Scene Description, seq2seq learning.
1. Introduction

Contemporary 3D generative models — diffusion-based mesh synthesis [1], neural radiance fields [2], and SDF-based shape generation [3] — produce geometrically high-quality outputs but share a fundamental limitation for production use: their outputs are static triangle meshes or volumetric fields that cannot be re-parameterised without destructive editing. In the context of Maps bridge geometry, this limitation is operationally critical: bridge assets must be editable, versioned, and regeneratable from updated polyline surveys without repeating the full generative inference cycle.

We propose a different generative target. Rather than mapping input geometry to output geometry, we train a network to map input geometry to an executable construction program. This program, expressed in a formal DSL, encodes the structural intent of the bridge as an ordered sequence of human-readable commands. A deterministic executor converts the program to a USD scene at runtime. The resulting system produces outputs that: (i) can be inspected and edited without specialised 3D tools; (ii) can be re-executed with modified parameters to produce geometry variants; (iii) can be version-controlled as text alongside other production assets.

This paper makes the following contributions: (1) a transformer encoder–decoder architecture adapted for the polyline-to-DSL mapping problem; (2) a context-free DSL grammar over 24 terminal tokens sufficient to describe the full structural variation in Maps bridge data; (3) a dual-loss training curriculum combining reconstruction fidelity and geometric efficiency objectives; (4) an analysis of the non-differentiable executor gap — the primary open problem for all executor-based program synthesis approaches to 3D geometry generation; (5) the architectural template that all subsequent thesis projects extend.

polyline .usd · B ENCODER 6L·8H·d512 + Attr. fusion LATENT z∈ℝ⁵¹² DECODER 6L·8H causal cross-attn + mask DSL tokens T INPUT LATENT SPACE OUTPUT PerformanceEval → ℒ_perf
Figure 1: PGN architecture. Polyline vertices and boundary attributes B are encoded separately and fused via cross-attention into latent z∈ℝ⁵¹². The causal decoder generates DSL token sequence T autoregressively. A PerformanceEvaluator MLP branches from z to contribute ℒ_perf to the training objective.
2. Related Work
2.1 Procedural 3D Generation

Procedural modelling systems — CityEngine [4], Houdini procedural networks, shape grammars [5] — define geometry through parameterised rule grammars, enabling scalable urban content production but requiring manual rule authoring per asset category. ProcGen3D [6] explores learning grammar rules from image data using autoregressive edge-token prediction with MCTS-guided sampling, but addresses 2D graph structures. PGN extends to 3D USD scene graphs with continuous parameter prediction and semantic attribute conditioning.

2.2 Program Synthesis for Geometry

ShapeAssembly [7] generates part-based shape assembly programs via a language model trained on a curated DSL. CSGNet [8] recovers constructive solid geometry programs from 3D shapes using imitation learning. Our approach differs in three respects: (1) the input is structured geometric data with semantic attributes rather than a shape category label; (2) the output DSL targets direct execution in Houdini production pipelines; (3) the grammar explicitly encodes Maps USD boundary attributes as first-class tokens enabling conditional generation.

2.3 3D Representations for ML Supervision

Sparse voxel representations (FVDB [9]) provide Houdini-native integration and city-scale memory efficiency but do not support differentiable supervision. SDF-based representations [10] provide smooth gradient fields for geometry learning but are prohibitively expensive at urban scene scale. We use USD mesh with graph neural network supervision for per-component geometric loss, reserving VDB for production storage. This evaluation confirmed VDB as the optimal production format (assessed in §2, USD pipeline research, same month).

3. Method
3.1 Problem Formulation

Let P = {p₁,...,pₙ}, pᵢ∈ℝ³ be a polyline with n vertices, and B = {b₁,...,bₘ}∈{CLOSED, OPEN, SEAM}ᵐ its corresponding boundary attribute sequence. The goal is to learn a mapping f:(P,B)→T={t₁,...,tₗ} where T is a sequence of DSL tokens drawn from vocabulary V, such that exec(T) produces a geometrically valid USD bridge scene consistent with the structural intent encoded in (P,B).

3.2 Network Architecture

The PolylineEncoder E_P applies a standard Transformer encoder [11] with sinusoidal positional encoding to P, producing contextualised vertex representations h₁,...,hₙ∈ℝ⁵¹². A [CLS] token aggregates these into context vector c_P∈ℝ⁵¹². The AttributeEncoder E_B applies a smaller 3-layer transformer to B, producing c_B∈ℝ⁵¹². Cross-attention fusion F combines both: z = F(c_P, c_B)∈ℝ⁵¹².

The DSL Decoder D_T is a causal transformer that generates T autoregressively. At step t it attends over the encoder output and previously generated tokens t₁,...,tₜ₋₁ via cross-attention and causal self-attention respectively, predicting the next token from vocabulary V. The PerformanceEvaluator G:z→(ĉ,r̂) is a 3-layer MLP predicting primitive count ĉ and rendering cost r̂.

3.3 Loss Formulation

Reconstruction loss is token cross-entropy:

ℒ_DSL = −Σₜ log p(tₜ | t₁,...,tₜ₋₁, z)

Performance loss penalises geometric redundancy:

ℒ_perf = MSE(ĉ, c*) + α·r̂, α=0.01

Total loss with curriculum weight:

ℒ = ℒ_DSL + λ(e)·ℒ_perf, λ(e) = 0.1 + 0.4·(e/E)

where c* is the ground-truth primitive count for the training bridge, E=25 epochs. The linear curriculum ensures structural correctness is learned (low λ early) before efficiency is optimised (higher λ late).

Table 1: PGN network hyperparameters
ParameterValue
d_model512
Encoder layers6
Decoder layers6
Attention heads8
FF dim2048
Dropout0.1
Batch size4
LR (encoders)1×10⁻⁴
LR (decoder)1×10⁻⁵
LR scheduleCosine anneal, T₀=100
Gradient clipnorm 1.0
Training epochs25
4. DSL Grammar

The DSL is a context-free grammar G=(V,Σ,R,S) where V={BRIDGE_PROGRAM, STMT_LIST, STMT, BOUNDARY_CLAUSE, PARAM} are non-terminals, Σ is the 24-token terminal alphabet, R are production rules, and S=BRIDGE_START is the start symbol. The grammar is deliberately minimal — spanning pedestrian bridges (~250 primitives) to major infrastructure (~7,000 primitives) with 24 terminal types. BOUNDARY_TYPE tokens are first-class grammar elements, enabling the decoder to learn attribute-conditioned branching (IF_BOUNDARY CLOSED → generate railings) without explicit rule injection.

A key design decision was making CONTINUITY_SEAM a first-class boundary type rather than treating it as a modifier. This allows the network to learn that seam boundaries produce continuity joints with adjacent bridge segments — a structural relationship that would be opaque if encoded as a parameter value.

1.0 0.8 0.5 0.2 0.0 ℒ_total ℒ_DSL 0.473 0.097 EPOCH (1–25)
Figure 2: Training loss curves. ℒ_total (solid) converges from 0.950 to 0.473. ℒ_DSL (dashed) from 0.500 to 0.097. Both components converge monotonically without instabilities. Training on M4 iMac, 32GB unified memory, MPS backend.
5. Experiments
5.1 Dataset

The training corpus consists of 15 polyline–bridge USD pairs from Maps production data, spanning four structural categories: pedestrian bridges (n=4), highway overpasses (n=4), curved interchanges (n=4), and major infrastructure (n=3). Polyline lengths range from 39 to 263 vertices; bridge primitive counts from 250 to 10,420. No public benchmark exists for this task; evaluation is loss-based and execution-validated given the dataset scale constraint.

5.2 Quantitative Results

Training converges stably over 25 epochs. DSL reconstruction loss falls from 0.500 (epoch 1) to 0.097 (epoch 25). Total loss falls from 0.950 to 0.473. Performance loss falls from 8.12 to 3.75, indicating progressively more compact program generation. A representative test case — medium-complexity curved bridge (ground truth: 462 primitives) — executes with correct deck curvature, CLOSED-boundary railings, and 6 evenly-spaced pillars. The execution status from the training log confirms: Execution: SUCCESS – 462 polygons.

5.3 Failure Mode Analysis

Two systematic failure modes are identified. First, BRIDGE_END omission on long sequences (>8 tokens): the model occasionally fails to predict the end delimiter, causing parser failure. Addressed by appending a forced end token at inference. Second, SET_DETAIL_LEVEL parameter values occasionally fall outside the valid continuous range [0,1], producing geometrically degenerate deck widths. Corrected with a clamping post-processor at inference time.

Both failure modes are consistent with known pathologies in autoregressive sequence generation: end-of-sequence prediction difficulty and continuous parameter out-of-distribution extrapolation. Both are mechanical problems rather than fundamental architectural failures.

6. Discussion
6.1 The Non-Differentiable Executor Gap

The central open problem is executor non-differentiability. Because exec(T) is a deterministic interpreter, no gradient signal flows from the 3D geometry back through the executor to the decoder. End-to-end training with a direct 3D geometry loss — the natural objective for this task — is therefore impossible with the current architecture. Two mitigation directions are under investigation: (1) approximating exec as differentiable via nvdiffrast [12] rendering supervision; (2) using a Vector DB of program–geometry pairs as retrieval-augmented similarity signal at the token level. Neither approach fully closes the gap.

6.2 Dataset Scale Limitation

15 training pairs is insufficient for robust generalisation to unseen bridge topologies. The model memorises training examples rather than learning truly generalisable structural principles. Data augmentation via procedural variation — randomised parameter perturbation with re-execution — would increase effective dataset size while maintaining geometric validity. This extension is planned but not yet implemented.

7. Conclusion

PGN demonstrates the viability of training a transformer seq2seq architecture to generate executable procedural programs for 3D geometry construction from structured geometric input with semantic attributes. The dual-loss curriculum effectively balances reconstruction fidelity and geometric efficiency, converging to DSL loss 0.097 over 25 epochs. The DSL approach produces outputs with properties unavailable in direct mesh generation paradigms: human readability, parametric editability, and version-controllability as text.

The non-differentiable executor and small dataset scale are the primary barriers to scaling. Both are tractable engineering problems. The architectural pattern established here — geometric input → latent → construction program → 3D output — is extended in all subsequent thesis work: SketchProc3D replaces polylines with sketch input; graph grammar research explores automatic grammar extraction; SculptNet extends to coarse-to-fine primitive assembly; building elevation reconstruction applies the pattern at city scale with street-view image input and a 6-plane mesh reconstruction executor.

References
[1] Zhengxinyang et al. "Locally Attentional SDF Diffusion for Controllable 3D Shape Generation." ACM Trans. Graph., 42(4), 2023. doi:10.1145/3592103
[2] Mildenhall et al. "NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis." ECCV, 2020.
[3] Park et al. "DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation." CVPR, 2019.
[4] Müller et al. "Procedural Modeling of Buildings." ACM SIGGRAPH, 2006.
[5] Prusinkiewicz & Lindenmayer. "The Algorithmic Beauty of Plants." Springer, 1990.
[6] Li et al. "ProcGen3D: Edge-Based Tokenization for Procedural 3D Graph Generation." arXiv, 2025.
[7] Jones et al. "ShapeAssembly: Learning to Generate Programs for 3D Shape Structure Synthesis." ACM Trans. Graph., 39(6), 2020.
[8] Sharma et al. "CSGNet: Neural Shape Parser for Constructive Solid Geometry." CVPR, 2018.
[9] NVIDIA. "FVDB: Feature VDB for Sparse Neural Fields." 2024.
[10] Zeamox Wang et al. "HotSpot: Personalized Synthesis for Text-to-3D Generation." CVPR, 2025.
[11] Vaswani et al. "Attention Is All You Need." NeurIPS, 2017.
[12] Laine et al. "Modular Primitives for High-Performance Differentiable Rendering." ACM Trans. Graph., 39(6), 2020.