PGN [1] trains a seq2seq transformer that maps a bridge polyline + per-segment semantic attribute string to an executable DSL program. The architecture choice was appropriate for the 15-pair training corpus PGN had, but it carries two structural limitations to the larger thesis line: (i) autoregressive decoding is slow at inference; (ii) the discrete DSL-token output requires the model to learn the executor's grammar perfectly rather than producing geometry it can directly evaluate.
A diffusion-style or flow-matching-style head sidesteps both. The output is a continuous geometric representation (polyline coordinates + attributes), the sampler runs in parallel over all positions, and the trained model is a denoiser over polyline configurations. The design question this paper specifies is which representation and which generator family works for variable-length polyline outputs.
The honest scope: this is a design specification, not an implementation. No training runs are executed. The contribution is the documented design decisions and the open architectural questions enumerated for the first training run.
A bridge polyline is a sequence of 3-D control points (x, y, z) plus per-segment semantic attributes (OPEN, CLOSED, RAILING, …). The PGN corpus has polylines of 8–40 control points. Three encoding options are analysed:
| Option | Encoding | Pros | Cons |
|---|---|---|---|
| Padded fixed length | Pad to N_max = 64 with sentinel + length mask | Standard transformer / Mamba consumption | Wasted capacity on short polylines |
| Variable-length set | Treat polyline as set of (xyz, attr, position-index) | No padding; scales to long polylines | Needs PointNet++ / GNN encoder; loses order without positional encoding |
| Length-conditioned sampling | Predict length L first, then sample (L × 3) polyline conditioned on L | Sharp generation; no wasted capacity | Two-stage sampler; L-prediction itself is a small generative problem |
Working hypothesis: padded fixed length with length mask. Reasons: composes cleanly with the Pure-Mamba backbone [2]; PGN corpus length distribution is tight (8–40 points) so padding overhead is manageable; the length-conditioned alternative is the backup if padded fails on long polylines.
Pure-Mamba state-space backbone, inheriting the Topic-25 MNIST backbone-validation decision [2]. The validation found Pure-Mamba beats Pure-Transformer and Hybrid-Mamba+Attention on the speed-quality trade-off at the 196-token regime. Polyline-diffusion's N_max = 64 sits inside that regime; the Mamba block scales linearly in token count where the transformer's attention is quadratic, so the advantage holds.
| Family | Sampling steps (quality) | Manifold-aware target? | Verdict |
|---|---|---|---|
| DDPM (pixel/polyline-space) | ~100–250 | ε-prediction (no) | Workable but slow |
| LDM (latent-space) | ~50–100 | ε-prediction in latent (no) | VAE compression buys speed but adds reconstruction artefacts at polyline scale |
| Flow Matching | ~20–50 | v-prediction (yes, closer to x-pred) | Best fit |
Pre-decision: flow matching, matching the upstream MambaFlow3D choice [4] and the x-prediction manifold-hypothesis analysis [3].
Flow matching defines a continuous-time flow from a base distribution p_0 (Gaussian noise over the polyline tensor) to the data distribution p_1 (the training-set bridge polylines). Linear interpolation gives the simplest flow:
x_t = (1 − t) · z + t · x_1, z ∼ 𝒩(0, I), x_1 ∼ p_data, t ∈ [0, 1]The optimal velocity field is the constant v(x_t, t) = x_1 − z; the network v̂(x_t, t) is trained to predict this velocity from x_t and t:
L = E_{t, z, x_1} ‖v̂(x_t, t) − (x_1 − z)‖²At sampling time the network's predicted velocity is integrated by Euler (or Heun, RK4) from t = 0 to t = 1:
x_{t + Δt} = x_t + Δt · v̂(x_t, t)with 20–50 Euler steps producing high-quality samples. For the polyline-tensor shape (N_max, 4) (3 coordinates + 1 attribute embedding per position), the flow operates per-position, with the Pure-Mamba backbone providing the cross-position interaction.
Per-segment semantic attributes (OPEN, CLOSED, RAILING) are the conditioning input. Three entry points considered:
Working hypothesis: added embeddings, with AdaLN-modulation as backup if added embeddings are insufficient.
Three concrete questions the first training run is intended to answer. (i) Does padded fixed-length with length-mask train cleanly, or does the sentinel-padding distort the loss landscape? (ii) Is added-embedding attribute conditioning expressive enough for the OPEN / CLOSED / RAILING distinction, or is cross-attention needed? (iii) Does the 20–50-step flow-matching sampler preserve polyline coherence (no kinks, no self-intersections), or are explicit geometric losses needed?
Polyline-diffusion is specified: padded fixed-length with length mask; Pure-Mamba backbone; flow-matching velocity prediction; added-embedding attribute conditioning. No training runs executed. The first training run's job is to answer the three open questions in §6.