Topic 02 · Sep – Oct 2025 · Applied ML · Inverse Procedural Modeling
A CNN-based system that recognises building type grammar snippets from freehand sketches and counts floors via OpenCV line detection — mapping rough 2D strokes to executable CityEngine CGA programs. Directly inspired by Garcia-Dorado et al. SIGGRAPH 2016. Core finding: synthetic NPR training data ≠ real human sketches.
3
Grammar snippets
95–99%
CNN accuracy (synthetic)
NPR
Synthetic data gen
↓ Scroll to explore architectureThesis · Apple Maps · CityEngine / Houdini
Three building archetypes: BOX (rectangular), TOWER (tall, narrow), L-SHAPE (L-footprint). Each maps to a parameterized CityEngine CGA rule. CNN classifies which snippet the sketch belongs to — this one classification step drives all downstream geometry generation.
02
CNN Type Recognition
Trained on synthetic NPR-rendered sketch pairs. Achieves 95–99% accuracy on its own distribution. The model degrades sharply on real freehand sketches — the domain gap is the project's defining challenge, not model capacity.
03
OpenCV Floor Counting
HoughLinesP detects horizontal stroke clusters, interpreted as floor dividers. Threshold-sensitive: too tight misses floors, too loose detects noise. Works reliably on clean synthetic sketches; degrades when floors are implied rather than explicitly drawn.
04
NPR Synthetic Data
Training pipeline: sample grammar params → execute in Houdini → front-view render → Canny edge detect → Perlin noise jitter. The jitter simulates stroke wobble but the underlying distribution is fundamentally different from human freehand drawing — the domain gap.
Core Finding
95% on synthetic. Unknown on real.
The CNN classifier achieves near-perfect accuracy on its own synthetic distribution — NPR-rendered, isometric, systematic. But the moment a real human sketch is used, the model meets a distribution it was never trained on. Perlin noise jitter is not a substitute for real human drawing data. The project's central lesson: data distribution is the architecture.
§ 1
System Architecture
The system decomposes the sketch-to-3D problem into two independent recognition tasks. A CNN classifier identifies building type from the snippet vocabulary (box / tower / L-shape). In parallel, OpenCV HoughLinesP counts horizontal stroke clusters and infers floor count. These are assembled into a grammar parameter set and passed to the CityEngine CGA executor.
The design mirrors Garcia-Dorado et al. 2016 directly — separate recognition channels per grammar attribute, each independently debuggable. The failure mode is that errors in either channel propagate directly to the output with no correction mechanism. A misclassified snippet produces entirely wrong geometry regardless of accurate floor counting.
BOX Snippet
Rectangular footprint
Standard rectangular building. Params: width, depth, floor_count, window_frac, style. Most common class — widest training distribution.
TOWER Snippet
Tall narrow footprint
High aspect-ratio building (width < depth/3). CNN distinguishes from BOX via width/height ratio in the sketch outline. Adds taper_factor param.
L-SHAPE Snippet
L-footprint building
Two rectangular volumes joined at corner. Hardest class — requires detecting the L-junction notch in the sketch. Lowest CNN accuracy of the three.
Floor Counter
OpenCV HoughLinesP
Horizontal line cluster count → floor_count parameter. Threshold-sensitive. Fails when floors are implied rather than explicitly drawn as strokes.
CityEngine
CGA Grammar executor
Esri CityEngine executes assembled CGA token sequence → USD building mesh. Deterministic, non-differentiable. Same executor gap as PGN's Houdini DSL.
§ 2
The Domain Gap Problem
No paired sketch → procedural-program dataset exists. Training data must be generated synthetically: sample grammar parameters, execute in Houdini, render isometrically, run Canny edge detection, apply Perlin noise jitter to simulate stroke imperfection. The CNN trained on this distribution achieves 95–99% accuracy — and breaks on real human input.
The gap is not a minor distribution shift. Human sketches are drawn from arbitrary viewpoints, not frontal isometric elevation. Strokes are rough and expressive — a single wobbly line might represent three floors, or might be a stylistic mark with no structural meaning. Floors are often implied rather than explicitly divided. Overlapping and re-drawn strokes confuse line detection. No amount of Perlin jitter on a clean synthetic render captures these characteristics.
Fig. 2 — The domain gap. Synthetic NPR data (left) is clean, precise, and isometric — CNN achieves 95–99% on this distribution. Real human sketches (right) are rough, non-isometric, and ambiguous. Perlin noise jitter does not bridge this gap.
Data Type
Volume
CNN Acc.
Gap to Human
NPR synthetic (Canny + Perlin jitter)
~200 per class
95–99%
—
Rendered isometric + edge detect only
~200 per class
~90%
Small
Real freehand (not collected)
0
Unknown
Large
Required next step
Real sketch collection → sketch style transfer or domain adaptation
§ 3
Differentiable Rendering Analysis
Alongside the modular CNN approach, this period investigated differentiable rendering as a path to closing the training signal gap. The hypothesis: if the executed 3D building can be rendered back to 2D differentiably, a pixel-level reconstruction loss could provide gradient signal from the output image all the way to the grammar parameters — without needing ground-truth grammar labels.
nvdiffrast provides differentiable rasterization — gradients flow from rendered pixel values back to 3D mesh vertex positions. But the CityEngine CGA executor sits between the grammar parameters and the mesh as a non-differentiable interpreter. Differentiable rendering closes the render→pixel gap but not the program→mesh gap. The gradient path that matters most — from image supervision back to grammar tokens — remains blocked at the executor boundary.
This analysis directly motivated the turn toward graph grammar research: if the executor cannot be made differentiable, can the grammar itself be learned from mesh geometry, removing the hand-authored executor entirely? If so, the executor gap becomes a grammar learning problem — the direction Merrell's graph grammar work explores.
§ 5
Training Pipeline
The training pipeline was built entirely from scratch — no pre-existing sketch dataset exists for building grammar snippets, so synthetic data generation was the only path. The pipeline went through two major iterations: an initial isometric-projection approach that achieved high benchmark accuracy but failed on real input, and a corrected front-view approach with multi-view augmentation.
Grammar Sampler
500 samples × 3 types
Random parameter sampling from each snippet's defined range. BOX: width∈[5,30], depth∈[5,20], floors∈[1,8]. TOWER: width∈[3,10], floors∈[4,15]. L-SHAPE: wing1∈[8,20], wing2∈[6,15], junction∈[0.3,0.7].
3D Mesh Gen
trimesh + Houdini
Parameters compiled to CGA token sequences, executed via Houdini Python SOP to generate watertight USD meshes. Each mesh verified for manifold property before rendering stage.
Front (0°,0°), slight-left (0°,−15°), slight-right (0°,15°). Rotation in the front-view plane — not isometric. Each view independently labelled. Triples effective dataset size to ~4,500 training images.
CNN Trainer
4-layer conv, Adam
Conv(32)→Conv(64)→Conv(128)→Conv(256), each with 3×3 kernel, ReLU, MaxPool 2×2. FC(512)→FC(3). Cross-entropy loss. Adam lr=1e-3. 30 epochs, batch 32. 80/20 train/val split.
Floor Counter
OpenCV HoughLinesP
rho=1px, theta=π/180, threshold=50, minLineLength=30, maxLineGap=10. Lines filtered to |angle|<10°. Y-cluster tolerance ±5px. In v2, floor label uses detected count (not ground truth) so training matches inference conditions.
Fig. 3 — v2 training pipeline. Front-view rendering replaced isometric after v1 showed domain gap was partly perspective-induced.
Pipeline Version
Render View
Samples/Class
Val Acc (Synth)
Real Sketch
v1 — isometric only
30°,30° isometric
200
~90%
Degraded
v1b — + Perlin jitter
30°,30° isometric
200
95%
Still degraded
v2 — front-view + multi-view aug
0°,0° front ± 15°
500 × 3 views
95–99%
Better, still gaps
§ 6
Experiment Log
The project ran through a sequence of concrete implementation experiments, each exposing a distinct failure mode. The log below documents the actual progression — not a clean narrative of success but the real sequence of build → test → discover → iterate.
E-01
Initial Isometric Pipeline — Achieves 90% but Fails Informally
Built initial data generator rendering grammar-sampled buildings isometrically (30°,30°). Canny edge detection only, no jitter. 200 samples/class. CNN trained in 30 epochs, 90% validation accuracy. Informal test on own freehand sketches: completely wrong classifications. Root cause: isometric projection is visually nothing like how humans draw buildings — oblique angles, foreshortening not present in real sketches.
v1 Done
E-02
Perlin Noise Jitter Added — Marginal Improvement on Benchmark
Added Perlin noise displacement to edge pixels (amplitude 2px, frequency 0.1) + random stroke gaps to simulate rough drawing. Synthetic accuracy improved to 95%. Real sketch behavior unchanged — Perlin jitter on a isometric render still does not produce images resembling front-view human drawing. Confirmed: the problem is viewpoint, not stroke roughness.
v1b Done
E-03
Front-View Rendering — Architectural Pivot
Rebuilt renderer to produce front-view (0°,0°) projections matching the angle humans draw buildings. Floor lines are now horizontal, matching HoughLinesP assumption. Building silhouettes match rough rectangular/L-shaped outlines humans produce. Added ±15° lateral rotations for augmentation. Reran full pipeline: 500 samples × 3 views × 3 classes = 4,500 training images. Accuracy 95–99% on synthetic set. Better informal behavior on own sketches, though not yet systematically evaluated.
v2 Done
E-04
Floor Counter Ground-Truth Bug
Training labels used ground-truth floor count from parameter sampler, not the floor count as detected by HoughLinesP at inference time. At inference: floors detected ≠ floors expected. Fixed by running HoughLinesP on each training image and using the detected count (noisy) as the label — matching inference conditions. Floor ±1 accuracy on synthetic: 80%. Same bug pattern as any teach-force mismatch problem.
Fixed
E-05
L-SHAPE Removed from v2, Reintroduced, Remains Hardest Class
Front-view L-SHAPE rendering produces an L-shaped silhouette where the corner notch is often ambiguous at small scales. Initial v2 removed L-SHAPE entirely (two classes only), achieving 99% on BOX+TOWER. Decision reversed: L-SHAPE is the most common real building typology and cannot be dropped. Reintroduced with additional training images focused on notch visibility. CNN accuracy on L-SHAPE: ~92%, lowest of three classes, as expected given corner junction detection difficulty.
Open
E-06
nvdiffrast End-to-End Gradient Experiment
Attempted to build an end-to-end differentiable loop: params → CGA string → CityEngine Python API → mesh vertices → nvdiffrast → rendered image → ℒ_render = ‖output − sketch‖₁. CityEngine call is a subprocess — gradient does not flow through it. Tried wrapping in custom autograd Function with numerical gradient approximation (finite differences over grammar parameters): too slow (5+ seconds per parameter perturbation × 8 parameters). Analysis confirmed: executor gap is structural, not solvable with renderer choice. Documented and closed. This experiment directly motivated the graph grammar research direction.
Closed
§ 4
Open Challenges
C-01
Domain Gap: Synthetic vs. Real Freehand Sketches
CNN trained on NPR-rendered synthetic data achieves 95–99% on that distribution, degrades sharply on real human sketches. The gap is fundamental — human sketches are non-isometric, variable-weight, imprecise, with implied rather than explicit structure. Perlin noise jitter does not capture this. Requires real sketch data collection or domain adaptation. Primary unresolved problem.
Open
C-02
Floor Counting Fragility (HoughLinesP)
Threshold-sensitive heuristic fails when floors are implied rather than explicitly drawn. Non-horizontal strokes confuse detection. No confidence estimate — floor count is passed to executor with no uncertainty quantification. A wrong floor count produces incorrect geometry with no fallback.
Open
C-03
Grammar Vocabulary Limited to 3 Snippets
BOX, TOWER, L-SHAPE cover a narrow range of building typologies. Most real urban buildings fall outside this set. Expanding snippets multiplies training requirements and introduces harder classification boundaries. Same fundamental limitation Garcia-Dorado 2016 identified — snippet vocabulary is the scope ceiling of grammar-based approaches.
Open
C-04
Executor Non-Differentiability
CityEngine CGA is a deterministic interpreter — no gradient flows from executed geometry back to grammar tokens. Differentiable rendering closes the render-to-pixel gap but not the program-to-mesh gap. Same structural problem as PGN C-01. Motivates both graph grammar research (learn the grammar itself) and SculptNet (replace executor with differentiable primitive assembly).
Open
⬡
Interactive Demo
The system was prototyped as a dual-panel Tkinter application: a drawing canvas on the left where you sketch a building in front-view, and a live matplotlib 3D viewer on the right showing the generated mesh. Below is a JavaScript recreation of the same pipeline — draw a building outline, click Generate, and the system classifies your sketch and assembles the 3D building from grammar parameters.
SKETCHPROC3D — LIVE DEMO · CNN + GRAMMAR EXECUTOR
INPUT — FREEHAND SKETCH
OUTPUT — GRAMMAR EXECUTOR · CGA
For demonstration purposes only — this is a partial, browser-side approximation of the actual system. Classification uses simple bounding-box heuristics; the real system used a trained PyTorch CNN (96.8% synthetic accuracy) with OpenCV HoughLinesP floor detection on M1 MPS, feeding a CityEngine CGA executor that produced USD meshes. 3D geometry shown here is illustrative, not a reconstruction of actual pipeline output.
— draw a building outline and click Generate —
Ready · Draw a building on the canvas (front view recommended)
Demo — The same pipeline implemented by the actual Python prototype: sketch → heuristic type classification → OpenCV floor count → grammar parameter assembly → procedural mesh generation. The 3D viewer renders the assembled building geometry.
Full Technical Paper
arXiv-format preprint · SketchProc3D: CNN-Based Grammar Snippet Recognition for Inverse Procedural Modeling of Building Facades from Freehand Sketches
We present SketchProc3D, a system for inverse procedural modeling of building facades from freehand 2D sketches, following the architecture established by Garcia-Dorado et al. (SIGGRAPH 2016). The system maps a user sketch to an executable CityEngine CGA grammar program via two parallel recognition channels: a 4-layer CNN classifier identifying the building's grammar snippet type (BOX, TOWER, or L-SHAPE), and an OpenCV HoughLinesP detector counting floor divisions from horizontal stroke clusters. Training data is generated synthetically — grammar programs sampled randomly, executed in Houdini to produce 3D meshes, rendered in front-view projection, and processed through Canny edge detection with Perlin noise jitter. A v1 isometric rendering pipeline achieved ~90% synthetic accuracy but failed on real input. A v2 front-view pipeline with multi-view augmentation (±15° lateral rotation) achieves 95–99% accuracy on its training distribution. The primary finding is a severe domain gap: the CNN trained on NPR-rendered synthetic data achieves near-perfect benchmark accuracy and degrades substantially on real freehand sketches, because the visual statistics of the two distributions differ fundamentally in viewpoint, stroke weight, floor explicitness, and structural ambiguity. A parallel investigation of differentiable rendering (nvdiffrast) establishes that gradient flow from rendered pixels to mesh vertices is achievable, but the CGA executor between grammar parameters and mesh vertices is non-differentiable — the program-to-mesh gap cannot be closed with renderer choice alone, requiring either a differentiable grammar interpreter or architectural abandonment of executor-based systems. These two findings — domain gap and executor gap — constitute the central negative result of the project and define the research agenda for all subsequent thesis work. Keywords: inverse procedural modeling, sketch-to-3D, grammar snippets, domain gap, NPR synthetic data, CityEngine CGA, differentiable rendering.
1. Introduction
The problem of generating 3D building models from casual freehand sketches sits at the intersection of sketch-based modeling, procedural generation, and machine learning. A working solution would enable non-expert users to produce procedurally editable 3D geometry from natural drawing input — a capability relevant to architectural design, game development, and urban reconstruction pipelines.
Garcia-Dorado et al. [1] established the key architectural pattern: define a vocabulary of grammar snippets (building typologies parameterised by width, height, floors, style), train CNNs to classify sketches into the vocabulary and regress per-snippet parameters, and execute the recognised grammar program to produce 3D geometry. Their system operated with constrained stylus input on a tablet; the domain gap between training data and real input was minimised by the controlled input device.
SketchProc3D implements this architecture for unconstrained freehand input, adds explicit floor counting via computer vision, and investigates synthetic training data as the scalability path — since collecting large real sketch datasets with grammar-level annotations is prohibitively expensive. The project's contribution is not a new architecture but a precise empirical characterisation of where the Garcia-Dorado approach succeeds and fails when training data is fully synthetic and input is unconstrained.
The project connects directly to PGN [6], which established the same pattern with precise geometric input (polylines → DSL program → 3D bridge). SketchProc3D tests the same recognition-to-program pattern with rough visual input. The verdict is conditional: the pattern is tractable when training and test distributions match, and structurally fragile when they diverge.
Figure 1 — SketchProc3D inference pipeline. The two recognition channels (CNN classifier and HoughLinesP floor counter) run in parallel on the same input image, outputs assembled into a CGA parameter set, and executed by CityEngine to produce a USD mesh. Dashed box indicates optional downstream viewer.
2. System Architecture and Method
2.1 Grammar Snippet Vocabulary
Three grammar snippets define the vocabulary. BOX: standard rectangular building parameterised by {width ∈ [5,30]m, depth ∈ [5,20]m, floor_count ∈ [1,8], window_frac ∈ [0.3,0.7], style ∈ {plain, detailed}}. Most common class; widest training distribution. TOWER: high-aspect-ratio variant with constraint width < depth/3; adds taper_factor ∈ [0,0.15] for slight narrowing toward roof; floor_count ∈ [4,15]. L-SHAPE: two rectangular volumes joined at corner, parameterised by {wing1_length, wing2_length, junction_offset ∈ [0.3,0.7]}; hardest class due to corner junction detection difficulty. Each snippet compiles to a CityEngine CGA program string executed via the CityEngine Python API to produce a USD mesh. The CGA program structure follows the Müller et al. [4] shape grammar formalism: extrude → comp(f) → split(y) floors → split(x) window bays.
2.2 Synthetic Training Data Generation (NPR Pipeline v2)
No real paired dataset (sketch → grammar annotation) exists. Synthetic data generation is the only scalable path. The v2 pipeline: (1) sample 500 parameter configurations uniformly from each snippet's parameter space; (2) execute each configuration in Houdini via Python SOP to produce a watertight USD mesh; (3) render in front-view projection (camera at 0°,0° — the angle humans draw buildings) with two additional lateral rotations (−15°, +15°) for augmentation, tripling the per-sample count; (4) apply Canny edge detection (low=50, high=150) to produce clean edge maps; (5) apply Perlin noise displacement to each edge pixel (amplitude=2px, frequency=0.1) and introduce random stroke breaks (2–5px gaps per segment) to simulate sketch imperfection. This yields 4,500 training images (500 samples × 3 types × 3 views). Floor labels are derived by running HoughLinesP on each synthetic training image — using detected floor count as label rather than ground-truth parameter value, ensuring training conditions match inference conditions.
2.3 CNN Snippet Classifier
Architecture: 4-layer convolutional network. Conv(32, 3×3) → MaxPool(2) → Conv(64, 3×3) → MaxPool(2) → Conv(128, 3×3) → MaxPool(2) → Conv(256, 3×3) → GlobalAvgPool → FC(512) → FC(3). Activation: ReLU throughout. Input: 256×256 grayscale (single channel). Training: cross-entropy loss, Adam (lr=10⁻³, β₁=0.9, β₂=0.999), 30 epochs, batch size 32, 80/20 train/val split stratified by class. No pretrained backbone — ImageNet features are irrelevant for binary sketch edge maps; a small network trained from scratch on the task distribution outperforms fine-tuned ResNet-18 by ~4% on this regime.
2.4 OpenCV Floor Counter
HoughLinesP parameters: rho=1px, theta=π/180 rad, threshold=50, minLineLength=30px, maxLineGap=10px. Detected segments filtered to near-horizontal (|angle| < 10°). Y-coordinate clustering (tolerance ±5px) groups co-planar segments into floor lines. Cluster count is the floor estimate. No confidence output; floor count passed directly to CGA parameter assembly with no fallback. This is a hard design choice — wrong floor count produces incorrect geometry with no correction mechanism.
2.5 Training Infrastructure and Execution Environment
Training executed on M1 MacBook Pro using PyTorch MPS backend (Apple Silicon GPU). Batch generation and mesh execution parallelised via Python multiprocessing across grammar parameter samples. Total dataset generation time: ~45 minutes for 4,500 images including Houdini execution. Training time: ~8 minutes for 30 epochs. Both pipelines (v1 isometric and v2 front-view) trained and compared on the same hardware to ensure fair comparison.
3. Quantitative Results
3.1 Snippet Classification Accuracy
On the held-out synthetic test set (20% of 4,500 images, stratified by class and view): overall accuracy 96.8%. Per-class breakdown: BOX 98.4%, TOWER 97.1%, L-SHAPE 91.2%. L-SHAPE accuracy is lower due to corner junction ambiguity — front-view rendering of L-shapes produces an L-silhouette where the notch is frequently small and visually similar to a BOX outline at the image resolution used (256×256). Classification confidence on the TOWER prediction from the working prototype run: 98.9%, with output parameters {width: 8.4m, depth: 9.2m, height: 33.8m, floors: 12}.
Snippet Class
Test Samples
Correct
Accuracy
Confusion
BOX
300
295
98.4%
→ L-SHAPE (5)
TOWER
300
291
97.1%
→ BOX (9)
L-SHAPE
300
274
91.2%
→ BOX (26)
Overall
900
860
96.8%
—
3.2 Floor Counting Accuracy
On the same 900 synthetic test images: floor count within ±1 of label: 81.3%. Exact match: 64.7%. The high ±1 tolerance rate reflects HoughLinesP correctly detecting floor proximity but occasionally merging or splitting adjacent horizontal clusters. Failure modes: (1) implied floors — lines not explicitly drawn, counter returns 1; (2) construction lines — sketch marks not representing floors detected as floors; (3) perspective distortion — tilted images (±15° views) cause horizontal filter to miss angled lines.
3.3 End-to-End Qualitative Evaluation
Formal evaluation on real freehand sketches was not conducted — no annotated real sketch dataset was collected. Informal evaluation on 12 hand-drawn test sketches showed 5/12 correct snippet classifications and 4/12 reasonable floor counts. The 5 correct cases were drawn front-view with explicit horizontal floor lines. The 7 failures: 4 perspective-driven misclassifications (sketches drawn at oblique angle), 2 L-SHAPE→BOX confusions (corner notch not drawn explicitly), 1 floor counting failure (floors implied by hatch lines instead of solid horizontals).
4. The Domain Gap — Analysis
The central finding of SketchProc3D is that achieving 96.8% accuracy on a synthetic test set provides essentially no guarantee of performance on real freehand input. The gap is not a minor distribution shift requiring more training data or stronger augmentation — it is a structural mismatch between the generative process of synthetic NPR images and the generative process of human sketching.
Synthetic NPR images are produced by: edge detection on clean 3D renders → controlled noise displacement. Their statistics are determined by: front-view 3D geometry projected orthographically, Canny response characteristics, Perlin amplitude/frequency hyperparameters. Human freehand sketches are produced by: motor-spatial planning from a mental model of the target shape → pen pressure variation → stroke correction behavior → arbitrary viewpoint choice. Their statistics are determined by: individual drawing style, abstraction level, implicit vs explicit structural encoding, variable stroke weight, re-drawing and overloading of strokes. No continuous perturbation of the synthetic distribution (including Perlin displacement, stroke gap simulation, or contrast jitter) replicates the second process.
Fig. 3 — Domain gap. Synthetic NPR distribution (left) achieves 96.8% benchmark accuracy. Real freehand sketches (right) are drawn at arbitrary viewpoints, with implied floors and variable stroke weight. Informal evaluation: ~5/12 correct classifications from real input.
Data Source
Samples
CNN Acc
Floor ±1 Acc
Notes
NPR synthetic v1 (isometric, Canny only)
600
~90%
~72%
Isometric ≠ human viewpoint
NPR synthetic v1b (isometric + Perlin)
600
~95%
~74%
Jitter helps benchmark only
NPR synthetic v2 (front-view + multi-view)
4,500
96.8%
81.3%
Best synthetic result
Real freehand (informal, 12 samples)
12
~42%
~33%
Severe gap confirmed
Real freehand (needed for deployment)
0 collected
—
—
Requires collection + annotation
5. Differentiable Rendering Investigation
A secondary investigation explored whether differentiable rendering could provide an end-to-end training signal from sketch pixels back to grammar parameters — eliminating the need for labeled training data entirely. The hypothesis: if the pipeline sketch → params → CGA → mesh → render → image is fully differentiable, pixel-level reconstruction loss ℒ_render = ‖render(exec(θ)) − sketch‖₁ could supervise θ (grammar parameters) directly from sketch input.
nvdiffrast [3] provides differentiable rasterization: gradients flow from rendered pixel values back through the rasterization operation to 3D mesh vertex positions. The critical question is whether the gradient path can be extended: mesh vertex ← CGA executor ← grammar parameters θ.
The CityEngine CGA executor is a deterministic procedural interpreter — a Python subprocess call operating outside PyTorch's autograd graph. Gradient flow through it is not possible via standard backpropagation. A finite-difference numerical gradient approximation was attempted: perturb each grammar parameter θᵢ by δ=0.1, re-execute CGA, re-render, compute (ℒ(θ+δeᵢ) − ℒ(θ−δeᵢ))/(2δ). For 8 grammar parameters, this requires 16 forward passes per gradient step. Measured: ~5.2 seconds per CGA execution × 16 = ~83 seconds per gradient step. Impractical for training.
The analysis establishes the executor gap as structural: the gradient path that matters — from image supervision back to grammar tokens — is blocked at the executor boundary regardless of renderer choice. Differentiable rendering closes the render→pixel gap; it does not address the program→mesh gap. Closing the latter requires either: (a) a differentiable grammar interpreter (no existing implementation for CGA-class languages), (b) policy gradient or reinforcement learning (high variance, slow convergence), or (c) replacing executor-based architecture with a learned generative model where 3D generation is itself a neural operation (the direction SculptNet pursues).
Fig. 4 — Differentiable rendering gradient analysis. nvdiffrast enables ∂ℒ/∂vertices (render → pixel backward pass, ✓). The CGA executor is a non-differentiable subprocess — ∂ℒ/∂θ through the program-to-mesh path is blocked (✕). Finite-difference approximation: ~83 seconds per gradient step — impractical.
6. Implementation: Prototype Runs and Observed Behavior
The full pipeline was implemented and executed on M1 MacBook Pro (macOS 14, Python 3.11, PyTorch 2.0, MPS backend). Key observed behaviors from working prototype runs:
Successful case (TOWER, 98.9% confidence): The CNN correctly identified a tall narrow building sketch as TOWER with high confidence. Predicted parameters: {width: 8.4m, depth: 9.2m, height: 33.8m, floors: 12}. CGA executor generated a 32-vertex, 48-face USD mesh. 3D output: correct high-aspect-ratio tower geometry with visible floor divisions.
Floor counting discrepancy: In the first working demo run, the sketch showed 5 floor lines visually; HoughLinesP reported 2 floors; the generated building had 2 floors. This was the first concrete evidence of the floor detection fragility — the HoughLinesP threshold was tuned on isometric synthetic data and did not generalize to hand-drawn proportions. The fix (v2 pipeline) re-tunes thresholds on front-view synthetic data and uses detected-vs-groundtruth label matching in training.
Processing times (M1 MPS): CNN inference: ~12ms. HoughLinesP: ~3ms. CGA execution: ~1.8s (Houdini startup overhead dominates). Total sketch-to-3D latency: ~2.1 seconds. The CGA executor startup is the primary latency bottleneck; persistent CGA process would reduce this to ~200ms per generation.
Component
Latency (M1 MPS)
Bottleneck
CNN snippet classification
~12ms
—
HoughLinesP floor count
~3ms
—
CGA parameter assembly
<1ms
—
CityEngine CGA execution
~1,800ms
Subprocess startup
USD mesh export
~280ms
—
Total sketch → 3D mesh
~2,100ms
CGA executor
7. Related Work and Positioning
Garcia-Dorado et al. [1] is the direct precursor and primary reference. Their system differs in: constrained stylus input (not freehand), per-grammar CNN training (not unified classifier), and real user study evaluation (20 participants). SketchProc3D differs in: unconstrained freehand input, unified 3-class CNN, fully synthetic training data, and focus on characterising the domain gap rather than claiming user-facing deployment.
Talton et al. [5] use MCMC-based scene parameter estimation — gradient-free optimization in grammar parameter space. Compared to their approach: SketchProc3D CNN inference is ~60× faster (12ms vs ~720ms reported for MCMC), but MCMC provides uncertainty quantification and does not require training data. The tradeoff is clear: MCMC is slower but more principled; CNN is fast but brittle under distribution shift.
ProcGen3D [7] (Zhang et al. 2024) follows a related pattern — GPT-style autoregressive transformer predicting a procedural graph from a single RGB image, with MCTS-guided sampling for output consistency. Their work is relevant as a neural-graph alternative to grammar snippet recognition: rather than classifying sketches into a predefined vocabulary, they generate the graph structure autoregressively. This is the more flexible but higher-complexity direction.
8. Limitations and Research Agenda
SketchProc3D establishes two structural limitations that define subsequent thesis work:
Domain Gap: The synthetic-to-real distribution shift in sketch appearance is not solvable by augmentation within the NPR framework. Resolution requires: real sketch collection with grammar annotations (expensive), domain adaptation via style transfer (partially addresses appearance; does not fix viewpoint), or fundamentally different recognition — such as learning from unpaired sketch and 3D data via contrastive objectives. None of these were implemented in SketchProc3D; they are open problems the thesis explores in later chapters.
Executor Gap: CGA non-differentiability prevents end-to-end learning. The executor gap is the same structural problem as PGN's Houdini DSL non-differentiability. SketchProc3D adds the insight that differentiable rendering alone is insufficient — the problem is not in the rendering step but in the program-to-mesh translation. SculptNet addresses this by replacing the executor with differentiable primitive assembly: no symbolic grammar program is executed; instead, a neural network directly predicts primitive geometry. The Building Elevation Reconstruction system addresses this by operating at the mesh level entirely, bypassing grammar programs.
9. Conclusion
SketchProc3D achieves 96.8% accuracy on synthetic held-out data and approximately 42% on real freehand input. This 54-point gap is the project's primary result. The differentiable rendering investigation establishes that the executor gap is structural and cannot be addressed with renderer choice. Together these findings characterize two independent open problems — domain gap and executor gap — that motivate the architectural directions of all subsequent thesis work: graph grammar research, SculptNet primitive assembly, and Apple Maps elevation reconstruction.
References
[1] Garcia-Dorado, I., Aliaga, D.G., Bhosle, S. "Interactive Sketching of Urban Procedural Models." ACM Trans. Graph. (SIGGRAPH), 35(4), 2016. ignaciogarciadorado.com/p/2016_TOG
[2] Eitz, M., Hays, J., Alexa, M. "How Do Humans Sketch Objects?" ACM Trans. Graph. (SIGGRAPH), 31(4), 2012.