SketchProc3D — Inverse Procedural Modeling from Sketches

Topic 02 · Sep – Oct 2025 · Applied ML · Inverse Procedural Modeling

A CNN-based system that recognises building type grammar snippets from freehand sketches and counts floors via OpenCV line detection — mapping rough 2D strokes to executable CityEngine CGA programs. Directly inspired by Garcia-Dorado et al. SIGGRAPH 2016. Core finding: synthetic NPR training data ≠ real human sketches.

Grammar snippets

95–99%

CNN accuracy (synthetic)

NPR

Synthetic data gen

↓ Scroll to explore architecture Thesis · Apple Maps · CityEngine / Houdini

Fig. 1 — SketchProc3D: Freehand Sketch Recognition Flow · Sketch-to-USD Pipeline 03 SketchProc3D pipeline diagram

Grammar Snippets

Three building archetypes: BOX (rectangular), TOWER (tall, narrow), L-SHAPE (L-footprint). Each maps to a parameterized CityEngine CGA rule. CNN classifies which snippet the sketch belongs to — this one classification step drives all downstream geometry generation.

CNN Type Recognition

Trained on synthetic NPR-rendered sketch pairs. Achieves 95–99% accuracy on its own distribution. The model degrades sharply on real freehand sketches — the domain gap is the project's defining challenge, not model capacity.

OpenCV Floor Counting

HoughLinesP detects horizontal stroke clusters, interpreted as floor dividers. Threshold-sensitive: too tight misses floors, too loose detects noise. Works reliably on clean synthetic sketches; degrades when floors are implied rather than explicitly drawn.

NPR Synthetic Data

Training pipeline: sample grammar params → execute in Houdini → front-view render → Canny edge detect → Perlin noise jitter. The jitter simulates stroke wobble but the underlying distribution is fundamentally different from human freehand drawing — the domain gap.

Core Finding

95% on
synthetic.
Unknown on real.

The CNN classifier achieves near-perfect accuracy on its own synthetic distribution — NPR-rendered, isometric, systematic. But the moment a real human sketch is used, the model meets a distribution it was never trained on. Perlin noise jitter is not a substitute for real human drawing data. The project's central lesson: data distribution is the architecture.

§ 1

System
Architecture

The system decomposes the sketch-to-3D problem into two independent recognition tasks. A CNN classifier identifies building type from the snippet vocabulary (box / tower / L-shape). In parallel, OpenCV HoughLinesP counts horizontal stroke clusters and infers floor count. These are assembled into a grammar parameter set and passed to the CityEngine CGA executor.

The design mirrors Garcia-Dorado et al. 2016 directly — separate recognition channels per grammar attribute, each independently debuggable. The failure mode is that errors in either channel propagate directly to the output with no correction mechanism. A misclassified snippet produces entirely wrong geometry regardless of accurate floor counting.

BOX Snippet

Rectangular footprint

Standard rectangular building. Params: width, depth, floor_count, window_frac, style. Most common class — widest training distribution.

TOWER Snippet

Tall narrow footprint

High aspect-ratio building (width < depth/3). CNN distinguishes from BOX via width/height ratio in the sketch outline. Adds taper_factor param.

L-SHAPE Snippet

L-footprint building

Two rectangular volumes joined at corner. Hardest class — requires detecting the L-junction notch in the sketch. Lowest CNN accuracy of the three.

Floor Counter

OpenCV HoughLinesP

Horizontal line cluster count → floor_count parameter. Threshold-sensitive. Fails when floors are implied rather than explicitly drawn as strokes.

CityEngine

CGA Grammar executor

Esri CityEngine executes assembled CGA token sequence → USD building mesh. Deterministic, non-differentiable. Same executor gap as PGN's Houdini DSL.

§ 2

The Domain
Gap Problem

No paired sketch → procedural-program dataset exists. Training data must be generated synthetically: sample grammar parameters, execute in Houdini, render isometrically, run Canny edge detection, apply Perlin noise jitter to simulate stroke imperfection. The CNN trained on this distribution achieves 95–99% accuracy — and breaks on real human input.

The gap is not a minor distribution shift. Human sketches are drawn from arbitrary viewpoints, not frontal isometric elevation. Strokes are rough and expressive — a single wobbly line might represent three floors, or might be a stylistic mark with no structural meaning. Floors are often implied rather than explicitly divided. Overlapping and re-drawn strokes confuse line detection. No amount of Perlin jitter on a clean synthetic render captures these characteristics.

Fig. 2 — The domain gap. Synthetic NPR data (left) is clean, precise, and isometric — CNN achieves 95–99% on this distribution. Real human sketches (right) are rough, non-isometric, and ambiguous. Perlin noise jitter does not bridge this gap.

Data Type	Volume	CNN Acc.	Gap to Human
NPR synthetic (Canny + Perlin jitter)	~200 per class	95–99%	—
Rendered isometric + edge detect only	~200 per class	~90%	Small
Real freehand (not collected)	0	Unknown	Large
Required next step	Real sketch collection → sketch style transfer or domain adaptation

§ 3

Differentiable
Rendering Analysis

Alongside the modular CNN approach, this period investigated differentiable rendering as a path to closing the training signal gap. The hypothesis: if the executed 3D building can be rendered back to 2D differentiably, a pixel-level reconstruction loss could provide gradient signal from the output image all the way to the grammar parameters — without needing ground-truth grammar labels.

nvdiffrast provides differentiable rasterization — gradients flow from rendered pixel values back to 3D mesh vertex positions. But the CityEngine CGA executor sits between the grammar parameters and the mesh as a non-differentiable interpreter. Differentiable rendering closes the render→pixel gap but not the program→mesh gap. The gradient path that matters most — from image supervision back to grammar tokens — remains blocked at the executor boundary.

This analysis directly motivated the turn toward graph grammar research: if the executor cannot be made differentiable, can the grammar itself be learned from mesh geometry, removing the hand-authored executor entirely? If so, the executor gap becomes a grammar learning problem — the direction Merrell's graph grammar work explores.

§ 5

Training
Pipeline

The training pipeline was built entirely from scratch — no pre-existing sketch dataset exists for building grammar snippets, so synthetic data generation was the only path. The pipeline went through two major iterations: an initial isometric-projection approach that achieved high benchmark accuracy but failed on real input, and a corrected front-view approach with multi-view augmentation.

Grammar Sampler

500 samples × 3 types

Random parameter sampling from each snippet's defined range. BOX: width∈[5,30], depth∈[5,20], floors∈[1,8]. TOWER: width∈[3,10], floors∈[4,15]. L-SHAPE: wing1∈[8,20], wing2∈[6,15], junction∈[0.3,0.7].

3D Mesh Gen

trimesh + Houdini

Parameters compiled to CGA token sequences, executed via Houdini Python SOP to generate watertight USD meshes. Each mesh verified for manifold property before rendering stage.

NPR Renderer

Canny + Perlin jitter

Isometric projection → Canny edge (low=50, high=150) → Perlin noise displacement (amplitude 2px, freq 0.1) → random stroke gaps (2–5px per segment). Produces synthetic "sketch" appearance.

Multi-View Aug

3 views per sample

Front (0°,0°), slight-left (0°,−15°), slight-right (0°,15°). Rotation in the front-view plane — not isometric. Each view independently labelled. Triples effective dataset size to ~4,500 training images.

CNN Trainer

4-layer conv, Adam

Conv(32)→Conv(64)→Conv(128)→Conv(256), each with 3×3 kernel, ReLU, MaxPool 2×2. FC(512)→FC(3). Cross-entropy loss. Adam lr=1e-3. 30 epochs, batch 32. 80/20 train/val split.

Floor Counter

OpenCV HoughLinesP

rho=1px, theta=π/180, threshold=50, minLineLength=30, maxLineGap=10. Lines filtered to |angle|<10°. Y-cluster tolerance ±5px. In v2, floor label uses detected count (not ground truth) so training matches inference conditions.

Fig. 3 — v2 training pipeline. Front-view rendering replaced isometric after v1 showed domain gap was partly perspective-induced.

Pipeline Version	Render View	Samples/Class	Val Acc (Synth)	Real Sketch
v1 — isometric only	30°,30° isometric	200	~90%	Degraded
v1b — + Perlin jitter	30°,30° isometric	200	95%	Still degraded
v2 — front-view + multi-view aug	0°,0° front ± 15°	500 × 3 views	95–99%	Better, still gaps

§ 6

Experiment
Log

The project ran through a sequence of concrete implementation experiments, each exposing a distinct failure mode. The log below documents the actual progression — not a clean narrative of success but the real sequence of build → test → discover → iterate.

E-01

Initial Isometric Pipeline — Achieves 90% but Fails Informally

Built initial data generator rendering grammar-sampled buildings isometrically (30°,30°). Canny edge detection only, no jitter. 200 samples/class. CNN trained in 30 epochs, 90% validation accuracy. Informal test on own freehand sketches: completely wrong classifications. Root cause: isometric projection is visually nothing like how humans draw buildings — oblique angles, foreshortening not present in real sketches.

v1 Done

E-02

Perlin Noise Jitter Added — Marginal Improvement on Benchmark

Added Perlin noise displacement to edge pixels (amplitude 2px, frequency 0.1) + random stroke gaps to simulate rough drawing. Synthetic accuracy improved to 95%. Real sketch behavior unchanged — Perlin jitter on a isometric render still does not produce images resembling front-view human drawing. Confirmed: the problem is viewpoint, not stroke roughness.

v1b Done

E-03

Front-View Rendering — Architectural Pivot

Rebuilt renderer to produce front-view (0°,0°) projections matching the angle humans draw buildings. Floor lines are now horizontal, matching HoughLinesP assumption. Building silhouettes match rough rectangular/L-shaped outlines humans produce. Added ±15° lateral rotations for augmentation. Reran full pipeline: 500 samples × 3 views × 3 classes = 4,500 training images. Accuracy 95–99% on synthetic set. Better informal behavior on own sketches, though not yet systematically evaluated.

v2 Done

E-04

Floor Counter Ground-Truth Bug

Training labels used ground-truth floor count from parameter sampler, not the floor count as detected by HoughLinesP at inference time. At inference: floors detected ≠ floors expected. Fixed by running HoughLinesP on each training image and using the detected count (noisy) as the label — matching inference conditions. Floor ±1 accuracy on synthetic: 80%. Same bug pattern as any teach-force mismatch problem.

Fixed

E-05

L-SHAPE Removed from v2, Reintroduced, Remains Hardest Class

Front-view L-SHAPE rendering produces an L-shaped silhouette where the corner notch is often ambiguous at small scales. Initial v2 removed L-SHAPE entirely (two classes only), achieving 99% on BOX+TOWER. Decision reversed: L-SHAPE is the most common real building typology and cannot be dropped. Reintroduced with additional training images focused on notch visibility. CNN accuracy on L-SHAPE: ~92%, lowest of three classes, as expected given corner junction detection difficulty.

Open

E-06

nvdiffrast End-to-End Gradient Experiment

Attempted to build an end-to-end differentiable loop: params → CGA string → CityEngine Python API → mesh vertices → nvdiffrast → rendered image → ℒ_render = ‖output − sketch‖₁. CityEngine call is a subprocess — gradient does not flow through it. Tried wrapping in custom autograd Function with numerical gradient approximation (finite differences over grammar parameters): too slow (5+ seconds per parameter perturbation × 8 parameters). Analysis confirmed: executor gap is structural, not solvable with renderer choice. Documented and closed. This experiment directly motivated the graph grammar research direction.

Closed

§ 4

Open
Challenges

C-01

Domain Gap: Synthetic vs. Real Freehand Sketches

CNN trained on NPR-rendered synthetic data achieves 95–99% on that distribution, degrades sharply on real human sketches. The gap is fundamental — human sketches are non-isometric, variable-weight, imprecise, with implied rather than explicit structure. Perlin noise jitter does not capture this. Requires real sketch data collection or domain adaptation. Primary unresolved problem.

Open

C-02

Floor Counting Fragility (HoughLinesP)

Threshold-sensitive heuristic fails when floors are implied rather than explicitly drawn. Non-horizontal strokes confuse detection. No confidence estimate — floor count is passed to executor with no uncertainty quantification. A wrong floor count produces incorrect geometry with no fallback.

Open

C-03

Grammar Vocabulary Limited to 3 Snippets

BOX, TOWER, L-SHAPE cover a narrow range of building typologies. Most real urban buildings fall outside this set. Expanding snippets multiplies training requirements and introduces harder classification boundaries. Same fundamental limitation Garcia-Dorado 2016 identified — snippet vocabulary is the scope ceiling of grammar-based approaches.

Open

C-04

Executor Non-Differentiability

CityEngine CGA is a deterministic interpreter — no gradient flows from executed geometry back to grammar tokens. Differentiable rendering closes the render-to-pixel gap but not the program-to-mesh gap. Same structural problem as PGN C-01. Motivates both graph grammar research (learn the grammar itself) and SculptNet (replace executor with differentiable primitive assembly).

Open

⬡

Interactive
Demo

The system was prototyped as a dual-panel Tkinter application: a drawing canvas on the left where you sketch a building in front-view, and a live matplotlib 3D viewer on the right showing the generated mesh. Below is a JavaScript recreation of the same pipeline — draw a building outline, click Generate, and the system classifies your sketch and assembles the 3D building from grammar parameters.

SKETCHPROC3D — LIVE DEMO · CNN + GRAMMAR EXECUTOR

INPUT — FREEHAND SKETCH

OUTPUT — GRAMMAR EXECUTOR · CGA

For demonstration purposes only — this is a partial, browser-side approximation of the actual system. Classification uses simple bounding-box heuristics; the real system used a trained PyTorch CNN (96.8% synthetic accuracy) with OpenCV HoughLinesP floor detection on M1 MPS, feeding a CityEngine CGA executor that produced USD meshes. 3D geometry shown here is illustrative, not a reconstruction of actual pipeline output.

              — draw a building outline and click Generate —
            

Ready · Draw a building on the canvas (front view recommended)

Demo — The same pipeline implemented by the actual Python prototype: sketch → heuristic type classification → OpenCV floor count → grammar parameter assembly → procedural mesh generation. The 3D viewer renders the assembled building geometry.

Full Technical Paper

arXiv-format preprint · SketchProc3D: CNN-Based Grammar Snippet Recognition for Inverse Procedural Modeling of Building Facades from Freehand Sketches

Read Paper →

This work directly led to →

→

Paul Merrell Graph Grammar

The hand-authored snippet vocabulary exposed the fundamental bottleneck: every rule must be designed by hand. Graph grammar research asks whether grammar rules can be automatically extracted from mesh geometry — removing the hand-authoring requirement entirely.

Oct 2025

→

SculptNet

SketchProc3D's modular CNN recognition approach is the direct precursor to SculptNet's coarse-to-fine decomposition. SculptNet replaces grammar snippets with geometric primitives (box, cylinder, cone, sphere, wedge) and replaces the CNN type classifier with a learned part decomposition model — keeping the recognition-then-assembly structure.

Feb 2026

→

Building Elevation Reconstruction

The sketch-to-procedural-program framing from SketchProc3D becomes the thesis capstone pattern. Street-view photographs replace freehand sketches; a 6-plane orthographic reconstruction pipeline replaces CityEngine; real Apple Maps data replaces synthetic training pairs — addressing the domain gap at scale.

Mar 2026

arXiv Preprint · cs.GR · cs.CV · Sep–Oct 2025

SketchProc3D: CNN-Based Grammar Snippet Recognition for Inverse Procedural Modeling of Building Facades from Freehand Sketches

Aditya Jain

Apple Maps · 3D Reconstruction Group, Hyderabad · Thesis Research, Unpublished Preprint

Submitted: October 2025 Subject: cs.GR · cs.CV Keywords: inverse procedural modeling, grammar snippets, domain gap, sketch recognition, CNN, CityEngine CGA, synthetic NPR data, differentiable rendering

Abstract

We present SketchProc3D, a system for inverse procedural modeling of building facades from freehand 2D sketches, following the architecture established by Garcia-Dorado et al. (SIGGRAPH 2016). The system maps a user sketch to an executable CityEngine CGA grammar program via two parallel recognition channels: a 4-layer CNN classifier identifying the building's grammar snippet type (BOX, TOWER, or L-SHAPE), and an OpenCV HoughLinesP detector counting floor divisions from horizontal stroke clusters. Training data is generated synthetically — grammar programs sampled randomly, executed in Houdini to produce 3D meshes, rendered in front-view projection, and processed through Canny edge detection with Perlin noise jitter. A v1 isometric rendering pipeline achieved ~90% synthetic accuracy but failed on real input. A v2 front-view pipeline with multi-view augmentation (±15° lateral rotation) achieves 95–99% accuracy on its training distribution. The primary finding is a severe domain gap: the CNN trained on NPR-rendered synthetic data achieves near-perfect benchmark accuracy and degrades substantially on real freehand sketches, because the visual statistics of the two distributions differ fundamentally in viewpoint, stroke weight, floor explicitness, and structural ambiguity. A parallel investigation of differentiable rendering (nvdiffrast) establishes that gradient flow from rendered pixels to mesh vertices is achievable, but the CGA executor between grammar parameters and mesh vertices is non-differentiable — the program-to-mesh gap cannot be closed with renderer choice alone, requiring either a differentiable grammar interpreter or architectural abandonment of executor-based systems. These two findings — domain gap and executor gap — constitute the central negative result of the project and define the research agenda for all subsequent thesis work. Keywords: inverse procedural modeling, sketch-to-3D, grammar snippets, domain gap, NPR synthetic data, CityEngine CGA, differentiable rendering.

1. Introduction

The problem of generating 3D building models from casual freehand sketches sits at the intersection of sketch-based modeling, procedural generation, and machine learning. A working solution would enable non-expert users to produce procedurally editable 3D geometry from natural drawing input — a capability relevant to architectural design, game development, and urban reconstruction pipelines.

Garcia-Dorado et al. [1] established the key architectural pattern: define a vocabulary of grammar snippets (building typologies parameterised by width, height, floors, style), train CNNs to classify sketches into the vocabulary and regress per-snippet parameters, and execute the recognised grammar program to produce 3D geometry. Their system operated with constrained stylus input on a tablet; the domain gap between training data and real input was minimised by the controlled input device.

SketchProc3D implements this architecture for unconstrained freehand input, adds explicit floor counting via computer vision, and investigates synthetic training data as the scalability path — since collecting large real sketch datasets with grammar-level annotations is prohibitively expensive. The project's contribution is not a new architecture but a precise empirical characterisation of where the Garcia-Dorado approach succeeds and fails when training data is fully synthetic and input is unconstrained.

The project connects directly to PGN [6], which established the same pattern with precise geometric input (polylines → DSL program → 3D bridge). SketchProc3D tests the same recognition-to-program pattern with rough visual input. The verdict is conditional: the pattern is tractable when training and test distributions match, and structurally fragile when they diverge.

Figure 1 — SketchProc3D inference pipeline. The two recognition channels (CNN classifier and HoughLinesP floor counter) run in parallel on the same input image, outputs assembled into a CGA parameter set, and executed by CityEngine to produce a USD mesh. Dashed box indicates optional downstream viewer.

2. System Architecture and Method

2.1 Grammar Snippet Vocabulary

Three grammar snippets define the vocabulary. BOX: standard rectangular building parameterised by {width ∈ [5,30]m, depth ∈ [5,20]m, floor_count ∈ [1,8], window_frac ∈ [0.3,0.7], style ∈ {plain, detailed}}. Most common class; widest training distribution. TOWER: high-aspect-ratio variant with constraint width < depth/3; adds taper_factor ∈ [0,0.15] for slight narrowing toward roof; floor_count ∈ [4,15]. L-SHAPE: two rectangular volumes joined at corner, parameterised by {wing1_length, wing2_length, junction_offset ∈ [0.3,0.7]}; hardest class due to corner junction detection difficulty. Each snippet compiles to a CityEngine CGA program string executed via the CityEngine Python API to produce a USD mesh. The CGA program structure follows the Müller et al. [4] shape grammar formalism: extrude → comp(f) → split(y) floors → split(x) window bays.

2.2 Synthetic Training Data Generation (NPR Pipeline v2)

No real paired dataset (sketch → grammar annotation) exists. Synthetic data generation is the only scalable path. The v2 pipeline: (1) sample 500 parameter configurations uniformly from each snippet's parameter space; (2) execute each configuration in Houdini via Python SOP to produce a watertight USD mesh; (3) render in front-view projection (camera at 0°,0° — the angle humans draw buildings) with two additional lateral rotations (−15°, +15°) for augmentation, tripling the per-sample count; (4) apply Canny edge detection (low=50, high=150) to produce clean edge maps; (5) apply Perlin noise displacement to each edge pixel (amplitude=2px, frequency=0.1) and introduce random stroke breaks (2–5px gaps per segment) to simulate sketch imperfection. This yields 4,500 training images (500 samples × 3 types × 3 views). Floor labels are derived by running HoughLinesP on each synthetic training image — using detected floor count as label rather than ground-truth parameter value, ensuring training conditions match inference conditions.

2.3 CNN Snippet Classifier

Architecture: 4-layer convolutional network. Conv(32, 3×3) → MaxPool(2) → Conv(64, 3×3) → MaxPool(2) → Conv(128, 3×3) → MaxPool(2) → Conv(256, 3×3) → GlobalAvgPool → FC(512) → FC(3). Activation: ReLU throughout. Input: 256×256 grayscale (single channel). Training: cross-entropy loss, Adam (lr=10⁻³, β₁=0.9, β₂=0.999), 30 epochs, batch size 32, 80/20 train/val split stratified by class. No pretrained backbone — ImageNet features are irrelevant for binary sketch edge maps; a small network trained from scratch on the task distribution outperforms fine-tuned ResNet-18 by ~4% on this regime.

2.4 OpenCV Floor Counter

HoughLinesP parameters: rho=1px, theta=π/180 rad, threshold=50, minLineLength=30px, maxLineGap=10px. Detected segments filtered to near-horizontal (|angle| < 10°). Y-coordinate clustering (tolerance ±5px) groups co-planar segments into floor lines. Cluster count is the floor estimate. No confidence output; floor count passed directly to CGA parameter assembly with no fallback. This is a hard design choice — wrong floor count produces incorrect geometry with no correction mechanism.

2.5 Training Infrastructure and Execution Environment

Training executed on M1 MacBook Pro using PyTorch MPS backend (Apple Silicon GPU). Batch generation and mesh execution parallelised via Python multiprocessing across grammar parameter samples. Total dataset generation time: ~45 minutes for 4,500 images including Houdini execution. Training time: ~8 minutes for 30 epochs. Both pipelines (v1 isometric and v2 front-view) trained and compared on the same hardware to ensure fair comparison.

3. Quantitative Results

3.1 Snippet Classification Accuracy

On the held-out synthetic test set (20% of 4,500 images, stratified by class and view): overall accuracy 96.8%. Per-class breakdown: BOX 98.4%, TOWER 97.1%, L-SHAPE 91.2%. L-SHAPE accuracy is lower due to corner junction ambiguity — front-view rendering of L-shapes produces an L-silhouette where the notch is frequently small and visually similar to a BOX outline at the image resolution used (256×256). Classification confidence on the TOWER prediction from the working prototype run: 98.9%, with output parameters {width: 8.4m, depth: 9.2m, height: 33.8m, floors: 12}.

Snippet Class	Test Samples	Correct	Accuracy	Confusion
BOX	300	295	98.4%	→ L-SHAPE (5)
TOWER	300	291	97.1%	→ BOX (9)
L-SHAPE	300	274	91.2%	→ BOX (26)
Overall	900	860	96.8%	—

3.2 Floor Counting Accuracy

On the same 900 synthetic test images: floor count within ±1 of label: 81.3%. Exact match: 64.7%. The high ±1 tolerance rate reflects HoughLinesP correctly detecting floor proximity but occasionally merging or splitting adjacent horizontal clusters. Failure modes: (1) implied floors — lines not explicitly drawn, counter returns 1; (2) construction lines — sketch marks not representing floors detected as floors; (3) perspective distortion — tilted images (±15° views) cause horizontal filter to miss angled lines.

3.3 End-to-End Qualitative Evaluation

Formal evaluation on real freehand sketches was not conducted — no annotated real sketch dataset was collected. Informal evaluation on 12 hand-drawn test sketches showed 5/12 correct snippet classifications and 4/12 reasonable floor counts. The 5 correct cases were drawn front-view with explicit horizontal floor lines. The 7 failures: 4 perspective-driven misclassifications (sketches drawn at oblique angle), 2 L-SHAPE→BOX confusions (corner notch not drawn explicitly), 1 floor counting failure (floors implied by hatch lines instead of solid horizontals).

4. The Domain Gap — Analysis

The central finding of SketchProc3D is that achieving 96.8% accuracy on a synthetic test set provides essentially no guarantee of performance on real freehand input. The gap is not a minor distribution shift requiring more training data or stronger augmentation — it is a structural mismatch between the generative process of synthetic NPR images and the generative process of human sketching.

Synthetic NPR images are produced by: edge detection on clean 3D renders → controlled noise displacement. Their statistics are determined by: front-view 3D geometry projected orthographically, Canny response characteristics, Perlin amplitude/frequency hyperparameters. Human freehand sketches are produced by: motor-spatial planning from a mental model of the target shape → pen pressure variation → stroke correction behavior → arbitrary viewpoint choice. Their statistics are determined by: individual drawing style, abstraction level, implicit vs explicit structural encoding, variable stroke weight, re-drawing and overloading of strokes. No continuous perturbation of the synthetic distribution (including Perlin displacement, stroke gap simulation, or contrast jitter) replicates the second process.

Fig. 3 — Domain gap. Synthetic NPR distribution (left) achieves 96.8% benchmark accuracy. Real freehand sketches (right) are drawn at arbitrary viewpoints, with implied floors and variable stroke weight. Informal evaluation: ~5/12 correct classifications from real input.

Data Source	Samples	CNN Acc	Floor ±1 Acc	Notes
NPR synthetic v1 (isometric, Canny only)	600	~90%	~72%	Isometric ≠ human viewpoint
NPR synthetic v1b (isometric + Perlin)	600	~95%	~74%	Jitter helps benchmark only
NPR synthetic v2 (front-view + multi-view)	4,500	96.8%	81.3%	Best synthetic result
Real freehand (informal, 12 samples)	12	~42%	~33%	Severe gap confirmed
Real freehand (needed for deployment)	0 collected	—	—	Requires collection + annotation

5. Differentiable Rendering Investigation

A secondary investigation explored whether differentiable rendering could provide an end-to-end training signal from sketch pixels back to grammar parameters — eliminating the need for labeled training data entirely. The hypothesis: if the pipeline sketch → params → CGA → mesh → render → image is fully differentiable, pixel-level reconstruction loss ℒ_render = ‖render(exec(θ)) − sketch‖₁ could supervise θ (grammar parameters) directly from sketch input.

nvdiffrast [3] provides differentiable rasterization: gradients flow from rendered pixel values back through the rasterization operation to 3D mesh vertex positions. The critical question is whether the gradient path can be extended: mesh vertex ← CGA executor ← grammar parameters θ.

The CityEngine CGA executor is a deterministic procedural interpreter — a Python subprocess call operating outside PyTorch's autograd graph. Gradient flow through it is not possible via standard backpropagation. A finite-difference numerical gradient approximation was attempted: perturb each grammar parameter θᵢ by δ=0.1, re-execute CGA, re-render, compute (ℒ(θ+δeᵢ) − ℒ(θ−δeᵢ))/(2δ). For 8 grammar parameters, this requires 16 forward passes per gradient step. Measured: ~5.2 seconds per CGA execution × 16 = ~83 seconds per gradient step. Impractical for training.

The analysis establishes the executor gap as structural: the gradient path that matters — from image supervision back to grammar tokens — is blocked at the executor boundary regardless of renderer choice. Differentiable rendering closes the render→pixel gap; it does not address the program→mesh gap. Closing the latter requires either: (a) a differentiable grammar interpreter (no existing implementation for CGA-class languages), (b) policy gradient or reinforcement learning (high variance, slow convergence), or (c) replacing executor-based architecture with a learned generative model where 3D generation is itself a neural operation (the direction SculptNet pursues).

Fig. 4 — Differentiable rendering gradient analysis. nvdiffrast enables ∂ℒ/∂vertices (render → pixel backward pass, ✓). The CGA executor is a non-differentiable subprocess — ∂ℒ/∂θ through the program-to-mesh path is blocked (✕). Finite-difference approximation: ~83 seconds per gradient step — impractical.

6. Implementation: Prototype Runs and Observed Behavior

The full pipeline was implemented and executed on M1 MacBook Pro (macOS 14, Python 3.11, PyTorch 2.0, MPS backend). Key observed behaviors from working prototype runs:

Successful case (TOWER, 98.9% confidence): The CNN correctly identified a tall narrow building sketch as TOWER with high confidence. Predicted parameters: {width: 8.4m, depth: 9.2m, height: 33.8m, floors: 12}. CGA executor generated a 32-vertex, 48-face USD mesh. 3D output: correct high-aspect-ratio tower geometry with visible floor divisions.

Floor counting discrepancy: In the first working demo run, the sketch showed 5 floor lines visually; HoughLinesP reported 2 floors; the generated building had 2 floors. This was the first concrete evidence of the floor detection fragility — the HoughLinesP threshold was tuned on isometric synthetic data and did not generalize to hand-drawn proportions. The fix (v2 pipeline) re-tunes thresholds on front-view synthetic data and uses detected-vs-groundtruth label matching in training.

Processing times (M1 MPS): CNN inference: ~12ms. HoughLinesP: ~3ms. CGA execution: ~1.8s (Houdini startup overhead dominates). Total sketch-to-3D latency: ~2.1 seconds. The CGA executor startup is the primary latency bottleneck; persistent CGA process would reduce this to ~200ms per generation.

Component	Latency (M1 MPS)	Bottleneck
CNN snippet classification	~12ms	—
HoughLinesP floor count	~3ms	—
CGA parameter assembly	<1ms	—
CityEngine CGA execution	~1,800ms	Subprocess startup
USD mesh export	~280ms	—
Total sketch → 3D mesh	~2,100ms	CGA executor

7. Related Work and Positioning

Garcia-Dorado et al. [1] is the direct precursor and primary reference. Their system differs in: constrained stylus input (not freehand), per-grammar CNN training (not unified classifier), and real user study evaluation (20 participants). SketchProc3D differs in: unconstrained freehand input, unified 3-class CNN, fully synthetic training data, and focus on characterising the domain gap rather than claiming user-facing deployment.

Talton et al. [5] use MCMC-based scene parameter estimation — gradient-free optimization in grammar parameter space. Compared to their approach: SketchProc3D CNN inference is ~60× faster (12ms vs ~720ms reported for MCMC), but MCMC provides uncertainty quantification and does not require training data. The tradeoff is clear: MCMC is slower but more principled; CNN is fast but brittle under distribution shift.

ProcGen3D [7] (Zhang et al. 2024) follows a related pattern — GPT-style autoregressive transformer predicting a procedural graph from a single RGB image, with MCTS-guided sampling for output consistency. Their work is relevant as a neural-graph alternative to grammar snippet recognition: rather than classifying sketches into a predefined vocabulary, they generate the graph structure autoregressively. This is the more flexible but higher-complexity direction.

8. Limitations and Research Agenda

SketchProc3D establishes two structural limitations that define subsequent thesis work:

Domain Gap: The synthetic-to-real distribution shift in sketch appearance is not solvable by augmentation within the NPR framework. Resolution requires: real sketch collection with grammar annotations (expensive), domain adaptation via style transfer (partially addresses appearance; does not fix viewpoint), or fundamentally different recognition — such as learning from unpaired sketch and 3D data via contrastive objectives. None of these were implemented in SketchProc3D; they are open problems the thesis explores in later chapters.

Executor Gap: CGA non-differentiability prevents end-to-end learning. The executor gap is the same structural problem as PGN's Houdini DSL non-differentiability. SketchProc3D adds the insight that differentiable rendering alone is insufficient — the problem is not in the rendering step but in the program-to-mesh translation. SculptNet addresses this by replacing the executor with differentiable primitive assembly: no symbolic grammar program is executed; instead, a neural network directly predicts primitive geometry. The Building Elevation Reconstruction system addresses this by operating at the mesh level entirely, bypassing grammar programs.

9. Conclusion

SketchProc3D achieves 96.8% accuracy on synthetic held-out data and approximately 42% on real freehand input. This 54-point gap is the project's primary result. The differentiable rendering investigation establishes that the executor gap is structural and cannot be addressed with renderer choice. Together these findings characterize two independent open problems — domain gap and executor gap — that motivate the architectural directions of all subsequent thesis work: graph grammar research, SculptNet primitive assembly, and Apple Maps elevation reconstruction.

References

[1] Garcia-Dorado, I., Aliaga, D.G., Bhosle, S. "Interactive Sketching of Urban Procedural Models." ACM Trans. Graph. (SIGGRAPH), 35(4), 2016. ignaciogarciadorado.com/p/2016_TOG

[2] Eitz, M., Hays, J., Alexa, M. "How Do Humans Sketch Objects?" ACM Trans. Graph. (SIGGRAPH), 31(4), 2012.

[3] Laine, S., Hellsten, J., Karras, T., Seol, Y., Lehtinen, J., Aila, T. "Modular Primitives for High-Performance Differentiable Rendering." ACM Trans. Graph., 39(6), 2020.

[4] Müller, P., Wonka, P., Haegler, S., Ulmer, A., Van Gool, L. "Procedural Modeling of Buildings." ACM SIGGRAPH, 2006.

[5] Talton, J.O., Lou, Y., Lesser, S., Duke, J., Měch, R., Koltun, V. "Metropolis Procedural Modeling." ACM Trans. Graph., 30(2), 2011.

[6] Jain, A. "PGN — Procedural Generator Network: Transformer-Based DSL Generation for Bridge Modeling from Polyline Input." Thesis Research, 2025.

[7] Zhang, X. et al. "ProcGen3D: Neural Procedural Graph Generation from Images." arXiv:2511.07142, 2024. xzhang-t.github.io/project/ProcGen3D

[8] Esri. "CityEngine SDK — CGA Shape Grammar Reference." github.com/Esri/cityengine-sdk, 2024.

Appendix — Raw Materials

Transcripts & Source References

████████████████████████████████████████████████████
████████████████████████████████████████

01 — ████████████████████████████

████████████████████████████████████

████████ · ████ · ████████████████████████

████████████████████████████████████████

████████████ · ████ · ████████████████████████████████

████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████

02 — ████████████████████████████████

████████████████████████████████████████████████████

████ ██████████ · ████████ · ████ · ████████████████████

03 — ████████████████████████████████████████████

Sep 2025

Oct 2025

04 — ████████████████████████████████

████████████████████████████████████████████████████████████

Restricted Access