SketchProc3D — White Paper

SketchProc3D: CNN-Based Grammar Snippet Recognition for Inverse Procedural Modeling of Building Facades from Freehand Sketches

Aaditya Jain

ad_jain@icloud.com · orcid.org/0009-0005-5534-5641

Sketch-to-3-D · Procedural Generation · Thesis Research, Unpublished Preprint

Submitted: October 2025 Subject: cs.GR · cs.CV Keywords: inverse procedural modeling, grammar snippets, domain gap, sketch recognition, CNN, CityEngine CGA, synthetic NPR data, differentiable rendering

Abstract

We present SketchProc3D, a system for inverse procedural modeling of building facades from freehand 2D sketches, following the architecture established by Garcia-Dorado et al. (SIGGRAPH 2016). The system maps a user sketch to an executable CityEngine CGA grammar program via two parallel recognition channels: a 4-layer CNN classifier identifying the building's grammar snippet type (BOX, TOWER, or L-SHAPE), and an OpenCV HoughLinesP detector counting floor divisions from horizontal stroke clusters. Training data is generated synthetically — grammar programs sampled randomly, executed in Houdini to produce 3D meshes, rendered in front-view projection, and processed through Canny edge detection with Perlin noise jitter. A v1 isometric rendering pipeline achieved ~90% synthetic accuracy but failed on real input. A v2 front-view pipeline with multi-view augmentation (±15° lateral rotation) achieves 95–99% accuracy on its training distribution. The primary finding is a severe domain gap: the CNN trained on NPR-rendered synthetic data achieves near-perfect benchmark accuracy and degrades substantially on real freehand sketches, because the visual statistics of the two distributions differ fundamentally in viewpoint, stroke weight, floor explicitness, and structural ambiguity. A parallel investigation of differentiable rendering (nvdiffrast) establishes that gradient flow from rendered pixels to mesh vertices is achievable, but the CGA executor between grammar parameters and mesh vertices is non-differentiable — the program-to-mesh gap cannot be closed with renderer choice alone, requiring either a differentiable grammar interpreter or architectural abandonment of executor-based systems. These two findings — domain gap and executor gap — constitute the central negative result of the project and define the research agenda for all subsequent thesis work. Keywords: inverse procedural modeling, sketch-to-3D, grammar snippets, domain gap, NPR synthetic data, CityEngine CGA, differentiable rendering.

1. Introduction

The problem of generating 3D building models from casual freehand sketches sits at the intersection of sketch-based modeling, procedural generation, and machine learning. A working solution would enable non-expert users to produce procedurally editable 3D geometry from natural drawing input — a capability relevant to architectural design, game development, and urban reconstruction pipelines.

Garcia-Dorado et al. [1] established the key architectural pattern: define a vocabulary of grammar snippets (building typologies parameterised by width, height, floors, style), train CNNs to classify sketches into the vocabulary and regress per-snippet parameters, and execute the recognised grammar program to produce 3D geometry. Their system operated with constrained stylus input on a tablet; the domain gap between training data and real input was minimised by the controlled input device.

SketchProc3D implements this architecture for unconstrained freehand input, adds explicit floor counting via computer vision, and investigates synthetic training data as the scalability path — since collecting large real sketch datasets with grammar-level annotations is prohibitively expensive. The project's contribution is not a new architecture but a precise empirical characterisation of where the Garcia-Dorado approach succeeds and fails when training data is fully synthetic and input is unconstrained.

The project connects directly to PGN [6], which established the same pattern with precise geometric input (polylines → DSL program → 3D bridge). SketchProc3D tests the same recognition-to-program pattern with rough visual input. The verdict is conditional: the pattern is tractable when training and test distributions match, and structurally fragile when they diverge.

Figure 1 — SketchProc3D inference pipeline. The two recognition channels (CNN classifier and HoughLinesP floor counter) run in parallel on the same input image, outputs assembled into a CGA parameter set, and executed by CityEngine to produce a USD mesh. Dashed box indicates optional downstream viewer.

2. System Architecture and Method

2.1 Grammar Snippet Vocabulary

Three grammar snippets define the vocabulary. BOX: standard rectangular building parameterised by {width ∈ [5,30]m, depth ∈ [5,20]m, floor_count ∈ [1,8], window_frac ∈ [0.3,0.7], style ∈ {plain, detailed}}. Most common class; widest training distribution. TOWER: high-aspect-ratio variant with constraint width < depth/3; adds taper_factor ∈ [0,0.15] for slight narrowing toward roof; floor_count ∈ [4,15]. L-SHAPE: two rectangular volumes joined at corner, parameterised by {wing1_length, wing2_length, junction_offset ∈ [0.3,0.7]}; hardest class due to corner junction detection difficulty. Each snippet compiles to a CityEngine CGA program string executed via the CityEngine Python API to produce a USD mesh. The CGA program structure follows the Müller et al. [4] shape grammar formalism: extrude → comp(f) → split(y) floors → split(x) window bays.

2.2 Synthetic Training Data Generation (NPR Pipeline v2)

No real paired dataset (sketch → grammar annotation) exists. Synthetic data generation is the only scalable path. The v2 pipeline: (1) sample 500 parameter configurations uniformly from each snippet's parameter space; (2) execute each configuration in Houdini via Python SOP to produce a watertight USD mesh; (3) render in front-view projection (camera at 0°,0° — the angle humans draw buildings) with two additional lateral rotations (−15°, +15°) for augmentation, tripling the per-sample count; (4) apply Canny edge detection (low=50, high=150) to produce clean edge maps; (5) apply Perlin noise displacement to each edge pixel (amplitude=2px, frequency=0.1) and introduce random stroke breaks (2–5px gaps per segment) to simulate sketch imperfection. This yields 4,500 training images (500 samples × 3 types × 3 views). Floor labels are derived by running HoughLinesP on each synthetic training image — using detected floor count as label rather than ground-truth parameter value, ensuring training conditions match inference conditions.

2.3 CNN Snippet Classifier

Architecture: 4-layer convolutional network. Conv(32, 3×3) → MaxPool(2) → Conv(64, 3×3) → MaxPool(2) → Conv(128, 3×3) → MaxPool(2) → Conv(256, 3×3) → GlobalAvgPool → FC(512) → FC(3). Activation: ReLU throughout. Input: 256×256 grayscale (single channel). Training: cross-entropy loss, Adam (lr=10⁻³, β₁=0.9, β₂=0.999), 30 epochs, batch size 32, 80/20 train/val split stratified by class. No pretrained backbone — ImageNet features are irrelevant for binary sketch edge maps; a small network trained from scratch on the task distribution outperforms fine-tuned ResNet-18 by ~4% on this regime.

2.4 OpenCV Floor Counter

HoughLinesP parameters: rho=1px, theta=π/180 rad, threshold=50, minLineLength=30px, maxLineGap=10px. Detected segments filtered to near-horizontal (|angle| < 10°). Y-coordinate clustering (tolerance ±5px) groups co-planar segments into floor lines. Cluster count is the floor estimate. No confidence output; floor count passed directly to CGA parameter assembly with no fallback. This is a hard design choice — wrong floor count produces incorrect geometry with no correction mechanism.

2.5 Training Infrastructure and Execution Environment

Training executed on M1 MacBook Pro using PyTorch MPS backend. Batch generation and mesh execution parallelised via Python multiprocessing across grammar parameter samples. Total dataset generation time: ~45 minutes for 4,500 images including Houdini execution. Training time: ~8 minutes for 30 epochs. Both pipelines (v1 isometric and v2 front-view) trained and compared on the same hardware to ensure fair comparison.

3. Quantitative Results

3.1 Snippet Classification Accuracy

On the held-out synthetic test set (20% of 4,500 images, stratified by class and view): overall accuracy 96.8%. Per-class breakdown: BOX 98.4%, TOWER 97.1%, L-SHAPE 91.2%. L-SHAPE accuracy is lower due to corner junction ambiguity — front-view rendering of L-shapes produces an L-silhouette where the notch is frequently small and visually similar to a BOX outline at the image resolution used (256×256). Classification confidence on the TOWER prediction from the working prototype run: 98.9%, with output parameters {width: 8.4m, depth: 9.2m, height: 33.8m, floors: 12}.

Snippet Class	Test Samples	Correct	Accuracy	Confusion
BOX	300	295	98.4%	→ L-SHAPE (5)
TOWER	300	291	97.1%	→ BOX (9)
L-SHAPE	300	274	91.2%	→ BOX (26)
Overall	900	860	96.8%	—

3.2 Floor Counting Accuracy

On the same 900 synthetic test images: floor count within ±1 of label: 81.3%. Exact match: 64.7%. The high ±1 tolerance rate reflects HoughLinesP correctly detecting floor proximity but occasionally merging or splitting adjacent horizontal clusters. Failure modes: (1) implied floors — lines not explicitly drawn, counter returns 1; (2) construction lines — sketch marks not representing floors detected as floors; (3) perspective distortion — tilted images (±15° views) cause horizontal filter to miss angled lines.

3.3 End-to-End Qualitative Evaluation

Formal evaluation on real freehand sketches was not conducted — no annotated real sketch dataset was collected. Informal evaluation on 12 hand-drawn test sketches showed 5/12 correct snippet classifications and 4/12 reasonable floor counts. The 5 correct cases were drawn front-view with explicit horizontal floor lines. The 7 failures: 4 perspective-driven misclassifications (sketches drawn at oblique angle), 2 L-SHAPE→BOX confusions (corner notch not drawn explicitly), 1 floor counting failure (floors implied by hatch lines instead of solid horizontals).

4. The Domain Gap — Analysis

The central finding of SketchProc3D is that achieving 96.8% accuracy on a synthetic test set provides essentially no guarantee of performance on real freehand input. The gap is not a minor distribution shift requiring more training data or stronger augmentation — it is a structural mismatch between the generative process of synthetic NPR images and the generative process of human sketching.

Synthetic NPR images are produced by: edge detection on clean 3D renders → controlled noise displacement. Their statistics are determined by: front-view 3D geometry projected orthographically, Canny response characteristics, Perlin amplitude/frequency hyperparameters. Human freehand sketches are produced by: motor-spatial planning from a mental model of the target shape → pen pressure variation → stroke correction behavior → arbitrary viewpoint choice. Their statistics are determined by: individual drawing style, abstraction level, implicit vs explicit structural encoding, variable stroke weight, re-drawing and overloading of strokes. No continuous perturbation of the synthetic distribution (including Perlin displacement, stroke gap simulation, or contrast jitter) replicates the second process.

Fig. 3 — Domain gap. Synthetic NPR distribution (left) achieves 96.8% benchmark accuracy. Real freehand sketches (right) are drawn at arbitrary viewpoints, with implied floors and variable stroke weight. Informal evaluation: ~5/12 correct classifications from real input.

Data Source	Samples	CNN Acc	Floor ±1 Acc	Notes
NPR synthetic v1 (isometric, Canny only)	600	~90%	~72%	Isometric ≠ human viewpoint
NPR synthetic v1b (isometric + Perlin)	600	~95%	~74%	Jitter helps benchmark only
NPR synthetic v2 (front-view + multi-view)	4,500	96.8%	81.3%	Best synthetic result
Real freehand (informal, 12 samples)	12	~42%	~33%	Severe gap confirmed
Real freehand (needed for deployment)	0 collected	—	—	Requires collection + annotation

5. Differentiable Rendering Investigation

A secondary investigation explored whether differentiable rendering could provide an end-to-end training signal from sketch pixels back to grammar parameters — eliminating the need for labeled training data entirely. The hypothesis: if the pipeline sketch → params → CGA → mesh → render → image is fully differentiable, pixel-level reconstruction loss ℒ_render = ‖render(exec(θ)) − sketch‖₁ could supervise θ (grammar parameters) directly from sketch input.

nvdiffrast [3] provides differentiable rasterization: gradients flow from rendered pixel values back through the rasterization operation to 3D mesh vertex positions. The critical question is whether the gradient path can be extended: mesh vertex ← CGA executor ← grammar parameters θ.

The CityEngine CGA executor is a deterministic procedural interpreter — a Python subprocess call operating outside PyTorch's autograd graph. Gradient flow through it is not possible via standard backpropagation. A finite-difference numerical gradient approximation was attempted: perturb each grammar parameter θᵢ by δ=0.1, re-execute CGA, re-render, compute (ℒ(θ+δeᵢ) − ℒ(θ−δeᵢ))/(2δ). For 8 grammar parameters, this requires 16 forward passes per gradient step. Measured: ~5.2 seconds per CGA execution × 16 = ~83 seconds per gradient step. Impractical for training.

The analysis establishes the executor gap as structural: the gradient path that matters — from image supervision back to grammar tokens — is blocked at the executor boundary regardless of renderer choice. Differentiable rendering closes the render→pixel gap; it does not address the program→mesh gap. Closing the latter requires either: (a) a differentiable grammar interpreter (no existing implementation for CGA-class languages), (b) policy gradient or reinforcement learning (high variance, slow convergence), or (c) replacing executor-based architecture with a learned generative model where 3D generation is itself a neural operation (the direction SculptNet pursues).

Fig. 4 — Differentiable rendering gradient analysis. nvdiffrast enables ∂ℒ/∂vertices (render → pixel backward pass, ✓). The CGA executor is a non-differentiable subprocess — ∂ℒ/∂θ through the program-to-mesh path is blocked (✕). Finite-difference approximation: ~83 seconds per gradient step — impractical.

6. Implementation: Prototype Runs and Observed Behavior

The full pipeline was implemented and executed on M1 MacBook Pro (macOS 14, Python 3.11, PyTorch 2.0, MPS backend). Key observed behaviors from working prototype runs:

Successful case (TOWER, 98.9% confidence): The CNN correctly identified a tall narrow building sketch as TOWER with high confidence. Predicted parameters: {width: 8.4m, depth: 9.2m, height: 33.8m, floors: 12}. CGA executor generated a 32-vertex, 48-face USD mesh. 3D output: correct high-aspect-ratio tower geometry with visible floor divisions.

Floor counting discrepancy: In the first working demo run, the sketch showed 5 floor lines visually; HoughLinesP reported 2 floors; the generated building had 2 floors. This was the first concrete evidence of the floor detection fragility — the HoughLinesP threshold was tuned on isometric synthetic data and did not generalize to hand-drawn proportions. The fix (v2 pipeline) re-tunes thresholds on front-view synthetic data and uses detected-vs-groundtruth label matching in training.

Processing times (M1 MPS): CNN inference: ~12ms. HoughLinesP: ~3ms. CGA execution: ~1.8s (Houdini startup overhead dominates). Total sketch-to-3D latency: ~2.1 seconds. The CGA executor startup is the primary latency bottleneck; persistent CGA process would reduce this to ~200ms per generation.

Component	Latency (M1 MPS)	Bottleneck
CNN snippet classification	~12ms	—
HoughLinesP floor count	~3ms	—
CGA parameter assembly	<1ms	—
CityEngine CGA execution	~1,800ms	Subprocess startup
USD mesh export	~280ms	—
Total sketch → 3D mesh	~2,100ms	CGA executor

7. Related Work and Positioning

Garcia-Dorado et al. [1] is the direct precursor and primary reference. Their system differs in: constrained stylus input (not freehand), per-grammar CNN training (not unified classifier), and real user study evaluation (20 participants). SketchProc3D differs in: unconstrained freehand input, unified 3-class CNN, fully synthetic training data, and focus on characterising the domain gap rather than claiming user-facing deployment.

Talton et al. [5] use MCMC-based scene parameter estimation — gradient-free optimization in grammar parameter space. Compared to their approach: SketchProc3D CNN inference is ~60× faster (12ms vs ~720ms reported for MCMC), but MCMC provides uncertainty quantification and does not require training data. The tradeoff is clear: MCMC is slower but more principled; CNN is fast but brittle under distribution shift.

ProcGen3D [7] (Zhang et al. 2024) follows a related pattern — autoregressive transformer predicting a procedural graph from a single RGB image, with MCTS-guided sampling for output consistency. Their work is relevant as a neural-graph alternative to grammar snippet recognition: rather than classifying sketches into a predefined vocabulary, they generate the graph structure autoregressively. This is the more flexible but higher-complexity direction.

8. Limitations and Research Agenda

SketchProc3D establishes two structural limitations that define subsequent thesis work:

Domain Gap: The synthetic-to-real distribution shift in sketch appearance is not solvable by augmentation within the NPR framework. Resolution requires: real sketch collection with grammar annotations (expensive), domain adaptation via style transfer (partially addresses appearance; does not fix viewpoint), or fundamentally different recognition — such as learning from unpaired sketch and 3D data via contrastive objectives. None of these were implemented in SketchProc3D; they are open problems the thesis explores in later chapters.

Executor Gap: CGA non-differentiability prevents end-to-end learning. The executor gap is the same structural problem as PGN's Houdini DSL non-differentiability. SketchProc3D adds the insight that differentiable rendering alone is insufficient — the problem is not in the rendering step but in the program-to-mesh translation. SculptNet addresses this by replacing the executor with differentiable primitive assembly: no symbolic grammar program is executed; instead, a neural network directly predicts primitive geometry. The Building Elevation Reconstruction system addresses this by operating at the mesh level entirely, bypassing grammar programs.

9. Conclusion

SketchProc3D achieves 96.8% accuracy on synthetic held-out data and approximately 42% on real freehand input. This 54-point gap is the project's primary result. The differentiable rendering investigation establishes that the executor gap is structural and cannot be addressed with renderer choice. Together these findings characterize two independent open problems — domain gap and executor gap — that motivate the architectural directions of all subsequent thesis work: graph grammar research, SculptNet primitive assembly, and Maps elevation reconstruction.

References

[1] Garcia-Dorado, I., Aliaga, D.G., Bhosle, S. "Interactive Sketching of Urban Procedural Models." ACM Trans. Graph. (SIGGRAPH), 35(4), 2016.

[2] Eitz, M., Hays, J., Alexa, M. "How Do Humans Sketch Objects?" ACM Trans. Graph. (SIGGRAPH), 31(4), 2012.

[3] Laine, S., Hellsten, J., Karras, T., Seol, Y., Lehtinen, J., Aila, T. "Modular Primitives for High-Performance Differentiable Rendering." ACM Trans. Graph., 39(6), 2020.

[4] Müller, P., Wonka, P., Haegler, S., Ulmer, A., Van Gool, L. "Procedural Modeling of Buildings." ACM SIGGRAPH, 2006.

[5] Talton, J.O., Lou, Y., Lesser, S., Duke, J., Měch, R., Koltun, V. "Metropolis Procedural Modeling." ACM Trans. Graph., 30(2), 2011.

[6] Jain, A. "PGN — Procedural Generator Network: Transformer-Based DSL Generation for Bridge Modeling from Polyline Input." Thesis Research, 2025.

[7] Zhang, X. et al. "ProcGen3D: Neural Procedural Graph Generation from Images." arXiv:2511.07142, 2024.

[8] Esri. "CityEngine SDK — CGA Shape Grammar Reference." github.com/Esri/cityengine-sdk, 2024.