The problem of generating 3D building models from casual freehand sketches sits at the intersection of sketch-based modeling, procedural generation, and machine learning. A working solution would enable non-expert users to produce procedurally editable 3D geometry from natural drawing input — a capability relevant to architectural design, game development, and urban reconstruction pipelines.
Garcia-Dorado et al. [1] established the key architectural pattern: define a vocabulary of grammar snippets (building typologies parameterised by width, height, floors, style), train CNNs to classify sketches into the vocabulary and regress per-snippet parameters, and execute the recognised grammar program to produce 3D geometry. Their system operated with constrained stylus input on a tablet; the domain gap between training data and real input was minimised by the controlled input device.
SketchProc3D implements this architecture for unconstrained freehand input, adds explicit floor counting via computer vision, and investigates synthetic training data as the scalability path — since collecting large real sketch datasets with grammar-level annotations is prohibitively expensive. The project's contribution is not a new architecture but a precise empirical characterisation of where the Garcia-Dorado approach succeeds and fails when training data is fully synthetic and input is unconstrained.
The project connects directly to PGN [6], which established the same pattern with precise geometric input (polylines → DSL program → 3D bridge). SketchProc3D tests the same recognition-to-program pattern with rough visual input. The verdict is conditional: the pattern is tractable when training and test distributions match, and structurally fragile when they diverge.
Three grammar snippets define the vocabulary. BOX: standard rectangular building parameterised by {width ∈ [5,30]m, depth ∈ [5,20]m, floor_count ∈ [1,8], window_frac ∈ [0.3,0.7], style ∈ {plain, detailed}}. Most common class; widest training distribution. TOWER: high-aspect-ratio variant with constraint width < depth/3; adds taper_factor ∈ [0,0.15] for slight narrowing toward roof; floor_count ∈ [4,15]. L-SHAPE: two rectangular volumes joined at corner, parameterised by {wing1_length, wing2_length, junction_offset ∈ [0.3,0.7]}; hardest class due to corner junction detection difficulty. Each snippet compiles to a CityEngine CGA program string executed via the CityEngine Python API to produce a USD mesh. The CGA program structure follows the Müller et al. [4] shape grammar formalism: extrude → comp(f) → split(y) floors → split(x) window bays.
No real paired dataset (sketch → grammar annotation) exists. Synthetic data generation is the only scalable path. The v2 pipeline: (1) sample 500 parameter configurations uniformly from each snippet's parameter space; (2) execute each configuration in Houdini via Python SOP to produce a watertight USD mesh; (3) render in front-view projection (camera at 0°,0° — the angle humans draw buildings) with two additional lateral rotations (−15°, +15°) for augmentation, tripling the per-sample count; (4) apply Canny edge detection (low=50, high=150) to produce clean edge maps; (5) apply Perlin noise displacement to each edge pixel (amplitude=2px, frequency=0.1) and introduce random stroke breaks (2–5px gaps per segment) to simulate sketch imperfection. This yields 4,500 training images (500 samples × 3 types × 3 views). Floor labels are derived by running HoughLinesP on each synthetic training image — using detected floor count as label rather than ground-truth parameter value, ensuring training conditions match inference conditions.
Architecture: 4-layer convolutional network. Conv(32, 3×3) → MaxPool(2) → Conv(64, 3×3) → MaxPool(2) → Conv(128, 3×3) → MaxPool(2) → Conv(256, 3×3) → GlobalAvgPool → FC(512) → FC(3). Activation: ReLU throughout. Input: 256×256 grayscale (single channel). Training: cross-entropy loss, Adam (lr=10⁻³, β₁=0.9, β₂=0.999), 30 epochs, batch size 32, 80/20 train/val split stratified by class. No pretrained backbone — ImageNet features are irrelevant for binary sketch edge maps; a small network trained from scratch on the task distribution outperforms fine-tuned ResNet-18 by ~4% on this regime.
HoughLinesP parameters: rho=1px, theta=π/180 rad, threshold=50, minLineLength=30px, maxLineGap=10px. Detected segments filtered to near-horizontal (|angle| < 10°). Y-coordinate clustering (tolerance ±5px) groups co-planar segments into floor lines. Cluster count is the floor estimate. No confidence output; floor count passed directly to CGA parameter assembly with no fallback. This is a hard design choice — wrong floor count produces incorrect geometry with no correction mechanism.
Training executed on M1 MacBook Pro using PyTorch MPS backend. Batch generation and mesh execution parallelised via Python multiprocessing across grammar parameter samples. Total dataset generation time: ~45 minutes for 4,500 images including Houdini execution. Training time: ~8 minutes for 30 epochs. Both pipelines (v1 isometric and v2 front-view) trained and compared on the same hardware to ensure fair comparison.
On the held-out synthetic test set (20% of 4,500 images, stratified by class and view): overall accuracy 96.8%. Per-class breakdown: BOX 98.4%, TOWER 97.1%, L-SHAPE 91.2%. L-SHAPE accuracy is lower due to corner junction ambiguity — front-view rendering of L-shapes produces an L-silhouette where the notch is frequently small and visually similar to a BOX outline at the image resolution used (256×256). Classification confidence on the TOWER prediction from the working prototype run: 98.9%, with output parameters {width: 8.4m, depth: 9.2m, height: 33.8m, floors: 12}.
| Snippet Class | Test Samples | Correct | Accuracy | Confusion |
|---|---|---|---|---|
| BOX | 300 | 295 | 98.4% | → L-SHAPE (5) |
| TOWER | 300 | 291 | 97.1% | → BOX (9) |
| L-SHAPE | 300 | 274 | 91.2% | → BOX (26) |
| Overall | 900 | 860 | 96.8% | — |
On the same 900 synthetic test images: floor count within ±1 of label: 81.3%. Exact match: 64.7%. The high ±1 tolerance rate reflects HoughLinesP correctly detecting floor proximity but occasionally merging or splitting adjacent horizontal clusters. Failure modes: (1) implied floors — lines not explicitly drawn, counter returns 1; (2) construction lines — sketch marks not representing floors detected as floors; (3) perspective distortion — tilted images (±15° views) cause horizontal filter to miss angled lines.
Formal evaluation on real freehand sketches was not conducted — no annotated real sketch dataset was collected. Informal evaluation on 12 hand-drawn test sketches showed 5/12 correct snippet classifications and 4/12 reasonable floor counts. The 5 correct cases were drawn front-view with explicit horizontal floor lines. The 7 failures: 4 perspective-driven misclassifications (sketches drawn at oblique angle), 2 L-SHAPE→BOX confusions (corner notch not drawn explicitly), 1 floor counting failure (floors implied by hatch lines instead of solid horizontals).
The central finding of SketchProc3D is that achieving 96.8% accuracy on a synthetic test set provides essentially no guarantee of performance on real freehand input. The gap is not a minor distribution shift requiring more training data or stronger augmentation — it is a structural mismatch between the generative process of synthetic NPR images and the generative process of human sketching.
Synthetic NPR images are produced by: edge detection on clean 3D renders → controlled noise displacement. Their statistics are determined by: front-view 3D geometry projected orthographically, Canny response characteristics, Perlin amplitude/frequency hyperparameters. Human freehand sketches are produced by: motor-spatial planning from a mental model of the target shape → pen pressure variation → stroke correction behavior → arbitrary viewpoint choice. Their statistics are determined by: individual drawing style, abstraction level, implicit vs explicit structural encoding, variable stroke weight, re-drawing and overloading of strokes. No continuous perturbation of the synthetic distribution (including Perlin displacement, stroke gap simulation, or contrast jitter) replicates the second process.
| Data Source | Samples | CNN Acc | Floor ±1 Acc | Notes |
|---|---|---|---|---|
| NPR synthetic v1 (isometric, Canny only) | 600 | ~90% | ~72% | Isometric ≠ human viewpoint |
| NPR synthetic v1b (isometric + Perlin) | 600 | ~95% | ~74% | Jitter helps benchmark only |
| NPR synthetic v2 (front-view + multi-view) | 4,500 | 96.8% | 81.3% | Best synthetic result |
| Real freehand (informal, 12 samples) | 12 | ~42% | ~33% | Severe gap confirmed |
| Real freehand (needed for deployment) | 0 collected | — | — | Requires collection + annotation |
A secondary investigation explored whether differentiable rendering could provide an end-to-end training signal from sketch pixels back to grammar parameters — eliminating the need for labeled training data entirely. The hypothesis: if the pipeline sketch → params → CGA → mesh → render → image is fully differentiable, pixel-level reconstruction loss ℒ_render = ‖render(exec(θ)) − sketch‖₁ could supervise θ (grammar parameters) directly from sketch input.
nvdiffrast [3] provides differentiable rasterization: gradients flow from rendered pixel values back through the rasterization operation to 3D mesh vertex positions. The critical question is whether the gradient path can be extended: mesh vertex ← CGA executor ← grammar parameters θ.
The CityEngine CGA executor is a deterministic procedural interpreter — a Python subprocess call operating outside PyTorch's autograd graph. Gradient flow through it is not possible via standard backpropagation. A finite-difference numerical gradient approximation was attempted: perturb each grammar parameter θᵢ by δ=0.1, re-execute CGA, re-render, compute (ℒ(θ+δeᵢ) − ℒ(θ−δeᵢ))/(2δ). For 8 grammar parameters, this requires 16 forward passes per gradient step. Measured: ~5.2 seconds per CGA execution × 16 = ~83 seconds per gradient step. Impractical for training.
The analysis establishes the executor gap as structural: the gradient path that matters — from image supervision back to grammar tokens — is blocked at the executor boundary regardless of renderer choice. Differentiable rendering closes the render→pixel gap; it does not address the program→mesh gap. Closing the latter requires either: (a) a differentiable grammar interpreter (no existing implementation for CGA-class languages), (b) policy gradient or reinforcement learning (high variance, slow convergence), or (c) replacing executor-based architecture with a learned generative model where 3D generation is itself a neural operation (the direction SculptNet pursues).
The full pipeline was implemented and executed on M1 MacBook Pro (macOS 14, Python 3.11, PyTorch 2.0, MPS backend). Key observed behaviors from working prototype runs:
Successful case (TOWER, 98.9% confidence): The CNN correctly identified a tall narrow building sketch as TOWER with high confidence. Predicted parameters: {width: 8.4m, depth: 9.2m, height: 33.8m, floors: 12}. CGA executor generated a 32-vertex, 48-face USD mesh. 3D output: correct high-aspect-ratio tower geometry with visible floor divisions.
Floor counting discrepancy: In the first working demo run, the sketch showed 5 floor lines visually; HoughLinesP reported 2 floors; the generated building had 2 floors. This was the first concrete evidence of the floor detection fragility — the HoughLinesP threshold was tuned on isometric synthetic data and did not generalize to hand-drawn proportions. The fix (v2 pipeline) re-tunes thresholds on front-view synthetic data and uses detected-vs-groundtruth label matching in training.
Processing times (M1 MPS): CNN inference: ~12ms. HoughLinesP: ~3ms. CGA execution: ~1.8s (Houdini startup overhead dominates). Total sketch-to-3D latency: ~2.1 seconds. The CGA executor startup is the primary latency bottleneck; persistent CGA process would reduce this to ~200ms per generation.
| Component | Latency (M1 MPS) | Bottleneck |
|---|---|---|
| CNN snippet classification | ~12ms | — |
| HoughLinesP floor count | ~3ms | — |
| CGA parameter assembly | <1ms | — |
| CityEngine CGA execution | ~1,800ms | Subprocess startup |
| USD mesh export | ~280ms | — |
| Total sketch → 3D mesh | ~2,100ms | CGA executor |
Garcia-Dorado et al. [1] is the direct precursor and primary reference. Their system differs in: constrained stylus input (not freehand), per-grammar CNN training (not unified classifier), and real user study evaluation (20 participants). SketchProc3D differs in: unconstrained freehand input, unified 3-class CNN, fully synthetic training data, and focus on characterising the domain gap rather than claiming user-facing deployment.
Talton et al. [5] use MCMC-based scene parameter estimation — gradient-free optimization in grammar parameter space. Compared to their approach: SketchProc3D CNN inference is ~60× faster (12ms vs ~720ms reported for MCMC), but MCMC provides uncertainty quantification and does not require training data. The tradeoff is clear: MCMC is slower but more principled; CNN is fast but brittle under distribution shift.
ProcGen3D [7] (Zhang et al. 2024) follows a related pattern — autoregressive transformer predicting a procedural graph from a single RGB image, with MCTS-guided sampling for output consistency. Their work is relevant as a neural-graph alternative to grammar snippet recognition: rather than classifying sketches into a predefined vocabulary, they generate the graph structure autoregressively. This is the more flexible but higher-complexity direction.
SketchProc3D establishes two structural limitations that define subsequent thesis work:
Domain Gap: The synthetic-to-real distribution shift in sketch appearance is not solvable by augmentation within the NPR framework. Resolution requires: real sketch collection with grammar annotations (expensive), domain adaptation via style transfer (partially addresses appearance; does not fix viewpoint), or fundamentally different recognition — such as learning from unpaired sketch and 3D data via contrastive objectives. None of these were implemented in SketchProc3D; they are open problems the thesis explores in later chapters.
Executor Gap: CGA non-differentiability prevents end-to-end learning. The executor gap is the same structural problem as PGN's Houdini DSL non-differentiability. SketchProc3D adds the insight that differentiable rendering alone is insufficient — the problem is not in the rendering step but in the program-to-mesh translation. SculptNet addresses this by replacing the executor with differentiable primitive assembly: no symbolic grammar program is executed; instead, a neural network directly predicts primitive geometry. The Building Elevation Reconstruction system addresses this by operating at the mesh level entirely, bypassing grammar programs.
SketchProc3D achieves 96.8% accuracy on synthetic held-out data and approximately 42% on real freehand input. This 54-point gap is the project's primary result. The differentiable rendering investigation establishes that the executor gap is structural and cannot be addressed with renderer choice. Together these findings characterize two independent open problems — domain gap and executor gap — that motivate the architectural directions of all subsequent thesis work: graph grammar research, SculptNet primitive assembly, and Maps elevation reconstruction.