← Research Timeline Aditya Jain / Apple Maps · 3D Reconstruction
Mar 2026
Topic 40 Mar 2026 3D Reconstruction · Generative Modeling

Building Elevation
Reconstruction.

From a single street-view photograph to a watertight 3D mesh of a building's facade — by routing the image through a six-plane orthographic decomposition rather than direct image-to-3D inference. Cardboard cutouts in, architecturally accurate geometry out.

00 — Motivation

A real-estate-grade metaverse from a single street-view photo.

The driving goal sits one level above this topic: an interactive 3-D map of the real world where every building you walk past in Google Street View is a real, editable 3-D asset — windows that recess, columns that protrude, cornices and balconies with actual depth profiles. Not a textured cardboard cutout, not a Genie-style frame-prediction model that hallucinates a video each time you turn the camera. Explicit production-clean geometry that loads into Houdini, edits like any USD asset, and renders deterministically.

What the thesis explored before this topic: image-to-3D networks (TripoSR, Hunyuan3D, TRELLIS, Sparc3D) and multi-view generators (Zero-1-to-3, MVDream) all fail on architectural subjects. The first category collapses facade detail into texture because the loss is dominated by overall silhouette. The second category produces perspective multi-view renders, not orthographic, which breaks the constant-depth-derivative assumption every classical reconstruction stage relies on. The honest path forward is the detour: route the single photo through six orthographic elevations first, then reconstruct via the six-plane mesh pipeline (Topic 35) — which has been independently validated on synthetic primitives (Topic 36) — to produce a watertight USD mesh.

The reconstruction backend has been built. The open question — the intentional gap this topic flags — is the orthographic multi-view frontend. Once that's trained (estimated $3K–$100K compute and 2 weeks to 6 months depending on dataset scale), the user experience becomes "drop a street-view photo, see the editable 3-D building mesh in 200 ms". Until then, this topic establishes the architectural decomposition and validates every stage downstream of view synthesis.

What this enables next
Combined with the upstream view-synthesis component, this is the MVP for the broader Apple Maps real-time 3-D reconstruction goal: an editor for the world where buildings are first-class procedural assets rather than baked photogrammetric meshes. Downstream consumers also include SculptNet (coarse-to-fine refinement) and the broader thesis target SIGGRAPH 2026.
01 — Problem

Why direct image-to-3D fails on buildings.

Contemporary image-to-3D systems — TripoSR, Hunyuan3D, TRELLIS, Sparc3D — produce plausible 3D shapes for general object categories but consistently fail on architectural subjects. Windows are not actually recessed, columns do not protrude, moldings are painted onto a flat surface as albedo rather than carved into geometry. The output is a cardboard cutout with a building-shaped texture, not a building.

The failure mode is structural. Image-to-3D models trained on shape datasets (Objaverse, ShapeNet) optimise for category-level resemblance, not facade-level fidelity. The network's inductive bias has no reason to allocate representational capacity to sub-centimetre features when the loss is dominated by overall silhouette.

The reconstruction pipeline presented here takes a different route: it treats a building elevation as a composable stack of six orthographic depth maps, each independently extracted and triangulated, then stitched into a closed mesh. This decomposition forces the system to commit to per-face geometric detail before composition — there is no global silhouette to hide behind.

02 — Pipeline

From photo to mesh in five stages.

street-view single photo RGB · arbitrary view INPUT SIX ELEVATIONS ±X · ±Y · ±Z orthographic projection VIEW SYNTHESIS CONTOUR EXTRACTION marching squares corner-preserving smoothing PER-VIEW DEPTH CLUSTERING k-means · planes window/door/wall layers TRIANGULATION earcut · RDP simplify per-cluster polygon fill WATERTIGHT MESH six faces stitched USD export OUTPUT FACADE-NATIVE PIPELINE — per-elevation triangulation forces detail before composition
Figure 1 — The five processing stages. The decomposition into six orthographic elevations is the architectural commitment: by forcing the reconstruction to commit to depth-cluster geometry on each face before stitching, sub-centimetre architectural detail (window recesses, column protrusions, molding profiles) is preserved through to the final mesh.
03 — Method

Stage-by-stage breakdown.

The pipeline is intentionally modular: each stage operates on a well-defined intermediate representation, can be evaluated independently, and admits substitution if a better component appears later. The non-trivial engineering sits in the contour-to-mesh stages where pixel-quantised boundaries must be lifted to clean polygon meshes without introducing zigzag artefacts.

A — VIEW SYNTHESIS
Six orthographic elevations from one photo

Multi-view generation produces six orthographic projections — front, back, left, right, top, bottom — from a single street-view input. Orthographic (not perspective) is the key constraint: each pixel maps to a single world-space ray with constant depth derivative, which lets downstream contour extraction work in image space without unprojection. Resolution: 512×512 per view.

B — CONTOUR EXTRACTION
Marching squares with corner preservation

Each elevation produces a depth map; the per-view silhouette is extracted by marching squares at the foreground/background threshold. Vanilla marching squares produces axis-aligned zigzag artefacts on diagonal building edges. We apply selective corner-preserving smoothing: a low-pass filter on the contour polyline that detects sharp turning points (angle change > 30°) and leaves those vertices untouched, smoothing only between corners.

C — DEPTH CLUSTERING
k-means over per-pixel depth values

Within each elevation, the depth channel is clustered into a small number of roughly-parallel planes — typically 3–5 (background wall, window-recess plane, protruding column, balcony slab). Each cluster becomes its own polygon-with-holes in the contour plane, lifted to its mean depth. This is the step that gives the method its name: depth is treated as a categorical layer index rather than a continuous field.

D — TRIANGULATION + STITCH
earcut per cluster, then six-face stitching

Each depth-clustered polygon is simplified with Ramer–Douglas–Peucker (ε=0.5px), then triangulated with the earcut algorithm — robust for polygons with holes. The six resulting face meshes share boundary vertices by construction (orthographic projection guarantees consistent boundaries across views); they stitch into a single watertight surface without explicit boundary matching. Typical output: 350K source pixels collapse to ~450 vertices and ~330 triangles — three orders of magnitude smaller than naive depth-map-as-mesh.

04 — Status

Where the pipeline stands.

The downstream pipeline (stages B–D) is validated end-to-end on synthetic inputs: given six clean orthographic depth maps of a primitive shape (cube, sphere, L-shape), the reconstruction produces a watertight mesh that round-trips through marching-cubes verification with topology preserved. Test outputs on standard primitives are reported in the white paper.

Stage A — single photo → six orthographic elevations — is the open engineering question. The constraint of orthographic output (not perspective) rules out most off-the-shelf novel-view generators, which produce perspective views. Current direction: a custom multi-view diffusion model trained on rendered architectural USD assets, with orthographic projection enforced in the rendering setup.

512²
per-view resolution
~450
output vertices
~330
output triangles
3–5
depth clusters per view
Core Insight

Decompose the photo.
Don't infer the mesh.

Direct image-to-3D collapses architectural detail into texture because the loss is dominated by overall silhouette. Routing the photograph through six orthographic elevations forces the system to commit to per-face geometric detail before composition — there is no global silhouette to hide behind. The detour through a structured intermediate is what produces recessed windows and protruding columns instead of cardboard cutouts.

Interactive Demo · Live

Pick a building type below (Tower / Box / L-Shape) or click the input photo to cycle. All three panes — photograph, six elevations, reconstructed mesh — update in real time as you switch typology. Drag the mesh to rotate.

01 — Input Photograph · CLICK TO CYCLE TOWER
Click photo or use preset buttons
02 — Six Orthographic Elevations
03 — Watertight Mesh Drag to rotate

Full Technical Paper

arXiv-format write-up · Building Elevation Reconstruction · methodology, contour stage analysis, open problem statement

Read Paper →
Related Thesis Chapters
PGN — Procedural Generator Network
Polyline → DSL → USD. The transformer-based generation backbone whose architectural template is reused here at the elevation-reconstruction stage.
SketchProc3D — Sketch to Building
CNN classifier + CGA grammar for sketch-driven building generation. The grammar-based predecessor; this work replaces grammar with explicit mesh reconstruction.
Graph Grammar (Merrell)
Boundary-string grammar extraction. Provides the symmetry-detection foundation; complementary to facade reconstruction at the rule-discovery level.
Appendix — Raw Materials
Transcripts & Source References
████████████████████████████████████████████████
███████████████████████████████████████

██████████████████████████████████████
█████████ · ████ · █████████████████████
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████
██████████████████████████████████████████████
██████████ · ████ · ███████████████████████████████
██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████

████████████████████████████████████████████
██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████

Mar 2026
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████
Mar 2026
██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████
Restricted Access