Building Elevation Reconstruction

00 — Motivation

A real-estate-grade metaverse from a single street-view photo.

The driving goal sits one level above this topic: an interactive 3-D map of the real world where every building you walk past in Google Street View is a real, editable 3-D asset — windows that recess, columns that protrude, cornices and balconies with actual depth profiles. Not a textured cardboard cutout, not a Genie-style frame-prediction model that hallucinates a video each time you turn the camera. Explicit production-clean geometry that loads into Houdini, edits like any USD asset, and renders deterministically.

What the thesis explored before this topic: image-to-3D networks (TripoSR, Hunyuan3D, TRELLIS, Sparc3D) and multi-view generators (Zero-1-to-3, MVDream) all fail on architectural subjects. The first category collapses facade detail into texture because the loss is dominated by overall silhouette. The second category produces perspective multi-view renders, not orthographic, which breaks the constant-depth-derivative assumption every classical reconstruction stage relies on. The honest path forward is the detour: route the single photo through six orthographic elevations first, then reconstruct via the six-plane mesh pipeline (Topic 35) — which has been independently validated on synthetic primitives (Topic 36) — to produce a watertight USD mesh.

The reconstruction backend has been built. The open question — the intentional gap this topic flags — is the orthographic multi-view frontend. Once that's trained (estimated $3K–$100K compute and 2 weeks to 6 months depending on dataset scale), the user experience becomes "drop a street-view photo, see the editable 3-D building mesh in 200 ms". Until then, this topic establishes the architectural decomposition and validates every stage downstream of view synthesis.

What this enables next

Combined with the upstream view-synthesis component, this is the MVP for the broader Apple Maps real-time 3-D reconstruction goal: an editor for the world where buildings are first-class procedural assets rather than baked photogrammetric meshes. Downstream consumers also include SculptNet (coarse-to-fine refinement) and the broader thesis target SIGGRAPH 2026.

01 — Problem

Why direct image-to-3D fails on buildings.

Contemporary image-to-3D systems — TripoSR, Hunyuan3D, TRELLIS, Sparc3D — produce plausible 3D shapes for general object categories but consistently fail on architectural subjects. Windows are not actually recessed, columns do not protrude, moldings are painted onto a flat surface as albedo rather than carved into geometry. The output is a cardboard cutout with a building-shaped texture, not a building.

The failure mode is structural. Image-to-3D models trained on shape datasets (Objaverse, ShapeNet) optimise for category-level resemblance, not facade-level fidelity. The network's inductive bias has no reason to allocate representational capacity to sub-centimetre features when the loss is dominated by overall silhouette.

The reconstruction pipeline presented here takes a different route: it treats a building elevation as a composable stack of six orthographic depth maps, each independently extracted and triangulated, then stitched into a closed mesh. This decomposition forces the system to commit to per-face geometric detail before composition — there is no global silhouette to hide behind.

03 — Method

Stage-by-stage breakdown.

The pipeline is intentionally modular: each stage operates on a well-defined intermediate representation, can be evaluated independently, and admits substitution if a better component appears later. The non-trivial engineering sits in the contour-to-mesh stages where pixel-quantised boundaries must be lifted to clean polygon meshes without introducing zigzag artefacts.

A — VIEW SYNTHESIS

Six orthographic elevations from one photo

Multi-view generation produces six orthographic projections — front, back, left, right, top, bottom — from a single street-view input. Orthographic (not perspective) is the key constraint: each pixel maps to a single world-space ray with constant depth derivative, which lets downstream contour extraction work in image space without unprojection. Resolution: 512×512 per view.

B — CONTOUR EXTRACTION

Marching squares with corner preservation

Each elevation produces a depth map; the per-view silhouette is extracted by marching squares at the foreground/background threshold. Vanilla marching squares produces axis-aligned zigzag artefacts on diagonal building edges. We apply selective corner-preserving smoothing: a low-pass filter on the contour polyline that detects sharp turning points (angle change > 30°) and leaves those vertices untouched, smoothing only between corners.

C — DEPTH CLUSTERING

k-means over per-pixel depth values

Within each elevation, the depth channel is clustered into a small number of roughly-parallel planes — typically 3–5 (background wall, window-recess plane, protruding column, balcony slab). Each cluster becomes its own polygon-with-holes in the contour plane, lifted to its mean depth. This is the step that gives the method its name: depth is treated as a categorical layer index rather than a continuous field.

D — TRIANGULATION + STITCH

earcut per cluster, then six-face stitching

Each depth-clustered polygon is simplified with Ramer–Douglas–Peucker (ε=0.5px), then triangulated with the earcut algorithm — robust for polygons with holes. The six resulting face meshes share boundary vertices by construction (orthographic projection guarantees consistent boundaries across views); they stitch into a single watertight surface without explicit boundary matching. Typical output: 350K source pixels collapse to ~450 vertices and ~330 triangles — three orders of magnitude smaller than naive depth-map-as-mesh.

04 — Status

Where the pipeline stands.

The downstream pipeline (stages B–D) is validated end-to-end on synthetic inputs: given six clean orthographic depth maps of a primitive shape (cube, sphere, L-shape), the reconstruction produces a watertight mesh that round-trips through marching-cubes verification with topology preserved. Test outputs on standard primitives are reported in the white paper.

Stage A — single photo → six orthographic elevations — is the open engineering question. The constraint of orthographic output (not perspective) rules out most off-the-shelf novel-view generators, which produce perspective views. Current direction: a custom multi-view diffusion model trained on rendered architectural USD assets, with orthographic projection enforced in the rendering setup.

512²

per-view resolution

~450

output vertices

~330

output triangles

3–5

depth clusters per view

Appendix — Raw Materials

Transcripts & Source References

████████████████████████████████████████████████
███████████████████████████████████████

01 — ██████████████████████████

██████████████████████████████████████

█████████ · ████ · █████████████████████

█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████

██████████████████████████████████████████████

██████████ · ████ · ███████████████████████████████

02 — ████████████████████████████████

████████████████████████████████████████████

03 — ████████████████████████████████████████████

Mar 2026

Restricted Access

Building Elevation
Reconstruction.

A real-estate-grade metaverse from a single street-view photo.

Why direct image-to-3D fails on buildings.

From photo to mesh in five stages.

Stage-by-stage breakdown.

Where the pipeline stands.

Interactive Demo · Live

Full Technical Paper

Building Elevation Reconstruction.

A real-estate-grade metaverse from a single street-view photo.

Why direct image-to-3D fails on buildings.

From photo to mesh in five stages.

Stage-by-stage breakdown.

Where the pipeline stands.

Interactive Demo · Live

Full Technical Paper

Building Elevation
Reconstruction.