From a single street-view photograph to a watertight 3D mesh of a building's facade — by routing the image through a six-plane orthographic decomposition rather than direct image-to-3D inference. Cardboard cutouts in, architecturally accurate geometry out.
The driving goal sits one level above this topic: an interactive 3-D map of the real world where every building you walk past in Google Street View is a real, editable 3-D asset — windows that recess, columns that protrude, cornices and balconies with actual depth profiles. Not a textured cardboard cutout, not a Genie-style frame-prediction model that hallucinates a video each time you turn the camera. Explicit production-clean geometry that loads into Houdini, edits like any USD asset, and renders deterministically.
What the thesis explored before this topic: image-to-3D networks (TripoSR, Hunyuan3D, TRELLIS, Sparc3D) and multi-view generators (Zero-1-to-3, MVDream) all fail on architectural subjects. The first category collapses facade detail into texture because the loss is dominated by overall silhouette. The second category produces perspective multi-view renders, not orthographic, which breaks the constant-depth-derivative assumption every classical reconstruction stage relies on. The honest path forward is the detour: route the single photo through six orthographic elevations first, then reconstruct via the six-plane mesh pipeline (Topic 35) — which has been independently validated on synthetic primitives (Topic 36) — to produce a watertight USD mesh.
The reconstruction backend has been built. The open question — the intentional gap this topic flags — is the orthographic multi-view frontend. Once that's trained (estimated $3K–$100K compute and 2 weeks to 6 months depending on dataset scale), the user experience becomes "drop a street-view photo, see the editable 3-D building mesh in 200 ms". Until then, this topic establishes the architectural decomposition and validates every stage downstream of view synthesis.
Contemporary image-to-3D systems — TripoSR, Hunyuan3D, TRELLIS, Sparc3D — produce plausible 3D shapes for general object categories but consistently fail on architectural subjects. Windows are not actually recessed, columns do not protrude, moldings are painted onto a flat surface as albedo rather than carved into geometry. The output is a cardboard cutout with a building-shaped texture, not a building.
The failure mode is structural. Image-to-3D models trained on shape datasets (Objaverse, ShapeNet) optimise for category-level resemblance, not facade-level fidelity. The network's inductive bias has no reason to allocate representational capacity to sub-centimetre features when the loss is dominated by overall silhouette.
The reconstruction pipeline presented here takes a different route: it treats a building elevation as a composable stack of six orthographic depth maps, each independently extracted and triangulated, then stitched into a closed mesh. This decomposition forces the system to commit to per-face geometric detail before composition — there is no global silhouette to hide behind.
The pipeline is intentionally modular: each stage operates on a well-defined intermediate representation, can be evaluated independently, and admits substitution if a better component appears later. The non-trivial engineering sits in the contour-to-mesh stages where pixel-quantised boundaries must be lifted to clean polygon meshes without introducing zigzag artefacts.
Multi-view generation produces six orthographic projections — front, back, left,
right, top, bottom — from a single street-view input. Orthographic (not perspective)
is the key constraint: each pixel maps to a single world-space ray with constant
depth derivative, which lets downstream contour extraction work in image space
without unprojection. Resolution: 512×512 per view.
Each elevation produces a depth map; the per-view silhouette is extracted by marching squares at the foreground/background threshold. Vanilla marching squares produces axis-aligned zigzag artefacts on diagonal building edges. We apply selective corner-preserving smoothing: a low-pass filter on the contour polyline that detects sharp turning points (angle change > 30°) and leaves those vertices untouched, smoothing only between corners.
Within each elevation, the depth channel is clustered into a small number of roughly-parallel planes — typically 3–5 (background wall, window-recess plane, protruding column, balcony slab). Each cluster becomes its own polygon-with-holes in the contour plane, lifted to its mean depth. This is the step that gives the method its name: depth is treated as a categorical layer index rather than a continuous field.
Each depth-clustered polygon is simplified with Ramer–Douglas–Peucker (ε=0.5px),
then triangulated with the earcut algorithm — robust for polygons
with holes. The six resulting face meshes share boundary vertices by construction
(orthographic projection guarantees consistent boundaries across views); they
stitch into a single watertight surface without explicit boundary matching. Typical
output: 350K source pixels collapse to ~450 vertices and ~330 triangles
— three orders of magnitude smaller than naive depth-map-as-mesh.
The downstream pipeline (stages B–D) is validated end-to-end on synthetic inputs: given six clean orthographic depth maps of a primitive shape (cube, sphere, L-shape), the reconstruction produces a watertight mesh that round-trips through marching-cubes verification with topology preserved. Test outputs on standard primitives are reported in the white paper.
Stage A — single photo → six orthographic elevations — is the open engineering question. The constraint of orthographic output (not perspective) rules out most off-the-shelf novel-view generators, which produce perspective views. Current direction: a custom multi-view diffusion model trained on rendered architectural USD assets, with orthographic projection enforced in the rendering setup.
Decompose the photo.
Don't infer the mesh.
Direct image-to-3D collapses architectural detail into texture because the loss is dominated by overall silhouette. Routing the photograph through six orthographic elevations forces the system to commit to per-face geometric detail before composition — there is no global silhouette to hide behind. The detour through a structured intermediate is what produces recessed windows and protruding columns instead of cardboard cutouts.
Pick a building type below (Tower / Box / L-Shape) or click the input photo to cycle. All three panes — photograph, six elevations, reconstructed mesh — update in real time as you switch typology. Drag the mesh to rotate.
arXiv-format write-up · Building Elevation Reconstruction · methodology, contour stage analysis, open problem statement