Single-image 3D reconstruction of architectural subjects sits at the intersection of two well-developed but mutually inadequate fields. Image-to-3D generative models [1,2,3] produce visually plausible meshes for general object categories but consistently fail to recover the geometric detail that defines an architectural facade — windows that recess, columns that protrude, moldings with real depth profile. Multi-view stereo and photogrammetry [4] recover accurate geometry but require many calibrated views and produce dense point clouds rather than the clean polygonal meshes a downstream production pipeline can consume.
We pursue a third route: rather than infer 3D directly from one photograph, we infer six axis-aligned orthographic elevations from the photograph, then reconstruct the mesh from the elevations using classical computational geometry. The six-plane decomposition is the architectural commitment of the method — it forces the system to commit to per-face geometric detail before composition, eliminating the global-silhouette shortcut that dominates losses in category-trained image-to-3D networks.
This paper makes the following contributions: (1) the six-plane elevation decomposition as an inductive bias for architectural reconstruction; (2) a contour extraction stage with corner-preserving smoothing that handles the pixel-quantisation artefacts of marching squares on diagonal facade edges; (3) a depth-clustering reformulation that treats facade depth as a small categorical index over planes rather than a continuous field, recovering window/column/wall layering; (4) an end-to-end pipeline from six orthographic depth maps to a watertight USD mesh with three-order-of-magnitude vertex compression; (5) a precise statement of the open engineering problem — orthographic multi-view synthesis from a single photo — that gates end-to-end deployment.
Contemporary single-image 3D generators are trained on shape datasets (Objaverse [5], ShapeNet [6]) where category-level resemblance is the primary supervision signal. For a chair, "looks like a chair" captures most of the perceptually relevant variation. For a building, "looks like a building" captures essentially none of the architecturally relevant variation — every Georgian rowhouse, every Mughal facade, every modernist curtain wall passes a category test. The network has no loss-driven reason to allocate representational capacity to the sub-centimetre features that distinguish them.
This is visible in failure modes. TripoSR on a building photograph produces a roughly building-shaped solid whose facade detail is entirely texture; the underlying geometry is a smooth slab. Hunyuan3D and TRELLIS produce richer textured outputs but the recessed/protruding geometry that a human reader would identify as architectural is absent. The cardboard-cutout failure is consistent across systems and across architectural styles, supporting the structural-mismatch hypothesis rather than a per-system bug.
Consider a category-conditioned reconstruction loss ℒ = ‖M̂ − M*‖ where M̂ is the predicted mesh and M* is the ground-truth mesh. Decompose the geometry into a coarse scale (overall silhouette, primary masses) and a fine scale (window depth, column protrusion, molding profile). Under uniform mesh-vertex loss, coarse-scale errors dominate by an order of magnitude per vertex they touch: a wrong overall size displaces every vertex; a missed window recess displaces only the vertices in the recess. A gradient descent optimiser therefore allocates network capacity proportional to where loss reduction is steepest, which is overwhelmingly the coarse scale. Fine architectural features are systematically under-represented in the learned distribution.
The natural response — "train on more buildings" — does not solve the loss-allocation problem. Adding 10× more building examples gives the network 10× more coarse-scale supervision and 10× more fine-scale supervision, but the gradient ratio between them stays the same. The network still under-represents fine detail because the gradient signal pushing it to learn that detail is intrinsically weaker. Fixing this requires either (a) reweighting the loss to upweight fine-scale errors, (b) changing the representation so fine-scale and coarse-scale errors are decoupled, or (c) routing through an intermediate that forces explicit fine-scale commitment. The pipeline described here pursues (c).
The reconstruction operates on six orthographic depth maps, one per axis-aligned face of the bounding box: front (-Z), back (+Z), left (-X), right (+X), top (-Y), bottom (+Y). Each is a depth image D_v ∈ ℝ^(H×W) where pixel (i,j) records the depth of the nearest surface point along the orthographic ray for view v. Orthographic — not perspective — is required: per-pixel depth must be a constant-linear function of position so that downstream contour extraction in image space corresponds directly to surface contours in world space.
From each depth map we extract the per-view silhouette via marching squares at the foreground/background threshold. Vanilla marching squares produces axis-aligned zigzag artefacts on diagonal facade edges, since the algorithm can only place vertices on cell-edge midpoints. We apply selective corner-preserving smoothing: traverse the contour polyline, compute turning angle at each vertex, identify corner vertices as those with absolute turning angle exceeding 30°, and apply a moving-average smoother only to the non-corner runs. The result preserves architectural corners (window jambs, column edges) while removing the zigzag along diagonal surfaces.
Within each silhouette, the depth channel D_v is clustered into k roughly parallel planes (typically k=3–5) via 1-D k-means in depth space, initialised at quantiles. Each cluster c_i gets assigned a single representative depth z_i = mean(D_v[mask_i]). The cluster mask becomes a polygon-with-holes in the contour plane, lifted to depth z_i. This step encodes a strong architectural prior: facades consist of a small number of locally planar layers (background wall, window recess plane, protruding column, balcony slab), not a smoothly varying depth field. The clustering recovers this layering automatically from the depth histogram.
Each per-cluster polygon-with-holes is simplified with Ramer–Douglas–Peucker (ε=0.5 px in image space) to remove redundant collinear vertices, then triangulated with the earcut algorithm [7] — robust for arbitrary simple polygons with holes. The six resulting face meshes are stitched into a single mesh by exploiting the orthographic guarantee: any vertex on the boundary of one elevation is, by construction, also on the boundary of two adjacent elevations at the same 3D position. Boundary identification is therefore a coordinate equality check; no ICP or optimisation-based alignment is required.
Per elevation: marching squares is O(HW), corner-preserving smoothing is O(V) over contour vertices, depth clustering is O(HW·k·iter) for k-means with k≈5 and iter≈10, RDP simplification is O(V log V), and earcut triangulation is O(V²) worst-case but linear on near-convex inputs typical of buildings. End-to-end the pipeline runs in ~120 ms per facade on a single CPU core for 512² input, dominated by the depth clustering. There is no neural inference cost in the contour-to-mesh stages — stage A (orthographic view synthesis) is the only neural component, and runs in roughly 800 ms on a single GPU.
Three systematic failure modes have been identified during synthetic validation. (i) Open boundaries: when a building has overhangs or floating elements (balconies disconnected from the main mass), the boundary-identity stitch fails because a +X-elevation boundary vertex has no counterpart on the +Y or +Z elevations. Mitigated by adding a "gap filling" step that triangulates between near-but-non-equal boundary vertices when they are within 2× the median edge length. (ii) Cluster fragmentation: depth clustering can split a single architectural plane into two clusters if depth noise is high, producing duplicate parallel surfaces in the final mesh. Mitigated by merging clusters with mean-depth difference under 1% of the depth range. (iii) Empty silhouettes: views in which the building is fully occluded or off-frame produce empty contours and break the six-face stitch. Mitigated by falling back to a bounding-box-derived placeholder face for the missing elevation, which over-estimates that face's geometry but preserves watertightness.
| Stage | Representation | Element count |
|---|---|---|
| Input depth maps | Pixels (six views, 512²) | ~352K |
| Per-view silhouettes | Contour vertices (smoothed) | ~2.4K |
| Depth-clustered polygons | Vertices after RDP | ~890 |
| Triangulated face meshes | Vertices (pre-stitch) | ~620 |
| Stitched output mesh | Vertices · Triangles | ~454 · ~332 |
Stages 2–5 are end-to-end validated on synthetic depth maps. The remaining open question is stage 1: producing six orthographic elevations from a single street-view photograph. Off-the-shelf novel-view synthesis models (Zero-1-to-3 [8], MVDream [9], multi-view fine-tunes of large image models) produce perspective views from arbitrary viewpoints, not orthographic views from axis-aligned viewpoints. Perspective output breaks the constant-depth-derivative assumption that lets downstream contour extraction work in image space.
Orthographic projection has the property that depth at pixel (i,j) corresponds to a specific 3D ray with no perspective foreshortening. A pixel-space contour traversal therefore corresponds bijectively to a surface contour traversal in world space. Under perspective projection, a contour in image space corresponds to a back-projected ray that hits the surface at an unknown depth; the contour does not uniquely determine the surface boundary. Triangulating a perspective contour produces a 3D surface that is correct only up to a depth-dependent radial distortion. For architectural reconstruction, where window-jamb depth offsets are precisely the quantities of interest, this distortion is not acceptable.
Three resolution paths are under evaluation. (a) Train a custom multi-view diffusion model on rendered architectural USD assets with orthographic projection enforced at the rendering stage — the most direct but most data-intensive route; estimated $3K–$100K training cost and 2 weeks to 6 months timeline depending on dataset scale. (b) Post-process a perspective multi-view model output by un-projecting and re-projecting through a virtual orthographic camera — accurate in principle, but introduces resampling error and requires per-view depth, which the perspective model does not natively output. (c) Bypass the photo-to-elevation step entirely by accepting six elevations as input from a separate tool (architectural drawings, scan output) and treating this pipeline as a downstream reconstruction module rather than an end-to-end image-to-3D system.
Path (a) is gated by data, not compute. The training set must consist of (street-view photograph, six orthographic depth maps) tuples with pixel-perfect alignment. No public dataset of this form exists. Rendering tuples synthetically from CityGML, OpenStreetMap 3D, or rendered architectural USD assets is technically straightforward but the synthetic-to-real distribution gap then becomes the dominant failure mode — the same fundamental problem identified in SketchProc3D [3]. A semi-supervised hybrid in which the model trains on synthetic pairs and is fine-tuned on a small set of human-aligned real pairs is the most plausible production route. The estimated effort to assemble even 500 such real pairs (manual orthographic depth ground-truth from architectural surveys) is roughly two to three months of dedicated annotation work.
The contour-to-mesh stages (2–5) are validated independently by feeding synthetic primitive geometry — boxes, L-shapes, T-shapes, U-shapes, cylinders — through the pipeline and checking three properties of the output. First, the output is watertight as verified by per-edge two-face incidence (every mesh edge is shared by exactly two faces). Second, the output preserves topology — the Euler characteristic χ = V − E + F matches the expected value for the primitive's genus (typically 2 for genus-0 building masses). Third, the output round-trips through a marching-cubes voxelisation: convert the output mesh to a 256³ occupancy grid, run marching cubes to recover a re-meshed surface, and check that the re-meshed surface differs from the original output by less than 1.5% Hausdorff distance. All three checks pass on the primitive test suite.
Stage 1 (view synthesis) is intentionally not validated end-to-end here because the deployable model does not exist yet. A bespoke evaluation harness has been drafted: render a held-out architectural USD asset through six orthographic cameras, feed one of the renders into a candidate view-synthesis model, recover six predicted elevations, run the contour-to-mesh stages, and compare the reconstruction to the original USD asset. Reporting metrics will be per-axis intersection-over-union of silhouettes, per-cluster depth RMS error, and end-to-end mesh Hausdorff distance.
The pipeline composes ideas from several adjacent areas. Marching squares originates in scientific visualisation [10] and is used here as a contour extractor rather than an iso-surface generator. Earcut triangulation [7] is the standard polygon-with-holes triangulator in 2-D GIS pipelines. Depth-cluster reformulations of facade geometry appear in the procedural-architecture literature (CGA shape grammars [11]) but as authored rules rather than recovered structure. The six-plane decomposition itself echoes the cube-map representation in graphics but uses the views as a reconstruction substrate rather than a rendering target.
The closest prior work is six-plane mesh reconstruction from controlled inputs — that pipeline established the contour-to-mesh stages used here verbatim, validated on geometric primitives (sphere, box, L-shape). The contribution of the present paper is the architectural-subject application and the orthographic-view-synthesis problem statement, not the per-cluster triangulation itself.
The methodology is part of a larger thesis pattern that prefers structured, inspectable intermediate representations over end-to-end neural inference. PGN [12] converts polylines into a DSL program before executing to USD; SketchProc3D [3] converts a sketch into CGA grammar parameters before rendering; the Merrell graph-grammar implementation [13] converts a single example into a generative grammar before producing variations. In every case, the intermediate is the deliverable, not the final geometry — final geometry inherits its quality from the intermediate rather than the noise floor of an end-to-end network.
Multi-plane image (MPI) representations [14] in novel-view synthesis decompose a scene into a stack of fronto-parallel RGB-α layers at varying depths. The six-plane elevation decomposition shares the underlying intuition — decompose geometry along a small number of carefully chosen planes — but operates on geometry directly rather than radiance, and produces a single mesh as output rather than a parametric novel-view renderer. The six-plane representation has more aggressive depth quantisation (3–5 discrete layers vs MPI's 32–128) traded for a closed-form mesh decoder rather than a learned renderer.
The six-plane reconstruction pipeline is a deliberate detour from the direct image-to-3D path that has dominated recent literature. It accepts a harder intermediate task (orthographic multi-view synthesis) in exchange for a much easier final task (mesh reconstruction from clean orthographic inputs). The trade is worthwhile only if the intermediate is in fact solvable; the current open-problem statement is that this remains to be demonstrated at deployable quality.
The pipeline as a whole illustrates a recurring pattern across the thesis: structured intermediate representations beat end-to-end neural inference for production-grade 3D output. PGN converts polylines to a DSL program rather than directly to a mesh; SketchProc3D converts a sketch to a grammar parameter set rather than a textured shape; the six-plane reconstruction converts a photo to six elevations rather than a mesh. In every case, the intermediate is a structured, inspectable, editable representation; in every case, the final geometry inherits the quality of the intermediate rather than the noise of an end-to-end network.
A successful deployment of this pipeline produces, from a single street-view photo of a building, a watertight USD mesh that an Maps production pipeline can consume as a first-class asset: editable in Houdini, instantiable across versions, mergeable with other assets via the standard USD composition arc. The fidelity target is window-recess depth accurate to ±2 cm on typical 3-storey residential subjects, with all major facade features (cornices, balconies, columns) preserved as discrete geometric elements rather than texture. Below that bar the pipeline is research; above it, the pipeline is a viable replacement for the current photogrammetric workflow at a fraction of the input-data cost.
The most likely failure mode is not the contour-to-mesh stages — these are deterministic and have been validated end-to-end on synthetic input. The most likely failure mode is that orthographic view synthesis turns out to be intractable for typical urban architecture given realistic training data budgets. If three independent attempts at training such a model (different architectures, different data scales, different objective functions) all fail to produce orthographic outputs with depth accuracy below ±5 cm on a held-out test set, the conclusion will be that this decomposition is not the right route for end-to-end image-to-3D building reconstruction. The downstream pipeline (stages 2–5) remains useful regardless — it can be deployed as a module that consumes externally-produced orthographic depth maps from photogrammetry, LiDAR, or architectural surveys.
Three concrete next steps. (a) Render a 100K-tuple synthetic training set from rendered architectural USD assets and train a baseline orthographic multi-view model; report per-axis IoU and depth RMSE. (b) Construct a held-out real-photograph evaluation set with manually-aligned orthographic ground truth — even 50 such examples suffice to characterise the synthetic-to-real gap quantitatively. (c) Combine the output of this pipeline with the PGN [12] bridge-construction DSL: use the recovered mesh as a geometric prior for a downstream DSL generator that produces a parametric Houdini scene, closing the loop from photo to editable production asset.