xy_pos, xy_neg, xz_pos, xz_neg, yz_pos, yz_neg) over a unit cube; the output is a closed surface that can be consumed by production 3-D tooling. The pipeline operates in six stages: (1) per-face foreground extraction with explicit holes via alpha; (2) per-pixel depth clustering with 1-D k-means to identify constant-depth surfaces robust to anti-aliasing; (3) subpixel marching-squares contour tracing per cluster, producing polygons-with-holes; (4) Ramer–Douglas–Peucker simplification at ε = 0.5 px to compress contours to structural corners; (5) earcut triangulation of each simplified polygon; (6) watertight stitching via boundary-coordinate matching across adjacent faces with side-wall fill where depth-cluster boundaries diverge from the silhouette. On synthetic primitives the pipeline compresses ~352 K input pixels to under 500 vertices with topology preserved under round-trip marching-cubes voxelisation (Hausdorff < 1.5 %). We compare this minimal-polygon approach against a cloth-grid alternative that produces denser meshes preferred for organic shapes, and characterise the trade-off between editability and surface smoothness. The pipeline is the foundation of two downstream applications discussed in companion papers: Six-Plane Building Elevation Reconstruction (which adds an orthographic view-synthesis frontend) and the Sphere Depth Maps validation harness (which provides the synthetic input for verification). Keywords: orthographic mesh reconstruction, marching squares, depth clustering, RDP simplification, earcut triangulation, watertight surface, polygon-with-holes.
The inverse of orthographic rendering is geometry recovery: given depth maps from n viewpoints, reconstruct the underlying 3-D surface. With n = 6 axis-aligned views (one per cube face) the recovery is nearly unique up to interior void ambiguity — every external surface patch is visible to at least one of the six cameras, and concave surface regions are visible to at least two opposite cameras whose depths can be combined to estimate thickness.
The naive recovery — back-project every non-empty pixel into 3-D and triangulate the resulting point cloud — produces meshes with mid-five-figure vertex counts that contain massive redundancy. A 1 m² flat wall sampled at 512 × 512 resolution produces 262 K points where one quad would suffice. The work here is replacing the pixel-as-vertex assumption with a region-as-polygon assumption: trace each constant-depth region's boundary as a polygon, simplify, triangulate per polygon. The result is a mesh with vertices located only at structural corners.
The contributions of this paper are: (1) the six-stage minimal-polygon pipeline from depth maps to a watertight triangulated mesh; (2) the depth-clustering reformulation that treats facade-like geometry as a small number of constant-depth planes rather than a continuous height field; (3) the boundary-coordinate stitching protocol that produces watertight output without ICP or optimisation; (4) a comparison with an alternative cloth-grid reconstruction that captures smooth surfaces at the cost of vertex count; (5) empirical compression results on synthetic primitives (352 K → 454 vertices on the sphere benchmark) with topology preservation verified under round-trip marching-cubes voxelisation.
The pipeline accepts six PNG depth maps over a normalised unit cube. Each is a single-channel grayscale image at any resolution (typical: 512 × 512). Pixel intensity encodes near-surface depth: 255 = nearest to the camera, 0 = farthest. Each image is accompanied by a metadata pair (t_min, t_range) — two floats per face, 12 floats total per shape — that allows denormalising the pixel value into a world t coordinate via t = t_min + (1 − I/255) · t_range.
The alpha channel (or, equivalently, a zero-intensity convention) distinguishes no-geometry pixels from shallow-geometry pixels. Black pixels with α = 0 indicate that the orthographic ray missed the shape entirely; they are holes in the silhouette. Grey pixels at any intensity above zero indicate geometry at the corresponding depth. This distinction matters: without it, a thin shallow geometric feature (a wire, an antenna) could be indistinguishable from a region the ray missed.
Each face has a local 2-D coordinate frame (right, up) indexing pixels and a world-space normal pointing into the cube interior. The frames are: +X right=+Y, up=+Z, normal=−X; −X right=−Y, up=+Z, normal=+X; and analogously for ±Y and ±Z faces. Boundary vertices on a face are mapped to world space via p_world = origin + u · right + v · up + (1 − depth_pixel) · normal.
A binary foreground mask is constructed per face: mask = (depth_map > 0) when alpha is implicit via intensity, or mask = (alpha > 0) when alpha is explicit. The mask is the input to all subsequent stages and demarcates which pixels carry geometric information.
Within the foreground region, depth values are clustered into k piecewise-constant layers via 1-D k-means. The number of layers is typically small (k = 3 to 5) and reflects the architectural intuition that a facade is locally piecewise-planar — there is a background wall plane, recessed window planes, protruding column planes. The cluster count is chosen per face based on a depth-histogram peak count, falling back to k = 3 when peaks are ambiguous. Each cluster receives a single representative depth z_i = mean(D[mask_i]).
Anti-aliasing along cluster boundaries — pixels whose depth value sits midway between two clusters because of marching-pixel discretisation — is absorbed into whichever cluster has the closer mean. This is the right disambiguation: in the underlying continuous shape, the pixel is on the boundary between two depth layers, and assigning it to the closer layer minimises overall depth error while keeping each cluster's polygon simple.
Each cluster mask is reduced to a list of polygons-with-holes via skimage.measure.find_contours applied at level 0.5 to the binary cluster mask (or 0.01 to the continuous depth field within the cluster — both work). The contour points are subpixel-accurate via linear interpolation along grid edges. Outer contours bound the cluster region; inner contours bound the holes inside it (where a smaller cluster sits inside a larger one, or where the foreground genuinely has a hole).
Each polygon's outer ring and inner rings are independently simplified via the Ramer–Douglas–Peucker algorithm with ε = 0.5 pixels in image space. RDP recursively removes vertices that lie within ε of the chord between two retained vertices; the result is a polygon with vertices only at structural corners (points where the polyline turns sharply enough that no chord can approximate the local geometry within ε). For a 512² source the typical reduction is one to two orders of magnitude: from ~1000 marching-squares vertices per ring to ~10–100 corners.
Each simplified polygon-with-holes is triangulated via the earcut algorithm. earcut is robust to arbitrary simple polygons (no self-intersections), handles multiple inner holes via the bridge-edge construction, and runs in near-linear time on the near-convex inputs typical of architectural geometry. Output: a triangle index buffer over the polygon's vertex list. Each per-face mesh consists of the union of per-cluster triangulated polygons; faces are still 2-D at this stage, awaiting lift into 3-D coordinates.
Per-face meshes are lifted into world coordinates via the face's (origin, right, up, normal) basis. Vertices at the per-face silhouette boundary, by construction of the orthographic projection, are identical in world coordinates to the corresponding vertices on the adjacent faces' silhouette boundaries. The stitch is therefore a coordinate-equality merge with ε = 1×10⁻⁴ tolerance — no ICP, no optimisation. Two vertices from different face meshes whose 3-D positions are within ε are unified into a single shared vertex and the face indices are updated.
Residual gaps appear when a depth cluster's boundary stops short of the silhouette — e.g. a recessed window cluster on the +Z facade has an inner contour that doesn't reach the +Z face's outer silhouette. The fix is side-wall fill: triangulate a strip of side walls connecting the cluster's boundary to the silhouette, perpendicular to the face plane. Physically, this is the wall of the recess (the side of a window niche). Algorithmically, it's a one-strip earcut on the connecting rectangle.
Each output mesh is validated for three properties. Two-face incidence: every edge in the mesh is shared by exactly two faces. Topology preservation: the Euler characteristic χ = V − E + F matches the expected value for the input shape's genus (χ = 2 for genus-0 shapes, χ = 0 for torus). Round-trip Hausdorff: the output mesh is voxelised at 256³ resolution and re-meshed via marching cubes; the resulting re-mesh is compared to the original output via two-way Hausdorff distance, which must be under 1.5 % of the shape's bounding diagonal.
| Shape | Input px | Vertices | Triangles | χ | Hausdorff |
|---|---|---|---|---|---|
| Sphere | 352 K | 454 | 332 | 2 | 1.12 % |
| Cube | 1.5 M | 8 | 12 | 2 | 0.00 % |
| Cylinder | 280 K | 120 | 98 | 2 | 1.31 % |
| Torus | 190 K | 340 | 280 | 0 | 1.48 % |
| L-shape | 260 K | 24 | 16 | 2 | 0.04 % |
An alternative pipeline shares the foreground extraction and depth clustering stages but replaces stages 3–6 with a cloth-grid formulation. For each of the six views, a subdivided grid mesh is cast facing that direction. Each grid vertex is displaced along the view direction by the depth-map value at that pixel: p_displaced = p_grid + depth(grid_uv) · normal. Boundary vertices of the grid are snapped onto the per-face silhouette contour extracted in stage 3. The six displaced grids are then merged into a single mesh by coordinate-equality on shared boundary vertices.
The trade-off is dimensional. A 16 × 16 grid per face produces 6 × 256 = 1536 vertices before shared-boundary merging, compared to the ~454 vertices of the polygon pipeline on the same input. The cloth result captures smooth curvature (the sphere becomes a smooth approximation, not a 332-triangle polyhedron), but the resulting vertices are at arbitrary grid positions rather than at structural corners — less editable, less compact. The choice between the two pipelines is application-driven: minimal polygon for downstream production tooling, cloth grid for ML training data where smoothness matters more than vertex count.
Three systematic failure modes have been observed during synthetic validation. (i) Open boundaries: shapes with overhangs or floating disconnected elements produce silhouette boundaries that don't have a counterpart on the adjacent face, breaking the coordinate-equality merge. Mitigation: a one-pass gap fill that triangulates between near-but-non-equal boundary vertices when the gap is under 2× the median edge length.
(ii) Cluster fragmentation: high depth-noise can split a single architectural plane into two clusters, producing duplicate parallel surfaces in the output. Mitigation: a post-clustering merge step that combines clusters whose mean depth differs by less than 1 % of the depth range.
(iii) Empty silhouettes: views in which the shape is fully off-frame produce empty contours and break the six-face stitch (one face contributes nothing). Mitigation: a bounding-box-derived placeholder triangulation for empty faces, which over-estimates that face's geometry but preserves watertightness.
The pipeline composes well-established computational geometry primitives — marching squares (Lorensen & Cline 1987 in 3-D; the 2-D analogue used here), Ramer–Douglas–Peucker simplification (1972, 1973), earcut triangulation (Mapbox 2016) — in a configuration specific to the six-view orthographic input. The configuration itself is the contribution; the individual primitives are textbook material.
Closest related is multi-view stereo (Schönberger & Frahm 2016, COLMAP) which produces dense point clouds from unstructured photographs. The six-plane formulation differs in two structural ways: input is six orthographic views from axis-aligned directions (not perspective views from arbitrary positions), and output is a clean low-vertex-count mesh (not a dense point cloud). The trade-off is generality vs editability: MVS handles arbitrary capture; the six-plane pipeline requires axis-aligned orthographic input but produces production-ready geometry.
The pipeline is the algorithmic core of the Six-Plane Building Elevation Reconstruction system, where it accepts orthographic elevations synthesised from a single street-view photograph. The pipeline is also used in the Sphere Depth Maps validation harness as the inverse-rendering check for the upstream view-rendering implementation.
The six-plane orthographic mesh reconstruction pipeline produces watertight triangulated meshes from six axis-aligned depth maps with vertex counts two to three orders of magnitude smaller than naive dense back-projection. Each stage is a classical computational-geometry primitive; the contribution is the configuration that respects shape structure (regions, clusters, corners) rather than the sampling grid (pixels). The pipeline serves as the algorithmic foundation for the architectural reconstruction work described in the Building Elevation companion paper and validates against the synthetic ground truth produced by the Sphere Depth Maps harness.
skimage.measure.find_contours — Marching Squares iso-contour extraction with subpixel linear interpolation.