6-Plane Orthographic Mesh Reconstruction

00 — Motivation

A real-time mesh viewer that responds the moment the input changes.

This work began with a different goal: build an interactive 3D map of the real world from street-view imagery — a real-estate-grade metaverse where you can walk into a building, edit its facade, change its windows, and see the geometry update live. The three competing constraints that defined the direction were speed (real-time interaction, not minute-long rebuilds), accuracy (geometry sharp enough to read as architecture), and editability (semantic control over individual building parts, not opaque diffusion output).

The trade-off question was whether to compete with frame-prediction systems like Google's Genie — generate the world view-by-view from a video model — or to commit to explicit 3-D geometry that can be loaded into Houdini and USD. The latter wins on editability and integration with the existing production pipeline, but it requires a fast reconstruction frontend that can keep up with a user's iteration speed. Without that, the editability is irrelevant — nobody iterates on a system that takes 90 seconds to rebuild after every parameter tweak.

The six-plane formulation here is the chosen frontend. It accepts a minimal representation (six orthographic depth maps) that is cheap to generate, fast to invert, and naturally connects to a future neural component: once a depth-prediction model is trained, the six input maps come from the model rather than from the user. The pipeline doesn't change. The user moves from "drag a slider, see the mesh update in 50 ms" to "drop a street-view photo, see the mesh update in 200 ms once the model has run". The reconstruction backend was deliberately built first so that the neural frontend has something to plug into without architectural rework.

Predecessor work it extends

The real-time triplane viewer (viewer_minimal.py, prior work) established that 9 image planes could drive an interactive 3-D reconstruction in iMac-class hardware. The current pipeline reduces the input from 9 planes to 6 axis-aligned orthographic depth maps and shifts the reconstruction from triplane volumetric to per-cluster polygon triangulation. The shift trades smooth volumetric output for production-clean polygonal output and gives back ~2× speed at the reconstruction step.

01 — Problem Statement

From six depth images back to one 3D mesh.

Six axis-aligned orthographic depth maps over a unit cube — labelled xy_pos, xy_neg, xz_pos, xz_neg, yz_pos, yz_neg — encode the near-surface depth of the underlying shape from each cube face. White = near, black = far. The reconstruction question: how do you invert this representation back into a single watertight triangle mesh that an editor can consume?

The naive approach — back-project every non-empty pixel into 3D and build a point cloud — produces around 352,000 dense points for a 512² input per face. Triangulating that point cloud directly gives a mesh with mid-five-figure vertex counts that contains massive redundancy: every flat region of the shape produces thousands of co-planar samples that a single quad could represent. The work here is finding a representation that respects the underlying surface structure rather than the pixel grid.

A secondary constraint: the result must be watertight. Six independently triangulated face meshes that don't share boundary vertices leave gaps at the seams where the cube faces meet. Watertightness is what makes the output usable in a downstream pipeline — Houdini procedural operations, USD composition, simulation — that expects a closed surface.

03 — Stage 1 · Contour Tracing

Each cluster becomes a polygon-with-holes.

For each of the six depth maps, the first pass identifies all foreground pixels — anything with an alpha value above zero. Black pixels with α=0 represent holes — places where the orthographic ray missed the geometry entirely. Grey pixels at any intensity represent geometry at that depth level.

Subpixel marching squares (skimage.measure.find_contours at level 0.5 for binary masks, 0.01 for the float depth field) produces ordered contour polylines. Outer contours bound the foreground region; inner contours bound the holes inside it. The polygon- with-holes representation is the input to subsequent stages.

04 — Stage 2 · Depth Clustering

k-means in depth space, robust to anti-aliasing.

Within a foreground region, depth is not necessarily uniform. A building facade has a wall plane at the background depth, recessed window planes a few centimetres deeper, protruding column planes a few centimetres shallower. Treating the entire region as a single planar polygon would collapse those distinct architectural layers into one flat surface.

The clustering step runs 1-D k-means on the depth values inside the region, typically with k=3 to 5. Clusters are initialised at depth quantiles and assigned a single representative depth z_i = mean(D[mask_i]). Each cluster mask becomes its own polygon-with-holes at depth z_i. Anti-aliasing pixels along cluster boundaries — the pixels where two depth levels blend into a gradient — are absorbed into whichever cluster has the closer mean, avoiding the staircase artefact a naive threshold would produce.

# 1-D k-means in depth space per foreground region
foreground = depth_map > 0
depths_inside = depth_map[foreground]
labels = kmeans_1d(depths_inside, k=3..5, init='quantiles')

# Each cluster gets a single depth assignment
cluster_depths = [depths_inside[labels==i].mean() for i in range(k)]

# Cluster mask → polygon-with-holes via find_contours, then to next stage
for i in range(k):
    mask_i = (foreground & (labels.reshape(D.shape) == i))
    polygons_i = extract_polygons_with_holes(mask_i)

05 — Stage 3 · RDP Simplification

From thousands of contour points to dozens.

The marching squares output for a 512² depth map produces approximately ~1000 contour points per cluster boundary — overkill for what is geometrically just a polygon with a small number of structural corners. Ramer–Douglas–Peucker with ε = 0.5 pixels in image space removes redundant collinear vertices: keeping the first and last point of any nearly- straight run, dropping the intermediate ones.

The ε value is a deliberate trade-off. Too small (e.g. 0.1 px) preserves zigzag artefacts from the underlying pixel grid. Too large (e.g. 2 px) rounds off real architectural corners. ε = 0.5 px hits the sweet spot for building facades: aggressive enough to flatten zigzag, conservative enough to preserve corners.

Source contour	Point count	After RDP ε=0.5	Reduction
Sphere silhouette (great circle approx)	~1025	~64	16×
Building facade (with 12 windows)	~3200	~120	27×
L-shape silhouette	~840	~12	70×
Cube face (square)	~2050	4	500×

06 — Stage 4 · earcut Triangulation

Polygon-with-holes → triangles, in linear time.

Each simplified per-cluster polygon (potentially with multiple inner holes) is triangulated using the earcut algorithm — the standard 2-D polygon-with-holes triangulator used in GIS pipelines. earcut is robust to arbitrary simple polygons, handles holes correctly, and runs in near-linear time for the near-convex polygons typical of architectural facades. Output: a flat list of triangle indices over the polygon's vertices.

The resulting per-face mesh has a small number of vertices (typically ~70–150) and a corresponding small number of triangles (~50–100). All vertices lie on the face plane at their cluster's depth — they are 2-D triangles waiting to be lifted into 3-D by the next stage.

07 — Stage 5 · Watertight Stitching

Match boundary coords across faces, fill any gap with side walls.

The six per-face triangulated meshes are lifted into the same global coordinate frame using each face's (origin, right, up, normal) basis. Vertex positions on a face's silhouette boundary correspond, by construction of the orthographic projection, to the same 3-D positions on the silhouette boundary of the adjacent face. Watertight stitching is therefore a coordinate-equality merge — no ICP, no energy minimisation, no optimisation. Two vertices from different face meshes whose 3-D positions are within ε = 1e-4 are merged into a single shared vertex; the face indices are updated to reflect the merge.

A residual gap can still appear when one face has a depth cluster that doesn't fully reach the silhouette — e.g. the recessed window-plane on the +Z facade has a smaller silhouette than the +Z face itself. The fix is side-wall fill: triangulate a strip of side walls connecting the recessed cluster's boundary to the corresponding boundary on the face's outer silhouette, perpendicular to the face plane. This is what physically creates the recess: the side walls of the window niche.

Failure mode the user fixed

An early version produced six face meshes that looked right individually but did not connect at the seams — the +X face and the +Y face shared an edge in 3-D space but each had independently triangulated the corner, producing two co-located but non-shared vertices. The coordinate-merge step explicitly identifies these and unifies them; the side-wall fill catches the remaining gaps from depth-cluster boundary mismatches.

08 — Alternative · Cloth-Grid Reconstruction

Subdivided grids displaced by depth, snapped at boundaries.

The minimal-polygon pipeline above produces the sparsest possible mesh for a given depth representation — ideal when downstream consumers want editable, low-poly output. A complementary approach explored in the same work is the cloth-grid reconstruction: for each of the six views, cast a subdivided grid (think of a flat mesh patch facing that direction), displace each grid vertex along the view direction using the depth map value at that pixel, snap boundary vertices to the extracted silhouettes, and merge the six displaced grid patches into a watertight mesh.

A — Minimal polygon (RDP + earcut)

Compress to corners only.

352 K source pixels → ~454 vertices, ~332 triangles on the sphere benchmark
Vertices live only at structural corners and along simplified silhouettes
Editable output — every vertex is a meaningful corner the user can drag
Trade-off: flat where the original was flat; curved-surface fidelity limited by the cluster count k

B — Cloth-grid (displaced subdivision)

Dense grid, depth-displaced, boundary-snapped.

~1500 vertices for a 16×16 per-face grid, 6 faces minus shared boundaries
Smooth surface reconstruction — captures curvature continuously
Less editable — vertices are arbitrary grid positions, not structural features
Better fit for organic shapes (sphere, torus); the polygon pipeline is better for architectural geometry with clear corners

Both approaches share stages 1–2 (contour tracing, depth clustering) and differ only in the meshing strategy. The polygon path is the default for downstream consumers that expect production-clean geometry; the cloth path is the default for ML training data where smoothness matters more than editability.

09 — Results · Per-Shape Verification

Validated against synthetic primitives.

Shape	Input pixels	Vertices (polygon path)	Triangles	Watertight
Sphere (r=0.5)	~352 K	~454	~332	✓ — round-trip Hausdorff < 1.5%
Cube (unit)	~1.5 M (all faces white-flat)	8	12	✓ — exact reconstruction
Cylinder (h=1, r=0.5)	~280 K	~120	~98	✓ — cap rims correctly arched via depth lift
Torus (R=0.5, r=0.2)	~190 K	~340	~280	✓ — inner hole correctly preserved
L-shape (composite)	~260 K	~24	~16	✓ — concave corner recovered cleanly

The watertight verification runs per-edge two-face incidence checks: every mesh edge must be shared by exactly two faces. All test shapes pass. Round- trip verification converts the output back to a 256³ occupancy grid, runs marching cubes, and checks Hausdorff distance against the original output mesh — under 1.5 % on all primitives.

Appendix — Raw Materials

Transcripts & Source References

████████████████████████████████████████████████
███████████████████████████████████████

01 — ██████████████████████████

██████████████████████████████████████

█████████ · ████ · █████████████████████

█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████

██████████████████████████████████████████████

██████████ · ████ · ███████████████████████████████

02 — ████████████████████████████████

████████████████████████████████████████████

03 — ████████████████████████████████████████████

Feb 2026

Restricted Access

6-Plane Orthographic
Mesh Reconstruction.

A real-time mesh viewer that responds the moment the input changes.

From six depth images back to one 3D mesh.

Six stages from depth to watertight mesh.

Each cluster becomes a polygon-with-holes.

k-means in depth space, robust to anti-aliasing.

From thousands of contour points to dozens.

Polygon-with-holes → triangles, in linear time.

Match boundary coords across faces, fill any gap with side walls.

Subdivided grids displaced by depth, snapped at boundaries.

Validated against synthetic primitives.

Interactive Demo · Live

Full Technical Paper

6-Plane Orthographic Mesh Reconstruction.

A real-time mesh viewer that responds the moment the input changes.

From six depth images back to one 3D mesh.

Six stages from depth to watertight mesh.

Each cluster becomes a polygon-with-holes.

k-means in depth space, robust to anti-aliasing.

From thousands of contour points to dozens.

Polygon-with-holes → triangles, in linear time.

Match boundary coords across faces, fill any gap with side walls.

Subdivided grids displaced by depth, snapped at boundaries.

Validated against synthetic primitives.

Interactive Demo · Live

Full Technical Paper

6-Plane Orthographic
Mesh Reconstruction.