From six axis-aligned orthographic depth maps to a single watertight 3D mesh — the inverse of the six-plane rendering. Two algorithmic approaches explored: a contour-based minimal-polygon path that compresses 352 K dense pixel samples down to 454 vertices and 332 triangles, and a cloth-grid path that produces a denser smoother mesh at the cost of vertex count.
This work began with a different goal: build an interactive 3D map of the real world from street-view imagery — a real-estate-grade metaverse where you can walk into a building, edit its facade, change its windows, and see the geometry update live. The three competing constraints that defined the direction were speed (real-time interaction, not minute-long rebuilds), accuracy (geometry sharp enough to read as architecture), and editability (semantic control over individual building parts, not opaque diffusion output).
The trade-off question was whether to compete with frame-prediction systems like Google's Genie — generate the world view-by-view from a video model — or to commit to explicit 3-D geometry that can be loaded into Houdini and USD. The latter wins on editability and integration with the existing production pipeline, but it requires a fast reconstruction frontend that can keep up with a user's iteration speed. Without that, the editability is irrelevant — nobody iterates on a system that takes 90 seconds to rebuild after every parameter tweak.
The six-plane formulation here is the chosen frontend. It accepts a minimal representation (six orthographic depth maps) that is cheap to generate, fast to invert, and naturally connects to a future neural component: once a depth-prediction model is trained, the six input maps come from the model rather than from the user. The pipeline doesn't change. The user moves from "drag a slider, see the mesh update in 50 ms" to "drop a street-view photo, see the mesh update in 200 ms once the model has run". The reconstruction backend was deliberately built first so that the neural frontend has something to plug into without architectural rework.
viewer_minimal.py, prior work)
established that 9 image planes could drive an interactive 3-D
reconstruction in iMac-class hardware. The current pipeline reduces the
input from 9 planes to 6 axis-aligned orthographic depth maps and shifts
the reconstruction from triplane volumetric to per-cluster polygon
triangulation. The shift trades smooth volumetric output for
production-clean polygonal output and gives back ~2× speed at the
reconstruction step.
Six axis-aligned orthographic depth maps over a unit cube — labelled
xy_pos, xy_neg, xz_pos,
xz_neg, yz_pos, yz_neg — encode the
near-surface depth of the underlying shape from each cube face. White =
near, black = far. The reconstruction question: how do you invert this
representation back into a single watertight triangle mesh that an editor
can consume?
The naive approach — back-project every non-empty pixel into 3D and build a
point cloud — produces around 352,000 dense points for a
512² input per face. Triangulating that point cloud directly
gives a mesh with mid-five-figure vertex counts that contains massive
redundancy: every flat region of the shape produces thousands of
co-planar samples that a single quad could represent. The work here is
finding a representation that respects the underlying surface structure
rather than the pixel grid.
A secondary constraint: the result must be watertight. Six independently triangulated face meshes that don't share boundary vertices leave gaps at the seams where the cube faces meet. Watertightness is what makes the output usable in a downstream pipeline — Houdini procedural operations, USD composition, simulation — that expects a closed surface.
Trace regions.
Don't sample pixels.
The dense-point-cloud approach treats the depth map as a height field with one mesh vertex per pixel. That representation respects the pixel grid, not the surface. The region-based approach traces the boundary of each constant-depth cluster as a polygon and triangulates per polygon — what would be a 352 K-point dense reconstruction collapses to under 500 vertices for the same shape, with no loss of geometric fidelity on flat regions and visible quality gain on diagonal edges where the dense approach produces zigzag staircases.
For each of the six depth maps, the first pass identifies all
foreground pixels — anything with an alpha value above zero. Black
pixels with α=0 represent holes — places where
the orthographic ray missed the geometry entirely. Grey pixels at any
intensity represent geometry at that depth level.
Subpixel marching squares (skimage.measure.find_contours at
level 0.5 for binary masks, 0.01 for the float
depth field) produces ordered contour polylines. Outer contours bound the
foreground region; inner contours bound the holes inside it. The polygon-
with-holes representation is the input to subsequent stages.
Within a foreground region, depth is not necessarily uniform. A building facade has a wall plane at the background depth, recessed window planes a few centimetres deeper, protruding column planes a few centimetres shallower. Treating the entire region as a single planar polygon would collapse those distinct architectural layers into one flat surface.
The clustering step runs 1-D k-means on the depth values inside the region,
typically with k=3 to 5. Clusters are
initialised at depth quantiles and assigned a single representative depth
z_i = mean(D[mask_i]). Each cluster mask becomes its own
polygon-with-holes at depth z_i. Anti-aliasing pixels
along cluster boundaries — the pixels where two depth levels blend
into a gradient — are absorbed into whichever cluster has the closer mean,
avoiding the staircase artefact a naive threshold would produce.
The marching squares output for a 512² depth map produces approximately
~1000 contour points per cluster boundary — overkill for what
is geometrically just a polygon with a small number of structural corners.
Ramer–Douglas–Peucker with ε = 0.5 pixels in image space removes
redundant collinear vertices: keeping the first and last point of any nearly-
straight run, dropping the intermediate ones.
The ε value is a deliberate trade-off. Too small (e.g.
0.1 px) preserves zigzag artefacts from the underlying pixel
grid. Too large (e.g. 2 px) rounds off real architectural
corners. ε = 0.5 px hits the sweet spot for building facades:
aggressive enough to flatten zigzag, conservative enough to preserve corners.
| Source contour | Point count | After RDP ε=0.5 | Reduction |
|---|---|---|---|
| Sphere silhouette (great circle approx) | ~1025 | ~64 | 16× |
| Building facade (with 12 windows) | ~3200 | ~120 | 27× |
| L-shape silhouette | ~840 | ~12 | 70× |
| Cube face (square) | ~2050 | 4 | 500× |
Each simplified per-cluster polygon (potentially with multiple inner holes) is triangulated using the earcut algorithm — the standard 2-D polygon-with-holes triangulator used in GIS pipelines. earcut is robust to arbitrary simple polygons, handles holes correctly, and runs in near-linear time for the near-convex polygons typical of architectural facades. Output: a flat list of triangle indices over the polygon's vertices.
The resulting per-face mesh has a small number of vertices (typically
~70–150) and a corresponding small number of triangles
(~50–100). All vertices lie on the face plane at their
cluster's depth — they are 2-D triangles waiting to be lifted into 3-D
by the next stage.
The six per-face triangulated meshes are lifted into the same global
coordinate frame using each face's (origin, right, up, normal)
basis. Vertex positions on a face's silhouette boundary correspond, by
construction of the orthographic projection, to the same 3-D positions on
the silhouette boundary of the adjacent face. Watertight stitching
is therefore a coordinate-equality merge — no ICP, no
energy minimisation, no optimisation. Two vertices from different face
meshes whose 3-D positions are within ε = 1e-4 are merged
into a single shared vertex; the face indices are updated to reflect the
merge.
A residual gap can still appear when one face has a depth cluster that doesn't fully reach the silhouette — e.g. the recessed window-plane on the +Z facade has a smaller silhouette than the +Z face itself. The fix is side-wall fill: triangulate a strip of side walls connecting the recessed cluster's boundary to the corresponding boundary on the face's outer silhouette, perpendicular to the face plane. This is what physically creates the recess: the side walls of the window niche.
The minimal-polygon pipeline above produces the sparsest possible mesh for a given depth representation — ideal when downstream consumers want editable, low-poly output. A complementary approach explored in the same work is the cloth-grid reconstruction: for each of the six views, cast a subdivided grid (think of a flat mesh patch facing that direction), displace each grid vertex along the view direction using the depth map value at that pixel, snap boundary vertices to the extracted silhouettes, and merge the six displaced grid patches into a watertight mesh.
kBoth approaches share stages 1–2 (contour tracing, depth clustering) and differ only in the meshing strategy. The polygon path is the default for downstream consumers that expect production-clean geometry; the cloth path is the default for ML training data where smoothness matters more than editability.
| Shape | Input pixels | Vertices (polygon path) | Triangles | Watertight |
|---|---|---|---|---|
| Sphere (r=0.5) | ~352 K | ~454 | ~332 | ✓ — round-trip Hausdorff < 1.5% |
| Cube (unit) | ~1.5 M (all faces white-flat) | 8 | 12 | ✓ — exact reconstruction |
| Cylinder (h=1, r=0.5) | ~280 K | ~120 | ~98 | ✓ — cap rims correctly arched via depth lift |
| Torus (R=0.5, r=0.2) | ~190 K | ~340 | ~280 | ✓ — inner hole correctly preserved |
| L-shape (composite) | ~260 K | ~24 | ~16 | ✓ — concave corner recovered cleanly |
The watertight verification runs per-edge two-face incidence checks: every mesh edge must be shared by exactly two faces. All test shapes pass. Round- trip verification converts the output back to a 256³ occupancy grid, runs marching cubes, and checks Hausdorff distance against the original output mesh — under 1.5 % on all primitives.
Click any input tile to toggle a recess on that face. The contour pane (middle) and the reconstructed mesh (right) update in real time as you edit. Each toggle changes the depth-map input, which the pipeline reconstructs end-to-end at interactive speed — exactly the editing loop the project was built for. Use the preset buttons below to jump between common configurations, or drag the mesh to rotate.
arXiv-format write-up · 6-Plane Orthographic Mesh Reconstruction · per-stage analysis, two-pipeline comparison, watertightness proof