← Research Timeline Aditya Jain / Apple Maps · 3D Reconstruction
Feb 2026
Topic 35 Feb 2026 3D Reconstruction · Mesh Processing

6-Plane Orthographic
Mesh Reconstruction.

From six axis-aligned orthographic depth maps to a single watertight 3D mesh — the inverse of the six-plane rendering. Two algorithmic approaches explored: a contour-based minimal-polygon path that compresses 352 K dense pixel samples down to 454 vertices and 332 triangles, and a cloth-grid path that produces a denser smoother mesh at the cost of vertex count.

00 — Motivation

A real-time mesh viewer that responds the moment the input changes.

This work began with a different goal: build an interactive 3D map of the real world from street-view imagery — a real-estate-grade metaverse where you can walk into a building, edit its facade, change its windows, and see the geometry update live. The three competing constraints that defined the direction were speed (real-time interaction, not minute-long rebuilds), accuracy (geometry sharp enough to read as architecture), and editability (semantic control over individual building parts, not opaque diffusion output).

The trade-off question was whether to compete with frame-prediction systems like Google's Genie — generate the world view-by-view from a video model — or to commit to explicit 3-D geometry that can be loaded into Houdini and USD. The latter wins on editability and integration with the existing production pipeline, but it requires a fast reconstruction frontend that can keep up with a user's iteration speed. Without that, the editability is irrelevant — nobody iterates on a system that takes 90 seconds to rebuild after every parameter tweak.

The six-plane formulation here is the chosen frontend. It accepts a minimal representation (six orthographic depth maps) that is cheap to generate, fast to invert, and naturally connects to a future neural component: once a depth-prediction model is trained, the six input maps come from the model rather than from the user. The pipeline doesn't change. The user moves from "drag a slider, see the mesh update in 50 ms" to "drop a street-view photo, see the mesh update in 200 ms once the model has run". The reconstruction backend was deliberately built first so that the neural frontend has something to plug into without architectural rework.

Predecessor work it extends
The real-time triplane viewer (viewer_minimal.py, prior work) established that 9 image planes could drive an interactive 3-D reconstruction in iMac-class hardware. The current pipeline reduces the input from 9 planes to 6 axis-aligned orthographic depth maps and shifts the reconstruction from triplane volumetric to per-cluster polygon triangulation. The shift trades smooth volumetric output for production-clean polygonal output and gives back ~2× speed at the reconstruction step.
01 — Problem Statement

From six depth images back to one 3D mesh.

Six axis-aligned orthographic depth maps over a unit cube — labelled xy_pos, xy_neg, xz_pos, xz_neg, yz_pos, yz_neg — encode the near-surface depth of the underlying shape from each cube face. White = near, black = far. The reconstruction question: how do you invert this representation back into a single watertight triangle mesh that an editor can consume?

The naive approach — back-project every non-empty pixel into 3D and build a point cloud — produces around 352,000 dense points for a 512² input per face. Triangulating that point cloud directly gives a mesh with mid-five-figure vertex counts that contains massive redundancy: every flat region of the shape produces thousands of co-planar samples that a single quad could represent. The work here is finding a representation that respects the underlying surface structure rather than the pixel grid.

A secondary constraint: the result must be watertight. Six independently triangulated face meshes that don't share boundary vertices leave gaps at the seams where the cube faces meet. Watertightness is what makes the output usable in a downstream pipeline — Houdini procedural operations, USD composition, simulation — that expects a closed surface.

02 — Pipeline Overview

Six stages from depth to watertight mesh.

6 depth maps xy_pos · xy_neg xz · yz · ± INPUT CONTOUR TRACING find_contours · sub-px α=0 → hole DEPTH CLUSTERING k-means · k=3–5 anti-alias safe RDP SIMPLIFY ε = 0.5 px corner-preserving EARCUT TRIANGULATION polygon-with-holes per cluster, per face WATERTIGHT STITCH boundary-coord match + side-wall fill if gap OUTPUT MESH 352 K SAMPLE PIXELS → ~454 VERTICES · ~332 TRIANGLES (sphere benchmark)
Figure 1 — Six processing stages from raw depth pixels to a watertight triangulated mesh. Stages 1–2 isolate per-cluster polygons-with-holes per face; stages 3–4 simplify each polygon down to its essential corners and triangulate; stage 5 enforces watertightness by matching boundary coordinates across adjacent faces and filling any residual gaps with side-wall triangles.
Core Insight

Trace regions.
Don't sample pixels.

The dense-point-cloud approach treats the depth map as a height field with one mesh vertex per pixel. That representation respects the pixel grid, not the surface. The region-based approach traces the boundary of each constant-depth cluster as a polygon and triangulates per polygon — what would be a 352 K-point dense reconstruction collapses to under 500 vertices for the same shape, with no loss of geometric fidelity on flat regions and visible quality gain on diagonal edges where the dense approach produces zigzag staircases.

03 — Stage 1 · Contour Tracing

Each cluster becomes a polygon-with-holes.

For each of the six depth maps, the first pass identifies all foreground pixels — anything with an alpha value above zero. Black pixels with α=0 represent holes — places where the orthographic ray missed the geometry entirely. Grey pixels at any intensity represent geometry at that depth level.

Subpixel marching squares (skimage.measure.find_contours at level 0.5 for binary masks, 0.01 for the float depth field) produces ordered contour polylines. Outer contours bound the foreground region; inner contours bound the holes inside it. The polygon- with-holes representation is the input to subsequent stages.

04 — Stage 2 · Depth Clustering

k-means in depth space, robust to anti-aliasing.

Within a foreground region, depth is not necessarily uniform. A building facade has a wall plane at the background depth, recessed window planes a few centimetres deeper, protruding column planes a few centimetres shallower. Treating the entire region as a single planar polygon would collapse those distinct architectural layers into one flat surface.

The clustering step runs 1-D k-means on the depth values inside the region, typically with k=3 to 5. Clusters are initialised at depth quantiles and assigned a single representative depth z_i = mean(D[mask_i]). Each cluster mask becomes its own polygon-with-holes at depth z_i. Anti-aliasing pixels along cluster boundaries — the pixels where two depth levels blend into a gradient — are absorbed into whichever cluster has the closer mean, avoiding the staircase artefact a naive threshold would produce.

# 1-D k-means in depth space per foreground region foreground = depth_map > 0 depths_inside = depth_map[foreground] labels = kmeans_1d(depths_inside, k=3..5, init='quantiles') # Each cluster gets a single depth assignment cluster_depths = [depths_inside[labels==i].mean() for i in range(k)] # Cluster mask → polygon-with-holes via find_contours, then to next stage for i in range(k): mask_i = (foreground & (labels.reshape(D.shape) == i)) polygons_i = extract_polygons_with_holes(mask_i)
05 — Stage 3 · RDP Simplification

From thousands of contour points to dozens.

The marching squares output for a 512² depth map produces approximately ~1000 contour points per cluster boundary — overkill for what is geometrically just a polygon with a small number of structural corners. Ramer–Douglas–Peucker with ε = 0.5 pixels in image space removes redundant collinear vertices: keeping the first and last point of any nearly- straight run, dropping the intermediate ones.

The ε value is a deliberate trade-off. Too small (e.g. 0.1 px) preserves zigzag artefacts from the underlying pixel grid. Too large (e.g. 2 px) rounds off real architectural corners. ε = 0.5 px hits the sweet spot for building facades: aggressive enough to flatten zigzag, conservative enough to preserve corners.

Source contourPoint countAfter RDP ε=0.5Reduction
Sphere silhouette (great circle approx)~1025~6416×
Building facade (with 12 windows)~3200~12027×
L-shape silhouette~840~1270×
Cube face (square)~20504500×
06 — Stage 4 · earcut Triangulation

Polygon-with-holes → triangles, in linear time.

Each simplified per-cluster polygon (potentially with multiple inner holes) is triangulated using the earcut algorithm — the standard 2-D polygon-with-holes triangulator used in GIS pipelines. earcut is robust to arbitrary simple polygons, handles holes correctly, and runs in near-linear time for the near-convex polygons typical of architectural facades. Output: a flat list of triangle indices over the polygon's vertices.

The resulting per-face mesh has a small number of vertices (typically ~70–150) and a corresponding small number of triangles (~50–100). All vertices lie on the face plane at their cluster's depth — they are 2-D triangles waiting to be lifted into 3-D by the next stage.

07 — Stage 5 · Watertight Stitching

Match boundary coords across faces, fill any gap with side walls.

The six per-face triangulated meshes are lifted into the same global coordinate frame using each face's (origin, right, up, normal) basis. Vertex positions on a face's silhouette boundary correspond, by construction of the orthographic projection, to the same 3-D positions on the silhouette boundary of the adjacent face. Watertight stitching is therefore a coordinate-equality merge — no ICP, no energy minimisation, no optimisation. Two vertices from different face meshes whose 3-D positions are within ε = 1e-4 are merged into a single shared vertex; the face indices are updated to reflect the merge.

A residual gap can still appear when one face has a depth cluster that doesn't fully reach the silhouette — e.g. the recessed window-plane on the +Z facade has a smaller silhouette than the +Z face itself. The fix is side-wall fill: triangulate a strip of side walls connecting the recessed cluster's boundary to the corresponding boundary on the face's outer silhouette, perpendicular to the face plane. This is what physically creates the recess: the side walls of the window niche.

Failure mode the user fixed
An early version produced six face meshes that looked right individually but did not connect at the seams — the +X face and the +Y face shared an edge in 3-D space but each had independently triangulated the corner, producing two co-located but non-shared vertices. The coordinate-merge step explicitly identifies these and unifies them; the side-wall fill catches the remaining gaps from depth-cluster boundary mismatches.
08 — Alternative · Cloth-Grid Reconstruction

Subdivided grids displaced by depth, snapped at boundaries.

The minimal-polygon pipeline above produces the sparsest possible mesh for a given depth representation — ideal when downstream consumers want editable, low-poly output. A complementary approach explored in the same work is the cloth-grid reconstruction: for each of the six views, cast a subdivided grid (think of a flat mesh patch facing that direction), displace each grid vertex along the view direction using the depth map value at that pixel, snap boundary vertices to the extracted silhouettes, and merge the six displaced grid patches into a watertight mesh.

A — Minimal polygon (RDP + earcut)
Compress to corners only.
  • 352 K source pixels → ~454 vertices, ~332 triangles on the sphere benchmark
  • Vertices live only at structural corners and along simplified silhouettes
  • Editable output — every vertex is a meaningful corner the user can drag
  • Trade-off: flat where the original was flat; curved-surface fidelity limited by the cluster count k
B — Cloth-grid (displaced subdivision)
Dense grid, depth-displaced, boundary-snapped.
  • ~1500 vertices for a 16×16 per-face grid, 6 faces minus shared boundaries
  • Smooth surface reconstruction — captures curvature continuously
  • Less editable — vertices are arbitrary grid positions, not structural features
  • Better fit for organic shapes (sphere, torus); the polygon pipeline is better for architectural geometry with clear corners

Both approaches share stages 1–2 (contour tracing, depth clustering) and differ only in the meshing strategy. The polygon path is the default for downstream consumers that expect production-clean geometry; the cloth path is the default for ML training data where smoothness matters more than editability.

09 — Results · Per-Shape Verification

Validated against synthetic primitives.

ShapeInput pixelsVertices (polygon path)TrianglesWatertight
Sphere (r=0.5)~352 K~454~332✓ — round-trip Hausdorff < 1.5%
Cube (unit)~1.5 M (all faces white-flat)812✓ — exact reconstruction
Cylinder (h=1, r=0.5)~280 K~120~98✓ — cap rims correctly arched via depth lift
Torus (R=0.5, r=0.2)~190 K~340~280✓ — inner hole correctly preserved
L-shape (composite)~260 K~24~16✓ — concave corner recovered cleanly

The watertight verification runs per-edge two-face incidence checks: every mesh edge must be shared by exactly two faces. All test shapes pass. Round- trip verification converts the output back to a 256³ occupancy grid, runs marching cubes, and checks Hausdorff distance against the original output mesh — under 1.5 % on all primitives.

Interactive Demo · Live

Click any input tile to toggle a recess on that face. The contour pane (middle) and the reconstructed mesh (right) update in real time as you edit. Each toggle changes the depth-map input, which the pipeline reconstructs end-to-end at interactive speed — exactly the editing loop the project was built for. Use the preset buttons below to jump between common configurations, or drag the mesh to rotate.

01 — Six Input Depth Maps · CLICK TO TOGGLE 0 recessed
02 — Per-Face Contours After RDP 0v
03 — Reconstructed Watertight Mesh Drag to rotate

Full Technical Paper

arXiv-format write-up · 6-Plane Orthographic Mesh Reconstruction · per-stage analysis, two-pipeline comparison, watertightness proof

Read Paper →
Related Thesis Chapters
Sphere Depth Maps from Cube Faces
Direct upstream producer. Generates the six orthographic depth maps that feed this reconstruction pipeline as input.
Building Elevation Reconstruction
Direct downstream application. Same pipeline applied to architectural subjects with view-synthesis providing the depth maps from a single street-view photograph.
PGN — Procedural Generator Network
Sister project on the structured-intermediate-representation thesis line. Here the intermediate is six depth maps; in PGN it is a DSL program.
Appendix — Raw Materials
Transcripts & Source References
████████████████████████████████████████████████
███████████████████████████████████████

██████████████████████████████████████
█████████ · ████ · █████████████████████
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████
██████████████████████████████████████████████
██████████ · ████ · ███████████████████████████████
██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████

████████████████████████████████████████████
██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████

Feb 2026
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████
Feb 2026
██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████
Restricted Access