← Research Timeline Aditya Jain / Apple Maps · 3D Reconstruction
Mar 2025
Topic 9 Mar 2025 SLAM · Depth Estimation · Grant Pitch

Neural SLAM —
Real-Time 3-D from RGB Video.

Pre-thesis exploration — pitch for a real-time 3-D reconstruction system from RGB video, combining per-frame monocular depth estimation with a Neural-SLAM back-end for camera-pose tracking and global geometry fusion. Packaged as a WTFund grant application (₹20 L seed grant, Indian early-stage research). Browser-based real-time depth inference and Rerun.io for live 3-D visualisation were the deployment substrate. The application was not funded; the system architecture studied here informed the thesis-line decisions taken later on consumer-GPU 3-D generation.

00 — Motivation

"Capture a building with my phone, get a 3-D model in the browser."

The thesis line's long-term goal is single-image-to-3-D for the Apple Maps 3-D-reconstruction pipeline. The March 2025 Neural-SLAM work was the earliest exploration of that line, framed as "capture a building with a phone camera, get a 3-D model live in the browser". The architectural premise: per-frame depth estimation (a monocular depth network) gives a per-frame point cloud; a Neural-SLAM back-end tracks camera pose and fuses the per-frame point clouds into a globally-consistent 3-D reconstruction; the result is visualised live in Rerun.io in-browser.

The exploration produced a pitch document for the WTFund seed grant — a Bangalore / Bombay angel-stage research grant for Indian early-stage projects, with ₹20 L (~$24 K USD) at the relevant tier. The pitch covered the system architecture, the monocular-depth model choice (DPT / Depth-Anything class), the Neural-SLAM choice (NICER-SLAM / MAST3R-SLAM class), and the browser-deployment plan (WebGPU for inference, Rerun.io for visualisation). The application was not funded. The exploration nevertheless informed later thesis-line decisions on what architecture choices the consumer-hardware constraint rules out.

The honest scope: this is an exploration + grant-pitch page, not a built system. No SLAM pipeline was implemented end-to-end. The contribution to the thesis line is the architectural-feasibility reasoning, which carried forward to the Hexplane AE / Hierarchical Triplane / MambaFlow3D work where the deployment constraint (consumer GPU, browser-rendered output) reappears.

What it informs
The architectural-feasibility reasoning from this exploration directly informs the consumer-hardware constraint adopted by the MambaFlow3D Phase-2 work (Topic 26) and the JiT-on-2×3060 reproduction (Topic 27). The "browser-rendered output" leg of the pitch reappears in the Bevy / WebGL viewer work in later thesis-line topics.
01 — System Architecture

Per-frame depth → SLAM pose tracking → fused geometry → Rerun.io.

The pitched system has four stages, deployed as a streaming pipeline so that each frame of video produces an incremental update to the 3-D reconstruction without batch processing.

StageInputOutputTarget latencyTool
1. Monocular depthRGB frame (480×640)Per-pixel depth map (480×640)< 100 ms / frameDPT or Depth-Anything (web)
2. Camera poseDepth frame + feature matchesSE(3) camera pose< 30 ms / frameNeural-SLAM back-end
3. Geometry fusionDepth + poseUpdated voxel / TSDF / sparse grid< 50 ms / frameTSDF integration (in-browser)
4. VisualisationFused geometryLive 3-D view30 fps targetRerun.io / WebGPU

Aggregate target: 10 fps end-to-end on a mid-range laptop with integrated GPU. The 10-fps number was the threshold below which "real-time" stops feeling real-time and the user perceives lag in the capture loop. The pitch did not commit to a proof-of-concept benchmark — the feasibility argument was based on constituent-model benchmarks (DPT inference at ~50 ms / frame on M2 Mac, NICER-SLAM at ~10 fps on RTX 3060) rather than a measured end-to-end run.

Pipeline

Streaming pipeline — one update per video frame.

RGB frame phone camera Mono depth DPT/D-Anything Neural SLAM pose + tracking TSDF fusion global geometry Rerun.io viewer live 3-D in browser Streaming — one cycle per video frame; geometry accumulates incrementally. Aggregate target latency budget: 100 + 30 + 50 = 180 ms / frame ⇒ ~5–10 fps end-to-end on integrated-GPU laptops.
Core Insight

"Build the consumer-GPU constraint into the pitch from day one."
Servers were never the deployment substrate.

The single carry-forward lesson from the WTFund pitch work: the thesis line should bake the consumer-hardware deployment constraint into every architecture decision, not bolt it on at the end. The pitch's "browser-rendered, on a laptop, in real time" framing is the same constraint that ultimately reshaped the 2026 architecture work — JiT on 2×RTX 3060, MambaFlow3D designed for consumer rentable rigs, Hexplane AE single-GPU training. The Neural-SLAM exploration is where that constraint was first articulated.

02 — The WTFund Pitch

₹20 L grant application. Not funded.

The pitch was submitted to WTFund — a Bombay / Bangalore early-stage research grant network — at the ₹20 L tier (~$24 K USD). The pitch document covered the system architecture (§01), the constituent-model benchmark sources, a proposed six-month milestone plan, a deployment narrative (browser-rendered, no server needed), and a market positioning against then-current commercial offerings (Polycam, Scaniverse).

SectionArgumentOutcome
Technical feasibilityConstituent-model benchmarks sum to ~10-fps targetReviewers were sceptical of the end-to-end claim without a PoC
Market positioningBrowser-deployable; no server costReviewers noted competitive offerings already had iOS apps
Six-month milestone planPoC at month 2, mobile test at month 4, beta at month 6Plausible but missing a measured-latency baseline
Hardware budget₹6 L compute, ₹8 L stipend, ₹6 L contingencyStandard tier
DecisionNot funded"Promising direction, request a PoC before re-applying"

The decline was reasonable. The pitch leaned on constituent-model benchmarks rather than a measured end-to-end run; a re-application would have needed an actual prototype with measured frame-time budget hitting the 10-fps bar. The thesis line pivoted at this point — towards the PGN / sketch-to-3-D / triplane work that became the core of the 2025-Q3 → 2026-Q1 research line, rather than continuing on the video-to-3-D pitch.

Interactive Demo · Live

Trace the streaming pipeline. Pick a scene and step the camera through five sample positions. The left pane is the per-frame RGB; the centre pane is the depth map at that frame; the right pane shows the cumulative point-cloud reconstruction as more frames are added. Drag the right pane to rotate.

01 — RGB frame · CLICK TO RE-SEED BUILDING
02 — Depth at frame t FRAME 0 / 5
03 — Fused point cloud drag to rotate

Full Technical Paper

White paper · Neural-SLAM system design pitch · WTFund decline · consumer-hardware-constraint throughline origin

Read Paper →
Related Thesis Chapters
Sphere-Depth Maps
Direct downstream of the depth-estimation thread started here. Topic 9's per-frame depth turned into the orthographic depth maps used as input by the later Six-Plane Mesh work.
Six-Plane Mesh Reconstruction
The geometry-fusion step in Topic 9's pipeline is the closest predecessor to the watertight-mesh extraction in Six-Plane Mesh. Same problem (depth → mesh), different framing (streaming vs batch).
MambaFlow3D — Consumer-GPU 3-D Gen
The consumer-hardware constraint baked into the WTFund pitch is the same constraint that shapes MambaFlow3D's speed-up budget. Topic 9 articulated the constraint; MambaFlow3D acts on it.
Appendix — Raw Materials
Transcripts & Source References
████████████████████████████████████████████████
███████████████████████████████████████

██████████████████████████████████████
█████████ · ████ · █████████████████████
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████
Restricted Access