Topic 9 Mar 2025 SLAM · Depth Estimation · Grant Pitch

Neural SLAM —
Real-Time 3-D from RGB Video.

Pre-thesis exploration — pitch for a real-time 3-D reconstruction system from RGB video, combining per-frame monocular depth estimation with a Neural-SLAM back-end for camera-pose tracking and global geometry fusion. Packaged as a WTFund grant application (₹20 L seed grant, Indian early-stage research). Browser-based real-time depth inference and Rerun.io for live 3-D visualisation were the deployment substrate. The application was not funded; the system architecture studied here informed the thesis-line decisions taken later on consumer-GPU 3-D generation.

00 — Motivation

"Capture a building with my phone, get a 3-D model in the browser."

The thesis line's long-term goal is single-image-to-3-D for the Apple Maps 3-D-reconstruction pipeline. The March 2025 Neural-SLAM work was the earliest exploration of that line, framed as "capture a building with a phone camera, get a 3-D model live in the browser". The architectural premise: per-frame depth estimation (a monocular depth network) gives a per-frame point cloud; a Neural-SLAM back-end tracks camera pose and fuses the per-frame point clouds into a globally-consistent 3-D reconstruction; the result is visualised live in Rerun.io in-browser.

The exploration produced a pitch document for the WTFund seed grant — a Bangalore / Bombay angel-stage research grant for Indian early-stage projects, with ₹20 L (~$24 K USD) at the relevant tier. The pitch covered the system architecture, the monocular-depth model choice (DPT / Depth-Anything class), the Neural-SLAM choice (NICER-SLAM / MAST3R-SLAM class), and the browser-deployment plan (WebGPU for inference, Rerun.io for visualisation). The application was not funded. The exploration nevertheless informed later thesis-line decisions on what architecture choices the consumer-hardware constraint rules out.

The honest scope: this is an exploration + grant-pitch page, not a built system. No SLAM pipeline was implemented end-to-end. The contribution to the thesis line is the architectural-feasibility reasoning, which carried forward to the Hexplane AE / Hierarchical Triplane / MambaFlow3D work where the deployment constraint (consumer GPU, browser-rendered output) reappears.

What it informs

The architectural-feasibility reasoning from this exploration directly informs the consumer-hardware constraint adopted by the MambaFlow3D Phase-2 work (Topic 26) and the JiT-on-2×3060 reproduction (Topic 27). The "browser-rendered output" leg of the pitch reappears in the Bevy / WebGL viewer work in later thesis-line topics.

01 — System Architecture

Per-frame depth → SLAM pose tracking → fused geometry → Rerun.io.

The pitched system has four stages, deployed as a streaming pipeline so that each frame of video produces an incremental update to the 3-D reconstruction without batch processing.

Stage	Input	Output	Target latency	Tool
1. Monocular depth	RGB frame (480×640)	Per-pixel depth map (480×640)	< 100 ms / frame	DPT or Depth-Anything (web)
2. Camera pose	Depth frame + feature matches	SE(3) camera pose	< 30 ms / frame	Neural-SLAM back-end
3. Geometry fusion	Depth + pose	Updated voxel / TSDF / sparse grid	< 50 ms / frame	TSDF integration (in-browser)
4. Visualisation	Fused geometry	Live 3-D view	30 fps target	Rerun.io / WebGPU

Aggregate target: 10 fps end-to-end on a mid-range laptop with integrated GPU. The 10-fps number was the threshold below which "real-time" stops feeling real-time and the user perceives lag in the capture loop. The pitch did not commit to a proof-of-concept benchmark — the feasibility argument was based on constituent-model benchmarks (DPT inference at ~50 ms / frame on M2 Mac, NICER-SLAM at ~10 fps on RTX 3060) rather than a measured end-to-end run.

Pipeline

Streaming pipeline — one update per video frame.

Core Insight

"Build the consumer-GPU constraint into the pitch from day one."
Servers were never the deployment substrate.

The single carry-forward lesson from the WTFund pitch work: the thesis line should bake the consumer-hardware deployment constraint into every architecture decision, not bolt it on at the end. The pitch's "browser-rendered, on a laptop, in real time" framing is the same constraint that ultimately reshaped the 2026 architecture work — JiT on 2×RTX 3060, MambaFlow3D designed for consumer rentable rigs, Hexplane AE single-GPU training. The Neural-SLAM exploration is where that constraint was first articulated.

02 — The WTFund Pitch

₹20 L grant application. Not funded.

The pitch was submitted to WTFund — a Bombay / Bangalore early-stage research grant network — at the ₹20 L tier (~$24 K USD). The pitch document covered the system architecture (§01), the constituent-model benchmark sources, a proposed six-month milestone plan, a deployment narrative (browser-rendered, no server needed), and a market positioning against then-current commercial offerings (Polycam, Scaniverse).

Section	Argument	Outcome
Technical feasibility	Constituent-model benchmarks sum to ~10-fps target	Reviewers were sceptical of the end-to-end claim without a PoC
Market positioning	Browser-deployable; no server cost	Reviewers noted competitive offerings already had iOS apps
Six-month milestone plan	PoC at month 2, mobile test at month 4, beta at month 6	Plausible but missing a measured-latency baseline
Hardware budget	₹6 L compute, ₹8 L stipend, ₹6 L contingency	Standard tier
Decision	Not funded	"Promising direction, request a PoC before re-applying"

The decline was reasonable. The pitch leaned on constituent-model benchmarks rather than a measured end-to-end run; a re-application would have needed an actual prototype with measured frame-time budget hitting the 10-fps bar. The thesis line pivoted at this point — towards the PGN / sketch-to-3-D / triplane work that became the core of the 2025-Q3 → 2026-Q1 research line, rather than continuing on the video-to-3-D pitch.

Interactive Demo · Live

Trace the streaming pipeline. Pick a scene and step the camera through five sample positions. The left pane is the per-frame RGB; the centre pane is the depth map at that frame; the right pane shows the cumulative point-cloud reconstruction as more frames are added. Drag the right pane to rotate.

01 — RGB frame · CLICK TO RE-SEED BUILDING

02 — Depth at frame t FRAME 0 / 5

03 — Fused point cloud drag to rotate

Appendix — Raw Materials

Transcripts & Source References

████████████████████████████████████████████████
███████████████████████████████████████

01 — ██████████████████████████

██████████████████████████████████████

█████████ · ████ · █████████████████████

█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████

Restricted Access

Neural SLAM — Real-Time 3-D from RGB Video.