Pre-thesis exploration — pitch for a real-time 3-D reconstruction system from RGB video, combining per-frame monocular depth estimation with a Neural-SLAM back-end for camera-pose tracking and global geometry fusion. Packaged as a WTFund grant application (₹20 L seed grant, Indian early-stage research). Browser-based real-time depth inference and Rerun.io for live 3-D visualisation were the deployment substrate. The application was not funded; the system architecture studied here informed the thesis-line decisions taken later on consumer-GPU 3-D generation.
The thesis line's long-term goal is single-image-to-3-D for the Apple Maps 3-D-reconstruction pipeline. The March 2025 Neural-SLAM work was the earliest exploration of that line, framed as "capture a building with a phone camera, get a 3-D model live in the browser". The architectural premise: per-frame depth estimation (a monocular depth network) gives a per-frame point cloud; a Neural-SLAM back-end tracks camera pose and fuses the per-frame point clouds into a globally-consistent 3-D reconstruction; the result is visualised live in Rerun.io in-browser.
The exploration produced a pitch document for the WTFund seed grant — a Bangalore / Bombay angel-stage research grant for Indian early-stage projects, with ₹20 L (~$24 K USD) at the relevant tier. The pitch covered the system architecture, the monocular-depth model choice (DPT / Depth-Anything class), the Neural-SLAM choice (NICER-SLAM / MAST3R-SLAM class), and the browser-deployment plan (WebGPU for inference, Rerun.io for visualisation). The application was not funded. The exploration nevertheless informed later thesis-line decisions on what architecture choices the consumer-hardware constraint rules out.
The honest scope: this is an exploration + grant-pitch page, not a built system. No SLAM pipeline was implemented end-to-end. The contribution to the thesis line is the architectural-feasibility reasoning, which carried forward to the Hexplane AE / Hierarchical Triplane / MambaFlow3D work where the deployment constraint (consumer GPU, browser-rendered output) reappears.
The pitched system has four stages, deployed as a streaming pipeline so that each frame of video produces an incremental update to the 3-D reconstruction without batch processing.
| Stage | Input | Output | Target latency | Tool |
|---|---|---|---|---|
| 1. Monocular depth | RGB frame (480×640) | Per-pixel depth map (480×640) | < 100 ms / frame | DPT or Depth-Anything (web) |
| 2. Camera pose | Depth frame + feature matches | SE(3) camera pose | < 30 ms / frame | Neural-SLAM back-end |
| 3. Geometry fusion | Depth + pose | Updated voxel / TSDF / sparse grid | < 50 ms / frame | TSDF integration (in-browser) |
| 4. Visualisation | Fused geometry | Live 3-D view | 30 fps target | Rerun.io / WebGPU |
Aggregate target: 10 fps end-to-end on a mid-range laptop with integrated GPU. The 10-fps number was the threshold below which "real-time" stops feeling real-time and the user perceives lag in the capture loop. The pitch did not commit to a proof-of-concept benchmark — the feasibility argument was based on constituent-model benchmarks (DPT inference at ~50 ms / frame on M2 Mac, NICER-SLAM at ~10 fps on RTX 3060) rather than a measured end-to-end run.
"Build the consumer-GPU constraint into the pitch from day one."
Servers were never the deployment substrate.
The single carry-forward lesson from the WTFund pitch work: the thesis line should bake the consumer-hardware deployment constraint into every architecture decision, not bolt it on at the end. The pitch's "browser-rendered, on a laptop, in real time" framing is the same constraint that ultimately reshaped the 2026 architecture work — JiT on 2×RTX 3060, MambaFlow3D designed for consumer rentable rigs, Hexplane AE single-GPU training. The Neural-SLAM exploration is where that constraint was first articulated.
The pitch was submitted to WTFund — a Bombay / Bangalore early-stage research grant network — at the ₹20 L tier (~$24 K USD). The pitch document covered the system architecture (§01), the constituent-model benchmark sources, a proposed six-month milestone plan, a deployment narrative (browser-rendered, no server needed), and a market positioning against then-current commercial offerings (Polycam, Scaniverse).
| Section | Argument | Outcome |
|---|---|---|
| Technical feasibility | Constituent-model benchmarks sum to ~10-fps target | Reviewers were sceptical of the end-to-end claim without a PoC |
| Market positioning | Browser-deployable; no server cost | Reviewers noted competitive offerings already had iOS apps |
| Six-month milestone plan | PoC at month 2, mobile test at month 4, beta at month 6 | Plausible but missing a measured-latency baseline |
| Hardware budget | ₹6 L compute, ₹8 L stipend, ₹6 L contingency | Standard tier |
| Decision | Not funded | "Promising direction, request a PoC before re-applying" |
The decline was reasonable. The pitch leaned on constituent-model benchmarks rather than a measured end-to-end run; a re-application would have needed an actual prototype with measured frame-time budget hitting the 10-fps bar. The thesis line pivoted at this point — towards the PGN / sketch-to-3-D / triplane work that became the core of the 2025-Q3 → 2026-Q1 research line, rather than continuing on the video-to-3-D pitch.
Trace the streaming pipeline. Pick a scene and step the camera through five sample positions. The left pane is the per-frame RGB; the centre pane is the depth map at that frame; the right pane shows the cumulative point-cloud reconstruction as more frames are added. Drag the right pane to rotate.
White paper · Neural-SLAM system design pitch · WTFund decline · consumer-hardware-constraint throughline origin