System Design Pitch · cs.CV · cs.RO · Mar 2025
Documentation → ← Back to White Papers
Real-Time Single-Phone 3-D Reconstruction via Monocular Depth + Neural SLAM + Browser-Rendered TSDF: A System Design Pitch (WTFund Grant Application, Not Funded) and Its Influence on the Thesis-Line Consumer-Hardware Constraint
Aaditya Jain
Neural SLAM · Live 3-D Capture · Thesis-Line Architectural Premise Origin
Submitted: March 2025 Subject: cs.CV · cs.RO · cs.GR Keywords: Neural SLAM, monocular depth, TSDF fusion, browser deployment, consumer hardware, WTFund grant
Abstract
We document the system design and the grant-pitch outcome of the March 2025 Neural-SLAM work — a proposed real-time 3-D reconstruction system that combines per-frame monocular depth estimation (DPT / Depth Anything class), Neural SLAM (NICER-SLAM class) for camera pose tracking and global geometry fusion, and TSDF integration with Rerun.io browser-rendered visualisation. The aggregate latency target was 10 fps end-to-end on a mid-range laptop with integrated GPU; the constituent-model benchmarks (DPT ~50 ms/frame, NICER-SLAM ~10 fps on RTX 3060) supported the feasibility argument but no end-to-end proof-of-concept was built. The pitch was submitted to WTFund (a Bombay / Bangalore early-stage research grant network) at the ₹20 L tier. The application was declined with feedback "promising direction, request a PoC before re-applying". The work documented here is a system-design pitch rather than an implemented system, and the contribution claimed is correspondingly modest: (i) the documented architecture for the proposed pipeline, including the constituent-model latency budget; (ii) the WTFund decline narrative and the lesson it teaches (pitch decks need measured end-to-end numbers, not summed constituent benchmarks); (iii) the thesis-line throughline this pitch originated — the consumer-hardware deployment constraint that reappears in JiT consumer-GPU reproduction [1], MambaFlow3D [2], and the Hexplane AE work [3] is first articulated here in March 2025. Keywords: Neural SLAM, monocular depth, real-time 3-D, browser rendering, consumer-hardware constraint, grant pitch declined.
1. Introduction

The thesis line's long-term ambition is single-image (or single-phone-video) to 3-D for the Maps procedural-modelling pipeline. The March 2025 Neural-SLAM work was the earliest concrete proposal in that direction, framed as "capture a building with a phone camera, see a 3-D model live in the browser". The proposal was packaged as a WTFund grant application; the application was not funded. The work survives in the thesis line as the consumer-hardware-constraint throughline that shapes every subsequent generator-design decision.

2. The Neural-SLAM Background

Classical SLAM (Simultaneous Localisation and Mapping) jointly estimates camera trajectory and scene geometry from a sequence of sensor observations. Classical approaches (ORB-SLAM, DSO) use hand-crafted feature extraction + non-linear optimisation. Neural SLAM (NICE-SLAM, NICER-SLAM, iMAP) replaces parts of the pipeline with neural networks — typically the geometry representation (a neural implicit surface or sparse-voxel grid in place of an explicit point cloud) and sometimes the feature extraction. Neural SLAM trades classical SLAM's robustness on featureless / texture-less scenes for the inherited neural-network limitations (slower per-frame, requires GPU).

For real-time single-phone reconstruction the Neural-SLAM choice is load-bearing because it provides a continuous, fillable geometry representation. Classical SLAM produces sparse landmarks (~hundreds per frame) which are not directly mesh-extractable; Neural SLAM produces a dense neural-implicit-surface representation that marches-cubes cleanly to a mesh per frame.

3. Proposed System Architecture

Four-stage streaming pipeline, each frame of input video producing an incremental update to the reconstructed geometry:

Table 1 — Pipeline stages.
StageInputOutputTarget latencyTool
1. Monocular depthRGB frame 480 × 640Per-pixel depth map< 100 ms / frameDPT or Depth Anything (web)
2. Camera poseDepth + feature matchesSE(3) pose< 30 ms / frameNeural-SLAM back-end
3. Geometry fusionDepth + poseUpdated TSDF / sparse grid< 50 ms / frameTSDF integration (in-browser)
4. VisualisationFused geometryLive 3-D view30 fps targetRerun.io / WebGPU

Aggregate latency budget: 100 + 30 + 50 = 180 ms / frame ⇒ ~5–10 fps end-to-end. The 10 fps target is the threshold below which the user perceives lag in the capture loop.

4. The Feasibility Argument (Sums of Constituent Benchmarks)

The pitch's feasibility argument relied on published per-stage benchmark numbers:

  • DPT inference at ~50 ms / frame on M2 Mac (published benchmarks).
  • NICER-SLAM at ~10 fps on RTX 3060 (paper-reported).
  • TSDF integration ~5 ms / frame for the resolution targeted (sparse 256³ at the relevant active band).
  • Rerun.io browser rendering at 30 fps for ~50 K-point clouds (visible from the Rerun.io demo gallery).

The sum supports the 10 fps aggregate target with margin. The pitch did not commit to a measured end-to-end proof-of-concept; this proved to be the load-bearing weakness during review.

5. WTFund Outcome

Submitted at the ₹20 L tier (~$24 K USD) — the early-stage research grant level appropriate to a one-developer prototype. The pitch deck covered system architecture (§2), the constituent-benchmark feasibility argument (§3), a six-month milestone plan (PoC at month 2, mobile test at month 4, beta at month 6), and a budget breakdown (₹6 L compute, ₹8 L stipend, ₹6 L contingency).

Table 2 — Reviewer feedback.
Section reviewedReviewer note
Technical feasibility"Constituent benchmarks plausible; PoC needed before claim can be evaluated."
Market positioning"Polycam and Scaniverse already ship iOS versions; differentiation insufficient."
Milestone plan"Plausible but missing a measured-latency baseline."
DecisionDeclined. "Promising direction, request a PoC before re-applying."

The decline was reasonable. The pitch should have included a measured end-to-end number on at least the bottleneck stage. A re-application with that number was not pursued — the thesis line pivoted at this point toward the PGN / sketch-to-3-D / triplane work that became the core research line through Q3 2025 – Q1 2026.

6. Why The Pitch Was Not Reapplied

The reviewer's "PoC before reapplying" feedback was actionable — a two-week sprint to build the end-to-end PoC, measure the actual frame-time, and resubmit. That sprint was not undertaken. The reason recorded at the time: the thesis line was beginning to crystallise around the PGN / procedural-modelling direction (Topics 14, 16, 38, 40), which had concrete deliverables and an internal stakeholder (Maps) that the WTFund pitch did not have. Spending two weeks on a PoC for a pitch that had been declined once was a worse use of time than two weeks on the procedural-modelling line.

This is the correct call in retrospect — the procedural-modelling thesis line produced PGN, SketchProc3D, SculptNet, MambaFlow3D, Hexplane AE, and the rest of the substantive thesis-line output. The Neural-SLAM direction would have produced one PoC of an idea that already had commercial competitors (Polycam, Scaniverse) with iOS apps and team funding. The opportunity cost analysis favoured the pivot.

7. The Thesis-Line Throughline

The single surviving contribution of this work to the rest of the thesis line: the consumer-hardware deployment constraint. The pitch's framing — "real-time, on a laptop, no server" — is the same constraint that ultimately shapes the 2026 architectural decisions: JiT trained on 2 × RTX 3060 [1] rather than reproduced on 8 × H200; MambaFlow3D designed for the same 2 × RTX 3060 substrate [2]; Hexplane AE trained on a single RTX 3060 [3]. Topic 9 is where this constraint was first written down.

8. The Generalisable Lesson

Pitch decks need measured end-to-end numbers, not summed constituent benchmarks. The WTFund decline turned on the absence of one. Carried forward as a feedback rule for the subsequent thesis-line work — every paper that touches deployment hardware ships with measured-wallclock and measured-memory numbers rather than back-of-envelope sums (JiT [1] Tables 2 and 4; MambaFlow3D [2] Table 1).

9. Conclusion

The Neural-SLAM pitch was declined; the proposed system was not built. The work survives as the origin of the consumer-hardware constraint that shapes the rest of the thesis line.

References
[1] Jain, A. "Training JiT Diffusion on Two Consumer GPUs." Thesis research, Nov 2025. /whitepaper/jit-diffusion
[2] Jain, A. "MambaFlow3D." Thesis research, Nov 2025. /whitepaper/mambaflow3d
[3] Jain, A. "Hexplane Autoencoder." Thesis research, Dec 2025. /whitepaper/hexplane-ae
[4] Ranftl, R. et al. "Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer (DPT)." TPAMI, 2022.
[5] Yang, L. et al. "Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data." 2024.
[6] Zhu, Z. et al. "NICER-SLAM: Neural Implicit Scene Encoding for RGB SLAM." 3DV, 2023.
[7] Curless, B., Levoy, M. "A Volumetric Method for Building Complex Models from Range Images." SIGGRAPH, 1996. TSDF.
[8] Rerun.io. "Multi-Modal Logging and Visualisation for ML and Robotics." 2024.