The thesis line's long-term ambition is single-image (or single-phone-video) to 3-D for the Maps procedural-modelling pipeline. The March 2025 Neural-SLAM work was the earliest concrete proposal in that direction, framed as "capture a building with a phone camera, see a 3-D model live in the browser". The proposal was packaged as a WTFund grant application; the application was not funded. The work survives in the thesis line as the consumer-hardware-constraint throughline that shapes every subsequent generator-design decision.
Classical SLAM (Simultaneous Localisation and Mapping) jointly estimates camera trajectory and scene geometry from a sequence of sensor observations. Classical approaches (ORB-SLAM, DSO) use hand-crafted feature extraction + non-linear optimisation. Neural SLAM (NICE-SLAM, NICER-SLAM, iMAP) replaces parts of the pipeline with neural networks — typically the geometry representation (a neural implicit surface or sparse-voxel grid in place of an explicit point cloud) and sometimes the feature extraction. Neural SLAM trades classical SLAM's robustness on featureless / texture-less scenes for the inherited neural-network limitations (slower per-frame, requires GPU).
For real-time single-phone reconstruction the Neural-SLAM choice is load-bearing because it provides a continuous, fillable geometry representation. Classical SLAM produces sparse landmarks (~hundreds per frame) which are not directly mesh-extractable; Neural SLAM produces a dense neural-implicit-surface representation that marches-cubes cleanly to a mesh per frame.
Four-stage streaming pipeline, each frame of input video producing an incremental update to the reconstructed geometry:
| Stage | Input | Output | Target latency | Tool |
|---|---|---|---|---|
| 1. Monocular depth | RGB frame 480 × 640 | Per-pixel depth map | < 100 ms / frame | DPT or Depth Anything (web) |
| 2. Camera pose | Depth + feature matches | SE(3) pose | < 30 ms / frame | Neural-SLAM back-end |
| 3. Geometry fusion | Depth + pose | Updated TSDF / sparse grid | < 50 ms / frame | TSDF integration (in-browser) |
| 4. Visualisation | Fused geometry | Live 3-D view | 30 fps target | Rerun.io / WebGPU |
Aggregate latency budget: 100 + 30 + 50 = 180 ms / frame ⇒ ~5–10 fps end-to-end. The 10 fps target is the threshold below which the user perceives lag in the capture loop.
The pitch's feasibility argument relied on published per-stage benchmark numbers:
The sum supports the 10 fps aggregate target with margin. The pitch did not commit to a measured end-to-end proof-of-concept; this proved to be the load-bearing weakness during review.
Submitted at the ₹20 L tier (~$24 K USD) — the early-stage research grant level appropriate to a one-developer prototype. The pitch deck covered system architecture (§2), the constituent-benchmark feasibility argument (§3), a six-month milestone plan (PoC at month 2, mobile test at month 4, beta at month 6), and a budget breakdown (₹6 L compute, ₹8 L stipend, ₹6 L contingency).
| Section reviewed | Reviewer note |
|---|---|
| Technical feasibility | "Constituent benchmarks plausible; PoC needed before claim can be evaluated." |
| Market positioning | "Polycam and Scaniverse already ship iOS versions; differentiation insufficient." |
| Milestone plan | "Plausible but missing a measured-latency baseline." |
| Decision | Declined. "Promising direction, request a PoC before re-applying." |
The decline was reasonable. The pitch should have included a measured end-to-end number on at least the bottleneck stage. A re-application with that number was not pursued — the thesis line pivoted at this point toward the PGN / sketch-to-3-D / triplane work that became the core research line through Q3 2025 – Q1 2026.
The reviewer's "PoC before reapplying" feedback was actionable — a two-week sprint to build the end-to-end PoC, measure the actual frame-time, and resubmit. That sprint was not undertaken. The reason recorded at the time: the thesis line was beginning to crystallise around the PGN / procedural-modelling direction (Topics 14, 16, 38, 40), which had concrete deliverables and an internal stakeholder (Maps) that the WTFund pitch did not have. Spending two weeks on a PoC for a pitch that had been declined once was a worse use of time than two weeks on the procedural-modelling line.
This is the correct call in retrospect — the procedural-modelling thesis line produced PGN, SketchProc3D, SculptNet, MambaFlow3D, Hexplane AE, and the rest of the substantive thesis-line output. The Neural-SLAM direction would have produced one PoC of an idea that already had commercial competitors (Polycam, Scaniverse) with iOS apps and team funding. The opportunity cost analysis favoured the pivot.
The single surviving contribution of this work to the rest of the thesis line: the consumer-hardware deployment constraint. The pitch's framing — "real-time, on a laptop, no server" — is the same constraint that ultimately shapes the 2026 architectural decisions: JiT trained on 2 × RTX 3060 [1] rather than reproduced on 8 × H200; MambaFlow3D designed for the same 2 × RTX 3060 substrate [2]; Hexplane AE trained on a single RTX 3060 [3]. Topic 9 is where this constraint was first written down.
Pitch decks need measured end-to-end numbers, not summed constituent benchmarks. The WTFund decline turned on the absence of one. Carried forward as a feedback rule for the subsequent thesis-line work — every paper that touches deployment hardware ships with measured-wallclock and measured-memory numbers rather than back-of-envelope sums (JiT [1] Tables 2 and 4; MambaFlow3D [2] Table 1).
The Neural-SLAM pitch was declined; the proposed system was not built. The work survives as the origin of the consumer-hardware constraint that shapes the rest of the thesis line.