← Research Timeline Aditya Jain / Apple Maps · 3D Reconstruction
Nov 2025
Topic 26 Nov 2025 Mamba · Flow Matching · Sparse Voxels

MambaFlow3D —
Sparse-Voxel 3-D Generation.

The follow-on to the MNIST validation (Topic 25) — scaling Pure Mamba from 2-D digits to 3-D point clouds and onward to sparse voxels, paired with flow matching in latent space. Reference targets are SparC3D and TRELLIS; the design goal is a 2–3× training speed-up and a 5–12× inference speed-up by replacing transformer attention with linear-time state-space blocks. Documents the architecture spec, the point-cloud Phase-2 validation on ModelNet10, and the channel-mismatch bug that held up first training.

00 — Motivation

Replicate SparC3D / TRELLIS, but faster on consumer GPUs.

The thesis line wants fast single-image-to-3-D — train on a consumer rig, infer at interactive rates. The published references at the time of this work — SparC3D and Microsoft's TRELLIS — both use transformer attention over sparse voxel token sequences. The token counts are short enough (~100–200 sparse cubes per scene) that transformer attention is not catastrophic, but the cost is still quadratic, and the per-step inference latency on consumer GPUs is the bottleneck for the interactive use-case the thesis targets.

The Mamba state-space block [1,2] is the obvious substitute. It is linear-time in sequence length, has constant memory under autoregressive rollouts, and on the MNIST validation (Topic 25) it matched or beat the transformer baseline on visual quality despite a slightly worse loss number. The question Topic 26 answers: does the Mamba-substitution survive the jump from 2-D MNIST to 3-D point clouds, and is the architecture spec coherent enough to commit to a full SparC3D-class implementation?

The honest scope: this is a design + Phase-2 validation page. The full SparC3D-class implementation is not yet built; what is built is the point-cloud generator end-to-end, the architecture spec for the sparse-voxel extension, and the documented bring-up on ModelNet10 including the PointNet++ channel-mismatch fix that blocked the first training run.

What it informs
The design and the partial point-cloud result feed two downstream decisions. (1) Whether the Mamba substitution survives at higher token counts. The MNIST result was on 196 tokens (14×14 patches); the ModelNet10 result is on 32 latent tokens; the SparC3D scale is 100–200 cubes. The trend across these three is the open question. (2) Whether flow matching in latent space (after PointNet++ encoding) is stable enough to extend to sparse-cube tokens, or whether the latent-space flow-matching prior from the MNIST topic needs to be re-tuned at 3-D scale.
01 — Architecture

PointNet++ → 10 Pure-Mamba blocks → flow matching → FoldingNet.

The Phase-2 architecture is end-to-end deterministic in the encoder and decoder, with the generative work happening in latent space. The forward path is:

Input : point cloud (B, 2048, 3) — ModelNet10 sample PointNet++ : (B, 2048, 3) → (B, 32, 256) — set abstraction, hierarchical Pure-Mamba ×10 : (B, 32, 256) → (B, 32, 256) — d_model=256, d_state=128 Flow matching : noise → latent, 50 steps — in (B, 32, 256) space FoldingNet : (B, 32, 256) → (B, 2048, 3) — 2-D grid folding twice Output : point cloud (B, 2048, 3) — reconstructed/generated

PointNet++ encoder. Three set-abstraction layers bringing 2 048 points down to 32 hierarchical features at 256 dimensions each. The three SA layers are (2048→512, mlp=[32,32,64])(512→128, mlp=[64,64,128])(128→32, mlp=[128,128,256]). The first layer takes no input features (only xyz coordinates); the next two take the previous layer's output. This is where the channel-mismatch bug appeared (§04).

Pure-Mamba body. Ten blocks at d_model = 256, d_state = 128, run over the 32 latent tokens. Pure-Mamba (not the Mamba+Attention hybrid) was the winner of the MNIST validation in Topic 25 on the speed–quality trade-off, so it is the substrate carried forward. The 10-block depth is taken from the MNIST setup; whether this is the right depth at 3-D scale is open.

Flow matching in latent space. Linear interpolation between noise and the latent code as the trajectory; the network predicts the velocity field v; loss is ‖v − v̂‖². Sampling is a 50-step Euler integrator from z₀ ∼ 𝒩(0, I) to the generated latent. The 50 steps is the nominal — the target band is 20–50 steps once the model is trained.

FoldingNet decoder. Two-step folding: the latent is broadcast across a fixed 2-D 45 × 45 ≈ 2 048 grid, the grid is concatenated with the latent and passed through a small MLP to produce the first folded point cloud, then concatenated again and folded a second time to produce the output point cloud. This is the standard FoldingNet pattern.

ComponentSpecOutput shapeParam count
PointNet++ encoder3 SA layers, MLP widths 32/64/128/128/256(B, 32, 256)~1.2 M
Pure-Mamba ×10d_model=256, d_state=128(B, 32, 256)~5.1 M
Flow-matching headVelocity MLP, latent-conditioned(B, 32, 256)~0.4 M
FoldingNet decoderTwo folds over 45×45 grid(B, 2048, 3)~0.55 M

Total: ~7.25 M parameters — confirmed by parameter count at first training launch.

Pipeline

Single-image-to-3-D, MambaFlow3D arrangement.

The Phase-2 work documented here is the point-cloud generator — PointNet++ encoder + Pure-Mamba + flow matching + FoldingNet. The full Phase-3 / Phase-4 system swaps in a sparse-voxel tokeniser (SparseCubes, ~100–200 tokens per scene) and an image-conditioning branch (ViT image encoder, cross-attention into the Mamba blocks). The diagram below shows both — Phase-2 is the dark-bordered path, Phase-3+ is the light-bordered extension.

Point Cloud (B, 2048, 3) PointNet++ → (B, 32, 256) Pure-Mamba ×10 d_model=256 Flow Match (50) latent space FoldingNet → (B, 2048, 3) Image (B, 3, 256²) ViT encoder ↓ SparseCubes ~100–200 tokens Phase 2 (this page, solid) — point clouds Phase 3+ (dashed) — sparse voxels + image conditioning
02 — Target Metrics

2–3× train speed-up, 5–12× inference speed-up vs SparC3D.

The target speed-up numbers are derived from two sources: the measured Mamba-vs-transformer ratios reported in [2] and an estimate of the inference advantage flow-matching gives over DDPM at 20–50 steps vs 250–1000 steps respectively.

MetricSparC3D / TRELLIS referenceMambaFlow3D targetSource of ratio
Training step time (sparse-cube scale)1.0 ×0.33–0.5 ×Mamba-2 fast-path on Ampere; [2]
Sampling steps~250 DDPM / ~50 if v-pred20–50 (flow matching)FM scheduler choice
Per-step inference time1.0 ×0.4–0.6 × (linear-time block)Mamba sequential rollout
End-to-end inference latency1.0 ×0.08–0.18 × (5–12 × faster)Compound of the two
Single-image-to-3-D quality targetSparC3D-classSparC3D-class @ 5× fasterSpeed not quality is the contribution

The 5–12× inference-latency range is wide because two of the constituent factors (Mamba per-step speed-up on Ampere, FM step count that retains quality) are not yet measured at the SparC3D-scale token count. The 5× lower bound assumes the modest end of each; the 12× upper bound assumes the optimistic end of each. Neither has been verified at 3-D scale.

Core Insight

Pure Mamba won on MNIST.
Then a channel mismatch killed first launch.

The architectural decision (Pure Mamba over Hybrid Mamba+Attention) was already settled by the Topic-25 MNIST validation. The Phase-2 bring-up on ModelNet10 was supposed to be a re-tune of known-good components — and was, except that the PointNet++ feature concatenation in the encoder's set-abstraction layer expected one channel count and the conv after it expected a different one. The fix was four characters in one MLP head; the diagnostic took the afternoon because the symptom (a channel error 3 layers deep) did not point at the SA-layer feature concatenation as the cause.

03 — Phase-2 Build

ModelNet10 dataset, 5 files, 7.25 M parameters.

The Phase-2 implementation is five files: flow_model.py (PointNet++ + Mamba + FoldingNet, the network), train.py (flow-matching loss, AdamW, gradient clipping), modelnet_loader.py (ModelNet10 dataloader, unit-sphere normalisation, 2 048-point sampling), evaluate.py (Chamfer Distance + 3-D plot), benchmark.py (timing harness for the 30 s / epoch target). All five live under 3d_point_cloud_Mamba/.

SettingValueNotes
DatasetModelNet1010 categories, ~5 K shapes total (3 991 train / 908 test)
Points per shape2 048Fixed; sampled with farthest-point sampling at load
Batch size16–3216 confirmed stable on 12 GB; 24 if memory headroom holds
OptimiserAdamW, lr 1×10⁻⁴Gradient clipping at 1.0
Epochs (planned)50~20 min wall-clock target
Memory target< 8 GB / GPUConfirmed at ~6.5 GB peak
Epoch-time target< 30 sConfirmed at ~25 s once channel bug resolved
Quality targetChamfer < 0.01Phase-2 success criterion; not yet measured at full convergence
04 — Channel-Mismatch Diagnosis

First training launch crashed at step 0. Cause was three layers up.

The launch crash was at the first forward pass through the encoder. Traceback bottomed out at a conv layer in the second set-abstraction (SA) layer of PointNet++ expecting 64 input channels and receiving 67. The factor-of-three surplus is the giveaway: the SA layer concatenates the grouped xyz coordinates (3 channels) onto the input features before the MLP, so the actual feature count entering each SA layer's MLP is in_channel + 3, not in_channel. The reference PointNet++ implementations handle this with an in_channel + 3 offset in the layer constructor; the initial code here did not.

SA layerDeclared in_channelConcatenated to MLP inputFix
SA1 (2048 → 512)3 (xyz only)0 input features + 3 xyz = 3 → MLP expects 3Set in_channel = 0 (no input features); MLP first layer takes 3
SA2 (512 → 128)6464 features + 3 xyz = 67 → MLP expects 64MLP first layer takes in_channel + 3 = 67
SA3 (128 → 32)128128 + 3 = 131 → MLP expects 128MLP first layer takes in_channel + 3 = 131

Once the offset was applied uniformly to the three SA layers, the forward pass ran clean and training proceeded. The lesson, also saved as a feedback memory: PointNet++ derivative implementations are an old-favourite source of channel-count off-by-three bugs; the SA layer's in_channel argument means "input features", not "input features + xyz", and the +3 has to be added at the MLP head.

Interactive Demo · Live

Step through the flow-matching trajectory in the latent space. Pick a target ModelNet10 category — chair, table, or airplane — then advance through the 10-step trajectory to see the generated point cloud emerge from noise. The middle pane shows the 32-token latent state at the current step; the right pane shows the decoded point cloud.

01 — Latent noise · CLICK TO RE-SEED CHAIR
02 — 32 latent tokens × 256 dims STEP 0 / 10
03 — Decoded point cloud 2 048 points · drag to rotate

Full Technical Paper

arXiv-format write-up · MambaFlow3D architecture spec · Mamba ↔ flow-matching coupling · target speed-ups · Phase-2 validation

Read Paper →
Related Thesis Chapters
JiT Diffusion — Consumer-GPU Training
Establishes the ViT + x-prediction baseline that the MambaFlow3D speed-up claim is measured against. Same hardware substrate (2 × RTX 3060), different backbone (transformer vs Mamba).
Hexplane Autoencoder
Companion architecture experiment — the deterministic-AE result there is the reconstruction substrate that a future MambaFlow3D-on-hexplane variant would generate into.
Hierarchical Part-Based Triplane
Downstream consumer of the MambaFlow3D generator — generating triplanes per part rather than full scenes is the canonical follow-on application once the sparse-cube backbone is validated.
Appendix — Raw Materials
Transcripts & Source References
████████████████████████████████████████████████
███████████████████████████████████████

██████████████████████████████████████
█████████ · ████ · █████████████████████
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████
Restricted Access