MambaFlow3D — Architecture White Paper

MambaFlow3D: A Pure-Mamba + Latent-Flow-Matching Architecture for Single-Image 3-D Generation — Spec, Speed-up Budget, and ModelNet10 Phase-2 Validation

Aaditya Jain

ad_jain@icloud.com · orcid.org/0009-0005-5534-5641

State-Space Models · 3-D Generation · Thesis Research, Unpublished Preprint

Submitted: November 2025 Subject: cs.LG · cs.GR Keywords: Mamba, state-space model, flow matching, point clouds, sparse voxels, SparC3D, TRELLIS, ModelNet10, consumer-GPU 3-D generation

Abstract

We specify MambaFlow3D — an architecture for single-image-to-3-D generation that substitutes a Pure-Mamba state-space backbone [1,2] for the transformer attention used by SparC3D [3] and TRELLIS [4], and couples it to flow matching [5] in latent space for fast sampling. The paper documents three contributions. First, the full architecture spec: PointNet++ [6] set abstraction to 32 latent tokens at 256-dim, ten Pure-Mamba blocks at d_model = 256, d_state = 128, a velocity-prediction MLP head for flow matching, and a FoldingNet [7] decoder back to 2 048 points. Parameter count 7.25 M. Second, a speed-up budget for the SparC3D-class sparse-voxel deployment: a 2–3× training step-time reduction from the linear-time Mamba block and a compound 5–12× end-to-end inference latency reduction from the combination of Mamba per-step speed-up and the 20–50-step flow-matching sampler vs. the ~250-step DDPM [8] sampler. Each constituent ratio is sourced and the compound range is reported as a range (5–12×) reflecting the uncertainty in the underlying ratios at SparC3D-scale token counts. Third, the ModelNet10 Phase-2 bring-up, including the PointNet++ channel- mismatch bug that crashed the first training launch — root cause isolated to the +3 xyz-coordinate offset that the SA layer concatenates onto its input features before passing them to the MLP head. The bug is a generic PointNet++ implementation trap and is documented as such. The work documented here is a design and partial-validation paper; the full SparC3D-scale implementation is the Phase-3 entry condition. Keywords: Mamba, flow matching, sparse voxels, SparC3D, TRELLIS, ModelNet10, PointNet++, FoldingNet, single-image-to-3-D.

1. Introduction

Single-image-to-3-D generation has converged on a small set of architectural patterns. SparC3D [3] tokenises a sparse voxel grid into ~100–200 cubes per scene and applies transformer attention; TRELLIS [4] uses a similar sparse-cube tokenisation with a transformer body for the autoregressive structure-then-appearance pass. Both deliver high-quality reconstructions; both are slow at inference on consumer GPUs because the transformer body is quadratic in token count and the diffusion sampler runs 100–250 steps for quality outputs.

This paper specifies MambaFlow3D, a variant that targets the speed bottleneck rather than the quality ceiling. The architectural premises: (i) the Mamba state-space block [1,2] can replace transformer attention at the token counts SparC3D uses, with linear-time scaling and a constant-memory sequential rollout; (ii) flow matching [5] can replace the long-step DDPM sampler with a 20–50-step ODE integrator without quality degradation, as the MNIST validation in the parallel thesis line confirmed. The paper does not propose either component in isolation; it specifies how they couple, what the parameter and speed-up budgets look like, and what the Phase-2 ModelNet10 implementation revealed.

The contributions are: (1) the architecture spec — PointNet++ → 10 Pure-Mamba blocks → FM-head → FoldingNet, 7.25 M parameters, Phase-2 on point clouds with planned Phase-3 swap to sparse voxels; (2) the speed-up budget — 2–3× training and 5–12× inference end-to-end vs the SparC3D reference, with constituent ratios sourced; (3) the ModelNet10 Phase-2 bring-up log, particularly the PointNet++ +3 xyz-offset channel-mismatch trap.

2. Architecture

2.1 Forward pass

For a point-cloud input x ∈ ℝ^(B × 2048 × 3):

z = PointNet++(x) ∈ ℝ^(B × 32 × 256) z' = Mamba₁₀(z) ∈ ℝ^(B × 32 × 256) x̂ = FoldingNet(z') ∈ ℝ^(B × 2048 × 3)

For generation, a flow-matching head replaces the deterministic decoder forward by predicting a velocity field v(z_t, t) over the latent and integrating from z₀ ∼ 𝒩(0, I) to z₁ over 20–50 steps. The FoldingNet decoder then maps the generated z₁ back to the point-cloud output.

2.2 PointNet++ encoder

Three set-abstraction (SA) layers [6]. SA1 reduces 2 048 → 512 points with MLP widths [32, 32, 64]. SA2 reduces 512 → 128 with [64, 64, 128]. SA3 reduces 128 → 32 with [128, 128, 256]. Each SA layer's MLP first layer takes in_channel + 3 input channels — the +3 is the xyz-coordinate offset concatenated to the feature tensor before the MLP. This is the channel trap documented in §5.

2.3 Pure-Mamba body

Ten stacked Pure-Mamba blocks at d_model = 256, d_state = 128. The decision between Pure-Mamba and a Mamba+Attention hybrid was settled by the Topic-25 MNIST validation, which found Pure-Mamba won on the speed-quality trade-off despite a marginally worse loss number (Pure-Mamba loss ~0.086 vs Hybrid loss ~0.080). The 10-block depth is taken from that MNIST setup; whether it is the right depth at 3-D scale is open.

2.4 Flow-matching head

Conditional flow matching [5] over the latent. The flow trajectory is linear interpolation z_t = (1 − t) · z₀ + t · z₁, the optimal velocity field is v* = z₁ − z₀, and the network predicts v̂(z_t, t) with loss ‖v* − v̂‖². Sampling integrates the predicted ODE from t = 0 to t = 1 via Euler with 20–50 steps. The head itself is a small MLP conditioned on a sinusoidal time embedding.

2.5 FoldingNet decoder

Two-stage folding [7]. The latent is broadcast across a fixed 45 × 45 ≈ 2 048-point 2-D grid; concatenated with the 2-D grid coordinates; passed through a small MLP to produce the first folded point cloud; concatenated with the latent again; passed through a second MLP to produce the output. The implementation is standard FoldingNet.

2.6 Parameter count

Table 1 — Per-component parameter count, MambaFlow3D Phase-2 point-cloud variant.
Component	Parameters	Notes
PointNet++ encoder (3 SA layers)	~1.20 M	Dominant cost: SA3 with widths 128 → 128 → 256
Pure-Mamba ×10	~5.10 M	d_model=256, d_state=128, expansion ratio default
Flow-matching head	~0.40 M	Velocity MLP + time embedding
FoldingNet decoder	~0.55 M	Two folds, MLP widths 512 → 512 → 3
Total	~7.25 M	Confirmed at first-launch parameter count

3. Speed-up Budget

The argument for MambaFlow3D over SparC3D is not quality — it is the compound inference speed-up from substituting Mamba for transformer attention and flow matching for DDPM sampling. This section sources each constituent ratio and reports the compound range honestly.

3.1 Per-step training cost (Mamba vs transformer)

Mamba-2 [2] on Ampere-class GPUs reports a 2–8× per-step speed-up over a flash-attention transformer at long sequence length, but the speed-up narrows at shorter sequences. At SparC3D's 100–200-token regime — short by language-modelling standards — the realised ratio is closer to the bottom of that range. The Topic-25 MNIST validation, run at 196 tokens on a single RTX 3060, observed ~2.3 × per-step speed-up vs the same transformer body. The MambaFlow3D budget assumes a 2–3× training-step speed-up at SparC3D scale.

3.2 Sampling-step count (flow matching vs DDPM)

DDPM [8] at quality typically runs 100–250 steps. Flow matching with a well-trained velocity predictor and a reasonable ODE integrator (Euler or Heun) at 20–50 steps matches DDPM-100 quality on the diffusion-sampler benchmarks reported in [5]. The MambaFlow3D budget assumes 20–50 sampling steps — a 5–10× step-count reduction vs the DDPM reference.

3.3 Per-step inference cost

Mamba's per-step inference cost in autoregressive rollout is constant per token rather than O(L²) for transformer attention. At 100–200 tokens the constant-overhead win for Mamba is roughly 1.7–2.5× per step. Combined with bf16 / fp16 mixed precision on Ampere, the per-step inference ratio is approximately the same as the training-step ratio — 2–3 ×.

3.4 Compound end-to-end inference

Table 2 — Speed-up budget breakdown. Compound range 5–12× end-to-end.
Factor	SparC3D reference	MambaFlow3D target	Ratio range
Per-step inference (block)	1.0 ×	0.4–0.6 ×	1.7–2.5 ×
Sampling step count	~250 (DDPM)	20–50 (FM)	5–10 ×
Mixed-precision (Ampere fp16/bf16)	same	same	1.0 × (cancels)
End-to-end inference	1.0 ×	0.083–0.20 ×	5–12 ×

The range is wide because the constituent ratios are wide. At the bottom end (1.7× per-step × 5× step count = ~8.5 × compound, then with overhead ~5×) and the top end (2.5 × × 10× = 25× pre-overhead, dropping to ~12× with realistic overhead), the compound speed-up is 5–12×. None of this is measured at SparC3D scale yet; the measurements at SparC3D scale are the Phase-3 deliverable.

4. ModelNet10 Phase-2 Setup

Dataset: ModelNet10 — 10 categories, 3 991 training shapes, 908 test shapes. Each shape sampled to 2 048 points via farthest-point sampling at load time, normalised to a unit sphere. Loss: flow-matching MSE on the predicted velocity. Optimiser: AdamW at learning rate 1×10⁻⁴, gradient clipping at norm 1.0. Batch: 16–32 (16 confirmed stable). Epochs planned: 50. Hardware: 1 × RTX 3060 12 GB on Vast.ai. Targets: Chamfer distance < 0.01, < 30 s/epoch, < 8 GB/GPU.

Pre-bug, the implementation was five files: flow_model.py (network), train.py (loop), modelnet_loader.py (data), evaluate.py (Chamfer + 3-D plot), benchmark.py (timing). All five were generated as a single bring-up package and committed; the bug surfaced at first python train.py.

5. The PointNet++ Channel-Mismatch Trap

First training launch crashed at the very first forward pass through the encoder. The traceback bottomed out at a conv layer inside PointNetSetAbstraction.forward in SA2 (the second set-abstraction layer), declaring that it expected 64 input channels and received 67. The +3 offset is the diagnostic giveaway.

5.1 Root cause

A PointNet++ set-abstraction layer's forward does (in sketch): (i) group the input points by farthest-point-sampled centroids, (ii) gather the per-point input features in each group, (iii) concatenate the grouped xyz coordinates onto the grouped features along the channel axis, (iv) pass the concatenated tensor through an MLP. The grouped xyz contributes 3 channels. The reference convention is that the SA layer's in_channel argument refers to the feature channels only, and the MLP head must be sized for in_channel + 3.

The bug: the initial implementation declared the MLP head with in_channel rather than in_channel + 3. SA2 declared in_channel = 64, the MLP head was built for 64, the runtime tensor after concatenation was 67, and the conv layer raised RuntimeError: expected 64 input channels, got 67.

5.2 Fix

A three-line change in the SA-layer constructor:

last_channel = in_channel + 3 # +3 for grouped xyz

Applied uniformly to SA1, SA2, SA3. The SA1 case is the corner: it has no input features (only xyz coordinates) and is constructed with in_channel = 0, so the MLP head correctly receives the +3 offset alone. The fix is symmetric across the three layers.

5.3 The generalisable lesson

PointNet++ derivative implementations re-implement this concatenation in dozens of independent codebases; the +3 offset is a known trap and is the most common reason for "off-by-three" channel errors in PointNet++ code. The right defensive habit when porting a PointNet++ SA layer into a new architecture is to assert the MLP head's first layer's input channel matches in_channel + 3 at construction time. The MambaFlow3D codebase carries that assert going forward.

6. Open Problems and Phase-3 Plan

Three concrete entry conditions for Phase-3. (i) ModelNet10 Chamfer < 0.01. The Phase-2 model after the channel fix runs end-to-end and trains; the Chamfer distance at full 50-epoch convergence has not been measured yet because epoch budget on the rented instance was prioritised for Topic 27 (JiT). Phase-3 entry requires Chamfer below the 0.01 success criterion. (ii) SparseCubes tokeniser swap. Replace the PointNet++ encoder with a SparC3D-style sparse-cube tokeniser producing 100–200 cube tokens. The Mamba body and the flow-matching head are unchanged. (iii) Image conditioning. Add a ViT image encoder and cross-attention from the image tokens into the Mamba blocks (or, alternatively, a cross-Mamba block). This is the single-image-to-3-D capability the thesis ultimately needs.

The speed-up budget in §3 is the Phase-3 deliverable's hypothesis. The measured end-to-end inference speed-up at SparC3D scale on a 2 × RTX 3060 rig is the number that decides whether MambaFlow3D is a useful architectural variant or whether the transformer baseline is good enough.

7. Conclusion

MambaFlow3D is specified end-to-end at 7.25 M parameters: PointNet++ encoder, ten Pure-Mamba blocks at d_model = 256, d_state = 128, a flow-matching velocity-prediction head, FoldingNet decoder. The Phase-2 ModelNet10 bring-up is functional after the +3 xyz-channel offset was applied uniformly to the SA-layer MLP heads. The Phase-3 sparse-cube + image-conditioning extension is specified but not implemented. The speed-up budget — 2–3× training, 5–12× end-to-end inference vs SparC3D — is sourced and reported as a range; verification is the Phase-3 deliverable.

References

[1] Gu, A., Dao, T. "Mamba: Linear-Time Sequence Modelling with Selective State Spaces." 2023.

[2] Dao, T., Gu, A. "Transformers are SSMs: Generalised Models and Efficient Algorithms Through Structured State Space Duality (Mamba-2)." ICML, 2024.

[3] SparC3D authors. "SparC3D: Sparse-Cube 3-D Generation from a Single Image." 2024.

[4] Microsoft Research. "TRELLIS: A Structured Latent Representation for Versatile and High-Quality 3-D Generation." 2024.

[5] Lipman, Y. et al. "Flow Matching for Generative Modelling." ICLR, 2023.

[6] Qi, C. R. et al. "PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space." NeurIPS, 2017. The +3 xyz-channel convention referenced in §5.

[7] Yang, Y. et al. "FoldingNet: Point Cloud Auto-encoder via Deep Grid Deformation." CVPR, 2018. The two-stage folding decoder.

[8] Ho, J., Jain, A., Abbeel, P. "Denoising Diffusion Probabilistic Models." NeurIPS, 2020.

[9] Jain, A. "JiT Diffusion: Training on Consumer GPUs." Thesis research, Nov 2025. /whitepaper/jit-diffusion — the transformer baseline against which MambaFlow3D's speed-up is measured.