Single-image-to-3-D generation has converged on a small set of architectural patterns. SparC3D [3] tokenises a sparse voxel grid into ~100–200 cubes per scene and applies transformer attention; TRELLIS [4] uses a similar sparse-cube tokenisation with a transformer body for the autoregressive structure-then-appearance pass. Both deliver high-quality reconstructions; both are slow at inference on consumer GPUs because the transformer body is quadratic in token count and the diffusion sampler runs 100–250 steps for quality outputs.
This paper specifies MambaFlow3D, a variant that targets the speed bottleneck rather than the quality ceiling. The architectural premises: (i) the Mamba state-space block [1,2] can replace transformer attention at the token counts SparC3D uses, with linear-time scaling and a constant-memory sequential rollout; (ii) flow matching [5] can replace the long-step DDPM sampler with a 20–50-step ODE integrator without quality degradation, as the MNIST validation in the parallel thesis line confirmed. The paper does not propose either component in isolation; it specifies how they couple, what the parameter and speed-up budgets look like, and what the Phase-2 ModelNet10 implementation revealed.
The contributions are: (1) the architecture spec — PointNet++ → 10 Pure-Mamba blocks → FM-head → FoldingNet, 7.25 M parameters, Phase-2 on point clouds with planned Phase-3 swap to sparse voxels; (2) the speed-up budget — 2–3× training and 5–12× inference end-to-end vs the SparC3D reference, with constituent ratios sourced; (3) the ModelNet10 Phase-2 bring-up log, particularly the PointNet++ +3 xyz-offset channel-mismatch trap.
For a point-cloud input x ∈ ℝ^(B × 2048 × 3):
For generation, a flow-matching head replaces the deterministic decoder forward by predicting a velocity field v(z_t, t) over the latent and integrating from z₀ ∼ 𝒩(0, I) to z₁ over 20–50 steps. The FoldingNet decoder then maps the generated z₁ back to the point-cloud output.
Three set-abstraction (SA) layers [6]. SA1 reduces 2 048 → 512 points with MLP widths [32, 32, 64]. SA2 reduces 512 → 128 with [64, 64, 128]. SA3 reduces 128 → 32 with [128, 128, 256]. Each SA layer's MLP first layer takes in_channel + 3 input channels — the +3 is the xyz-coordinate offset concatenated to the feature tensor before the MLP. This is the channel trap documented in §5.
Ten stacked Pure-Mamba blocks at d_model = 256, d_state = 128. The decision between Pure-Mamba and a Mamba+Attention hybrid was settled by the Topic-25 MNIST validation, which found Pure-Mamba won on the speed-quality trade-off despite a marginally worse loss number (Pure-Mamba loss ~0.086 vs Hybrid loss ~0.080). The 10-block depth is taken from that MNIST setup; whether it is the right depth at 3-D scale is open.
Conditional flow matching [5] over the latent. The flow trajectory is linear interpolation z_t = (1 − t) · z₀ + t · z₁, the optimal velocity field is v* = z₁ − z₀, and the network predicts v̂(z_t, t) with loss ‖v* − v̂‖². Sampling integrates the predicted ODE from t = 0 to t = 1 via Euler with 20–50 steps. The head itself is a small MLP conditioned on a sinusoidal time embedding.
Two-stage folding [7]. The latent is broadcast across a fixed 45 × 45 ≈ 2 048-point 2-D grid; concatenated with the 2-D grid coordinates; passed through a small MLP to produce the first folded point cloud; concatenated with the latent again; passed through a second MLP to produce the output. The implementation is standard FoldingNet.
| Component | Parameters | Notes |
|---|---|---|
| PointNet++ encoder (3 SA layers) | ~1.20 M | Dominant cost: SA3 with widths 128 → 128 → 256 |
| Pure-Mamba ×10 | ~5.10 M | d_model=256, d_state=128, expansion ratio default |
| Flow-matching head | ~0.40 M | Velocity MLP + time embedding |
| FoldingNet decoder | ~0.55 M | Two folds, MLP widths 512 → 512 → 3 |
| Total | ~7.25 M | Confirmed at first-launch parameter count |
The argument for MambaFlow3D over SparC3D is not quality — it is the compound inference speed-up from substituting Mamba for transformer attention and flow matching for DDPM sampling. This section sources each constituent ratio and reports the compound range honestly.
Mamba-2 [2] on Ampere-class GPUs reports a 2–8× per-step speed-up over a flash-attention transformer at long sequence length, but the speed-up narrows at shorter sequences. At SparC3D's 100–200-token regime — short by language-modelling standards — the realised ratio is closer to the bottom of that range. The Topic-25 MNIST validation, run at 196 tokens on a single RTX 3060, observed ~2.3 × per-step speed-up vs the same transformer body. The MambaFlow3D budget assumes a 2–3× training-step speed-up at SparC3D scale.
DDPM [8] at quality typically runs 100–250 steps. Flow matching with a well-trained velocity predictor and a reasonable ODE integrator (Euler or Heun) at 20–50 steps matches DDPM-100 quality on the diffusion-sampler benchmarks reported in [5]. The MambaFlow3D budget assumes 20–50 sampling steps — a 5–10× step-count reduction vs the DDPM reference.
Mamba's per-step inference cost in autoregressive rollout is constant per token rather than O(L²) for transformer attention. At 100–200 tokens the constant-overhead win for Mamba is roughly 1.7–2.5× per step. Combined with bf16 / fp16 mixed precision on Ampere, the per-step inference ratio is approximately the same as the training-step ratio — 2–3 ×.
| Factor | SparC3D reference | MambaFlow3D target | Ratio range |
|---|---|---|---|
| Per-step inference (block) | 1.0 × | 0.4–0.6 × | 1.7–2.5 × |
| Sampling step count | ~250 (DDPM) | 20–50 (FM) | 5–10 × |
| Mixed-precision (Ampere fp16/bf16) | same | same | 1.0 × (cancels) |
| End-to-end inference | 1.0 × | 0.083–0.20 × | 5–12 × |
The range is wide because the constituent ratios are wide. At the bottom end (1.7× per-step × 5× step count = ~8.5 × compound, then with overhead ~5×) and the top end (2.5 × × 10× = 25× pre-overhead, dropping to ~12× with realistic overhead), the compound speed-up is 5–12×. None of this is measured at SparC3D scale yet; the measurements at SparC3D scale are the Phase-3 deliverable.
Dataset: ModelNet10 — 10 categories, 3 991 training shapes, 908 test shapes. Each shape sampled to 2 048 points via farthest-point sampling at load time, normalised to a unit sphere. Loss: flow-matching MSE on the predicted velocity. Optimiser: AdamW at learning rate 1×10⁻⁴, gradient clipping at norm 1.0. Batch: 16–32 (16 confirmed stable). Epochs planned: 50. Hardware: 1 × RTX 3060 12 GB on Vast.ai. Targets: Chamfer distance < 0.01, < 30 s/epoch, < 8 GB/GPU.
Pre-bug, the implementation was five files: flow_model.py (network), train.py (loop), modelnet_loader.py (data), evaluate.py (Chamfer + 3-D plot), benchmark.py (timing). All five were generated as a single bring-up package and committed; the bug surfaced at first python train.py.
First training launch crashed at the very first forward pass through the encoder. The traceback bottomed out at a conv layer inside PointNetSetAbstraction.forward in SA2 (the second set-abstraction layer), declaring that it expected 64 input channels and received 67. The +3 offset is the diagnostic giveaway.
A PointNet++ set-abstraction layer's forward does (in sketch): (i) group the input points by farthest-point-sampled centroids, (ii) gather the per-point input features in each group, (iii) concatenate the grouped xyz coordinates onto the grouped features along the channel axis, (iv) pass the concatenated tensor through an MLP. The grouped xyz contributes 3 channels. The reference convention is that the SA layer's in_channel argument refers to the feature channels only, and the MLP head must be sized for in_channel + 3.
The bug: the initial implementation declared the MLP head with in_channel rather than in_channel + 3. SA2 declared in_channel = 64, the MLP head was built for 64, the runtime tensor after concatenation was 67, and the conv layer raised RuntimeError: expected 64 input channels, got 67.
A three-line change in the SA-layer constructor:
last_channel = in_channel + 3 # +3 for grouped xyzApplied uniformly to SA1, SA2, SA3. The SA1 case is the corner: it has no input features (only xyz coordinates) and is constructed with in_channel = 0, so the MLP head correctly receives the +3 offset alone. The fix is symmetric across the three layers.
PointNet++ derivative implementations re-implement this concatenation in dozens of independent codebases; the +3 offset is a known trap and is the most common reason for "off-by-three" channel errors in PointNet++ code. The right defensive habit when porting a PointNet++ SA layer into a new architecture is to assert the MLP head's first layer's input channel matches in_channel + 3 at construction time. The MambaFlow3D codebase carries that assert going forward.
Three concrete entry conditions for Phase-3. (i) ModelNet10 Chamfer < 0.01. The Phase-2 model after the channel fix runs end-to-end and trains; the Chamfer distance at full 50-epoch convergence has not been measured yet because epoch budget on the rented instance was prioritised for Topic 27 (JiT). Phase-3 entry requires Chamfer below the 0.01 success criterion. (ii) SparseCubes tokeniser swap. Replace the PointNet++ encoder with a SparC3D-style sparse-cube tokeniser producing 100–200 cube tokens. The Mamba body and the flow-matching head are unchanged. (iii) Image conditioning. Add a ViT image encoder and cross-attention from the image tokens into the Mamba blocks (or, alternatively, a cross-Mamba block). This is the single-image-to-3-D capability the thesis ultimately needs.
The speed-up budget in §3 is the Phase-3 deliverable's hypothesis. The measured end-to-end inference speed-up at SparC3D scale on a 2 × RTX 3060 rig is the number that decides whether MambaFlow3D is a useful architectural variant or whether the transformer baseline is good enough.
MambaFlow3D is specified end-to-end at 7.25 M parameters: PointNet++ encoder, ten Pure-Mamba blocks at d_model = 256, d_state = 128, a flow-matching velocity-prediction head, FoldingNet decoder. The Phase-2 ModelNet10 bring-up is functional after the +3 xyz-channel offset was applied uniformly to the SA-layer MLP heads. The Phase-3 sparse-cube + image-conditioning extension is specified but not implemented. The speed-up budget — 2–3× training, 5–12× end-to-end inference vs SparC3D — is sourced and reported as a range; verification is the Phase-3 deliverable.