MambaFlow3D — Sparse-Voxel 3-D Generation Design

00 — Motivation

Replicate SparC3D / TRELLIS, but faster on consumer GPUs.

The thesis line wants fast single-image-to-3-D — train on a consumer rig, infer at interactive rates. The published references at the time of this work — SparC3D and Microsoft's TRELLIS — both use transformer attention over sparse voxel token sequences. The token counts are short enough (~100–200 sparse cubes per scene) that transformer attention is not catastrophic, but the cost is still quadratic, and the per-step inference latency on consumer GPUs is the bottleneck for the interactive use-case the thesis targets.

The Mamba state-space block [1,2] is the obvious substitute. It is linear-time in sequence length, has constant memory under autoregressive rollouts, and on the MNIST validation (Topic 25) it matched or beat the transformer baseline on visual quality despite a slightly worse loss number. The question Topic 26 answers: does the Mamba-substitution survive the jump from 2-D MNIST to 3-D point clouds, and is the architecture spec coherent enough to commit to a full SparC3D-class implementation?

The honest scope: this is a design + Phase-2 validation page. The full SparC3D-class implementation is not yet built; what is built is the point-cloud generator end-to-end, the architecture spec for the sparse-voxel extension, and the documented bring-up on ModelNet10 including the PointNet++ channel-mismatch fix that blocked the first training run.

What it informs

The design and the partial point-cloud result feed two downstream decisions. (1) Whether the Mamba substitution survives at higher token counts. The MNIST result was on 196 tokens (14×14 patches); the ModelNet10 result is on 32 latent tokens; the SparC3D scale is 100–200 cubes. The trend across these three is the open question. (2) Whether flow matching in latent space (after PointNet++ encoding) is stable enough to extend to sparse-cube tokens, or whether the latent-space flow-matching prior from the MNIST topic needs to be re-tuned at 3-D scale.

01 — Architecture

PointNet++ → 10 Pure-Mamba blocks → flow matching → FoldingNet.

The Phase-2 architecture is end-to-end deterministic in the encoder and decoder, with the generative work happening in latent space. The forward path is:

Input            : point cloud (B, 2048, 3)        — ModelNet10 sample
PointNet++       : (B, 2048, 3) → (B, 32, 256)     — set abstraction, hierarchical
Pure-Mamba ×10   : (B, 32, 256) → (B, 32, 256)     — d_model=256, d_state=128
Flow matching    : noise → latent, 50 steps        — in (B, 32, 256) space
FoldingNet       : (B, 32, 256) → (B, 2048, 3)     — 2-D grid folding twice
Output           : point cloud (B, 2048, 3)        — reconstructed/generated

PointNet++ encoder. Three set-abstraction layers bringing 2 048 points down to 32 hierarchical features at 256 dimensions each. The three SA layers are (2048→512, mlp=[32,32,64]) → (512→128, mlp=[64,64,128]) → (128→32, mlp=[128,128,256]). The first layer takes no input features (only xyz coordinates); the next two take the previous layer's output. This is where the channel-mismatch bug appeared (§04).

Pure-Mamba body. Ten blocks at d_model = 256, d_state = 128, run over the 32 latent tokens. Pure-Mamba (not the Mamba+Attention hybrid) was the winner of the MNIST validation in Topic 25 on the speed–quality trade-off, so it is the substrate carried forward. The 10-block depth is taken from the MNIST setup; whether this is the right depth at 3-D scale is open.

Flow matching in latent space. Linear interpolation between noise and the latent code as the trajectory; the network predicts the velocity field v; loss is ‖v − v̂‖². Sampling is a 50-step Euler integrator from z₀ ∼ 𝒩(0, I) to the generated latent. The 50 steps is the nominal — the target band is 20–50 steps once the model is trained.

FoldingNet decoder. Two-step folding: the latent is broadcast across a fixed 2-D 45 × 45 ≈ 2 048 grid, the grid is concatenated with the latent and passed through a small MLP to produce the first folded point cloud, then concatenated again and folded a second time to produce the output point cloud. This is the standard FoldingNet pattern.

Component	Spec	Output shape	Param count
PointNet++ encoder	3 SA layers, MLP widths 32/64/128/128/256	(B, 32, 256)	~1.2 M
Pure-Mamba ×10	d_model=256, d_state=128	(B, 32, 256)	~5.1 M
Flow-matching head	Velocity MLP, latent-conditioned	(B, 32, 256)	~0.4 M
FoldingNet decoder	Two folds over 45×45 grid	(B, 2048, 3)	~0.55 M

Total: ~7.25 M parameters — confirmed by parameter count at first training launch.

Pipeline

Single-image-to-3-D, MambaFlow3D arrangement.

The Phase-2 work documented here is the point-cloud generator — PointNet++ encoder + Pure-Mamba + flow matching + FoldingNet. The full Phase-3 / Phase-4 system swaps in a sparse-voxel tokeniser (SparseCubes, ~100–200 tokens per scene) and an image-conditioning branch (ViT image encoder, cross-attention into the Mamba blocks). The diagram below shows both — Phase-2 is the dark-bordered path, Phase-3+ is the light-bordered extension.

02 — Target Metrics

2–3× train speed-up, 5–12× inference speed-up vs SparC3D.

The target speed-up numbers are derived from two sources: the measured Mamba-vs-transformer ratios reported in [2] and an estimate of the inference advantage flow-matching gives over DDPM at 20–50 steps vs 250–1000 steps respectively.

Metric	SparC3D / TRELLIS reference	MambaFlow3D target	Source of ratio
Training step time (sparse-cube scale)	1.0 ×	0.33–0.5 ×	Mamba-2 fast-path on Ampere; [2]
Sampling steps	~250 DDPM / ~50 if v-pred	20–50 (flow matching)	FM scheduler choice
Per-step inference time	1.0 ×	0.4–0.6 × (linear-time block)	Mamba sequential rollout
End-to-end inference latency	1.0 ×	0.08–0.18 × (5–12 × faster)	Compound of the two
Single-image-to-3-D quality target	SparC3D-class	SparC3D-class @ 5× faster	Speed not quality is the contribution

The 5–12× inference-latency range is wide because two of the constituent factors (Mamba per-step speed-up on Ampere, FM step count that retains quality) are not yet measured at the SparC3D-scale token count. The 5× lower bound assumes the modest end of each; the 12× upper bound assumes the optimistic end of each. Neither has been verified at 3-D scale.

Core Insight

Pure Mamba won on MNIST.
Then a channel mismatch killed first launch.

The architectural decision (Pure Mamba over Hybrid Mamba+Attention) was already settled by the Topic-25 MNIST validation. The Phase-2 bring-up on ModelNet10 was supposed to be a re-tune of known-good components — and was, except that the PointNet++ feature concatenation in the encoder's set-abstraction layer expected one channel count and the conv after it expected a different one. The fix was four characters in one MLP head; the diagnostic took the afternoon because the symptom (a channel error 3 layers deep) did not point at the SA-layer feature concatenation as the cause.

03 — Phase-2 Build

ModelNet10 dataset, 5 files, 7.25 M parameters.

The Phase-2 implementation is five files: flow_model.py (PointNet++ + Mamba + FoldingNet, the network), train.py (flow-matching loss, AdamW, gradient clipping), modelnet_loader.py (ModelNet10 dataloader, unit-sphere normalisation, 2 048-point sampling), evaluate.py (Chamfer Distance + 3-D plot), benchmark.py (timing harness for the 30 s / epoch target). All five live under 3d_point_cloud_Mamba/.

Setting	Value	Notes
Dataset	ModelNet10	10 categories, ~5 K shapes total (3 991 train / 908 test)
Points per shape	2 048	Fixed; sampled with farthest-point sampling at load
Batch size	16–32	16 confirmed stable on 12 GB; 24 if memory headroom holds
Optimiser	AdamW, lr 1×10⁻⁴	Gradient clipping at 1.0
Epochs (planned)	50	~20 min wall-clock target
Memory target	< 8 GB / GPU	Confirmed at ~6.5 GB peak
Epoch-time target	< 30 s	Confirmed at ~25 s once channel bug resolved
Quality target	Chamfer < 0.01	Phase-2 success criterion; not yet measured at full convergence

04 — Channel-Mismatch Diagnosis

First training launch crashed at step 0. Cause was three layers up.

The launch crash was at the first forward pass through the encoder. Traceback bottomed out at a conv layer in the second set-abstraction (SA) layer of PointNet++ expecting 64 input channels and receiving 67. The factor-of-three surplus is the giveaway: the SA layer concatenates the grouped xyz coordinates (3 channels) onto the input features before the MLP, so the actual feature count entering each SA layer's MLP is in_channel + 3, not in_channel. The reference PointNet++ implementations handle this with an in_channel + 3 offset in the layer constructor; the initial code here did not.

SA layer	Declared `in_channel`	Concatenated to MLP input	Fix
SA1 (2048 → 512)	`3` (xyz only)	0 input features + 3 xyz = 3 → MLP expects 3	Set `in_channel = 0` (no input features); MLP first layer takes 3
SA2 (512 → 128)	`64`	64 features + 3 xyz = 67 → MLP expects 64	MLP first layer takes `in_channel + 3 = 67`
SA3 (128 → 32)	`128`	128 + 3 = 131 → MLP expects 128	MLP first layer takes `in_channel + 3 = 131`

Once the offset was applied uniformly to the three SA layers, the forward pass ran clean and training proceeded. The lesson, also saved as a feedback memory: PointNet++ derivative implementations are an old-favourite source of channel-count off-by-three bugs; the SA layer's in_channel argument means "input features", not "input features + xyz", and the +3 has to be added at the MLP head.

Appendix — Raw Materials

Transcripts & Source References

████████████████████████████████████████████████
███████████████████████████████████████

01 — ██████████████████████████

██████████████████████████████████████

█████████ · ████ · █████████████████████

█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████

Restricted Access

MambaFlow3D —
Sparse-Voxel 3-D Generation.

Replicate SparC3D / TRELLIS, but faster on consumer GPUs.

PointNet++ → 10 Pure-Mamba blocks → flow matching → FoldingNet.

Single-image-to-3-D, MambaFlow3D arrangement.

2–3× train speed-up, 5–12× inference speed-up vs SparC3D.

ModelNet10 dataset, 5 files, 7.25 M parameters.

First training launch crashed at step 0. Cause was three layers up.

Interactive Demo · Live

Full Technical Paper

MambaFlow3D — Sparse-Voxel 3-D Generation.

Replicate SparC3D / TRELLIS, but faster on consumer GPUs.

PointNet++ → 10 Pure-Mamba blocks → flow matching → FoldingNet.

Single-image-to-3-D, MambaFlow3D arrangement.

2–3× train speed-up, 5–12× inference speed-up vs SparC3D.

ModelNet10 dataset, 5 files, 7.25 M parameters.

First training launch crashed at step 0. Cause was three layers up.

Interactive Demo · Live

Full Technical Paper

MambaFlow3D —
Sparse-Voxel 3-D Generation.