The follow-on to the MNIST validation (Topic 25) — scaling Pure Mamba from 2-D digits to 3-D point clouds and onward to sparse voxels, paired with flow matching in latent space. Reference targets are SparC3D and TRELLIS; the design goal is a 2–3× training speed-up and a 5–12× inference speed-up by replacing transformer attention with linear-time state-space blocks. Documents the architecture spec, the point-cloud Phase-2 validation on ModelNet10, and the channel-mismatch bug that held up first training.
The thesis line wants fast single-image-to-3-D — train on a consumer rig, infer at interactive rates. The published references at the time of this work — SparC3D and Microsoft's TRELLIS — both use transformer attention over sparse voxel token sequences. The token counts are short enough (~100–200 sparse cubes per scene) that transformer attention is not catastrophic, but the cost is still quadratic, and the per-step inference latency on consumer GPUs is the bottleneck for the interactive use-case the thesis targets.
The Mamba state-space block [1,2] is the obvious substitute. It is linear-time in sequence length, has constant memory under autoregressive rollouts, and on the MNIST validation (Topic 25) it matched or beat the transformer baseline on visual quality despite a slightly worse loss number. The question Topic 26 answers: does the Mamba-substitution survive the jump from 2-D MNIST to 3-D point clouds, and is the architecture spec coherent enough to commit to a full SparC3D-class implementation?
The honest scope: this is a design + Phase-2 validation page. The full SparC3D-class implementation is not yet built; what is built is the point-cloud generator end-to-end, the architecture spec for the sparse-voxel extension, and the documented bring-up on ModelNet10 including the PointNet++ channel-mismatch fix that blocked the first training run.
The Phase-2 architecture is end-to-end deterministic in the encoder and decoder, with the generative work happening in latent space. The forward path is:
PointNet++ encoder. Three set-abstraction layers
bringing 2 048 points down to 32 hierarchical features at 256
dimensions each. The three SA layers are
(2048→512, mlp=[32,32,64]) →
(512→128, mlp=[64,64,128]) →
(128→32, mlp=[128,128,256]). The first layer takes
no input features (only xyz coordinates); the next two take
the previous layer's output. This is where the channel-mismatch bug
appeared (§04).
Pure-Mamba body. Ten blocks at
d_model = 256, d_state = 128, run over the 32 latent
tokens. Pure-Mamba (not the Mamba+Attention hybrid) was the winner of
the MNIST validation in Topic 25 on the speed–quality trade-off, so
it is the substrate carried forward. The 10-block depth is taken from
the MNIST setup; whether this is the right depth at 3-D scale is open.
Flow matching in latent space. Linear interpolation
between noise and the latent code as the trajectory; the network
predicts the velocity field v; loss is
‖v − v̂‖². Sampling is a 50-step Euler integrator from
z₀ ∼ 𝒩(0, I) to the generated latent. The 50 steps is the
nominal — the target band is 20–50 steps once the model is trained.
FoldingNet decoder. Two-step folding: the latent is
broadcast across a fixed 2-D 45 × 45 ≈ 2 048 grid, the
grid is concatenated with the latent and passed through a small MLP
to produce the first folded point cloud, then concatenated again and
folded a second time to produce the output point cloud. This is the
standard FoldingNet pattern.
| Component | Spec | Output shape | Param count |
|---|---|---|---|
| PointNet++ encoder | 3 SA layers, MLP widths 32/64/128/128/256 | (B, 32, 256) | ~1.2 M |
| Pure-Mamba ×10 | d_model=256, d_state=128 | (B, 32, 256) | ~5.1 M |
| Flow-matching head | Velocity MLP, latent-conditioned | (B, 32, 256) | ~0.4 M |
| FoldingNet decoder | Two folds over 45×45 grid | (B, 2048, 3) | ~0.55 M |
Total: ~7.25 M parameters — confirmed by parameter count at first training launch.
The Phase-2 work documented here is the point-cloud generator — PointNet++ encoder + Pure-Mamba + flow matching + FoldingNet. The full Phase-3 / Phase-4 system swaps in a sparse-voxel tokeniser (SparseCubes, ~100–200 tokens per scene) and an image-conditioning branch (ViT image encoder, cross-attention into the Mamba blocks). The diagram below shows both — Phase-2 is the dark-bordered path, Phase-3+ is the light-bordered extension.
The target speed-up numbers are derived from two sources: the measured Mamba-vs-transformer ratios reported in [2] and an estimate of the inference advantage flow-matching gives over DDPM at 20–50 steps vs 250–1000 steps respectively.
| Metric | SparC3D / TRELLIS reference | MambaFlow3D target | Source of ratio |
|---|---|---|---|
| Training step time (sparse-cube scale) | 1.0 × | 0.33–0.5 × | Mamba-2 fast-path on Ampere; [2] |
| Sampling steps | ~250 DDPM / ~50 if v-pred | 20–50 (flow matching) | FM scheduler choice |
| Per-step inference time | 1.0 × | 0.4–0.6 × (linear-time block) | Mamba sequential rollout |
| End-to-end inference latency | 1.0 × | 0.08–0.18 × (5–12 × faster) | Compound of the two |
| Single-image-to-3-D quality target | SparC3D-class | SparC3D-class @ 5× faster | Speed not quality is the contribution |
The 5–12× inference-latency range is wide because two of the constituent factors (Mamba per-step speed-up on Ampere, FM step count that retains quality) are not yet measured at the SparC3D-scale token count. The 5× lower bound assumes the modest end of each; the 12× upper bound assumes the optimistic end of each. Neither has been verified at 3-D scale.
Pure Mamba won on MNIST.
Then a channel mismatch killed first launch.
The architectural decision (Pure Mamba over Hybrid Mamba+Attention) was already settled by the Topic-25 MNIST validation. The Phase-2 bring-up on ModelNet10 was supposed to be a re-tune of known-good components — and was, except that the PointNet++ feature concatenation in the encoder's set-abstraction layer expected one channel count and the conv after it expected a different one. The fix was four characters in one MLP head; the diagnostic took the afternoon because the symptom (a channel error 3 layers deep) did not point at the SA-layer feature concatenation as the cause.
The Phase-2 implementation is five files: flow_model.py
(PointNet++ + Mamba + FoldingNet, the network), train.py
(flow-matching loss, AdamW, gradient clipping), modelnet_loader.py
(ModelNet10 dataloader, unit-sphere normalisation, 2 048-point
sampling), evaluate.py (Chamfer Distance + 3-D plot),
benchmark.py (timing harness for the 30 s / epoch
target). All five live under 3d_point_cloud_Mamba/.
| Setting | Value | Notes |
|---|---|---|
| Dataset | ModelNet10 | 10 categories, ~5 K shapes total (3 991 train / 908 test) |
| Points per shape | 2 048 | Fixed; sampled with farthest-point sampling at load |
| Batch size | 16–32 | 16 confirmed stable on 12 GB; 24 if memory headroom holds |
| Optimiser | AdamW, lr 1×10⁻⁴ | Gradient clipping at 1.0 |
| Epochs (planned) | 50 | ~20 min wall-clock target |
| Memory target | < 8 GB / GPU | Confirmed at ~6.5 GB peak |
| Epoch-time target | < 30 s | Confirmed at ~25 s once channel bug resolved |
| Quality target | Chamfer < 0.01 | Phase-2 success criterion; not yet measured at full convergence |
The launch crash was at the first forward pass through the encoder.
Traceback bottomed out at a conv layer in the second
set-abstraction (SA) layer of PointNet++ expecting 64
input channels and receiving 67. The factor-of-three
surplus is the giveaway: the SA layer concatenates the
grouped xyz coordinates (3 channels) onto the input
features before the MLP, so the actual feature count entering each
SA layer's MLP is in_channel + 3, not
in_channel. The reference PointNet++ implementations
handle this with an in_channel + 3 offset in the layer
constructor; the initial code here did not.
| SA layer | Declared in_channel | Concatenated to MLP input | Fix |
|---|---|---|---|
| SA1 (2048 → 512) | 3 (xyz only) | 0 input features + 3 xyz = 3 → MLP expects 3 | Set in_channel = 0 (no input features); MLP first layer takes 3 |
| SA2 (512 → 128) | 64 | 64 features + 3 xyz = 67 → MLP expects 64 | MLP first layer takes in_channel + 3 = 67 |
| SA3 (128 → 32) | 128 | 128 + 3 = 131 → MLP expects 128 | MLP first layer takes in_channel + 3 = 131 |
Once the offset was applied uniformly to the three SA layers, the
forward pass ran clean and training proceeded. The lesson, also
saved as a feedback memory: PointNet++ derivative
implementations are an old-favourite source of channel-count
off-by-three bugs; the SA layer's in_channel argument
means "input features", not "input features + xyz", and the +3 has
to be added at the MLP head.
Step through the flow-matching trajectory in the latent space. Pick a target ModelNet10 category — chair, table, or airplane — then advance through the 10-step trajectory to see the generated point cloud emerge from noise. The middle pane shows the 32-token latent state at the current step; the right pane shows the decoded point cloud.
arXiv-format write-up · MambaFlow3D architecture spec · Mamba ↔ flow-matching coupling · target speed-ups · Phase-2 validation