The thesis line targets a fast single-image-to-3-D generator. The architectural premise carried into the scaling work is that the canonical transformer + DDPM stack used by SparC3D [5] and TRELLIS [6] can be replaced by Mamba + flow matching for a substantial inference-latency win without giving up too much sample quality. The premise is plausible but not free: the Mamba state-space block [1] is a less-established sequence primitive than transformer attention, and flow matching [3] has different stability characteristics than DDPM [7] at the same parameter budget. Before committing the 3-D-scale rented-GPU hours, the question is whether the substitution survives at all.
This paper is the cheap-but-honest validation step. Three backbones — Pure Mamba, Pure Transformer, Hybrid Mamba+Attention — train under identical conditions on MNIST under a shared latent-flow-matching head. Matched parameter count to within ±10 %, identical training schedule, identical evaluation. Pick a winner under a decision rule defined ex ante. The contribution is the validation protocol and the resulting backbone choice.
The contributions are: (1) the three-backbone protocol with matched parameters, identical head, identical training; (2) the ex-ante decision rule that mixes loss and visual sample quality, with quality as the primary signal at low loss values; (3) the result that Pure Mamba and Hybrid are tied on quality and Pure Mamba is clearly faster, so Pure Mamba is the right backbone for the 3-D scaling work.
MNIST [8]. 60 K training images, 10 K test images, 28 × 28 greyscale, 10 digit classes. Normalised to [−1, 1]. No augmentation. The choice is deliberate — MNIST is small enough that the three backbones can be trained to convergence on a single RTX 3060 in under an hour each, and the failure modes (the Pure Transformer "produces noise" outcome documented in §4.3) are recognisable visually without needing FID-class metrics.
All three backbones share: a patch tokeniser that maps 28 × 28 → 196 tokens at embedding dim 128 (4 × 4 patches, learned linear embedding plus sinusoidal positional embedding); a flow-matching head that takes the backbone output and predicts the velocity field v̂(z_t, t) with sinusoidal time embedding; and an un-patch step that maps the predicted-image tokens back to 28 × 28.
| Backbone | Body | Param count |
|---|---|---|
| Pure Mamba (PM) | 8 × Mamba block · d_model 128 · d_state 64 | ~3.1 M |
| Pure Transformer (PT) | 8 × Transformer block · d_model 128 · 8 heads · MLP 512 | ~3.2 M |
| Hybrid (HMA) | 4 × (Mamba + Attention) · interleaved, d_model 128 | ~3.4 M |
AdamW [9], learning rate 1×10⁻⁴, gradient clipping at norm 1.0, batch 256, 50 epochs. Single RTX 3060 12 GB on Vast.ai. fp16 mixed precision. No EMA, no learning-rate warmup or cosine schedule — kept simple to avoid confounding the comparison.
Two evaluation channels. Loss: the final 50-epoch training loss, averaged over the last epoch. Visual sample quality: 64 generated samples per backbone at the same 10-step Euler-integrator sampling schedule, judged by inspection on whether the generated digits are recognisable as digits. The judging is binary in this paper — the cases under study (clean digit / noise / marginal-sharpness difference) are visually obvious at MNIST scale and do not need a graded FID-class metric to distinguish.
Set before the experiment ran: the winning backbone is the one with the best speed-quality trade-off, where:
Quality is judged first by sample quality (binary: clean digits or noise) and second by loss (lower better). Sample quality is the primary signal at low loss values because the loss landscape at the optimum is flat enough that 5–10 % loss differences map onto visually imperceptible quality differences.
Speed is per-step time on the training rig, measured over the last 100 training steps. Memory peak is a secondary speed proxy.
Tie-breaker if all three are quality-equivalent: pick the faster one. Tie-breaker if all three are speed-equivalent: pick the higher-quality one. If a backbone fails the quality bar entirely (produces noise), it is eliminated regardless of loss or speed.
| Backbone | Final loss | Per-step time | Peak memory | 50-epoch wallclock |
|---|---|---|---|---|
| Pure Mamba | 0.086 | 18 ms | 3.8 GB | ~42 min |
| Hybrid | 0.080 | 26 ms | 4.5 GB | ~60 min |
| Pure Transformer | 0.092 | 21 ms | 4.1 GB | ~49 min |
The Hybrid wins the loss number by 7 % over Pure Mamba; Pure Mamba wins the per-step-time number by 31 % over Hybrid. Pure Transformer is third on loss and second on per-step time.
Pure Mamba and Hybrid both produce clean recognisable digits across all ten classes after the 10-step Euler sampling. The Hybrid samples are marginally sharper on close inspection — slightly thinner edge transitions, slightly less blurry interior strokes — but the gap is small enough that it would not be visible without side-by-side inspection.
Pure Transformer produces noise. The 64 samples at end-of-training are visually indistinguishable from the initial-noise input. The loss number (0.092) is the lowest end of "still actively training" — the model is moving but has not yet entered the regime where it generates meaningful images. The 50-epoch budget is not enough for Pure Transformer at this parameter count.
The Pure-Transformer failure to converge in 50 epochs at 3.2 M parameters is not new — transformers on flow matching at small parameter budget are known to be slow to converge and benefit from much longer training runs than Mamba or Hybrid backbones at the same budget. The point of including it in the comparison is not to claim Pure Transformer cannot do this task (it can, with more compute), but to confirm that at the budget this work has to spend, the Mamba-class backbones are clearly preferred.
| Criterion | Pure Mamba | Hybrid | Pure Transformer |
|---|---|---|---|
| Visual quality bar | Pass | Pass (marginal edge) | Fail (noise) |
| Speed | Fastest | Slowest | Middle |
| Memory | Lowest | Highest | Middle |
| Decision outcome | Win | Skip — quality edge does not justify cost | Eliminated — fails quality bar |
Two observations carry forward to MambaFlow3D [4]. First, the per-step speed gap between Pure Mamba and Hybrid (18 ms vs 26 ms — a 1.44 × ratio) is measured at 196 tokens. At SparC3D-scale token counts (100–200 sparse-cube tokens) the absolute time-per-step is similar to MNIST-scale; what changes is that the Hybrid's per-block attention is still O(N²) in token count, so the speed gap widens. The MNIST result is a lower bound on the Pure-Mamba win at SparC3D scale.
Second, the loss-vs-quality observation generalises. At low loss values the loss landscape is flat enough that small differences in the loss number do not predict visual quality differences reliably. The downstream MambaFlow3D evaluation will therefore use Chamfer distance + visual inspection rather than loss as the primary quality signal, mirroring the protocol used here.
Third, a caveat: the comparison here used a single random seed per backbone. A multi-seed re-run would tighten the error bars on the loss numbers (0.080 vs 0.086 may be within run-to-run variance) but is unlikely to flip the decision — the decision turns on visual quality and speed, both of which are robust to seed variation in the regime this experiment was in.
Three concrete open questions left by this work. (i) Pure Transformer at longer budget. Whether Pure Transformer at 3.2 M parameters reaches Pure-Mamba sample quality at 200 epochs (instead of 50). The hypothesis is yes, with 4 × the compute. Not pursued here because the downstream pick is between Mamba-class backbones. (ii) Multi-seed loss tightening. Re-run all three backbones with 3 seeds each to tighten the loss numbers. The decision is robust to seed variance, but the published loss numbers should be reported with error bars. (iii) Higher-resolution test. Repeat the comparison at CIFAR-class data (32 × 32, 3 channels) before committing to the SparC3D-scale scaling experiment. Not done in this work due to compute budget; the MambaFlow3D Phase-2 on ModelNet10 is the next-up validation step instead.
Three backbones, matched parameters, identical training, identical evaluation, ex-ante decision rule. Pure Mamba wins the speed-quality trade-off — indistinguishable from the Hybrid on visual sample quality and 1.44 × faster per step. Pure Transformer fails the quality bar within the 50-epoch budget. The decision is informed by the downstream SparC3D-scale constraint, where the Hybrid's per-block attention becomes more expensive in absolute terms. The Pure-Mamba pick is the backbone choice for the MambaFlow3D scaling work documented in [4].