Technical Report · cs.LG · Nov 2025
Documentation → ← Back to White Papers
Flow Matching Backbone Validation on MNIST: A Three-Way Comparison of Pure Mamba, Pure Transformer, and Hybrid Mamba+Attention Under Matched Parameter Count
Aaditya Jain
Flow Matching · Backbone Validation · Thesis Research, Unpublished Preprint
Submitted: November 2025 Subject: cs.LG Keywords: flow matching, Mamba, transformer, hybrid, MNIST, backbone selection, downstream 3-D generation
Abstract
We report a matched-parameter MNIST flow-matching validation across three sequence-processing backbones — Pure Mamba [1], Pure Transformer [2], and Hybrid Mamba+Attention — held under the same patch tokeniser, the same conditional-flow-matching head [3], the same training schedule, and the same evaluation protocol. The motivation is upstream of the 3-D generation thesis line: the Pure-Mamba premise carried forward into MambaFlow3D [4] rests on this validation being a deliberate visual-quality + per-step-speed pick rather than a loss-only pick. The Hybrid backbone won the loss number (final 0.080 vs Pure Mamba's 0.086 vs Pure Transformer's 0.092). Visually, Pure Mamba and Hybrid both produced clean recognisable digits; the Hybrid samples were marginally sharper on close inspection but the gap was small. Pure Transformer's samples remained noise within the 50-epoch budget despite a slowly decreasing loss. Per-step times: Pure Mamba 18 ms, Pure Transformer 21 ms, Hybrid 26 ms. Memory peaks: 3.8, 4.1, 4.5 GB. Decision rule (defined ex ante): pick the backbone with the best speed-quality trade-off, where quality is sample-quality first and loss second. Pure Mamba wins under this rule — it is indistinguishable from the Hybrid on quality and clearly faster. The decision is informed by a downstream constraint: at SparC3D-scale token counts (100–200 sparse-cube tokens) the Hybrid's per-block attention is still O(N²), so the per-step speed gap widens further at 3-D scale than the gap measured here. The contribution is the validation protocol — matched parameter count, ex-ante decision rule, both loss and visual-quality criteria — and the result that Pure Mamba is the right backbone for the MambaFlow3D scaling work. Keywords: flow matching, Mamba vs transformer, MNIST validation, sequence backbone, 3-D scaling premise.
1. Introduction

The thesis line targets a fast single-image-to-3-D generator. The architectural premise carried into the scaling work is that the canonical transformer + DDPM stack used by SparC3D [5] and TRELLIS [6] can be replaced by Mamba + flow matching for a substantial inference-latency win without giving up too much sample quality. The premise is plausible but not free: the Mamba state-space block [1] is a less-established sequence primitive than transformer attention, and flow matching [3] has different stability characteristics than DDPM [7] at the same parameter budget. Before committing the 3-D-scale rented-GPU hours, the question is whether the substitution survives at all.

This paper is the cheap-but-honest validation step. Three backbones — Pure Mamba, Pure Transformer, Hybrid Mamba+Attention — train under identical conditions on MNIST under a shared latent-flow-matching head. Matched parameter count to within ±10 %, identical training schedule, identical evaluation. Pick a winner under a decision rule defined ex ante. The contribution is the validation protocol and the resulting backbone choice.

The contributions are: (1) the three-backbone protocol with matched parameters, identical head, identical training; (2) the ex-ante decision rule that mixes loss and visual sample quality, with quality as the primary signal at low loss values; (3) the result that Pure Mamba and Hybrid are tied on quality and Pure Mamba is clearly faster, so Pure Mamba is the right backbone for the 3-D scaling work.

2. Setup
2.1 Dataset

MNIST [8]. 60 K training images, 10 K test images, 28 × 28 greyscale, 10 digit classes. Normalised to [−1, 1]. No augmentation. The choice is deliberate — MNIST is small enough that the three backbones can be trained to convergence on a single RTX 3060 in under an hour each, and the failure modes (the Pure Transformer "produces noise" outcome documented in §4.3) are recognisable visually without needing FID-class metrics.

2.2 Tokeniser, head, decoder

All three backbones share: a patch tokeniser that maps 28 × 28 → 196 tokens at embedding dim 128 (4 × 4 patches, learned linear embedding plus sinusoidal positional embedding); a flow-matching head that takes the backbone output and predicts the velocity field v̂(z_t, t) with sinusoidal time embedding; and an un-patch step that maps the predicted-image tokens back to 28 × 28.

2.3 Backbones (matched parameter count)
Table 1 — Three backbones, matched parameters within ±10 %.
BackboneBodyParam count
Pure Mamba (PM)8 × Mamba block · d_model 128 · d_state 64~3.1 M
Pure Transformer (PT)8 × Transformer block · d_model 128 · 8 heads · MLP 512~3.2 M
Hybrid (HMA)4 × (Mamba + Attention) · interleaved, d_model 128~3.4 M
2.4 Training

AdamW [9], learning rate 1×10⁻⁴, gradient clipping at norm 1.0, batch 256, 50 epochs. Single RTX 3060 12 GB on Vast.ai. fp16 mixed precision. No EMA, no learning-rate warmup or cosine schedule — kept simple to avoid confounding the comparison.

2.5 Evaluation

Two evaluation channels. Loss: the final 50-epoch training loss, averaged over the last epoch. Visual sample quality: 64 generated samples per backbone at the same 10-step Euler-integrator sampling schedule, judged by inspection on whether the generated digits are recognisable as digits. The judging is binary in this paper — the cases under study (clean digit / noise / marginal-sharpness difference) are visually obvious at MNIST scale and do not need a graded FID-class metric to distinguish.

3. Decision Rule (Ex Ante)

Set before the experiment ran: the winning backbone is the one with the best speed-quality trade-off, where:

Quality is judged first by sample quality (binary: clean digits or noise) and second by loss (lower better). Sample quality is the primary signal at low loss values because the loss landscape at the optimum is flat enough that 5–10 % loss differences map onto visually imperceptible quality differences.

Speed is per-step time on the training rig, measured over the last 100 training steps. Memory peak is a secondary speed proxy.

Tie-breaker if all three are quality-equivalent: pick the faster one. Tie-breaker if all three are speed-equivalent: pick the higher-quality one. If a backbone fails the quality bar entirely (produces noise), it is eliminated regardless of loss or speed.

4. Results
4.1 Loss curves and per-step times
Table 2 — Final-epoch metrics across the three backbones.
BackboneFinal lossPer-step timePeak memory50-epoch wallclock
Pure Mamba0.08618 ms3.8 GB~42 min
Hybrid0.08026 ms4.5 GB~60 min
Pure Transformer0.09221 ms4.1 GB~49 min

The Hybrid wins the loss number by 7 % over Pure Mamba; Pure Mamba wins the per-step-time number by 31 % over Hybrid. Pure Transformer is third on loss and second on per-step time.

4.2 Visual sample quality

Pure Mamba and Hybrid both produce clean recognisable digits across all ten classes after the 10-step Euler sampling. The Hybrid samples are marginally sharper on close inspection — slightly thinner edge transitions, slightly less blurry interior strokes — but the gap is small enough that it would not be visible without side-by-side inspection.

Pure Transformer produces noise. The 64 samples at end-of-training are visually indistinguishable from the initial-noise input. The loss number (0.092) is the lowest end of "still actively training" — the model is moving but has not yet entered the regime where it generates meaningful images. The 50-epoch budget is not enough for Pure Transformer at this parameter count.

4.3 The Pure-Transformer "noise" outcome — diagnostic

The Pure-Transformer failure to converge in 50 epochs at 3.2 M parameters is not new — transformers on flow matching at small parameter budget are known to be slow to converge and benefit from much longer training runs than Mamba or Hybrid backbones at the same budget. The point of including it in the comparison is not to claim Pure Transformer cannot do this task (it can, with more compute), but to confirm that at the budget this work has to spend, the Mamba-class backbones are clearly preferred.

4.4 Application of the decision rule
Table 3 — Decision-rule application across the three backbones.
CriterionPure MambaHybridPure Transformer
Visual quality barPassPass (marginal edge)Fail (noise)
SpeedFastestSlowestMiddle
MemoryLowestHighestMiddle
Decision outcomeWinSkip — quality edge does not justify costEliminated — fails quality bar
5. Implications for the 3-D Scaling Work

Two observations carry forward to MambaFlow3D [4]. First, the per-step speed gap between Pure Mamba and Hybrid (18 ms vs 26 ms — a 1.44 × ratio) is measured at 196 tokens. At SparC3D-scale token counts (100–200 sparse-cube tokens) the absolute time-per-step is similar to MNIST-scale; what changes is that the Hybrid's per-block attention is still O(N²) in token count, so the speed gap widens. The MNIST result is a lower bound on the Pure-Mamba win at SparC3D scale.

Second, the loss-vs-quality observation generalises. At low loss values the loss landscape is flat enough that small differences in the loss number do not predict visual quality differences reliably. The downstream MambaFlow3D evaluation will therefore use Chamfer distance + visual inspection rather than loss as the primary quality signal, mirroring the protocol used here.

Third, a caveat: the comparison here used a single random seed per backbone. A multi-seed re-run would tighten the error bars on the loss numbers (0.080 vs 0.086 may be within run-to-run variance) but is unlikely to flip the decision — the decision turns on visual quality and speed, both of which are robust to seed variation in the regime this experiment was in.

6. Open Questions

Three concrete open questions left by this work. (i) Pure Transformer at longer budget. Whether Pure Transformer at 3.2 M parameters reaches Pure-Mamba sample quality at 200 epochs (instead of 50). The hypothesis is yes, with 4 × the compute. Not pursued here because the downstream pick is between Mamba-class backbones. (ii) Multi-seed loss tightening. Re-run all three backbones with 3 seeds each to tighten the loss numbers. The decision is robust to seed variance, but the published loss numbers should be reported with error bars. (iii) Higher-resolution test. Repeat the comparison at CIFAR-class data (32 × 32, 3 channels) before committing to the SparC3D-scale scaling experiment. Not done in this work due to compute budget; the MambaFlow3D Phase-2 on ModelNet10 is the next-up validation step instead.

7. Conclusion

Three backbones, matched parameters, identical training, identical evaluation, ex-ante decision rule. Pure Mamba wins the speed-quality trade-off — indistinguishable from the Hybrid on visual sample quality and 1.44 × faster per step. Pure Transformer fails the quality bar within the 50-epoch budget. The decision is informed by the downstream SparC3D-scale constraint, where the Hybrid's per-block attention becomes more expensive in absolute terms. The Pure-Mamba pick is the backbone choice for the MambaFlow3D scaling work documented in [4].

References
[1] Gu, A., Dao, T. "Mamba: Linear-Time Sequence Modelling with Selective State Spaces." 2023.
[2] Vaswani, A. et al. "Attention Is All You Need." NeurIPS, 2017.
[3] Lipman, Y. et al. "Flow Matching for Generative Modelling." ICLR, 2023.
[4] Jain, A. "MambaFlow3D: A Pure-Mamba + Latent-Flow-Matching Architecture for Single-Image 3-D Generation." Thesis research, Nov 2025. /whitepaper/mambaflow3d
[5] SparC3D authors. "SparC3D: Sparse-Cube 3-D Generation from a Single Image." 2024.
[6] Microsoft Research. "TRELLIS: A Structured Latent Representation for Versatile and High-Quality 3-D Generation." 2024.
[7] Ho, J., Jain, A., Abbeel, P. "Denoising Diffusion Probabilistic Models." NeurIPS, 2020.
[8] LeCun, Y., Cortes, C. "The MNIST Database of Handwritten Digits." 1998.
[9] Loshchilov, I., Hutter, F. "Decoupled Weight Decay Regularization." ICLR, 2019. AdamW.