Phase-1 validation before scaling to 3-D — train three backbones (Pure Mamba, Pure Transformer, Hybrid Mamba+Attention) on MNIST under the same latent-flow-matching head and pick the one that survives the speed-quality trade. Pure Mamba won visually despite a marginally worse loss number (0.086 vs 0.080 for the Hybrid). Transformer produced noise. Verdict: proceed with Pure Mamba as the backbone for the MambaFlow3D scaling work (Topic 26).
The thesis line wants a fast single-image-to-3-D generator. The architectural premise is that Mamba + flow matching beats the standard transformer + DDPM stack on inference speed without giving up too much quality at the 3-D scale. The premise is plausible — Mamba is linear-time, flow matching has fewer sampling steps than DDPM — but plausible is not enough to spend the rented-GPU hours required to validate at SparC3D scale.
Topic 25 is the cheap-but-honest validation step. Train the three architectural candidates — Pure Mamba, Pure Transformer, Hybrid Mamba+Attention — all with the same latent-flow-matching head, all on MNIST, all with matched parameter count. Compare on loss, sampling quality, and per-step speed. Pick the winner. Then spend the 3-D compute on the winner.
The trap to avoid is picking on loss alone. Flow-matching loss measures the average velocity error over the trajectory — small differences in this number can hide large differences in sample quality, particularly when the loss landscape has flat regions near the optimum. The MNIST validation here was set up to look at both the loss curve and the generated samples, and to decide on the combination.
The three backbones swap only the sequence-processing body. The pre-encoder (image-patch → token sequence), the latent-flow-matching head, and the decoder are identical across all three. MNIST images are 28 × 28; flattened into 196 tokens of 4-pixel-square patches at embedding dim 128. Each backbone consumes the token sequence and returns a same-shape sequence; the FM head takes the sequence and predicts the velocity.
| Backbone | Body | Param count | Loss (final) | Sample quality (visual) |
|---|---|---|---|---|
| Pure Mamba | 8 × Mamba block · d_model 128 · d_state 64 | ~3.1 M | 0.086 | Clean digits, recognisable |
| Hybrid Mamba+Attn | 4 × (Mamba + Attention) · interleaved | ~3.4 M | 0.080 | Clean digits, marginally cleaner |
| Pure Transformer | 8 × Transformer block · d_model 128 · 8 heads | ~3.2 M | 0.092 | Noise — failed to converge in budget |
Parameter counts are matched to within ±10 % so that the comparison is not biased by capacity. Training schedule is identical across the three (AdamW, lr 1×10⁻⁴, gradient clip at 1.0, 50 epochs, batch 256). The training rig is the same single RTX 3060 12 GB on Vast.ai used for the MambaFlow3D Phase-2 work.
The pipeline is the same across all three runs — only the box labelled "BACKBONE" changes between Pure Mamba (PM), Pure Transformer (PT), or Hybrid Mamba+Attention (HMA).
All three runs trained for 50 epochs at batch 256 on the same MNIST train split (60 K images). Pure Mamba and Hybrid both converged smoothly. Pure Transformer's loss declined slowly but the generated samples remained noise — the model was not learning the velocity field meaningfully at this parameter count and step budget.
| Run | Final loss | Per-step time | Memory peak | 50-epoch wallclock | Sample quality |
|---|---|---|---|---|---|
| Pure Mamba | 0.086 | ~18 ms | ~3.8 GB | ~42 min | Clean digits |
| Hybrid Mamba+Attn | 0.080 | ~26 ms | ~4.5 GB | ~60 min | Clean digits, marginally cleaner |
| Pure Transformer | 0.092 | ~21 ms | ~4.1 GB | ~49 min | Noise — failed to converge |
Per-step time is the per-batch forward+backward time, measured over the last 100 steps of training. Pure Mamba's 18 ms vs Pure Transformer's 21 ms is a 1.17 × speed-up — modest at MNIST scale (only 196 tokens). The 2–3 × speed-up at 3-D scale (Topic 26 budget) relies on this ratio widening as token count grows, which is the theoretical prediction from [1].
Loss = 0.080 vs 0.086.
Samples = night and day.
The Hybrid backbone's marginally better loss number (0.080 vs Pure Mamba's 0.086) was a 7 % difference on a metric whose 7 % range maps onto visual quality very weakly at low loss values. The visual inspection of the generated samples showed both Pure Mamba and Hybrid produced clean recognisable digits with no obvious quality gap; the Hybrid samples were marginally sharper on close inspection, but the difference was small enough that it would not justify the Hybrid's 1.44 × per-step cost. The lesson: at low loss values, sample quality is the primary signal, not the loss number.
The decision rule for the validation, set in advance: pick the backbone with the best speed-quality trade-off, where "quality" is defined as visual sample quality and "speed" is per-step time. The Hybrid is marginally ahead on quality and clearly behind on speed. Pure Transformer is behind on quality entirely. Pure Mamba is in front on speed and indistinguishable from the Hybrid on quality.
| Criterion | Pure Mamba | Hybrid | Pure Transformer |
|---|---|---|---|
| Visual sample quality | ✓ | ✓ (marginal edge) | ✗ |
| Per-step speed | ✓ (fastest) | ✗ (1.44× slower) | ✗ (1.17× slower) |
| Memory peak | ✓ (lowest) | ✗ (highest) | ~mid |
| Compatible with 3-D scaling premise | ✓ | partial (attention quadratic at 3-D) | ✗ |
| Verdict | Proceed | Skip | Skip |
The decision is also informed by the downstream-application constraint: at SparC3D-scale token counts (100–200 tokens) the Hybrid's per-block attention is still O(N²) — the speed gap widens further at 3-D scale than it is here at MNIST scale. The Pure Mamba win at 196 tokens is the lower bound for the win at SparC3D-scale.
The single bring-up bug in this topic: the trained Pure Mamba flow
model produced a checkpoint that the initial inference script could
not load — FileNotFoundError: checkpoints/flow_model_final.pt
followed by, after fixing the path, a checkpoint-key-structure
mismatch. The root cause was that the training loop saved the
checkpoint under a different key layout than the inference loader
expected (the loader expected a bare state_dict at the
top level; the trainer saved a wrapper dict with a
'model' key inside). The fix was a robust loader that
inspects the checkpoint's top-level keys and routes through either
the bare or wrapped layout. Carried forward as
generate_samples_robust.py.
load_state_dict, and handle
both bare and wrapped formats. Carried forward to MambaFlow3D and
the JiT reproduction work.
Step the flow trajectory for a chosen MNIST digit under the three backbones side-by-side. The left pane is the shared noise seed; the centre pane is the trajectory step indicator; the right pane shows the rendered samples from each backbone at the current step. Pure Mamba and Hybrid converge to clean digits; Pure Transformer stays noisy throughout.
arXiv-format write-up · 3-architecture MNIST validation · matched parameter count · decision rule · downstream implications