← Research Timeline Aditya Jain / Apple Maps · 3D Reconstruction
Nov 2025
Topic 25 Nov 2025 Mamba · Flow Matching · MNIST · Validation

Flow Matching ×
{Mamba, Transformer, Hybrid}.

Phase-1 validation before scaling to 3-D — train three backbones (Pure Mamba, Pure Transformer, Hybrid Mamba+Attention) on MNIST under the same latent-flow-matching head and pick the one that survives the speed-quality trade. Pure Mamba won visually despite a marginally worse loss number (0.086 vs 0.080 for the Hybrid). Transformer produced noise. Verdict: proceed with Pure Mamba as the backbone for the MambaFlow3D scaling work (Topic 26).

00 — Motivation

Validate the architecture choice before spending 3-D compute.

The thesis line wants a fast single-image-to-3-D generator. The architectural premise is that Mamba + flow matching beats the standard transformer + DDPM stack on inference speed without giving up too much quality at the 3-D scale. The premise is plausible — Mamba is linear-time, flow matching has fewer sampling steps than DDPM — but plausible is not enough to spend the rented-GPU hours required to validate at SparC3D scale.

Topic 25 is the cheap-but-honest validation step. Train the three architectural candidates — Pure Mamba, Pure Transformer, Hybrid Mamba+Attention — all with the same latent-flow-matching head, all on MNIST, all with matched parameter count. Compare on loss, sampling quality, and per-step speed. Pick the winner. Then spend the 3-D compute on the winner.

The trap to avoid is picking on loss alone. Flow-matching loss measures the average velocity error over the trajectory — small differences in this number can hide large differences in sample quality, particularly when the loss landscape has flat regions near the optimum. The MNIST validation here was set up to look at both the loss curve and the generated samples, and to decide on the combination.

What it informs
The result of this experiment directly determines the backbone for Topic 26 (MambaFlow3D scaling) and indirectly for the Topic-27 JiT reproduction (which is the transformer baseline). If Pure Transformer had won here, MambaFlow3D would have been abandoned. If Hybrid had won on both metrics, the 3-D scaling work would have carried an attention block per Mamba layer at significantly higher memory cost. Pure Mamba winning is the cheapest, fastest 3-D scaling path.
01 — Architectures

Three backbones, same head, same data.

The three backbones swap only the sequence-processing body. The pre-encoder (image-patch → token sequence), the latent-flow-matching head, and the decoder are identical across all three. MNIST images are 28 × 28; flattened into 196 tokens of 4-pixel-square patches at embedding dim 128. Each backbone consumes the token sequence and returns a same-shape sequence; the FM head takes the sequence and predicts the velocity.

BackboneBodyParam countLoss (final)Sample quality (visual)
Pure Mamba8 × Mamba block · d_model 128 · d_state 64~3.1 M0.086Clean digits, recognisable
Hybrid Mamba+Attn4 × (Mamba + Attention) · interleaved~3.4 M0.080Clean digits, marginally cleaner
Pure Transformer8 × Transformer block · d_model 128 · 8 heads~3.2 M0.092Noise — failed to converge in budget

Parameter counts are matched to within ±10 % so that the comparison is not biased by capacity. Training schedule is identical across the three (AdamW, lr 1×10⁻⁴, gradient clip at 1.0, 50 epochs, batch 256). The training rig is the same single RTX 3060 12 GB on Vast.ai used for the MambaFlow3D Phase-2 work.

Pipeline

Latent flow-matching, three swappable backbones.

The pipeline is the same across all three runs — only the box labelled "BACKBONE" changes between Pure Mamba (PM), Pure Transformer (PT), or Hybrid Mamba+Attention (HMA).

MNIST 28×28 (B, 1, 28, 28) Patch embed → (B, 196, 128) BACKBONE {PM/PT/HMA} → (B, 196, 128) FM head v̂(z_t, t) Un-patch → (B, 1, 28, 28) PM · 8 × Mamba (d_model=128, d_state=64) PT · 8 × Transformer (d_model=128, h=8) HMA · 4 × (Mamba + Attention) interleaved
02 — Training Log

Loss curves and per-step times across the three runs.

All three runs trained for 50 epochs at batch 256 on the same MNIST train split (60 K images). Pure Mamba and Hybrid both converged smoothly. Pure Transformer's loss declined slowly but the generated samples remained noise — the model was not learning the velocity field meaningfully at this parameter count and step budget.

RunFinal lossPer-step timeMemory peak50-epoch wallclockSample quality
Pure Mamba0.086~18 ms~3.8 GB~42 minClean digits
Hybrid Mamba+Attn0.080~26 ms~4.5 GB~60 minClean digits, marginally cleaner
Pure Transformer0.092~21 ms~4.1 GB~49 minNoise — failed to converge

Per-step time is the per-batch forward+backward time, measured over the last 100 steps of training. Pure Mamba's 18 ms vs Pure Transformer's 21 ms is a 1.17 × speed-up — modest at MNIST scale (only 196 tokens). The 2–3 × speed-up at 3-D scale (Topic 26 budget) relies on this ratio widening as token count grows, which is the theoretical prediction from [1].

Core Finding

Loss = 0.080 vs 0.086.
Samples = night and day.

The Hybrid backbone's marginally better loss number (0.080 vs Pure Mamba's 0.086) was a 7 % difference on a metric whose 7 % range maps onto visual quality very weakly at low loss values. The visual inspection of the generated samples showed both Pure Mamba and Hybrid produced clean recognisable digits with no obvious quality gap; the Hybrid samples were marginally sharper on close inspection, but the difference was small enough that it would not justify the Hybrid's 1.44 × per-step cost. The lesson: at low loss values, sample quality is the primary signal, not the loss number.

03 — Decision

Proceed with Pure Mamba. The Hybrid's edge does not justify the cost.

The decision rule for the validation, set in advance: pick the backbone with the best speed-quality trade-off, where "quality" is defined as visual sample quality and "speed" is per-step time. The Hybrid is marginally ahead on quality and clearly behind on speed. Pure Transformer is behind on quality entirely. Pure Mamba is in front on speed and indistinguishable from the Hybrid on quality.

CriterionPure MambaHybridPure Transformer
Visual sample quality✓ (marginal edge)
Per-step speed✓ (fastest)✗ (1.44× slower)✗ (1.17× slower)
Memory peak✓ (lowest)✗ (highest)~mid
Compatible with 3-D scaling premisepartial (attention quadratic at 3-D)
VerdictProceedSkipSkip

The decision is also informed by the downstream-application constraint: at SparC3D-scale token counts (100–200 tokens) the Hybrid's per-block attention is still O(N²) — the speed gap widens further at 3-D scale than it is here at MNIST scale. The Pure Mamba win at 196 tokens is the lower bound for the win at SparC3D-scale.

04 — Inference Trip-up

Checkpoint key-shape mismatch on the inference script.

The single bring-up bug in this topic: the trained Pure Mamba flow model produced a checkpoint that the initial inference script could not load — FileNotFoundError: checkpoints/flow_model_final.pt followed by, after fixing the path, a checkpoint-key-structure mismatch. The root cause was that the training loop saved the checkpoint under a different key layout than the inference loader expected (the loader expected a bare state_dict at the top level; the trainer saved a wrapper dict with a 'model' key inside). The fix was a robust loader that inspects the checkpoint's top-level keys and routes through either the bare or wrapped layout. Carried forward as generate_samples_robust.py.

Generalisable lesson
Training-time and inference-time scripts that are written separately tend to drift in their checkpoint-format expectations. The defensive habit: in the inference loader, always inspect the top-level keys before load_state_dict, and handle both bare and wrapped formats. Carried forward to MambaFlow3D and the JiT reproduction work.

Interactive Demo · Live

Step the flow trajectory for a chosen MNIST digit under the three backbones side-by-side. The left pane is the shared noise seed; the centre pane is the trajectory step indicator; the right pane shows the rendered samples from each backbone at the current step. Pure Mamba and Hybrid converge to clean digits; Pure Transformer stays noisy throughout.

01 — Noise seed · CLICK TO RE-SEED DIGIT 3
02 — Trajectory step STEP 0 / 10
03 — Samples · PM / HMA / PT three backbones

Full Technical Paper

arXiv-format write-up · 3-architecture MNIST validation · matched parameter count · decision rule · downstream implications

Read Paper →
Related Thesis Chapters
MambaFlow3D — Sparse-Voxel 3-D Generation
Direct downstream consumer of this validation. The Pure-Mamba win here is the architectural premise that the Phase-2 ModelNet10 bring-up scales.
JiT Diffusion — Consumer-GPU Training
The transformer baseline the Mamba speed-up is measured against. Same hardware substrate; different backbone choice.
Hexplane Autoencoder
Parallel architecture experiment from the same thesis line — different objective (reconstruction vs generation), same consumer-hardware constraint.
Appendix — Raw Materials
Transcripts & Source References
████████████████████████████████████████████████
███████████████████████████████████████

██████████████████████████████████████
█████████ · ████ · █████████████████████
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████
Restricted Access