Late 2025 saw a wave of single-image-to-3-D Gaussian-splat methods land at top venues, each demonstrating sub-second feed-forward inference and high photo-realism on a single A100. The thesis-line question: should the MambaFlow3D-class generator [2] target Gaussian splats as its native output, or should it stick with the triplane / VDB representations already used by the rest of the thesis line?
This paper documents the survey of the four leading G-Splat methods and the comparison against VDB / FVDB that informs the answer.
| Method | Architecture | Input | Speed | Thesis-line relevance |
|---|---|---|---|---|
| GS-LRM | Transformer (LRM-style) | 2–4 sparse views | 0.23 s on A100 | Highest-quality reference |
| Triplane-Meets-Gaussian | Dual decoder (point-cloud + triplane) | Single view | ~0.5 s | Bridges triplane and G-Splat ecosystems |
| Splatter Image | U-Net pixel → Gaussian | Single view | ~0.3 s | Simplest architecture; one Gaussian per pixel |
| Gamba | Mamba over Gaussian sequence | Single view | ~0.4 s | Direct architectural cousin of MambaFlow3D |
Large Reconstruction Model architecture. Input: 2–4 sparse views of a scene with known camera poses. Architecture: a deep transformer (24+ layers, multi-head self-attention) operates over per-view image-token sequences with cross-view attention. Output: a fixed-cardinality set of 3-D Gaussian primitives (position, scale, rotation, colour, opacity per Gaussian) — typically 4 096 Gaussians per scene. Reported inference: 0.23 s on A100 GPU. Quality is the highest of the four surveyed but requires multiple input views, which is a stricter setup than the single-image use case the thesis line targets.
Dual-decoder bridge architecture. Input: single image. Architecture: a shared encoder produces a structured latent; two parallel decoders generate (a) a triplane feature representation, (b) Gaussian-splat parameters. The dual output lets downstream consumers pick the format that fits their use case. Inference ~0.5 s on consumer GPU. Most thesis-line-architecturally compatible because it includes a triplane output natively — easy to integrate with the thesis-line universal-intermediate decision [1].
Simplest of the four architecturally. Input: single image. Architecture: a U-Net predicts one Gaussian per input pixel — position offset from the pixel's ray, scale, rotation, colour, opacity. The result: H × W = 256 × 256 = 65 K Gaussians per scene, parameterised pixel-by-pixel. Inference ~0.3 s. The "one Gaussian per pixel" parameterisation is elegant but produces redundant Gaussians for smooth regions and under-represents fine surface detail.
Gamba is the most thesis-line-relevant of the four. Architecture: substitute a Mamba state-space block for transformer attention over the Gaussian-token sequence. Specifically, the image is encoded to a sequence of tokens; the tokens are processed by stacked Mamba blocks (linear time in sequence length, constant per-token-update memory) rather than transformer attention (quadratic in sequence length). The output is a Gaussian-primitive sequence in the same shape as GS-LRM's output. Inference ~0.4 s.
Gamba's architectural choice — Mamba over transformer for 3-D-generation token-sequence processing — is the same architectural choice MambaFlow3D [2] makes for SparC3D-class sparse-cube tokenisation. The two differ in the output target (Gamba outputs Gaussians; MambaFlow3D outputs sparse cubes) but share the backbone. Gamba's existence as a published paper is the strongest external validation of the Mamba-substitution premise that MambaFlow3D builds on; the MNIST validation in [3] provides internal empirical support.
| Property | Gaussian Splats | VDB / FVDB | Triplane (chosen) |
|---|---|---|---|
| Render speed (256² image) | 5–15 ms | ~100–300 ms | 30–80 ms |
| Storage (typical scene) | 50–500 MB | 10–80 MB | 6–12 MB |
| Editability | Move Gaussians directly | Houdini-native (best for procedural) | Edit 2-D feature planes |
| Procedural-pipeline integration | Splat-to-mesh required (lossy) | Native | Triplane → marching cubes → mesh → Houdini |
| Photo-realism | Highest (matches NeRF) | Mid | Mid–high |
| Single-image-to-3-D leaders | GS-LRM, Splatter Image, Gamba | Uncommon (heavy compute) | EG3D, InstantMesh, TRELLIS |
The decision-rule, set ex ante: pick the representation with the smallest storage + best editability + cleanest convertibility to both alternatives. Storage and editability favour triplane. Convertibility: triplane → mesh extraction → Houdini (the procedural-pipeline path) is one step; triplane → density-field volume rendering → G-Splat-class preview is another path that does not require Gaussian-from-image inference. Conversely, G-Splat → mesh is lossy (Gaussians do not have an explicit surface); VDB → triplane is approximately a downsample-with-decoder operation but adds a stage.
Triplane wins as the universal intermediate. G-Splat is retained as an optional preview-rendering target for use cases where photo-realism matters. VDB is retained as the procedural-pipeline interchange format (Houdini-native).
The thesis-line MambaFlow3D [2] proposes substituting a Pure-Mamba state-space backbone for transformer attention over SparC3D-class sparse-cube tokens. The MNIST validation in [3] provides empirical support at 196 tokens. Gamba — published before this thesis-line work — provides published support for the same architectural substitution applied to Gaussian-sequence tokens. The substitution is therefore not a thesis-line novelty in isolation; what MambaFlow3D contributes is the application to sparse-cube tokens (rather than Gaussian-sequence tokens) and the consumer-GPU speed-up budget targeted at the Maps procedural use case.
Triplane is the universal intermediate for the thesis-line single-image-to-3-D generators. G-Splat is optional preview rendering; VDB is procedural-pipeline integration. Gamba validates the MambaFlow3D Mamba-substitution premise for the architectural class but not for the specific sparse-cube tokenisation MambaFlow3D targets.