Image-to-3-D without 3-D supervision. Flow-SDF replaces the deterministic CNN encoder of Mini SDF-SRN with a rectified-flow transformer: the SDF latent code is now generated through 8 steps of iterative denoising, conditioned on the input image. The entire system — image conditioner, rectified flow, SDF decoder — trains end-to-end through a differentiable renderer using only 2-D silhouette supervision, no 3-D ground truth at any point. It is architecturally equivalent to Hunyuan3D and TRELLIS, differing only in scale, and it matches the CNN baseline's reconstruction quality (silhouette loss 0.0053 vs 0.0050) while opening the door to true generative 3-D.
Mini SDF-SRN (Topic 43) proved you can learn 3-D from single-view silhouettes alone — but its CNN encoder is deterministic, mapping one image to one code in a single forward pass. That is a dead end for generation: no stochasticity, no conditioning on text or partial views, no iterative refinement. Modern image-generation systems (Stable Diffusion 3) and state-of-the-art 3-D systems (Hunyuan3D 2.0, TRELLIS 2) all replaced deterministic encoders with iterative denoising — flow-based diffusion transformers conditioned on image features.
Flow-SDF asks the precise question: can we swap the CNN encoder for a rectified-flow model while keeping the one property that matters — requiring no 3-D supervision? The answer is yes. The conventional systems build their 3-D latent space in a separate first stage, trained on millions of 3-D meshes; Flow-SDF builds it jointly, from 2-D images only, with the SDF decoder and the flow co-evolving through the differentiable renderer.
| Component | Spec | Role |
|---|---|---|
| Image conditioner | CNN — 4 conv layers + FC, 2.9 M params | Processes the input image into a conditioning embedding that guides the flow's denoising |
| Rectified flow velocity net | 6-layer MLP with AdaLN timestep conditioning, 3.9 M params | Predicts a velocity field v(z_t, t, cond) defining straight-line paths from noise to clean latent codes |
| SDF decoder | 5-layer MLP with skip at layer 3, 330 K params (DeepSDF architecture) | Latent code + 3-D coordinate → scalar SDF value. Never trained on 3-D data |
| Differentiable renderer | Fixed ray-marching, not learned — soft silhouette via sigmoid(−min_sdf / temp) | The bridge — provides continuous gradients from 2-D pixels to 3-D SDF values |
The rectified flow uses AdaLN — the same conditioning technique as in the Diffusion Transformer — to modulate layer-norm parameters by the diffusion timestep, so the network behaves differently at each stage of denoising. At inference, starting from scaled Gaussian noise, the velocity field is integrated with 8 Euler steps to produce a clean SDF latent code.
Phase 1 — flow distillation. A standard CNN-based SDF-SRN model is trained to convergence first (this is Mini SDF-SRN, Topic 43). Its trained encoder generates target latent codes for all training images; the rectified flow is then trained to reproduce those codes with two complementary losses — a velocity loss (MSE between predicted and target velocity at random timesteps, training local accuracy) and a sampling loss (MSE between the full 8-step Euler-integrated output and the target, training end-to-end integration quality). Phase 1 is fast — ~30 seconds total, no rendering involved.
Phase 2 — end-to-end fine-tuning. All parameters unfrozen. The full pipeline runs image → conditioner → flow (8 steps) → latent → SDF decoder → differentiable renderer → silhouette, with binary-cross-entropy loss against the ground-truth silhouette. Phase 2 is slow (~14 s/epoch) because each step requires 8 flow denoising passes followed by SDF evaluation at 196,608 spatial points (64 × 64 pixels × 48 depth samples). 200 epochs brings the silhouette loss from ~1.5 down to 0.0053.
The project runs in three stages. Stage 1 is the CNN-encoder silhouette baseline — the Mini SDF-SRN reimplementation (Topic 43). Stage 2 replaces the CNN encoder with the 8-step rectified flow. Stage 3 adds differentiable RGB rendering with Lambertian shading on top. The figures below are the actual epoch outputs from the repository.




Stage 3 adds a differentiable RGB renderer with Lambertian shading. The intuition: a sphere and a cube have similar silhouettes but very different shading patterns — RGB supervision provides dense surface- normal information that silhouette-only supervision cannot.




| Model | Silhouette loss | Training time | Inference |
|---|---|---|---|
| CNN encoder (baseline — Mini SDF-SRN) | 0.0050 | ~70 min | ~2 ms (single forward pass) |
| Rectified flow (Flow-SDF) | 0.0053 | ~90 min total | ~20 ms (8 denoising steps) |
Gradient-flow validation: all three trainable components receive non-zero gradients when loss is backpropagated through the full chain — conditioner (grad norm 3.91), velocity network (161.93), SDF decoder (324.63). The forward pass completes in 258 ms, the backward pass in 282 ms on an Apple M4.
Architecturally equivalent to Hunyuan3D. Differs only in scale.
Industry: DINOv2-Giant (1.1 B params) image encoder, a 21-layer DiT flow backbone, a ShapeVAE trained on millions of meshes, millions of 3-D assets. Flow-SDF: a 2.9 M-param CNN conditioner, a 6-layer MLP flow, an SDF decoder trained via a 2-D renderer, 50 synthetic shapes, 2-D only. The concept is identical. Each substitution — CNN → DINOv2, MLP → DiT, silhouette → RGB+silhouette, 50 shapes → millions — is modular and independently testable. The small-scale validation is the scaling recipe.
Speed vs capacity. The CNN encoder produces a code in one ~2 ms forward pass; the rectified flow needs 8 denoising steps (~20 ms), and training is ~3× slower per epoch. But the flow architecture supports stochastic sampling, conditioning on arbitrary modalities, and iterative refinement — capabilities the CNN encoder fundamentally lacks. 3-D supervision vs 2-D-only. The 2-D silhouette signal is geometrically weaker than direct 3-D supervision — it constrains the visual hull but leaves depth ambiguity within it. The shared decoder resolves this statistically across the dataset, but the geometry is less precise than 3-D-supervised methods. The trade is deliberate: zero 3-D data is a significant practical advantage where 3-D assets are scarce.
The scaling path the report lays out is concrete: replace the CNN conditioner with a frozen DINOv2-Small (background- invariant features, no training); replace the MLP velocity network with a small transformer DiT using cross-attention between latent tokens and image tokens; combine silhouette + RGB supervision; add a CLIP text encoder for text-to-3-D; and extend to compositional shapes with spatially-structured latents (3-D grids or triplanes). Each is a modular, independently-testable substitution.
Step the 8-step rectified flow. The left pane is the velocity-field integration — scaled Gaussian noise at step 0, resolving toward the clean SDF latent by step 8. The middle pane shows the decoded shape's silhouette at the current step; the right pane is the ground truth. Watch the noise-scaling effect: with scaling off, the flow collapses toward an empty code.
White paper · rectified flow for SDF reconstruction · the noise-scaling fix · two-phase training · three-stage results · architectural equivalence to Hunyuan3D / TRELLIS