← Research Timeline Aditya Jain / Apple Maps · 3D Reconstruction
Apr 2026
Topic 44 Apr 2026 Rectified Flow · SDF · Differentiable Rendering · Image-to-3-D

Flow-SDF —
Rectified Flow for Neural SDF Reconstruction.

Image-to-3-D without 3-D supervision. Flow-SDF replaces the deterministic CNN encoder of Mini SDF-SRN with a rectified-flow transformer: the SDF latent code is now generated through 8 steps of iterative denoising, conditioned on the input image. The entire system — image conditioner, rectified flow, SDF decoder — trains end-to-end through a differentiable renderer using only 2-D silhouette supervision, no 3-D ground truth at any point. It is architecturally equivalent to Hunyuan3D and TRELLIS, differing only in scale, and it matches the CNN baseline's reconstruction quality (silhouette loss 0.0053 vs 0.0050) while opening the door to true generative 3-D.

00 — Motivation

Can a flow model replace the CNN encoder — and keep "no 3-D supervision"?

Mini SDF-SRN (Topic 43) proved you can learn 3-D from single-view silhouettes alone — but its CNN encoder is deterministic, mapping one image to one code in a single forward pass. That is a dead end for generation: no stochasticity, no conditioning on text or partial views, no iterative refinement. Modern image-generation systems (Stable Diffusion 3) and state-of-the-art 3-D systems (Hunyuan3D 2.0, TRELLIS 2) all replaced deterministic encoders with iterative denoising — flow-based diffusion transformers conditioned on image features.

Flow-SDF asks the precise question: can we swap the CNN encoder for a rectified-flow model while keeping the one property that matters — requiring no 3-D supervision? The answer is yes. The conventional systems build their 3-D latent space in a separate first stage, trained on millions of 3-D meshes; Flow-SDF builds it jointly, from 2-D images only, with the SDF decoder and the flow co-evolving through the differentiable renderer.

What it informs
Flow-SDF is the architectural bridge between the from-scratch primitive (Mini SDF-SRN, Topic 43) and the full-scale image-to-3-D archive (Hypernet → DeepSDF, Topic 41). It validates — at small scale, on consumer hardware — the exact component stack that production systems use: image conditioner → flow backbone → SDF decoder → differentiable renderer. Each component is modular and independently upgradeable, so the validated small-scale pipeline is a clear scaling recipe.
Pipeline

Image → conditioner → rectified flow (8 steps) → SDF decoder → renderer.

image 64²single view conditionerCNN 2.9M rectified flowMLP+AdaLN 3.9M · 8 Euler steps latent z128-dim SDF decoder5-layer MLP 330K diff. rendererray march → silhouette + RGB Random noise → 8-step Euler integration of the rectified flow's velocity field → clean SDF latent code. All three modules train end-to-end through the renderer. No 3-D data at any point. Loss = rendered silhouette (and RGB) vs ground truth. Runs on Apple Silicon (MPS), CUDA, or CPU.
01 — Architecture

Four components, ~7 M parameters total, all trained end-to-end.

ComponentSpecRole
Image conditionerCNN — 4 conv layers + FC, 2.9 M paramsProcesses the input image into a conditioning embedding that guides the flow's denoising
Rectified flow velocity net6-layer MLP with AdaLN timestep conditioning, 3.9 M paramsPredicts a velocity field v(z_t, t, cond) defining straight-line paths from noise to clean latent codes
SDF decoder5-layer MLP with skip at layer 3, 330 K params (DeepSDF architecture)Latent code + 3-D coordinate → scalar SDF value. Never trained on 3-D data
Differentiable rendererFixed ray-marching, not learned — soft silhouette via sigmoid(−min_sdf / temp)The bridge — provides continuous gradients from 2-D pixels to 3-D SDF values

The rectified flow uses AdaLN — the same conditioning technique as in the Diffusion Transformer — to modulate layer-norm parameters by the diffusion timestep, so the network behaves differently at each stage of denoising. At inference, starting from scaled Gaussian noise, the velocity field is integrated with 8 Euler steps to produce a clean SDF latent code.

02 — Two-Phase Training

Flow distillation, then end-to-end fine-tuning.

Phase 1 — flow distillation. A standard CNN-based SDF-SRN model is trained to convergence first (this is Mini SDF-SRN, Topic 43). Its trained encoder generates target latent codes for all training images; the rectified flow is then trained to reproduce those codes with two complementary losses — a velocity loss (MSE between predicted and target velocity at random timesteps, training local accuracy) and a sampling loss (MSE between the full 8-step Euler-integrated output and the target, training end-to-end integration quality). Phase 1 is fast — ~30 seconds total, no rendering involved.

Phase 2 — end-to-end fine-tuning. All parameters unfrozen. The full pipeline runs image → conditioner → flow (8 steps) → latent → SDF decoder → differentiable renderer → silhouette, with binary-cross-entropy loss against the ground-truth silhouette. Phase 2 is slow (~14 s/epoch) because each step requires 8 flow denoising passes followed by SDF evaluation at 196,608 spatial points (64 × 64 pixels × 48 depth samples). 200 epochs brings the silhouette loss from ~1.5 down to 0.0053.

The critical fix — noise scaling
Standard Gaussian noise in 128 dimensions has norm ~11.3, while the target SDF latent codes have norm ~0.42. Without scaling, the flow collapses to producing near-zero codes — minimising MSE by magnitude reduction rather than direction-finding. Scaling the source noise to match the target distribution's standard deviation makes the flow's task direction-finding rather than magnitude-shrinking. This single fix raised cosine similarity from 0.08 to 0.95.
03 — Results

Three stages — silhouette baseline, flow + silhouette, flow + RGB.

The project runs in three stages. Stage 1 is the CNN-encoder silhouette baseline — the Mini SDF-SRN reimplementation (Topic 43). Stage 2 replaces the CNN encoder with the 8-step rectified flow. Stage 3 adds differentiable RGB rendering with Lambertian shading on top. The figures below are the actual epoch outputs from the repository.

Stage 2 — rectified flow + silhouette

Stage 2 flow + silhouette — sphere novel views
Sphere · Phase 2, epoch 200. Training view + 4 novel angles. Pred row vs GT row — the 8-step flow matches the CNN baseline's silhouette quality.
Stage 2 flow + silhouette — box novel views
Box · Phase 2, epoch 200. The hardest case — sharp corners. The flow recovers the gross box silhouette across all 4 novel angles.
Stage 2 flow + silhouette — ellipsoid novel views
Ellipsoid · Phase 2, epoch 200. The anisotropic silhouette is recovered at every novel angle.
Stage 2 training convergence
Stage 2 training convergence · Phase 2, epoch 200. Ground-truth / predicted / difference. The difference shrinks to a thin boundary band — the rectified flow reaches the same reconstruction quality as the CNN encoder.

Stage 3 — rectified flow + RGB

Stage 3 adds a differentiable RGB renderer with Lambertian shading. The intuition: a sphere and a cube have similar silhouettes but very different shading patterns — RGB supervision provides dense surface- normal information that silhouette-only supervision cannot.

Stage 3 flow + RGB — sphere
Sphere · RGB Flow-SDF, epoch 100. Three rows — predicted RGB, predicted silhouette, ground-truth RGB. Lambertian shading gradients add surface-normal supervision on top of the boundary signal.
Stage 3 flow + RGB — box
Box · RGB Flow-SDF, epoch 100. The shading distinguishes a box from a sphere even where the silhouettes are similar.
Stage 3 flow + RGB — ellipsoid
Ellipsoid · RGB Flow-SDF, epoch 100. RGB + silhouette combined supervision.
Stage 3 RGB training convergence
Stage 3 RGB training convergence · epoch 100. The RGB rendering loss converges alongside the silhouette loss; shading gradients provide the dense surface-normal signal.
ModelSilhouette lossTraining timeInference
CNN encoder (baseline — Mini SDF-SRN)0.0050~70 min~2 ms (single forward pass)
Rectified flow (Flow-SDF)0.0053~90 min total~20 ms (8 denoising steps)

Gradient-flow validation: all three trainable components receive non-zero gradients when loss is backpropagated through the full chain — conditioner (grad norm 3.91), velocity network (161.93), SDF decoder (324.63). The forward pass completes in 258 ms, the backward pass in 282 ms on an Apple M4.

Core Result

Architecturally equivalent to Hunyuan3D. Differs only in scale.

Industry: DINOv2-Giant (1.1 B params) image encoder, a 21-layer DiT flow backbone, a ShapeVAE trained on millions of meshes, millions of 3-D assets. Flow-SDF: a 2.9 M-param CNN conditioner, a 6-layer MLP flow, an SDF decoder trained via a 2-D renderer, 50 synthetic shapes, 2-D only. The concept is identical. Each substitution — CNN → DINOv2, MLP → DiT, silhouette → RGB+silhouette, 50 shapes → millions — is modular and independently testable. The small-scale validation is the scaling recipe.

04 — Tradeoffs & Next Steps

Speed for capacity; 2-D-only for geometric precision.

Speed vs capacity. The CNN encoder produces a code in one ~2 ms forward pass; the rectified flow needs 8 denoising steps (~20 ms), and training is ~3× slower per epoch. But the flow architecture supports stochastic sampling, conditioning on arbitrary modalities, and iterative refinement — capabilities the CNN encoder fundamentally lacks. 3-D supervision vs 2-D-only. The 2-D silhouette signal is geometrically weaker than direct 3-D supervision — it constrains the visual hull but leaves depth ambiguity within it. The shared decoder resolves this statistically across the dataset, but the geometry is less precise than 3-D-supervised methods. The trade is deliberate: zero 3-D data is a significant practical advantage where 3-D assets are scarce.

The scaling path the report lays out is concrete: replace the CNN conditioner with a frozen DINOv2-Small (background- invariant features, no training); replace the MLP velocity network with a small transformer DiT using cross-attention between latent tokens and image tokens; combine silhouette + RGB supervision; add a CLIP text encoder for text-to-3-D; and extend to compositional shapes with spatially-structured latents (3-D grids or triplanes). Each is a modular, independently-testable substitution.

Interactive Demo · Live

Step the 8-step rectified flow. The left pane is the velocity-field integration — scaled Gaussian noise at step 0, resolving toward the clean SDF latent by step 8. The middle pane shows the decoded shape's silhouette at the current step; the right pane is the ground truth. Watch the noise-scaling effect: with scaling off, the flow collapses toward an empty code.

01 — Flow step STEP 0 / 8
02 — Decoded silhouette cos = 0.00
03 — Ground truth target shape

Full Technical Paper

White paper · rectified flow for SDF reconstruction · the noise-scaling fix · two-phase training · three-stage results · architectural equivalence to Hunyuan3D / TRELLIS

Read Paper →
Related Thesis Chapters
Mini SDF-SRN
The Stage-1 baseline. Flow-SDF keeps Mini SDF-SRN's SDF decoder and differentiable renderer unchanged and swaps only the CNN encoder for the rectified flow.
Hypernet → DeepSDF
The full-scale image-to-3-D archive. Flow-SDF validates the flow-conditioner-decoder-renderer stack at small scale; Hypernet → DeepSDF runs the DeepSDF + image-DiT version at 976 shapes.
x-Prediction Analysis
The manifold-hypothesis theory behind rectified flow's velocity target — the straight-line-path formulation is the flow-matching cousin of x-prediction.
Appendix — Raw Materials
Transcripts & Source References
████████████████████████████████████████████████

██████████████████████████████████████
█████████ · ████ · █████████████████████
████████████████████████████████████████████████████████████████████████████████████████████████████████████
Restricted Access