Flow-SDF — White Paper

Flow-SDF: Rectified Flow for Neural SDF Reconstruction — Image-to-3-D Without 3-D Supervision

Aaditya Jain

ad_jain@icloud.com · orcid.org/0009-0005-5534-5641

Rectified Flow · Neural SDF Reconstruction · Thesis Research, Unpublished Preprint

Submitted: April 2026 Subject: cs.CV · cs.GR · cs.LG Keywords: rectified flow, neural SDF, differentiable rendering, image-to-3-D, 2-D supervision, joint training, noise scaling

Abstract

We present Flow-SDF, a pipeline that replaces the conventional CNN encoder in neural SDF reconstruction with a rectified-flow transformer, enabling iterative latent-code generation conditioned on input images. Unlike standard approaches that use a single-forward-pass encoder, our method generates SDF latent codes through multi-step denoising, providing the architectural foundation for future generative 3-D tasks. The entire system trains end-to-end using only 2-D silhouette supervision through a differentiable renderer, without requiring any 3-D ground truth data at any point. We validate that gradients flow through the complete chain — from rendered silhouettes back through the differentiable renderer, the SDF decoder, the 8 rectified-flow denoising steps, and the image conditioner — and demonstrate reconstruction quality matching the CNN baseline (silhouette loss 0.0053 vs 0.0050). We identify and resolve the noise-scaling problem in rectified-flow training with small-norm target distributions: standard Gaussian noise in 128 dimensions has norm ~11.3 while the target latent codes have norm ~0.42, so without scaling the flow collapses to magnitude-shrinking rather than direction-finding — scaling the source noise to the target's standard deviation raised cosine similarity from 0.08 to 0.95. We describe a two-phase training strategy (fast flow distillation, then end-to-end fine-tuning) that converges reliably on consumer hardware, and a three-stage progression — CNN+silhouette baseline, flow+silhouette, flow+RGB. The pipeline is architecturally equivalent to production systems like Hunyuan3D 2.0 and TRELLIS 2, differing only in scale, establishing a clear modular scaling path. Keywords: rectified flow, neural SDF, differentiable rendering, image-to-3-D, 2-D-only supervision, noise scaling.

1. Introduction

Reconstructing 3-D shapes from single images is a fundamental challenge in computer vision. Traditional approaches require paired image–3-D datasets, which are expensive and difficult to collect at scale. SDF-SRN (Lin et al., NeurIPS 2020) demonstrated that 3-D shapes can be learned from single-view images using only 2-D silhouette supervision, eliminating the need for 3-D ground truth entirely — but SDF-SRN uses a deterministic CNN encoder that maps images to latent codes in a single forward pass, limiting its capacity for generative applications.

Modern image-generation systems like Stable Diffusion 3 have shown that replacing deterministic encoders with iterative denoising processes (diffusion, rectified flow) dramatically improves generation quality and enables compositional understanding. State-of-the-art 3-D generation systems — Hunyuan3D 2.0, TRELLIS 2 — use flow-based diffusion transformers conditioned on image features to generate 3-D latent codes. We ask: can we replace the CNN encoder in a neural SDF pipeline with a rectified-flow model while keeping the key advantage of requiring no 3-D supervision? We answer affirmatively.

2. Approach Comparison

2.1 The conventional two-stage approach

Modern 3-D generation systems (Hunyuan3D 2.0, TRELLIS 2, Shap-E) follow a two-stage paradigm. Stage 1 — 3-D autoencoder training: a ShapeVAE is trained on millions of 3-D meshes with direct 3-D supervision; the encoder compresses meshes into a latent space, the decoder reconstructs them. This requires a massive 3-D dataset. Stage 2 — conditional generation: a flow-based diffusion transformer is trained to generate codes in the frozen ShapeVAE's latent space, conditioned on image features from a pretrained vision model (typically DINOv2). This requires paired image–3-D data. Hunyuan3D 2.0, for example, uses DINOv2-Giant (1.1 B parameters) at 518×518 resolution and trains on millions of 3-D assets.

2.2 Our joint training approach

Flow-SDF fundamentally differs: everything trains together, and no 3-D data is used at any point. The SDF decoder, rectified-flow model, and image conditioner are all trained simultaneously through a single unified loss — the comparison between a differentiably-rendered silhouette and the ground-truth silhouette. There is no separate 3-D latent space built beforehand. The decoder discovers what latent codes should mean at the same time the flow learns what codes to produce; they co-evolve through the differentiable renderer, which provides a continuous gradient path from 2-D pixel comparisons to 3-D geometry decisions.

Table 1 — Approach comparison.
Aspect	Conventional (Hunyuan3D, TRELLIS)	Ours (Flow-SDF)
3-D data required	Millions of 3-D meshes for the ShapeVAE	None — trained entirely from 2-D images
Training stages	Two separate (3-D autoencoder, then flow)	Joint end-to-end through the renderer
Latent space	Pre-built from 3-D supervision (high quality)	Co-evolved with the flow via 2-D loss (less precise)
Image encoder	Frozen DINOv2-Giant (1.1 B params)	Learned CNN (2.9 M params)
Flow backbone	Transformer DiT with cross-attention	MLP with AdaLN (3.9 M params)
Supervision	Direct 3-D + paired images	2-D silhouettes only
Generative capacity	Full text/image-to-3-D generation	Validated architecture, reconstruction demonstrated

The key insight: the conventional approach builds a better 3-D latent space because it has direct 3-D supervision; our approach builds the latent space from weaker 2-D signals, resulting in less precise geometry. But our approach requires zero 3-D data — a significant practical advantage for domains where 3-D assets are scarce but images are abundant.

3. Architecture

The full pipeline is four components, trained end-to-end. Image conditioner — a CNN feature extractor (4 conv layers + FC, 2.9 M params) producing a conditioning embedding that guides the flow's denoising; in a production system this would be a frozen pretrained DINOv2. Rectified-flow velocity network — a 6-layer MLP (3.9 M params) with Adaptive Layer Normalization (AdaLN) conditioning on timestep embeddings; it predicts a velocity field v(z_t, t, cond) defining straight-line paths from noise to clean latent codes, integrated with 8 Euler steps at inference. AdaLN — the same conditioning technique as in DiT — modulates layer-norm parameters by timestep, letting the network behave differently at each denoising stage. SDF decoder — a 5-layer MLP (330 K params) with a skip connection at layer 3, DeepSDF-style; it takes a latent code concatenated with a 3-D coordinate and outputs a scalar SDF value, and it was never trained on 3-D data directly. Differentiable renderer — a fixed (non-learned) ray-marching renderer that samples points along camera rays, queries the SDF decoder, and computes a soft silhouette via sigmoid(−min_sdf / temperature); it is the critical bridge between 3-D and 2-D, a mathematical operation providing continuous gradients from 2-D pixel comparisons back through the SDF values to the latent codes.

4. Training Strategy

4.1 Phase 1 — flow distillation

A standard CNN-based SDF-SRN model is first trained to convergence (silhouette loss 0.005). Its trained encoder generates target latent codes for all training images, and the rectified flow is trained to reproduce these codes with two complementary losses. Velocity loss: at random timesteps t, interpolate between noise and target z_t = (1−t)·z_0 + t·z_1; the network predicts the velocity v = z_1 − z_0; loss = MSE(v_pred, v_target) — this trains local accuracy at each timestep. Sampling loss: run the full 8-step Euler integration from noise to clean code; loss = MSE(z_sampled, z_target) — this trains end-to-end integration quality, preventing error accumulation across steps. Phase 1 is fast (~0.2 s per epoch, ~30 seconds total) because it involves only the conditioner and velocity network — no rendering. It converges when mean cosine similarity across all shapes exceeds 0.9.

4.2 The noise-scaling fix

The single most important implementation detail. Standard Gaussian noise in 128 dimensions has norm ~11.3, while the target latent codes have norm ~0.42. Without scaling, the flow collapses to producing near-zero codes — minimising MSE by magnitude reduction rather than direction-finding. We scale the source noise to match the target distribution's standard deviation, so the flow's task becomes direction-finding rather than magnitude-shrinking. This single fix raised cosine similarity from 0.08 to 0.95. It is a general lesson for rectified-flow training against small-norm target distributions: the source noise must be scaled to the target, or the easy minimum is the empty one.

4.3 Phase 2 — end-to-end fine-tuning

All parameters are unfrozen (conditioner, velocity network, SDF decoder). The full pipeline runs image → conditioner → flow (8 steps) → latent code → SDF decoder → differentiable renderer → silhouette, with binary-cross-entropy loss against the ground-truth silhouette plus an auxiliary flow loss maintaining alignment with the pretrained codes. Phase 2 is slower (~14 s per epoch) because every step requires 8 flow denoising passes followed by SDF evaluation at 196,608 spatial points (64 × 64 pixels × 48 depth samples). 200 epochs brings the silhouette loss from ~1.5 to 0.0053, matching the CNN baseline's 0.0050.

5. Results

5.1 Gradient-flow validation

The gradient-flow test confirms all three trainable components receive non-zero gradients when loss is backpropagated through the full chain: conditioner (grad norm 3.91), velocity network (161.93), SDF decoder (324.63). Forward pass completes in 258 ms, backward pass in 282 ms on an M4 — the full eight-step-flow-plus-render-plus-decode chain is differentiable end to end on consumer hardware.

5.2 Three-stage progression

Stage 1 is the CNN-encoder silhouette baseline (Mini SDF-SRN). Stage 2 replaces the CNN encoder with the 8-step rectified flow — cosine similarity between flow-generated and CNN-generated latent codes reaches 0.947 within 140 epochs (~28 seconds). Stage 3 adds differentiable RGB rendering with Lambertian shading: a sphere and a cube have similar silhouettes but very different shading patterns, so RGB supervision provides dense surface-normal information that silhouette-only supervision cannot.

Table 2 — Reconstruction quality: CNN encoder vs rectified flow.
Model	Silhouette loss	Training time	Inference
CNN encoder (baseline)	0.0050	~70 min	~2 ms (single forward pass)
Flow model (ours)	0.0053	~90 min total	~20 ms (8 denoising steps)

The flow model achieves equivalent reconstruction quality to the CNN encoder while providing architectural capacity for future generative tasks. Novel-view predictions from both models are visually comparable, with minor softness at shape boundaries in the flow model's output.

6. Why Joint Training Works

The differentiable renderer creates a continuous gradient path from 2-D pixels to 3-D geometry. Because the renderer is a fixed mathematical operation rather than a learned component, the gradients it passes back are geometrically meaningful: "this pixel should be white but is black" → "the SDF value along this ray should be negative somewhere" → "the latent code should change so the decoder produces a surface here" → "the flow should produce a different output for this image." Every component receives a meaningful learning signal derived from a simple 2-D image comparison. The SDF decoder never sees a 3-D mesh — it learns what 3-D geometry looks like from the accumulated constraints of many 2-D silhouette comparisons across the dataset; different training examples from different viewpoints collectively teach the decoder about 3-D structure even though no single example provides multi-view supervision.

7. Tradeoffs

Speed vs capacity. The CNN encoder produces a latent code in a single ~2 ms forward pass; the rectified flow requires 8 denoising steps (~20 ms), and training is ~3× slower per epoch. But the flow architecture supports stochastic sampling, conditioning on arbitrary modalities, and iterative refinement — capabilities the CNN encoder fundamentally lacks. 3-D supervision vs 2-D-only. Our approach requires no 3-D data, advantageous where 3-D assets are scarce; but the 2-D silhouette signal is geometrically weaker than direct 3-D supervision — it constrains the visual hull but leaves depth ambiguity within the silhouette, which the shared decoder resolves only statistically across the dataset. Reconstruction vs generation. The current system is a reconstruction model; the CNN-encoder version is a dead end for generation because it deterministically maps one image to one code. The flow version opens the door — the same architecture can be conditioned on text embeddings, partial views, or semantic descriptions to generate novel 3-D shapes, and the flow's stochastic nature enables multi-sample averaging for more robust reconstructions.

8. Relationship to State-of-the-Art and Next Steps

Flow-SDF is architecturally equivalent to production systems, differing only in scale: image encoder CNN (2.9 M) vs DINOv2-Giant (1.1 B); flow backbone MLP-6-AdaLN (3.9 M) vs DiT-21-layers; 3-D decoder SDF-MLP-trained-via-renderer vs ShapeVAE-trained-on-millions; training data 50 synthetic shapes (2-D only) vs millions of 3-D assets. The concept is identical. The architectural equivalence is intentional — by validating each component at small scale with the same conceptual design, the scaling path is clear and each substitution is modular and independently testable.

The next steps the report lays out: (i) RGB rendering loss — replace silhouette-only with combined silhouette + RGB, where shading gradients provide dense surface-normal information (a sphere and a cube have similar silhouettes but different shading). (ii) DINOv2 conditioning — replace the learned CNN conditioner with a frozen DINOv2-Small (21 M params, pretrained on 142 M images) for background-invariant, semantically rich features without training. (iii) Transformer velocity network — replace the MLP velocity network with a small DiT using cross-attention between latent tokens and image-conditioning tokens, enabling spatial conditioning. (iv) Compositional shapes — extend to composite shapes with spatially-structured latents (3-D grids or triplanes) instead of a single global vector. (v) Text conditioning — add a CLIP text encoder alongside DINOv2 for text-to-3-D.

9. Conclusion

We have demonstrated that a rectified-flow model can replace a CNN encoder in a neural SDF reconstruction pipeline while maintaining equivalent reconstruction quality and requiring no 3-D supervision. The key contributions: (1) end-to-end gradient flow through a complete pipeline — image → conditioner → rectified flow (8 steps) → SDF decoder → differentiable renderer → 2-D loss — validated on M4 hardware; (2) identification and resolution of the noise-scaling problem in rectified-flow training with small-norm target distributions (cosine similarity 0.08 → 0.95); (3) a two-phase training strategy — fast flow distillation, then end-to-end fine-tuning — that converges reliably on consumer hardware; (4) a complete, self-contained codebase (6 files, ~2,000 lines) running on MPS, CUDA, or CPU with synthetic data requiring no external downloads. This establishes a validated foundation for 3-D generation from 2-D supervision; the architectural pathway to production-quality image-to-3-D is clear, and each component is modular and independently upgradeable.

References

[1] Lin, C. H., Wang, C., Lucey, S. "SDF-SRN: Learning Signed Distance 3D Object Reconstruction from Static Images." NeurIPS, 2020.

[2] Park, J. J. et al. "DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation." CVPR, 2019.

[3] Esser, P., Kulal, S., Blattmann, A. et al. "Scaling Rectified Flow Transformers for High-Resolution Image Synthesis (Stable Diffusion 3)." 2024.

[4] Tencent Hunyuan3D Team. "Hunyuan3D 2.0: Scaling Diffusion Models for High-Resolution Textured 3D Asset Generation." arXiv:2501.12202, 2025.

[5] Xiang, J. et al. "TRELLIS: Structured 3D Latents for Scalable and Versatile 3D Generation." Microsoft, 2024.

[6] Oquab, M. et al. "DINOv2: Learning Robust Visual Features without Supervision." Meta AI, 2023.

[7] Peebles, W., Xie, S. "Scalable Diffusion Models with Transformers (DiT)." ICCV, 2023.

[8] Liu, X., Gong, C., Liu, Q. "Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow." ICLR, 2023.

[9] Jain, A. "Mini SDF-SRN: Learning 3-D from Single Images." Thesis research, Apr 2026. /whitepaper/mini-sdf-srn

[10] Code & Technical Report: github.com/BOB-THE-BUILDER-in/rectified-flow-sdf