Mini SDF-SRN — White Paper

Mini SDF-SRN: A Minimal Self-Contained Reimplementation of Single-View 3-D Reconstruction from Silhouette Supervision, and the Differentiable-Renderer-as-Bridge Primitive

Aaditya Jain

ad_jain@icloud.com · orcid.org/0009-0005-5534-5641

Neural SDF · Single-View Reconstruction · Thesis-Line Foundation Build

Submitted: April 2026 Subject: cs.CV · cs.GR Keywords: SDF-SRN, differentiable rendering, single-view reconstruction, silhouette supervision, signed distance fields, 3-D from 2-D

Abstract

We present Mini SDF-SRN, a minimal self-contained reimplementation of the SDF-SRN concept (Lin et al., NeurIPS 2020): learning single-view 3-D shape reconstruction using only 2-D silhouette supervision, with no 3-D ground truth at any point. The pipeline is four components — a CNN encoder compressing a 64 × 64 image into a 128-dimensional latent, a shared 5-layer SDF decoder with a skip connection, and a fixed (non-learned) differentiable ray-marching renderer that turns the decoded SDF into a soft silhouette — trained end-to-end by comparing the rendered silhouette against the ground-truth silhouette and backpropagating through the entire chain. Trained on 50 synthetic shapes, each seen from a single viewpoint, for 300 epochs (~70 minutes on an M4 iMac), the network predicts correct silhouettes from 6 novel angles it never saw, converging to a silhouette loss of ~0.005. The contribution is not a competitive reconstruction model — it is the from-scratch build of the 3-D-from-2-D primitive that the rest of the thesis-line image-to-3-D work depends on: the differentiable renderer is the bridge that makes 2-D-pixel gradients geometrically meaningful in 3-D, and owning it end-to-end before scaling is the same architecture-literacy investment made for transformers and DDPMs earlier in the thesis line. Mini SDF-SRN is the direct foundation of Flow-SDF (the rectified-flow successor), which keeps the SDF decoder and renderer unchanged and swaps only the CNN encoder. Keywords: SDF-SRN, differentiable rendering, single-view reconstruction, silhouette supervision, foundation build.

1. Introduction

Reconstructing 3-D shape from a single image traditionally requires paired image–3-D datasets, which are expensive to collect at scale. SDF-SRN (Lin et al., NeurIPS 2020) demonstrated that 3-D shapes can be learned from single-view images using only 2-D silhouette supervision, eliminating the need for 3-D ground truth entirely. The thesis line's image-to-3-D work — the Hypernet → DeepSDF archive, the Flow-SDF pipeline — all rests on this one primitive: a differentiable renderer that turns a 3-D implicit surface into a 2-D image, so gradients from a 2-D loss can teach a network 3-D geometry.

Mini SDF-SRN builds that primitive from scratch on the smallest problem that still exercises every component: 50 synthetic shapes, single-view silhouette supervision, a CNN encoder, a shared SDF decoder, a ray-marching renderer. The goal is not a competitive reconstruction model — it is to own the 3-D-from-2-D mechanism end to end before using it as a black box downstream, the same architecture-literacy philosophy applied to the from-scratch transformer and the from-scratch DDPM earlier in the thesis line.

2. Architecture

2.1 CNN encoder

Four convolutional layers (3 → 32 → 64 → 128 → 256 channels, stride 2) followed by two fully-connected layers, compressing a 64 × 64 × 3 image into a 128-dimensional latent code. The encoder is deterministic — one image maps to one code in a single forward pass. (This determinism is precisely the property Flow-SDF later replaces, to gain generative capacity.)

2.2 SDF decoder

A 5-layer MLP with a skip connection, following the DeepSDF pattern. It takes the latent code concatenated with a 3-D query coordinate and outputs a scalar SDF value — the latent determines which shape, the coordinate determines where it is evaluated. The decoder is shared across all 50 shapes; the latent code is the only per-shape state. Crucially, this decoder is never trained on 3-D data — it learns to produce 3-D geometry purely through 2-D silhouette supervision via the differentiable renderer.

2.3 Differentiable renderer

A fixed, non-learned ray-marching renderer. It samples points along camera rays at fixed intervals, queries the SDF decoder at each point, and computes a soft silhouette via sigmoid(−min_sdf / temperature) — a soft indicator of whether any sampled point along the ray was inside the surface. It is not a neural network; it is a mathematical operation, and its role is to provide continuous, differentiable gradients from 2-D pixel comparisons back through the SDF values to the latent codes. Without this component there would be no way to train the 3-D representation from 2-D images at all.

3. Why It Works — The Renderer Is The Bridge

The differentiable renderer creates a continuous gradient path from 2-D pixels to 3-D geometry. Because the renderer is a fixed mathematical operation rather than a learned component, the gradients it passes back are geometrically meaningful, and the chain of reasoning is concrete:

"This pixel should be white but is black" → "the SDF value along this ray should be negative somewhere, indicating a surface" → "the latent code should change so the decoder produces a surface here" → "the encoder should map this image to a different code." Every component receives a meaningful learning signal derived from a simple 2-D image comparison.

The SDF decoder never sees a 3-D mesh. It learns what 3-D geometry looks like from the accumulated constraints of many single-view silhouette comparisons across the dataset. No single training example provides multi-view supervision — but different examples from different viewpoints collectively teach the decoder about 3-D structure. Single-view silhouette supervision constrains the visual hull of each shape; the shared decoder resolves the within-hull depth ambiguity statistically across the dataset.

4. Results

Trained on 50 synthetic shapes for 300 epochs, ~70 minutes on an M4 iMac (the implementation also runs on CUDA or CPU). Each shape was seen from a single viewpoint during training; the network is then tested on 6 novel angles per shape it never saw.

Table 1 — Mini SDF-SRN configuration and results.
Property	Value
Training data	50 synthetic shapes (sphere, box, ellipsoid families), one view per shape
Supervision	2-D silhouette only — no 3-D meshes, point clouds, or multi-view data
Encoder / decoder	CNN (4 conv + 2 FC) → 128-dim latent → 5-layer skip-MLP SDF decoder
Training	300 epochs, ~70 min on M4 iMac
Novel-view test	6 angles per shape, never seen in training
Converged silhouette loss	~0.005 — the CNN-encoder baseline Flow-SDF later matches with a rectified-flow encoder

The qualitative result, visible in the repository's epoch-300 figures: the sphere (smooth — easiest) tracks the ground-truth silhouette closely across all 6 novel angles; the box (sharp corners — hardest) recovers the gross silhouette with corners rounding off slightly; the ellipsoid (intermediate) recovers the anisotropic silhouette at every angle. The training-convergence triptych — ground truth / predicted / difference — shows the difference shrinking to a thin boundary band by epoch 300. The network has learned correct 3-D structure from single-view silhouette supervision alone.

5. Limitations and the Path to Flow-SDF

Two honest limitations. (i) The CNN encoder is deterministic and is a dead end for generation. It maps one image to one code in a single forward pass — there is no stochasticity, no capacity for conditioning on text or partial views, no iterative refinement. (ii) Synthetic shapes only. The 50-shape synthetic dataset (sphere/box/ellipsoid families) validates the mechanism but says nothing about real-image generalisation; the box's rounded corners hint that sharp-feature recovery from silhouette-only supervision is intrinsically limited — a silhouette constrains the visual hull but not within-hull concavities.

Both limitations point at the same successor. Flow-SDF (the rectified-flow topic) keeps the SDF decoder and the differentiable renderer unchanged — they are validated here — and replaces only the CNN encoder with a rectified-flow transformer that generates the latent code through iterative denoising. That single substitution converts a reconstruction-only model into one with generative capacity, and Flow-SDF demonstrates it matches this baseline's silhouette loss (0.0053 vs 0.0050). Mini SDF-SRN is, precisely, Stage 1 of Flow-SDF.

6. Conclusion

Mini SDF-SRN is a minimal, self-contained, from-scratch build of the 3-D-from-2-D primitive: CNN encoder, shared SDF decoder, fixed differentiable ray-marching renderer, trained end-to-end on single-view silhouette supervision with no 3-D ground truth. It learns correct 3-D structure from 50 synthetic shapes and predicts novel views it never saw. The contribution is the validated primitive — owning the differentiable-renderer-as-bridge mechanism end to end is the prerequisite for the thesis-line image-to-3-D work that builds directly on it.

References

[1] Lin, C. H., Wang, C., Lucey, S. "SDF-SRN: Learning Signed Distance 3D Object Reconstruction from Static Images." NeurIPS, 2020.

[2] Park, J. J. et al. "DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation." CVPR, 2019.

[3] Mescheder, L. et al. "Occupancy Networks: Learning 3D Reconstruction in Function Space." CVPR, 2019.

[4] Jain, A. "Flow-SDF: Rectified Flow for Neural SDF Reconstruction." Thesis research, Apr 2026. /whitepaper/rectified-flow-sdf

[5] Jain, A. "Hypernet → DeepSDF: Image-to-3-D Research Archive." Thesis research, May 2026. /whitepaper/hypernet-deepsdf

[6] Jain, A. "SDF Research and Experiments." Thesis research, Feb 2025. /whitepaper/sdf-research

[7] Code: github.com/BOB-THE-BUILDER-in/mini-sdf-srn