Topic 43 Apr 2026 SDF-SRN · Differentiable Rendering · 3-D from 2-D

Mini SDF-SRN —
Learning 3-D From Single Images.

A minimal, self-contained reimplementation of the SDF-SRN concept: learn 3-D shape reconstruction from single-view images using only silhouette supervision — no 3-D ground truth. A CNN encoder compresses an image into a 128-dim latent; a shared SDF decoder reconstructs the full 3-D shape; a differentiable ray-marching renderer produces a silhouette that is compared against the ground-truth silhouette, and gradients flow all the way back — renderer → decoder → latent → encoder. Trained on 50 synthetic shapes, each seen from only one viewpoint, the network predicts correct silhouettes from 6 novel angles it never saw. ~70 minutes on an M4 iMac.

GitHub · mini-sdf-srn ↗ SDF-SRN (Lin et al., NeurIPS 2020) ↗

00 — Motivation

Build the 3-D-from-2-D primitive from scratch, before scaling it.

The thesis line's image-to-3-D work — the Hypernet → DeepSDF archive (Topic 41), the Flow-SDF pipeline (Topic 44) — all rests on one primitive: a differentiable renderer that turns a 3-D implicit surface into a 2-D image, so gradients from a 2-D loss can teach a network 3-D geometry. Mini SDF-SRN is that primitive, built from scratch on the smallest problem that still exercises every component: 50 synthetic shapes, single-view silhouette supervision, a CNN encoder, a shared SDF decoder, a ray-marching renderer.

The point is not a competitive reconstruction model — it is to own the 3-D-from-2-D mechanism end to end before using it as a black box downstream. Same philosophy as the from-scratch transformer (Topic 13) and the from-scratch DDPM (Topic 6): build the 100-line version on a toy, watch every gradient, then scale.

What it informs

Mini SDF-SRN is the direct foundation for Flow-SDF (Topic 44), which keeps the SDF decoder and the differentiable renderer unchanged and swaps only the CNN encoder for a rectified-flow transformer. The "no 3-D supervision" property — the whole point — is what makes the thesis line's image-to-3-D work viable in the Apple Maps data regime, where images are abundant but 3-D meshes are scarce.

Pipeline

Image → CNN encoder → latent → SDF decoder → renderer → silhouette loss.

01 — Architecture

CNN encoder, shared SDF decoder, fixed differentiable renderer.

CNN encoder. Four convolutional layers (3 → 32 → 64 → 128 → 256 channels, stride 2) followed by two fully-connected layers, compressing a 64 × 64 × 3 image into a 128-dimensional latent code.

SDF decoder. A 5-layer MLP with a skip connection, following the DeepSDF pattern. It takes the latent code concatenated with a 3-D query coordinate and outputs a scalar SDF value — the latent determines which shape, the coordinate determines where it is evaluated. It is shared across all shapes.

Differentiable renderer. A fixed (non-learned) ray-marching renderer — samples points along camera rays at fixed intervals, queries the SDF decoder at each, and computes a soft silhouette via sigmoid(−min_sdf / temperature). It is the bridge between 3-D and 2-D: not a neural network, but a mathematical operation that provides continuous gradients from 2-D pixel comparisons back through the SDF values to the latent codes.

Image (64×64×3)
    → CNN encoder (4 conv layers + FC)
    → latent code (128-dim)
    → SDF decoder (5-layer MLP, skip connection)
        queried at ray-marched 3D coordinates
    → SDF values → surface detection → soft silhouette
    → compare with GT silhouette → backpropagate through everything

02 — Results

Trained on 50 shapes, one view each. Predicts 6 novel angles.

Trained on 50 synthetic shapes for 300 epochs, ~70 minutes on an M4 iMac. Each shape was seen from only one viewpoint during training. The figures below are the actual epoch-300 outputs from the repository — the top row of each is predicted from the network, the bottom row is ground truth, across 6 novel angles the network never saw.

Mini SDF-SRN — sphere novel views at epoch 300

Sphere · epoch 300. Training view (red shaded) + 6 novel angles. Pred row vs GT row — the smooth shape is the easiest case and tracks closely.

Mini SDF-SRN — box novel views at epoch 300

Box · epoch 300. The hardest case — sharp corners. The network recovers the gross silhouette across angles; corners round off slightly.

Mini SDF-SRN — ellipsoid novel views at epoch 300

Ellipsoid · epoch 300. Intermediate difficulty — the anisotropic silhouette is recovered at every novel angle.

Mini SDF-SRN — training convergence at epoch 300

Training convergence · epoch 300. Ground-truth silhouette / predicted silhouette / difference map. By epoch 300 the difference is a thin boundary band — the network has learned correct 3-D structure from single-view silhouette supervision alone.

Property	Value
Training data	50 synthetic shapes (sphere, box, ellipsoid families), one view per shape
Supervision	2-D silhouette only — no 3-D meshes, point clouds, or multi-view data
Epochs / hardware	300 epochs · ~70 min on an M4 iMac (also runs on CUDA / CPU)
Novel-view test	6 angles per shape, never seen during training
Silhouette loss (converged)	~0.005 — the CNN-encoder baseline that Flow-SDF (Topic 44) later matches

Core Insight

The renderer is not a network. It's the bridge.
A fixed ray-marching operation that makes 2-D gradients geometrically meaningful in 3-D.

"This pixel should be white but is black" → "the SDF along this ray should be negative somewhere" → "the latent code should change so the decoder produces a surface here" → "the encoder should map this image differently". Every component receives a meaningful learning signal derived from a simple 2-D image comparison — and the SDF decoder learns what 3-D geometry looks like from the accumulated constraints of many single-view silhouette comparisons, never seeing a 3-D mesh.

Interactive Demo · Live

Step the training and rotate to a novel view. Pick a shape and an epoch. The left pane is the single training view the network sees; the middle pane is the network's predicted silhouette at a novel angle; the right pane is the ground-truth silhouette at that angle. Early epochs the prediction is mush; by epoch 300 it tracks the GT.

01 — Shape + epoch SPHERE · EP 300

epoch novel view angle

02 — Training view / Pred 90°

03 — Ground truth (novel) never seen in training

Appendix — Raw Materials

Transcripts & Source References

████████████████████████████████████████████████

01 — ██████████████████████████

██████████████████████████████████████

█████████ · ████ · █████████████████████

████████████████████████████████████████████████████████████████████████████████████████████████████████████

Restricted Access

Mini SDF-SRN — Learning 3-D From Single Images.