A minimal, self-contained reimplementation of the SDF-SRN concept: learn 3-D shape reconstruction from single-view images using only silhouette supervision — no 3-D ground truth. A CNN encoder compresses an image into a 128-dim latent; a shared SDF decoder reconstructs the full 3-D shape; a differentiable ray-marching renderer produces a silhouette that is compared against the ground-truth silhouette, and gradients flow all the way back — renderer → decoder → latent → encoder. Trained on 50 synthetic shapes, each seen from only one viewpoint, the network predicts correct silhouettes from 6 novel angles it never saw. ~70 minutes on an M4 iMac.
The thesis line's image-to-3-D work — the Hypernet → DeepSDF archive (Topic 41), the Flow-SDF pipeline (Topic 44) — all rests on one primitive: a differentiable renderer that turns a 3-D implicit surface into a 2-D image, so gradients from a 2-D loss can teach a network 3-D geometry. Mini SDF-SRN is that primitive, built from scratch on the smallest problem that still exercises every component: 50 synthetic shapes, single-view silhouette supervision, a CNN encoder, a shared SDF decoder, a ray-marching renderer.
The point is not a competitive reconstruction model — it is to own the 3-D-from-2-D mechanism end to end before using it as a black box downstream. Same philosophy as the from-scratch transformer (Topic 13) and the from-scratch DDPM (Topic 6): build the 100-line version on a toy, watch every gradient, then scale.
CNN encoder. Four convolutional layers (3 → 32 → 64 → 128 → 256 channels, stride 2) followed by two fully-connected layers, compressing a 64 × 64 × 3 image into a 128-dimensional latent code.
SDF decoder. A 5-layer MLP with a skip connection, following the DeepSDF pattern. It takes the latent code concatenated with a 3-D query coordinate and outputs a scalar SDF value — the latent determines which shape, the coordinate determines where it is evaluated. It is shared across all shapes.
Differentiable renderer. A fixed (non-learned)
ray-marching renderer — samples points along camera rays at fixed
intervals, queries the SDF decoder at each, and computes a soft
silhouette via sigmoid(−min_sdf / temperature). It is
the bridge between 3-D and 2-D: not a neural network, but a
mathematical operation that provides continuous gradients from 2-D
pixel comparisons back through the SDF values to the latent codes.
Trained on 50 synthetic shapes for 300 epochs, ~70 minutes on an M4 iMac. Each shape was seen from only one viewpoint during training. The figures below are the actual epoch-300 outputs from the repository — the top row of each is predicted from the network, the bottom row is ground truth, across 6 novel angles the network never saw.




| Property | Value |
|---|---|
| Training data | 50 synthetic shapes (sphere, box, ellipsoid families), one view per shape |
| Supervision | 2-D silhouette only — no 3-D meshes, point clouds, or multi-view data |
| Epochs / hardware | 300 epochs · ~70 min on an M4 iMac (also runs on CUDA / CPU) |
| Novel-view test | 6 angles per shape, never seen during training |
| Silhouette loss (converged) | ~0.005 — the CNN-encoder baseline that Flow-SDF (Topic 44) later matches |
The renderer is not a network. It's the bridge.
A fixed ray-marching operation that makes 2-D gradients geometrically meaningful in 3-D.
"This pixel should be white but is black" → "the SDF along this ray should be negative somewhere" → "the latent code should change so the decoder produces a surface here" → "the encoder should map this image differently". Every component receives a meaningful learning signal derived from a simple 2-D image comparison — and the SDF decoder learns what 3-D geometry looks like from the accumulated constraints of many single-view silhouette comparisons, never seeing a 3-D mesh.
Step the training and rotate to a novel view. Pick a shape and an epoch. The left pane is the single training view the network sees; the middle pane is the network's predicted silhouette at a novel angle; the right pane is the ground-truth silhouette at that angle. Early epochs the prediction is mush; by epoch 300 it tracks the GT.
White paper · minimal SDF-SRN reimplementation · the differentiable-renderer-as-bridge principle · single-view silhouette supervision · novel-view results