← Research Timeline Aditya Jain / Apple Maps · 3D Reconstruction
Apr 2026
Topic 43 Apr 2026 SDF-SRN · Differentiable Rendering · 3-D from 2-D

Mini SDF-SRN —
Learning 3-D From Single Images.

A minimal, self-contained reimplementation of the SDF-SRN concept: learn 3-D shape reconstruction from single-view images using only silhouette supervision — no 3-D ground truth. A CNN encoder compresses an image into a 128-dim latent; a shared SDF decoder reconstructs the full 3-D shape; a differentiable ray-marching renderer produces a silhouette that is compared against the ground-truth silhouette, and gradients flow all the way back — renderer → decoder → latent → encoder. Trained on 50 synthetic shapes, each seen from only one viewpoint, the network predicts correct silhouettes from 6 novel angles it never saw. ~70 minutes on an M4 iMac.

00 — Motivation

Build the 3-D-from-2-D primitive from scratch, before scaling it.

The thesis line's image-to-3-D work — the Hypernet → DeepSDF archive (Topic 41), the Flow-SDF pipeline (Topic 44) — all rests on one primitive: a differentiable renderer that turns a 3-D implicit surface into a 2-D image, so gradients from a 2-D loss can teach a network 3-D geometry. Mini SDF-SRN is that primitive, built from scratch on the smallest problem that still exercises every component: 50 synthetic shapes, single-view silhouette supervision, a CNN encoder, a shared SDF decoder, a ray-marching renderer.

The point is not a competitive reconstruction model — it is to own the 3-D-from-2-D mechanism end to end before using it as a black box downstream. Same philosophy as the from-scratch transformer (Topic 13) and the from-scratch DDPM (Topic 6): build the 100-line version on a toy, watch every gradient, then scale.

What it informs
Mini SDF-SRN is the direct foundation for Flow-SDF (Topic 44), which keeps the SDF decoder and the differentiable renderer unchanged and swaps only the CNN encoder for a rectified-flow transformer. The "no 3-D supervision" property — the whole point — is what makes the thesis line's image-to-3-D work viable in the Apple Maps data regime, where images are abundant but 3-D meshes are scarce.
Pipeline

Image → CNN encoder → latent → SDF decoder → renderer → silhouette loss.

image 64²×3single view CNN encoder4 conv + FC latent z128-dim SDF decoder5-layer MLP, skip diff. rendererray march → silhouette silhouette lossvs GT silhouette gradients flow backward through the entire chain — renderer → decoder → latent → encoder No 3-D ground truth at any point. The renderer is a fixed mathematical operation, not a learned network.
01 — Architecture

CNN encoder, shared SDF decoder, fixed differentiable renderer.

CNN encoder. Four convolutional layers (3 → 32 → 64 → 128 → 256 channels, stride 2) followed by two fully-connected layers, compressing a 64 × 64 × 3 image into a 128-dimensional latent code.

SDF decoder. A 5-layer MLP with a skip connection, following the DeepSDF pattern. It takes the latent code concatenated with a 3-D query coordinate and outputs a scalar SDF value — the latent determines which shape, the coordinate determines where it is evaluated. It is shared across all shapes.

Differentiable renderer. A fixed (non-learned) ray-marching renderer — samples points along camera rays at fixed intervals, queries the SDF decoder at each, and computes a soft silhouette via sigmoid(−min_sdf / temperature). It is the bridge between 3-D and 2-D: not a neural network, but a mathematical operation that provides continuous gradients from 2-D pixel comparisons back through the SDF values to the latent codes.

Image (64×64×3) → CNN encoder (4 conv layers + FC) → latent code (128-dim) → SDF decoder (5-layer MLP, skip connection) queried at ray-marched 3D coordinates → SDF values → surface detection → soft silhouette → compare with GT silhouette → backpropagate through everything
02 — Results

Trained on 50 shapes, one view each. Predicts 6 novel angles.

Trained on 50 synthetic shapes for 300 epochs, ~70 minutes on an M4 iMac. Each shape was seen from only one viewpoint during training. The figures below are the actual epoch-300 outputs from the repository — the top row of each is predicted from the network, the bottom row is ground truth, across 6 novel angles the network never saw.

Mini SDF-SRN — sphere novel views at epoch 300
Sphere · epoch 300. Training view (red shaded) + 6 novel angles. Pred row vs GT row — the smooth shape is the easiest case and tracks closely.
Mini SDF-SRN — box novel views at epoch 300
Box · epoch 300. The hardest case — sharp corners. The network recovers the gross silhouette across angles; corners round off slightly.
Mini SDF-SRN — ellipsoid novel views at epoch 300
Ellipsoid · epoch 300. Intermediate difficulty — the anisotropic silhouette is recovered at every novel angle.
Mini SDF-SRN — training convergence at epoch 300
Training convergence · epoch 300. Ground-truth silhouette / predicted silhouette / difference map. By epoch 300 the difference is a thin boundary band — the network has learned correct 3-D structure from single-view silhouette supervision alone.
PropertyValue
Training data50 synthetic shapes (sphere, box, ellipsoid families), one view per shape
Supervision2-D silhouette only — no 3-D meshes, point clouds, or multi-view data
Epochs / hardware300 epochs · ~70 min on an M4 iMac (also runs on CUDA / CPU)
Novel-view test6 angles per shape, never seen during training
Silhouette loss (converged)~0.005 — the CNN-encoder baseline that Flow-SDF (Topic 44) later matches
Core Insight

The renderer is not a network. It's the bridge.
A fixed ray-marching operation that makes 2-D gradients geometrically meaningful in 3-D.

"This pixel should be white but is black" → "the SDF along this ray should be negative somewhere" → "the latent code should change so the decoder produces a surface here" → "the encoder should map this image differently". Every component receives a meaningful learning signal derived from a simple 2-D image comparison — and the SDF decoder learns what 3-D geometry looks like from the accumulated constraints of many single-view silhouette comparisons, never seeing a 3-D mesh.

Interactive Demo · Live

Step the training and rotate to a novel view. Pick a shape and an epoch. The left pane is the single training view the network sees; the middle pane is the network's predicted silhouette at a novel angle; the right pane is the ground-truth silhouette at that angle. Early epochs the prediction is mush; by epoch 300 it tracks the GT.

01 — Shape + epoch SPHERE · EP 300
epoch novel view angle
02 — Training view / Pred 90°
03 — Ground truth (novel) never seen in training

Full Technical Paper

White paper · minimal SDF-SRN reimplementation · the differentiable-renderer-as-bridge principle · single-view silhouette supervision · novel-view results

Read Paper →
Related Thesis Chapters
Flow-SDF
The direct successor — keeps this SDF decoder and differentiable renderer, swaps the CNN encoder for a rectified-flow transformer. Stage 1 of Flow-SDF is this reimplementation.
SDF Research
The foundational SDF study — the implicit-surface representation this whole pipeline reconstructs into.
Hypernet → DeepSDF
The large-scale image-to-3-D archive. Mini SDF-SRN is the from-scratch primitive; Hypernet → DeepSDF is the 976-shape system built on the same DeepSDF + renderer ideas.
Appendix — Raw Materials
Transcripts & Source References
████████████████████████████████████████████████

██████████████████████████████████████
█████████ · ████ · █████████████████████
████████████████████████████████████████████████████████████████████████████████████████████████████████████
Restricted Access