Reconstructing 3-D shape from a single image traditionally requires paired image–3-D datasets, which are expensive to collect at scale. SDF-SRN (Lin et al., NeurIPS 2020) demonstrated that 3-D shapes can be learned from single-view images using only 2-D silhouette supervision, eliminating the need for 3-D ground truth entirely. The thesis line's image-to-3-D work — the Hypernet → DeepSDF archive, the Flow-SDF pipeline — all rests on this one primitive: a differentiable renderer that turns a 3-D implicit surface into a 2-D image, so gradients from a 2-D loss can teach a network 3-D geometry.
Mini SDF-SRN builds that primitive from scratch on the smallest problem that still exercises every component: 50 synthetic shapes, single-view silhouette supervision, a CNN encoder, a shared SDF decoder, a ray-marching renderer. The goal is not a competitive reconstruction model — it is to own the 3-D-from-2-D mechanism end to end before using it as a black box downstream, the same architecture-literacy philosophy applied to the from-scratch transformer and the from-scratch DDPM earlier in the thesis line.
Four convolutional layers (3 → 32 → 64 → 128 → 256 channels, stride 2) followed by two fully-connected layers, compressing a 64 × 64 × 3 image into a 128-dimensional latent code. The encoder is deterministic — one image maps to one code in a single forward pass. (This determinism is precisely the property Flow-SDF later replaces, to gain generative capacity.)
A 5-layer MLP with a skip connection, following the DeepSDF pattern. It takes the latent code concatenated with a 3-D query coordinate and outputs a scalar SDF value — the latent determines which shape, the coordinate determines where it is evaluated. The decoder is shared across all 50 shapes; the latent code is the only per-shape state. Crucially, this decoder is never trained on 3-D data — it learns to produce 3-D geometry purely through 2-D silhouette supervision via the differentiable renderer.
A fixed, non-learned ray-marching renderer. It samples points along camera rays at fixed intervals, queries the SDF decoder at each point, and computes a soft silhouette via sigmoid(−min_sdf / temperature) — a soft indicator of whether any sampled point along the ray was inside the surface. It is not a neural network; it is a mathematical operation, and its role is to provide continuous, differentiable gradients from 2-D pixel comparisons back through the SDF values to the latent codes. Without this component there would be no way to train the 3-D representation from 2-D images at all.
The differentiable renderer creates a continuous gradient path from 2-D pixels to 3-D geometry. Because the renderer is a fixed mathematical operation rather than a learned component, the gradients it passes back are geometrically meaningful, and the chain of reasoning is concrete:
"This pixel should be white but is black" → "the SDF value along this ray should be negative somewhere, indicating a surface" → "the latent code should change so the decoder produces a surface here" → "the encoder should map this image to a different code." Every component receives a meaningful learning signal derived from a simple 2-D image comparison.
The SDF decoder never sees a 3-D mesh. It learns what 3-D geometry looks like from the accumulated constraints of many single-view silhouette comparisons across the dataset. No single training example provides multi-view supervision — but different examples from different viewpoints collectively teach the decoder about 3-D structure. Single-view silhouette supervision constrains the visual hull of each shape; the shared decoder resolves the within-hull depth ambiguity statistically across the dataset.
Trained on 50 synthetic shapes for 300 epochs, ~70 minutes on an M4 iMac (the implementation also runs on CUDA or CPU). Each shape was seen from a single viewpoint during training; the network is then tested on 6 novel angles per shape it never saw.
| Property | Value |
|---|---|
| Training data | 50 synthetic shapes (sphere, box, ellipsoid families), one view per shape |
| Supervision | 2-D silhouette only — no 3-D meshes, point clouds, or multi-view data |
| Encoder / decoder | CNN (4 conv + 2 FC) → 128-dim latent → 5-layer skip-MLP SDF decoder |
| Training | 300 epochs, ~70 min on M4 iMac |
| Novel-view test | 6 angles per shape, never seen in training |
| Converged silhouette loss | ~0.005 — the CNN-encoder baseline Flow-SDF later matches with a rectified-flow encoder |
The qualitative result, visible in the repository's epoch-300 figures: the sphere (smooth — easiest) tracks the ground-truth silhouette closely across all 6 novel angles; the box (sharp corners — hardest) recovers the gross silhouette with corners rounding off slightly; the ellipsoid (intermediate) recovers the anisotropic silhouette at every angle. The training-convergence triptych — ground truth / predicted / difference — shows the difference shrinking to a thin boundary band by epoch 300. The network has learned correct 3-D structure from single-view silhouette supervision alone.
Two honest limitations. (i) The CNN encoder is deterministic and is a dead end for generation. It maps one image to one code in a single forward pass — there is no stochasticity, no capacity for conditioning on text or partial views, no iterative refinement. (ii) Synthetic shapes only. The 50-shape synthetic dataset (sphere/box/ellipsoid families) validates the mechanism but says nothing about real-image generalisation; the box's rounded corners hint that sharp-feature recovery from silhouette-only supervision is intrinsically limited — a silhouette constrains the visual hull but not within-hull concavities.
Both limitations point at the same successor. Flow-SDF (the rectified-flow topic) keeps the SDF decoder and the differentiable renderer unchanged — they are validated here — and replaces only the CNN encoder with a rectified-flow transformer that generates the latent code through iterative denoising. That single substitution converts a reconstruction-only model into one with generative capacity, and Flow-SDF demonstrates it matches this baseline's silhouette loss (0.0053 vs 0.0050). Mini SDF-SRN is, precisely, Stage 1 of Flow-SDF.
Mini SDF-SRN is a minimal, self-contained, from-scratch build of the 3-D-from-2-D primitive: CNN encoder, shared SDF decoder, fixed differentiable ray-marching renderer, trained end-to-end on single-view silhouette supervision with no 3-D ground truth. It learns correct 3-D structure from 50 synthetic shapes and predicts novel views it never saw. The contribution is the validated primitive — owning the differentiable-renderer-as-bridge mechanism end to end is the prerequisite for the thesis-line image-to-3-D work that builds directly on it.