The most recent thesis-line project, and the largest — twelve experimental phases, several thousand GPU-hours, and a 30-page write-up titled "From Weight-Space Diffusion to Latent-Space DeepSDF". The starting hypothesis: a 3-D shape is the weights of a small neural network, so image-to-3-D is the problem of predicting weights from an image. It fails — and fails informatively. The trained per-shape decoder weights occupy a thin warm-started shell of the 54,785-dimensional weight space (mean pairwise cosine 0.96), and the image-conditioned diffusion model collapses to a 4-shape attractor cluster regardless of input. The pivot — DeepSDF: a single shared decoder jointly optimised with one 64-dim latent per shape — works: clean reconstructions across all 976 training shapes, and genuine category-level out-of-distribution generation that was entirely absent at the 20-shape pilot scale.
The project starts from a genuinely exciting hypothesis, inherited
from prior lab work on hypernetworks and weight-space learning. If a
3-D shape is encoded as the parameters of a small MLP that defines
its signed distance function — a network f_θ : ℝ³ → ℝ
whose weights θ are the shape — then 3-D
generation reduces to predicting MLP parameters, and image-to-3-D
reduces to high-dimensional vector regression from an image. The
framing is representation-free: no triplanes, no voxels, no mesh
topology. The shape is the network, and it composes naturally with
diffusion models.
The real research question Topic 41 set out to answer: what architectural choices matter when you train image-to-3-D systems with ~10³ shapes rather than the ~10⁵–10⁷ of large published systems? Data scarcity — a few hundred to a few thousand domain-specific shapes — is the realistic operating regime for most research and applied work, and it is the regime the Apple Maps procedural-modelling thesis line lives in.
The honest answer the project arrived at, after eight architectural iterations of the weight-space line, is the weight-space hypothesis does not survive at this data scale — and the value is in the precision of that finding. The failures localise to one structural cause (§03), and that cause pointed straight at the pivot (§05) that works. The write-up's own framing: "the failure modes are themselves the contribution."
The project is structured as twelve numbered phases, each a directory in the public repository. Phases 0–5 build the foundation. Phases 6–10 are the image-conditioned weight-space line — and all five fail. Phases 11–12 are the pivot — and both work. The three architectural families under test: (1) per-shape neural-network weight prediction via diffusion in raw 54,785-dimensional weight space; (2) weight-space autoencoders that compress per-shape MLP weights into low-dimensional codes for downstream diffusion; (3) DeepSDF-style joint optimisation of a single shared decoder with per-shape latent codes.
| Phase | What it does | Status |
|---|---|---|
| phase0 | Anchor SIREN / base SDF architecture | Foundation |
| phase1 / 1.5 | Per-shape SIREN decoders + warm-start permutation alignment | Foundation |
| phase1_relu_pershape | Per-shape ReLU+PE decoders — chosen for higher weight-perturbation tolerance | Foundation |
| phase2 | Early weight-space DiT | Partial |
| phase3_hypernet | Per-layer hypernetwork — the prior topic-03 work | Success at its scope (unconditional) |
| phase4 | Latent diffusion / VAE+diffusion over hypernet outputs | Partial |
| phase5_image_cond | First image-conditioned attempt (single-view, SIREN family) | Partial |
| phase6_image_cond | Image-conditioned weight-space DiT (~132 M params), single-view | FAILED — mode collapse to 4-shape attractor cluster |
| phase7_multiview_cond | Multi-view extension, K ∈ [1,8] views + camera poses | FAILED — same collapse |
| phase8_no_standardization | Phase 7 ablation removing per-dim weight standardization | FAILED — rules out standardization |
| phase9_no_dropout | Phase 7 ablation removing CFG / pose dropout | FAILED — rules out dropout |
| phase10_weight_autoencoder | Compress weights via autoencoder, then diffuse | FAILED visually despite cos(rec, true) = 0.997 |
| phase11_deepsdf | DeepSDF shared decoder (~1.95 M params) + per-shape 64-dim latents, jointly optimised | WORKS — clean reconstructions across all 976 shapes |
| phase12_image_to_latent | Image-conditioned latent DiT (~10 M params) — predicts 64-dim latent from DINOv2 features + poses | WORKS — perfect recall + category-appropriate OOD at 976 shapes |
Every phase shares the same data foundation. The dataset is a
1,000-shape curated subset of Objaverse, filtered to the LVIS
category vocabulary — common objects (table, chair, lamp, bottle),
wildlife (dog, lion, beetle), tools (toothbrush, sharpie, pacifier),
vehicles (cabin_car, surfboard), and long-tail oddities (banjo,
escargot, signboard, Tabasco_sauce). After watertight conversion and
SDF sampling, 976 shapes came through clean across
both stages and form the working set. Each shape carries an integer
obj_idx (0–975), its original Objaverse hash UID, and an
LVIS category label, bidirectionally mapped in manifest.json.
| Stage | Method | Output |
|---|---|---|
| Watertight conversion | Houdini VDB pipeline — scatter ~1 M surface points, voxelise via VDB-from-particles (~1 M voxels), VDB → polygons | Closed manifold meshes |
| SDF sampling | mesh-to-sdf, 200,000 query points per shape (50 % near-surface, 50 % uniform in the unit cube), normalised to a unit bounding sphere | 976 × obj_NNNN.npz, ~3 GB |
| Multi-view renders | 64 viewpoints on a Fibonacci sphere at distance 2.5, pyrender + EGL, gray-blue PBR material, headlight + rim light, 224 × 224 RGB | 976 × 64 PNG renders, ~1 GB · 199 pre-existed, 777 rendered fresh (~13 h) |
| Image features | DINOv2-base/14 — 768-dim CLS token per (shape, view), ImageNet-normalised | Feature cache, 192 MB |
Phase 1 trains one ReLU+PE decoder per shape —
point → positional encoding (6 bands → 39-dim) → four
Linear(128)+ReLU layers → Linear(1), 54,785 parameters each.
ReLU+PE was chosen over SIREN deliberately: in-house perturbation
benchmarks showed ReLU+PE decoders tolerate much larger weight
perturbations before reconstruction collapses (relative σ ≥ 0.34
versus SIREN's ≤ 0.17) — and a downstream diffusion model
will produce noisy weight vectors. All 976 decoders are
warm-started from a single anchor (trained on obj_0000,
a coffee table), because warm-starting keeps every per-shape decoder
in the same permutation neighbourhood — necessary for the
weight-space interpolation the prior topic-03 work depended on.
Final per-shape losses: median 0.00185, 95th-percentile 0.00409.
Phases 6–10 are the image-conditioned weight-space line. Phase 6 trains a ~132 M-parameter Diffusion Transformer to predict the 54,785-dim weight vector from a single image; it chunks the weight vector into 8 tokens, cross-attends to view tokens (DINOv2 CLS ⊕ sinusoidal pose embedding), uses a cosine T = 500 schedule with x₀-prediction and DDIM 50-step sampling. Phase 7 extends to multi-view (K ∈ [1, 8]). Both collapse. Phases 8 and 9 are ablations; phase 10 is the autoencoder rescue attempt. All five fail.
| Phase | Hypothesis under test | Result |
|---|---|---|
| 6 — image-cond DiT | A ~132 M-param image-conditioned weight-space DiT will track the conditioning image | Mode collapse — predictions land on a 4-shape attractor cluster (obj_0054, 0055, 0172, 0000) regardless of input |
| 7 — multi-view | More views (K up to 8) give a stronger conditioning signal that breaks the collapse | Same collapse. Training loss EMA plateaus at 0.198 standardised MSE around step 15 K |
| 8 — no standardization | Per-dimension weight standardization (124 of 54,785 dims had std < 10⁻⁴) distorts the per-shape signal | Same collapse — rules out standardization |
| 9 — no dropout | Classifier-free-guidance / pose dropout destroys the conditioning during training | Same collapse — rules out dropout |
| 10 — weight autoencoder | Compress the weights with an AE first (54,785 → 128/256 dim), diffuse in the smaller space | Numerically excellent (cos 0.997), visually destroyed meshes |
obj_0119: cos(pred, true) =
0.9698, but the top-5 nearest training latents to the
prediction are obj_0054, 0055, 0172, 0000 at
cos ≈ 0.987 — closer to the attractors than to
the true target. And across all four test shapes, the same
four attractor shapes top the list regardless of input image. The
DiT has not collapsed to noise — it has collapsed to the centroid
of the training distribution, with image conditioning supplying
only a ~3 % directional nudge that is too weak to reach the
target.
t = 0 (input
is the near-clean target plus a trace of noise) the loss is still
0.189: the model cannot even reproduce a near-clean input. A
healthy diffusion model has a characteristic non-flat loss
profile. A flat one is the signature of a model that has learned
the marginal mean and nothing conditional. This rules out
"x₀-prediction is the wrong target" and rules out "just train
longer" — the floor is structural, not optimisation-bound.
The phase-10 autoencoder compresses weight residuals (54,785 → 128
or 256 dims) and is numerically excellent — final MSE 2.28 × 10⁻⁵,
cos(rec, true) mean 0.9965, and the latent-space
pairwise cosine drops from 0.96 all the way to 0.07 (different
shapes mapped to near-orthogonal directions). By every metric an
autoencoder optimises, this is excellent compression. The meshes
say otherwise. The residual 0.3–0.5 % error is not
uniformly distributed — it lands on different dimensions for
different shapes, and ReLU+PE decoders are non-uniformly sensitive
to which dimensions absorb it. The figures below are the
actual phase-10 results from the released checkpoints.
turkey_rec in the repo but actually renders obj_0119_rec.obj — labelled here by content.Stop extracting the 64-dim signal from a 54,785-dim weight space.
Constrain it to 64 dims by construction instead.
The weight-space line spent eight phases trying to extract a thin per-shape signal out of a high-dimensional weight vector that was 96 % shared anchor. The DeepSDF pivot inverts the problem: instead of discovering a low-dimensional per-shape code inside the weight space, it defines a 64-dim latent up front and jointly optimises it with a single shared decoder. The "compression" from 54,785 → 64 is not learned post-hoc by an autoencoder — it is enforced by the training procedure. The shared decoder must use the latent to differentiate shapes, because it has no per-shape weights to fall back on.
Phase 11 is DeepSDF. A single shared decoder — 8
hidden ReLU layers of width 512, with a DeepSDF-style skip
connection that re-injects the input at the middle layer, ~1.95 M
parameters — takes concat(latent₆₄, PE(point)₃₉) = 103
input dimensions and predicts an SDF scalar. Each shape's 64-dim
latent is a learnable parameter, initialised from
𝒩(0, 0.01²) and optimised jointly with the decoder
(separate Adam learning rates — 5 × 10⁻⁴ decoder, 10⁻³ latents — plus
a 10⁻⁵ L2 regulariser on the latent norms). Training objective is
clamped-L1 on the SDF, 4 shapes and 8,192 points per shape per step.
Phase 12 trains the image-conditioning head. The target is now the clean 64-dim latent, not a thin weight residual. A Diffusion Transformer — d_model 384, 4 layers, 6 heads, ~10 M parameters (vs Phase 7's 132 M — the target is 800× smaller) — treats the latent as a single token, cross-attends to multi-view tokens (DINOv2 CLS-768 ⊕ 64-dim sinusoidal pose embedding), with AdaLN modulation on the diffusion timestep. Cosine T = 500 schedule, x₀-prediction, K ∈ [1, 8] views sampled per batch, 15 K training steps. CFG/pose dropout is disabled — Phase 9 confirmed it was never the cause of the collapse.
The first phase-11 pilot used a small decoder — hidden 256, 4 layers, 800 epochs. The loss numbers looked healthy (SDF L1 mean 0.00613, latent pairwise cos −0.04 — essentially orthogonal) but the meshes were blob-quality: the decoder lacked the capacity to represent 20 distinct shapes through a 64-dim latent.



Increasing decoder capacity to hidden 512 × 8 layers (~1.95 M params) and training for 4,000 epochs produces clean reconstructions across all 20 pilot shapes — SDF L1 mean drops from 0.00613 to 0.00051, max to 0.00090, in ~13 minutes. Same data, same 64-dim latent — capacity was the difference.



Scaling to all 976 shapes (same 512×8 architecture, 1,500 epochs, ~140 min) keeps every shape clean: SDF L1 mean 0.00212, max 0.00593 — about 4× the 20-shape mean, but no shape catastrophically fails. The latent pairwise cosine rises only to 0.12 — the 64-dim space has room for 976 distinguishable shapes.
On training shapes the phase-12 pipeline has essentially perfect
recall. The 20-shape pilot reaches a training-loss EMA of 8 × 10⁻⁶
and cos(pred, true) = 1.0000 for every tested shape,
with a > 0.4 cosine margin to the second-nearest training latent.
At 976 shapes the training EMA is 5.1 × 10⁻³ — deliberately
higher, because the model can no longer memorise — and
recall is still strong: obj_0009 the wagon wheel
round-trips cleanly through the full
image → DINOv2 → DiT → decoder → marching-cubes pipeline.


The most important result of the project is the qualitative shift in OOD behaviour between scales. At 20 training shapes the model is a pure nearest-neighbour retriever in DINOv2 feature space: feed it a tunnel and it returns a clean — but completely wrong — humanoid, because there is no "tunnel-like" region in a 20-shape latent space for it to land in.


At 976 shapes the latent space has acquired enough semantic structure that DINOv2 features can navigate it. The same three never-seen inputs — a posed humanoid, a thin tunnel, a head-and-shoulders bust — now produce category-appropriate output. The surfaces are rough; the topology is right.









| Metric | 20-shape pilot | 976-shape run |
|---|---|---|
| Phase 11 SDF L1 (mean / max) | 0.00051 / 0.00090 | 0.00212 / 0.00593 |
| Phase 11 latent pairwise cosine | −0.04 (orthogonal) | 0.12 (room to spare) |
| Phase 12 training-loss EMA | 8 × 10⁻⁶ (memorises) | 5.1 × 10⁻³ (cannot memorise) |
| Recall | cos(pred, true) = 1.0000 | Clean full-pipeline recall |
| OOD behaviour | Pure nearest-neighbour retrieval | Category-appropriate generation |
The actual phase 11 + 12 pipeline, running. This is the project's own HuggingFace Space embedded below — upload 1–8 images of an object and it returns a 3-D mesh, running the real DINOv2 → DiT → DeepSDF decoder → marching-cubes pipeline on the released 976-shape checkpoints. If the embed is slow to wake (HF Spaces sleep when idle), open it directly on HuggingFace ↗.
White paper · twelve-phase image-to-3-D research archive · the warm-start dominance diagnostic · the autoencoder trap · the DeepSDF pivot · 20 → 976-shape scaling · grounded in the 30-page thesis