← Research Timeline Aditya Jain / Apple Maps · 3D Reconstruction
Apr–May 2026
Topic 41 Apr–May 2026 Hypernetworks · DeepSDF · Latent Diffusion · Image-to-3-D

Hypernet → DeepSDF —
Twelve Phases to a Working Image-to-3-D Pipeline.

The most recent thesis-line project, and the largest — twelve experimental phases, several thousand GPU-hours, and a 30-page write-up titled "From Weight-Space Diffusion to Latent-Space DeepSDF". The starting hypothesis: a 3-D shape is the weights of a small neural network, so image-to-3-D is the problem of predicting weights from an image. It fails — and fails informatively. The trained per-shape decoder weights occupy a thin warm-started shell of the 54,785-dimensional weight space (mean pairwise cosine 0.96), and the image-conditioned diffusion model collapses to a 4-shape attractor cluster regardless of input. The pivot — DeepSDF: a single shared decoder jointly optimised with one 64-dim latent per shape — works: clean reconstructions across all 976 training shapes, and genuine category-level out-of-distribution generation that was entirely absent at the 20-shape pilot scale.

00 — Motivation

"What if 3-D geometry just is a neural network's weights?"

The project starts from a genuinely exciting hypothesis, inherited from prior lab work on hypernetworks and weight-space learning. If a 3-D shape is encoded as the parameters of a small MLP that defines its signed distance function — a network f_θ : ℝ³ → ℝ whose weights θ are the shape — then 3-D generation reduces to predicting MLP parameters, and image-to-3-D reduces to high-dimensional vector regression from an image. The framing is representation-free: no triplanes, no voxels, no mesh topology. The shape is the network, and it composes naturally with diffusion models.

The real research question Topic 41 set out to answer: what architectural choices matter when you train image-to-3-D systems with ~10³ shapes rather than the ~10⁵–10⁷ of large published systems? Data scarcity — a few hundred to a few thousand domain-specific shapes — is the realistic operating regime for most research and applied work, and it is the regime the Apple Maps procedural-modelling thesis line lives in.

The honest answer the project arrived at, after eight architectural iterations of the weight-space line, is the weight-space hypothesis does not survive at this data scale — and the value is in the precision of that finding. The failures localise to one structural cause (§03), and that cause pointed straight at the pivot (§05) that works. The write-up's own framing: "the failure modes are themselves the contribution."

What it informs
This is the thesis-line's first end-to-end working image-to-3-D system, and a non-technical user can run it — the HuggingFace Space takes 1–8 photos and returns a mesh. It consumes the architectural decisions of the entire preceding thesis line — the manifold-hypothesis argument for x-prediction (Topic 28), the DeepSDF / SDF foundation study (Topic 8), the consumer-hardware constraint first articulated in the Neural-SLAM pitch (Topic 9) — and produces the result those topics were building toward. It also closes the prior topic-03 hypernetwork thread with a documented answer: the per-layer hypernetwork succeeded unconditionally, but the warm-start prior it relies on is precisely what dooms image-conditioned generation at small scale.
01 — The Twelve Phases

A research archive, not a single experiment.

The project is structured as twelve numbered phases, each a directory in the public repository. Phases 0–5 build the foundation. Phases 6–10 are the image-conditioned weight-space line — and all five fail. Phases 11–12 are the pivot — and both work. The three architectural families under test: (1) per-shape neural-network weight prediction via diffusion in raw 54,785-dimensional weight space; (2) weight-space autoencoders that compress per-shape MLP weights into low-dimensional codes for downstream diffusion; (3) DeepSDF-style joint optimisation of a single shared decoder with per-shape latent codes.

PhaseWhat it doesStatus
phase0Anchor SIREN / base SDF architectureFoundation
phase1 / 1.5Per-shape SIREN decoders + warm-start permutation alignmentFoundation
phase1_relu_pershapePer-shape ReLU+PE decoders — chosen for higher weight-perturbation toleranceFoundation
phase2Early weight-space DiTPartial
phase3_hypernetPer-layer hypernetwork — the prior topic-03 workSuccess at its scope (unconditional)
phase4Latent diffusion / VAE+diffusion over hypernet outputsPartial
phase5_image_condFirst image-conditioned attempt (single-view, SIREN family)Partial
phase6_image_condImage-conditioned weight-space DiT (~132 M params), single-viewFAILED — mode collapse to 4-shape attractor cluster
phase7_multiview_condMulti-view extension, K ∈ [1,8] views + camera posesFAILED — same collapse
phase8_no_standardizationPhase 7 ablation removing per-dim weight standardizationFAILED — rules out standardization
phase9_no_dropoutPhase 7 ablation removing CFG / pose dropoutFAILED — rules out dropout
phase10_weight_autoencoderCompress weights via autoencoder, then diffuseFAILED visually despite cos(rec, true) = 0.997
phase11_deepsdfDeepSDF shared decoder (~1.95 M params) + per-shape 64-dim latents, jointly optimisedWORKS — clean reconstructions across all 976 shapes
phase12_image_to_latentImage-conditioned latent DiT (~10 M params) — predicts 64-dim latent from DINOv2 features + posesWORKS — perfect recall + category-appropriate OOD at 976 shapes
02 — Data & Common Pipeline

976 watertight Objaverse-LVIS shapes, rendered and SDF-sampled.

Every phase shares the same data foundation. The dataset is a 1,000-shape curated subset of Objaverse, filtered to the LVIS category vocabulary — common objects (table, chair, lamp, bottle), wildlife (dog, lion, beetle), tools (toothbrush, sharpie, pacifier), vehicles (cabin_car, surfboard), and long-tail oddities (banjo, escargot, signboard, Tabasco_sauce). After watertight conversion and SDF sampling, 976 shapes came through clean across both stages and form the working set. Each shape carries an integer obj_idx (0–975), its original Objaverse hash UID, and an LVIS category label, bidirectionally mapped in manifest.json.

StageMethodOutput
Watertight conversionHoudini VDB pipeline — scatter ~1 M surface points, voxelise via VDB-from-particles (~1 M voxels), VDB → polygonsClosed manifold meshes
SDF samplingmesh-to-sdf, 200,000 query points per shape (50 % near-surface, 50 % uniform in the unit cube), normalised to a unit bounding sphere976 × obj_NNNN.npz, ~3 GB
Multi-view renders64 viewpoints on a Fibonacci sphere at distance 2.5, pyrender + EGL, gray-blue PBR material, headlight + rim light, 224 × 224 RGB976 × 64 PNG renders, ~1 GB · 199 pre-existed, 777 rendered fresh (~13 h)
Image featuresDINOv2-base/14 — 768-dim CLS token per (shape, view), ImageNet-normalisedFeature cache, 192 MB
03 — The Failed Line (Phases 1, 6–10)

Eight iterations of careful ablation. The diagnostic is the contribution.

Phase 1 trains one ReLU+PE decoder per shape — point → positional encoding (6 bands → 39-dim) → four Linear(128)+ReLU layers → Linear(1), 54,785 parameters each. ReLU+PE was chosen over SIREN deliberately: in-house perturbation benchmarks showed ReLU+PE decoders tolerate much larger weight perturbations before reconstruction collapses (relative σ ≥ 0.34 versus SIREN's ≤ 0.17) — and a downstream diffusion model will produce noisy weight vectors. All 976 decoders are warm-started from a single anchor (trained on obj_0000, a coffee table), because warm-starting keeps every per-shape decoder in the same permutation neighbourhood — necessary for the weight-space interpolation the prior topic-03 work depended on. Final per-shape losses: median 0.00185, 95th-percentile 0.00409.

Phases 6–10 are the image-conditioned weight-space line. Phase 6 trains a ~132 M-parameter Diffusion Transformer to predict the 54,785-dim weight vector from a single image; it chunks the weight vector into 8 tokens, cross-attends to view tokens (DINOv2 CLS ⊕ sinusoidal pose embedding), uses a cosine T = 500 schedule with x₀-prediction and DDIM 50-step sampling. Phase 7 extends to multi-view (K ∈ [1, 8]). Both collapse. Phases 8 and 9 are ablations; phase 10 is the autoencoder rescue attempt. All five fail.

PhaseHypothesis under testResult
6 — image-cond DiTA ~132 M-param image-conditioned weight-space DiT will track the conditioning imageMode collapse — predictions land on a 4-shape attractor cluster (obj_0054, 0055, 0172, 0000) regardless of input
7 — multi-viewMore views (K up to 8) give a stronger conditioning signal that breaks the collapseSame collapse. Training loss EMA plateaus at 0.198 standardised MSE around step 15 K
8 — no standardizationPer-dimension weight standardization (124 of 54,785 dims had std < 10⁻⁴) distorts the per-shape signalSame collapse — rules out standardization
9 — no dropoutClassifier-free-guidance / pose dropout destroys the conditioning during trainingSame collapse — rules out dropout
10 — weight autoencoderCompress the weights with an AE first (54,785 → 128/256 dim), diffuse in the smaller spaceNumerically excellent (cos 0.997), visually destroyed meshes
Diagnostic 1 — Where does the prediction land?
Inference is run on four trained shapes and each predicted weight vector is ranked against all 199 training latents by cosine similarity. For obj_0119: cos(pred, true) = 0.9698, but the top-5 nearest training latents to the prediction are obj_0054, 0055, 0172, 0000 at cos ≈ 0.987closer to the attractors than to the true target. And across all four test shapes, the same four attractor shapes top the list regardless of input image. The DiT has not collapsed to noise — it has collapsed to the centroid of the training distribution, with image conditioning supplying only a ~3 % directional nudge that is too weak to reach the target.
Diagnostic 4 — Is the loss uniform across timesteps?
The x₀ loss is measured at 10 diffusion timesteps spanning [0, 450]. It is essentially flat at ~0.19 across every timestep — and critically, at t = 0 (input is the near-clean target plus a trace of noise) the loss is still 0.189: the model cannot even reproduce a near-clean input. A healthy diffusion model has a characteristic non-flat loss profile. A flat one is the signature of a model that has learned the marginal mean and nothing conditional. This rules out "x₀-prediction is the wrong target" and rules out "just train longer" — the floor is structural, not optimisation-bound.
Diagnostic 5 — The warm-start dominance problem
The geometry of the 976 trained weight vectors, measured directly: mean L2 norm 13.39 (std 0.10 — just 0.7 % of the mean); mean pairwise cosine similarity 0.9606 (min/max 0.927 / 1.000); the three attractor shapes have cos-to-population-mean = 0.995 — they are literally the most central shapes in the dataset. Variance is spread impossibly thin: the top 10 dimensions capture 0.2 % of the variance, the top 1,000 capture 11.2 %, and it takes 25,000 dimensions to reach 84.7 %. This is the warm-start dominance problem. The shared anchor initialisation — necessary for coherent weight-space interpolation — concentrates the entire training distribution into a thin shell where per-shape signal is buried under shared structure. The DiT learns the easy "predict the anchor" minimum and never gets the gradient signal to escape it.

Phase 10 — the autoencoder trap: strong numbers, broken meshes

The phase-10 autoencoder compresses weight residuals (54,785 → 128 or 256 dims) and is numerically excellent — final MSE 2.28 × 10⁻⁵, cos(rec, true) mean 0.9965, and the latent-space pairwise cosine drops from 0.96 all the way to 0.07 (different shapes mapped to near-orthogonal directions). By every metric an autoencoder optimises, this is excellent compression. The meshes say otherwise. The residual 0.3–0.5 % error is not uniformly distributed — it lands on different dimensions for different shapes, and ReLU+PE decoders are non-uniformly sensitive to which dimensions absorb it. The figures below are the actual phase-10 results from the released checkpoints.

obj_0100 turkey — watertight ground-truth mesh
obj_0100 turkey · ground truth. A watertight Objaverse-LVIS mesh with thin legs and a distinct neck and head. Its phase-10 AE reconstruction loses the neck, head and legs at cos = 0.994 — the topology is gone even though the number is excellent.
obj_0000 table — watertight ground-truth mesh
obj_0000 table · ground truth. The coffee table the warm-start anchor was trained on. Its phase-10 AE reconstruction failed marching cubes entirely — the decoded SDF had range [−1.07, −0.06], no zero crossing, no surface at all.
obj_0050 vase — phase-10 autoencoder reconstruction
obj_0050 vase · AE reconstruction. One of the few shapes that survives. The vase's simple convex geometry means the reconstruction error lands on dimensions that produce small surface displacements rather than topology changes — it still reads as a vase, just rougher.
obj_0119 — phase-10 autoencoder reconstruction, a quadruped
obj_0119 · AE reconstruction. A four-legged animal — lumpy and surface-rough, but the gross topology (four legs, body, tail) holds. Note: this file is named turkey_rec in the repo but actually renders obj_0119_rec.obj — labelled here by content.
The lesson, stated bluntly in the write-up
Numerical metrics ≠ mesh quality. A weight-space reconstruction cosine of 0.997 can correspond to a mesh that failed marching cubes entirely. An autoencoder trained on weight MSE has no signal about which perturbations are catastrophic in SDF space. The project made it a hard rule: never declare success on numerical metrics alone, and specifically stress-test shapes with thin or topologically complex geometry — wagon wheels, multi-part figures — as the most informative cases.
The Pivot

Stop extracting the 64-dim signal from a 54,785-dim weight space.
Constrain it to 64 dims by construction instead.

The weight-space line spent eight phases trying to extract a thin per-shape signal out of a high-dimensional weight vector that was 96 % shared anchor. The DeepSDF pivot inverts the problem: instead of discovering a low-dimensional per-shape code inside the weight space, it defines a 64-dim latent up front and jointly optimises it with a single shared decoder. The "compression" from 54,785 → 64 is not learned post-hoc by an autoencoder — it is enforced by the training procedure. The shared decoder must use the latent to differentiate shapes, because it has no per-shape weights to fall back on.

04 — The Working Pipeline (Phases 11–12)

DeepSDF shared decoder + image-conditioned latent diffusion.

Phase 11 is DeepSDF. A single shared decoder — 8 hidden ReLU layers of width 512, with a DeepSDF-style skip connection that re-injects the input at the middle layer, ~1.95 M parameters — takes concat(latent₆₄, PE(point)₃₉) = 103 input dimensions and predicts an SDF scalar. Each shape's 64-dim latent is a learnable parameter, initialised from 𝒩(0, 0.01²) and optimised jointly with the decoder (separate Adam learning rates — 5 × 10⁻⁴ decoder, 10⁻³ latents — plus a 10⁻⁵ L2 regulariser on the latent norms). Training objective is clamped-L1 on the SDF, 4 shapes and 8,192 points per shape per step.

Phase 12 trains the image-conditioning head. The target is now the clean 64-dim latent, not a thin weight residual. A Diffusion Transformer — d_model 384, 4 layers, 6 heads, ~10 M parameters (vs Phase 7's 132 M — the target is 800× smaller) — treats the latent as a single token, cross-attends to multi-view tokens (DINOv2 CLS-768 ⊕ 64-dim sinusoidal pose embedding), with AdaLN modulation on the diffusion timestep. Cosine T = 500 schedule, x₀-prediction, K ∈ [1, 8] views sampled per batch, 15 K training steps. CFG/pose dropout is disabled — Phase 9 confirmed it was never the cause of the collapse.

image(s) K ∈ [1, 8] DINOv2 ViT-B/14 CLS, 768-dim/view DiT (phase 12) ~10M, x₀-pred 64-dim latent z shape code DeepSDF decoder 8×512 MLP, 1.95M SDF → marching cubes → mesh .obj Phase 11 trains the shared decoder + per-shape latents jointly. Phase 12 trains the DiT to predict the latent from image features. Training: phase 11 ~140 min (976 shapes, 1500 epochs), phase 12 ~7 min. All on a single Vast.ai NVIDIA RTX 5060 Ti (16 GB).

Phase 11 — the two pilots: capacity is the difference

The first phase-11 pilot used a small decoder — hidden 256, 4 layers, 800 epochs. The loss numbers looked healthy (SDF L1 mean 0.00613, latent pairwise cos −0.04 — essentially orthogonal) but the meshes were blob-quality: the decoder lacked the capacity to represent 20 distinct shapes through a 64-dim latent.

Phase 11 initial pilot — obj_0007, broken thin blob
Initial pilot · obj_0007. Broken. A thin column with a bulbous mass — the decoder (256×4, 800 ep) cannot resolve the shape.
Phase 11 initial pilot — obj_0009, broken pitted sphere
Initial pilot · obj_0009. Broken. A pitted sphere full of spurious holes — nothing like the wagon wheel it should be.
Phase 11 initial pilot — obj_0010, broken amorphous column
Initial pilot · obj_0010. Broken. An amorphous column. Loss values "looked reasonable" — the meshes did not.

Increasing decoder capacity to hidden 512 × 8 layers (~1.95 M params) and training for 4,000 epochs produces clean reconstructions across all 20 pilot shapes — SDF L1 mean drops from 0.00613 to 0.00051, max to 0.00090, in ~13 minutes. Same data, same 64-dim latent — capacity was the difference.

Phase 11 scaled pilot — obj_0003, clean sword/paddle reconstruction
Scaled pilot · obj_0003 sword. Clean. Decoder 512×8, 4000 epochs. The flat blade and the handle are both crisp.
Phase 11 scaled pilot — obj_0006, clean pants reconstruction
Scaled pilot · obj_0006 pants. Clean. The two legs, waist opening and the hollow interior are all preserved.
Phase 11 scaled pilot — obj_0009, clean wagon wheel reconstruction
Scaled pilot · obj_0009 wagon wheel. Clean. Twelve-plus spokes, the central hub and the outer rim — the razor-sharp thin geometry that failed in every weight-space phase.

Scaling to all 976 shapes (same 512×8 architecture, 1,500 epochs, ~140 min) keeps every shape clean: SDF L1 mean 0.00212, max 0.00593 — about 4× the 20-shape mean, but no shape catastrophically fails. The latent pairwise cosine rises only to 0.12 — the 64-dim space has room for 976 distinguishable shapes.

05 — Results (Phase 12)

Perfect recall. And — at 976 shapes — genuine OOD generalization.

On training shapes the phase-12 pipeline has essentially perfect recall. The 20-shape pilot reaches a training-loss EMA of 8 × 10⁻⁶ and cos(pred, true) = 1.0000 for every tested shape, with a > 0.4 cosine margin to the second-nearest training latent. At 976 shapes the training EMA is 5.1 × 10⁻³ — deliberately higher, because the model can no longer memorise — and recall is still strong: obj_0009 the wagon wheel round-trips cleanly through the full image → DINOv2 → DiT → decoder → marching-cubes pipeline.

Phase 12 recall — obj_0009 wagon wheel, K=8 views
Recall · obj_0009 wagon wheel · K = 8. Full pipeline. Image → DINOv2 → DiT → 64-dim latent → DeepSDF decoder → marching cubes. Spokes, hub and rim all survive the round trip.
Phase 12 recall — obj_0009 wagon wheel, alternate render
Recall · obj_0009 wagon wheel · alternate render. Comparable to the direct latent decode. The thin-geometry recall is the discriminating test — every weight-space phase destroyed exactly this.

Out-of-distribution — the central result

The most important result of the project is the qualitative shift in OOD behaviour between scales. At 20 training shapes the model is a pure nearest-neighbour retriever in DINOv2 feature space: feed it a tunnel and it returns a clean — but completely wrong — humanoid, because there is no "tunnel-like" region in a 20-shape latent space for it to land in.

20-shape OOD failure — tunnel input retrieves a humanoid training shape
20-shape OOD · tunnel input → humanoid output. Pure retrieval failure. The 20-shape model snaps the tunnel input to its nearest training shape (a humanoid, top-1 cos 0.966) — the output is clean because it is a memorised training shape, and bears no resemblance to a tunnel.
Tunnel input mesh — a long thin rod
The tunnel input. A long thin rod (~3,900 verts), never seen in training. Fed to the 976-shape model below.

At 976 shapes the latent space has acquired enough semantic structure that DINOv2 features can navigate it. The same three never-seen inputs — a posed humanoid, a thin tunnel, a head-and-shoulders bust — now produce category-appropriate output. The surfaces are rough; the topology is right.

input · never seen
OOD input — posed humanoid figure
phase-12 output · K=8
OOD output — rough humanoid mesh
OOD · humanoid. Head, shoulders, outstretched arms, torso, legs, feet. Surface is rough — no fingers, mushy face — but the topology is unmistakably humanoid.
input · never seen
OOD input — thin tunnel rod
phase-12 output · K=8
OOD output — elongated rod mesh
OOD · tunnel. An elongated rod. The 976-shape model recognises the long-thin geometry — compare the 20-shape result above, which collapsed to a humanoid entirely.
input · never seen
OOD input — head-and-shoulders bust
phase-12 output · K=8
OOD output — rough head mesh
OOD · head bust. Head-on-shoulders topology with eye-socket-like depressions. Below the head the neck/shoulder region degrades into noise — the model learned "head-like" but not "cleanly attached to a torso".
obj_0050 vase reconstruction at K=8
obj_0050 vase · K = 8. A jar/vase with a lidded top — convex body reconstructs cleanly, the lip is rougher.
obj_0119 reconstruction at K=8 — a clothed figure
obj_0119 · K = 8. A clothed standing figure — head, torso, a flared skirt and feet. The skirt's lower edge frays into noise; the upper body holds.
obj_0150 reconstruction at K=8 — a garment-like shape
obj_0150 · K = 8. A sleeved garment-like form. Gross silhouette is recovered; the hem and sleeves dissolve into stringy artefacts — the expected failure mode at this data scale.
Metric20-shape pilot976-shape run
Phase 11 SDF L1 (mean / max)0.00051 / 0.000900.00212 / 0.00593
Phase 11 latent pairwise cosine−0.04 (orthogonal)0.12 (room to spare)
Phase 12 training-loss EMA8 × 10⁻⁶ (memorises)5.1 × 10⁻³ (cannot memorise)
Recallcos(pred, true) = 1.0000Clean full-pipeline recall
OOD behaviourPure nearest-neighbour retrievalCategory-appropriate generation
Honest framing — what this is and is not
This is not yet "real" image-to-3-D in the sense of Shap-E or Get3D. The reconstructions are rough, fine detail is lost, and category-appropriate does not mean shape-faithful. But it is qualitatively beyond pure retrieval, achieved at ~10⁻³ the data scale of those large-scale systems. The write-up's one-sentence conclusion: "at small data scale, structural inductive biases that constrain the prediction space beat learned compression of an unconstrained representation, every time."

Interactive Demo · Live

The actual phase 11 + 12 pipeline, running. This is the project's own HuggingFace Space embedded below — upload 1–8 images of an object and it returns a 3-D mesh, running the real DINOv2 → DiT → DeepSDF decoder → marching-cubes pipeline on the released 976-shape checkpoints. If the embed is slow to wake (HF Spaces sleep when idle), open it directly on HuggingFace ↗.

Full Technical Paper

White paper · twelve-phase image-to-3-D research archive · the warm-start dominance diagnostic · the autoencoder trap · the DeepSDF pivot · 20 → 976-shape scaling · grounded in the 30-page thesis

Read Paper →
Related Thesis Chapters
x-Prediction Analysis
The manifold-hypothesis argument behind phase 12's x₀-prediction choice. The DeepSDF latent is the manifold; the DiT predicts it directly.
SDF Research
The foundational SDF study — the eikonal constraint and the GAN-SDF dead-end recorded there is why phase 11 uses a DeepSDF decoder, not a GAN.
Triplane Deep Dive
The competing universal-intermediate representation. Hypernet → DeepSDF chose a 64-dim global latent over a triplane; future work proposes a triplane-hybrid latent for finer detail.
Appendix — Raw Materials
Transcripts & Source References
████████████████████████████████████████████████
███████████████████████████████████████

██████████████████████████████████████
█████████ · ████ · █████████████████████
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████
Restricted Access