Hypernet → DeepSDF — Image-to-3-D Research Archive

00 — Motivation

"What if 3-D geometry just is a neural network's weights?"

The project starts from a genuinely exciting hypothesis, inherited from prior lab work on hypernetworks and weight-space learning. If a 3-D shape is encoded as the parameters of a small MLP that defines its signed distance function — a network f_θ : ℝ³ → ℝ whose weights θ are the shape — then 3-D generation reduces to predicting MLP parameters, and image-to-3-D reduces to high-dimensional vector regression from an image. The framing is representation-free: no triplanes, no voxels, no mesh topology. The shape is the network, and it composes naturally with diffusion models.

The real research question Topic 41 set out to answer: what architectural choices matter when you train image-to-3-D systems with ~10³ shapes rather than the ~10⁵–10⁷ of large published systems? Data scarcity — a few hundred to a few thousand domain-specific shapes — is the realistic operating regime for most research and applied work, and it is the regime the Apple Maps procedural-modelling thesis line lives in.

The honest answer the project arrived at, after eight architectural iterations of the weight-space line, is the weight-space hypothesis does not survive at this data scale — and the value is in the precision of that finding. The failures localise to one structural cause (§03), and that cause pointed straight at the pivot (§05) that works. The write-up's own framing: "the failure modes are themselves the contribution."

What it informs

This is the thesis-line's first end-to-end working image-to-3-D system, and a non-technical user can run it — the HuggingFace Space takes 1–8 photos and returns a mesh. It consumes the architectural decisions of the entire preceding thesis line — the manifold-hypothesis argument for x-prediction (Topic 28), the DeepSDF / SDF foundation study (Topic 8), the consumer-hardware constraint first articulated in the Neural-SLAM pitch (Topic 9) — and produces the result those topics were building toward. It also closes the prior topic-03 hypernetwork thread with a documented answer: the per-layer hypernetwork succeeded unconditionally, but the warm-start prior it relies on is precisely what dooms image-conditioned generation at small scale.

01 — The Twelve Phases

A research archive, not a single experiment.

The project is structured as twelve numbered phases, each a directory in the public repository. Phases 0–5 build the foundation. Phases 6–10 are the image-conditioned weight-space line — and all five fail. Phases 11–12 are the pivot — and both work. The three architectural families under test: (1) per-shape neural-network weight prediction via diffusion in raw 54,785-dimensional weight space; (2) weight-space autoencoders that compress per-shape MLP weights into low-dimensional codes for downstream diffusion; (3) DeepSDF-style joint optimisation of a single shared decoder with per-shape latent codes.

Phase	What it does	Status
phase0	Anchor SIREN / base SDF architecture	Foundation
phase1 / 1.5	Per-shape SIREN decoders + warm-start permutation alignment	Foundation
phase1_relu_pershape	Per-shape ReLU+PE decoders — chosen for higher weight-perturbation tolerance	Foundation
phase2	Early weight-space DiT	Partial
phase3_hypernet	Per-layer hypernetwork — the prior topic-03 work	Success at its scope (unconditional)
phase4	Latent diffusion / VAE+diffusion over hypernet outputs	Partial
phase5_image_cond	First image-conditioned attempt (single-view, SIREN family)	Partial
phase6_image_cond	Image-conditioned weight-space DiT (~132 M params), single-view	FAILED — mode collapse to 4-shape attractor cluster
phase7_multiview_cond	Multi-view extension, K ∈ [1,8] views + camera poses	FAILED — same collapse
phase8_no_standardization	Phase 7 ablation removing per-dim weight standardization	FAILED — rules out standardization
phase9_no_dropout	Phase 7 ablation removing CFG / pose dropout	FAILED — rules out dropout
phase10_weight_autoencoder	Compress weights via autoencoder, then diffuse	FAILED visually despite cos(rec, true) = 0.997
phase11_deepsdf	DeepSDF shared decoder (~1.95 M params) + per-shape 64-dim latents, jointly optimised	WORKS — clean reconstructions across all 976 shapes
phase12_image_to_latent	Image-conditioned latent DiT (~10 M params) — predicts 64-dim latent from DINOv2 features + poses	WORKS — perfect recall + category-appropriate OOD at 976 shapes

02 — Data & Common Pipeline

976 watertight Objaverse-LVIS shapes, rendered and SDF-sampled.

Every phase shares the same data foundation. The dataset is a 1,000-shape curated subset of Objaverse, filtered to the LVIS category vocabulary — common objects (table, chair, lamp, bottle), wildlife (dog, lion, beetle), tools (toothbrush, sharpie, pacifier), vehicles (cabin_car, surfboard), and long-tail oddities (banjo, escargot, signboard, Tabasco_sauce). After watertight conversion and SDF sampling, 976 shapes came through clean across both stages and form the working set. Each shape carries an integer obj_idx (0–975), its original Objaverse hash UID, and an LVIS category label, bidirectionally mapped in manifest.json.

Stage	Method	Output
Watertight conversion	Houdini VDB pipeline — scatter ~1 M surface points, voxelise via VDB-from-particles (~1 M voxels), VDB → polygons	Closed manifold meshes
SDF sampling	`mesh-to-sdf`, 200,000 query points per shape (50 % near-surface, 50 % uniform in the unit cube), normalised to a unit bounding sphere	976 × `obj_NNNN.npz`, ~3 GB
Multi-view renders	64 viewpoints on a Fibonacci sphere at distance 2.5, pyrender + EGL, gray-blue PBR material, headlight + rim light, 224 × 224 RGB	976 × 64 PNG renders, ~1 GB · 199 pre-existed, 777 rendered fresh (~13 h)
Image features	DINOv2-base/14 — 768-dim CLS token per (shape, view), ImageNet-normalised	Feature cache, 192 MB

03 — The Failed Line (Phases 1, 6–10)

Eight iterations of careful ablation. The diagnostic is the contribution.

Phase 1 trains one ReLU+PE decoder per shape — point → positional encoding (6 bands → 39-dim) → four Linear(128)+ReLU layers → Linear(1), 54,785 parameters each. ReLU+PE was chosen over SIREN deliberately: in-house perturbation benchmarks showed ReLU+PE decoders tolerate much larger weight perturbations before reconstruction collapses (relative σ ≥ 0.34 versus SIREN's ≤ 0.17) — and a downstream diffusion model will produce noisy weight vectors. All 976 decoders are warm-started from a single anchor (trained on obj_0000, a coffee table), because warm-starting keeps every per-shape decoder in the same permutation neighbourhood — necessary for the weight-space interpolation the prior topic-03 work depended on. Final per-shape losses: median 0.00185, 95th-percentile 0.00409.

Phases 6–10 are the image-conditioned weight-space line. Phase 6 trains a ~132 M-parameter Diffusion Transformer to predict the 54,785-dim weight vector from a single image; it chunks the weight vector into 8 tokens, cross-attends to view tokens (DINOv2 CLS ⊕ sinusoidal pose embedding), uses a cosine T = 500 schedule with x₀-prediction and DDIM 50-step sampling. Phase 7 extends to multi-view (K ∈ [1, 8]). Both collapse. Phases 8 and 9 are ablations; phase 10 is the autoencoder rescue attempt. All five fail.

Phase	Hypothesis under test	Result
6 — image-cond DiT	A ~132 M-param image-conditioned weight-space DiT will track the conditioning image	Mode collapse — predictions land on a 4-shape attractor cluster (`obj_0054, 0055, 0172, 0000`) regardless of input
7 — multi-view	More views (K up to 8) give a stronger conditioning signal that breaks the collapse	Same collapse. Training loss EMA plateaus at 0.198 standardised MSE around step 15 K
8 — no standardization	Per-dimension weight standardization (124 of 54,785 dims had std < 10⁻⁴) distorts the per-shape signal	Same collapse — rules out standardization
9 — no dropout	Classifier-free-guidance / pose dropout destroys the conditioning during training	Same collapse — rules out dropout
10 — weight autoencoder	Compress the weights with an AE first (54,785 → 128/256 dim), diffuse in the smaller space	Numerically excellent (cos 0.997), visually destroyed meshes

Diagnostic 1 — Where does the prediction land?

Inference is run on four trained shapes and each predicted weight vector is ranked against all 199 training latents by cosine similarity. For obj_0119:

cos(pred, true) =
      0.9698

, but the top-5 nearest training latents to the prediction are obj_0054, 0055, 0172, 0000 at cos ≈ 0.987 — closer to the attractors than to the true target. And across all four test shapes, the same four attractor shapes top the list regardless of input image. The DiT has not collapsed to noise — it has collapsed to the centroid of the training distribution, with image conditioning supplying only a ~3 % directional nudge that is too weak to reach the target.

Diagnostic 4 — Is the loss uniform across timesteps?

The x₀ loss is measured at 10 diffusion timesteps spanning [0, 450]. It is essentially flat at ~0.19 across every timestep — and critically, at t = 0 (input is the near-clean target plus a trace of noise) the loss is still 0.189: the model cannot even reproduce a near-clean input. A healthy diffusion model has a characteristic non-flat loss profile. A flat one is the signature of a model that has learned the marginal mean and nothing conditional. This rules out "x₀-prediction is the wrong target" and rules out "just train longer" — the floor is structural, not optimisation-bound.

Diagnostic 5 — The warm-start dominance problem

The geometry of the 976 trained weight vectors, measured directly: mean L2 norm 13.39 (std 0.10 — just 0.7 % of the mean); mean pairwise cosine similarity 0.9606 (min/max 0.927 / 1.000); the three attractor shapes have cos-to-population-mean = 0.995 — they are literally the most central shapes in the dataset. Variance is spread impossibly thin: the top 10 dimensions capture 0.2 % of the variance, the top 1,000 capture 11.2 %, and it takes 25,000 dimensions to reach 84.7 %. This is the warm-start dominance problem. The shared anchor initialisation — necessary for coherent weight-space interpolation — concentrates the entire training distribution into a thin shell where per-shape signal is buried under shared structure. The DiT learns the easy "predict the anchor" minimum and never gets the gradient signal to escape it.

Phase 10 — the autoencoder trap: strong numbers, broken meshes

The phase-10 autoencoder compresses weight residuals (54,785 → 128 or 256 dims) and is numerically excellent — final MSE 2.28 × 10⁻⁵, cos(rec, true) mean 0.9965, and the latent-space pairwise cosine drops from 0.96 all the way to 0.07 (different shapes mapped to near-orthogonal directions). By every metric an autoencoder optimises, this is excellent compression. The meshes say otherwise. The residual 0.3–0.5 % error is not uniformly distributed — it lands on different dimensions for different shapes, and ReLU+PE decoders are non-uniformly sensitive to which dimensions absorb it. The figures below are the actual phase-10 results from the released checkpoints.

obj_0100 turkey — watertight ground-truth mesh

obj_0100 turkey · ground truth. A watertight Objaverse-LVIS mesh with thin legs and a distinct neck and head. Its phase-10 AE reconstruction loses the neck, head and legs at cos = 0.994 — the topology is gone even though the number is excellent.

obj_0000 table — watertight ground-truth mesh

obj_0000 table · ground truth. The coffee table the warm-start anchor was trained on. Its phase-10 AE reconstruction failed marching cubes entirely — the decoded SDF had range [−1.07, −0.06], no zero crossing, no surface at all.

obj_0050 vase — phase-10 autoencoder reconstruction

obj_0050 vase · AE reconstruction. One of the few shapes that survives. The vase's simple convex geometry means the reconstruction error lands on dimensions that produce small surface displacements rather than topology changes — it still reads as a vase, just rougher.

obj_0119 — phase-10 autoencoder reconstruction, a quadruped

obj_0119 · AE reconstruction. A four-legged animal — lumpy and surface-rough, but the gross topology (four legs, body, tail) holds. Note: this file is named turkey_rec in the repo but actually renders obj_0119_rec.obj — labelled here by content.

The lesson, stated bluntly in the write-up

Numerical metrics ≠ mesh quality. A weight-space reconstruction cosine of 0.997 can correspond to a mesh that failed marching cubes entirely. An autoencoder trained on weight MSE has no signal about which perturbations are catastrophic in SDF space. The project made it a hard rule: never declare success on numerical metrics alone, and specifically stress-test shapes with thin or topologically complex geometry — wagon wheels, multi-part figures — as the most informative cases.

04 — The Working Pipeline (Phases 11–12)

DeepSDF shared decoder + image-conditioned latent diffusion.

Phase 11 is DeepSDF. A single shared decoder — 8 hidden ReLU layers of width 512, with a DeepSDF-style skip connection that re-injects the input at the middle layer, ~1.95 M parameters — takes concat(latent₆₄, PE(point)₃₉) = 103 input dimensions and predicts an SDF scalar. Each shape's 64-dim latent is a learnable parameter, initialised from 𝒩(0, 0.01²) and optimised jointly with the decoder (separate Adam learning rates — 5 × 10⁻⁴ decoder, 10⁻³ latents — plus a 10⁻⁵ L2 regulariser on the latent norms). Training objective is clamped-L1 on the SDF, 4 shapes and 8,192 points per shape per step.

Phase 12 trains the image-conditioning head. The target is now the clean 64-dim latent, not a thin weight residual. A Diffusion Transformer — d_model 384, 4 layers, 6 heads, ~10 M parameters (vs Phase 7's 132 M — the target is 800× smaller) — treats the latent as a single token, cross-attends to multi-view tokens (DINOv2 CLS-768 ⊕ 64-dim sinusoidal pose embedding), with AdaLN modulation on the diffusion timestep. Cosine T = 500 schedule, x₀-prediction, K ∈ [1, 8] views sampled per batch, 15 K training steps. CFG/pose dropout is disabled — Phase 9 confirmed it was never the cause of the collapse.

Phase 11 — the two pilots: capacity is the difference

The first phase-11 pilot used a small decoder — hidden 256, 4 layers, 800 epochs. The loss numbers looked healthy (SDF L1 mean 0.00613, latent pairwise cos −0.04 — essentially orthogonal) but the meshes were blob-quality: the decoder lacked the capacity to represent 20 distinct shapes through a 64-dim latent.

Phase 11 initial pilot — obj_0007, broken thin blob

Initial pilot · obj_0007. Broken. A thin column with a bulbous mass — the decoder (256×4, 800 ep) cannot resolve the shape.

Phase 11 initial pilot — obj_0009, broken pitted sphere

Initial pilot · obj_0009. Broken. A pitted sphere full of spurious holes — nothing like the wagon wheel it should be.

Phase 11 initial pilot — obj_0010, broken amorphous column

Initial pilot · obj_0010. Broken. An amorphous column. Loss values "looked reasonable" — the meshes did not.

Increasing decoder capacity to hidden 512 × 8 layers (~1.95 M params) and training for 4,000 epochs produces clean reconstructions across all 20 pilot shapes — SDF L1 mean drops from 0.00613 to 0.00051, max to 0.00090, in ~13 minutes. Same data, same 64-dim latent — capacity was the difference.

Phase 11 scaled pilot — obj_0003, clean sword/paddle reconstruction

Scaled pilot · obj_0003 sword. Clean. Decoder 512×8, 4000 epochs. The flat blade and the handle are both crisp.

Phase 11 scaled pilot — obj_0006, clean pants reconstruction

Scaled pilot · obj_0006 pants. Clean. The two legs, waist opening and the hollow interior are all preserved.

Phase 11 scaled pilot — obj_0009, clean wagon wheel reconstruction

Scaled pilot · obj_0009 wagon wheel. Clean. Twelve-plus spokes, the central hub and the outer rim — the razor-sharp thin geometry that failed in every weight-space phase.

Scaling to all 976 shapes (same 512×8 architecture, 1,500 epochs, ~140 min) keeps every shape clean: SDF L1 mean 0.00212, max 0.00593 — about 4× the 20-shape mean, but no shape catastrophically fails. The latent pairwise cosine rises only to 0.12 — the 64-dim space has room for 976 distinguishable shapes.

05 — Results (Phase 12)

Perfect recall. And — at 976 shapes — genuine OOD generalization.

On training shapes the phase-12 pipeline has essentially perfect recall. The 20-shape pilot reaches a training-loss EMA of 8 × 10⁻⁶ and cos(pred, true) = 1.0000 for every tested shape, with a > 0.4 cosine margin to the second-nearest training latent. At 976 shapes the training EMA is 5.1 × 10⁻³ — deliberately higher, because the model can no longer memorise — and recall is still strong: obj_0009 the wagon wheel round-trips cleanly through the full image → DINOv2 → DiT → decoder → marching-cubes pipeline.

Phase 12 recall — obj_0009 wagon wheel, K=8 views

Recall · obj_0009 wagon wheel · K = 8. Full pipeline. Image → DINOv2 → DiT → 64-dim latent → DeepSDF decoder → marching cubes. Spokes, hub and rim all survive the round trip.

Phase 12 recall — obj_0009 wagon wheel, alternate render

Recall · obj_0009 wagon wheel · alternate render. Comparable to the direct latent decode. The thin-geometry recall is the discriminating test — every weight-space phase destroyed exactly this.

Out-of-distribution — the central result

The most important result of the project is the qualitative shift in OOD behaviour between scales. At 20 training shapes the model is a pure nearest-neighbour retriever in DINOv2 feature space: feed it a tunnel and it returns a clean — but completely wrong — humanoid, because there is no "tunnel-like" region in a 20-shape latent space for it to land in.

20-shape OOD failure — tunnel input retrieves a humanoid training shape

20-shape OOD · tunnel input → humanoid output. Pure retrieval failure. The 20-shape model snaps the tunnel input to its nearest training shape (a humanoid, top-1 cos 0.966) — the output is clean because it is a memorised training shape, and bears no resemblance to a tunnel.

The tunnel input. A long thin rod (~3,900 verts), never seen in training. Fed to the 976-shape model below.

At 976 shapes the latent space has acquired enough semantic structure that DINOv2 features can navigate it. The same three never-seen inputs — a posed humanoid, a thin tunnel, a head-and-shoulders bust — now produce category-appropriate output. The surfaces are rough; the topology is right.

input · never seen

phase-12 output · K=8

OOD · humanoid. Head, shoulders, outstretched arms, torso, legs, feet. Surface is rough — no fingers, mushy face — but the topology is unmistakably humanoid.

input · never seen

phase-12 output · K=8

OOD · tunnel. An elongated rod. The 976-shape model recognises the long-thin geometry — compare the 20-shape result above, which collapsed to a humanoid entirely.

input · never seen

phase-12 output · K=8

OOD · head bust. Head-on-shoulders topology with eye-socket-like depressions. Below the head the neck/shoulder region degrades into noise — the model learned "head-like" but not "cleanly attached to a torso".

obj_0050 vase · K = 8. A jar/vase with a lidded top — convex body reconstructs cleanly, the lip is rougher.

obj_0119 reconstruction at K=8 — a clothed figure

obj_0119 · K = 8. A clothed standing figure — head, torso, a flared skirt and feet. The skirt's lower edge frays into noise; the upper body holds.

obj_0150 reconstruction at K=8 — a garment-like shape

obj_0150 · K = 8. A sleeved garment-like form. Gross silhouette is recovered; the hem and sleeves dissolve into stringy artefacts — the expected failure mode at this data scale.

Metric	20-shape pilot	976-shape run
Phase 11 SDF L1 (mean / max)	0.00051 / 0.00090	0.00212 / 0.00593
Phase 11 latent pairwise cosine	−0.04 (orthogonal)	0.12 (room to spare)
Phase 12 training-loss EMA	8 × 10⁻⁶ (memorises)	5.1 × 10⁻³ (cannot memorise)
Recall	cos(pred, true) = 1.0000	Clean full-pipeline recall
OOD behaviour	Pure nearest-neighbour retrieval	Category-appropriate generation

Honest framing — what this is and is not

This is not yet "real" image-to-3-D in the sense of Shap-E or Get3D. The reconstructions are rough, fine detail is lost, and category-appropriate does not mean shape-faithful. But it is qualitatively beyond pure retrieval, achieved at ~10⁻³ the data scale of those large-scale systems. The write-up's one-sentence conclusion: "at small data scale, structural inductive biases that constrain the prediction space beat learned compression of an unconstrained representation, every time."

Hypernet → DeepSDF —
Twelve Phases to a Working Image-to-3-D Pipeline.

"What if 3-D geometry just is a neural network's weights?"

A research archive, not a single experiment.

976 watertight Objaverse-LVIS shapes, rendered and SDF-sampled.

Eight iterations of careful ablation. The diagnostic is the contribution.

Phase 10 — the autoencoder trap: strong numbers, broken meshes

DeepSDF shared decoder + image-conditioned latent diffusion.

Phase 11 — the two pilots: capacity is the difference

Perfect recall. And — at 976 shapes — genuine OOD generalization.

Out-of-distribution — the central result

Interactive Demo · Live

Full Technical Paper

Hypernet → DeepSDF — Twelve Phases to a Working Image-to-3-D Pipeline.

"What if 3-D geometry just is a neural network's weights?"

A research archive, not a single experiment.

976 watertight Objaverse-LVIS shapes, rendered and SDF-sampled.

Eight iterations of careful ablation. The diagnostic is the contribution.

Phase 10 — the autoencoder trap: strong numbers, broken meshes

DeepSDF shared decoder + image-conditioned latent diffusion.

Phase 11 — the two pilots: capacity is the difference

Perfect recall. And — at 976 shapes — genuine OOD generalization.

Out-of-distribution — the central result

Interactive Demo · Live

Full Technical Paper

Hypernet → DeepSDF —
Twelve Phases to a Working Image-to-3-D Pipeline.