Recent large-scale image-to-3-D systems (Get3D, Shap-E, CRM, Zero-1-to-3) demonstrate impressive results when trained on hundreds of thousands to millions of shapes. But the practical reality of most research and engineering settings is data scarcity: a curated training set of a few hundred to a few thousand shapes, often domain-specific, is the realistic operating regime. This work asks: what architectural choices matter when training image-to-3-D systems with ~10³ shapes rather than ~10⁵?
The starting hypothesis was inherited from prior lab work on hypernetworks and weight-space learning: if each shape is encoded as the parameters of a small MLP that defines its SDF, then 3-D generation can be framed as predicting MLP parameters from images. The framing has theoretical appeal — it makes shape generation isomorphic to high-dimensional vector regression — and it composes naturally with diffusion models. What follows is the empirical story of pursuing that hypothesis through eight architectural iterations, watching it fail in informative ways, and ultimately abandoning it for a DeepSDF-style shared decoder.
(i) A systematic empirical comparison of three image-to-3-D architectural families (raw weight prediction, weight autoencoder + diffusion, DeepSDF latent + diffusion) at the 199–976-shape scale. (ii) A diagnostic methodology — ranking predicted outputs against the full set of training latents by cosine similarity — that cleanly distinguishes mode collapse from prediction noise from genuine generalisation. (iii) Identification of the warm-start dominance problem: per-shape MLPs sharing a common initialisation produce a weight distribution with mean pairwise cosine similarity ≥ 0.96, too concentrated for diffusion to extract per-shape signal at this data scale. (iv) A demonstration that ReLU+PE decoder weights are fragile to per-dimension reconstruction error in ways not captured by aggregate MSE or cosine metrics. (v) A working final pipeline (DeepSDF shared decoder + image-conditioned latent DiT) that produces clean recall at 976 shapes and exhibits category-appropriate OOD generation despite the small training set.
A 1,000-shape curated subset of Objaverse, filtered to the LVIS category vocabulary — common objects (table, chair, lamp, bottle), wildlife (dog, lion, beetle), tools (toothbrush, sharpie, pacifier), vehicles (cabin_car, surfboard), and long-tail oddities (banjo, escargot, signboard, Tabasco_sauce). After watertight conversion and SDF sampling, 976 shapes had clean output across both stages and form the working set. Each shape's identity is recorded as both an integer obj_idx (0–975) and the original Objaverse hash UID, with a manifest.json providing the bidirectional mapping plus the LVIS category label.
Raw Objaverse meshes are non-manifold — holes, internal geometry, duplicate vertices. All meshes are converted to watertight via a Houdini VDB pipeline: scatter ~1 M points uniformly on the surface, voxelise via VDB-from-particles (~1 M target voxels), convert the VDB back to polygons. SDF samples are then generated with mesh-to-sdf at 200,000 query points per shape (50 % near-surface, 50 % uniform in the unit cube), all shapes normalised to a unit bounding sphere, stored as obj_NNNN.npz.
Each watertight mesh is rendered from 64 viewpoints on a Fibonacci sphere at distance 2.5, using pyrender with EGL — black background, gray-blue PBR material, one headlight and one rim light, 224 × 224 RGB. Of the 976 shapes, 199 had renders from earlier project iterations; the remaining 777 were rendered for this work (~13 hours, CPU-bound). All renders pass through DINOv2-base/14 to produce a 768-dimensional CLS token per (shape, view); the full feature cache is 192 MB.
Each per-shape decoder f_θ : ℝ³ → ℝ has the form: point → positional encoding (6 frequency bands → 39-dim) → Linear(39,128)+ReLU → three Linear(128,128)+ReLU → Linear(128,1) — a total of 54,785 parameters. Formally, for a query point p ∈ ℝ³ the decoder maps
ReLU+PE was chosen over SIREN deliberately: in-house perturbation benchmarks showed ReLU+PE decoders tolerate much larger weight perturbations before reconstruction collapses (relative σ ≥ 0.34 versus SIREN's ≤ 0.17) — and a downstream diffusion model will produce noisy weight vectors.
All 976 per-shape decoders are initialised from a single anchor decoder, trained on obj_0000 (a coffee table). Per-shape decoders are then fit from this anchor for 200 epochs each at learning rate 10⁻⁴ with cosine decay. The motivation: warm-starting from a shared anchor keeps all per-shape decoders in the same permutation neighbourhood, which prior work showed is necessary for coherent downstream weight-space interpolation. Final per-shape losses: median 0.00185, 95th percentile 0.00409, only 11/976 outliers above 0.01. Reconstruction quality is good across categories.
Phase 6 trains a diffusion transformer (~132 M parameters) to predict per-shape decoder weights from a single image; Phase 7 extends to multi-view conditioning with K ∈ [1, 8] views and explicit camera-pose embeddings. The DiT chunks the 54,785-dim weight vector into 8 tokens of 6,849 dim each (after padding), projects each to d_model = 768, and has 8 layers with self-attention (within the 8 weight tokens) and cross-attention (to a variable number of view tokens — each view token is the concatenation of DINOv2 CLS and a 64-dim sinusoidal pose embedding). AdaLN modulation conditions on the diffusion timestep; the process uses a cosine T = 500 schedule, x₀-prediction, DDIM 50-step sampling, with cfg_dropout 0.10 and pose_dropout 0.20. The DiT trains on the 199 shapes that had multi-view renders at the time. Training loss EMA plateaus at 0.198 standardised MSE around step 15 K and does not improve through 30 K steps.
The diffusion process operates on the standardised weight vector x₀ = θ. The forward (noising) process and the x₀-prediction training objective are
q(xₜ | x₀) = 𝒩(xₜ ; √ᾱₜ · x₀ , (1 − ᾱₜ) I) (3) ℒ_DiT = 𝔼₍ₜ, x₀, c₎ ‖ x₀ − f_DiT(xₜ, t, c) ‖² , c = { DINOv2(Iₖ), pose(Iₖ) }ₖ₌₁ᴷ (4)where c is the multi-view conditioning set (K ∈ [1, 8] views, each a DINOv2 CLS token concatenated with a sinusoidal pose embedding), and ᾱₜ follows the cosine schedule. It is the floor on ℒ_DiT — and its uniformity across t, examined next — that the diagnostics interrogate.
To determine whether the 0.198 loss floor is prediction noise or something pathological, we run diagnostic inference on four trained shapes (obj_0000 table, obj_0050 snowman, obj_0100 turkey, obj_0119 wolf) and rank each predicted weight vector against all 199 training latents by cosine similarity. For obj_0119: cos(pred, true) = 0.9698, but the top-5 nearest training latents to the prediction are obj_0054, 0172, 0055, 0000, 0010 at cos ≈ 0.985–0.987. Across all four test shapes, the top-5 are the same four shapes — obj_0054, 0055, 0172, 0000 — regardless of input image, and the prediction is closer to these attractors than to the true target. This is mode collapse, but in an interesting form: the DiT has not collapsed to zero or noise, it has collapsed to the centroid of the training distribution, with image conditioning supplying only a ~3 % directional perturbation insufficient to reach the target.
A natural follow-up: perhaps x₀-prediction is hard at high noise levels and ε-prediction would help. We measure the loss at 10 timesteps spanning [0, 450].
| timestep t | ᾱ(t) | MSE(x₀) | MSE(ε implied) |
|---|---|---|---|
| 0 | 0.9999 | 0.18912 | 2164.16 |
| 50 | 0.9711 | 0.18286 | 6.13 |
| 100 | 0.8968 | 0.18700 | 1.63 |
| 200 | 0.6445 | 0.18563 | 0.34 |
| 300 | 0.3379 | 0.19423 | 0.10 |
| 450 | 0.0231 | 0.19252 | 0.005 |
The x₀ loss is essentially flat at ~0.19 across all timesteps. Critically, at t = 0 — where the input is the clean target plus a trace of noise — the loss is still 0.189: the model cannot even reproduce a near-clean input. This rules out the prediction-target hypothesis and rules out "just train longer" — the floor is structural.
Two hypotheses remain. Phase 8 retrains on raw (unstandardized) weight vectors — 124 of the 54,785 dimensions had std < 10⁻⁴ across the 199 shapes and were clamped before standardization, artificially amplifying near-constant dimensions. After 5 K steps the same diagnostic shows the same attractor cluster: standardization was not the cause. Phase 9 retrains with cfg_dropout and pose_dropout both set to 0.0, removing all unconditional training. The loss curve is essentially identical to Phase 8; the diagnostic shows the same top-1 collapse to obj_0055. CFG dropout was not the cause.
We measure the geometry of the 976 trained per-shape decoder weight vectors directly.
| Statistic | Value |
|---|---|
| Mean L2 norm of weight vectors | 13.39 |
| Std of L2 norm | 0.10 (0.7 % of mean) |
| Mean pairwise cosine similarity | 0.9606 |
| Min / max pairwise cosine | 0.927 / 1.000 |
| Cos-to-population-mean: obj_0054 / 0055 / 0172 | 0.9951 / 0.9951 / 0.9950 |
| Variance captured by top 10 dims | 0.2 % |
| Variance captured by top 1,000 dims | 11.2 % |
| Variance captured by top 10,000 dims | 55.4 % |
| Variance captured by top 25,000 dims | 84.7 % |
The diagnosis is now clear. The 976 weight vectors are highly concentrated — pairwise cosine 0.96 means all shapes are small perturbations of a shared mean — and the three attractor shapes (0054, 0055, 0172) have cos-to-mean = 0.995: they are literally the most central shapes in the dataset. The DiT's predictions land at the centroid because that is the easy minimum, and per-shape variation is spread thinly across ~50,000 dimensions, each with std ≈ 0.0008. This is the warm-start dominance problem. The shared anchor initialisation — desirable for the original goal of weight-space interpolation — concentrates the entire training distribution in a thin shell; there is per-shape signal in the shell, but it is buried under the shared structure, and the DiT never gets the gradient signal to escape the "predict the anchor" minimum.
Given the warm-start dominance problem, a natural intervention is to compress weight vectors before the diffusion model. We train an MLP autoencoder on residuals from the population mean: encoder maps 54,785-dim residual → 1024 → 1024 → 1024 → latent_dim, decoder mirrors it, latent_dim ∈ {128, 256}, 2,000 epochs at lr 3 × 10⁻⁴. By every metric an autoencoder optimises, the result is excellent: final raw-weight-space MSE 2.28 × 10⁻⁵ (latent 128) / 2.24 × 10⁻⁵ (256); cos(rec, true) mean 0.9965 / 0.9966, min 0.9921 / 0.9920; and the latent-space pairwise cosine drops from 0.96 (raw weight space) to 0.07 — different shapes mapped to essentially orthogonal directions.
When the trained latents are decoded back through the autoencoder and the resulting weights passed to the original ReLU+PE network for marching cubes, the meshes tell a different story. obj_0000 (the coffee table the anchor was trained on) failed marching cubes entirely — the decoded SDF had range [−1.07, −0.06], no zero crossing, no surface. obj_0100 (a turkey) reconstructs at cos = 0.994 with its neck, head and legs gone — the topology is lost. Only shapes with simple convex geometry, like obj_0050 (a vase), reconstruct acceptably.
The 0.3–0.5 % reconstruction error is not uniformly distributed. It lands on different dimensions for different shapes, and ReLU+PE decoders are highly sensitive to which specific dimensions absorb it. For a vase, errors land on dimensions producing small surface displacements — the vase still looks like a vase. For the table and the turkey, errors land on dimensions controlling the final-layer bias or the placement of the SDF zero-crossing — the topology is destroyed. Increasing latent_dim from 128 to 256 does not help; the issue is not compression ratio but decoder fragility. The lesson, made a hard project rule: aggregate weight MSE is a poor proxy for shape reconstruction quality, and visual inspection of marching-cubes outputs — specifically of thin or topologically complex shapes — is the only reliable evaluation.
The weight-space approaches treat each shape's MLP weights as independent — each shape has its own ~55 K-dim representation, and any compression must be learned post-hoc. DeepSDF inverts this: a single shared decoder is trained jointly with one learnable latent code per shape, so per-shape variation lives in a low-dimensional space by construction.
Phase 6/7 (failed): image → DiT → 54,785-dim weight vector → per-shape ReLU+PE net → SDF Phase 11 (works): shape_id → 64-dim latent z; f_shared(concat(z, PE(p))) → SDFThe "compression" from 54,785 → 64 dimensions is not learned by an autoencoder — it is enforced by the training procedure. The shared decoder must use the 64-dim latent to differentiate shapes, because it has no per-shape weights to fall back on.
The shared decoder has 8 hidden ReLU layers of width 512, with a DeepSDF-style skip connection re-injecting the input vector at the middle layer; input is concat(latent₆₄, PE₃₉) = 103 dimensions, output a single SDF scalar, ~1.95 M parameters total. Each shape's latent is a learnable parameter initialised i.i.d. from 𝒩(0, 0.01²) and optimised jointly with the decoder. Separate Adam learning rates — 5 × 10⁻⁴ for the decoder, 10⁻³ for the latents (latents need a higher LR to escape their initialisation) — and an L2 regulariser (weight 10⁻⁵) on the latent norms. Training objective: clamped-L1 on SDF predictions plus a latent regulariser, with decoder parameters φ and the per-shape latent set {zᵢ} optimised jointly
where sᵢ,ₚ is the ground-truth SDF value of point p for shape i, clamp(·, δ) truncates to [−δ, δ], and the (1/σ²) term (σ² = 10⁻⁵ weight) regularises the latent norms. The "compression" to 64 dimensions is enforced by (5) directly: the shared f_φ has no per-shape weights, so it must route per-shape variation through zᵢ. Training uses 4 shapes per step, 8,192 random points per shape per step.
The initial pilot — 20 shapes, decoder hidden 256 × 4 layers, 800 epochs — produced healthy-looking numbers (SDF L1 mean 0.00613, latent pairwise cosine −0.04) but blob-quality meshes: the decoder lacked capacity to represent 20 distinct shapes through a 64-dim latent. Scaling the decoder to hidden 512 × 8 layers (~1.95 M params) and training 4,000 epochs produces clean reconstructions across all 20 shapes — the wagon wheel, with 13 distinct spokes, hub and outer rim, is preserved with razor-sharp thin geometry.
| Configuration | SDF L1 (mean) | SDF L1 (max) | Latent pairwise cos | Train time |
|---|---|---|---|---|
| 20 shapes · 256×4 · 800 ep | 0.00613 | 0.01069 | −0.04 | ~1 min |
| 20 shapes · 512×8 · 4000 ep | 0.00051 | 0.00090 | −0.04 | ~13 min |
| 976 shapes · 512×8 · 1500 ep | 0.00212 | 0.00593 | 0.12 | ~140 min |
Scaling to 976 shapes grows the mean SDF L1 ~4× — but every individual shape stays well below 0.006, none catastrophically fail, and the worst 10 shapes by loss are spread across categories with no common failure mode. The latent pairwise cosine rises only to 0.12: the 64-dim space has ample room for 976 distinguishable shapes.
With Phase 11's 64-dim latents in hand, image-to-3-D reduces to: given image features and camera poses, predict a 64-dim latent that decodes to the correct shape — a small-dimensional supervised problem with multimodal output, making diffusion the natural choice. The DiT treats the 64-dim latent as a single token projected to d_model = 384, with 4 layers and 6 attention heads (vs Phase 7's 8 layers). Each layer has self-attention (over the 1 latent token — effectively a no-op), cross-attention to multi-view tokens, and a feed-forward block; each view token is concat(DINOv2 CLS-768, 64-dim sinusoidal pose embedding) projected to d_model; AdaLN modulates on the diffusion timestep. Total: ~10 M parameters — vs Phase 7's 132 M, appropriate because the prediction target is 800× smaller. The Phase 11 latents are standardised to zero-mean unit-variance per dimension. Training: 15 K steps, batch 32, lr 3 × 10⁻⁴ cosine, K ∈ [1, 8] views uniformly sampled per batch, cosine T = 500, x₀-prediction. CFG and pose dropout are disabled — Phase 9 confirmed they were never the cause of collapse.
Run on the 20-shape pilot, the training loss EMA reaches 8 × 10⁻⁶ — essentially zero — and recall is perfect: every tested training shape has cos(pred, true) = 1.0000 with a > 0.4 cosine margin to the second-nearest training latent. But OOD generalisation fails almost completely. Feeding the model an arbitrary held-out shape (a tunnel mesh) produces a prediction whose top-1 nearest training latent is obj_0000 (the table) at cos = 0.9661 — the model snaps to its closest training-set match and outputs a humanoid figure bearing no resemblance to a tunnel. With 20 training shapes there is no "tunnel-like" region in the latent space; the DiT correctly memorises its 20 (image, latent) pairs and otherwise behaves as a nearest-neighbour retriever in DINOv2 feature space.
The same pipeline at 976 training shapes produces qualitatively different OOD behaviour. The training EMA is 5.1 × 10⁻³ — much higher than the 20-shape memorisation, which is exactly what we want: the model can no longer memorise. Recall on training shapes is still strong (the wagon wheel reconstructs cleanly through the full image-to-3-D pipeline). The OOD result is the central finding: three never-seen inputs — a posed humanoid, a thin tunnel, a head-and-shoulders bust — produce category-appropriate output. The humanoid input yields an unmistakably humanoid mesh (head, shoulders, outstretched arms, torso, legs, feet — rough surface, no fingers, mushy face, but the topology is right). The tunnel yields an elongated rod (compare the 20-shape result, which collapsed to a humanoid entirely). The head yields a head-on-shoulders topology with eye-socket-like depressions, though the neck region degrades into noise.
| Property | 20-shape pilot | 976-shape run |
|---|---|---|
| Training-loss EMA | 8 × 10⁻⁶ (memorises) | 5.1 × 10⁻³ (cannot memorise) |
| Recall | cos(pred, true) = 1.0000 | Clean full-pipeline recall |
| OOD behaviour | Pure nearest-neighbour retrieval | Category-appropriate generation |
The qualitative shift from 20-shape pure-retrieval to 976-shape category-appropriate generation is the most important result of this work. At 976 shapes the latent space has acquired enough semantic structure that DINOv2 features can navigate it: a humanoid input lands in a humanoid region, a long-thin input in a long-thin region, a head input in a head region. The decoder produces the appropriate gross topology even for never-seen inputs.
The most counterintuitive finding is that the warm-start prior — established as essential for downstream weight-space tasks in prior work on per-layer hypernetworks — is precisely what dooms image-conditioned weight-space diffusion at this data scale. Warm-starting all per-shape MLPs from a single anchor produces a weight distribution living in a thin shell of weight space (mean pairwise cosine 0.96). For unconditional weight-space interpolation this is desirable — every weight vector you land on is in the same permutation neighbourhood. For image-conditioned diffusion the same property is fatal: per-shape signal is buried under shared anchor structure, the diffusion model sees mostly noise relative to its target, and it collapses to predicting the mean. DeepSDF sidesteps the trap by never having per-shape weight vectors at all.
The Phase 10 weight autoencoder is a clean example of metrics that mislead. cos(rec, true) = 0.997 is excellent for almost any application except this one — the residual error lands on different dimensions for different shapes, and ReLU+PE decoders are non-uniformly sensitive: a small error on a critical bias term destroys topology, while a large error on an inactive ReLU does nothing. Visual inspection of marching-cubes outputs is the only reliable evaluation, and thin / topologically complex shapes (wagon wheels, multi-part figures) are the most informative stress tests.
All negative results from §3–4 carry the caveat "at 976 shapes". We have not shown weight-space diffusion is impossible in principle — only that the warm-start prior plus 976 shapes places it below the threshold of feasibility. With 10⁵–10⁶ shapes there may be enough per-shape variation in the warm-started distribution for diffusion to extract it; the scale at which weight-space diffusion "turns on" is an open empirical question. Conversely, the Phase 11+12 success at 976 shapes does not establish DeepSDF is uniformly superior — at very large scale the weight-space approach may have advantages (no need to choose latent dimensionality up front; per-shape MLP capacity adapts to per-shape complexity). What we establish is a lower bound: at the small-data scale typical of academic and applied research, the DeepSDF approach is robustly superior to the weight-space diffusion variants we tested.
The Phase 12 / 976-shape pipeline has clear limits. OOD surface quality is rough — no fine detail, mushy faces, no fingers, broken neck attachments — the expected manifestation of data scale: the latent space is sparsely populated and DiT predictions for novel inputs land in under-trained regions that decode noisily. Fine geometric features present in training-shape recall are not consistent in OOD outputs; single-view OOD reconstructions are particularly noisy. The system does not handle truly novel categories — a tunnel produces a long-thin object but not a hollow-interior tunnel, because nothing in the 976 shapes has tunnel-like topology. And generalisation has not been measured quantitatively across many test shapes — only spot-checked on obvious examples. A rigorous evaluation would need a held-out test set with ground-truth meshes and a 3-D similarity metric (Chamfer, IoU).
Larger scale: 5 K, 50 K, 500 K shapes from full Objaverse — the expected curve, based on the 20 → 976 result, is that surface quality and OOD fidelity improve smoothly with shape count. More expressive latent spaces: triplane or hybrid representations combining a small global latent with localised features, to improve fine detail while preserving the structured manifold. Latent-space regularisation: training the DeepSDF stage with explicit smoothness regularisers (KL to a Gaussian prior, interpolation losses) to make the manifold smoother and OOD decoding cleaner. Richer image conditioning: combining DINOv2 with depth or surface-normal estimators for more 3-D-structural information per view. Quantitative OOD evaluation: a held-out test set with ground-truth meshes and Chamfer / IoU scoring.
Image-to-3-D generation lowers the cost of producing 3-D assets, with the usual dual-use profile of generative media: it can accelerate design, simulation, accessibility, and education work, and it can equally be used to generate assets that infringe the rights of the artists whose shapes appear in the training corpus. This work trains only on Objaverse / Objaverse-LVIS, whose objects carry Creative-Commons licences, but downstream users should respect per-object licensing rather than treating the trained model as licence-laundering. The small-data focus of this paper is, if anything, a mitigation: it points toward systems trainable on modest, curated, properly-licensed corpora rather than on web-scale scrapes of uncertain provenance. The model has no person-identifying capability and is not intended for any safety-critical use; its OOD outputs are explicitly low-fidelity (§7.4) and should not be relied on as measurements of real-world geometry.
Across twelve experimental phases we attempted three architectural families: per-shape weight-space diffusion, weight-space autoencoder + diffusion, and DeepSDF + latent diffusion. The first two failed in informative ways; the third succeeded. The weight-space failures localise to a single underlying cause — the warm-start prior necessary for downstream weight-space tasks creates a training distribution too concentrated for diffusion to extract per-shape signal — and diagnostic ablations rule out standardization, CFG dropout, and prediction-target choice. The weight autoencoder bypasses the dimensionality problem but creates a new one: ReLU+PE decoders are non-uniformly fragile, and aggregate weight MSE is uncorrelated with mesh quality. The DeepSDF approach succeeds because it constrains per-shape variation to a 64-dim latent space by construction. With 976 training shapes the latent space acquires enough semantic structure that an image-conditioned DiT can navigate it for OOD inputs. The journey is the message: at small data scale, structural inductive biases that constrain the prediction space beat learned compression of an unconstrained representation, every time. We expect this to generalise beyond 3-D shapes to other domains where the prediction target is high-dimensional, structured, and learned through warm-starting. All experiments ran on a single Vast.ai instance with an NVIDIA RTX 5060 Ti (16 GB).
docs/thesis.docx in the repository.