White Paper — Aaditya Jain

WP — 023

From Weight-Space Generation to DeepSDF: A Twelve-Phase Image-to-3-D Research Archive, the Weight-Space Dominance Diagnostic, and an Image-Conditioned Latent-Diffusion Pipeline that Works at the 976-Shape Scale

Twelve-phase image-to-3-D project. The weight-space hypothesis (shapes are decoder weights) fails for image-conditioned generation: warm-started per-shape weights are ≥96% shared anchor, the ≤4% per-shape signal is too thin for diffusion to extract — mode collapse survives every ablation. The autoencoder rescue fails too (cos=0.997 yet broken meshes). The DeepSDF pivot constructs a 64-dim latent rather than extracting one, and works: perfect recall + category-appropriate OOD at 976 shapes. Code, 26GB dataset, live HF Space, 30-page thesis public.

Technical Reportcs.CV · cs.GR · cs.LGAaditya Jain

Documentation →

May 2026

→

WP — 024

Activation-Space Part Discovery in DeepSDF: Self-Supervised Segmentation by Probing a Trained Implicit Decoder, and an Honest Account of the Per-Part Reconstruction Attempts that Failed

Self-supervised part segmentation with no part labels — by clustering the activations of a trained DeepSDF decoder as it is queried across a shape. On 9 CSG shapes the activation clustering recovers the constructive components (ARI reported per shape). The paper is equally an account of what did not work: four separate attempts to reconstruct individual parts as standalone SDFs, and why each failed.

Technical Reportcs.CV · cs.GR · cs.LGAaditya Jain

Documentation →

Apr 2026

→

WP — 025

Mini SDF-SRN: A Minimal From-Scratch Reimplementation of Single-View Neural SDF Recovery via a Differentiable Ray-Marching Renderer

A compact reimplementation of SDF-SRN — recovering a neural signed-distance field from single-view 2-D silhouette supervision through a differentiable ray-marching renderer, with no 3-D ground truth. Trained on synthetic primitives (sphere, box, ellipsoid); the report covers the renderer, the training schedule, novel-view quality, and the configuration that made it converge.

Technical Reportcs.CV · cs.GRAaditya Jain

Documentation →

Apr 2026

→

WP — 026

Flow-SDF: Rectified Flow for Neural SDF Reconstruction from 2-D Supervision, with the Noise-Scaling Diagnostic that Took Cosine Similarity from 0.08 to 0.95

A self-contained image-to-3-D pipeline — image conditioner, rectified-flow velocity network, SDF decoder, differentiable renderer — trained end-to-end on 2-D silhouette and RGB supervision with no 3-D data. The central diagnostic: 128-dim Gaussian noise has norm ~11.3 against SDF latent codes of norm ~0.42, so the flow minimises MSE by shrinking magnitude, not finding direction. Scaling source noise to the target's std fixed cosine similarity 0.08 → 0.95. Deliberately mirrors the Hunyuan3D component stack at small scale as a scaling recipe.

Technical Reportcs.CV · cs.GR · cs.LGAaditya Jain

Documentation →

Apr 2026

→

WP — 027

The Hypernet → Shape Pipeline: Per-Shape SIRENs, Per-Layer Hypernetworks, and a Weight-Space Autoencoder for Image-to-3-D — the Predecessor the Twelve-Phase Archive Diagnosed

The standalone weight-space pipeline that the Topic-41 archive grew out of. 24 images per object become 24 image-SIRENs, then one per-object hypernetwork; its weights route — via a tiny mapper through a 128-dim weight-space autoencoder — into shape-SIREN weights and a mesh. Two findings carried forward: per-layer hypernetworks preserve genus-1 topology where monolithic ones destroy it even at MSE ~10⁻⁷; and warm-starting all shape-SIRENs from a shared anchor is necessary for coherent weight-space interpolation — the same property the archive later finds fatal at scale.

Technical Reportcs.CV · cs.GR · cs.LGAaditya Jain

Documentation →

Apr 2026

→

WP — 012

Triplane Mechanics: Rendering, Storage, and the Mesh-Extraction Decoupling — A Decision Note Promoting Triplane to Universal Intermediate

Mechanics of triplane representations (3 axis-aligned planes, bilinear sample, MLP, volume render) and the three-way comparison against Gaussian splats and VDB / FVDB. Triplane wins as the universal intermediate on storage (6–12 MB), editability, and lossless convertibility to either alternative. Mesh extraction is a separate downstream step, not part of rendering.

Technical Notecs.GR · cs.LGAaditya Jain

Documentation →

Jan 2026

→

WP — 013

Axis-Aligned Distance Fields: An Analysis of UODF (Lu et al., CVPR 2024) and Its Relationship to the Thesis-Line Six-Plane Mesh and Hexplane Representations

UODF stores three axis-aligned unsigned distance fields instead of a single SDF, giving interpolation-free surface extraction. 20–100× quality gain on open surfaces; triplane variant at 30–80 M points/s. The same axis-aligned principle the thesis-line Six-Plane Mesh and Hexplane AE use implicitly — UODF provides the theoretical justification.

Technical Analysiscs.GR · cs.LGAaditya Jain

Documentation →

Jan 2026

→

WP — 014

Gaussian Splats vs VDB for Single-Image-to-3-D: An Architecture Survey Across Splatter Image, GS-LRM, Triplane-Meets-Gaussian, and Gamba

Four-method survey (Splatter Image, GS-LRM, Triplane-Meets-Gaussian, Gamba) + three-way comparison (G-Splat vs VDB vs triplane). Triplane wins as universal intermediate. Gamba (Mamba over Gaussian-sequence tokens) is published validation of the MambaFlow3D substitution premise for sparse-cube tokens.

Architecture Surveycs.GR · cs.LGAaditya Jain

Documentation →

Dec 2025

→

WP — 015

Manifold-Aware Diffusion Targets: An Analysis of Li & He's "Back to Basics" x-Prediction Result and Its Extension to 3-D Geometric Representations

Summary of Li & He's x-prediction manifold-hypothesis argument and an extension showing the case is structurally stronger for 3-D geometry than for natural images. SDFs (eikonal constraint), sparse voxels (surface-band coherence), and triplanes all have algebraic constraints that ε-prediction destroys. Decides x / v-prediction for all thesis-line diffusion work.

Technical Analysiscs.LG · cs.CV · cs.GRAaditya Jain

Documentation →

Nov 2025

→

WP — 016

Diffusion Generator over Houdini Bridge Polylines: An Architecture Design Specification Lifting PGN's Seq2seq Head into Flow-Matching over a Padded-with-Mask Polyline Tensor

Design specification for a flow-matching polyline generator that replaces PGN's autoregressive seq2seq DSL head. Padded fixed-length tensor with explicit length mask; Pure-Mamba backbone; flow-matching velocity prediction; added-embedding attribute conditioning. Three open architectural questions enumerated for the first training run.

Design Specificationcs.LG · cs.GRAaditya Jain

Documentation →

Nov 2025

→

WP — 017

Image-to-3-D With Parametric Output: A Field-Gap Survey of Five 3-D-Generation Methods (CAPRI-Net, BrepGen, HoLa, SparC3D, TRELLIS) and the Thesis-Line Opportunity

Five-method survey identifies a structural field gap: no published method combines image input with parametric (procedural-CAD) output. The field bifurcates into image-input + raw-geometry-output (SparC3D, TRELLIS) and latent-input + parametric-output (CAPRI-Net, BrepGen, HoLa). The intersection is the thesis-line opportunity, operationalised by PGN and SculptNet.

Surveycs.GR · cs.CV · cs.LGAaditya Jain

Documentation →

Oct 2025

→

WP — 018

A Character-Level Transformer From First Principles: Implementation, Attention-Pattern Inspection, and the Context-Dependent-Representation Pedagogical Result

From-first-principles pure-NumPy implementation of a character-level transformer block (multi-head causal self-attention, positional encoding, FFN, LayerNorm, residuals). The "hello world" run shows three occurrences of the letter l producing three distinct output vectors — the operational signature of context-dependent representation. Foundation study for the thesis-line transformer use.

Technical Notecs.LGAaditya Jain

Documentation →

Sep 2025

→

WP — 019

Latent Diffusion Models, Conditional Diffusion, and Latent Consistency Models: A Study Note on the Stable-Diffusion-Class Architecture and Its Implications for 3-D-Diffusion

Study of LDM (Stable Diffusion-class) decomposition: frozen VAE compressor + trained U-Net denoiser + frozen text encoder + scheduler. LCM distillation gives 25× sampling speed-up. Same VAE-U-Net decoupling structures the thesis-line 3-D-diffusion work — triplane encoder is the VAE-analogue, Mamba block is the U-Net-analogue.

Technical Notecs.LG · cs.CVAaditya Jain

Documentation →

May 2025

→

WP — 020

Real-Time Single-Phone 3-D Reconstruction via Monocular Depth + Neural SLAM + Browser-Rendered TSDF: A System Design Pitch (WTFund, Not Funded) and Its Influence on the Thesis-Line Consumer-Hardware Constraint

System design pitch for real-time 3-D reconstruction (DPT depth → NICER-SLAM pose → TSDF fusion → Rerun.io browser viewer; 10 fps end-to-end target). WTFund ₹20 L grant application declined ("PoC needed"). The pitch's "real-time, on a laptop, no server" framing is the consumer-hardware throughline that shapes every later thesis-line architecture decision.

System Design Pitchcs.CV · cs.ROAaditya Jain

Documentation →

Mar 2025

→

WP — 021

A Minimal DDPM From First Principles: Single-Image Training on a 16×16 Red Square as Architecture-Literacy Investment for the Thesis-Line Diffusion Work

From-scratch DDPM: 3 M-param MLP denoiser with sinusoidal timestep embedding, 100-step linear β schedule, ε-prediction MSE loss, single-image training on a 16×16 red square. Trains in ~5 min on M2 CPU; reverse-pass reconstruction MSE < 0.01. The 0.8 s sampling cost on the toy seeds the flow-matching switch in the later thesis-line work.

Technical Notecs.LGAaditya Jain

Documentation →

Feb 2025

→

WP — 022

Signed Distance Fields as a Foundational 3-D Representation: Analytic SDFs, Comparison Against Point Clouds and Meshes, and a Brief Exploration of GAN-Based SDF Generation

Foundational SDF study. Structured comparison (point cloud vs mesh vs SDF), analytic sphere / cuboid SDFs with closed-form CSG operators (union = min, intersection = max, difference = max(a, −b)), and a brief abandoned GAN-SDF exploration. Decisions made here propagate forward: Hexplane AE's continuous-feature pivot, the diffusion-over-GAN preference, the UODF cross-reference.

Foundation Studycs.GR · cs.LGAaditya Jain

Documentation →

Feb 2025

→

WP — 011

Flow Matching Backbone Validation on MNIST: A Three-Way Comparison of Pure Mamba, Pure Transformer, and Hybrid Mamba+Attention Under Matched Parameter Count

Matched-parameter MNIST validation across three sequence backbones with identical FM head + training schedule. Hybrid wins loss (0.080) but Pure Mamba ties on visual sample quality and is 1.44× faster per step. Pure Transformer produces noise at 50-epoch budget. Decision rule (set ex ante) picks Pure Mamba — the backbone choice that carries into MambaFlow3D.

Technical Report cs.LG Aaditya Jain

Documentation →

Nov 2025

→

WP — 010

MambaFlow3D: A Pure-Mamba + Latent-Flow-Matching Architecture for Single-Image 3-D Generation — Spec, Speed-up Budget, and ModelNet10 Phase-2 Validation

Architecture spec for single-image-to-3-D substituting Pure-Mamba state-space blocks for transformer attention (vs SparC3D / TRELLIS) and flow matching for DDPM sampling. PointNet++ → 10 Mamba blocks (d_model=256, d_state=128) → FM head → FoldingNet, 7.25 M parameters. Speed-up budget: 2–3× training, 5–12× end-to-end inference. ModelNet10 Phase-2 bring-up with the PointNet++ +3-channel trap diagnostic.

Technical Report cs.LG · cs.GR Aaditya Jain

Documentation →

Nov 2025

→

WP — 009

Training JiT Diffusion on Two Consumer GPUs: Hardware Adaptation, Debugging Cascade, and Phase-1 Reproduction of ViT-Backbone x-Prediction Diffusion at ImageNet-256

Reproduction of LTH14/JiT (ViT-B/16 + x-prediction, ImageNet-256) on 2 × RTX 3060 12 GB. Documents the hardware-adaptation table (1024 → 32 effective batch, 80 % → 55 % memory, ~1 h → ~6 h per epoch) and a five-failure debugging cascade (MKL ITT, ImageNet flat dirs, "hang" diagnostic, DataLoader OOM, pos_embed shape mismatch). Epoch-0 FID 281.24; crash at start of epoch 1 in sampling pass.

Technical Report cs.LG · cs.CV Aaditya Jain

Documentation →

Nov 2025

→

WP — 008

When Variational Autoencoders Meet Binary Geometry: Posterior Collapse on 6-Plane Hexplane Representations and the Continuous-Feature Fix

Diagnoses a structural failure mode of VAEs trained on binary hexplane occupancy: continuous-Gaussian distributional assumption is violated, reconstruction term converges to degenerate mean solution, posterior collapses regardless of KL schedule. Fix is at input representation, not loss schedule. Pivot to deterministic AE with continuous depth+normals features eliminates the failure.

arXiv Preprint cs.LG · cs.GR Aaditya Jain

Documentation →

Dec 2025

→

WP — 007

SculptNet: Learning Coarse-to-Fine 3D Reconstruction from Single Images via Five-Primitive Vocabularies and Progressive Stage-Wise Commitment

Four-stage coarse-to-fine pipeline (block → shape → detail → compose) over five named primitives (box, cylinder, cone, sphere, wedge) with independent face/cap deformation. PartNeXt-trained via a Houdini Python SOP geometric classifier. ~1.3 cm Hausdorff on chair benchmark (2.6% of bounding diagonal). No executor gap — primitives are continuous parametric geometry.

arXiv Preprint cs.GR · cs.CV Aaditya Jain

Documentation →

Feb 2026

→

WP — 006

Hierarchical Part-Based Triplane Reconstruction: Eliminating Inter-Part Occlusion via Per-Part Local Frames and a Shared Decoder

N + 1 triplane sets — one per semantic part in a local frame, one coarser global for spatial context. Shared SDF decoder conditioned on part-id embedding. Structural fix to the EG3D / SparC3D inter-part occlusion failure mode. ~6× storage, ~2× inference, fidelity 4.8% → 1.9% Hausdorff on chair benchmark.

arXiv Preprint cs.GR · cs.LG Aaditya Jain

Documentation →

Feb 2026

→

WP — 005

Six-Plane Orthographic Mesh Reconstruction: From Dense Depth Pixels to Watertight Triangulated Geometry via Per-Cluster Polygon-with-Holes Triangulation

Inverts six axis-aligned orthographic depth maps into a single watertight 3D triangle mesh. Six-stage pipeline (foreground → cluster → contour → simplify → triangulate → stitch) compresses 352K input pixels to ~454 vertices on the sphere benchmark; minimal-polygon vs cloth-grid trade-off characterised.

Technical Report cs.GR · cs.CG Aaditya Jain

Documentation →

Feb 2026

→

WP — 004

Six-Plane Elevation Reconstruction: Watertight 3D Building Geometry from a Single Street-View Photograph

Routing a single street-view photo through six orthographic elevations → marching-squares contours → depth clustering → earcut triangulation → watertight stitched mesh. Compresses 352K source pixels to ~454 vertices / ~332 triangles.

arXiv Preprint cs.GR · cs.CV Aaditya Jain

Documentation →

Mar 2026

→

WP — 001

PGN: A Transformer-Based Procedural Generator Network for 3D Bridge Synthesis from Polyline Semantic Attributes

Seq2seq transformer mapping polylines with semantic boundary attributes to executable DSL programs that construct USD bridge scenes. 15-pair training corpus, dual-loss curriculum, analysis of the non-differentiable executor gap.

arXiv Preprint cs.GR · cs.LG Aaditya Jain

Documentation →

Sep 2025

→

WP — 002

SketchProc3D: CNN-Based Grammar Snippet Recognition for Inverse Procedural Modeling of Building Facades from Freehand Sketches

CNN system mapping freehand building sketches to CityEngine CGA grammar programs. 95–99% accuracy on synthetic data; domain-gap analysis between synthetic edges and human sketches; differentiable rendering gradient analysis.

arXiv Preprint cs.GR · cs.CV Aaditya Jain

Documentation →

Oct 2025

→

WP — 003

Graph Grammars for Automatic 3D Procedural Modeling: Implementing Merrell's Boundary String Method with Neural Rule Prediction

Re-implementation of Merrell's graph grammar from scratch — half-edge boundary strings, 3D extension, Python prototype generating chairs / tables with detected symmetry rules. Path toward neural rule prediction outlined.

arXiv Preprint cs.GR · cs.LG Aaditya Jain

Documentation →

Oct 2025

→