← Research Timeline Aditya Jain / Apple Maps · 3D Reconstruction
Sep 2025
Topic 13 Sep 2025 Transformer · Char-LM · Learning Exercise

Mini LLM —
Character Transformer from Scratch.

Pre-thesis learning exercise — build a character-level transformer from first principles, with no high-level library, to understand the mechanics of attention before using transformers as a black box in PGN (Topic 40) and the later 3-D-generation work. Implemented Phase 1 in pure NumPy on an Intel iMac. The "aha" was watching the same letter 'l' produce three different output vectors in "hello world" — confirming context-dependent representations as the core of what attention does.

00 — Motivation

Understand transformers before using them in 3-D work.

September 2025 was the start of the thesis line — PGN was being designed and the seq2seq transformer head was the obvious choice for the polyline → DSL mapping. The honest gap: I had read about transformers and used them through Hugging Face, but had never built one from scratch. The risk of using a black-box implementation for a thesis-line architectural choice was that I would not understand the failure modes when the model misbehaved.

The Mini LLM project was the cheap fix. Build a character-level language model from first principles. Pure NumPy, no PyTorch high-level modules. One transformer block first — attention only, then attention + FFN + layer norm + residuals + positional encoding. Test on a tiny string ("hello world"). Then upgrade to PyTorch in Phase 2, then compare against a Hugging Face tinyGPT in Phase 3. The goal was not a competitive language model; the goal was to never be confused by an attention pattern again.

The motivation is also forward-looking. The Apple Maps 3-D- reconstruction line ultimately wants text-to-3-D — generate a 3-D bridge from a sketch + a text description, or extract 3-D from a captioned street-view photo. Both depend on a text encoder, which is a transformer. The Mini LLM exercise is the building block.

What it informs
The exercise feeds the seq2seq head choice in PGN (Topic 40), the transformer-vs-Mamba comparison in MNIST validation (Topic 25), the JiT reproduction (Topic 27 — ViT backbone is the same transformer family), and the polyline-diffusion design study (Topic 24). Every subsequent transformer use in the thesis line is informed by having understood what attention actually computes from this exercise.
01 — Architecture

One transformer block, four heads, 64-dim embeddings.

The block is the simplest version of the architecture used in GPT-class models. Built component-by-component to make each piece inspectable. The full forward path:

Input string : "hello world" → 11 chars Character tokeniser : char → int in [0, 8) → (11,) Embedding : int → 64-dim vector → (11, 64) Positional encoding : sinusoidal added to embedding → (11, 64) LayerNorm : pre-norm → (11, 64) Multi-head attention : 4 heads, causal mask → (11, 64) + (4, 11, 11) Residual add : x + attn(x) → (11, 64) LayerNorm : pre-norm → (11, 64) Feed-forward : 64 → 256 (ReLU) → 64 → (11, 64) Residual add : x + ffn(x) → (11, 64) Output head : linear to vocab size → (11, 8)
ComponentSettingWhy
Vocabulary8 characters (h, e, l, o, ·, w, r, d)Tiny on purpose — easy to read the attention heatmap by character name
Embedding dim64Small but non-trivial; matches typical educational toy GPT
Attention heads4Enough to see different heads learn different patterns
Per-head dim1664 / 4 = 16
FFN expansionStandard transformer convention
ActivationReLUSimpler than GELU; sufficient for the exercise
Positional encodingSinusoidalClosed-form; no learned parameters
Causal maskLower triangularAutoregressive: token i sees tokens 1…i
Pipeline

Three phases — NumPy from scratch → PyTorch → Hugging Face compare.

Phase 1 · NumPy attention from scratch + FFN + LayerNorm + PE Phase 2 · PyTorch nn.Linear, nn.LayerNorm + training loop (planned) Phase 3 · HF compare side-by-side vs tinyGPT on the same toy text (planned) Status: Phase 1 complete. Phases 2/3 queued — superseded by PGN seq2seq work (Topic 40). "Phase 1 was enough" — the architectural understanding gained from NumPy built the intuition needed; Phases 2/3 were optional once that was in place.
02 — The "hello world" Run

Untrained model on 11 characters. Output: confirmed attention works.

The first end-to-end run was on the literal string "hello world" — 11 characters tokenised to 8 unique vocab items. The model was untrained (random-initialised weights, random embeddings), so the output values are not meaningful as predictions. The diagnostic signal is in (i) the attention-weight matrix (per head, per query position, distribution over key positions) being lower-triangular as expected from the causal mask, and (ii) the same letter 'l' producing three different output vectors depending on its position.

PositionCharInput embed (first 3 dims)Output (first 3 dims)
0h[0.224, 1.107, 0.053][0.649, 2.759, 1.445]
1e[0.946, 0.745, 0.553][1.239, 2.416, 1.729]
2l (in "hello")[0.731, −0.452, 1.103][0.587, 0.828, 2.060]
3l (in "hello")[−0.037, −1.026, 0.884][−0.662, −0.150, 1.563]
4o[−0.758, −0.833, 0.022][−1.603, −0.288, 0.509]
5·[−0.933, 0.382, −0.547][−1.593, 0.973, −0.013]
6w[−0.258, 0.986, −0.892][−0.997, 1.805, −0.519]
7o[0.656, 0.574, −0.979][−0.140, 0.828, −0.928]
8r[0.885, −0.275, −0.436][0.485, 0.020, −0.644]
9l (in "world")[0.234, −0.947, 0.555][−0.200, −1.299, 0.636]
10d[−0.698, −0.860, 1.068][−1.308, −1.325, 1.482]

The three 'l' rows are the punch-line. The character 'l' has one fixed embedding at the input, but three different outputs depending on what came before it — exactly what context-dependent representation means. This is the entire point of attention, made concrete on a string short enough to read by hand.

Core Insight

Same 'l'. Three different output vectors.
That's the whole thing.

Attention is, mechanically, weighted averaging of values — and the weights depend on the query (which depends on the current position) and the keys (which depend on every previous position). Once you have watched a single character produce different outputs at different positions in a single sentence, the rest of the architecture (multi-head, FFN, residuals, layer norm) is engineering around that core idea.

Interactive Demo · Live

Type a short string and watch the attention heatmap update. Each row is a query position; each column is a key position. The lower-triangular pattern is the causal mask. The colour intensity shows how much that query position attends to that key position. Pick a head to see how different heads attend to different parts of the sequence.

01 — Input text · TYPE TO EDIT 11 chars
02 — Attention head selector HEAD 1 / 4
03 — Attention heatmap causal mask · lower-triangular

Full Technical Paper

White paper · transformer-from-scratch implementation · hello-world context-dependence result · thesis-line foundation

Read Paper →
Related Thesis Chapters
PGN — Polyline → DSL seq2seq
First production use of transformer attention in the thesis line. The Mini-LLM exercise informed the architecture choices made for PGN's encoder and decoder stacks.
MNIST Flow-Matching Validation
The Mamba-vs-Transformer comparison where the transformer baseline is the same architecture family this exercise built. The understanding from here is what made the Mamba win interpretable.
JiT Diffusion — ViT Backbone
ViT is a transformer over image patches. The patch-attention pattern in JiT is the same mechanism as the character-attention pattern in this exercise, scaled up to 256 patches and 86 M parameters.
Appendix — Raw Materials
Transcripts & Source References
████████████████████████████████████████████████
███████████████████████████████████████

██████████████████████████████████████
█████████ · ████ · █████████████████████
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████
Restricted Access