Topic 13 Sep 2025 Transformer · Char-LM · Learning Exercise

Mini LLM —
Character Transformer from Scratch.

Pre-thesis learning exercise — build a character-level transformer from first principles, with no high-level library, to understand the mechanics of attention before using transformers as a black box in PGN (Topic 40) and the later 3-D-generation work. Implemented Phase 1 in pure NumPy on an Intel iMac. The "aha" was watching the same letter 'l' produce three different output vectors in "hello world" — confirming context-dependent representations as the core of what attention does.

00 — Motivation

Understand transformers before using them in 3-D work.

September 2025 was the start of the thesis line — PGN was being designed and the seq2seq transformer head was the obvious choice for the polyline → DSL mapping. The honest gap: I had read about transformers and used them through Hugging Face, but had never built one from scratch. The risk of using a black-box implementation for a thesis-line architectural choice was that I would not understand the failure modes when the model misbehaved.

The Mini LLM project was the cheap fix. Build a character-level language model from first principles. Pure NumPy, no PyTorch high-level modules. One transformer block first — attention only, then attention + FFN + layer norm + residuals + positional encoding. Test on a tiny string ("hello world"). Then upgrade to PyTorch in Phase 2, then compare against a Hugging Face tinyGPT in Phase 3. The goal was not a competitive language model; the goal was to never be confused by an attention pattern again.

The motivation is also forward-looking. The Apple Maps 3-D- reconstruction line ultimately wants text-to-3-D — generate a 3-D bridge from a sketch + a text description, or extract 3-D from a captioned street-view photo. Both depend on a text encoder, which is a transformer. The Mini LLM exercise is the building block.

What it informs

The exercise feeds the seq2seq head choice in PGN (Topic 40), the transformer-vs-Mamba comparison in MNIST validation (Topic 25), the JiT reproduction (Topic 27 — ViT backbone is the same transformer family), and the polyline-diffusion design study (Topic 24). Every subsequent transformer use in the thesis line is informed by having understood what attention actually computes from this exercise.

01 — Architecture

One transformer block, four heads, 64-dim embeddings.

The block is the simplest version of the architecture used in GPT-class models. Built component-by-component to make each piece inspectable. The full forward path:

Input string         : "hello world"                  → 11 chars
Character tokeniser  : char → int in [0, 8)             → (11,)
Embedding            : int → 64-dim vector              → (11, 64)
Positional encoding  : sinusoidal added to embedding    → (11, 64)
LayerNorm            : pre-norm                         → (11, 64)
Multi-head attention : 4 heads, causal mask             → (11, 64) + (4, 11, 11)
Residual add         : x + attn(x)                      → (11, 64)
LayerNorm            : pre-norm                         → (11, 64)
Feed-forward         : 64 → 256 (ReLU) → 64             → (11, 64)
Residual add         : x + ffn(x)                       → (11, 64)
Output head          : linear to vocab size             → (11, 8)

Component	Setting	Why
Vocabulary	8 characters (h, e, l, o, ·, w, r, d)	Tiny on purpose — easy to read the attention heatmap by character name
Embedding dim	64	Small but non-trivial; matches typical educational toy GPT
Attention heads	4	Enough to see different heads learn different patterns
Per-head dim	16	64 / 4 = 16
FFN expansion	4×	Standard transformer convention
Activation	ReLU	Simpler than GELU; sufficient for the exercise
Positional encoding	Sinusoidal	Closed-form; no learned parameters
Causal mask	Lower triangular	Autoregressive: token i sees tokens 1…i

Pipeline

Three phases — NumPy from scratch → PyTorch → Hugging Face compare.

02 — The "hello world" Run

Untrained model on 11 characters. Output: confirmed attention works.

The first end-to-end run was on the literal string "hello world" — 11 characters tokenised to 8 unique vocab items. The model was untrained (random-initialised weights, random embeddings), so the output values are not meaningful as predictions. The diagnostic signal is in (i) the attention-weight matrix (per head, per query position, distribution over key positions) being lower-triangular as expected from the causal mask, and (ii) the same letter 'l' producing three different output vectors depending on its position.

Position	Char	Input embed (first 3 dims)	Output (first 3 dims)
0	h	[0.224, 1.107, 0.053]	[0.649, 2.759, 1.445]
1	e	[0.946, 0.745, 0.553]	[1.239, 2.416, 1.729]
2	l (in "hello")	[0.731, −0.452, 1.103]	[0.587, 0.828, 2.060]
3	l (in "hello")	[−0.037, −1.026, 0.884]	[−0.662, −0.150, 1.563]
4	o	[−0.758, −0.833, 0.022]	[−1.603, −0.288, 0.509]
5	·	[−0.933, 0.382, −0.547]	[−1.593, 0.973, −0.013]
6	w	[−0.258, 0.986, −0.892]	[−0.997, 1.805, −0.519]
7	o	[0.656, 0.574, −0.979]	[−0.140, 0.828, −0.928]
8	r	[0.885, −0.275, −0.436]	[0.485, 0.020, −0.644]
9	l (in "world")	[0.234, −0.947, 0.555]	[−0.200, −1.299, 0.636]
10	d	[−0.698, −0.860, 1.068]	[−1.308, −1.325, 1.482]

The three 'l' rows are the punch-line. The character 'l' has one fixed embedding at the input, but three different outputs depending on what came before it — exactly what context-dependent representation means. This is the entire point of attention, made concrete on a string short enough to read by hand.

Core Insight

Same 'l'. Three different output vectors.
That's the whole thing.

Attention is, mechanically, weighted averaging of values — and the weights depend on the query (which depends on the current position) and the keys (which depend on every previous position). Once you have watched a single character produce different outputs at different positions in a single sentence, the rest of the architecture (multi-head, FFN, residuals, layer norm) is engineering around that core idea.

Interactive Demo · Live

Type a short string and watch the attention heatmap update. Each row is a query position; each column is a key position. The lower-triangular pattern is the causal mask. The colour intensity shows how much that query position attends to that key position. Pick a head to see how different heads attend to different parts of the sequence.

01 — Input text · TYPE TO EDIT 11 chars

02 — Attention head selector HEAD 1 / 4

03 — Attention heatmap causal mask · lower-triangular

Appendix — Raw Materials

Transcripts & Source References

████████████████████████████████████████████████
███████████████████████████████████████

01 — ██████████████████████████

██████████████████████████████████████

█████████ · ████ · █████████████████████

█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████

Restricted Access

Mini LLM — Character Transformer from Scratch.