Pre-thesis learning exercise — build a character-level transformer from first principles, with no high-level library, to understand the mechanics of attention before using transformers as a black box in PGN (Topic 40) and the later 3-D-generation work. Implemented Phase 1 in pure NumPy on an Intel iMac. The "aha" was watching the same letter 'l' produce three different output vectors in "hello world" — confirming context-dependent representations as the core of what attention does.
September 2025 was the start of the thesis line — PGN was being designed and the seq2seq transformer head was the obvious choice for the polyline → DSL mapping. The honest gap: I had read about transformers and used them through Hugging Face, but had never built one from scratch. The risk of using a black-box implementation for a thesis-line architectural choice was that I would not understand the failure modes when the model misbehaved.
The Mini LLM project was the cheap fix. Build a character-level language model from first principles. Pure NumPy, no PyTorch high-level modules. One transformer block first — attention only, then attention + FFN + layer norm + residuals + positional encoding. Test on a tiny string ("hello world"). Then upgrade to PyTorch in Phase 2, then compare against a Hugging Face tinyGPT in Phase 3. The goal was not a competitive language model; the goal was to never be confused by an attention pattern again.
The motivation is also forward-looking. The Apple Maps 3-D- reconstruction line ultimately wants text-to-3-D — generate a 3-D bridge from a sketch + a text description, or extract 3-D from a captioned street-view photo. Both depend on a text encoder, which is a transformer. The Mini LLM exercise is the building block.
The block is the simplest version of the architecture used in GPT-class models. Built component-by-component to make each piece inspectable. The full forward path:
| Component | Setting | Why |
|---|---|---|
| Vocabulary | 8 characters (h, e, l, o, ·, w, r, d) | Tiny on purpose — easy to read the attention heatmap by character name |
| Embedding dim | 64 | Small but non-trivial; matches typical educational toy GPT |
| Attention heads | 4 | Enough to see different heads learn different patterns |
| Per-head dim | 16 | 64 / 4 = 16 |
| FFN expansion | 4× | Standard transformer convention |
| Activation | ReLU | Simpler than GELU; sufficient for the exercise |
| Positional encoding | Sinusoidal | Closed-form; no learned parameters |
| Causal mask | Lower triangular | Autoregressive: token i sees tokens 1…i |
The first end-to-end run was on the literal string "hello world"
— 11 characters tokenised to 8 unique vocab items. The model was
untrained (random-initialised weights, random embeddings), so the
output values are not meaningful as predictions. The diagnostic
signal is in (i) the attention-weight matrix
(per head, per query position, distribution over key positions)
being lower-triangular as expected from the causal mask, and
(ii) the same letter 'l' producing three
different output vectors depending on its position.
| Position | Char | Input embed (first 3 dims) | Output (first 3 dims) |
|---|---|---|---|
| 0 | h | [0.224, 1.107, 0.053] | [0.649, 2.759, 1.445] |
| 1 | e | [0.946, 0.745, 0.553] | [1.239, 2.416, 1.729] |
| 2 | l (in "hello") | [0.731, −0.452, 1.103] | [0.587, 0.828, 2.060] |
| 3 | l (in "hello") | [−0.037, −1.026, 0.884] | [−0.662, −0.150, 1.563] |
| 4 | o | [−0.758, −0.833, 0.022] | [−1.603, −0.288, 0.509] |
| 5 | · | [−0.933, 0.382, −0.547] | [−1.593, 0.973, −0.013] |
| 6 | w | [−0.258, 0.986, −0.892] | [−0.997, 1.805, −0.519] |
| 7 | o | [0.656, 0.574, −0.979] | [−0.140, 0.828, −0.928] |
| 8 | r | [0.885, −0.275, −0.436] | [0.485, 0.020, −0.644] |
| 9 | l (in "world") | [0.234, −0.947, 0.555] | [−0.200, −1.299, 0.636] |
| 10 | d | [−0.698, −0.860, 1.068] | [−1.308, −1.325, 1.482] |
The three 'l' rows are the punch-line. The character 'l' has one fixed embedding at the input, but three different outputs depending on what came before it — exactly what context-dependent representation means. This is the entire point of attention, made concrete on a string short enough to read by hand.
Same 'l'. Three different output vectors.
That's the whole thing.
Attention is, mechanically, weighted averaging of values — and the weights depend on the query (which depends on the current position) and the keys (which depend on every previous position). Once you have watched a single character produce different outputs at different positions in a single sentence, the rest of the architecture (multi-head, FFN, residuals, layer norm) is engineering around that core idea.
Type a short string and watch the attention heatmap update. Each row is a query position; each column is a key position. The lower-triangular pattern is the causal mask. The colour intensity shows how much that query position attends to that key position. Pick a head to see how different heads attend to different parts of the sequence.
White paper · transformer-from-scratch implementation · hello-world context-dependence result · thesis-line foundation