The thesis line documented across [1–4] is shaped by transformer-family architectural choices at every load-bearing decision. The seq2seq DSL head in PGN [1], the ViT backbone in the JiT reproduction [2], the transformer baseline against which the Mamba state-space substitute is measured [3], and the SD-class latent diffusion study [5] all assume the reader understands attention at the mechanical level. The question this paper answers is: what is the cheapest path to that understanding?.
The answer recorded here is: implement a single character-level transformer block from first principles, in pure NumPy, on a tiny vocabulary and a single short string, and inspect the per-position output. The implementation takes a few hours; the per-position inspection takes minutes; the resulting architectural literacy makes every later transformer paper read at twice the speed.
The contribution is therefore not a research finding in the conventional sense. It is the documented from-scratch path, the single empirical result (context-dependent representation visible on a string short enough to read by hand), and the generalisable lesson the path validates — for any architecture the thesis line uses load-bearing, build the 100-line from-scratch version first.
The transformer block follows the standard pre-norm GPT-class layout, scaled down to the smallest configuration that still exercises every component meaningfully. The exact parameter values are chosen so that every tensor's shape fits in a single line and can be inspected by hand.
| Component | Setting | Comment |
|---|---|---|
| Vocabulary | 8 characters (h, e, l, o, ·, w, r, d) | Tiny enough to label every column of the attention heatmap by character |
| Sequence length | 11 tokens | "hello world" — short enough to print every per-position vector |
| Embedding dim d | 64 | Small but non-trivial; standard educational toy-GPT scale |
| Heads h | 4 at 16 dim each | Per-head dim d_k = d/h = 16; enough heads to see different attention patterns |
| QKV projections | 3 × (64 → 64) linear, no bias | ~12 K params per head set; ~50 K total for the four-head ensemble |
| FFN | (64 → 256 → 64), ReLU | 4 × expansion ratio; ~33 K params |
| Activation | ReLU | Simpler than GELU; adequate for the exercise |
| Positional encoding | Sinusoidal (closed-form) | Period varying across dimensions; no learnable params |
| Layer norm | Pre-norm (before each sub-layer) | The GPT-2-onward standard; numerically more stable than post-norm |
| Causal mask | Lower triangular, added to attention logits as −∞ | Autoregressive — token i sees only tokens 1…i |
The forward path: character → integer token → embedding (64-dim) + sinusoidal positional encoding → pre-norm LayerNorm → 4-head causal self-attention → residual → pre-norm LayerNorm → FFN (64 → 256 → 64) → residual → output (sequence of 64-dim vectors). The output would be fed to a vocab-projection linear layer for next-character prediction, but no training is run in this paper — all results are on randomly-initialised weights, which is the right setting to inspect the mechanism rather than any learned behaviour.
For each head h independently, the attention operation on a sequence of L input vectors X ∈ ℝ^{L × d}:
Q = X W^Q_h, K = X W^K_h, V = X W^V_h (each ∈ ℝ^{L × d_k}) A = softmax( (Q K^T) / √d_k + M ) (the attention matrix, ∈ ℝ^{L × L}) H = A V (the head output, ∈ ℝ^{L × d_k})where M is the causal mask matrix with M_{ij} = 0 if j ≤ i and M_{ij} = −∞ otherwise. The four head outputs are concatenated along the feature axis and projected through a final W^O ∈ ℝ^{d × d} to produce the multi-head-attention block output. The full block sums the attention output with the residual stream, then applies layer-norm + FFN + residual.
Sinusoidal positional encoding for position i ∈ [0, L) and dimension j ∈ [0, d):
PE_{i, 2k} = sin( i / 10000^{2k/d} ) PE_{i, 2k+1} = cos( i / 10000^{2k/d} )The encoding is added (not concatenated) to the input embedding before the first sub-layer. The cosine / sine pairs at different dimension-frequencies produce a unique signature per position; the encoding is closed-form and adds no learnable parameters.
The full block is implemented in approximately 100 lines of pure NumPy, no PyTorch / TensorFlow primitives. The implementation choices that matter:
The input is the literal string "hello world", tokenised to 11 integers over the 8-character vocabulary. The 11 input embeddings are 64-dimensional vectors; the 11 output vectors after a single forward pass through the transformer block are also 64-dimensional. The diagnostic claim: the three occurrences of the letter l at positions 2, 3, and 9 in the input string produce three distinct output vectors, despite sharing a single input embedding.
| Position | Char | Input embed (first 3 dims) | Output (first 3 dims) |
|---|---|---|---|
| 2 | l (in "hello") | [0.731, −0.452, 1.103] | [0.587, 0.828, 2.060] |
| 3 | l (in "hello") | [−0.037, −1.026, 0.884] | [−0.662, −0.150, 1.563] |
| 9 | l (in "world") | [0.234, −0.947, 0.555] | [−0.200, −1.299, 0.636] |
The three input embeddings are identical (one row of the embedding matrix, accessed by the same integer token), but the sinusoidal positional encoding added at the next step is position-dependent, and the attention output then mixes in contributions from every token strictly before the current position. The result: position 2's l sees only "h" and "e" upstream; position 3's l sees "h", "e", and the first l; position 9's l sees the entire prefix "hello wor". Each gets a different weighted-average-of-values output. Context-dependent representation.
The lower-triangular attention heatmap, visualised per-head, exhibits the causal-mask pattern: row i has non-zero entries only at columns 0…i. This is the second-tier diagnostic — confirms the causal mask is wired correctly. With both diagnostics passing (causal lower-triangular pattern + three-different-output-vectors for the same letter), the transformer block has been verified at the architectural level.
The four attention heads, all on randomly-initialised weights, produce four different lower-triangular attention patterns. The lower-triangularity is the causal-mask signature — entries above the diagonal are exactly zero after the −∞ masking. The structure below the diagonal varies by head:
| Head | Pattern observed | What it would learn if trained |
|---|---|---|
| Head 0 | Strong self-attention (heavy diagonal), light off-diagonal | Identity-like — preserves the current token's representation |
| Head 1 | Bias toward position 0 (early-token attention) | "Beginning-of-sequence" head — useful for tasks that need to anchor on the prefix |
| Head 2 | Distributed attention across the available prefix | Context-averaging head — useful for tasks that need a smoothed prefix summary |
| Head 3 | Bias toward the most-recent token (i-1) | "Previous-token" head — useful for tasks that need local context |
The patterns at random initialisation are interpretation-free — they reflect the random QKV-matrix structure, not any learned behaviour. The diagnostic value is structural: with random weights, the four heads produce four different attention patterns rather than four identical ones, confirming that the multi-head architecture is wired correctly (each head has independent W^Q, W^K, W^V matrices, not shared ones).
The initial three-phase plan was: Phase 1 NumPy from scratch (done); Phase 2 PyTorch reimplementation using nn.Linear and nn.LayerNorm primitives, with a training loop; Phase 3 side-by-side comparison against Hugging Face's tinyGPT on the same toy text. Phase 2 and Phase 3 were not executed. The reason recorded at the time, carried forward as a feedback lesson: the production transformer use case (PGN seq2seq) kicked in two weeks later and the architecture-literacy load shifted to the production code. The toy-implementation parallel-track would have duplicated effort.
The generalisable lesson: educational follow-ups have a half-life. Once the production use kicks in, do the further learning on the production code, not on a parallel toy. The Topic-25 MNIST validation [3] applied this rule — the three-backbone comparison was done on actual MNIST training rather than further toy work.
For any architectural component the thesis line uses load-bearing — transformer (this paper), diffusion [4], latent-diffusion-with-VAE [5], flow-matching [3], Mamba state-space [6] — build a 100-line from-scratch implementation on a tiny problem before using the library version. The hours invested in the from-scratch path pay back at the year scale through faster paper reading, faster debugging, and cleaner architecture choices in downstream work. Every subsequent thesis-line topic that touches an architectural class for the first time follows this rule explicitly.
A character-level transformer block was implemented from first principles in pure NumPy. The "hello world" experiment exhibited context-dependent representation: three occurrences of the letter l produced three distinct output vectors at three positions. The implementation is pedagogical; the contribution is the documented from-scratch path and the architecture-literacy investment that pays back across the thesis line.