Character-Level Transformer From Scratch — White Paper

A Character-Level Transformer From First Principles: Implementation, Attention-Pattern Inspection, and the Context-Dependent-Representation Pedagogical Result

Aaditya Jain

ad_jain@icloud.com · orcid.org/0009-0005-5534-5641

Transformers · From-Scratch Foundations · Thesis-Line Foundation Study

Submitted: September 2025 Subject: cs.LG Keywords: transformer, attention, char-level LM, from-scratch implementation, pedagogical, thesis-line foundation

Abstract

We document a from-first-principles implementation of a character-level transformer block — multi-head self-attention with causal mask, position embedding, feed-forward network, layer normalisation, and residual connections — written in pure NumPy on commodity Mac hardware. The motivation is pedagogical rather than competitive: every load-bearing transformer use in the subsequent thesis line (the PGN seq2seq head [1], the JiT ViT-backbone diffusion reproduction [2], the MNIST flow-matching backbone comparison [3]) is informed by having built one block by hand at this scale. The substantive empirical contribution is a single 11-character experiment on the input string "hello world" — three occurrences of the letter l produce three distinct 64-dimensional output vectors at three distinct positions despite sharing a single input embedding. This is the operational signature of context-dependent representation: attention has worked. We document the architecture (8-vocab tokeniser, 64-dim embedding, 4 heads at 16-dim each, FFN 4× expansion, sinusoidal position embedding, lower-triangular causal mask), the per-position input-output table that exhibits the context-dependence result, the planned-but-skipped Phase-2 (PyTorch reimplementation) and Phase-3 (Hugging Face tinyGPT comparison) follow-ups that were superseded by the PGN seq2seq work two weeks later, and the generalisable lesson that for any neural-network component the thesis line will use load-bearing, a 100-line from-scratch implementation pays back its cost in architecture-literacy gains. The contribution is the documented from-scratch path and the context-dependence result on a string short enough to read by hand. Keywords: char-level transformer, attention from scratch, pedagogical implementation, thesis-line foundation.

1. Introduction

The thesis line documented across [1–4] is shaped by transformer-family architectural choices at every load-bearing decision. The seq2seq DSL head in PGN [1], the ViT backbone in the JiT reproduction [2], the transformer baseline against which the Mamba state-space substitute is measured [3], and the SD-class latent diffusion study [5] all assume the reader understands attention at the mechanical level. The question this paper answers is: what is the cheapest path to that understanding?.

The answer recorded here is: implement a single character-level transformer block from first principles, in pure NumPy, on a tiny vocabulary and a single short string, and inspect the per-position output. The implementation takes a few hours; the per-position inspection takes minutes; the resulting architectural literacy makes every later transformer paper read at twice the speed.

The contribution is therefore not a research finding in the conventional sense. It is the documented from-scratch path, the single empirical result (context-dependent representation visible on a string short enough to read by hand), and the generalisable lesson the path validates — for any architecture the thesis line uses load-bearing, build the 100-line from-scratch version first.

2. Architecture

The transformer block follows the standard pre-norm GPT-class layout, scaled down to the smallest configuration that still exercises every component meaningfully. The exact parameter values are chosen so that every tensor's shape fits in a single line and can be inspected by hand.

Table 1 — Mini-LLM block configuration.
Component	Setting	Comment
Vocabulary	8 characters (h, e, l, o, ·, w, r, d)	Tiny enough to label every column of the attention heatmap by character
Sequence length	11 tokens	"hello world" — short enough to print every per-position vector
Embedding dim d	64	Small but non-trivial; standard educational toy-GPT scale
Heads h	4 at 16 dim each	Per-head dim d_k = d/h = 16; enough heads to see different attention patterns
QKV projections	3 × (64 → 64) linear, no bias	~12 K params per head set; ~50 K total for the four-head ensemble
FFN	(64 → 256 → 64), ReLU	4 × expansion ratio; ~33 K params
Activation	ReLU	Simpler than GELU; adequate for the exercise
Positional encoding	Sinusoidal (closed-form)	Period varying across dimensions; no learnable params
Layer norm	Pre-norm (before each sub-layer)	The GPT-2-onward standard; numerically more stable than post-norm
Causal mask	Lower triangular, added to attention logits as −∞	Autoregressive — token i sees only tokens 1…i

The forward path: character → integer token → embedding (64-dim) + sinusoidal positional encoding → pre-norm LayerNorm → 4-head causal self-attention → residual → pre-norm LayerNorm → FFN (64 → 256 → 64) → residual → output (sequence of 64-dim vectors). The output would be fed to a vocab-projection linear layer for next-character prediction, but no training is run in this paper — all results are on randomly-initialised weights, which is the right setting to inspect the mechanism rather than any learned behaviour.

2.1 Attention math, explicitly

For each head h independently, the attention operation on a sequence of L input vectors X ∈ ℝ^{L × d}:

Q = X W^Q_h, K = X W^K_h, V = X W^V_h (each ∈ ℝ^{L × d_k}) A = softmax( (Q K^T) / √d_k + M ) (the attention matrix, ∈ ℝ^{L × L}) H = A V (the head output, ∈ ℝ^{L × d_k})

where M is the causal mask matrix with M_{ij} = 0 if j ≤ i and M_{ij} = −∞ otherwise. The four head outputs are concatenated along the feature axis and projected through a final W^O ∈ ℝ^{d × d} to produce the multi-head-attention block output. The full block sums the attention output with the residual stream, then applies layer-norm + FFN + residual.

2.2 Positional encoding

Sinusoidal positional encoding for position i ∈ [0, L) and dimension j ∈ [0, d):

PE_{i, 2k} = sin( i / 10000^{2k/d} ) PE_{i, 2k+1} = cos( i / 10000^{2k/d} )

The encoding is added (not concatenated) to the input embedding before the first sub-layer. The cosine / sine pairs at different dimension-frequencies produce a unique signature per position; the encoding is closed-form and adds no learnable parameters.

3. Implementation Notes (Pure NumPy)

The full block is implemented in approximately 100 lines of pure NumPy, no PyTorch / TensorFlow primitives. The implementation choices that matter:

QKV projections as three separate matrices — not as one fused 3d × d matrix. The fused form is faster at production scale; the separate form is clearer when reading by hand and the cost difference at L = 11 is irrelevant.
Numerical stability of softmax — subtract the per-row max before exponentiating, otherwise large logits overflow.
Causal mask as additive −∞ — adding −∞ to the future-position logits before softmax sends those entries to exactly zero after exponentiation, which is cleaner than multiplying by a 0/1 mask after softmax.
√d_k scaling — divide QK^T by √d_k = 4 before softmax. Skipping this scaling produces softmax saturation as d_k grows; at d_k = 16 the effect is small but the scaling is the standard practice and should be in the implementation from day one.
No dropout, no weight init scheme, no training loop — random Gaussian-initialised weights, single forward pass, inspect.

4. The "hello world" Experiment

The input is the literal string "hello world", tokenised to 11 integers over the 8-character vocabulary. The 11 input embeddings are 64-dimensional vectors; the 11 output vectors after a single forward pass through the transformer block are also 64-dimensional. The diagnostic claim: the three occurrences of the letter l at positions 2, 3, and 9 in the input string produce three distinct output vectors, despite sharing a single input embedding.

Table 2 — Per-position output (first three dimensions of the 64-dim output vector).
Position	Char	Input embed (first 3 dims)	Output (first 3 dims)
2	l (in "hello")	[0.731, −0.452, 1.103]	[0.587, 0.828, 2.060]
3	l (in "hello")	[−0.037, −1.026, 0.884]	[−0.662, −0.150, 1.563]
9	l (in "world")	[0.234, −0.947, 0.555]	[−0.200, −1.299, 0.636]

The three input embeddings are identical (one row of the embedding matrix, accessed by the same integer token), but the sinusoidal positional encoding added at the next step is position-dependent, and the attention output then mixes in contributions from every token strictly before the current position. The result: position 2's l sees only "h" and "e" upstream; position 3's l sees "h", "e", and the first l; position 9's l sees the entire prefix "hello wor". Each gets a different weighted-average-of-values output. Context-dependent representation.

The lower-triangular attention heatmap, visualised per-head, exhibits the causal-mask pattern: row i has non-zero entries only at columns 0…i. This is the second-tier diagnostic — confirms the causal mask is wired correctly. With both diagnostics passing (causal lower-triangular pattern + three-different-output-vectors for the same letter), the transformer block has been verified at the architectural level.

5. Per-Head Attention-Pattern Inspection

The four attention heads, all on randomly-initialised weights, produce four different lower-triangular attention patterns. The lower-triangularity is the causal-mask signature — entries above the diagonal are exactly zero after the −∞ masking. The structure below the diagonal varies by head:

Table 3 — Per-head attention-pattern observations on the "hello world" input (randomly-initialised weights).
Head	Pattern observed	What it would learn if trained
Head 0	Strong self-attention (heavy diagonal), light off-diagonal	Identity-like — preserves the current token's representation
Head 1	Bias toward position 0 (early-token attention)	"Beginning-of-sequence" head — useful for tasks that need to anchor on the prefix
Head 2	Distributed attention across the available prefix	Context-averaging head — useful for tasks that need a smoothed prefix summary
Head 3	Bias toward the most-recent token (i-1)	"Previous-token" head — useful for tasks that need local context

The patterns at random initialisation are interpretation-free — they reflect the random QKV-matrix structure, not any learned behaviour. The diagnostic value is structural: with random weights, the four heads produce four different attention patterns rather than four identical ones, confirming that the multi-head architecture is wired correctly (each head has independent W^Q, W^K, W^V matrices, not shared ones).

6. Phase-2 and Phase-3 — Skipped, and Why

The initial three-phase plan was: Phase 1 NumPy from scratch (done); Phase 2 PyTorch reimplementation using nn.Linear and nn.LayerNorm primitives, with a training loop; Phase 3 side-by-side comparison against Hugging Face's tinyGPT on the same toy text. Phase 2 and Phase 3 were not executed. The reason recorded at the time, carried forward as a feedback lesson: the production transformer use case (PGN seq2seq) kicked in two weeks later and the architecture-literacy load shifted to the production code. The toy-implementation parallel-track would have duplicated effort.

The generalisable lesson: educational follow-ups have a half-life. Once the production use kicks in, do the further learning on the production code, not on a parallel toy. The Topic-25 MNIST validation [3] applied this rule — the three-backbone comparison was done on actual MNIST training rather than further toy work.

7. The Generalisable Lesson

For any architectural component the thesis line uses load-bearing — transformer (this paper), diffusion [4], latent-diffusion-with-VAE [5], flow-matching [3], Mamba state-space [6] — build a 100-line from-scratch implementation on a tiny problem before using the library version. The hours invested in the from-scratch path pay back at the year scale through faster paper reading, faster debugging, and cleaner architecture choices in downstream work. Every subsequent thesis-line topic that touches an architectural class for the first time follows this rule explicitly.

8. Conclusion

A character-level transformer block was implemented from first principles in pure NumPy. The "hello world" experiment exhibited context-dependent representation: three occurrences of the letter l produced three distinct output vectors at three positions. The implementation is pedagogical; the contribution is the documented from-scratch path and the architecture-literacy investment that pays back across the thesis line.

References

[1] Jain, A. "PGN: A Transformer-Based Procedural Generator Network." Thesis research, Sep 2025. /whitepaper/pgn

[2] Jain, A. "JiT Diffusion on Consumer GPUs." Thesis research, Nov 2025. /whitepaper/jit-diffusion

[3] Jain, A. "MNIST Flow-Matching Backbone Validation." Thesis research, Nov 2025. /whitepaper/mnist-flow-validation

[4] Jain, A. "Red-Square DDPM From Scratch." Thesis research, Feb 2025. /whitepaper/diffusion-red-square

[5] Jain, A. "Latent Diffusion Model Study." Thesis research, May 2025. /whitepaper/ldm-study

[6] Gu, A., Dao, T. "Mamba: Linear-Time Sequence Modelling with Selective State Spaces." 2023.

[7] Vaswani, A. et al. "Attention Is All You Need." NeurIPS, 2017.

[8] Radford, A. et al. "Improving Language Understanding by Generative Pre-Training." OpenAI, 2018. GPT-1 architecture reference.