Training a German GPT from scratch
· 2 min read
I wanted to know what it actually takes to build a language model end to end — not fine-tune someone else's, but start from raw text and end with something you can talk to. So I trained a small GPT from scratch on the German classics and turned it into a chat model you can talk to as Goethe, Kant, or Schiller.
No pretrained weights. The corpus, the tokenizer, and the model are all built from public-domain text.
A tokenizer that speaks German
The first lever most people skip is the tokenizer. GPT-2's tokenizer is tuned for English, so German words get shredded into many tiny pieces — which wastes a small model's capacity. I trained an 8k-vocabulary byte-level BPE tokenizer on the corpus itself:
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
tokenizer = Tokenizer(BPE())
trainer = BpeTrainer(vocab_size=8192, special_tokens=SPECIAL_TOKENS)
tokenizer.train(["input.txt"], trainer)
A German-native vocabulary keeps words whole — about 4.2 characters per token instead of GPT-2's shredding — and shrinks the embedding table from 50k to 8k rows, freeing capacity for the transformer itself.
Two-stage training
I trained in two stages: first pretrain on the full corpus to learn German, then fine-tune on dialogue mined from the plays so it learns to take turns. The fine-tune mixes in 20% prose batches so the model doesn't forget how to write.
The numbers, on an Apple Silicon laptop (MPS):
- Pretrain: 40k steps, 42M parameters, validation loss 4.21.
- Fine-tune: 3k steps, chat-validation loss 3.32.
What 42M parameters can and can't do
Here's the honest part. The model writes fluent, period-flavoured German and picks up each author's register — but it does not understand your questions. It's style transfer, not a knowledgeable assistant. Ask Kant about duty and you get something that sounds like Kant mid-thought, not an answer:
Du: Was ist die Pflicht des Menschen? Kant: Wenn wir uns nun selbst in der Welt verachtend verhalten müssen: was können wir tun?
At this scale the model stores grammar and style, not facts. That's the ceiling, and it's worth saying out loud rather than cherry-picking a good sample.
The code and the model are on GitHub and Hugging Face.