Mining dialogue from 18th-century plays

A chat model needs conversations to learn from. My corpus was German classics — mostly prose and philosophy, which is monologue, not dialogue. The exception is the plays. A drama is already a conversation: characters take turns, one line answering the last. If I could parse speaker turns reliably, I'd have thousands of natural question-and-answer pairs for free.

The catch: Projekt Gutenberg's plays come in at least four different layout conventions, and naive parsing pulls in chapter headings, stage directions, and cast lists as if they were dialogue.

Detecting a real speaker line

The core heuristic: a speaker line is a short line, at column zero, that's a name — sometimes followed by a period or colon, sometimes bare with the speech indented on the next line (the Nathan der Weise style).

NAME_PUNCT_RE = re.compile(rf"^({NAME_CORE})\s*[.:]\s*$")
NAME_BARE_RE = re.compile(rf"^({NAME_CORE})\s*$")

def speaker_name(lines, i):
    ln = lines[i]
    if len(ln) > 32:
        return None
    m = NAME_PUNCT_RE.match(ln)
    if m:
        return m.group(1)
    # bare name: the following non-empty line must start with . or : or be indented
    ...

Then I validate per play: only keep names that appear at least five times, and only treat a work as a drama if it yields 50+ turns. That alone removes most false positives.

Where it broke

Two failure modes were stubborn:

Histories that aren't plays. Schiller's Geschichte des 30jährigen Kriegs is full of capitalized names at the start of lines ("Gustav Adolph…") that look exactly like speaker cues. A turn-density filter — speaker lines divided by non-blank lines — separates real dialogue from prose with names in it.
Mid-sentence line breaks. In histories, an emphasized name on its own line followed by lowercase continuation reads as a speaker. Requiring the following indented line to start with a capital killed those.

After tuning, the parser kept 57 plays and ~33,000 dialogue turns, with zero obvious false positives left. Consecutive turns became (user, bot) pairs attributed to the play's author.

The honest result

The dialogue data made the model take turns, but most of it is still drama — which responds to the previous line of a scene, not to a question. So the model free-associates in-register more than it answers. The fix isn't more scale; it's more genuinely conversational data. But as a way to bootstrap a chat format out of text that was never meant to be one, drama-mining worked better than I expected.

Appendix

The link to the Github repo
& Hugging Face