Prompt System
The prompt system is one of the clearest demo surfaces in the repository. It shows how the solution turned heterogeneous scholarly sources into structured training data for extraction-and-preparation-pipeline.
The prompts are stored in prompts/ and are selected by extraction and splitting scripts according to document layout, source language, and data type.
Prompt Families
- AKT extraction prompts handle side-by-side, top-bottom, OCR, Turkish, German, and Kouwenberg/Larsen-style alignment cases.
- Journal prompts handle Turkish Dergipark papers and English Michel-style academic chapters.
- Hecker prompts extract transliteration-only tablet entries from born-digital PDFs.
- CAD prompts extract Old Assyrian attestations from dictionary-style OCR scans into dual raw/MT-normalized tracks.
- Repair and sentence-splitting prompts turn document-level expert or synthetic outputs into sentence-aligned examples.
Design Pattern
Most extraction prompts combine:
- a domain role, usually Assyriologist plus data engineer;
- layout-specific instructions;
- cleaning rules;
- atomic chunking rules for witnesses, seals, goods, and itemized lists;
- quality or confidence fields;
- strict JSON output delimiters.
This is worth showing in a demo because it makes the data pipeline inspectable. The model-training code in training-stack is conventional enough; the prompt system explains much of how the non-standard data was converted into usable supervision.
Public-Wiki Handling
Do not copy full prompts into the wiki unless there is a specific reason. Summaries are preferred because the prompt files are already in the repo and can be read directly.
Related pages: synthetic-data-generation, normalization, missing-283-transliterations.