Synthetic Data Generation
The sdg/ directory contains synthetic-data generation workflows used to teach Old Assyrian fundamentals and broaden training coverage. It complements source extraction from prompt-system and feeds the final data assembly in extraction-and-preparation-pipeline.
Workflows
grammar_transform.py: generates grammar transformations from seed examples and grammar/context resources.generate_cad_drills.py: generates deliberate-practice examples from CAD/eSAD senses, examples, and a generation plan.fill_engine.py: produces deterministic slot-filled examples from JSON templates without requiring an API call.
Template System
Template files under sdg/templates/ cover:
- debts and loans;
- legal and seal formulas;
- letter openings and correspondence patterns;
- accounting, memoranda, and trade examples.
The template engine draws from slot pools in template_pools.py, including names, commodities, amounts, places, months, eponyms, deadlines, penalties, occupations, kinship terms, and containers. Constraint helpers live in template_constraints.py.
Demo Value
This section is useful in the public brain because it shows the solution was not only model training. It combined:
- mined scholarly data;
- LLM-assisted extraction;
- LLM-assisted transformations;
- deterministic template generation;
- source-specific normalization in normalization.
Provider-specific endpoint details in SDG config files should stay out of public prose. The reusable point is the data-generation architecture, not the API vendor.