Wiki Index
Content catalog. Every wiki page is listed under its type with a one-line summary. Last updated: 2026-04-26 | Total pages: 15
Entities
- missing-283-transliterations — Historical source collection from the original working repo for missing OARE transliterations and extra Old Assyrian data.
Concepts
- brain-publishing — GitHub Pages and Quartz workflow used to publish the
brain/wiki. - codebase-overview — Map of the current reproduction repository and its main modules.
- dataset-and-config-map — Public map of data families, Kaggle datasets, and Hydra training configs.
- evaluation-and-decoding — Competition metric implementation and decoding utilities.
- extraction-and-preparation-pipeline — End-to-end data extraction, repair, splitting, deduplication, and final dataset preparation flow.
- normalization — Shared transliteration and translation normalization rules used across extraction and preparation.
- prompt-system — Prompt families used for PDF extraction, repair, sentence splitting, and translation-oriented data conversion.
- reproducibility-caveats — Known gaps and checks needed when reproducing from this checkout.
- synthetic-data-generation — Grammar transforms, CAD drills, and deterministic template-based synthetic examples.
- training-stack — ByT5 baseline, reward model, configs, datasets, and training utilities.
Comparisons
Queries
- Query Pages — Folder landing page for filed query answers.
- how-does-this-repo-reproduce-the-solution — Filed answer explaining the end-to-end reproduction path.
- what-makes-the-data-pipeline-distinctive — Filed synthesis of the repo’s data-pipeline differentiators.
- what-should-i-show-in-a-demo — Suggested walkthrough for presenting the published brain.