Wiki Index

Content catalog. Every wiki page is listed under its type with a one-line summary. Last updated: 2026-04-26 | Total pages: 15

Entities

missing-283-transliterations — Historical source collection from the original working repo for missing OARE transliterations and extra Old Assyrian data.

brain-publishing — GitHub Pages and Quartz workflow used to publish the brain/ wiki.
codebase-overview — Map of the current reproduction repository and its main modules.
dataset-and-config-map — Public map of data families, Kaggle datasets, and Hydra training configs.
evaluation-and-decoding — Competition metric implementation and decoding utilities.
extraction-and-preparation-pipeline — End-to-end data extraction, repair, splitting, deduplication, and final dataset preparation flow.
normalization — Shared transliteration and translation normalization rules used across extraction and preparation.
prompt-system — Prompt families used for PDF extraction, repair, sentence splitting, and translation-oriented data conversion.
reproducibility-caveats — Known gaps and checks needed when reproducing from this checkout.
synthetic-data-generation — Grammar transforms, CAD drills, and deterministic template-based synthetic examples.
training-stack — ByT5 baseline, reward model, configs, datasets, and training utilities.

Query Pages — Folder landing page for filed query answers.
how-does-this-repo-reproduce-the-solution — Filed answer explaining the end-to-end reproduction path.
what-makes-the-data-pipeline-distinctive — Filed synthesis of the repo’s data-pipeline differentiators.
what-should-i-show-in-a-demo — Suggested walkthrough for presenting the published brain.