Wiki Index

Content catalog. Every wiki page is listed under its type with a one-line summary. Last updated: 2026-04-26 | Total pages: 15

Entities

  • missing-283-transliterations — Historical source collection from the original working repo for missing OARE transliterations and extra Old Assyrian data.

Concepts

  • brain-publishing — GitHub Pages and Quartz workflow used to publish the brain/ wiki.
  • codebase-overview — Map of the current reproduction repository and its main modules.
  • dataset-and-config-map — Public map of data families, Kaggle datasets, and Hydra training configs.
  • evaluation-and-decoding — Competition metric implementation and decoding utilities.
  • extraction-and-preparation-pipeline — End-to-end data extraction, repair, splitting, deduplication, and final dataset preparation flow.
  • normalization — Shared transliteration and translation normalization rules used across extraction and preparation.
  • prompt-system — Prompt families used for PDF extraction, repair, sentence splitting, and translation-oriented data conversion.
  • reproducibility-caveats — Known gaps and checks needed when reproducing from this checkout.
  • synthetic-data-generation — Grammar transforms, CAD drills, and deterministic template-based synthetic examples.
  • training-stack — ByT5 baseline, reward model, configs, datasets, and training utilities.

Comparisons

Queries