Normalization
Normalization is a central reproducibility surface. Shared logic lives in scripts/normalization.py; pipeline-specific wrappers are used in prepare_sentence_data_23.py and normalize_cad_v20.py.
This page connects directly to extraction-and-preparation-pipeline and training-stack because normalization affects deduplication, training targets, gap markers, and final model behavior.
Covered Transform Families
scripts/normalization.py includes functions for:
- fractions and slash fractions
- subscripts
- h-dot handling
- gap markers
- determinatives
- brackets and unmatched brackets
- whitespace and punctuation spacing
- scribal insertions
- special characters
- line dividers
- ceiling brackets
- figure dash
- circumflex-to-macron conversion
- CDLI-to-target conversion
- Hecker transliteration normalization
- final translation postprocessing
- character cleaning for transliteration and translation fields
Important Practice
Do not reimplement normalization ad hoc in a new script. Reuse the shared functions or add a named pipeline-specific wrapper when behavior needs to differ.
prepare_sentence_data_23.py explicitly overrides subscript handling with V15-era behavior, mapping subscript x to plain x rather than removing it. This is a deliberate variant and should be preserved unless a reproduction run intentionally changes it.
Gap markers such as <gap> and <big_gap> are semantic. Treat them as model-facing data, not generic markup.