Normalization

Normalization is a central reproducibility surface. Shared logic lives in scripts/normalization.py; pipeline-specific wrappers are used in prepare_sentence_data_23.py and normalize_cad_v20.py.

This page connects directly to extraction-and-preparation-pipeline and training-stack because normalization affects deduplication, training targets, gap markers, and final model behavior.

Covered Transform Families

scripts/normalization.py includes functions for:

fractions and slash fractions
subscripts
h-dot handling
gap markers
determinatives
brackets and unmatched brackets
whitespace and punctuation spacing
scribal insertions
special characters
line dividers
ceiling brackets
figure dash
circumflex-to-macron conversion
CDLI-to-target conversion
Hecker transliteration normalization
final translation postprocessing
character cleaning for transliteration and translation fields

Important Practice

Do not reimplement normalization ad hoc in a new script. Reuse the shared functions or add a named pipeline-specific wrapper when behavior needs to differ.

prepare_sentence_data_23.py explicitly overrides subscript handling with V15-era behavior, mapping subscript x to plain x rather than removing it. This is a deliberate variant and should be preserved unless a reproduction run intentionally changes it.

Gap markers such as <gap> and <big_gap> are semantic. Treat them as model-facing data, not generic markup.

Deep Past Solution Brain

Explorer

Normalization

Normalization

Covered Transform Families

Important Practice

Graph View

Table of Contents

Backlinks