Extraction and Preparation Pipeline

The pipeline converts competition CSVs, source PDFs, HTML corpora, dictionary attestations, and synthetic translations into augmented sentence-level training data. It depends heavily on normalization and feeds training-stack.

Orchestration

run_pipeline.sh is the top-level sequential pipeline. It sets DATA_DIR, loads .env if present, verifies that the required API key environment variable is available, and runs seven stages:

Expert data: combine competition CSVs, repair translations, split into sentence pairs, deduplicate.
AKT extraction: side-by-side, top-bottom, and OCR extraction modes.
CAD extraction and normalization.
Journal article extraction for Dergipark, Michel, and Round 2/4 sources.
Hecker pipeline: PDF extraction, optional HTML scraping, cross-reference, synthetic translation generation.
Synthetic V22 generation from published transliterations.
Final preparation with prepare_sentence_data_23.py --hecker --round4.

Most extraction scripts expose checkpointing, sharding, or flattening behavior. Use existing flags rather than adding ad hoc parallel wrappers.

Extraction Scripts

extract_akt_pairs_v24.py: multimodal PDF extraction with modes such as side_by_side, top_bottom, ocr, dergipark, michel, hecker, and gelb.
extract_cad_pairs_v20.py: CAD PDF extraction into structured entries and flattened pairs.
normalize_cad_v20.py: filters and normalizes CAD/eSAD-style output.
repair_expert_translations_v16.py: LLM-backed repair of expert translations.
split_expert_sentences_v16.py: LLM-backed expert sentence splitting.
split_published_texts_v22.py: synthetic translation and sentence splitting for published transliterations or Hecker inputs.
crossref_hecker.py: matches Hecker transliterations to OARE, competition, expert, or synthetic records.
scrape_hpm_html.py: scrapes HPM HTML transliteration pages.

Preparation Scripts

dedup_expert_v19.py: deduplicates expert data by OARE ID and near-exact similarity.
dedup_synthetic_v19.py: removes synthetic records overlapping expert or AKT transliterations.
prepare_sentence_data_23.py: assembles the final augmented training set with source-specific copy counts and sliding-window merging.

The final preparation script references the missing_283_transliterations directory under DATA_DIR for Round 2 and Round 4 source files. The historical source collection behind that path is summarized in missing-283-transliterations.

Output

The pipeline message says final training data is written to DATA_DIR/synth_claude_v23_aug1/. The public README also documents Kaggle-hosted pre-extracted training data as the shortcut path.

Deep Past Solution Brain

Explorer

Extraction and Preparation Pipeline

Extraction and Preparation Pipeline

Orchestration

Extraction Scripts

Preparation Scripts

Output

Graph View

Table of Contents

Backlinks