How Does This Repo Reproduce the Solution?
This repo is the public reproduction artifact for the 3rd place Deep Past Initiative Machine Translation solution. It is not the original exploratory working tree; that role belongs to the adjacent historical ../akk/ repo. The clean reproduction path is summarized by codebase-overview and dataset-and-config-map.
Short Answer
The repo reproduces the solution by packaging:
- extraction and repair scripts for turning competition CSVs and source documents into aligned sentence pairs;
- prompt templates for source-specific document extraction;
- normalization and deduplication scripts;
- final dataset assembly logic;
- ByT5 training code and Hydra configs;
- reward-model training code;
- synthetic-data generation code for grammar, CAD, and template drills.
The main executable path is extraction-and-preparation-pipeline followed by training-stack.
Practical Walkthrough
- Start with
README.mdfor data downloads and environment setup. - Use
run_pipeline.shfor the end-to-end extraction path, or use the pre-extracted Kaggle dataset shortcut. - Use
scripts/preparation/prepare_sentence_data_23.pyto assemble final augmented data. - Train baseline models with
code/train_baseline.pyand configs underconf/baseline/. - Train the reward model with
code/train_reward.pyandconf/reward_model/conf_reward.yaml. - Check evaluation-and-decoding for how predictions are scored and decoded.
Caveat
Before running from scratch, read reproducibility-caveats. In particular, the current checkout has a known missing-helper-module issue in prepare_sentence_data_23.py that should be resolved or bypassed via pre-extracted data.