Dataset and Config Map

This page maps the repo’s public training configuration surface. It supports training-stack and clarifies how data moves from extraction to model runs.

Data Families

The README describes three broad data setup paths:

competition data from Kaggle;
pre-extracted training data from a Kaggle dataset;
source PDFs/reference material from a separate Kaggle dataset for full extraction reruns.

The source collection summarized in missing-283-transliterations is historical working-repo provenance for some Round 2/4 files, while the reproducibility path should use the staged DATA_DIR layout or the Kaggle datasets documented in README.md.

Training Configs

Current config files:

conf/baseline/conf_baseline_pretrain_large.yaml
conf/baseline/conf_baseline_pretrain_xl.yaml
conf/baseline/conf_baseline_continue_large.yaml
conf/baseline/conf_baseline_continue_xl.yaml
conf/reward_model/conf_reward.yaml

The baseline configs cover continued pretraining and fine-tuning for ByT5-Large and ByT5-XL. The reward config covers the preference/reward model path.

Practical Rule

Before citing a dataset ID, checkpoint path, batch size, learning rate, or epoch count, re-read the current YAML. These are perishable facts. This is especially important because the wiki is public and intended for a demo, while configs may change during cleanup.

Deep Past Solution Brain

Explorer

Dataset and Config Map

Dataset and Config Map

Data Families

Training Configs

Practical Rule

Graph View

Table of Contents

Backlinks