Evaluation and Decoding

Evaluation logic lives under code/utils/ and is used by training-stack. The public README states that the competition metric is the geometric mean of corpus-level BLEU and chrF++ via SacreBLEU.

Metric

code/utils/metric_utils.py computes:

  • corpus BLEU with SacreBLEU;
  • corpus chrF++ with word_order=2;
  • final score as the square root of BLEU multiplied by chrF++.

The function also returns the component BLEU and chrF values, which is useful for debugging whether a model is improving through exact lexical overlap, character-level similarity, or both.

Decoding

code/utils/generation_utils.py includes:

  • standard batched generation;
  • generation config construction from Hydra config;
  • minimum Bayes risk style candidate selection using pairwise sentence-level utility.

The MBR helper samples multiple candidates per input, scores candidates against each other using the same BLEU/chrF++-style utility, and selects the candidate with the highest average pairwise utility.

This page is a useful demo companion to dataset-and-config-map because it connects training configs to the score reported by the competition.