Evaluation and Decoding

Evaluation logic lives under code/utils/ and is used by training-stack. The public README states that the competition metric is the geometric mean of corpus-level BLEU and chrF++ via SacreBLEU.

Metric

code/utils/metric_utils.py computes:

corpus BLEU with SacreBLEU;
corpus chrF++ with word_order=2;
final score as the square root of BLEU multiplied by chrF++.

The function also returns the component BLEU and chrF values, which is useful for debugging whether a model is improving through exact lexical overlap, character-level similarity, or both.

Decoding

code/utils/generation_utils.py includes:

standard batched generation;
generation config construction from Hydra config;
minimum Bayes risk style candidate selection using pairwise sentence-level utility.

The MBR helper samples multiple candidates per input, scores candidates against each other using the same BLEU/chrF++-style utility, and selects the candidate with the highest average pairwise utility.

This page is a useful demo companion to dataset-and-config-map because it connects training configs to the score reported by the competition.

Deep Past Solution Brain

Explorer

Evaluation and Decoding

Evaluation and Decoding

Metric

Decoding

Graph View

Table of Contents

Backlinks