Data Filtering
Static Embeddings Filter
filter_static_embeddings
batch(iterable, n)
Batch data into lists of length n. The last batch may be shorter.
Source code in src/quickmt_train/filter_static_embeddings.py
Basic Filter
filter_basic
batch(iterable, n)
Batch data into lists of length n. The last batch may be shorter.
char_length_match(s_clean, t_clean, min_char_length, max_char_length, length_ratio)
Ensure src/tgt within char length bounds and remove if identical src/tgt
Source code in src/quickmt_train/filter_basic.py
clean(src_lang, tgt_lang, src_min_langid_score=0.5, tgt_min_langid_score=0.5, length_ratio=4, min_char_length=3, max_char_length=2000, ft_model_path='../lid.176.bin')
Remove non-printable characters and filter out if char length ratio > length_ratio
Source code in src/quickmt_train/filter_basic.py
clean_input(s, t, src_lang, tgt_lang, ft, src_min_langid_score=0.5, tgt_min_langid_score=0.5, length_ratio=4, min_char_length=3, max_char_length=2000)
Parallel data filter and clean
Source code in src/quickmt_train/filter_basic.py
english_text_match(s_clean, t_clean, src_lang, tgt_lang)
Ensure English side has sufficient words and alpha chars
Somewhat similar to https://github.com/mozilla/translations/blob/main/pipeline/clean/tools/clean_parallel.py#L73
Source code in src/quickmt_train/filter_basic.py
fasttext_lang_match(s, t, slang, tlang, ft, s_min_score=0.5, t_min_score=0.5)
Ensure correct source and target language via fasttext langid model