Day 9 - More Reading LRL MT Related Papers

Abduselam,Mon Sep 01 2025•low resource languages

Back

September 1st, 2025

Thoughts reading Data Augmentation With Back translation for Low Resource languages: A case of English and Luganda: (opens in a new tab):
- Initial thoughts: novel technique for back translation seems interesting, although in general never been a fan of back translation. Not clear what the technique is from the paper tho
- Wats TER?
- Paper makes a good point of other papers not being reproducible. Honestly, i feel like most papers i've read up until now are just like totally not reproducible.
- I think the data collection should have been more thorough about its cleaning process. Also, it would’ve been nice to see some analysis or evaluations done on the dataset to determine its quality.
- Typo in “SelecBasedOnBLEU”. Also this “OurBT” is sorta not clear to me, some things kind of hand wavy. I think it needs to be explained a lot more. basically the way i understand the novel method is, train a new model for every back translated monolingual dataset, figure out which monolingual dataset had the highest scoring BLEU, figure out which model had highest scoring BLEU, then train the best model on best monolingual dataset. This approach doesnt seem very logical and clear how it produces a better model with better performance. I think i should re-read as i may be missing something.
- Another typo of BLUE instead of BLEU (im most sure).
- Looking at example sentences, the translations are amazing! They are literally perfect (besides one), i hope there wasn’t any tainting of test dataset in during training
- I think looking into Incremental BT and Iterative BT may help me understand and differentiate from this paper’s OurBT. Update: i understand the first two, still not OurBT since its not clear if L_best and E_best are singular or multiple datasets. also becuz SelecBasedOnBLEU/SelectBestBLEU implementations arent included and explanations are way too high level which isn't helpful becuz later in paper the authors mention "OurBT was very instrumental in helping us select the best dataset combinations", like wut and how. I could understand how the single best dataset is chosen according the pseudocode, but that requires lots of assumptions. anyway the novel approach seems hand wavy and unclear, thus making it hard to reproduce.
- Another typo, no close parentheses on page 146
- Would’ve been nice if paper included actual metrics of Google Translate
- No reproducibility section, i wonder if theres a github. The paper mentioned reproducibility issues of other papers but didnt include one, it better have a github. Ok found github but dont think its for this paper although the dataset seems similar: https://github.com/kimrichies/English-Luganda-Parallel-corpus?tab=readme-ov-file (opens in a new tab)