Day 3 - More Reading LRL MT Related Papers

Abduselam,Mon Aug 25 2025•low resource languages

Back

August 25th, 2025

Yesterday I got distracted doing side quests and learning or remembering stuff so I couldn't finish the paper.

Continuing off reading Prompt Engineering Enhances Faroese MT, but Only Humans Can Tell (opens in a new tab) (while doing side research quests, namely basic understanding of the common evaluation metrics):
- Learned more about METEOR (opens in a new tab) which basically answers previous thoughts I had about an eval metric which is better for synonyms, conjugations, and word ordering. It still has its issue, namely being dependent on a lexical database(WordNet) which won’t really work for low-resource languages since there are so many and many don’t have lexical databases or stemmers. TODO: look into how many of the top 500 languages have a grammar book published for English speakers, and similarly how many have a dictionary.
- TODO: idea came which i should research about: whether there is a model architecture that outputs multiple translations for a sentence.
- Remembered “regression” for models means just outputting a continuous value.
- This terminology “quality drift” was vague so i checked and it means basically quality of translations getting worse slowly or gradually over time, across longer text content length(more attention requirements), and across domains (like fine-tuning one domain may cause worse quality for another domain).
- Learned more about BLEURT (opens in a new tab) which evaluates by using basically an embedding model to compare similarity between gold and predicted sentences. The way it works is that they (opens in a new tab) trained on the BERT model a bunch of times and then fine-tuned using example human evaluations, basically hoping that the model would learn to evaluate like a human would evaluate. It's technically a regression model. Downsides are it might be biased due to training data used to evaluate a human and also its limited to 13 languages, and its quite expensive to use and extend to other languages. It's not clear if many other papers use BLEURT as a metric for low-resource languages, but so far I don’t remember seeing it often. I guess it could be used for evaluating om->en since the gold and predicated sentences would be english, although im not sure how standard it is for research papers to only consider one direction.
- Learned about BERTScore (opens in a new tab) too, similar to BLEURT but different because its basically an embedding that takes the embedding vector for both the gold and predicated sentences, and dot products them to get similarity. Again, issues with this are similar to BLEURT as its for a limited set of languages although much more coverage (104 langs). Few more notes: since output is from a dot product, the model output range is 0-1. Also, it’s use cases are not limited/refined to only machine translation. Another thing, the paper mentions how it uses “contextual embeddings” which is basically like when transformer models by design add attention for each token, instead of calling it attention its called “contextual”. TODO: could an approach like BERTScore be done for other languages? specifically by training on monolingual data for a low-resource language? Although, this idea doesn’t seem like a standard approach i am reading about in papers.
- TODO: research YiSi1-SRL and ESIM since they seem like potentially relevant evaluation metrics since it's mentioned in the paper about BLEURT. (opens in a new tab)
- TODO: research idea of finetuning reasoning model for MT either as two-way MT or single-way MT.
Joined languagetechnology subreddit and found out about some people asking about reproducibility challenges. Found a conference talk, summarized it here (opens in a new tab), and basically it was about how its better to not use Jupyter Notebooks for reproducibility, instead using VS Code and standard repo will ultimately be better for a number of reasons.
Got in touch with a redditor who connected me with a PhD researcher at CMU who answered questions and advised me to apply to Cohere’s scholar program which closes in 4 days and to also join online research communities. After, I found old professor Noah Smith’s google doc notes about PhD (opens in a new tab) and its some decent advice.