Day 4 - More Reading LRL MT Related Papers
Abduselam,•low resource languages
August 26th, 2025
Yesterday I was still doing side quests, namely learning about the different evaluation metrics, reaching out to graduate community members, and getting connected to resources/programs like Cohere. So i still didn’t finish that paper, almost done, only need to look into COMET:
- Thoughts about COMET: A Neural Framework for MT Evaluation (opens in a new tab):
- Initial thoughts: paper has good introduction, good points, especially about how there isn’t a new updated standard for machine translation quality metric(e.g. To replace BLEU and chrF). They support this claim with the # of submissions to WMT translation shared task being far more than # of submissions for the metrics task. I agree with this as so far in my research the BLEU and chrF metrics have strengths but many weaknesses. Similarly the other metrics(BLEURT, BERTScore, METEOR) have strengths/weaknesses, specifically being them mainly supporting high resource languages only. I’m curious to know how COMET fairs in this paper and whether it addresses this.
- TODO: research thought of creating multistyled source sentences in a synthetic parallel corpus. Determine all different styles of writing (same meaning tho). For each source sentence, for each writing style, pass it to an LLM to output a new sentence with that different writing style and add to the parallel corpus of course. If capable, do the same for target sentences in the inverse direction. Example styles: informal, formal, slang. For each style, the LLM can be provided with a system prompt with many examples in just that style.
- Correction, both BLEURT and BERTScore are designed for evaluating natural language generation tasks where there is a source, reference, and predicted sentence.
- Honestly, not really understanding how COMET works but i understand its using a model rather than being based on heuristics. TODO: take a second look at COMET to understand it
- Continuing off reading Prompt Engineering Enhances Faroese MT, but Only Humans Can Tell (opens in a new tab):
- Honestly things were clear until Section 3 about evaluation, reading it was dense. TODO: go back and read section 3.
- TODO: research what Kendall’s Tau and Kendall’s Coefficient of Concordance are. I keep seeing this when it comes to comparing two values, typically translation quality ratings.
- Interesting finding: T_rand and T_sel were very close during human evaluation rankings, similarly in bleu, chrf, and bertscore they were close, T_sel coming out above just slightly. The zero-shot was actually closer to T_rand but ultimately the human evaluators noticed much better translations in T_sel even though the automatic metrics (i.e. BLEU and chrF) labeled them very close. I think as a whole I see that results on paper, especially automatic eval results, are not enough to determine if some approach is good or bad as many other factors need to be observed (e.g. the test dataset, human eval, statistical signifcance i think the p score thingy). In the future, scrolling down and reading results isn’t enough to determine if a paper is worth continuing to read. Im writing this especially since when I scrolled down to the results earlier before reading the sections explaining how the automatic metrics are blind, I made the following assumptions and thoughts (which ultimately turned out false!): these results are evidence for when prompting LLMs for translations in the future, it's sufficient enough to provide randomly selected sentences as examples instead of doing extra work for to find a set of similar sentences to the source (although it's probably a good idea to mention it in future works how it could be an improvement and use this paper as proof along with the other one).
- TODO: look into what Mann–Whitney U test. I think it might have to do with determining the probability or chance of something happening due to luck, so the lower the p-score the better as it proves there's a low chance of some results happening due to luck. In this paper's case, its referring to two values so I think its basically trying to estimate whether the T_rand and T_sel rankings being slightly different values is luck(e.g. noisy data) or because of some underlying rule/truth.
- TODO: research idea of doing translation masked, meaning that as the sentence grows, the translation starts to develop.
- I feel like it was obvious that GPT-4 wouldn’t be able to determine the best translation before seeing section 4.5, but after thinking about it, it seems like part of being a rigorous researcher means claims (in this case my assumptions) should be backed up with some sort of proof.
- End of section 5.1 is a call for more evaluation tools suitable for low resource languages as well as adding more low-resource language neural metric support, like COMET, BLEURT, and BERT.
- TODO: research idea of training reasoning models for low resource language machine translation where rules are passed in. also consider passing in a subset of rules for potential benefit of greater focus of the model and more high quality results.
- I’m seeing a pattern in papers of basically worries that LLMs already have seen the test data. TODO: research papers about detecting if LLMs have already seen some data
- Very very nice section 5.3 about limitations and future works. TODO: consider checking back here for future ideas.
- Rest of time/days are for applying to the Cohere Scholars Program until application is submitted: https://jobs.ashbyhq.com/cohere/a77c6864-5a43-44c1-81dc-a66e23bdd9a6 (opens in a new tab)