Day 1 - Reading Low Resource Language (LRL) Machine Translation (MT) Related Papers
Abduselam,•low resource languages
August 23rd, 2025
Low resource language machine translation nlp. Original idea is nmt models trained on llm based synthetic parallel corpus.
- Searched for papers based on “llm based synthetic data generation for low resource languages machine translation”.
- Thoughts about [2505.14423] Scaling Low-Resource MT via Synthetic Data Generation with LLMs (opens in a new tab):
- Found out about SynOPUS (opens in a new tab) thru its abstract.
- Remembered backtranslation, here is an example:
- 50k english-oromo sentences and 1mil sentences monolingual oromo
- Train nmt model on 50k sentences
- Translate 1mil with nmt model = 1mil synthetic parallel corpus
- Train new nmt model with the 1mil synthetic + 50k real corpus
- Remembered pivot-based strategies: basically involving a middle man language. For example, oromo-chinese language pairs are unlikely but english-oromo corpora/models exist and english-chinese corpora/models exist. Keep in mind there are slight differences in pivot based strategies.
- Initial thoughts: it seems like a bad idea to generate synthetic level at the “document-level” since LLMs are attention-based, and document-level seems like nmt model’s attention mechanism gets strained. Makes sense though about trying to capture more context.
- Remember what chrF score is: basically an evaluation metric for MT that is more forgiving than BLEU. for example with BLEU, the sentences “the cats are running” and “the cat is running” is punished more(cuz cats!=cat and are!=is) than chrF which considers more overlap in characters(rewards still having “cat” even though it's actually “cats”). Initial thoughts: BLEU is more rigid but chrF is more flexible and considerate of variations of translations.
- Roughly ~1 to ~3 point increase for chrF using this paper’s methods, its nice but not exactly groundbreaking although finetuned models are smaller which is nice
- Learned about NLLB (facebook) which is basically 200 languages supported now, as of 2022 considered the best for low-resource languages
- The Limitations section contains super good points to watch out for in research with LLMs, for example hallucination outputs, risk of data being seen by LLMs used for creating synthetic parallel corpora, and difficulty in reproducibility. The paper acknowledged a lot of these issues, for example they claim their work isn’t reproducible but they could have planned earlier as openai has ways to reproduce (opens in a new tab)(here is a discussion for reproducibility issue (opens in a new tab)).
- TODO: remembered Research University and i should check it out. Watch the inauguration video and see if good fit to get mentorship on this research.
- TODO: look into synthetic parallel corpus dataset generation by augmenting sentences by swapping verbs/nouns/adjectives.
- TODO idea consider: Idea came to mind that could be researched; fine tuning a reasoning model given a grammar book and dictionary, and step by step is passing in single sentences to be translated followed by correct translations. This may actually involve generating a synthetic dataset of models explaining why a translation is bad or could be improved upon. This could also naturally perform better because of consideration for badly spelled sentences in corpora, which is highly realistic. Another TODO to research in general is techniques that consider bad sentences when translating (e.g. a preprocessor model for cleaning up source sentences or a style-embedding model that captures styling of sentences meaning informal, text-like, formal, or slang which ultimately is added back to last final translation). Another TODO idea is a multimodal approach using a reasoning model + LLM model jointly trained.
- Thoughts about From Text to Multi-Modal: Advancing Low-Resource-Language Translation through Synthetic Data Generation and Cross-Modal Alignments (opens in a new tab):
- Initial thoughts: In general, I didn’t know about multimodal datasets being used for machine translation. It makes sense though as it seems like it would add more context but I think it’d super slow during inference becuz of needing to find or generate a relevant image.
- Skimmed thru the rest of the paper tbh as the final results were good but underwhelming and not what I would consider leaps. Also, i was misinformed by the mentioned average increase of 9% bleu score. I considered the increase to be 9 point increase, the two are very different. Overall the paper seems like a lot of extra work for mediocre results. In terms of trying to combine all the best techniques to see better performance, this seems like a viable approach that can be used. I did find more interesting/relevant papers in the references though which helped understand the space more.
- Thoughts about Low-Resource Machine Translation through Retrieval-Augmented LLM Prompting: A Study on the Mambai Language (opens in a new tab):
- Initial thoughts: seems unfortunate from the abstract that the paper’s techniques don’t perform well with translating natural spoken sentences and only performed well on textbook style sentences. Makes it seem much less useful. I wonder if there are problems with the paper, like depending on a single speaker for translations. I feel like it would be more reasonable to check with a lot more translators/translations.
- The paper did all the extraction and everything from all the language manual but only ended up using the extracted parallel sentences for training it seems.
- My assumption was right, only one Mambai speaker with 50 sentences provided which seems risky. Also so far, paper seems like it has potential issues due to having to do scanning of the document so its not clear what the results are of scanning the document. I mean that the paper doesn’t provide any quality analysis of the scanning of the document to help know how much error or mistakes may be propagated down. Also its not completely clear whether the authors custom alphabet used for scanning is the same as what the single Mambai speaker used, as in are they aligned alphabet wise?
- While reading this, it occurred to me the importance of having a multi-domain and wide variety of styles of text within the parallel corpora for low-resource language models to perform well. One previous paper I read, I noticed how low-resource language models performed well with a huge dataset of parallel corpora even though there was a lot of noise. I think its important to consider many low-resource languages are of varying development, some have no orthography at all, some have it but its not standardized, some have it standardized although practically its usage is platform dependent (e.g. facebook comments vs news articles), and others are highly standardized with wide variety of styling like english(standard english, slang english, casual shortcut english).
- An issue i see is that the extracted parallel corpus is quite limited as its like grammar book style sentences. The portioned out test set does ok but the provided 50 sentences by the native speaker is super wide variety.
- An issue im noticing with machine translation systems in general is they seem very expensive to update and since language changes constantly with new terminology and lingo, this is a problem as generally a non-native speaker of a language is easily able to pick up on new terminology or lingo as long as they’ve heard a few times.
- I actually like the idea that the authors used, basically providing similar sentence translations where the source is closest with the input sentence as examples for the LLM along with UseDict technique which helps boosts final translations where words aren’t translated. Im not so confident about UseDict because source languages have words that have a variety of meanings which can cause final translation quality to decrease, and also target languages that have words with a variety of meanings would worsen the final translation even more. Its not surprising that the paper performed better with the UseDict for the test sentences from the manual book (higher BLEU) and considerably worse with UseDict for the native-speaker provided sentences (all BLEU yielded zero). I hypothesize its becuz the grammar book has an outdated dictionary, the alphabet used by the native-speaker differed from the scanning system alphabet, and/or words on either side within the dictionary have multiple meanings so its not a 1-1 dictionary translation. I suppose one idea that might help correct UseDict is if the dictionary contained all possible meanings of a word thru more detailed dictionary or thru usage of some kind of embeddings.
- Thoughts about Prompt Engineering Enhances Faroese MT, but Only Humans Can Tell (opens in a new tab):
- Initial thoughts: this seems very similar to the previous paper regarding the Mambai language, technique-usage wise. This paper calls it Semantic Textual Similarity(STS) but the previous paper didn’t exactly give it a name. It seems both papers used a combination of STS along with some similarity search, not sure if TF-IDF or embeddings are used for this paper but both are used for Mambai paper.