Day 6 - Finishing Multilingual Paper Review (Cohere's Research Scholar Program)

Abduselam,Thu Aug 28 2025•cohere application

Back

August 28th, 2025

Gonna try reading When Life Gives You Samples: The Benefits of Scaling up Inference Compute for Multilingual LLMs (opens in a new tab) again from section 2:
- Trying to understand BoN: [NAACL 2025] Regularized Best-of-N Sampling with Minimum Bayes Risk Objective for LLM Alignment (opens in a new tab)
- Trying to understand MBR: [ICML 2024] Model-Based Minimum Bayes-Risk Decoding for Text Generation (opens in a new tab)
- Why do authors consider GPT-4o the standard judge for m-ArenaHard. Seems risky as ive seen in other papers issues due to using closed-source models. For example, there are issues of uncontrolled versioning or another issue which is that it's possible the models become deprecated. What happens to this paper and reproducing it in the future once GPT-4o becomes totally unavailable? Or what about if GPT-4o had already seen the data so it becomes tainted as a judge. I looked at the reference that this paper mentions to claim GPT-4o as standard judge, basically its from Aya Expanse: Combining Research Breakthroughs for a New Multilingual Frontier arXiv:2412.04261v1 [cs.CL] 5 Dec 2024 (opens in a new tab). Typo of LLML-judge in that paper but besides that, here are some potential issues I see from this Aya Expanse paper and limitations mentioned in it’s reference papers regarding using an LLM as a judge:
  - Paper mentions a clear bias that the GPT-4o-mini has in comparison to GPT-4o in the “Difference between GPT-4o and GPT-4o-mini” section. Authors turn to more conservative results through use of GPT-4o as a judge, although the bar for GPT-4o yielding more conservative results isn’t clear (I skimmed paper so I may have missed a point that addresses this). But anyway, preferring 4o over 4o-mini felt like a bandaid to try and mitigate biases as there may be further underlying biases as highlighted in referenced papers below.
  - In referenced paper [2305.18290] Direct Preference Optimization: Your Language Model is Secretly a Reward Model (opens in a new tab), it’s not mentioned whether generalizations apply to a multilingual setting, that paper’s data is english centric.
  - In referenced paper Alpacafarm: A simulation framework for methods that learn from human feedback (opens in a new tab), all models use LLaMA 7B which is opened sourced, but still its data isn’t opensourced leaving speculation for tainted training data. This paper also mentions that the model tends to prefer the first outputs demonstrating order bias, and ultimately this paper expresses how there may be other biases and there should be further research to explore this.
  - In referenced paper, Prometheus: Inducing Fine-grained Evaluation Capability in Language Models (opens in a new tab), there is potential for length-bias where models as judges prefer longer responses.
  - In the main referenced paper Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model (opens in a new tab), some languages were left out during evaluation because some languages’ performance aren’t mentioned in GPT-4o paper (e.g. Yoruba and Portuguese). This paper also depends on papers mentioned above, but those papers have no generalizations for whether they apply to multilingual tasks. Also, this paper has some trust-me-bro points, specifically in the “Human Label Variation” section in appendix where its said that for each example it was done by one human annotator who is trustworthy by Cohere. They mention mitigations but after checking the “Quality Considerations” section to learn about these annotators and the process of collection as a whole, I was a tad surprised as efforts to ensure quality data can be summarized as emphasizing global-scale and impact to annotators which in my opinion is not enough, because in my experience at DataAnnotation, there are many issues that can arise when collecting human data annotation alone, for example usage of AI-generated responses or annotators who provide poor quality annotations which have to be reviewed to get cleaned up. This is just a few to mention. Anyway, not saying that we know these annotators to be untrustworthy or the data is bad quality, its just that the claim for them to be trusted by Cohere is sorta trust-me-bro claims as its not fully substantiated and that the quality considerations is not sufficient. Ended up also skimming over The “Problem” of Human Label Variation: On Ground Truth in Data, Modeling and Evaluation (opens in a new tab) since it was mentioned, super relevant to data collection as a whole.
  - TODO: idea came which should be researched is during selection to instead of purely selecting one single response to instead select snippets from within each sample using a combination of hedging + MBR. basically a separate “manager” model that can understand extremities and use most agreed upon thoughts(using MBR), while also using part of the best performers thoughts. Honestly, that might just be what this paper is proposing and im just being slow oof.
- It’d be interesting to explore causes for “Win rate improvements over greedy are not developing smoothly across languages”, i hypothesize it may have to do with how languages being lower-resourced in comparison to English, cultural nuances, human labeling variation, or poor human data labeling collection leading to poorer quality data.
- Single temperature sampling generally worsens performance it seems, I think im not reading/understanding Figure 4 correctly. TODO: go over and try to understand it better.
- Not surprised by using LLM as a judge for BoN absolute scoring, LLMs don’t seem great out of the box at judging imo, thats just based on personal experience and knowing they hallucinate with extreme confidence.
- TODO: research MBR more for low-resource languages.
- Thoughts about CHOPS technique: do authors explore bias for earlier outputs in prompts being preferred?
- Interested to explore whether reasoning models would be good judges.
- interested to know how well X-MBR fairs for open generation tasks that involve multiple languages in a single prompt
- Having a hard time understanding table 3 and table 4 since RM BoN seems to have a higher delta a number of times
- Interested to explore current capabilities with applying recipe approaches in this paper to models like NLLB or MADLAD-400 or mBert-XL for the FLORES dataset. Update: nvm it seems that in general this paper is geared more towards open-tasks rather than verifiable tasks. I just misunderstood open-tasks to mean anything that requires generation, but it actually means tasks where there isn’t a structured or verifiable answer like the math and translation tasks.
- Interested to apply CHOPS + X-MBR in creating a synthetic parallel corpus along with new technique mentioned in A BENCHMARK FOR LEARNING TO TRANSLATE A NEW LANGUAGE FROM ONE GRAMMAR BOOK (opens in a new tab) of including grammar/dictionary book in context as well. Then ultimately training NMT model.
- Interested in multiple judges that can sorta debate, some judges more specialized than others. This is inspired by this paper which showed specialized debating judges reduced overconfidence bias: AlphaAgents: Large Language Model based Multi-Agents for Equity Portfolio Constructions (opens in a new tab)
Found this paper which has substantial extrapolations: [2409.00626] Correcting FLORES Evaluation Dataset for Four African Languages (opens in a new tab). TODO: read this paper and see if there is connection with it and using in-house translation models to generate datasets mentioned in paper above. Experience at dataannotation is relevant to this paper.