Day 5 - Starting Review of Paper (Cohere's Research Scholar Program)

Abduselam,Wed Aug 27 2025•cohere application

Back

August 27th, 2025

Today gonna mostly focus on the Cohere application.

Reading When Life Gives You Samples: The Benefits of Scaling up Inference Compute for Multilingual LLMs (opens in a new tab) (pulled from Cohere application):
- Initial thoughts: checked out the test dataset m-ArenaHard-v2.0 and it has me on edge because it is a machine translated dataset(although the claim is its SoTA inhouse models). The abstract establishes their techniques by mentioning improvements based on that dataset(I guess it's considered synthetic right?) but I’m curious and hopeful that they mention limitations(like compounding effect due to using machine translated models) or proofs(i.e. showing it's totally acceptable and there are established known effects of doing this, as in we know its impact on performance) regarding using a machine translated dataset. It’d be nice to see at least stats of the performance of the in-house MT models or evaluation results from a totally external test dataset in the paper.
- Didn’t know about the Command A (opens in a new tab) model but im surprised how its on-par with GPT-4o and DeepSeek-v3 while faster. Costs seem kinda expensive but idk the competitor prices off top of head tbh.
- This abstract gives me the idea of scaling up inference for other kinds of tasks, specifically MT. So like increasing number of MT samples (probs need to vary the seed or temp parameter so different values can be generated). TODO: research inference time compute scaling by sampling as a means of increasing performance for NMT systems and also research papers doing similar but for LLMs instead of NMT.
- When reading about this greedy vs sampling stuff (opens in a new tab), i wondered if another model could be trained purely for this stage of decoding (im remembering the NEAT model). Something like much much wider amount of sampling, anyway this is a TODO now to research. Update: ctrf+f for “Budget Size for Parallel Scaling” to find relevant references about using up to thousands of samples.
- TODO: research Best of N scoring, Minimum Bayes Risk, “specialized reward model”, min-p
- Table 1 was sorta hard to read/understand.
- Using any LLM as a judge (i understand its the standard for Arena) just doesn’t sit right with me. LLMs themselves have their own issues so how can it be considered as a judge, and be seriously considered in a paper. Maybe its already been proven. Update: lol its been proven.
- Didn’t really understand why they didn’t do other languages -> English. The paper said only evaluating English->other due to “complex target language generations” but that didnt really make sense to me.
- Interesting, in referenced paper Large Language Monkeys: Scaling Inference Compute with Repeated Sampling (opens in a new tab), seems that sampling large amounts follows scaling laws, so basically and typically steep performance improvements after first few samples but then diminishing returns after. The paper has some nice graphs.
- TODO: find out what does “utility metric and evaluation metric” mean.
- Ok it seems now that sampling goes hand in hand with selection, since you eventually need a reward model that “selects” the best response. The paper mentions an in-house multilingual reward model, i hope it specifies details about it later like about the in-house translate model.
- Small typo where “and” is written twice at end of page 10 and third paragraph page 2. Gonna end off day reading from conclusion to keep interest, then tmr start up again from section 2.
- I think i misunderstood multilingual to be inclusive of low-resource languages which based on the limitations section, see note there about languages of paper being high-resource languages. This sorta disappointed me as i was looking for whether this paper’s generalizations might apply to low-resource languages.