Day 14 - Take Home Challenge Stuff
Abduselam,•application
September 11th, 2025:
- Wow forgot to take notes yesterday, I was basically trying to understand MoE on a deep level along with rotary embeddings. Here’s kinda the gist (or jist?):
- Rope or rotary positional embeddings is basically like a way to better track positions of tokens. I guess some history is like the most basic kind of way to track position is called “absolute”, its like (most of the time) adding a deterministic “position” embedding that uses cosine and sine math, the farther the token the less the “frequency” and by frequency i mean how the period, so low frequency is large period. Anyway this is nice and fast but still doesnt get some important information like the relative positions between a token and another token. Thats a whole issue with multiple other attempts to solve it but what i care about for now is the rotary version where its basically instead using positional embeddings in a polar space, and to track positions we rotate in a deterministic way, similar with the cosine/sine absolution position embeddings (i.e. farther tokens are lower frequency)
- Also looked into MoE architecture and learned about how it basically replaces the feed forward part of transformers with a new kind of layer kind MoE. ok so in this layer, we basically have a specified number of “experts” who each are an expert in some stuff, you wouldn’t waste your time giving a painter math work nor give a math dude painting work. Overall, you need someone to know who to pick every time in a balanced sort of way, kinda like a manager balances work among his employees based on a number of factors. Anyway, the technical jargon for the manager is called “router” or sometimes “gate” but router makes more sense since you are routing the work to a specific worker. The standard and most common MoE implementation though is to actually choose the top-k workers since you could think of it as the top-k workers have expertise that would benefit the assignment (meaning the token’s being passed thru). Anyway theres some more math here and there with deciding how much of each expert’s “work” (actually outputs) to really use, then once you decide how much of each of the top-k expert’s work to use, you use it lol (sum it up). Anyyywaays theres also something super related called expert importance or sometimes known thru ML jargon talk: expert utilization which just basically means how much each expert is actually used. Preferably, experts should trained evenly which means they end up being used evenly, this is sort of like load balancing, actually the loss function for this part is literally called load balancing loss sooo ya. Anyway those two metrics (expert utilization and load balancing loss) help us track and know how much of the “load” (meaning tokens) passed thru are being evenly split across all the “experts”. Jargon talk aside, its NOT the same as multimodal, gotta make that clear cuz i was confused when i first heard to “experts” part (like it makes more sense to think an expert would mean a different modal but it doesnt lol, actually there are multimodal mixture of expert models soo yea, gotta keep track of the multis, they gotta create a multiverse model soon: multilingual multimodal mixture of experts).
- Some relevant links/videos about rope and moe:
- [2101.03961] Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity (opens in a new tab)
- RoFormer: Enhanced Transformer with Rotary Position Embedding (opens in a new tab)
- Practice: Implementing Custom MoE Gating (opens in a new tab)
- https://huggingface.co/docs/transformers/en/model_doc/jetmoe (opens in a new tab)
- Mixtral of experts | Mistral AI (opens in a new tab)
- Understanding Mixture of Experts (MoE) | by Shashank Sane | Medium (opens in a new tab)
- Mixture of experts - Wikipedia (opens in a new tab)
- https://github.com/junfanz1/MoE-Mixture-of-Experts-in-PyTorch/blob/main/MoE.py (opens in a new tab)
- https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py#L73 (opens in a new tab)
- Sparse vs Dense MoE : r/LocalLLaMA (opens in a new tab)