Day 15 - Take Home Challenge Stuff
Abduselam,•application
September 12th, 2025:
- Thoughts reading [2212.05055] Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints (opens in a new tab)
- Initial thoughts from abstract: whats sunk cost mean? I guess something like how much money spent pretraining a model. Update: Ok yea i was right
- What the heck is a ZFLOP, i get the point its a lot of computation increasing recently but like i wanna understand wat that means in terms of costs or like magnitude in comparison to something. Update: its like a lot of electricity and money, millions and billions.
- TODO: what is SuperGLUE metric mentioned? Update: its a benchmark on a bunch of different hard language understanding tasks
- Random thought remembering the kv cache, basically it saves us time from having to recompute key and value vectors.
- TODO: need to look into what Expert Choice routing is.
- Figure 1 makes it seem like the initialization process is actually pretty easy, just copy all the weights over but for the MLP, we duplicate the MLP layer E times (# of experts) and place within out MoE.
- Ok pretty simple actually just had to copy over all weights and duplicate FFN part as each expert in every layer.