In Clippers next week I will present some early-stage planning for a mixture-of-experts (MoE) language model project I hope to pursue. It will consist of:
- A literature review of neural MoE models in NLP
- How MoE models changed my thinking around model parallelism, FLOPs and compute efficiency
- What this implies about GPT-4 (which is rumored to be a MoE model)
- Soft MoE: a recent paper that aims to solve many problems with MoE models, but only applies it to vision
- Ideas I have on how to apply soft MoE to language modeling
I hope that #1 and #2 will be valuable to everyone, because I think MoE models are very under-utilized in research, despite supposedly powering the best language model in the world (GPT-4).