transformers/docs/source/en/model_doc/olmoe.md
2024-09-05 15:49:28 +02:00

2.3 KiB

OLMoE

Overview

The OLMoE model was proposed in OLMoE: Open Mixture-of-Experts Language Models by Niklas Muennighoff, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Jacob Morrison, Sewon Min, Weijia Shi, Pete Walsh, Oyvind Tafjord, Nathan Lambert, Yuling Gu, Shane Arora, Akshita Bhagia, Dustin Schwenk, David Wadden, Alexander Wettig, Binyuan Hui, Tim Dettmers, Douwe Kiela, Ali Farhadi, Noah A. Smith, Pang Wei Koh, Amanpreet Singh, Hannaneh Hajishirzi.

OLMoE is a series of Open Language Models using sparse Mixture-of-Experts designed to enable the science of language models. We release all code, checkpoints, logs, and details involved in training these models.

The abstract from the paper is the following:

We introduce OLMoE, a fully open, state-of-the-art language model leveraging sparse Mixture-of-Experts (MoE). OLMoE-1B-7B has 7 billion (B) parameters but uses only 1B per input token. We pretrain it on 5 trillion tokens and further adapt it to create OLMoE-1B-7B-Instruct. Our models outperform all available models with similar active parameters, even surpassing larger ones like Llama2-13B-Chat and DeepSeekMoE-16B. We present various experiments on MoE training, analyze routing in our model showing high specialization, and open-source all aspects of our work: model weights, training data, code, and logs.

This model was contributed by Muennighoff. The original code can be found here.

OlmoeConfig

autodoc OlmoeConfig

OlmoeModel

autodoc OlmoeModel - forward

OlmoeForCausalLM

autodoc OlmoeForCausalLM - forward