mirror of https://github.com/huggingface/transformers.git synced 2025-07-03 12:50:06 +06:00

Quentin Gallouédec de24fb63ed

* Use hf papers

* Hugging Face papers

* doi to hf papers

* style

2025-06-13 11:07:09 +00:00

2.3 KiB

Raw Blame History

JetMoe

Overview

JetMoe-8B is an 8B Mixture-of-Experts (MoE) language model developed by Yikang Shen and MyShell. JetMoe project aims to provide a LLaMA2-level performance and efficient language model with a limited budget. To achieve this goal, JetMoe uses a sparsely activated architecture inspired by the ModuleFormer. Each JetMoe block consists of two MoE layers: Mixture of Attention Heads and Mixture of MLP Experts. Given the input tokens, it activates a subset of its experts to process them. This sparse activation schema enables JetMoe to achieve much better training throughput than similar size dense models. The training throughput of JetMoe-8B is around 100B tokens per day on a cluster of 96 H100 GPUs with a straightforward 3-way pipeline parallelism strategy.

This model was contributed by Yikang Shen.

JetMoeConfig

autodoc JetMoeConfig

JetMoeModel

autodoc JetMoeModel - forward

JetMoeForCausalLM

autodoc JetMoeForCausalLM - forward

JetMoeForSequenceClassification

autodoc JetMoeForSequenceClassification - forward

2.3 KiB Raw Blame History