# JetMoe
## Overview
**JetMoe-8B** is an 8B Mixture-of-Experts (MoE) language model developed by [Yikang Shen](https://scholar.google.com.hk/citations?user=qff5rRYAAAAJ) and [MyShell](https://myshell.ai/).
JetMoe project aims to provide a LLaMA2-level performance and efficient language model with a limited budget.
To achieve this goal, JetMoe uses a sparsely activated architecture inspired by the [ModuleFormer](https://arxiv.org/abs/2306.04640).
Each JetMoe block consists of two MoE layers: Mixture of Attention Heads and Mixture of MLP Experts.
Given the input tokens, it activates a subset of its experts to process them.
This sparse activation schema enables JetMoe to achieve much better training throughput than similar size dense models.
The training throughput of JetMoe-8B is around 100B tokens per day on a cluster of 96 H100 GPUs with a straightforward 3-way pipeline parallelism strategy.
This model was contributed by [Yikang Shen](https://huggingface.co/YikangS).
## JetMoeConfig
[[autodoc]] JetMoeConfig
## JetMoeModel
[[autodoc]] JetMoeModel
- forward
## JetMoeForCausalLM
[[autodoc]] JetMoeForCausalLM
- forward
## JetMoeForSequenceClassification
[[autodoc]] JetMoeForSequenceClassification
- forward