transformers/docs/source/en/model_doc/distilbert.md
Maria Khalusova 5964f820db
[Docs] Model_doc structure/clarity improvements (#26876)
* first batch of structure improvements for model_docs

* second batch of structure improvements for model_docs

* more structure improvements for model_docs

* more structure improvements for model_docs

* structure improvements for cv model_docs

* more structural refactoring

* addressed feedback about image processors
2023-11-03 10:57:03 -04:00

13 KiB
Raw Blame History

DistilBERT

Overview

The DistilBERT model was proposed in the blog post Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT, and the paper DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. DistilBERT is a small, fast, cheap and light Transformer model trained by distilling BERT base. It has 40% less parameters than bert-base-uncased, runs 60% faster while preserving over 95% of BERT's performances as measured on the GLUE language understanding benchmark.

The abstract from the paper is the following:

As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage knowledge distillation during the pretraining phase and show that it is possible to reduce the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive biases learned by larger models during pretraining, we introduce a triple loss combining language modeling, distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device study.

This model was contributed by victorsanh. This model jax version was contributed by kamalkraj. The original code can be found here.

Usage tips

  • DistilBERT doesn't have token_type_ids, you don't need to indicate which token belongs to which segment. Just separate your segments with the separation token tokenizer.sep_token (or [SEP]).

  • DistilBERT doesn't have options to select the input positions (position_ids input). This could be added if necessary though, just let us know if you need this option.

  • Same as BERT but smaller. Trained by distillation of the pretrained BERT model, meaning its been trained to predict the same probabilities as the larger model. The actual objective is a combination of:

    • finding the same probabilities as the teacher model
    • predicting the masked tokens correctly (but no next-sentence objective)
    • a cosine similarity between the hidden states of the student and the teacher model

Resources

A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with DistilBERT. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.

Multiple choice

⚗️ Optimization

Inference

🚀 Deploy

DistilBertConfig

autodoc DistilBertConfig

DistilBertTokenizer

autodoc DistilBertTokenizer

DistilBertTokenizerFast

autodoc DistilBertTokenizerFast

DistilBertModel

autodoc DistilBertModel - forward

DistilBertForMaskedLM

autodoc DistilBertForMaskedLM - forward

DistilBertForSequenceClassification

autodoc DistilBertForSequenceClassification - forward

DistilBertForMultipleChoice

autodoc DistilBertForMultipleChoice - forward

DistilBertForTokenClassification

autodoc DistilBertForTokenClassification - forward

DistilBertForQuestionAnswering

autodoc DistilBertForQuestionAnswering - forward

TFDistilBertModel

autodoc TFDistilBertModel - call

TFDistilBertForMaskedLM

autodoc TFDistilBertForMaskedLM - call

TFDistilBertForSequenceClassification

autodoc TFDistilBertForSequenceClassification - call

TFDistilBertForMultipleChoice

autodoc TFDistilBertForMultipleChoice - call

TFDistilBertForTokenClassification

autodoc TFDistilBertForTokenClassification - call

TFDistilBertForQuestionAnswering

autodoc TFDistilBertForQuestionAnswering - call

FlaxDistilBertModel

autodoc FlaxDistilBertModel - call

FlaxDistilBertForMaskedLM

autodoc FlaxDistilBertForMaskedLM - call

FlaxDistilBertForSequenceClassification

autodoc FlaxDistilBertForSequenceClassification - call

FlaxDistilBertForMultipleChoice

autodoc FlaxDistilBertForMultipleChoice - call

FlaxDistilBertForTokenClassification

autodoc FlaxDistilBertForTokenClassification - call

FlaxDistilBertForQuestionAnswering

autodoc FlaxDistilBertForQuestionAnswering - call