dilbert -> distilbert

This commit is contained in:
thomwolf 2019-08-28 13:59:42 +02:00
parent c9bce1811c
commit 912a377e90
15 changed files with 144 additions and 144 deletions

View File

@ -13,7 +13,7 @@ The library currently contains PyTorch implementations, pre-trained model weight
5. **[XLNet](https://github.com/zihangdai/xlnet/)** (from Google/CMU) released with the paper [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
6. **[XLM](https://github.com/facebookresearch/XLM/)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau.
7. **[RoBERTa](https://github.com/pytorch/fairseq/tree/master/examples/roberta)** (from Facebook), released together with the paper a [Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
8. **[DilBERT](https://github.com/huggingface/pytorch-transformers/tree/master/examples/distillation)** (from HuggingFace), released together with the blogpost [Smaller, faster, cheaper, lighter: Introducing DilBERT, a distilled version of BERT](https://medium.com/huggingface/smaller-faster-cheaper-lighter-introducing-dilbert-a-distilled-version-of-bert-8cf3380435b5
8. **[DistilBERT](https://github.com/huggingface/pytorch-transformers/tree/master/examples/distillation)** (from HuggingFace), released together with the blogpost [Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT](https://medium.com/huggingface/smaller-faster-cheaper-lighter-introducing-distilbert-a-distilled-version-of-bert-8cf3380435b5
) by Victor Sanh, Lysandre Debut and Thomas Wolf.
These implementations have been tested on several datasets (see the example scripts) and should match the performances of the original implementations (e.g. ~93 F1 on SQuAD for BERT Whole-Word-Masking, ~88 F1 on RocStories for OpenAI GPT, ~18.3 perplexity on WikiText 103 for Transformer-XL, ~0.916 Peason R coefficient on STS-B for XLNet). You can find more details on the performances in the Examples section of the [documentation](https://huggingface.co/pytorch-transformers/examples.html).

View File

@ -1,33 +1,33 @@
# DilBERT
# DistilBERT
This folder contains the original code used to train DilBERT as well as examples showcasing how to use DilBERT.
This folder contains the original code used to train DistilBERT as well as examples showcasing how to use DistilBERT.
## What is DilBERT
## What is DistilBERT
DilBERT stands for Distillated-BERT. DilBERT is a small, fast, cheap and light Transformer model based on Bert architecture. It has 40% less parameters than `bert-base-uncased`, runs 60% faster while preserving over 95% of Bert's performances as measured on the GLUE language understanding benchmark. DilBERT is trained using knowledge distillation, a technique to compress a large model called the teacher into a smaller model called the student. By distillating Bert, we obtain a smaller Transformer model that bears a lot of similarities with the original BERT model while being lighter, smaller and faster to run. DilBERT is thus an interesting option to put large-scaled trained Transformer model into production.
DistilBERT stands for Distillated-BERT. DistilBERT is a small, fast, cheap and light Transformer model based on Bert architecture. It has 40% less parameters than `bert-base-uncased`, runs 60% faster while preserving over 95% of Bert's performances as measured on the GLUE language understanding benchmark. DistilBERT is trained using knowledge distillation, a technique to compress a large model called the teacher into a smaller model called the student. By distillating Bert, we obtain a smaller Transformer model that bears a lot of similarities with the original BERT model while being lighter, smaller and faster to run. DistilBERT is thus an interesting option to put large-scaled trained Transformer model into production.
For more information on DilBERT, please refer to our [detailed blog post](https://medium.com/huggingface/smaller-faster-cheaper-lighter-introducing-dilbert-a-distilled-version-of-bert-8cf3380435b5
For more information on DistilBERT, please refer to our [detailed blog post](https://medium.com/huggingface/smaller-faster-cheaper-lighter-introducing-distilbert-a-distilled-version-of-bert-8cf3380435b5
).
## How to use DilBERT
## How to use DistilBERT
PyTorch-Transformers includes two pre-trained DilBERT models, currently only provided for English (we are investigating the possibility to train and release a multilingual version of DilBERT):
PyTorch-Transformers includes two pre-trained DistilBERT models, currently only provided for English (we are investigating the possibility to train and release a multilingual version of DistilBERT):
- `dilbert-base-uncased`: DilBERT English language model pretrained on the same data used to pretrain Bert (concatenation of the Toronto Book Corpus and full English Wikipedia) using distillation with the supervision of the `bert-base-uncased` version of Bert. The model has 6 layers, 768 dimension and 12 heads, totalizing 66M parameters.
- `dilbert-base-uncased-distilled-squad`: A finetuned version of `dilbert-base-uncased` finetuned using (a second step of) knwoledge distillation on SQuAD 1.0. This model reaches a F1 score of 86.2 on the dev set (for comparison, Bert `bert-base-uncased` version reaches a 88.5 F1 score).
- `distilbert-base-uncased`: DistilBERT English language model pretrained on the same data used to pretrain Bert (concatenation of the Toronto Book Corpus and full English Wikipedia) using distillation with the supervision of the `bert-base-uncased` version of Bert. The model has 6 layers, 768 dimension and 12 heads, totalizing 66M parameters.
- `distilbert-base-uncased-distilled-squad`: A finetuned version of `distilbert-base-uncased` finetuned using (a second step of) knwoledge distillation on SQuAD 1.0. This model reaches a F1 score of 86.2 on the dev set (for comparison, Bert `bert-base-uncased` version reaches a 88.5 F1 score).
Using DilBERT is very similar to using BERT. DilBERT share the same tokenizer as BERT's `bert-base-uncased` even though we provide a link to this tokenizer under the `DilBertTokenizer` name to have a consistent naming between the library models.
Using DistilBERT is very similar to using BERT. DistilBERT share the same tokenizer as BERT's `bert-base-uncased` even though we provide a link to this tokenizer under the `DistilBertTokenizer` name to have a consistent naming between the library models.
```python
tokenizer = DilBertTokenizer.from_pretrained('dilbert-base-uncased')
model = DilBertModel.from_pretrained('dilbert-base-uncased')
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertModel.from_pretrained('distilbert-base-uncased')
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)
outputs = model(input_ids)
last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
```
## How to train DilBERT
## How to train DistilBERT
In the following, we will explain how you can train your own compressed model.
@ -68,7 +68,7 @@ python train.py \
By default, this will launch a training on a single GPU (even if more are available on the cluster). Other parameters are available in the command line, please look in `train.py` or run `python train.py --help` to list them.
We highly encourage you to distributed training for training DilBert as the training corpus is quite large. Here's an example that runs a distributed training on a single node having 4 GPUs:
We highly encourage you to distributed training for training DistilBert as the training corpus is quite large. Here's an example that runs a distributed training on a single node having 4 GPUs:
```bash
export NODE_RANK=0

View File

@ -13,7 +13,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Dataloaders to train DilBERT.
Dataloaders to train DistilBERT.
"""
from typing import List
import math

View File

@ -13,7 +13,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.
"""
The distiller to distil DilBERT.
The distiller to distil DistilBERT.
"""
import os
import math

View File

@ -13,7 +13,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Preprocessing script before training DilBERT.
Preprocessing script before training DistilBERT.
"""
import argparse
import pickle

View File

@ -13,7 +13,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Preprocessing script before training DilBERT.
Preprocessing script before training DistilBERT.
"""
from pytorch_transformers import BertForPreTraining
import torch
@ -33,32 +33,32 @@ if __name__ == '__main__':
compressed_sd = {}
for w in ['word_embeddings', 'position_embeddings']:
compressed_sd[f'dilbert.embeddings.{w}.weight'] = \
compressed_sd[f'distilbert.embeddings.{w}.weight'] = \
state_dict[f'bert.embeddings.{w}.weight']
for w in ['weight', 'bias']:
compressed_sd[f'dilbert.embeddings.LayerNorm.{w}'] = \
compressed_sd[f'distilbert.embeddings.LayerNorm.{w}'] = \
state_dict[f'bert.embeddings.LayerNorm.{w}']
std_idx = 0
for teacher_idx in [0, 2, 4, 7, 9, 11]:
for w in ['weight', 'bias']:
compressed_sd[f'dilbert.transformer.layer.{std_idx}.attention.q_lin.{w}'] = \
compressed_sd[f'distilbert.transformer.layer.{std_idx}.attention.q_lin.{w}'] = \
state_dict[f'bert.encoder.layer.{teacher_idx}.attention.self.query.{w}']
compressed_sd[f'dilbert.transformer.layer.{std_idx}.attention.k_lin.{w}'] = \
compressed_sd[f'distilbert.transformer.layer.{std_idx}.attention.k_lin.{w}'] = \
state_dict[f'bert.encoder.layer.{teacher_idx}.attention.self.key.{w}']
compressed_sd[f'dilbert.transformer.layer.{std_idx}.attention.v_lin.{w}'] = \
compressed_sd[f'distilbert.transformer.layer.{std_idx}.attention.v_lin.{w}'] = \
state_dict[f'bert.encoder.layer.{teacher_idx}.attention.self.value.{w}']
compressed_sd[f'dilbert.transformer.layer.{std_idx}.attention.out_lin.{w}'] = \
compressed_sd[f'distilbert.transformer.layer.{std_idx}.attention.out_lin.{w}'] = \
state_dict[f'bert.encoder.layer.{teacher_idx}.attention.output.dense.{w}']
compressed_sd[f'dilbert.transformer.layer.{std_idx}.sa_layer_norm.{w}'] = \
compressed_sd[f'distilbert.transformer.layer.{std_idx}.sa_layer_norm.{w}'] = \
state_dict[f'bert.encoder.layer.{teacher_idx}.attention.output.LayerNorm.{w}']
compressed_sd[f'dilbert.transformer.layer.{std_idx}.ffn.lin1.{w}'] = \
compressed_sd[f'distilbert.transformer.layer.{std_idx}.ffn.lin1.{w}'] = \
state_dict[f'bert.encoder.layer.{teacher_idx}.intermediate.dense.{w}']
compressed_sd[f'dilbert.transformer.layer.{std_idx}.ffn.lin2.{w}'] = \
compressed_sd[f'distilbert.transformer.layer.{std_idx}.ffn.lin2.{w}'] = \
state_dict[f'bert.encoder.layer.{teacher_idx}.output.dense.{w}']
compressed_sd[f'dilbert.transformer.layer.{std_idx}.output_layer_norm.{w}'] = \
compressed_sd[f'distilbert.transformer.layer.{std_idx}.output_layer_norm.{w}'] = \
state_dict[f'bert.encoder.layer.{teacher_idx}.output.LayerNorm.{w}']
std_idx += 1

View File

@ -13,7 +13,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Preprocessing script before training DilBERT.
Preprocessing script before training DistilBERT.
"""
from collections import Counter
import argparse

View File

@ -13,7 +13,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Training DilBERT.
Training DistilBERT.
"""
import os
import argparse
@ -24,7 +24,7 @@ import numpy as np
import torch
from pytorch_transformers import BertTokenizer, BertForMaskedLM
from pytorch_transformers import DilBertForMaskedLM, DilBertConfig
from pytorch_transformers import DistilBertForMaskedLM, DistilBertConfig
from distiller import Distiller
from utils import git_log, logger, init_gpu_params, set_seed
@ -201,13 +201,13 @@ def main():
assert os.path.isfile(os.path.join(args.from_pretrained_config))
logger.info(f'Loading pretrained weights from {args.from_pretrained_weights}')
logger.info(f'Loading pretrained config from {args.from_pretrained_config}')
stu_architecture_config = DilBertConfig.from_json_file(args.from_pretrained_config)
student = DilBertForMaskedLM.from_pretrained(args.from_pretrained_weights,
stu_architecture_config = DistilBertConfig.from_json_file(args.from_pretrained_config)
student = DistilBertForMaskedLM.from_pretrained(args.from_pretrained_weights,
config=stu_architecture_config)
else:
args.vocab_size_or_config_json_file = args.vocab_size
stu_architecture_config = DilBertConfig(**vars(args))
student = DilBertForMaskedLM(stu_architecture_config)
stu_architecture_config = DistilBertConfig(**vars(args))
student = DistilBertForMaskedLM(stu_architecture_config)
if args.n_gpu > 0:

View File

@ -13,7 +13,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Utils to train DilBERT.
Utils to train DistilBERT.
"""
import git
import json

View File

@ -7,7 +7,7 @@ from .tokenization_gpt2 import GPT2Tokenizer
from .tokenization_xlnet import XLNetTokenizer, SPIECE_UNDERLINE
from .tokenization_xlm import XLMTokenizer
from .tokenization_roberta import RobertaTokenizer
from .tokenization_dilbert import DilBertTokenizer
from .tokenization_distilbert import DistilBertTokenizer
from .tokenization_utils import (PreTrainedTokenizer)
@ -41,9 +41,9 @@ from .modeling_xlm import (XLMConfig, XLMPreTrainedModel , XLMModel,
XLM_PRETRAINED_MODEL_ARCHIVE_MAP)
from .modeling_roberta import (RobertaConfig, RobertaForMaskedLM, RobertaModel, RobertaForSequenceClassification,
ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP)
from .modeling_dilbert import (DilBertConfig, DilBertForMaskedLM, DilBertModel,
DilBertForSequenceClassification, DilBertForQuestionAnswering,
DILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, DILBERT_PRETRAINED_MODEL_ARCHIVE_MAP)
from .modeling_distilbert import (DistilBertConfig, DistilBertForMaskedLM, DistilBertModel,
DistilBertForSequenceClassification, DistilBertForQuestionAnswering,
DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP)
from .modeling_utils import (WEIGHTS_NAME, CONFIG_NAME, TF_WEIGHTS_NAME,
PretrainedConfig, PreTrainedModel, prune_layer, Conv1D)

View File

@ -30,7 +30,7 @@ from .modeling_transfo_xl import TransfoXLConfig, TransfoXLModel
from .modeling_xlnet import XLNetConfig, XLNetModel
from .modeling_xlm import XLMConfig, XLMModel
from .modeling_roberta import RobertaConfig, RobertaModel
from .modeling_dilbert import DilBertConfig, DilBertModel
from .modeling_distilbert import DistilBertConfig, DistilBertModel
from .modeling_utils import PreTrainedModel, SequenceSummary
@ -111,8 +111,8 @@ class AutoConfig(object):
assert unused_kwargs == {'foo': False}
"""
if 'dilbert' in pretrained_model_name_or_path:
return DilBertConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
if 'distilbert' in pretrained_model_name_or_path:
return DistilBertConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
elif 'roberta' in pretrained_model_name_or_path:
return RobertaConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
elif 'bert' in pretrained_model_name_or_path:
@ -228,8 +228,8 @@ class AutoModel(object):
model = AutoModel.from_pretrained('./tf_model/bert_tf_checkpoint.ckpt.index', from_tf=True, config=config)
"""
if 'dilbert' in pretrained_model_name_or_path:
return DilBertModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
if 'distilbert' in pretrained_model_name_or_path:
return DistilBertModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
elif 'roberta' in pretrained_model_name_or_path:
return RobertaModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
elif 'bert' in pretrained_model_name_or_path:

View File

@ -13,7 +13,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.
"""
PyTorch DilBERT model.
PyTorch DistilBERT model.
"""
from __future__ import absolute_import, division, print_function, unicode_literals
@ -36,19 +36,19 @@ import logging
logger = logging.getLogger(__name__)
DILBERT_PRETRAINED_MODEL_ARCHIVE_MAP = {
'dilbert-base-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/dilbert-base-uncased-pytorch_model.bin",
'dilbert-base-uncased-distilled-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/dilbert-base-uncased-distilled-squad-pytorch_model.bin"
DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP = {
'distilbert-base-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-uncased-pytorch_model.bin",
'distilbert-base-uncased-distilled-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-uncased-distilled-squad-pytorch_model.bin"
}
DILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
'dilbert-base-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/dilbert-base-uncased-config.json",
'dilbert-base-uncased-distilled-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/dilbert-base-uncased-distilled-squad-config.json"
DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
'distilbert-base-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-uncased-config.json",
'distilbert-base-uncased-distilled-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-uncased-distilled-squad-config.json"
}
class DilBertConfig(PretrainedConfig):
pretrained_config_archive_map = DILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP
class DistilBertConfig(PretrainedConfig):
pretrained_config_archive_map = DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP
def __init__(self,
vocab_size_or_config_json_file=30522,
@ -66,7 +66,7 @@ class DilBertConfig(PretrainedConfig):
qa_dropout=0.1,
seq_classif_dropout=0.2,
**kwargs):
super(DilBertConfig, self).__init__(**kwargs)
super(DistilBertConfig, self).__init__(**kwargs)
if isinstance(vocab_size_or_config_json_file, str) or (sys.version_info[0] == 2
and isinstance(vocab_size_or_config_json_file, unicode)):
@ -398,17 +398,17 @@ class Transformer(nn.Module):
### INTERFACE FOR ENCODER AND TASK SPECIFIC MODEL ###
class DilBertPreTrainedModel(PreTrainedModel):
class DistilBertPreTrainedModel(PreTrainedModel):
""" An abstract class to handle weights initialization and
a simple interface for downloading and loading pretrained models.
"""
config_class = DilBertConfig
pretrained_model_archive_map = DILBERT_PRETRAINED_MODEL_ARCHIVE_MAP
config_class = DistilBertConfig
pretrained_model_archive_map = DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP
load_tf_weights = None
base_model_prefix = "dilbert"
base_model_prefix = "distilbert"
def __init__(self, *inputs, **kwargs):
super(DilBertPreTrainedModel, self).__init__(*inputs, **kwargs)
super(DistilBertPreTrainedModel, self).__init__(*inputs, **kwargs)
def init_weights(self, module):
""" Initialize the weights.
@ -425,36 +425,36 @@ class DilBertPreTrainedModel(PreTrainedModel):
module.bias.data.zero_()
DILBERT_START_DOCSTRING = r"""
DilBERT is a small, fast, cheap and light Transformer model
DISTILBERT_START_DOCSTRING = r"""
DistilBERT is a small, fast, cheap and light Transformer model
trained by distilling Bert base. It has 40% less parameters than
`bert-base-uncased`, runs 60% faster while preserving over 95% of
Bert's performances as measured on the GLUE language understanding benchmark.
Here are the differences between the interface of Bert and DilBert:
Here are the differences between the interface of Bert and DistilBert:
- DilBert doesn't have `token_type_ids`, you don't need to indicate which token belong to which segment. Just separate your segments with the separation token `tokenizer.sep_token` (or `[SEP]`)
- DilBert doesn't have options to select the input positions (`position_ids` input). This could be added if necessary though, just let's us know if you need this option.
- DistilBert doesn't have `token_type_ids`, you don't need to indicate which token belong to which segment. Just separate your segments with the separation token `tokenizer.sep_token` (or `[SEP]`)
- DistilBert doesn't have options to select the input positions (`position_ids` input). This could be added if necessary though, just let's us know if you need this option.
For more information on DilBERT, please refer to our
For more information on DistilBERT, please refer to our
`detailed blog post`_
.. _`detailed blog post`:
https://medium.com/huggingface/smaller-faster-cheaper-lighter-introducing-dilbert-a-distilled-version-of-bert-8cf3380435b5
https://medium.com/huggingface/smaller-faster-cheaper-lighter-introducing-distilbert-a-distilled-version-of-bert-8cf3380435b5
Parameters:
config (:class:`~pytorch_transformers.DilBertConfig`): Model configuration class with all the parameters of the model.
config (:class:`~pytorch_transformers.DistilBertConfig`): Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only the configuration.
Check out the :meth:`~pytorch_transformers.PreTrainedModel.from_pretrained` method to load the model weights.
"""
DILBERT_INPUTS_DOCSTRING = r"""
DISTILBERT_INPUTS_DOCSTRING = r"""
Inputs:
**input_ids**L ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
Indices oof input sequence tokens in the vocabulary.
The input sequences should start with `[CLS]` and `[SEP]` tokens.
For now, ONLY BertTokenizer(`bert-base-uncased`) is supported and you should use this tokenizer when using DilBERT.
For now, ONLY BertTokenizer(`bert-base-uncased`) is supported and you should use this tokenizer when using DistilBERT.
**attention_mask**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
Mask to avoid performing attention on padding token indices.
Mask values selected in ``[0, 1]``:
@ -465,9 +465,9 @@ DILBERT_INPUTS_DOCSTRING = r"""
``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.
"""
@add_start_docstrings("The bare DilBERT encoder/transformer outputing raw hidden-states without any specific head on top.",
DILBERT_START_DOCSTRING, DILBERT_INPUTS_DOCSTRING)
class DilBertModel(DilBertPreTrainedModel):
@add_start_docstrings("The bare DistilBERT encoder/transformer outputing raw hidden-states without any specific head on top.",
DISTILBERT_START_DOCSTRING, DISTILBERT_INPUTS_DOCSTRING)
class DistilBertModel(DistilBertPreTrainedModel):
r"""
Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
**last_hidden_state**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, hidden_size)``
@ -482,15 +482,15 @@ class DilBertModel(DilBertPreTrainedModel):
Examples::
tokenizer = DilBertTokenizer.from_pretrained('dilbert-base-uncased')
model = DilBertModel.from_pretrained('dilbert-base-uncased')
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertModel.from_pretrained('distilbert-base-uncased')
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0) # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
"""
def __init__(self, config):
super(DilBertModel, self).__init__(config)
super(DistilBertModel, self).__init__(config)
self.embeddings = Embeddings(config) # Embeddings
self.transformer = Transformer(config) # Encoder
@ -543,9 +543,9 @@ class DilBertModel(DilBertPreTrainedModel):
return output # last-layer hidden-state, (all hidden_states), (all attentions)
@add_start_docstrings("""DilBert Model with a `masked language modeling` head on top. """,
DILBERT_START_DOCSTRING, DILBERT_INPUTS_DOCSTRING)
class DilBertForMaskedLM(DilBertPreTrainedModel):
@add_start_docstrings("""DistilBert Model with a `masked language modeling` head on top. """,
DISTILBERT_START_DOCSTRING, DISTILBERT_INPUTS_DOCSTRING)
class DistilBertForMaskedLM(DistilBertPreTrainedModel):
r"""
**masked_lm_labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
Labels for computing the masked language modeling loss.
@ -568,19 +568,19 @@ class DilBertForMaskedLM(DilBertPreTrainedModel):
Examples::
tokenizer = DilBertTokenizer.from_pretrained('dilbert-base-uncased')
model = DilBertForMaskedLM.from_pretrained('dilbert-base-uncased')
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertForMaskedLM.from_pretrained('distilbert-base-uncased')
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0) # Batch size 1
outputs = model(input_ids, masked_lm_labels=input_ids)
loss, prediction_scores = outputs[:2]
"""
def __init__(self, config):
super(DilBertForMaskedLM, self).__init__(config)
super(DistilBertForMaskedLM, self).__init__(config)
self.output_attentions = config.output_attentions
self.output_hidden_states = config.output_hidden_states
self.dilbert = DilBertModel(config)
self.distilbert = DistilBertModel(config)
self.vocab_transform = nn.Linear(config.dim, config.dim)
self.vocab_layer_norm = nn.LayerNorm(config.dim, eps=1e-12)
self.vocab_projector = nn.Linear(config.dim, config.vocab_size)
@ -595,14 +595,14 @@ class DilBertForMaskedLM(DilBertPreTrainedModel):
Export to TorchScript can't handle parameter sharing so we are cloning them instead.
"""
self._tie_or_clone_weights(self.vocab_projector,
self.dilbert.embeddings.word_embeddings)
self.distilbert.embeddings.word_embeddings)
def forward(self,
input_ids: torch.tensor,
attention_mask: torch.tensor = None,
masked_lm_labels: torch.tensor = None,
head_mask: torch.tensor = None):
dlbrt_output = self.dilbert(input_ids=input_ids,
dlbrt_output = self.distilbert(input_ids=input_ids,
attention_mask=attention_mask,
head_mask=head_mask)
hidden_states = dlbrt_output[0] # (bs, seq_length, dim)
@ -620,10 +620,10 @@ class DilBertForMaskedLM(DilBertPreTrainedModel):
return outputs # (mlm_loss), prediction_logits, (all hidden_states), (all attentions)
@add_start_docstrings("""DilBert Model transformer with a sequence classification/regression head on top (a linear layer on top of
@add_start_docstrings("""DistilBert Model transformer with a sequence classification/regression head on top (a linear layer on top of
the pooled output) e.g. for GLUE tasks. """,
DILBERT_START_DOCSTRING, DILBERT_INPUTS_DOCSTRING)
class DilBertForSequenceClassification(DilBertPreTrainedModel):
DISTILBERT_START_DOCSTRING, DISTILBERT_INPUTS_DOCSTRING)
class DistilBertForSequenceClassification(DistilBertPreTrainedModel):
r"""
**labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:
Labels for computing the sequence classification/regression loss.
@ -646,8 +646,8 @@ class DilBertForSequenceClassification(DilBertPreTrainedModel):
Examples::
tokenizer = DilBertTokenizer.from_pretrained('dilbert-base-uncased')
model = DilBertForSequenceClassification.from_pretrained('dilbert-base-uncased')
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0) # Batch size 1
labels = torch.tensor([1]).unsqueeze(0) # Batch size 1
outputs = model(input_ids, labels=labels)
@ -655,10 +655,10 @@ class DilBertForSequenceClassification(DilBertPreTrainedModel):
"""
def __init__(self, config):
super(DilBertForSequenceClassification, self).__init__(config)
super(DistilBertForSequenceClassification, self).__init__(config)
self.num_labels = config.num_labels
self.dilbert = DilBertModel(config)
self.distilbert = DistilBertModel(config)
self.pre_classifier = nn.Linear(config.dim, config.dim)
self.classifier = nn.Linear(config.dim, config.num_labels)
self.dropout = nn.Dropout(config.seq_classif_dropout)
@ -670,17 +670,17 @@ class DilBertForSequenceClassification(DilBertPreTrainedModel):
attention_mask: torch.tensor = None,
labels: torch.tensor = None,
head_mask: torch.tensor = None):
dilbert_output = self.dilbert(input_ids=input_ids,
distilbert_output = self.distilbert(input_ids=input_ids,
attention_mask=attention_mask,
head_mask=head_mask)
hidden_state = dilbert_output[0] # (bs, seq_len, dim)
hidden_state = distilbert_output[0] # (bs, seq_len, dim)
pooled_output = hidden_state[:, 0] # (bs, dim)
pooled_output = self.pre_classifier(pooled_output) # (bs, dim)
pooled_output = nn.ReLU()(pooled_output) # (bs, dim)
pooled_output = self.dropout(pooled_output) # (bs, dim)
logits = self.classifier(pooled_output) # (bs, dim)
outputs = (logits,) + dilbert_output[1:]
outputs = (logits,) + distilbert_output[1:]
if labels is not None:
if self.num_labels == 1:
loss_fct = nn.MSELoss()
@ -693,10 +693,10 @@ class DilBertForSequenceClassification(DilBertPreTrainedModel):
return outputs # (loss), logits, (hidden_states), (attentions)
@add_start_docstrings("""DilBert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of
@add_start_docstrings("""DistilBert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of
the hidden-states output to compute `span start logits` and `span end logits`). """,
DILBERT_START_DOCSTRING, DILBERT_INPUTS_DOCSTRING)
class DilBertForQuestionAnswering(DilBertPreTrainedModel):
DISTILBERT_START_DOCSTRING, DISTILBERT_INPUTS_DOCSTRING)
class DistilBertForQuestionAnswering(DistilBertPreTrainedModel):
r"""
**start_positions**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:
Labels for position (index) of the start of the labelled span for computing the token classification loss.
@ -724,8 +724,8 @@ class DilBertForQuestionAnswering(DilBertPreTrainedModel):
Examples::
tokenizer = DilBertTokenizer.from_pretrained('dilbert-base-uncased')
model = DilBertForQuestionAnswering.from_pretrained('dilbert-base-uncased')
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertForQuestionAnswering.from_pretrained('distilbert-base-uncased')
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0) # Batch size 1
start_positions = torch.tensor([1])
end_positions = torch.tensor([3])
@ -734,9 +734,9 @@ class DilBertForQuestionAnswering(DilBertPreTrainedModel):
"""
def __init__(self, config):
super(DilBertForQuestionAnswering, self).__init__(config)
super(DistilBertForQuestionAnswering, self).__init__(config)
self.dilbert = DilBertModel(config)
self.distilbert = DistilBertModel(config)
self.qa_outputs = nn.Linear(config.dim, config.num_labels)
assert config.num_labels == 2
self.dropout = nn.Dropout(config.qa_dropout)
@ -749,10 +749,10 @@ class DilBertForQuestionAnswering(DilBertPreTrainedModel):
start_positions: torch.tensor = None,
end_positions: torch.tensor = None,
head_mask: torch.tensor = None):
dilbert_output = self.dilbert(input_ids=input_ids,
distilbert_output = self.distilbert(input_ids=input_ids,
attention_mask=attention_mask,
head_mask=head_mask)
hidden_states = dilbert_output[0] # (bs, max_query_len, dim)
hidden_states = distilbert_output[0] # (bs, max_query_len, dim)
hidden_states = self.dropout(hidden_states) # (bs, max_query_len, dim)
logits = self.qa_outputs(hidden_states) # (bs, max_query_len, 2)
@ -760,7 +760,7 @@ class DilBertForQuestionAnswering(DilBertPreTrainedModel):
start_logits = start_logits.squeeze(-1) # (bs, max_query_len)
end_logits = end_logits.squeeze(-1) # (bs, max_query_len)
outputs = (start_logits, end_logits,) + dilbert_output[1:]
outputs = (start_logits, end_logits,) + distilbert_output[1:]
if start_positions is not None and end_positions is not None:
# If we are on multi-GPU, split add a dimension
if len(start_positions.size()) > 1:

View File

@ -20,23 +20,23 @@ import unittest
import shutil
import pytest
from pytorch_transformers import (DilBertConfig, DilBertModel, DilBertForMaskedLM,
DilBertForQuestionAnswering, DilBertForSequenceClassification)
from pytorch_transformers.modeling_dilbert import DILBERT_PRETRAINED_MODEL_ARCHIVE_MAP
from pytorch_transformers import (DistilBertConfig, DistilBertModel, DistilBertForMaskedLM,
DistilBertForQuestionAnswering, DistilBertForSequenceClassification)
from pytorch_transformers.modeling_distilbert import DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP
from .modeling_common_test import (CommonTestCases, ConfigTester, ids_tensor)
class DilBertModelTest(CommonTestCases.CommonModelTester):
class DistilBertModelTest(CommonTestCases.CommonModelTester):
all_model_classes = (DilBertModel, DilBertForMaskedLM, DilBertForQuestionAnswering,
DilBertForSequenceClassification)
all_model_classes = (DistilBertModel, DistilBertForMaskedLM, DistilBertForQuestionAnswering,
DistilBertForSequenceClassification)
test_pruning = True
test_torchscript = True
test_resize_embeddings = True
test_head_masking = True
class DilBertModelTester(object):
class DistilBertModelTester(object):
def __init__(self,
parent,
@ -100,7 +100,7 @@ class DilBertModelTest(CommonTestCases.CommonModelTester):
token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels)
choice_labels = ids_tensor([self.batch_size], self.num_choices)
config = DilBertConfig(
config = DistilBertConfig(
vocab_size_or_config_json_file=self.vocab_size,
dim=self.hidden_size,
n_layers=self.num_hidden_layers,
@ -119,8 +119,8 @@ class DilBertModelTest(CommonTestCases.CommonModelTester):
list(result["loss"].size()),
[])
def create_and_check_dilbert_model(self, config, input_ids, input_mask, sequence_labels, token_labels, choice_labels):
model = DilBertModel(config=config)
def create_and_check_distilbert_model(self, config, input_ids, input_mask, sequence_labels, token_labels, choice_labels):
model = DistilBertModel(config=config)
model.eval()
(sequence_output,) = model(input_ids, input_mask)
(sequence_output,) = model(input_ids)
@ -132,8 +132,8 @@ class DilBertModelTest(CommonTestCases.CommonModelTester):
list(result["sequence_output"].size()),
[self.batch_size, self.seq_length, self.hidden_size])
def create_and_check_dilbert_for_masked_lm(self, config, input_ids, input_mask, sequence_labels, token_labels, choice_labels):
model = DilBertForMaskedLM(config=config)
def create_and_check_distilbert_for_masked_lm(self, config, input_ids, input_mask, sequence_labels, token_labels, choice_labels):
model = DistilBertForMaskedLM(config=config)
model.eval()
loss, prediction_scores = model(input_ids, attention_mask=input_mask, masked_lm_labels=token_labels)
result = {
@ -145,8 +145,8 @@ class DilBertModelTest(CommonTestCases.CommonModelTester):
[self.batch_size, self.seq_length, self.vocab_size])
self.check_loss_output(result)
def create_and_check_dilbert_for_question_answering(self, config, input_ids, input_mask, sequence_labels, token_labels, choice_labels):
model = DilBertForQuestionAnswering(config=config)
def create_and_check_distilbert_for_question_answering(self, config, input_ids, input_mask, sequence_labels, token_labels, choice_labels):
model = DistilBertForQuestionAnswering(config=config)
model.eval()
loss, start_logits, end_logits = model(input_ids, input_mask, sequence_labels, sequence_labels)
result = {
@ -162,9 +162,9 @@ class DilBertModelTest(CommonTestCases.CommonModelTester):
[self.batch_size, self.seq_length])
self.check_loss_output(result)
def create_and_check_dilbert_for_sequence_classification(self, config, input_ids, input_mask, sequence_labels, token_labels, choice_labels):
def create_and_check_distilbert_for_sequence_classification(self, config, input_ids, input_mask, sequence_labels, token_labels, choice_labels):
config.num_labels = self.num_labels
model = DilBertForSequenceClassification(config)
model = DistilBertForSequenceClassification(config)
model.eval()
loss, logits = model(input_ids, input_mask, sequence_labels)
result = {
@ -183,33 +183,33 @@ class DilBertModelTest(CommonTestCases.CommonModelTester):
return config, inputs_dict
def setUp(self):
self.model_tester = DilBertModelTest.DilBertModelTester(self)
self.config_tester = ConfigTester(self, config_class=DilBertConfig, dim=37)
self.model_tester = DistilBertModelTest.DistilBertModelTester(self)
self.config_tester = ConfigTester(self, config_class=DistilBertConfig, dim=37)
def test_config(self):
self.config_tester.run_common_tests()
def test_dilbert_model(self):
def test_distilbert_model(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_dilbert_model(*config_and_inputs)
self.model_tester.create_and_check_distilbert_model(*config_and_inputs)
def test_for_masked_lm(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_dilbert_for_masked_lm(*config_and_inputs)
self.model_tester.create_and_check_distilbert_for_masked_lm(*config_and_inputs)
def test_for_question_answering(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_dilbert_for_question_answering(*config_and_inputs)
self.model_tester.create_and_check_distilbert_for_question_answering(*config_and_inputs)
def test_for_sequence_classification(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_dilbert_for_sequence_classification(*config_and_inputs)
self.model_tester.create_and_check_distilbert_for_sequence_classification(*config_and_inputs)
# @pytest.mark.slow
# def test_model_from_pretrained(self):
# cache_dir = "/tmp/pytorch_transformers_test/"
# for model_name in list(DILBERT_PRETRAINED_MODEL_ARCHIVE_MAP.keys())[:1]:
# model = DilBertModel.from_pretrained(model_name, cache_dir=cache_dir)
# for model_name in list(DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP.keys())[:1]:
# model = DistilBertModel.from_pretrained(model_name, cache_dir=cache_dir)
# shutil.rmtree(cache_dir)
# self.assertIsNotNone(model)

View File

@ -18,20 +18,20 @@ import os
import unittest
from io import open
from pytorch_transformers.tokenization_dilbert import (DilBertTokenizer)
from pytorch_transformers.tokenization_distilbert import (DistilBertTokenizer)
from .tokenization_tests_commons import CommonTestCases
from .tokenization_bert_test import BertTokenizationTest
class DilBertTokenizationTest(BertTokenizationTest):
class DistilBertTokenizationTest(BertTokenizationTest):
tokenizer_class = DilBertTokenizer
tokenizer_class = DistilBertTokenizer
def get_tokenizer(self):
return DilBertTokenizer.from_pretrained(self.tmpdirname)
return DistilBertTokenizer.from_pretrained(self.tmpdirname)
def test_sequence_builders(self):
tokenizer = DilBertTokenizer.from_pretrained("dilbert-base-uncased")
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
text = tokenizer.encode("sequence builders")
text_2 = tokenizer.encode("multi-sequence build")

View File

@ -12,7 +12,7 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Tokenization classes for DilBERT."""
"""Tokenization classes for DistilBERT."""
from __future__ import absolute_import, division, print_function, unicode_literals
@ -31,21 +31,21 @@ VOCAB_FILES_NAMES = {'vocab_file': 'vocab.txt'}
PRETRAINED_VOCAB_FILES_MAP = {
'vocab_file':
{
'dilbert-base-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt",
'dilbert-base-uncased-distilled-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt",
'distilbert-base-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt",
'distilbert-base-uncased-distilled-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt",
}
}
PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
'dilbert-base-uncased': 512,
'dilbert-base-uncased-distilled-squad': 512,
'distilbert-base-uncased': 512,
'distilbert-base-uncased-distilled-squad': 512,
}
class DilBertTokenizer(BertTokenizer):
class DistilBertTokenizer(BertTokenizer):
r"""
Constructs a DilBertTokenizer.
:class:`~pytorch_transformers.DilBertTokenizer` is identical to BertTokenizer and runs end-to-end tokenization: punctuation splitting + wordpiece
Constructs a DistilBertTokenizer.
:class:`~pytorch_transformers.DistilBertTokenizer` is identical to BertTokenizer and runs end-to-end tokenization: punctuation splitting + wordpiece
Args:
vocab_file: Path to a one-wordpiece-per-line vocabulary file