mirror of
https://github.com/huggingface/transformers.git
synced 2025-07-31 02:02:21 +06:00
Merge branch 'master' into cleanup-configs
This commit is contained in:
commit
a52d56c8d9
@ -58,7 +58,7 @@ Choose the right framework for every part of a model's lifetime
|
||||
| [Quick tour: Fine-tuning/usage scripts](#quick-tour-of-the-fine-tuningusage-scripts) | Using provided scripts: GLUE, SQuAD and Text generation |
|
||||
| [Migrating from pytorch-transformers to transformers](#Migrating-from-pytorch-transformers-to-transformers) | Migrating your code from pytorch-transformers to transformers |
|
||||
| [Migrating from pytorch-pretrained-bert to pytorch-transformers](#Migrating-from-pytorch-pretrained-bert-to-transformers) | Migrating your code from pytorch-pretrained-bert to transformers |
|
||||
| [Documentation][(v2.2.0/v2.2.1)](https://huggingface.co/transformers/v2.2.0) [(v2.1.1)](https://huggingface.co/transformers/v2.1.1) [(v2.0.0)](https://huggingface.co/transformers/v2.0.0) [(v1.2.0)](https://huggingface.co/transformers/v1.2.0) [(v1.1.0)](https://huggingface.co/transformers/v1.1.0) [(v1.0.0)](https://huggingface.co/transformers/v1.0.0) [(master)](https://huggingface.co/transformers) | Full API documentation and more |
|
||||
| [Documentation][(v2.2.0/v2.2.1/v2.2.2)](https://huggingface.co/transformers/v2.2.0) [(v2.1.1)](https://huggingface.co/transformers/v2.1.1) [(v2.0.0)](https://huggingface.co/transformers/v2.0.0) [(v1.2.0)](https://huggingface.co/transformers/v1.2.0) [(v1.1.0)](https://huggingface.co/transformers/v1.1.0) [(v1.0.0)](https://huggingface.co/transformers/v1.0.0) [(master)](https://huggingface.co/transformers) | Full API documentation and more |
|
||||
|
||||
## Installation
|
||||
|
||||
@ -144,7 +144,8 @@ At some point in the future, you'll be able to seamlessly move from pre-training
|
||||
9. **[CTRL](https://github.com/salesforce/ctrl/)** (from Salesforce) released with the paper [CTRL: A Conditional Transformer Language Model for Controllable Generation](https://arxiv.org/abs/1909.05858) by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and Richard Socher.
|
||||
10. **[CamemBERT](https://camembert-model.fr)** (from Inria/Facebook/Sorbonne) released with the paper [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
|
||||
11. **[ALBERT](https://github.com/google-research/ALBERT)** (from Google Research and the Toyota Technological Institute at Chicago) released with the paper [ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942), by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut.
|
||||
11. Want to contribute a new model? We have added a **detailed guide and templates** to guide you in the process of adding a new model. You can find them in the [`templates`](./templates) folder of the repository. Be sure to check the [contributing guidelines](./CONTRIBUTING.md) and contact the maintainers or open an issue to collect feedbacks before starting your PR.
|
||||
12. **[T5](https://github.com/google-research/text-to-text-transfer-transformer)** (from Google AI) released with the paper [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
|
||||
13. Want to contribute a new model? We have added a **detailed guide and templates** to guide you in the process of adding a new model. You can find them in the [`templates`](./templates) folder of the repository. Be sure to check the [contributing guidelines](./CONTRIBUTING.md) and contact the maintainers or open an issue to collect feedbacks before starting your PR.
|
||||
|
||||
These implementations have been tested on several datasets (see the example scripts) and should match the performances of the original implementations (e.g. ~93 F1 on SQuAD for BERT Whole-Word-Masking, ~88 F1 on RocStories for OpenAI GPT, ~18.3 perplexity on WikiText 103 for Transformer-XL, ~0.916 Peason R coefficient on STS-B for XLNet). You can find more details on the performances in the Examples section of the [documentation](https://huggingface.co/transformers/examples.html).
|
||||
|
||||
|
@ -26,7 +26,7 @@ author = u'huggingface'
|
||||
# The short X.Y version
|
||||
version = u''
|
||||
# The full version, including alpha/beta/rc tags
|
||||
release = u'2.2.1'
|
||||
release = u'2.2.2'
|
||||
|
||||
|
||||
# -- General configuration ---------------------------------------------------
|
||||
|
@ -217,6 +217,21 @@ Here is the full list of the currently provided pretrained models together with
|
||||
| | | | ALBERT xxlarge model with no dropout, additional training data and longer training |
|
||||
| | | (see `details <https://github.com/google-research/ALBERT>`__) |
|
||||
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| T5 | ``t5-small`` | | ~60M parameters with 6-layers, 512-hidden-state, 2048 feed-forward hidden-state, 8-heads, |
|
||||
| | | | Trained on English text: the Colossal Clean Crawled Corpus (C4) |
|
||||
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| | ``t5-base`` | | ~220M parameters with 12-layers, 768-hidden-state, 3072 feed-forward hidden-state, 12-heads, |
|
||||
| | | | Trained on English text: the Colossal Clean Crawled Corpus (C4) |
|
||||
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| | ``t5-large`` | | ~770M parameters with 24-layers, 1024-hidden-state, 4096 feed-forward hidden-state, 16-heads, |
|
||||
| | | | Trained on English text: the Colossal Clean Crawled Corpus (C4) |
|
||||
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| | ``t5-3B`` | | ~2.8B parameters with 24-layers, 1024-hidden-state, 16384 feed-forward hidden-state, 32-heads, |
|
||||
| | | | Trained on English text: the Colossal Clean Crawled Corpus (C4) |
|
||||
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| | ``t5-11B`` | | ~11B parameters with 24-layers, 1024-hidden-state, 65536 feed-forward hidden-state, 128-heads, |
|
||||
| | | | Trained on English text: the Colossal Clean Crawled Corpus (C4) |
|
||||
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
|
||||
|
||||
|
||||
.. <https://huggingface.co/transformers/examples.html>`__
|
||||
|
@ -580,10 +580,16 @@ def main():
|
||||
# Evaluation - we can ask to evaluate all the checkpoints (sub-directories) in a directory
|
||||
results = {}
|
||||
if args.do_eval and args.local_rank in [-1, 0]:
|
||||
checkpoints = [args.output_dir]
|
||||
if args.eval_all_checkpoints:
|
||||
checkpoints = list(os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + '/**/' + WEIGHTS_NAME, recursive=True)))
|
||||
logging.getLogger("transformers.modeling_utils").setLevel(logging.WARN) # Reduce model loading logs
|
||||
|
||||
if args.do_train:
|
||||
logger.info("Loading checkpoints saved during training for evaluation")
|
||||
checkpoints = [args.output_dir]
|
||||
if args.eval_all_checkpoints:
|
||||
checkpoints = list(os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + '/**/' + WEIGHTS_NAME, recursive=True)))
|
||||
logging.getLogger("transformers.modeling_utils").setLevel(logging.WARN) # Reduce model loading logs
|
||||
else:
|
||||
logger.info("Loading checkpoint %s for evaluation", args.model_name_or_path)
|
||||
checkpoints = [args.model_name_or_path]
|
||||
|
||||
logger.info("Evaluate the following checkpoints: %s", checkpoints)
|
||||
|
||||
|
@ -29,7 +29,7 @@ And move all the stories to the same folder. We will refer as `$DATA_PATH` the p
|
||||
python run_summarization.py \
|
||||
--documents_dir $DATA_PATH \
|
||||
--summaries_output_dir $SUMMARIES_PATH \ # optional
|
||||
--to_cpu false \
|
||||
--no_cuda false \
|
||||
--batch_size 4 \
|
||||
--min_length 50 \
|
||||
--max_length 200 \
|
||||
@ -39,7 +39,7 @@ python run_summarization.py \
|
||||
--compute_rouge true
|
||||
```
|
||||
|
||||
The scripts executes on GPU if one is available and if `to_cpu` is not set to `true`. Inference on multiple GPUs is not suported yet. The ROUGE scores will be displayed in the console at the end of evaluation and written in a `rouge_scores.txt` file. The script takes 30 hours to compute with a single Tesla V100 GPU and a batch size of 10 (300,000 texts to summarize).
|
||||
The scripts executes on GPU if one is available and if `no_cuda` is not set to `true`. Inference on multiple GPUs is not suported yet. The ROUGE scores will be displayed in the console at the end of evaluation and written in a `rouge_scores.txt` file. The script takes 30 hours to compute with a single Tesla V100 GPU and a batch size of 10 (300,000 texts to summarize).
|
||||
|
||||
## Summarize any text
|
||||
|
||||
@ -49,7 +49,7 @@ Put the documents that you would like to summarize in a folder (the path to whic
|
||||
python run_summarization.py \
|
||||
--documents_dir $DATA_PATH \
|
||||
--summaries_output_dir $SUMMARIES_PATH \ # optional
|
||||
--to_cpu false \
|
||||
--no_cuda false \
|
||||
--batch_size 4 \
|
||||
--min_length 50 \
|
||||
--max_length 200 \
|
||||
|
2
setup.py
2
setup.py
@ -44,7 +44,7 @@ extras['all'] = [package for package in extras.values()]
|
||||
|
||||
setup(
|
||||
name="transformers",
|
||||
version="2.2.1",
|
||||
version="2.2.2",
|
||||
author="Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Google AI Language Team Authors, Open AI team Authors, Facebook AI Authors, Carnegie Mellon University Authors",
|
||||
author_email="thomas@huggingface.co",
|
||||
description="State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch",
|
||||
|
@ -75,8 +75,6 @@ class XxxConfig(PretrainedConfig):
|
||||
attn_pdrop=0.1,
|
||||
layer_norm_epsilon=1e-5,
|
||||
initializer_range=0.02,
|
||||
|
||||
num_labels=1,
|
||||
summary_type='cls_index',
|
||||
summary_use_proj=True,
|
||||
summary_activation=None,
|
||||
@ -84,7 +82,7 @@ class XxxConfig(PretrainedConfig):
|
||||
summary_first_dropout=0.1,
|
||||
**kwargs):
|
||||
super(XxxConfig, self).__init__(**kwargs)
|
||||
self.vocab_size = vocab_size if isinstance(vocab_size, six.string_types) else -1
|
||||
self.vocab_size = vocab_size
|
||||
self.n_ctx = n_ctx
|
||||
self.n_positions = n_positions
|
||||
self.n_embd = n_embd
|
||||
@ -95,23 +93,11 @@ class XxxConfig(PretrainedConfig):
|
||||
self.attn_pdrop = attn_pdrop
|
||||
self.layer_norm_epsilon = layer_norm_epsilon
|
||||
self.initializer_range = initializer_range
|
||||
|
||||
self.num_labels = num_labels
|
||||
self.summary_type = summary_type
|
||||
self.summary_use_proj = summary_use_proj
|
||||
self.summary_activation = summary_activation
|
||||
self.summary_first_dropout = summary_first_dropout
|
||||
self.summary_proj_to_labels = summary_proj_to_labels
|
||||
if isinstance(vocab_size, six.string_types):
|
||||
with open(vocab_size, "r", encoding="utf-8") as reader:
|
||||
json_config = json.loads(reader.read())
|
||||
for key, value in json_config.items():
|
||||
self.__dict__[key] = value
|
||||
elif not isinstance(vocab_size, int):
|
||||
raise ValueError(
|
||||
"First argument must be either a vocabulary size (int)"
|
||||
"or the path to a pretrained model config file (str)"
|
||||
)
|
||||
|
||||
@property
|
||||
def max_position_embeddings(self):
|
||||
|
@ -26,9 +26,9 @@ from transformers import XxxConfig, XxxForPreTraining, load_tf_weights_in_xxx
|
||||
import logging
|
||||
logging.basicConfig(level=logging.INFO)
|
||||
|
||||
def convert_tf_checkpoint_to_pytorch(tf_checkpoint_path, xxx_config_file, pytorch_dump_path):
|
||||
def convert_tf_checkpoint_to_pytorch(tf_checkpoint_path, config_file, pytorch_dump_path):
|
||||
# Initialise PyTorch model
|
||||
config = XxxConfig.from_json_file(xxx_config_file)
|
||||
config = XxxConfig.from_json_file(config_file)
|
||||
print("Building PyTorch model from configuration: {}".format(str(config)))
|
||||
model = XxxForPreTraining(config)
|
||||
|
||||
@ -48,11 +48,11 @@ if __name__ == "__main__":
|
||||
type = str,
|
||||
required = True,
|
||||
help = "Path to the TensorFlow checkpoint path.")
|
||||
parser.add_argument("--xxx_config_file",
|
||||
parser.add_argument("--config_file",
|
||||
default = None,
|
||||
type = str,
|
||||
required = True,
|
||||
help = "The config json file corresponding to the pre-trained XXX model. \n"
|
||||
help = "The config json file corresponding to the pre-trained model. \n"
|
||||
"This specifies the model architecture.")
|
||||
parser.add_argument("--pytorch_dump_path",
|
||||
default = None,
|
||||
@ -61,5 +61,5 @@ if __name__ == "__main__":
|
||||
help = "Path to the output PyTorch model.")
|
||||
args = parser.parse_args()
|
||||
convert_tf_checkpoint_to_pytorch(args.tf_checkpoint_path,
|
||||
args.xxx_config_file,
|
||||
args.config_file,
|
||||
args.pytorch_dump_path)
|
||||
|
@ -26,6 +26,8 @@ import logging
|
||||
import math
|
||||
import os
|
||||
import sys
|
||||
import copy
|
||||
import itertools
|
||||
from io import open
|
||||
|
||||
import numpy as np
|
||||
|
@ -25,6 +25,8 @@ import logging
|
||||
import math
|
||||
import os
|
||||
import sys
|
||||
import copy
|
||||
import itertools
|
||||
from io import open
|
||||
|
||||
import torch
|
||||
|
@ -1,4 +1,4 @@
|
||||
__version__ = "2.2.1"
|
||||
__version__ = "2.2.2"
|
||||
|
||||
# Work around to update TensorFlow's absl.logging threshold which alters the
|
||||
# default Python logging output behavior when present.
|
||||
@ -48,6 +48,7 @@ from .tokenization_roberta import RobertaTokenizer
|
||||
from .tokenization_distilbert import DistilBertTokenizer
|
||||
from .tokenization_albert import AlbertTokenizer
|
||||
from .tokenization_camembert import CamembertTokenizer
|
||||
from .tokenization_t5 import T5Tokenizer
|
||||
|
||||
# Configurations
|
||||
from .configuration_utils import PretrainedConfig
|
||||
@ -58,12 +59,12 @@ from .configuration_transfo_xl import TransfoXLConfig, TRANSFO_XL_PRETRAINED_CON
|
||||
from .configuration_gpt2 import GPT2Config, GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP
|
||||
from .configuration_ctrl import CTRLConfig, CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP
|
||||
from .configuration_xlnet import XLNetConfig, XLNET_PRETRAINED_CONFIG_ARCHIVE_MAP
|
||||
from .configuration_ctrl import CTRLConfig, CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP
|
||||
from .configuration_xlm import XLMConfig, XLM_PRETRAINED_CONFIG_ARCHIVE_MAP
|
||||
from .configuration_roberta import RobertaConfig, ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP
|
||||
from .configuration_distilbert import DistilBertConfig, DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP
|
||||
from .configuration_albert import AlbertConfig, ALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP
|
||||
from .configuration_camembert import CamembertConfig, CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP
|
||||
from .configuration_t5 import T5Config, T5_PRETRAINED_CONFIG_ARCHIVE_MAP
|
||||
|
||||
# Modeling
|
||||
if is_torch_available():
|
||||
@ -77,8 +78,8 @@ if is_torch_available():
|
||||
BertForTokenClassification, BertForQuestionAnswering,
|
||||
load_tf_weights_in_bert, BERT_PRETRAINED_MODEL_ARCHIVE_MAP)
|
||||
from .modeling_openai import (OpenAIGPTPreTrainedModel, OpenAIGPTModel,
|
||||
OpenAIGPTLMHeadModel, OpenAIGPTDoubleHeadsModel,
|
||||
load_tf_weights_in_openai_gpt, OPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_MAP)
|
||||
OpenAIGPTLMHeadModel, OpenAIGPTDoubleHeadsModel,
|
||||
load_tf_weights_in_openai_gpt, OPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_MAP)
|
||||
from .modeling_transfo_xl import (TransfoXLPreTrainedModel, TransfoXLModel, TransfoXLLMHeadModel,
|
||||
AdaptiveEmbedding,
|
||||
load_tf_weights_in_transfo_xl, TRANSFO_XL_PRETRAINED_MODEL_ARCHIVE_MAP)
|
||||
@ -110,6 +111,9 @@ if is_torch_available():
|
||||
CamembertForTokenClassification,
|
||||
CAMEMBERT_PRETRAINED_MODEL_ARCHIVE_MAP)
|
||||
from .modeling_encoder_decoder import PreTrainedEncoderDecoder, Model2Model
|
||||
from .modeling_t5 import (T5PreTrainedModel, T5Model, T5WithLMHeadModel,
|
||||
load_tf_weights_in_t5,
|
||||
T5_PRETRAINED_MODEL_ARCHIVE_MAP)
|
||||
|
||||
from .modeling_albert import (AlbertPreTrainedModel, AlbertModel, AlbertForMaskedLM, AlbertForSequenceClassification,
|
||||
AlbertForQuestionAnswering,
|
||||
@ -178,6 +182,10 @@ if is_tf_available():
|
||||
from .modeling_tf_albert import (TFAlbertPreTrainedModel, TFAlbertModel, TFAlbertForMaskedLM,
|
||||
TFAlbertForSequenceClassification,
|
||||
TF_ALBERT_PRETRAINED_MODEL_ARCHIVE_MAP)
|
||||
|
||||
from .modeling_tf_t5 import (TFT5PreTrainedModel, TFT5Model, TFT5WithLMHeadModel,
|
||||
TF_T5_PRETRAINED_MODEL_ARCHIVE_MAP)
|
||||
|
||||
# Optimization
|
||||
from .optimization_tf import (WarmUp, create_optimizer, AdamWeightDecay, GradientAccumulator)
|
||||
|
||||
|
@ -6,6 +6,7 @@ def main():
|
||||
"This command line utility let you convert original (author released) model checkpoint to pytorch.\n"
|
||||
"It should be used as one of: \n"
|
||||
">> transformers bert TF_CHECKPOINT TF_CONFIG PYTORCH_DUMP_OUTPUT, \n"
|
||||
">> transformers t5 TF_CHECKPOINT TF_CONFIG PYTORCH_DUMP_OUTPUT, \n"
|
||||
">> transformers gpt OPENAI_GPT_CHECKPOINT_FOLDER_PATH PYTORCH_DUMP_OUTPUT [OPENAI_GPT_CONFIG], \n"
|
||||
">> transformers transfo_xl TF_CHECKPOINT_OR_DATASET PYTORCH_DUMP_OUTPUT [TF_CONFIG] or \n"
|
||||
">> transformers gpt2 TF_CHECKPOINT PYTORCH_DUMP_OUTPUT [GPT2_CONFIG] or \n"
|
||||
@ -21,6 +22,23 @@ def main():
|
||||
"https://www.tensorflow.org/install/ for installation instructions.")
|
||||
raise
|
||||
|
||||
if len(sys.argv) != 5:
|
||||
# pylint: disable=line-too-long
|
||||
print("Should be used as `transformers bert TF_CHECKPOINT TF_CONFIG PYTORCH_DUMP_OUTPUT`")
|
||||
else:
|
||||
PYTORCH_DUMP_OUTPUT = sys.argv.pop()
|
||||
TF_CONFIG = sys.argv.pop()
|
||||
TF_CHECKPOINT = sys.argv.pop()
|
||||
convert_tf_checkpoint_to_pytorch(TF_CHECKPOINT, TF_CONFIG, PYTORCH_DUMP_OUTPUT)
|
||||
elif sys.argv[1] == "t5":
|
||||
try:
|
||||
from .convert_t5_original_tf_checkpoint_to_pytorch import convert_tf_checkpoint_to_pytorch
|
||||
except ImportError:
|
||||
print("transformers can only be used from the commandline to convert TensorFlow models in PyTorch, "
|
||||
"In that case, it requires TensorFlow to be installed. Please see "
|
||||
"https://www.tensorflow.org/install/ for installation instructions.")
|
||||
raise
|
||||
|
||||
if len(sys.argv) != 5:
|
||||
# pylint: disable=line-too-long
|
||||
print("Should be used as `transformers bert TF_CHECKPOINT TF_CONFIG PYTORCH_DUMP_OUTPUT`")
|
||||
|
@ -19,8 +19,8 @@ class UserCommands(BaseTransformersCLICommand):
|
||||
list_parser.set_defaults(func=lambda args: ListObjsCommand(args))
|
||||
# upload
|
||||
upload_parser = parser.add_parser('upload')
|
||||
upload_parser.add_argument('file', type=str, help='Local filepath of the file to upload.')
|
||||
upload_parser.add_argument('--filename', type=str, default=None, help='Optional: override object filename on S3.')
|
||||
upload_parser.add_argument('path', type=str, help='Local path of the folder or individual file to upload.')
|
||||
upload_parser.add_argument('--filename', type=str, default=None, help='Optional: override individual object filename on S3.')
|
||||
upload_parser.set_defaults(func=lambda args: UploadCommand(args))
|
||||
|
||||
|
||||
@ -138,28 +138,57 @@ class ListObjsCommand(BaseUserCommand):
|
||||
|
||||
|
||||
class UploadCommand(BaseUserCommand):
|
||||
def walk_dir(self, rel_path):
|
||||
"""
|
||||
Recursively list all files in a folder.
|
||||
"""
|
||||
entries: List[os.DirEntry] = list(os.scandir(rel_path))
|
||||
files = [
|
||||
(
|
||||
os.path.join(os.getcwd(), f.path), # filepath
|
||||
f.path # filename
|
||||
)
|
||||
for f in entries if f.is_file()
|
||||
]
|
||||
for f in entries:
|
||||
if f.is_dir():
|
||||
files += self.walk_dir(f.path)
|
||||
return files
|
||||
|
||||
def run(self):
|
||||
token = HfFolder.get_token()
|
||||
if token is None:
|
||||
print("Not logged in")
|
||||
exit(1)
|
||||
filepath = os.path.join(os.getcwd(), self.args.file)
|
||||
filename = self.args.filename if self.args.filename is not None else os.path.basename(filepath)
|
||||
print(
|
||||
"About to upload file {} to S3 under filename {}".format(
|
||||
ANSI.bold(filepath), ANSI.bold(filename)
|
||||
local_path = os.path.abspath(self.args.path)
|
||||
if os.path.isdir(local_path):
|
||||
if self.args.filename is not None:
|
||||
raise ValueError("Cannot specify a filename override when uploading a folder.")
|
||||
rel_path = os.path.basename(local_path)
|
||||
files = self.walk_dir(rel_path)
|
||||
elif os.path.isfile(local_path):
|
||||
filename = self.args.filename if self.args.filename is not None else os.path.basename(local_path)
|
||||
files = [(local_path, filename)]
|
||||
else:
|
||||
raise ValueError("Not a valid file or directory: {}".format(local_path))
|
||||
|
||||
for filepath, filename in files:
|
||||
print(
|
||||
"About to upload file {} to S3 under filename {}".format(
|
||||
ANSI.bold(filepath), ANSI.bold(filename)
|
||||
)
|
||||
)
|
||||
)
|
||||
|
||||
choice = input("Proceed? [Y/n] ").lower()
|
||||
if not(choice == "" or choice == "y" or choice == "yes"):
|
||||
print("Abort")
|
||||
exit()
|
||||
print(
|
||||
ANSI.bold("Uploading... This might take a while if file is large")
|
||||
ANSI.bold("Uploading... This might take a while if files are large")
|
||||
)
|
||||
access_url = self._api.presign_and_upload(
|
||||
token=token, filename=filename, filepath=filepath
|
||||
)
|
||||
print("Your file now lives at:")
|
||||
print(access_url)
|
||||
for filepath, filename in files:
|
||||
access_url = self._api.presign_and_upload(
|
||||
token=token, filename=filename, filepath=filepath
|
||||
)
|
||||
print("Your file now lives at:")
|
||||
print(access_url)
|
||||
|
@ -29,6 +29,7 @@ from .configuration_distilbert import DistilBertConfig
|
||||
from .configuration_ctrl import CTRLConfig
|
||||
from .configuration_camembert import CamembertConfig
|
||||
from .configuration_albert import AlbertConfig
|
||||
from .configuration_t5 import T5Config
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
@ -68,6 +69,7 @@ class AutoConfig(object):
|
||||
|
||||
The configuration class to instantiate is selected as the first pattern matching
|
||||
in the `pretrained_model_name_or_path` string (in the following order):
|
||||
- contains `t5`: T5Config (T5 model)
|
||||
- contains `distilbert`: DistilBertConfig (DistilBERT model)
|
||||
- contains `albert`: AlbertConfig (ALBERT model)
|
||||
- contains `camembert`: CamembertConfig (CamemBERT model)
|
||||
@ -124,7 +126,9 @@ class AutoConfig(object):
|
||||
assert unused_kwargs == {'foo': False}
|
||||
|
||||
"""
|
||||
if 'distilbert' in pretrained_model_name_or_path:
|
||||
if 't5' in pretrained_model_name_or_path:
|
||||
return T5Config.from_pretrained(pretrained_model_name_or_path, **kwargs)
|
||||
elif 'distilbert' in pretrained_model_name_or_path:
|
||||
return DistilBertConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
|
||||
elif 'albert' in pretrained_model_name_or_path:
|
||||
return AlbertConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
|
||||
|
119
transformers/configuration_t5.py
Normal file
119
transformers/configuration_t5.py
Normal file
@ -0,0 +1,119 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2010, The T5 Authors and HuggingFace Inc.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
""" T5 model configuration """
|
||||
|
||||
from __future__ import absolute_import, division, print_function, unicode_literals
|
||||
|
||||
import json
|
||||
import logging
|
||||
import sys
|
||||
import six
|
||||
from io import open
|
||||
|
||||
from .configuration_utils import PretrainedConfig
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
T5_PRETRAINED_CONFIG_ARCHIVE_MAP = {
|
||||
't5-small': "https://s3.amazonaws.com/models.huggingface.co/bert/t5-small-config.json",
|
||||
't5-base': "https://s3.amazonaws.com/models.huggingface.co/bert/t5-base-config.json",
|
||||
't5-large': "https://s3.amazonaws.com/models.huggingface.co/bert/t5-large-config.json",
|
||||
't5-3b': "https://s3.amazonaws.com/models.huggingface.co/bert/t5-3b-config.json",
|
||||
't5-11b': "https://s3.amazonaws.com/models.huggingface.co/bert/t5-11b-config.json",
|
||||
}
|
||||
|
||||
|
||||
class T5Config(PretrainedConfig):
|
||||
r"""
|
||||
:class:`~transformers.T5Config` is the configuration class to store the configuration of a
|
||||
`T5Model`.
|
||||
|
||||
|
||||
Arguments:
|
||||
vocab_size_or_config_json_file: Vocabulary size of `inputs_ids` in `T5Model`.
|
||||
hidden_size: Size of the encoder layers and the pooler layer.
|
||||
num_hidden_layers: Number of hidden layers in the Transformer encoder.
|
||||
num_attention_heads: Number of attention heads for each attention layer in
|
||||
the Transformer encoder.
|
||||
intermediate_size: The size of the "intermediate" (i.e., feed-forward)
|
||||
layer in the Transformer encoder.
|
||||
hidden_act: The non-linear activation function (function or string) in the
|
||||
encoder and pooler. If string, "gelu", "relu", "swish" and "gelu_new" are supported.
|
||||
hidden_dropout_prob: The dropout probabilitiy for all fully connected
|
||||
layers in the embeddings, encoder, and pooler.
|
||||
attention_probs_dropout_prob: The dropout ratio for the attention
|
||||
probabilities.
|
||||
max_position_embeddings: The maximum sequence length that this model might
|
||||
ever be used with. Typically set this to something large just in case
|
||||
(e.g., 512 or 1024 or 2048).
|
||||
type_vocab_size: The vocabulary size of the `token_type_ids` passed into
|
||||
`T5Model`.
|
||||
initializer_factor: A factor for initializing all weight matrices (should be kept to 1.0, used for initialization testing).
|
||||
layer_norm_eps: The epsilon used by LayerNorm.
|
||||
"""
|
||||
pretrained_config_archive_map = T5_PRETRAINED_CONFIG_ARCHIVE_MAP
|
||||
|
||||
def __init__(self,
|
||||
vocab_size_or_config_json_file=32128,
|
||||
n_positions=512,
|
||||
d_model=512,
|
||||
d_kv=64,
|
||||
d_ff=2048,
|
||||
num_layers=6,
|
||||
num_heads=8,
|
||||
relative_attention_num_buckets=32,
|
||||
dropout_rate=0.1,
|
||||
layer_norm_epsilon=1e-6,
|
||||
initializer_factor=1.0,
|
||||
**kwargs):
|
||||
super(T5Config, self).__init__(**kwargs)
|
||||
self.vocab_size = vocab_size_or_config_json_file if isinstance(vocab_size_or_config_json_file, int) else -1
|
||||
self.n_positions = n_positions
|
||||
self.d_model = d_model
|
||||
self.d_kv = d_kv
|
||||
self.d_ff = d_ff
|
||||
self.num_layers = num_layers
|
||||
self.num_heads = num_heads
|
||||
self.relative_attention_num_buckets = relative_attention_num_buckets
|
||||
self.dropout_rate = dropout_rate
|
||||
self.layer_norm_epsilon = layer_norm_epsilon
|
||||
self.initializer_factor = initializer_factor
|
||||
|
||||
if isinstance(vocab_size_or_config_json_file, six.string_types):
|
||||
with open(vocab_size_or_config_json_file, "r", encoding="utf-8") as reader:
|
||||
json_config = json.loads(reader.read())
|
||||
for key, value in json_config.items():
|
||||
self.__dict__[key] = value
|
||||
elif not isinstance(vocab_size_or_config_json_file, int):
|
||||
raise ValueError(
|
||||
"First argument must be either a vocabulary size (int)"
|
||||
"or the path to a pretrained model config file (str)"
|
||||
)
|
||||
|
||||
@property
|
||||
def max_position_embeddings(self):
|
||||
return self.n_positions
|
||||
|
||||
@property
|
||||
def hidden_size(self):
|
||||
return self.d_model
|
||||
|
||||
@property
|
||||
def num_attention_heads(self):
|
||||
return self.num_heads
|
||||
|
||||
@property
|
||||
def num_hidden_layers(self):
|
||||
return self.num_layers
|
@ -34,7 +34,8 @@ from transformers import (load_pytorch_checkpoint_in_tf2_model,
|
||||
RobertaConfig, TFRobertaForMaskedLM, TFRobertaForSequenceClassification, ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||
DistilBertConfig, TFDistilBertForMaskedLM, TFDistilBertForQuestionAnswering, DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||
CTRLConfig, TFCTRLLMHeadModel, CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||
AlbertConfig, TFAlbertForMaskedLM, ALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP)
|
||||
AlbertConfig, TFAlbertForMaskedLM, ALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||
T5Config, TFT5WithLMHeadModel, T5_PRETRAINED_CONFIG_ARCHIVE_MAP)
|
||||
|
||||
if is_torch_available():
|
||||
import torch
|
||||
@ -48,7 +49,8 @@ if is_torch_available():
|
||||
RobertaForMaskedLM, RobertaForSequenceClassification, ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP,
|
||||
DistilBertForMaskedLM, DistilBertForQuestionAnswering, DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP,
|
||||
CTRLLMHeadModel, CTRL_PRETRAINED_MODEL_ARCHIVE_MAP,
|
||||
AlbertForMaskedLM, ALBERT_PRETRAINED_MODEL_ARCHIVE_MAP)
|
||||
AlbertForMaskedLM, ALBERT_PRETRAINED_MODEL_ARCHIVE_MAP,
|
||||
T5WithLMHeadModel, T5_PRETRAINED_MODEL_ARCHIVE_MAP)
|
||||
else:
|
||||
(BertForPreTraining, BertForQuestionAnswering, BertForSequenceClassification, BERT_PRETRAINED_MODEL_ARCHIVE_MAP,
|
||||
GPT2LMHeadModel, GPT2_PRETRAINED_MODEL_ARCHIVE_MAP,
|
||||
@ -59,7 +61,8 @@ else:
|
||||
RobertaForMaskedLM, RobertaForSequenceClassification, ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP,
|
||||
DistilBertForMaskedLM, DistilBertForQuestionAnswering, DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP,
|
||||
CTRLLMHeadModel, CTRL_PRETRAINED_MODEL_ARCHIVE_MAP,
|
||||
AlbertForMaskedLM, ALBERT_PRETRAINED_MODEL_ARCHIVE_MAP) = (
|
||||
AlbertForMaskedLM, ALBERT_PRETRAINED_MODEL_ARCHIVE_MAP,
|
||||
T5WithLMHeadModel, T5_PRETRAINED_MODEL_ARCHIVE_MAP) = (
|
||||
None, None, None, None,
|
||||
None, None,
|
||||
None, None,
|
||||
@ -69,6 +72,7 @@ else:
|
||||
None, None, None,
|
||||
None, None, None,
|
||||
None, None,
|
||||
None, None,
|
||||
None, None)
|
||||
|
||||
|
||||
@ -90,7 +94,8 @@ MODEL_CLASSES = {
|
||||
'distilbert': (DistilBertConfig, TFDistilBertForMaskedLM, DistilBertForMaskedLM, DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP, DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP),
|
||||
'distilbert-base-uncased-distilled-squad': (DistilBertConfig, TFDistilBertForQuestionAnswering, DistilBertForQuestionAnswering, DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP, DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP),
|
||||
'ctrl': (CTRLConfig, TFCTRLLMHeadModel, CTRLLMHeadModel, CTRL_PRETRAINED_MODEL_ARCHIVE_MAP, CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP),
|
||||
'albert': (AlbertConfig, TFAlbertForMaskedLM, AlbertForMaskedLM, ALBERT_PRETRAINED_MODEL_ARCHIVE_MAP, ALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP)
|
||||
'albert': (AlbertConfig, TFAlbertForMaskedLM, AlbertForMaskedLM, ALBERT_PRETRAINED_MODEL_ARCHIVE_MAP, ALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP),
|
||||
't5': (T5Config, TFT5WithLMHeadModel, T5WithLMHeadModel, T5_PRETRAINED_MODEL_ARCHIVE_MAP, T5_PRETRAINED_CONFIG_ARCHIVE_MAP),
|
||||
}
|
||||
|
||||
def convert_pt_checkpoint_to_tf(model_type, pytorch_checkpoint_path, config_file, tf_dump_path, compare_with_pt_model=False, use_cached_models=True):
|
||||
@ -115,24 +120,21 @@ def convert_pt_checkpoint_to_tf(model_type, pytorch_checkpoint_path, config_file
|
||||
tf_model = load_pytorch_checkpoint_in_tf2_model(tf_model, pytorch_checkpoint_path)
|
||||
|
||||
if compare_with_pt_model:
|
||||
inputs_list = [[7, 6, 0, 0, 1], [1, 2, 3, 0, 0], [0, 0, 0, 4, 5]]
|
||||
tf_inputs = tf.constant(inputs_list)
|
||||
tfo = tf_model(tf_inputs, training=False) # build the network
|
||||
tfo = tf_model(tf_model.dummy_inputs, training=False) # build the network
|
||||
|
||||
state_dict = torch.load(pytorch_checkpoint_path, map_location='cpu')
|
||||
pt_model = pt_model_class.from_pretrained(pretrained_model_name_or_path=None,
|
||||
config=config,
|
||||
state_dict=state_dict)
|
||||
|
||||
pt_inputs = torch.tensor(inputs_list)
|
||||
with torch.no_grad():
|
||||
pto = pt_model(pt_inputs)
|
||||
pto = pt_model(**pt_model.dummy_inputs)
|
||||
|
||||
np_pt = pto[0].detach().numpy()
|
||||
np_pt = pto[0].numpy()
|
||||
np_tf = tfo[0].numpy()
|
||||
diff = np.amax(np.abs(np_pt - np_tf))
|
||||
print("Max absolute difference between models outputs {}".format(diff))
|
||||
assert diff <= 2e-2, "Error, model absolute difference is >2e-2"
|
||||
assert diff <= 2e-2, "Error, model absolute difference is >2e-2: {}".format(diff)
|
||||
|
||||
# Save pytorch-model
|
||||
print("Save TensorFlow model to {}".format(tf_dump_path))
|
||||
|
65
transformers/convert_t5_original_tf_checkpoint_to_pytorch.py
Executable file
65
transformers/convert_t5_original_tf_checkpoint_to_pytorch.py
Executable file
@ -0,0 +1,65 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2018 The T5 authors and HuggingFace Inc. team.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""Convert T5 checkpoint."""
|
||||
|
||||
from __future__ import absolute_import
|
||||
from __future__ import division
|
||||
from __future__ import print_function
|
||||
|
||||
import argparse
|
||||
import torch
|
||||
|
||||
from transformers import T5Config, T5Model, load_tf_weights_in_t5
|
||||
|
||||
import logging
|
||||
logging.basicConfig(level=logging.INFO)
|
||||
|
||||
def convert_tf_checkpoint_to_pytorch(tf_checkpoint_path, config_file, pytorch_dump_path):
|
||||
# Initialise PyTorch model
|
||||
config = T5Config.from_json_file(config_file)
|
||||
print("Building PyTorch model from configuration: {}".format(str(config)))
|
||||
model = T5Model(config)
|
||||
|
||||
# Load weights from tf checkpoint
|
||||
load_tf_weights_in_t5(model, config, tf_checkpoint_path)
|
||||
|
||||
# Save pytorch-model
|
||||
print("Save PyTorch model to {}".format(pytorch_dump_path))
|
||||
torch.save(model.state_dict(), pytorch_dump_path)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser()
|
||||
## Required parameters
|
||||
parser.add_argument("--tf_checkpoint_path",
|
||||
default = None,
|
||||
type = str,
|
||||
required = True,
|
||||
help = "Path to the TensorFlow checkpoint path.")
|
||||
parser.add_argument("--config_file",
|
||||
default = None,
|
||||
type = str,
|
||||
required = True,
|
||||
help = "The config json file corresponding to the pre-trained T5 model. \n"
|
||||
"This specifies the model architecture.")
|
||||
parser.add_argument("--pytorch_dump_path",
|
||||
default = None,
|
||||
type = str,
|
||||
required = True,
|
||||
help = "Path to the output PyTorch model.")
|
||||
args = parser.parse_args()
|
||||
convert_tf_checkpoint_to_pytorch(args.tf_checkpoint_path,
|
||||
args.config_file,
|
||||
args.pytorch_dump_path)
|
@ -133,7 +133,7 @@ def glue_convert_examples_to_features(examples, tokenizer,
|
||||
if is_tf_available() and is_tf_dataset:
|
||||
def gen():
|
||||
for ex in features:
|
||||
yield ({'input_ids': ex.input_ids,
|
||||
yield ({'input_ids': ex.input_ids,
|
||||
'attention_mask': ex.attention_mask,
|
||||
'token_type_ids': ex.token_type_ids},
|
||||
ex.label)
|
||||
|
@ -18,19 +18,20 @@ if is_tf_available():
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
def _improve_answer_span(doc_tokens, input_start, input_end, tokenizer,
|
||||
orig_answer_text):
|
||||
|
||||
def _improve_answer_span(doc_tokens, input_start, input_end, tokenizer, orig_answer_text):
|
||||
"""Returns tokenized answer spans that better match the annotated answer."""
|
||||
tok_answer_text = " ".join(tokenizer.tokenize(orig_answer_text))
|
||||
|
||||
for new_start in range(input_start, input_end + 1):
|
||||
for new_end in range(input_end, new_start - 1, -1):
|
||||
text_span = " ".join(doc_tokens[new_start:(new_end + 1)])
|
||||
text_span = " ".join(doc_tokens[new_start : (new_end + 1)])
|
||||
if text_span == tok_answer_text:
|
||||
return (new_start, new_end)
|
||||
|
||||
return (input_start, input_end)
|
||||
|
||||
|
||||
def _check_is_max_context(doc_spans, cur_span_index, position):
|
||||
"""Check if this is the 'max context' doc span for the token."""
|
||||
best_score = None
|
||||
@ -50,10 +51,11 @@ def _check_is_max_context(doc_spans, cur_span_index, position):
|
||||
|
||||
return cur_span_index == best_span_index
|
||||
|
||||
|
||||
def _new_check_is_max_context(doc_spans, cur_span_index, position):
|
||||
"""Check if this is the 'max context' doc span for the token."""
|
||||
# if len(doc_spans) == 1:
|
||||
# return True
|
||||
# return True
|
||||
best_score = None
|
||||
best_span_index = None
|
||||
for (span_index, doc_span) in enumerate(doc_spans):
|
||||
@ -71,14 +73,16 @@ def _new_check_is_max_context(doc_spans, cur_span_index, position):
|
||||
|
||||
return cur_span_index == best_span_index
|
||||
|
||||
|
||||
def _is_whitespace(c):
|
||||
if c == " " or c == "\t" or c == "\r" or c == "\n" or ord(c) == 0x202F:
|
||||
return True
|
||||
return False
|
||||
|
||||
def squad_convert_examples_to_features(examples, tokenizer, max_seq_length,
|
||||
doc_stride, max_query_length, is_training,
|
||||
return_dataset=False):
|
||||
|
||||
def squad_convert_examples_to_features(
|
||||
examples, tokenizer, max_seq_length, doc_stride, max_query_length, is_training, return_dataset=False
|
||||
):
|
||||
"""
|
||||
Converts a list of examples into a list of features that can be directly given as input to a model.
|
||||
It is model-dependant and takes advantage of many of the tokenizer's features to create the model's inputs.
|
||||
@ -112,24 +116,23 @@ def squad_convert_examples_to_features(examples, tokenizer, max_seq_length,
|
||||
)
|
||||
"""
|
||||
|
||||
# Defining helper methods
|
||||
# Defining helper methods
|
||||
unique_id = 1000000000
|
||||
|
||||
features = []
|
||||
for (example_index, example) in enumerate(tqdm(examples)):
|
||||
for (example_index, example) in enumerate(tqdm(examples, desc="Converting examples to features")):
|
||||
if is_training and not example.is_impossible:
|
||||
# Get start and end position
|
||||
start_position = example.start_position
|
||||
end_position = example.end_position
|
||||
|
||||
# If the answer cannot be found in the text, then skip this example.
|
||||
actual_text = " ".join(example.doc_tokens[start_position:(end_position + 1)])
|
||||
actual_text = " ".join(example.doc_tokens[start_position : (end_position + 1)])
|
||||
cleaned_answer_text = " ".join(whitespace_tokenize(example.answer_text))
|
||||
if actual_text.find(cleaned_answer_text) == -1:
|
||||
logger.warning("Could not find answer: '%s' vs. '%s'", actual_text, cleaned_answer_text)
|
||||
continue
|
||||
|
||||
|
||||
tok_to_orig_index = []
|
||||
orig_to_tok_index = []
|
||||
all_doc_tokens = []
|
||||
@ -140,7 +143,6 @@ def squad_convert_examples_to_features(examples, tokenizer, max_seq_length,
|
||||
tok_to_orig_index.append(i)
|
||||
all_doc_tokens.append(sub_token)
|
||||
|
||||
|
||||
if is_training and not example.is_impossible:
|
||||
tok_start_position = orig_to_tok_index[example.start_position]
|
||||
if example.end_position < len(example.doc_tokens) - 1:
|
||||
@ -153,36 +155,41 @@ def squad_convert_examples_to_features(examples, tokenizer, max_seq_length,
|
||||
)
|
||||
|
||||
spans = []
|
||||
|
||||
truncated_query = tokenizer.encode(example.question_text, add_special_tokens=False, max_length=max_query_length)
|
||||
sequence_added_tokens = tokenizer.max_len - tokenizer.max_len_single_sentence
|
||||
sequence_pair_added_tokens = tokenizer.max_len - tokenizer.max_len_sentences_pair
|
||||
|
||||
truncated_query = tokenizer.encode(
|
||||
example.question_text, add_special_tokens=False, max_length=max_query_length
|
||||
)
|
||||
sequence_added_tokens = tokenizer.max_len - tokenizer.max_len_single_sentence
|
||||
sequence_pair_added_tokens = tokenizer.max_len - tokenizer.max_len_sentences_pair
|
||||
|
||||
span_doc_tokens = all_doc_tokens
|
||||
while len(spans) * doc_stride < len(all_doc_tokens):
|
||||
|
||||
|
||||
encoded_dict = tokenizer.encode_plus(
|
||||
truncated_query if tokenizer.padding_side == "right" else span_doc_tokens,
|
||||
span_doc_tokens if tokenizer.padding_side == "right" else truncated_query,
|
||||
max_length=max_seq_length,
|
||||
return_overflowing_tokens=True,
|
||||
truncated_query if tokenizer.padding_side == "right" else span_doc_tokens,
|
||||
span_doc_tokens if tokenizer.padding_side == "right" else truncated_query,
|
||||
max_length=max_seq_length,
|
||||
return_overflowing_tokens=True,
|
||||
pad_to_max_length=True,
|
||||
stride=max_seq_length - doc_stride - len(truncated_query) - sequence_pair_added_tokens,
|
||||
truncation_strategy='only_second' if tokenizer.padding_side == "right" else 'only_first'
|
||||
truncation_strategy="only_second" if tokenizer.padding_side == "right" else "only_first",
|
||||
)
|
||||
|
||||
paragraph_len = min(len(all_doc_tokens) - len(spans) * doc_stride, max_seq_length - len(truncated_query) - sequence_pair_added_tokens)
|
||||
paragraph_len = min(
|
||||
len(all_doc_tokens) - len(spans) * doc_stride,
|
||||
max_seq_length - len(truncated_query) - sequence_pair_added_tokens,
|
||||
)
|
||||
|
||||
if tokenizer.pad_token_id in encoded_dict['input_ids']:
|
||||
non_padded_ids = encoded_dict['input_ids'][:encoded_dict['input_ids'].index(tokenizer.pad_token_id)]
|
||||
if tokenizer.pad_token_id in encoded_dict["input_ids"]:
|
||||
non_padded_ids = encoded_dict["input_ids"][: encoded_dict["input_ids"].index(tokenizer.pad_token_id)]
|
||||
else:
|
||||
non_padded_ids = encoded_dict['input_ids']
|
||||
non_padded_ids = encoded_dict["input_ids"]
|
||||
|
||||
tokens = tokenizer.convert_ids_to_tokens(non_padded_ids)
|
||||
|
||||
token_to_orig_map = {}
|
||||
for i in range(paragraph_len):
|
||||
index = len(truncated_query) + sequence_added_tokens + i if tokenizer.padding_side == "right" else i
|
||||
index = len(truncated_query) + sequence_added_tokens + i if tokenizer.padding_side == "right" else i
|
||||
token_to_orig_map[index] = tok_to_orig_index[len(spans) * doc_stride + i]
|
||||
|
||||
encoded_dict["paragraph_len"] = paragraph_len
|
||||
@ -202,16 +209,20 @@ def squad_convert_examples_to_features(examples, tokenizer, max_seq_length,
|
||||
for doc_span_index in range(len(spans)):
|
||||
for j in range(spans[doc_span_index]["paragraph_len"]):
|
||||
is_max_context = _new_check_is_max_context(spans, doc_span_index, doc_span_index * doc_stride + j)
|
||||
index = j if tokenizer.padding_side == "left" else spans[doc_span_index]["truncated_query_with_special_tokens_length"] + j
|
||||
index = (
|
||||
j
|
||||
if tokenizer.padding_side == "left"
|
||||
else spans[doc_span_index]["truncated_query_with_special_tokens_length"] + j
|
||||
)
|
||||
spans[doc_span_index]["token_is_max_context"][index] = is_max_context
|
||||
|
||||
for span in spans:
|
||||
# Identify the position of the CLS token
|
||||
cls_index = span['input_ids'].index(tokenizer.cls_token_id)
|
||||
cls_index = span["input_ids"].index(tokenizer.cls_token_id)
|
||||
|
||||
# p_mask: mask with 1 for token than cannot be in the answer (0 for token which can be in an answer)
|
||||
# Original TF implem also keep the classification token (set to 0) (not sure why...)
|
||||
p_mask = np.array(span['token_type_ids'])
|
||||
p_mask = np.array(span["token_type_ids"])
|
||||
|
||||
p_mask = np.minimum(p_mask, 1)
|
||||
|
||||
@ -224,7 +235,6 @@ def squad_convert_examples_to_features(examples, tokenizer, max_seq_length,
|
||||
# Set the CLS index to '0'
|
||||
p_mask[cls_index] = 0
|
||||
|
||||
|
||||
span_is_impossible = example.is_impossible
|
||||
start_position = 0
|
||||
end_position = 0
|
||||
@ -247,55 +257,99 @@ def squad_convert_examples_to_features(examples, tokenizer, max_seq_length,
|
||||
doc_offset = 0
|
||||
else:
|
||||
doc_offset = len(truncated_query) + sequence_added_tokens
|
||||
|
||||
|
||||
start_position = tok_start_position - doc_start + doc_offset
|
||||
end_position = tok_end_position - doc_start + doc_offset
|
||||
|
||||
|
||||
features.append(SquadFeatures(
|
||||
span['input_ids'],
|
||||
span['attention_mask'],
|
||||
span['token_type_ids'],
|
||||
cls_index,
|
||||
p_mask.tolist(),
|
||||
|
||||
example_index=example_index,
|
||||
unique_id=unique_id,
|
||||
paragraph_len=span['paragraph_len'],
|
||||
token_is_max_context=span["token_is_max_context"],
|
||||
tokens=span["tokens"],
|
||||
token_to_orig_map=span["token_to_orig_map"],
|
||||
|
||||
start_position=start_position,
|
||||
end_position=end_position
|
||||
))
|
||||
features.append(
|
||||
SquadFeatures(
|
||||
span["input_ids"],
|
||||
span["attention_mask"],
|
||||
span["token_type_ids"],
|
||||
cls_index,
|
||||
p_mask.tolist(),
|
||||
example_index=example_index,
|
||||
unique_id=unique_id,
|
||||
paragraph_len=span["paragraph_len"],
|
||||
token_is_max_context=span["token_is_max_context"],
|
||||
tokens=span["tokens"],
|
||||
token_to_orig_map=span["token_to_orig_map"],
|
||||
start_position=start_position,
|
||||
end_position=end_position,
|
||||
)
|
||||
)
|
||||
|
||||
unique_id += 1
|
||||
|
||||
if return_dataset == 'pt':
|
||||
if return_dataset == "pt":
|
||||
if not is_torch_available():
|
||||
raise ImportError("Pytorch must be installed to return a pytorch dataset.")
|
||||
|
||||
# Convert to Tensors and build dataset
|
||||
all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
|
||||
all_input_mask = torch.tensor([f.attention_mask for f in features], dtype=torch.long)
|
||||
all_segment_ids = torch.tensor([f.token_type_ids for f in features], dtype=torch.long)
|
||||
all_attention_masks = torch.tensor([f.attention_mask for f in features], dtype=torch.long)
|
||||
all_token_type_ids = torch.tensor([f.token_type_ids for f in features], dtype=torch.long)
|
||||
all_cls_index = torch.tensor([f.cls_index for f in features], dtype=torch.long)
|
||||
all_p_mask = torch.tensor([f.p_mask for f in features], dtype=torch.float)
|
||||
|
||||
if not is_training:
|
||||
all_example_index = torch.arange(all_input_ids.size(0), dtype=torch.long)
|
||||
dataset = TensorDataset(all_input_ids, all_input_mask, all_segment_ids,
|
||||
all_example_index, all_cls_index, all_p_mask)
|
||||
dataset = TensorDataset(
|
||||
all_input_ids, all_attention_masks, all_token_type_ids, all_example_index, all_cls_index, all_p_mask
|
||||
)
|
||||
else:
|
||||
all_start_positions = torch.tensor([f.start_position for f in features], dtype=torch.long)
|
||||
all_end_positions = torch.tensor([f.end_position for f in features], dtype=torch.long)
|
||||
dataset = TensorDataset(all_input_ids, all_input_mask, all_segment_ids,
|
||||
all_start_positions, all_end_positions,
|
||||
all_cls_index, all_p_mask)
|
||||
dataset = TensorDataset(
|
||||
all_input_ids,
|
||||
all_attention_masks,
|
||||
all_token_type_ids,
|
||||
all_start_positions,
|
||||
all_end_positions,
|
||||
all_cls_index,
|
||||
all_p_mask,
|
||||
)
|
||||
|
||||
return features, dataset
|
||||
|
||||
elif return_dataset == "tf":
|
||||
if not is_tf_available():
|
||||
raise ImportError("TensorFlow must be installed to return a TensorFlow dataset.")
|
||||
|
||||
def gen():
|
||||
for ex in features:
|
||||
yield (
|
||||
{
|
||||
"input_ids": ex.input_ids,
|
||||
"attention_mask": ex.attention_mask,
|
||||
"token_type_ids": ex.token_type_ids,
|
||||
}, {
|
||||
"start_position": ex.start_position,
|
||||
"end_position": ex.end_position,
|
||||
"cls_index": ex.cls_index,
|
||||
"p_mask": ex.p_mask,
|
||||
}
|
||||
)
|
||||
|
||||
return tf.data.Dataset.from_generator(
|
||||
gen,
|
||||
(
|
||||
{"input_ids": tf.int32, "attention_mask": tf.int32, "token_type_ids": tf.int32},
|
||||
{"start_position": tf.int64, "end_position": tf.int64, "cls_index": tf.int64, "p_mask": tf.int32},
|
||||
),
|
||||
(
|
||||
{
|
||||
"input_ids": tf.TensorShape([None]),
|
||||
"attention_mask": tf.TensorShape([None]),
|
||||
"token_type_ids": tf.TensorShape([None]),
|
||||
},
|
||||
{
|
||||
"start_position": tf.TensorShape([]),
|
||||
"end_position": tf.TensorShape([]),
|
||||
"cls_index": tf.TensorShape([]),
|
||||
"p_mask": tf.TensorShape([None]),
|
||||
},
|
||||
),
|
||||
)
|
||||
|
||||
return features
|
||||
|
||||
@ -305,31 +359,32 @@ class SquadProcessor(DataProcessor):
|
||||
Processor for the SQuAD data set.
|
||||
Overriden by SquadV1Processor and SquadV2Processor, used by the version 1.1 and version 2.0 of SQuAD, respectively.
|
||||
"""
|
||||
|
||||
train_file = None
|
||||
dev_file = None
|
||||
|
||||
def _get_example_from_tensor_dict(self, tensor_dict, evaluate=False):
|
||||
if not evaluate:
|
||||
answer = tensor_dict['answers']['text'][0].numpy().decode('utf-8')
|
||||
answer_start = tensor_dict['answers']['answer_start'][0].numpy()
|
||||
answer = tensor_dict["answers"]["text"][0].numpy().decode("utf-8")
|
||||
answer_start = tensor_dict["answers"]["answer_start"][0].numpy()
|
||||
answers = []
|
||||
else:
|
||||
answers = [{
|
||||
"answer_start": start.numpy(),
|
||||
"text": text.numpy().decode('utf-8')
|
||||
} for start, text in zip(tensor_dict['answers']["answer_start"], tensor_dict['answers']["text"])]
|
||||
answers = [
|
||||
{"answer_start": start.numpy(), "text": text.numpy().decode("utf-8")}
|
||||
for start, text in zip(tensor_dict["answers"]["answer_start"], tensor_dict["answers"]["text"])
|
||||
]
|
||||
|
||||
answer = None
|
||||
answer_start = None
|
||||
|
||||
return SquadExample(
|
||||
qas_id=tensor_dict['id'].numpy().decode("utf-8"),
|
||||
question_text=tensor_dict['question'].numpy().decode('utf-8'),
|
||||
context_text=tensor_dict['context'].numpy().decode('utf-8'),
|
||||
qas_id=tensor_dict["id"].numpy().decode("utf-8"),
|
||||
question_text=tensor_dict["question"].numpy().decode("utf-8"),
|
||||
context_text=tensor_dict["context"].numpy().decode("utf-8"),
|
||||
answer_text=answer,
|
||||
start_position_character=answer_start,
|
||||
title=tensor_dict['title'].numpy().decode('utf-8'),
|
||||
answers=answers
|
||||
title=tensor_dict["title"].numpy().decode("utf-8"),
|
||||
answers=answers,
|
||||
)
|
||||
|
||||
def get_examples_from_dataset(self, dataset, evaluate=False):
|
||||
@ -359,7 +414,7 @@ class SquadProcessor(DataProcessor):
|
||||
|
||||
examples = []
|
||||
for tensor_dict in tqdm(dataset):
|
||||
examples.append(self._get_example_from_tensor_dict(tensor_dict, evaluate=evaluate))
|
||||
examples.append(self._get_example_from_tensor_dict(tensor_dict, evaluate=evaluate))
|
||||
|
||||
return examples
|
||||
|
||||
@ -379,7 +434,9 @@ class SquadProcessor(DataProcessor):
|
||||
if self.train_file is None:
|
||||
raise ValueError("SquadProcessor should be instantiated via SquadV1Processor or SquadV2Processor")
|
||||
|
||||
with open(os.path.join(data_dir, self.train_file if filename is None else filename), "r", encoding='utf-8') as reader:
|
||||
with open(
|
||||
os.path.join(data_dir, self.train_file if filename is None else filename), "r", encoding="utf-8"
|
||||
) as reader:
|
||||
input_data = json.load(reader)["data"]
|
||||
return self._create_examples(input_data, "train")
|
||||
|
||||
@ -397,8 +454,10 @@ class SquadProcessor(DataProcessor):
|
||||
|
||||
if self.dev_file is None:
|
||||
raise ValueError("SquadProcessor should be instantiated via SquadV1Processor or SquadV2Processor")
|
||||
|
||||
with open(os.path.join(data_dir, self.dev_file if filename is None else filename), "r", encoding='utf-8') as reader:
|
||||
|
||||
with open(
|
||||
os.path.join(data_dir, self.dev_file if filename is None else filename), "r", encoding="utf-8"
|
||||
) as reader:
|
||||
input_data = json.load(reader)["data"]
|
||||
return self._create_examples(input_data, "dev")
|
||||
|
||||
@ -406,7 +465,7 @@ class SquadProcessor(DataProcessor):
|
||||
is_training = set_type == "train"
|
||||
examples = []
|
||||
for entry in tqdm(input_data):
|
||||
title = entry['title']
|
||||
title = entry["title"]
|
||||
for paragraph in entry["paragraphs"]:
|
||||
context_text = paragraph["context"]
|
||||
for qa in paragraph["qas"]:
|
||||
@ -415,7 +474,7 @@ class SquadProcessor(DataProcessor):
|
||||
start_position_character = None
|
||||
answer_text = None
|
||||
answers = []
|
||||
|
||||
|
||||
if "is_impossible" in qa:
|
||||
is_impossible = qa["is_impossible"]
|
||||
else:
|
||||
@ -424,8 +483,8 @@ class SquadProcessor(DataProcessor):
|
||||
if not is_impossible:
|
||||
if is_training:
|
||||
answer = qa["answers"][0]
|
||||
answer_text = answer['text']
|
||||
start_position_character = answer['answer_start']
|
||||
answer_text = answer["text"]
|
||||
start_position_character = answer["answer_start"]
|
||||
else:
|
||||
answers = qa["answers"]
|
||||
|
||||
@ -437,12 +496,13 @@ class SquadProcessor(DataProcessor):
|
||||
start_position_character=start_position_character,
|
||||
title=title,
|
||||
is_impossible=is_impossible,
|
||||
answers=answers
|
||||
answers=answers,
|
||||
)
|
||||
|
||||
examples.append(example)
|
||||
return examples
|
||||
|
||||
|
||||
class SquadV1Processor(SquadProcessor):
|
||||
train_file = "train-v1.1.json"
|
||||
dev_file = "dev-v1.1.json"
|
||||
@ -451,7 +511,7 @@ class SquadV1Processor(SquadProcessor):
|
||||
class SquadV2Processor(SquadProcessor):
|
||||
train_file = "train-v2.0.json"
|
||||
dev_file = "dev-v2.0.json"
|
||||
|
||||
|
||||
|
||||
class SquadExample(object):
|
||||
"""
|
||||
@ -468,21 +528,23 @@ class SquadExample(object):
|
||||
is_impossible: False by default, set to True if the example has no possible answer.
|
||||
"""
|
||||
|
||||
def __init__(self,
|
||||
qas_id,
|
||||
question_text,
|
||||
context_text,
|
||||
answer_text,
|
||||
start_position_character,
|
||||
title,
|
||||
answers=[],
|
||||
is_impossible=False):
|
||||
def __init__(
|
||||
self,
|
||||
qas_id,
|
||||
question_text,
|
||||
context_text,
|
||||
answer_text,
|
||||
start_position_character,
|
||||
title,
|
||||
answers=[],
|
||||
is_impossible=False,
|
||||
):
|
||||
self.qas_id = qas_id
|
||||
self.question_text = question_text
|
||||
self.context_text = context_text
|
||||
self.answer_text = answer_text
|
||||
self.title = title
|
||||
self.is_impossible = is_impossible
|
||||
self.is_impossible = is_impossible
|
||||
self.answers = answers
|
||||
|
||||
self.start_position, self.end_position = 0, 0
|
||||
@ -537,24 +599,23 @@ class SquadFeatures(object):
|
||||
end_position: end of the answer token index
|
||||
"""
|
||||
|
||||
def __init__(self,
|
||||
input_ids,
|
||||
attention_mask,
|
||||
token_type_ids,
|
||||
cls_index,
|
||||
p_mask,
|
||||
|
||||
example_index,
|
||||
unique_id,
|
||||
paragraph_len,
|
||||
token_is_max_context,
|
||||
tokens,
|
||||
token_to_orig_map,
|
||||
|
||||
start_position,
|
||||
end_position
|
||||
):
|
||||
self.input_ids = input_ids
|
||||
def __init__(
|
||||
self,
|
||||
input_ids,
|
||||
attention_mask,
|
||||
token_type_ids,
|
||||
cls_index,
|
||||
p_mask,
|
||||
example_index,
|
||||
unique_id,
|
||||
paragraph_len,
|
||||
token_is_max_context,
|
||||
tokens,
|
||||
token_to_orig_map,
|
||||
start_position,
|
||||
end_position,
|
||||
):
|
||||
self.input_ids = input_ids
|
||||
self.attention_mask = attention_mask
|
||||
self.token_type_ids = token_type_ids
|
||||
self.cls_index = cls_index
|
||||
@ -580,12 +641,13 @@ class SquadResult(object):
|
||||
start_logits: The logits corresponding to the start of the answer
|
||||
end_logits: The logits corresponding to the end of the answer
|
||||
"""
|
||||
|
||||
def __init__(self, unique_id, start_logits, end_logits, start_top_index=None, end_top_index=None, cls_logits=None):
|
||||
self.start_logits = start_logits
|
||||
self.end_logits = end_logits
|
||||
self.unique_id = unique_id
|
||||
|
||||
|
||||
if start_top_index:
|
||||
self.start_top_index = start_top_index
|
||||
self.end_top_index = end_top_index
|
||||
self.cls_logits = cls_logits
|
||||
self.cls_logits = cls_logits
|
||||
|
@ -73,8 +73,13 @@ TF2_WEIGHTS_NAME = 'tf_model.h5'
|
||||
TF_WEIGHTS_NAME = 'model.ckpt'
|
||||
CONFIG_NAME = "config.json"
|
||||
|
||||
|
||||
DUMMY_INPUTS = [[7, 6, 0, 0, 1], [1, 2, 3, 0, 0], [0, 0, 0, 4, 5]]
|
||||
DUMMY_MASK = [[1, 1, 1, 1, 1], [1, 1, 1, 0, 0], [0, 0, 0, 1, 1]]
|
||||
|
||||
S3_BUCKET_PREFIX = "https://s3.amazonaws.com/models.huggingface.co/bert"
|
||||
|
||||
|
||||
def is_torch_available():
|
||||
return _torch_available
|
||||
|
||||
|
@ -131,8 +131,9 @@ class HfApi:
|
||||
# the client still has to specify it when uploading the file.
|
||||
with open(filepath, "rb") as f:
|
||||
pf = TqdmProgressFileReader(f)
|
||||
data = f if pf.total_size > 0 else ""
|
||||
|
||||
r = requests.put(urls.write, data=f, headers={
|
||||
r = requests.put(urls.write, data=data, headers={
|
||||
"content-type": urls.type,
|
||||
})
|
||||
r.raise_for_status()
|
||||
|
@ -29,6 +29,7 @@ from .modeling_roberta import RobertaModel, RobertaForMaskedLM, RobertaForSequen
|
||||
from .modeling_distilbert import DistilBertModel, DistilBertForQuestionAnswering, DistilBertForMaskedLM, DistilBertForSequenceClassification
|
||||
from .modeling_camembert import CamembertModel, CamembertForMaskedLM, CamembertForSequenceClassification, CamembertForMultipleChoice
|
||||
from .modeling_albert import AlbertModel, AlbertForMaskedLM, AlbertForSequenceClassification, AlbertForQuestionAnswering
|
||||
from .modeling_t5 import T5Model, T5WithLMHeadModel
|
||||
|
||||
from .modeling_utils import PreTrainedModel, SequenceSummary
|
||||
|
||||
@ -49,6 +50,7 @@ class AutoModel(object):
|
||||
|
||||
The base model class to instantiate is selected as the first pattern matching
|
||||
in the `pretrained_model_name_or_path` string (in the following order):
|
||||
- contains `t5`: T5Model (T5 model)
|
||||
- contains `distilbert`: DistilBertModel (DistilBERT model)
|
||||
- contains `albert`: AlbertModel (ALBERT model)
|
||||
- contains `camembert`: CamembertModel (CamemBERT model)
|
||||
@ -74,6 +76,7 @@ class AutoModel(object):
|
||||
|
||||
The model class to instantiate is selected as the first pattern matching
|
||||
in the `pretrained_model_name_or_path` string (in the following order):
|
||||
- contains `t5`: T5Model (T5 model)
|
||||
- contains `distilbert`: DistilBertModel (DistilBERT model)
|
||||
- contains `albert`: AlbertModel (ALBERT model)
|
||||
- contains `camembert`: CamembertModel (CamemBERT model)
|
||||
@ -146,7 +149,9 @@ class AutoModel(object):
|
||||
model = AutoModel.from_pretrained('./tf_model/bert_tf_checkpoint.ckpt.index', from_tf=True, config=config)
|
||||
|
||||
"""
|
||||
if 'distilbert' in pretrained_model_name_or_path:
|
||||
if 't5' in pretrained_model_name_or_path:
|
||||
return T5Model.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
|
||||
elif 'distilbert' in pretrained_model_name_or_path:
|
||||
return DistilBertModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
|
||||
elif 'albert' in pretrained_model_name_or_path:
|
||||
return AlbertModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
|
||||
@ -185,6 +190,7 @@ class AutoModelWithLMHead(object):
|
||||
|
||||
The model class to instantiate is selected as the first pattern matching
|
||||
in the `pretrained_model_name_or_path` string (in the following order):
|
||||
- contains `t5`: T5ModelWithLMHead (T5 model)
|
||||
- contains `distilbert`: DistilBertForMaskedLM (DistilBERT model)
|
||||
- contains `albert`: AlbertForMaskedLM (ALBERT model)
|
||||
- contains `camembert`: CamembertForMaskedLM (CamemBERT model)
|
||||
@ -213,6 +219,7 @@ class AutoModelWithLMHead(object):
|
||||
|
||||
The model class to instantiate is selected as the first pattern matching
|
||||
in the `pretrained_model_name_or_path` string (in the following order):
|
||||
- contains `t5`: T5ModelWithLMHead (T5 model)
|
||||
- contains `distilbert`: DistilBertForMaskedLM (DistilBERT model)
|
||||
- contains `albert`: AlbertForMaskedLM (ALBERT model)
|
||||
- contains `camembert`: CamembertForMaskedLM (CamemBERT model)
|
||||
@ -284,7 +291,9 @@ class AutoModelWithLMHead(object):
|
||||
model = AutoModelWithLMHead.from_pretrained('./tf_model/bert_tf_checkpoint.ckpt.index', from_tf=True, config=config)
|
||||
|
||||
"""
|
||||
if 'distilbert' in pretrained_model_name_or_path:
|
||||
if 't5' in pretrained_model_name_or_path:
|
||||
return T5WithLMHeadModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
|
||||
elif 'distilbert' in pretrained_model_name_or_path:
|
||||
return DistilBertForMaskedLM.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
|
||||
elif 'albert' in pretrained_model_name_or_path:
|
||||
return AlbertForMaskedLM.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
|
||||
|
@ -219,9 +219,7 @@ class PreTrainedEncoderDecoder(nn.Module):
|
||||
encoder_hidden_states = kwargs_encoder.pop("hidden_states", None)
|
||||
if encoder_hidden_states is None:
|
||||
encoder_outputs = self.encoder(encoder_input_ids, **kwargs_encoder)
|
||||
encoder_hidden_states = encoder_outputs[
|
||||
0
|
||||
] # output the last layer hidden state
|
||||
encoder_hidden_states = encoder_outputs[0]
|
||||
else:
|
||||
encoder_outputs = ()
|
||||
|
||||
|
886
transformers/modeling_t5.py
Normal file
886
transformers/modeling_t5.py
Normal file
@ -0,0 +1,886 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2018 Mesh TensorFlow authors, T5 Authors and HuggingFace Inc. team.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
""" PyTorch T5 model. """
|
||||
|
||||
from __future__ import absolute_import, division, print_function, unicode_literals
|
||||
|
||||
import json
|
||||
import logging
|
||||
import math
|
||||
import os
|
||||
import sys
|
||||
import copy
|
||||
import itertools
|
||||
from io import open
|
||||
|
||||
import torch
|
||||
from torch import nn
|
||||
import torch.nn.functional as F
|
||||
from torch.nn import CrossEntropyLoss, MSELoss
|
||||
|
||||
from .modeling_utils import PreTrainedModel, prune_linear_layer
|
||||
from .configuration_t5 import T5Config
|
||||
from .file_utils import add_start_docstrings, DUMMY_INPUTS, DUMMY_MASK
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
####################################################
|
||||
# This dict contrains shortcut names and associated url
|
||||
# for the pretrained weights provided with the models
|
||||
####################################################
|
||||
T5_PRETRAINED_MODEL_ARCHIVE_MAP = {
|
||||
't5-small': "https://s3.amazonaws.com/models.huggingface.co/bert/t5-small-pytorch_model.bin",
|
||||
't5-base': "https://s3.amazonaws.com/models.huggingface.co/bert/t5-base-pytorch_model.bin",
|
||||
't5-large': "https://s3.amazonaws.com/models.huggingface.co/bert/t5-large-pytorch_model.bin",
|
||||
't5-3b': "https://s3.amazonaws.com/models.huggingface.co/bert/t5-3b-pytorch_model.bin",
|
||||
't5-11b': "https://s3.amazonaws.com/models.huggingface.co/bert/t5-11b-pytorch_model.bin",
|
||||
}
|
||||
|
||||
####################################################
|
||||
# This is a conversion method from TF 1.0 to PyTorch
|
||||
# More details: https://medium.com/huggingface/from-tensorflow-to-pytorch-265f40ef2a28
|
||||
####################################################
|
||||
def load_tf_weights_in_t5(model, config, tf_checkpoint_path):
|
||||
""" Load tf checkpoints in a pytorch model.
|
||||
"""
|
||||
try:
|
||||
import re
|
||||
import numpy as np
|
||||
import tensorflow as tf
|
||||
except ImportError:
|
||||
logger.error("Loading a TensorFlow model in PyTorch, requires TensorFlow to be installed. Please see "
|
||||
"https://www.tensorflow.org/install/ for installation instructions.")
|
||||
raise
|
||||
tf_path = os.path.abspath(tf_checkpoint_path)
|
||||
logger.info("Converting TensorFlow checkpoint from {}".format(tf_path))
|
||||
# Load weights from TF model
|
||||
init_vars = tf.train.list_variables(tf_path)
|
||||
names = []
|
||||
tf_weights = {}
|
||||
for name, shape in init_vars:
|
||||
logger.info("Loading TF weight {} with shape {}".format(name, shape))
|
||||
array = tf.train.load_variable(tf_path, name)
|
||||
names.append(name)
|
||||
tf_weights[name] = array
|
||||
|
||||
for txt_name in names:
|
||||
name = txt_name.split('/')
|
||||
# adam_v and adam_m are variables used in AdamWeightDecayOptimizer to calculated m and v
|
||||
# which are not required for using pretrained model
|
||||
if any(n in ["adam_v", "adam_m", "global_step"] for n in name):
|
||||
logger.info("Skipping {}".format("/".join(name)))
|
||||
tf_weights.pop(txt_name, None)
|
||||
continue
|
||||
if '_slot_' in name[-1]:
|
||||
logger.info("Skipping {}".format("/".join(name)))
|
||||
tf_weights.pop(txt_name, None)
|
||||
continue
|
||||
pointer = model
|
||||
array = tf_weights[txt_name]
|
||||
for m_name in name:
|
||||
if re.fullmatch(r'[A-Za-z]+_\d+', m_name):
|
||||
l = re.split(r'_(\d+)', m_name)
|
||||
else:
|
||||
l = [m_name]
|
||||
if l[0] in ['kernel', 'scale', 'embedding']:
|
||||
pointer = getattr(pointer, 'weight')
|
||||
# elif l[0] == 'scale':
|
||||
# pointer = getattr(pointer, 'weight')
|
||||
# elif l[0] == 'output_bias' or l[0] == 'beta':
|
||||
# pointer = getattr(pointer, 'bias')
|
||||
# elif l[0] == 'squad':
|
||||
# pointer = getattr(pointer, 'classifier')
|
||||
else:
|
||||
try:
|
||||
pointer = getattr(pointer, l[0])
|
||||
except AttributeError:
|
||||
logger.info("Skipping {}".format("/".join(name)))
|
||||
continue
|
||||
if len(l) >= 2:
|
||||
num = int(l[1])
|
||||
pointer = pointer[num]
|
||||
if l[0] not in ['kernel', 'scale', 'embedding']:
|
||||
pointer = getattr(pointer, 'weight')
|
||||
if l[0] != 'embedding':
|
||||
logger.info("Transposing numpy weight of shape {} for {}".format(array.shape, name))
|
||||
array = np.transpose(array)
|
||||
try:
|
||||
assert pointer.shape == array.shape
|
||||
except AssertionError as e:
|
||||
e.args += (pointer.shape, array.shape)
|
||||
raise
|
||||
logger.info("Initialize PyTorch weight {}".format(name))
|
||||
pointer.data = torch.from_numpy(array.astype(np.float32))
|
||||
tf_weights.pop(txt_name, None)
|
||||
|
||||
logger.info("Weights not copied to PyTorch model: {}".format(', '.join(tf_weights.keys())))
|
||||
# logger.info("Weights not copied to PyTorch model: {}".format(', '.join(tf_weights.keys())))
|
||||
return model
|
||||
|
||||
|
||||
####################################################
|
||||
# PyTorch Models are constructed by sub-classing
|
||||
# - torch.nn.Module for the layers and
|
||||
# - PreTrainedModel for the models (it-self a sub-class of torch.nn.Module)
|
||||
####################################################
|
||||
|
||||
class T5LayerNorm(nn.Module):
|
||||
def __init__(self, hidden_size, eps=1e-6):
|
||||
""" Construct a layernorm module in the T5 style
|
||||
No bias and no substraction of mean.
|
||||
"""
|
||||
super(T5LayerNorm, self).__init__()
|
||||
self.weight = nn.Parameter(torch.ones(hidden_size))
|
||||
self.variance_epsilon = eps
|
||||
|
||||
def forward(self, x):
|
||||
variance = x.pow(2).mean(-1, keepdim=True)
|
||||
x = x / torch.sqrt(variance + self.variance_epsilon)
|
||||
return self.weight * x
|
||||
|
||||
|
||||
class T5DenseReluDense(nn.Module):
|
||||
def __init__(self, config):
|
||||
super(T5DenseReluDense, self).__init__()
|
||||
self.wi = nn.Linear(config.d_model, config.d_ff, bias=False)
|
||||
self.wo = nn.Linear(config.d_ff, config.d_model, bias=False)
|
||||
self.dropout = nn.Dropout(config.dropout_rate)
|
||||
|
||||
def forward(self, hidden_states):
|
||||
h = self.wi(hidden_states)
|
||||
h = F.relu(h)
|
||||
h = self.dropout(h)
|
||||
h = self.wo(h)
|
||||
return h
|
||||
|
||||
|
||||
class T5LayerFF(nn.Module):
|
||||
def __init__(self, config):
|
||||
super(T5LayerFF, self).__init__()
|
||||
self.DenseReluDense = T5DenseReluDense(config)
|
||||
self.layer_norm = T5LayerNorm(config.d_model, eps=config.layer_norm_epsilon)
|
||||
self.dropout = nn.Dropout(config.dropout_rate)
|
||||
|
||||
def forward(self, hidden_states):
|
||||
norm_x = self.layer_norm(hidden_states)
|
||||
y = self.DenseReluDense(norm_x)
|
||||
layer_output = hidden_states + self.dropout(y)
|
||||
return layer_output
|
||||
|
||||
|
||||
class T5Attention(nn.Module):
|
||||
NEW_ID = itertools.count()
|
||||
|
||||
def __init__(self, config, has_relative_attention_bias=False):
|
||||
super(T5Attention, self).__init__()
|
||||
self.layer_id = next(T5Attention.NEW_ID)
|
||||
self.is_decoder = config.is_decoder
|
||||
self.has_relative_attention_bias = has_relative_attention_bias
|
||||
|
||||
self.output_attentions = config.output_attentions
|
||||
self.relative_attention_num_buckets = config.relative_attention_num_buckets
|
||||
self.d_model = config.d_model
|
||||
self.d_kv = config.d_kv
|
||||
self.n_heads = config.num_heads
|
||||
self.dropout = config.dropout_rate
|
||||
self.inner_dim = self.n_heads * self.d_kv
|
||||
|
||||
# Mesh TensorFlow initialization to avoid scaling before softmax
|
||||
self.q = nn.Linear(self.d_model, self.inner_dim, bias=False)
|
||||
self.k = nn.Linear(self.d_model, self.inner_dim, bias=False)
|
||||
self.v = nn.Linear(self.d_model, self.inner_dim, bias=False)
|
||||
self.o = nn.Linear(self.inner_dim, self.d_model, bias=False)
|
||||
|
||||
if self.has_relative_attention_bias:
|
||||
self.relative_attention_bias = nn.Embedding(self.relative_attention_num_buckets, self.n_heads)
|
||||
self.pruned_heads = set()
|
||||
|
||||
def prune_heads(self, heads):
|
||||
if len(heads) == 0:
|
||||
return
|
||||
mask = torch.ones(self.n_heads, self.d_kv)
|
||||
heads = set(heads) - self.pruned_heads
|
||||
for head in heads:
|
||||
head -= sum(1 if h < head else 0 for h in self.pruned_heads)
|
||||
mask[head] = 0
|
||||
mask = mask.view(-1).contiguous().eq(1)
|
||||
index = torch.arange(len(mask))[mask].long()
|
||||
# Prune linear layers
|
||||
self.q = prune_linear_layer(self.q, index)
|
||||
self.k = prune_linear_layer(self.k, index)
|
||||
self.v = prune_linear_layer(self.v, index)
|
||||
self.o = prune_linear_layer(self.o, index, dim=1)
|
||||
# Update hyper params
|
||||
self.n_heads = self.n_heads - len(heads)
|
||||
self.inner_dim = self.d_kv * self.n_heads
|
||||
self.pruned_heads = self.pruned_heads.union(heads)
|
||||
|
||||
@staticmethod
|
||||
def _relative_position_bucket(relative_position,
|
||||
bidirectional=True,
|
||||
num_buckets=32,
|
||||
max_distance=128):
|
||||
"""
|
||||
Adapted from Mesh Tensorflow:
|
||||
https://github.com/tensorflow/mesh/blob/0cb87fe07da627bf0b7e60475d59f95ed6b5be3d/mesh_tensorflow/transformer/transformer_layers.py#L593
|
||||
|
||||
Translate relative position to a bucket number for relative attention.
|
||||
The relative position is defined as memory_position - query_position, i.e.
|
||||
the distance in tokens from the attending position to the attended-to
|
||||
position. If bidirectional=False, then positive relative positions are
|
||||
invalid.
|
||||
We use smaller buckets for small absolute relative_position and larger buckets
|
||||
for larger absolute relative_positions. All relative positions >=max_distance
|
||||
map to the same bucket. All relative positions <=-max_distance map to the
|
||||
same bucket. This should allow for more graceful generalization to longer
|
||||
sequences than the model has been trained on.
|
||||
Args:
|
||||
relative_position: an int32 Tensor
|
||||
bidirectional: a boolean - whether the attention is bidirectional
|
||||
num_buckets: an integer
|
||||
max_distance: an integer
|
||||
Returns:
|
||||
a Tensor with the same shape as relative_position, containing int32
|
||||
values in the range [0, num_buckets)
|
||||
"""
|
||||
ret = 0
|
||||
n = -relative_position
|
||||
if bidirectional:
|
||||
num_buckets //= 2
|
||||
ret += (n < 0).to(torch.long) * num_buckets # mtf.to_int32(mtf.less(n, 0)) * num_buckets
|
||||
n = torch.abs(n)
|
||||
else:
|
||||
n = torch.max(n, torch.zeros_like(n))
|
||||
# now n is in the range [0, inf)
|
||||
|
||||
# half of the buckets are for exact increments in positions
|
||||
max_exact = num_buckets // 2
|
||||
is_small = (n < max_exact)
|
||||
|
||||
# The other half of the buckets are for logarithmically bigger bins in positions up to max_distance
|
||||
val_if_large = max_exact + (
|
||||
torch.log(n.float() / max_exact)
|
||||
/ math.log(max_distance / max_exact) * (num_buckets - max_exact)).to(torch.long)
|
||||
val_if_large = torch.min(val_if_large, torch.full_like(val_if_large, num_buckets - 1))
|
||||
|
||||
ret += torch.where(is_small, n, val_if_large)
|
||||
return ret
|
||||
|
||||
def compute_bias(self, qlen, klen):
|
||||
""" Compute binned relative position bias """
|
||||
context_position = torch.arange(qlen, dtype=torch.long)[:, None]
|
||||
memory_position = torch.arange(klen, dtype=torch.long)[None, :]
|
||||
relative_position = memory_position - context_position # shape (qlen, klen)
|
||||
rp_bucket = self._relative_position_bucket(relative_position, # shape (qlen, klen)
|
||||
bidirectional=not self.is_decoder,
|
||||
num_buckets=self.relative_attention_num_buckets)
|
||||
values = self.relative_attention_bias(rp_bucket) # shape (qlen, klen, num_heads)
|
||||
values = values.permute([2, 0, 1]).unsqueeze(0) # shape (1, num_heads, qlen, klen)
|
||||
return values
|
||||
|
||||
def forward(self, input, mask=None, kv=None, position_bias=None, cache=None, head_mask=None):
|
||||
"""
|
||||
Self-attention (if kv is None) or attention over source sentence (provided by kv).
|
||||
"""
|
||||
# Input is (bs, qlen, dim)
|
||||
# Mask is (bs, klen) (non-causal) or (bs, klen, klen)
|
||||
bs, qlen, dim = input.size()
|
||||
if kv is None:
|
||||
klen = qlen if cache is None else cache['slen'] + qlen
|
||||
else:
|
||||
klen = kv.size(1)
|
||||
|
||||
def shape(x):
|
||||
""" projection """
|
||||
return x.view(bs, -1, self.n_heads, self.d_kv).transpose(1, 2)
|
||||
|
||||
def unshape(x):
|
||||
""" compute context """
|
||||
return x.transpose(1, 2).contiguous().view(bs, -1, self.inner_dim)
|
||||
|
||||
q = shape(self.q(input)) # (bs, n_heads, qlen, dim_per_head)
|
||||
if kv is None:
|
||||
k = shape(self.k(input)) # (bs, n_heads, qlen, dim_per_head)
|
||||
v = shape(self.v(input)) # (bs, n_heads, qlen, dim_per_head)
|
||||
elif cache is None or self.layer_id not in cache:
|
||||
k = v = kv
|
||||
k = shape(self.k(k)) # (bs, n_heads, qlen, dim_per_head)
|
||||
v = shape(self.v(v)) # (bs, n_heads, qlen, dim_per_head)
|
||||
|
||||
if cache is not None:
|
||||
if self.layer_id in cache:
|
||||
if kv is None:
|
||||
k_, v_ = cache[self.layer_id]
|
||||
k = torch.cat([k_, k], dim=2) # (bs, n_heads, klen, dim_per_head)
|
||||
v = torch.cat([v_, v], dim=2) # (bs, n_heads, klen, dim_per_head)
|
||||
else:
|
||||
k, v = cache[self.layer_id]
|
||||
cache[self.layer_id] = (k, v)
|
||||
|
||||
# q = q / math.sqrt(dim_per_head) # No scaling in T5
|
||||
scores = torch.einsum('bnqd,bnkd->bnqk', q, k) # (bs, n_heads, qlen, klen)
|
||||
|
||||
if position_bias is None:
|
||||
if not self.has_relative_attention_bias:
|
||||
raise ValueError("No position_bias provided and no weights to compute position_bias")
|
||||
position_bias = self.compute_bias(qlen, klen)
|
||||
if mask is not None:
|
||||
position_bias = position_bias + mask # (bs, n_heads, qlen, klen)
|
||||
|
||||
scores += position_bias
|
||||
weights = F.softmax(scores.float(), dim=-1).type_as(scores) # (bs, n_heads, qlen, klen)
|
||||
weights = F.dropout(weights, p=self.dropout, training=self.training) # (bs, n_heads, qlen, klen)
|
||||
|
||||
# Mask heads if we want to
|
||||
if head_mask is not None:
|
||||
weights = weights * head_mask
|
||||
|
||||
context = torch.matmul(weights, v) # (bs, n_heads, qlen, dim_per_head)
|
||||
context = unshape(context) # (bs, qlen, dim)
|
||||
|
||||
context = self.o(context)
|
||||
|
||||
outputs = (context,)
|
||||
if self.output_attentions:
|
||||
outputs = outputs + (weights,)
|
||||
if self.has_relative_attention_bias:
|
||||
outputs = outputs + (position_bias,)
|
||||
return outputs
|
||||
|
||||
|
||||
class T5LayerSelfAttention(nn.Module):
|
||||
def __init__(self, config, has_relative_attention_bias=False):
|
||||
super(T5LayerSelfAttention, self).__init__()
|
||||
self.SelfAttention = T5Attention(config, has_relative_attention_bias=has_relative_attention_bias)
|
||||
self.layer_norm = T5LayerNorm(config.d_model, eps=config.layer_norm_epsilon)
|
||||
self.dropout = nn.Dropout(config.dropout_rate)
|
||||
|
||||
def forward(self, hidden_states, attention_mask=None, position_bias=None, head_mask=None):
|
||||
norm_x = self.layer_norm(hidden_states)
|
||||
attention_output = self.SelfAttention(norm_x,
|
||||
mask=attention_mask,
|
||||
position_bias=position_bias,
|
||||
head_mask=head_mask)
|
||||
y = attention_output[0]
|
||||
layer_output = hidden_states + self.dropout(y)
|
||||
outputs = (layer_output,) + attention_output[1:] # add attentions if we output them
|
||||
return outputs
|
||||
|
||||
|
||||
class T5LayerCrossAttention(nn.Module):
|
||||
def __init__(self, config, has_relative_attention_bias=False):
|
||||
super(T5LayerCrossAttention, self).__init__()
|
||||
self.EncDecAttention = T5Attention(config, has_relative_attention_bias=has_relative_attention_bias)
|
||||
self.layer_norm = T5LayerNorm(config.d_model, eps=config.layer_norm_epsilon)
|
||||
self.dropout = nn.Dropout(config.dropout_rate)
|
||||
|
||||
def forward(self, hidden_states, kv, attention_mask=None, position_bias=None, head_mask=None):
|
||||
norm_x = self.layer_norm(hidden_states)
|
||||
attention_output = self.EncDecAttention(norm_x,
|
||||
mask=attention_mask,
|
||||
kv=kv,
|
||||
position_bias=position_bias,
|
||||
head_mask=head_mask)
|
||||
y = attention_output[0]
|
||||
layer_output = hidden_states + self.dropout(y)
|
||||
outputs = (layer_output,) + attention_output[1:] # add attentions if we output them
|
||||
return outputs
|
||||
|
||||
|
||||
class T5Block(nn.Module):
|
||||
def __init__(self, config, has_relative_attention_bias=False):
|
||||
super(T5Block, self).__init__()
|
||||
self.is_decoder = config.is_decoder
|
||||
self.layer = nn.ModuleList()
|
||||
self.layer.append(T5LayerSelfAttention(config, has_relative_attention_bias=has_relative_attention_bias))
|
||||
if self.is_decoder:
|
||||
self.layer.append(T5LayerCrossAttention(config, has_relative_attention_bias=has_relative_attention_bias))
|
||||
self.layer.append(T5LayerFF(config))
|
||||
else:
|
||||
self.layer.append(T5LayerFF(config))
|
||||
|
||||
def forward(self, hidden_states, attention_mask=None, position_bias=None,
|
||||
encoder_hidden_states=None, encoder_attention_mask=None, encoder_decoder_position_bias=None,
|
||||
head_mask=None):
|
||||
self_attention_outputs = self.layer[0](hidden_states,
|
||||
attention_mask=attention_mask,
|
||||
position_bias=position_bias,
|
||||
head_mask=head_mask)
|
||||
hidden_states = self_attention_outputs[0]
|
||||
outputs = self_attention_outputs[1:] # Keep self-attention outputs and relative position weights
|
||||
|
||||
if not self.is_decoder:
|
||||
hidden_states = self.layer[1](hidden_states)
|
||||
else:
|
||||
cross_attention_outputs = self.layer[1](hidden_states,
|
||||
kv=encoder_hidden_states,
|
||||
attention_mask=encoder_attention_mask,
|
||||
position_bias=encoder_decoder_position_bias,
|
||||
head_mask=head_mask)
|
||||
hidden_states = cross_attention_outputs[0]
|
||||
outputs = outputs + cross_attention_outputs[1:] # Keep cross-attention outputs and relative position weights
|
||||
hidden_states = self.layer[2](hidden_states)
|
||||
|
||||
outputs = (hidden_states,) + outputs # add attentions if we output them
|
||||
return outputs # hidden-states, (self-attention weights), (self-attention position bias), (cross-attention weights), (cross-attention position bias)
|
||||
|
||||
|
||||
class T5PreTrainedModel(PreTrainedModel):
|
||||
""" An abstract class to handle weights initialization and
|
||||
a simple interface for dowloading and loading pretrained models.
|
||||
"""
|
||||
config_class = T5Config
|
||||
pretrained_model_archive_map = T5_PRETRAINED_MODEL_ARCHIVE_MAP
|
||||
load_tf_weights = load_tf_weights_in_t5
|
||||
base_model_prefix = "transformer"
|
||||
|
||||
@property
|
||||
def dummy_inputs(self):
|
||||
input_ids = torch.tensor(DUMMY_INPUTS)
|
||||
input_mask = torch.tensor(DUMMY_MASK)
|
||||
dummy_inputs = {'decoder_input_ids': input_ids,
|
||||
'encoder_input_ids': input_ids,
|
||||
'decoder_attention_mask': input_mask}
|
||||
return dummy_inputs
|
||||
|
||||
def _init_weights(self, module):
|
||||
""" Initialize the weights """
|
||||
factor = self.config.initializer_factor # Used for testing weights initialization
|
||||
if isinstance(module, T5LayerNorm):
|
||||
module.weight.data.fill_(factor*1.0)
|
||||
elif isinstance(module, (T5Model, T5WithLMHeadModel)):
|
||||
# Mesh TensorFlow embeddings initialization
|
||||
# See https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/mesh_tensorflow/layers.py#L1624
|
||||
module.shared.weight.data.normal_(mean=0.0, std=factor*1.0)
|
||||
elif isinstance(module, T5DenseReluDense):
|
||||
# Mesh TensorFlow FF initialization
|
||||
# See https://github.com/tensorflow/mesh/blob/master/mesh_tensorflow/transformer/transformer_layers.py#L56
|
||||
# and https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/mesh_tensorflow/layers.py#L89
|
||||
module.wi.weight.data.normal_(mean=0.0, std=factor*((self.config.d_model) ** -0.5))
|
||||
if hasattr(module.wi, 'bias') and module.wi.bias is not None:
|
||||
module.wi.bias.data.zero_()
|
||||
module.wo.weight.data.normal_(mean=0.0, std=factor*((self.config.d_ff) ** -0.5))
|
||||
if hasattr(module.wo, 'bias') and module.wo.bias is not None:
|
||||
module.wo.bias.data.zero_()
|
||||
elif isinstance(module, T5Attention):
|
||||
# Mesh TensorFlow attention initialization to avoid scaling before softmax
|
||||
# See https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/mesh_tensorflow/transformer/attention.py#L136
|
||||
d_model = self.config.d_model
|
||||
d_kv = self.config.d_kv
|
||||
n_heads = self.config.num_heads
|
||||
module.q.weight.data.normal_(mean=0.0, std=factor*((d_model * d_kv) ** -0.5))
|
||||
module.k.weight.data.normal_(mean=0.0, std=factor*(d_model ** -0.5))
|
||||
module.v.weight.data.normal_(mean=0.0, std=factor*(d_model ** -0.5))
|
||||
module.o.weight.data.normal_(mean=0.0, std=factor*((n_heads * d_kv) ** -0.5))
|
||||
if module.has_relative_attention_bias:
|
||||
module.relative_attention_bias.weight.data.normal_(mean=0.0, std=factor*((d_model) ** -0.5))
|
||||
|
||||
|
||||
class T5Stack(T5PreTrainedModel):
|
||||
def __init__(self, config):
|
||||
super(T5Stack, self).__init__(config)
|
||||
self.output_attentions = config.output_attentions
|
||||
self.output_hidden_states = config.output_hidden_states
|
||||
self.is_decoder = config.is_decoder
|
||||
|
||||
self.block = nn.ModuleList([T5Block(config, has_relative_attention_bias=bool(i == 0))
|
||||
for i in range(config.num_layers)])
|
||||
self.final_layer_norm = T5LayerNorm(config.d_model, eps=config.layer_norm_epsilon)
|
||||
self.dropout = nn.Dropout(config.dropout_rate)
|
||||
|
||||
self.init_weights()
|
||||
|
||||
def forward(self,
|
||||
hidden_states,
|
||||
attention_mask=None,
|
||||
encoder_hidden_states=None,
|
||||
encoder_attention_mask=None,
|
||||
head_mask=None):
|
||||
|
||||
batch_size, seq_length = hidden_states.shape[0], hidden_states.shape[1]
|
||||
if attention_mask is None:
|
||||
attention_mask = torch.ones(batch_size, seq_length).to(hidden_states.device)
|
||||
if self.is_decoder and encoder_attention_mask is None:
|
||||
encoder_seq_length = encoder_hidden_states.shape[1]
|
||||
encoder_attention_mask = torch.ones(batch_size, encoder_seq_length).to(hidden_states.device)
|
||||
|
||||
# We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
|
||||
# ourselves in which case we just need to make it broadcastable to all heads.
|
||||
if attention_mask.dim() == 3:
|
||||
extended_attention_mask = attention_mask[:, None, :, :]
|
||||
elif attention_mask.dim() == 2:
|
||||
# Provided a padding mask of dimensions [batch_size, seq_length]
|
||||
# - if the model is a decoder, apply a causal mask in addition to the padding mask
|
||||
# - if the model is an encoder, make the mask broadcastable to [batch_size, num_heads, seq_length, seq_length]
|
||||
if self.config.is_decoder:
|
||||
seq_ids = torch.arange(seq_length, device=hidden_states.device)
|
||||
causal_mask = seq_ids[None, None, :].repeat(batch_size, seq_length, 1) <= seq_ids[None, :, None]
|
||||
causal_mask = causal_mask.to(attention_mask)
|
||||
extended_attention_mask = causal_mask[:, None, :, :] * attention_mask[:, None, None, :]
|
||||
else:
|
||||
extended_attention_mask = attention_mask[:, None, None, :]
|
||||
|
||||
# Since attention_mask is 1.0 for positions we want to attend and 0.0 for
|
||||
# masked positions, this operation will create a tensor which is 0.0 for
|
||||
# positions we want to attend and -1e9 for masked positions.
|
||||
# Since we are adding it to the raw scores before the softmax, this is
|
||||
# effectively the same as removing these entirely.
|
||||
|
||||
# T5 has a mask that can compare sequence ids, we can simulate this here with this transposition
|
||||
# Cf. https://github.com/tensorflow/mesh/blob/8d2465e9bc93129b913b5ccc6a59aa97abd96ec6/mesh_tensorflow/transformer/transformer_layers.py#L270
|
||||
# extended_attention_mask = (extended_attention_mask == extended_attention_mask.transpose(-1, -2))
|
||||
|
||||
extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility
|
||||
extended_attention_mask = (1.0 - extended_attention_mask) * -1e9
|
||||
|
||||
if self.is_decoder:
|
||||
# If a 2D ou 3D attention mask is provided for the cross-attention
|
||||
# we need to make broadcastabe to [batch_size, num_heads, seq_length, seq_length]
|
||||
if encoder_attention_mask.dim() == 3:
|
||||
encoder_extended_attention_mask = encoder_attention_mask[:, None, :, :]
|
||||
if encoder_attention_mask.dim() == 2:
|
||||
encoder_extended_attention_mask = encoder_attention_mask[:, None, None, :]
|
||||
|
||||
# T5 has a mask that can compare sequence ids, we can simulate this here with this transposition
|
||||
# Cf. https://github.com/tensorflow/mesh/blob/8d2465e9bc93129b913b5ccc6a59aa97abd96ec6/mesh_tensorflow/transformer/transformer_layers.py#L270
|
||||
# encoder_extended_attention_mask = (encoder_extended_attention_mask == encoder_extended_attention_mask.transpose(-1, -2))
|
||||
|
||||
encoder_extended_attention_mask = encoder_extended_attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility
|
||||
encoder_extended_attention_mask = (1.0 - encoder_extended_attention_mask) * -1e9
|
||||
else:
|
||||
encoder_extended_attention_mask = None
|
||||
|
||||
# Prepare head mask if needed
|
||||
# 1.0 in head_mask indicate we keep the head
|
||||
# attention_probs has shape bsz x n_heads x N x N
|
||||
# input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
|
||||
# and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
|
||||
if head_mask is not None:
|
||||
if head_mask.dim() == 1:
|
||||
head_mask = head_mask.unsqueeze(0).unsqueeze(0).unsqueeze(-1).unsqueeze(-1)
|
||||
head_mask = head_mask.expand(self.config.num_layers, -1, -1, -1, -1)
|
||||
elif head_mask.dim() == 2:
|
||||
head_mask = head_mask.unsqueeze(1).unsqueeze(-1).unsqueeze(-1) # We can specify head_mask for each layer
|
||||
head_mask = head_mask.to(dtype=next(self.parameters()).dtype) # switch to fload if need + fp16 compatibility
|
||||
else:
|
||||
head_mask = [None] * self.config.num_layers
|
||||
|
||||
all_hidden_states = ()
|
||||
all_attentions = ()
|
||||
position_bias = None
|
||||
encoder_decoder_position_bias = None
|
||||
|
||||
hidden_states = self.dropout(hidden_states)
|
||||
for i, layer_module in enumerate(self.block):
|
||||
if self.output_hidden_states:
|
||||
all_hidden_states = all_hidden_states + (hidden_states,)
|
||||
|
||||
layer_outputs = layer_module(hidden_states,
|
||||
attention_mask=extended_attention_mask,
|
||||
position_bias=position_bias,
|
||||
encoder_hidden_states=encoder_hidden_states,
|
||||
encoder_attention_mask=encoder_extended_attention_mask,
|
||||
encoder_decoder_position_bias=encoder_decoder_position_bias,
|
||||
head_mask=head_mask[i])
|
||||
# layer_outputs is a tuple with:
|
||||
# hidden-states, (self-attention weights), (self-attention position bias), (cross-attention weights), (cross-attention position bias)
|
||||
hidden_states = layer_outputs[0]
|
||||
if i == 0:
|
||||
# We share the position biases between the layers - the first layer store them
|
||||
# layer_outputs = hidden-states, (self-attention weights), (self-attention position bias), (cross-attention weights), (cross-attention position bias)
|
||||
position_bias = layer_outputs[2 if self.output_attentions else 1]
|
||||
if self.is_decoder:
|
||||
encoder_decoder_position_bias = layer_outputs[4 if self.output_attentions else 2]
|
||||
|
||||
if self.output_attentions:
|
||||
all_attentions = all_attentions + (layer_outputs[1],) # We keep only self-attention weights for now
|
||||
|
||||
hidden_states = self.final_layer_norm(hidden_states)
|
||||
layer_output = self.dropout(hidden_states)
|
||||
|
||||
# Add last layer
|
||||
if self.output_hidden_states:
|
||||
all_hidden_states = all_hidden_states + (hidden_states,)
|
||||
|
||||
outputs = (hidden_states,)
|
||||
if self.output_hidden_states:
|
||||
outputs = outputs + (all_hidden_states,)
|
||||
if self.output_attentions:
|
||||
outputs = outputs + (all_attentions,)
|
||||
return outputs # last-layer hidden state, (all hidden states), (all attentions)
|
||||
|
||||
|
||||
T5_START_DOCSTRING = r""" The T5 model was proposed in
|
||||
`Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer`_
|
||||
by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu.
|
||||
It's an encoder decoder transformer pre-trained in a text-to-text denoising generative setting.
|
||||
|
||||
This model is a PyTorch `torch.nn.Module`_ sub-class. Use it as a regular PyTorch Module and
|
||||
refer to the PyTorch documentation for all matter related to general usage and behavior.
|
||||
|
||||
.. _`Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer`:
|
||||
https://arxiv.org/abs/1910.10683
|
||||
|
||||
.. _`torch.nn.Module`:
|
||||
https://pytorch.org/docs/stable/nn.html#module
|
||||
|
||||
Parameters:
|
||||
config (:class:`~transformers.T5Config`): Model configuration class with all the parameters of the model.
|
||||
Initializing with a config file does not load the weights associated with the model, only the configuration.
|
||||
Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model weights.
|
||||
"""
|
||||
|
||||
T5_INPUTS_DOCSTRING = r"""
|
||||
Inputs:
|
||||
**input_ids**: ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
|
||||
Indices of input sequence tokens in the vocabulary.
|
||||
To match pre-training, T5 input sequence should be formatted with [CLS] and [SEP] tokens as follows:
|
||||
|
||||
(a) For sequence pairs:
|
||||
|
||||
``tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]``
|
||||
|
||||
(b) For single sequences:
|
||||
|
||||
``tokens: [CLS] the dog is hairy . [SEP]``
|
||||
|
||||
T5 is a model with relative position embeddings so you should be able to pad the inputs on
|
||||
the right or the left.
|
||||
|
||||
Indices can be obtained using :class:`transformers.T5Tokenizer`.
|
||||
See :func:`transformers.PreTrainedTokenizer.encode` and
|
||||
:func:`transformers.PreTrainedTokenizer.convert_tokens_to_ids` for details.
|
||||
**attention_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, sequence_length)``:
|
||||
Mask to avoid performing attention on padding token indices.
|
||||
Mask values selected in ``[0, 1]``:
|
||||
``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
|
||||
**head_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(num_heads,)`` or ``(num_layers, num_heads)``:
|
||||
Mask to nullify selected heads of the self-attention modules.
|
||||
Mask values selected in ``[0, 1]``:
|
||||
``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.
|
||||
"""
|
||||
|
||||
@add_start_docstrings("The bare T5 Model transformer outputting raw hidden-states"
|
||||
"without any specific head on top.",
|
||||
T5_START_DOCSTRING, T5_INPUTS_DOCSTRING)
|
||||
class T5Model(T5PreTrainedModel):
|
||||
r"""
|
||||
Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
|
||||
**last_hidden_state**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, hidden_size)``
|
||||
Sequence of hidden-states at the output of the last layer of the model.
|
||||
**hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
|
||||
list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
|
||||
of shape ``(batch_size, sequence_length, hidden_size)``:
|
||||
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
|
||||
**attentions**: (`optional`, returned when ``config.output_attentions=True``)
|
||||
list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
|
||||
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
|
||||
|
||||
Examples::
|
||||
|
||||
tokenizer = T5Tokenizer.from_pretrained('t5-small')
|
||||
model = T5Model.from_pretrained('t5-small')
|
||||
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0) # Batch size 1
|
||||
outputs = model(input_ids)
|
||||
last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
|
||||
|
||||
"""
|
||||
def __init__(self, config):
|
||||
super(T5Model, self).__init__(config)
|
||||
self.shared = nn.Embedding(config.vocab_size, config.d_model)
|
||||
|
||||
encoder_config = copy.deepcopy(config)
|
||||
self.encoder = T5Stack(encoder_config)
|
||||
|
||||
decoder_config = copy.deepcopy(config)
|
||||
decoder_config.is_decoder = True
|
||||
self.decoder = T5Stack(decoder_config)
|
||||
|
||||
self.init_weights()
|
||||
|
||||
def get_input_embeddings(self):
|
||||
return self.shared
|
||||
|
||||
def set_input_embeddings(self, new_embeddings):
|
||||
self.shared = new_embeddings
|
||||
|
||||
def _prune_heads(self, heads_to_prune):
|
||||
""" Prunes heads of the model.
|
||||
heads_to_prune: dict of {layer_num: list of heads to prune in this layer}
|
||||
See base class PreTrainedModel
|
||||
"""
|
||||
for layer, heads in heads_to_prune.items():
|
||||
self.encoder.layer[layer].attention.prune_heads(heads)
|
||||
|
||||
def forward(self, **kwargs):
|
||||
# keyword arguments come in 3 flavors: encoder-specific (prefixed by
|
||||
# `encoder_`), decoder-specific (prefixed by `decoder_`) and those
|
||||
# that apply to the model as whole.
|
||||
# We let the specific kwargs override the common ones in case of conflict.
|
||||
kwargs_common = dict((k, v) for k, v in kwargs.items()
|
||||
if not k.startswith("encoder_") and not k.startswith("decoder_"))
|
||||
kwargs_encoder = kwargs_common.copy()
|
||||
kwargs_decoder = kwargs_common.copy()
|
||||
kwargs_encoder.update(dict((k[len("encoder_"):], v) for k, v in kwargs.items() if k.startswith("encoder_")))
|
||||
kwargs_decoder.update(dict((k[len("decoder_"):], v) for k, v in kwargs.items() if k.startswith("decoder_")))
|
||||
|
||||
# Encode if needed (training, first prediction pass)
|
||||
encoder_hidden_states = kwargs_encoder.pop("hidden_states", None)
|
||||
encoder_attention_mask = kwargs_encoder.get("attention_mask", None)
|
||||
if encoder_hidden_states is None:
|
||||
# Convert encoder inputs in embeddings if needed
|
||||
hidden_states = kwargs_encoder.pop("inputs_embeds", None)
|
||||
if hidden_states is None:
|
||||
encoder_inputs_ids = kwargs_encoder.pop("input_ids")
|
||||
hidden_states = self.shared(encoder_inputs_ids) # Convert inputs in embeddings
|
||||
|
||||
if encoder_attention_mask is not None:
|
||||
# Apply masking
|
||||
encoder_attention_mask = (encoder_attention_mask != 0).to(hidden_states)
|
||||
hidden_states = hidden_states * encoder_attention_mask.unsqueeze(-1)
|
||||
|
||||
encoder_outputs = self.encoder(hidden_states, **kwargs_encoder)
|
||||
encoder_hidden_states = encoder_outputs[0]
|
||||
else:
|
||||
encoder_outputs = ()
|
||||
|
||||
# Decode
|
||||
# Convert decoder inputs in embeddings if needed
|
||||
hidden_states = kwargs_decoder.pop("inputs_embeds", None)
|
||||
if hidden_states is None:
|
||||
decoder_inputs_ids = kwargs_decoder.pop("input_ids")
|
||||
hidden_states = self.shared(decoder_inputs_ids)
|
||||
|
||||
kwargs_decoder["encoder_hidden_states"] = encoder_hidden_states
|
||||
kwargs_decoder["encoder_attention_mask"] = encoder_attention_mask
|
||||
decoder_outputs = self.decoder(hidden_states, **kwargs_decoder)
|
||||
|
||||
return decoder_outputs + encoder_outputs
|
||||
|
||||
|
||||
@add_start_docstrings("""T5 Model with a `language modeling` head on top. """,
|
||||
T5_START_DOCSTRING, T5_INPUTS_DOCSTRING)
|
||||
class T5WithLMHeadModel(T5PreTrainedModel):
|
||||
r"""
|
||||
**lm_labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
|
||||
Labels for computing the masked language modeling loss.
|
||||
Indices should be in ``[-1, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)
|
||||
Tokens with indices set to ``-1`` are ignored (masked), the loss is only computed for the tokens with labels
|
||||
in ``[0, ..., config.vocab_size]``
|
||||
|
||||
Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
|
||||
**loss**: (`optional`, returned when ``lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
|
||||
Masked language modeling loss.
|
||||
**prediction_scores**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, config.vocab_size)``
|
||||
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
|
||||
**hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
|
||||
list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
|
||||
of shape ``(batch_size, sequence_length, hidden_size)``:
|
||||
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
|
||||
**attentions**: (`optional`, returned when ``config.output_attentions=True``)
|
||||
list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
|
||||
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
|
||||
|
||||
Examples::
|
||||
|
||||
tokenizer = T5Tokenizer.from_pretrained('t5-small')
|
||||
model = T5WithLMHeadModel.from_pretrained('t5-small')
|
||||
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0) # Batch size 1
|
||||
outputs = model(input_ids, lm_labels=input_ids)
|
||||
loss, prediction_scores = outputs[:2]
|
||||
|
||||
"""
|
||||
def __init__(self, config):
|
||||
super(T5WithLMHeadModel, self).__init__(config)
|
||||
self.model_dim = config.d_model
|
||||
|
||||
self.shared = nn.Embedding(config.vocab_size, config.d_model)
|
||||
|
||||
encoder_config = copy.deepcopy(config)
|
||||
self.encoder = T5Stack(encoder_config)
|
||||
|
||||
decoder_config = copy.deepcopy(config)
|
||||
decoder_config.is_decoder = True
|
||||
self.decoder = T5Stack(decoder_config)
|
||||
|
||||
self.lm_head = nn.Linear(config.d_model, config.vocab_size, bias=False)
|
||||
|
||||
self.init_weights()
|
||||
|
||||
def get_input_embeddings(self):
|
||||
return self.shared
|
||||
|
||||
def set_input_embeddings(self, new_embeddings):
|
||||
self.shared = new_embeddings
|
||||
|
||||
def get_output_embeddings(self):
|
||||
return self.lm_head
|
||||
|
||||
def forward(self, **kwargs):
|
||||
# keyword arguments come in 3 flavors: encoder-specific (prefixed by
|
||||
# `encoder_`), decoder-specific (prefixed by `decoder_`) and those
|
||||
# that apply to the model as whole.
|
||||
# We let the specific kwargs override the common ones in case of conflict.
|
||||
|
||||
lm_labels = kwargs.pop('decoder_lm_labels', None)
|
||||
|
||||
kwargs_common = dict((k, v) for k, v in kwargs.items()
|
||||
if not k.startswith("encoder_") and not k.startswith("decoder_"))
|
||||
kwargs_encoder = kwargs_common.copy()
|
||||
kwargs_decoder = kwargs_common.copy()
|
||||
kwargs_encoder.update(dict((k[len("encoder_"):], v) for k, v in kwargs.items() if k.startswith("encoder_")))
|
||||
kwargs_decoder.update(dict((k[len("decoder_"):], v) for k, v in kwargs.items() if k.startswith("decoder_")))
|
||||
|
||||
# Encode if needed (training, first prediction pass)
|
||||
encoder_hidden_states = kwargs_encoder.pop("hidden_states", None)
|
||||
if encoder_hidden_states is None:
|
||||
# Convert encoder inputs in embeddings if needed
|
||||
hidden_states = kwargs_encoder.pop("inputs_embeds", None)
|
||||
if hidden_states is None:
|
||||
encoder_inputs_ids = kwargs_encoder.pop("input_ids")
|
||||
hidden_states = self.shared(encoder_inputs_ids) # Convert inputs in embeddings
|
||||
|
||||
encoder_outputs = self.encoder(hidden_states, **kwargs_encoder)
|
||||
encoder_hidden_states = encoder_outputs[0]
|
||||
else:
|
||||
encoder_outputs = ()
|
||||
|
||||
# Decode
|
||||
# Convert decoder inputs in embeddings if needed
|
||||
hidden_states = kwargs_decoder.pop("inputs_embeds", None)
|
||||
if hidden_states is None:
|
||||
decoder_inputs_ids = kwargs_decoder.pop("input_ids")
|
||||
hidden_states = self.shared(decoder_inputs_ids)
|
||||
|
||||
kwargs_decoder["encoder_hidden_states"] = encoder_hidden_states
|
||||
kwargs_decoder["encoder_attention_mask"] = kwargs_encoder.get("attention_mask", None)
|
||||
decoder_outputs = self.decoder(hidden_states, **kwargs_decoder)
|
||||
|
||||
sequence_output = decoder_outputs[0]
|
||||
# Rescale output before projecting on vocab
|
||||
# See https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/mesh_tensorflow/transformer/transformer.py#L586
|
||||
sequence_output = sequence_output * (self.model_dim ** -0.5)
|
||||
lm_logits = self.lm_head(sequence_output)
|
||||
|
||||
decoder_outputs = (lm_logits,) + decoder_outputs[1:] # Add hidden states and attention if they are here
|
||||
if lm_labels is not None:
|
||||
shift_logits = lm_logits[..., :-1, :].contiguous()
|
||||
shift_labels = lm_labels[..., 1:].contiguous()
|
||||
loss_fct = CrossEntropyLoss(ignore_index=-1)
|
||||
loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)),
|
||||
shift_labels.view(-1))
|
||||
decoder_outputs = (loss,) + decoder_outputs # TODO(thom): Add z_loss https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/mesh_tensorflow/layers.py#L666
|
||||
|
||||
return decoder_outputs + encoder_outputs
|
@ -27,6 +27,7 @@ from .modeling_tf_xlm import TFXLMModel, TFXLMWithLMHeadModel, TFXLMForSequenceC
|
||||
from .modeling_tf_roberta import TFRobertaModel, TFRobertaForMaskedLM, TFRobertaForSequenceClassification
|
||||
from .modeling_tf_distilbert import TFDistilBertModel, TFDistilBertForQuestionAnswering, TFDistilBertForMaskedLM, TFDistilBertForSequenceClassification
|
||||
from .modeling_tf_ctrl import TFCTRLModel, TFCTRLLMHeadModel
|
||||
from .modeling_tf_t5 import TFT5Model, TFT5WithLMHeadModel
|
||||
|
||||
from .file_utils import add_start_docstrings
|
||||
|
||||
@ -45,6 +46,7 @@ class TFAutoModel(object):
|
||||
|
||||
The base model class to instantiate is selected as the first pattern matching
|
||||
in the `pretrained_model_name_or_path` string (in the following order):
|
||||
- contains `t5`: TFT5Model (T5 model)
|
||||
- contains `distilbert`: TFDistilBertModel (DistilBERT model)
|
||||
- contains `roberta`: TFRobertaModel (RoBERTa model)
|
||||
- contains `bert`: TFBertModel (Bert model)
|
||||
@ -68,6 +70,7 @@ class TFAutoModel(object):
|
||||
|
||||
The model class to instantiate is selected as the first pattern matching
|
||||
in the `pretrained_model_name_or_path` string (in the following order):
|
||||
- contains `t5`: TFT5Model (T5 model)
|
||||
- contains `distilbert`: TFDistilBertModel (DistilBERT model)
|
||||
- contains `roberta`: TFRobertaModel (RoBERTa model)
|
||||
- contains `bert`: TFTFBertModel (Bert model)
|
||||
@ -137,7 +140,9 @@ class TFAutoModel(object):
|
||||
model = TFAutoModel.from_pretrained('./pt_model/bert_pytorch_model.bin', from_pt=True, config=config)
|
||||
|
||||
"""
|
||||
if 'distilbert' in pretrained_model_name_or_path:
|
||||
if 't5' in pretrained_model_name_or_path:
|
||||
return TFT5Model.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
|
||||
elif 'distilbert' in pretrained_model_name_or_path:
|
||||
return TFDistilBertModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
|
||||
elif 'roberta' in pretrained_model_name_or_path:
|
||||
return TFRobertaModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
|
||||
@ -173,6 +178,7 @@ class TFAutoModelWithLMHead(object):
|
||||
|
||||
The model class to instantiate is selected as the first pattern matching
|
||||
in the `pretrained_model_name_or_path` string (in the following order):
|
||||
- contains `t5`: TFT5WithLMHeadModel (T5 model)
|
||||
- contains `distilbert`: TFDistilBertForMaskedLM (DistilBERT model)
|
||||
- contains `roberta`: TFRobertaForMaskedLM (RoBERTa model)
|
||||
- contains `bert`: TFBertForMaskedLM (Bert model)
|
||||
@ -199,6 +205,7 @@ class TFAutoModelWithLMHead(object):
|
||||
|
||||
The model class to instantiate is selected as the first pattern matching
|
||||
in the `pretrained_model_name_or_path` string (in the following order):
|
||||
- contains `t5`: TFT5WithLMHeadModel (T5 model)
|
||||
- contains `distilbert`: TFDistilBertForMaskedLM (DistilBERT model)
|
||||
- contains `roberta`: TFRobertaForMaskedLM (RoBERTa model)
|
||||
- contains `bert`: TFBertForMaskedLM (Bert model)
|
||||
@ -269,7 +276,9 @@ class TFAutoModelWithLMHead(object):
|
||||
model = TFAutoModelWithLMHead.from_pretrained('./pt_model/bert_pytorch_model.bin', from_pt=True, config=config)
|
||||
|
||||
"""
|
||||
if 'distilbert' in pretrained_model_name_or_path:
|
||||
if 't5' in pretrained_model_name_or_path:
|
||||
return TFT5WithLMHeadModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
|
||||
elif 'distilbert' in pretrained_model_name_or_path:
|
||||
return TFDistilBertForMaskedLM.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
|
||||
elif 'roberta' in pretrained_model_name_or_path:
|
||||
return TFRobertaForMaskedLM.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
|
||||
|
@ -78,6 +78,7 @@ def load_pytorch_checkpoint_in_tf2_model(tf_model, pytorch_checkpoint_path, tf_i
|
||||
logger.info("Loading PyTorch weights from {}".format(pt_path))
|
||||
|
||||
pt_state_dict = torch.load(pt_path, map_location='cpu')
|
||||
logger.info("PyTorch checkpoint contains {:,} parameters".format(sum(t.numel() for t in pt_state_dict.values())))
|
||||
|
||||
return load_pytorch_weights_in_tf2_model(tf_model, pt_state_dict, tf_inputs=tf_inputs, allow_missing_keys=allow_missing_keys)
|
||||
|
||||
@ -134,7 +135,7 @@ def load_pytorch_weights_in_tf2_model(tf_model, pt_state_dict, tf_inputs=None, a
|
||||
start_prefix_to_remove = tf_model.base_model_prefix + '.'
|
||||
|
||||
symbolic_weights = tf_model.trainable_weights + tf_model.non_trainable_weights
|
||||
|
||||
tf_loaded_numel = 0
|
||||
weight_value_tuples = []
|
||||
all_pytorch_weights = set(list(pt_state_dict.keys()))
|
||||
for symbolic_weight in symbolic_weights:
|
||||
@ -159,7 +160,8 @@ def load_pytorch_weights_in_tf2_model(tf_model, pt_state_dict, tf_inputs=None, a
|
||||
e.args += (symbolic_weight.shape, array.shape)
|
||||
raise e
|
||||
|
||||
logger.info("Initialize TF weight {}".format(symbolic_weight.name))
|
||||
tf_loaded_numel += array.size
|
||||
# logger.warning("Initialize TF weight {}".format(symbolic_weight.name))
|
||||
|
||||
weight_value_tuples.append((symbolic_weight, array))
|
||||
all_pytorch_weights.discard(name)
|
||||
@ -169,6 +171,8 @@ def load_pytorch_weights_in_tf2_model(tf_model, pt_state_dict, tf_inputs=None, a
|
||||
if tf_inputs is not None:
|
||||
tfo = tf_model(tf_inputs, training=False) # Make sure restore ops are run
|
||||
|
||||
logger.info("Loaded {:,} parameters in the TF 2.0 model.".format(tf_loaded_numel))
|
||||
|
||||
logger.info("Weights or buffers not loaded from PyTorch model: {}".format(all_pytorch_weights))
|
||||
|
||||
return tf_model
|
||||
@ -272,7 +276,7 @@ def load_tf2_weights_in_pytorch_model(pt_model, tf_weights, allow_missing_keys=F
|
||||
e.args += (pt_weight.shape, array.shape)
|
||||
raise e
|
||||
|
||||
logger.info("Initialize PyTorch weight {}".format(pt_weight_name))
|
||||
# logger.warning("Initialize PyTorch weight {}".format(pt_weight_name))
|
||||
|
||||
new_pt_params_dict[pt_weight_name] = torch.from_numpy(array)
|
||||
loaded_pt_weights_data_ptr[pt_weight.data_ptr()] = torch.from_numpy(array)
|
||||
|
775
transformers/modeling_tf_t5.py
Normal file
775
transformers/modeling_tf_t5.py
Normal file
@ -0,0 +1,775 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2018 T5 Authors and The HuggingFace Inc. team.
|
||||
# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
""" TF 2.0 T5 model. """
|
||||
|
||||
from __future__ import absolute_import, division, print_function, unicode_literals
|
||||
|
||||
import logging
|
||||
import math
|
||||
import copy
|
||||
import itertools
|
||||
|
||||
import tensorflow as tf
|
||||
|
||||
from .configuration_t5 import T5Config
|
||||
from .modeling_tf_utils import TFPreTrainedModel, TFSharedEmbeddings, shape_list
|
||||
from .file_utils import add_start_docstrings, DUMMY_INPUTS, DUMMY_MASK
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
TF_T5_PRETRAINED_MODEL_ARCHIVE_MAP = {
|
||||
't5-small': "https://s3.amazonaws.com/models.huggingface.co/bert/t5-small-tf_model.h5",
|
||||
't5-base': "https://s3.amazonaws.com/models.huggingface.co/bert/t5-base-tf_model.h5",
|
||||
't5-large': "https://s3.amazonaws.com/models.huggingface.co/bert/t5-large-tf_model.h5",
|
||||
't5-3b': "https://s3.amazonaws.com/models.huggingface.co/bert/t5-3b-tf_model.h5",
|
||||
't5-11b': "https://s3.amazonaws.com/models.huggingface.co/bert/t5-11b-tf_model.h5",
|
||||
}
|
||||
|
||||
####################################################
|
||||
# TF 2.0 Models are constructed using Keras imperative API by sub-classing
|
||||
# - tf.keras.layers.Layer for the layers and
|
||||
# - TFPreTrainedModel for the models (it-self a sub-class of tf.keras.Model)
|
||||
####################################################
|
||||
|
||||
class TFT5LayerNorm(tf.keras.layers.Layer):
|
||||
def __init__(self, epsilon=1e-6, **kwargs):
|
||||
""" Construct a layernorm module in the T5 style
|
||||
No bias and no substraction of mean.
|
||||
"""
|
||||
super(TFT5LayerNorm, self).__init__(**kwargs)
|
||||
self.variance_epsilon = epsilon
|
||||
|
||||
def build(self, input_shape):
|
||||
"""Build shared word embedding layer """
|
||||
self.weight = self.add_weight(
|
||||
"weight",
|
||||
shape=(input_shape[-1],),
|
||||
initializer='ones')
|
||||
super(TFT5LayerNorm, self).build(input_shape)
|
||||
|
||||
def call(self, x):
|
||||
variance = tf.math.reduce_mean(tf.math.square(x), axis=-1, keepdims=True)
|
||||
x = x * tf.math.rsqrt(variance + self.variance_epsilon)
|
||||
return self.weight * x
|
||||
|
||||
|
||||
class TFT5DenseReluDense(tf.keras.layers.Layer):
|
||||
def __init__(self, config, **kwargs):
|
||||
super(TFT5DenseReluDense, self).__init__(**kwargs)
|
||||
self.wi = tf.keras.layers.Dense(config.d_ff, use_bias=False, name='wi')
|
||||
self.wo = tf.keras.layers.Dense(config.d_model, use_bias=False, name='wo')
|
||||
self.dropout = tf.keras.layers.Dropout(config.dropout_rate)
|
||||
self.act = tf.keras.activations.relu
|
||||
|
||||
def call(self, hidden_states, training=False):
|
||||
h = self.wi(hidden_states)
|
||||
h = self.act(h)
|
||||
h = self.dropout(h, training=training)
|
||||
h = self.wo(h)
|
||||
return h
|
||||
|
||||
|
||||
class TFT5LayerFF(tf.keras.layers.Layer):
|
||||
def __init__(self, config, **kwargs):
|
||||
super(TFT5LayerFF, self).__init__(**kwargs)
|
||||
self.DenseReluDense = TFT5DenseReluDense(config, name='DenseReluDense')
|
||||
self.layer_norm = TFT5LayerNorm(epsilon=config.layer_norm_epsilon,
|
||||
name='layer_norm')
|
||||
self.dropout = tf.keras.layers.Dropout(config.dropout_rate)
|
||||
|
||||
def call(self, hidden_states, training=False):
|
||||
norm_x = self.layer_norm(hidden_states)
|
||||
y = self.DenseReluDense(norm_x, training=training)
|
||||
layer_output = hidden_states + self.dropout(y, training=training)
|
||||
return layer_output
|
||||
|
||||
|
||||
class TFT5Attention(tf.keras.layers.Layer):
|
||||
NEW_ID = itertools.count()
|
||||
|
||||
def __init__(self, config, has_relative_attention_bias=False, **kwargs):
|
||||
super(TFT5Attention, self).__init__(**kwargs)
|
||||
self.layer_id = next(TFT5Attention.NEW_ID)
|
||||
self.is_decoder = config.is_decoder
|
||||
self.has_relative_attention_bias = has_relative_attention_bias
|
||||
|
||||
self.output_attentions = config.output_attentions
|
||||
self.relative_attention_num_buckets = config.relative_attention_num_buckets
|
||||
self.d_model = config.d_model
|
||||
self.d_kv = config.d_kv
|
||||
self.n_heads = config.num_heads
|
||||
self.inner_dim = self.n_heads * self.d_kv
|
||||
|
||||
# Mesh TensorFlow initialization to avoid scaling before softmax
|
||||
self.q = tf.keras.layers.Dense(self.inner_dim, use_bias=False, name='q')
|
||||
self.k = tf.keras.layers.Dense(self.inner_dim, use_bias=False, name='k')
|
||||
self.v = tf.keras.layers.Dense(self.inner_dim, use_bias=False, name='v')
|
||||
self.o = tf.keras.layers.Dense(self.d_model, use_bias=False, name='o')
|
||||
self.dropout = tf.keras.layers.Dropout(config.dropout_rate)
|
||||
|
||||
if self.has_relative_attention_bias:
|
||||
self.relative_attention_bias = tf.keras.layers.Embedding(self.relative_attention_num_buckets,
|
||||
self.n_heads,
|
||||
name='relative_attention_bias')
|
||||
self.pruned_heads = set()
|
||||
|
||||
def prune_heads(self, heads):
|
||||
raise NotImplementedError
|
||||
|
||||
@staticmethod
|
||||
def _relative_position_bucket(relative_position,
|
||||
bidirectional=True,
|
||||
num_buckets=32,
|
||||
max_distance=128):
|
||||
"""
|
||||
Adapted from Mesh Tensorflow:
|
||||
https://github.com/tensorflow/mesh/blob/0cb87fe07da627bf0b7e60475d59f95ed6b5be3d/mesh_tensorflow/transformer/transformer_layers.py#L593
|
||||
|
||||
Translate relative position to a bucket number for relative attention.
|
||||
The relative position is defined as memory_position - query_position, i.e.
|
||||
the distance in tokens from the attending position to the attended-to
|
||||
position. If bidirectional=False, then positive relative positions are
|
||||
invalid.
|
||||
We use smaller buckets for small absolute relative_position and larger buckets
|
||||
for larger absolute relative_positions. All relative positions >=max_distance
|
||||
map to the same bucket. All relative positions <=-max_distance map to the
|
||||
same bucket. This should allow for more graceful generalization to longer
|
||||
sequences than the model has been trained on.
|
||||
Args:
|
||||
relative_position: an int32 Tensor
|
||||
bidirectional: a boolean - whether the attention is bidirectional
|
||||
num_buckets: an integer
|
||||
max_distance: an integer
|
||||
Returns:
|
||||
a Tensor with the same shape as relative_position, containing int32
|
||||
values in the range [0, num_buckets)
|
||||
"""
|
||||
ret = 0
|
||||
n = -relative_position
|
||||
if bidirectional:
|
||||
num_buckets //= 2
|
||||
ret += tf.dtypes.cast(tf.math.less(n, 0), tf.int32) * num_buckets
|
||||
n = tf.math.abs(n)
|
||||
else:
|
||||
n = tf.math.maximum(n, 0)
|
||||
# now n is in the range [0, inf)
|
||||
max_exact = num_buckets // 2
|
||||
is_small = tf.math.less(n, max_exact)
|
||||
val_if_large = max_exact + tf.dtypes.cast(
|
||||
tf.math.log(tf.dtypes.cast(n, tf.float32) / max_exact)
|
||||
/ math.log(max_distance / max_exact) * (num_buckets - max_exact), tf.int32)
|
||||
val_if_large = tf.math.minimum(val_if_large, num_buckets - 1)
|
||||
ret += tf.where(is_small, n, val_if_large)
|
||||
return ret
|
||||
|
||||
def compute_bias(self, qlen, klen):
|
||||
""" Compute binned relative position bias """
|
||||
context_position = tf.range(qlen)[:, None]
|
||||
memory_position = tf.range(klen)[None, :]
|
||||
relative_position = memory_position - context_position # shape (qlen, klen)
|
||||
rp_bucket = self._relative_position_bucket(relative_position,
|
||||
bidirectional=not self.is_decoder,
|
||||
num_buckets=self.relative_attention_num_buckets)
|
||||
values = self.relative_attention_bias(rp_bucket) # shape (qlen, klen, num_heads)
|
||||
values = tf.expand_dims(tf.transpose(values, [2, 0, 1]), axis=0) # shape (1, num_heads, qlen, klen)
|
||||
return values
|
||||
|
||||
def call(self, input, mask=None, kv=None, position_bias=None, cache=None, head_mask=None, training=False):
|
||||
"""
|
||||
Self-attention (if kv is None) or attention over source sentence (provided by kv).
|
||||
"""
|
||||
# Input is (bs, qlen, dim)
|
||||
# Mask is (bs, klen) (non-causal) or (bs, klen, klen)
|
||||
bs, qlen, dim = shape_list(input)
|
||||
if kv is None:
|
||||
klen = qlen if cache is None else cache['slen'] + qlen
|
||||
else:
|
||||
klen = shape_list(kv)[1]
|
||||
|
||||
def shape(x):
|
||||
""" projection """
|
||||
return tf.transpose(tf.reshape(x, (bs, -1, self.n_heads, self.d_kv)), perm=(0, 2, 1, 3))
|
||||
|
||||
def unshape(x):
|
||||
""" compute context """
|
||||
return tf.reshape(tf.transpose(x, perm=(0, 2, 1, 3)), (bs, -1, self.inner_dim))
|
||||
|
||||
q = shape(self.q(input)) # (bs, n_heads, qlen, dim_per_head)
|
||||
if kv is None:
|
||||
k = shape(self.k(input)) # (bs, n_heads, qlen, dim_per_head)
|
||||
v = shape(self.v(input)) # (bs, n_heads, qlen, dim_per_head)
|
||||
elif cache is None or self.layer_id not in cache:
|
||||
k = v = kv
|
||||
k = shape(self.k(k)) # (bs, n_heads, qlen, dim_per_head)
|
||||
v = shape(self.v(v)) # (bs, n_heads, qlen, dim_per_head)
|
||||
|
||||
if cache is not None:
|
||||
if self.layer_id in cache:
|
||||
if kv is None:
|
||||
k_, v_ = cache[self.layer_id]
|
||||
k = tf.concat([k_, k], axis=2) # (bs, n_heads, klen, dim_per_head)
|
||||
v = tf.concat([v_, v], axis=2) # (bs, n_heads, klen, dim_per_head)
|
||||
else:
|
||||
k, v = cache[self.layer_id]
|
||||
cache[self.layer_id] = (k, v)
|
||||
|
||||
# q = q / math.sqrt(dim_per_head) # No scaling in T5
|
||||
# scores = tf.matmul(q, k, transpose_b=True) # (bs, n_heads, qlen, klen)
|
||||
scores = tf.einsum('bnqd,bnkd->bnqk', q, k) # (bs, n_heads, qlen, klen)
|
||||
|
||||
if position_bias is None:
|
||||
if not self.has_relative_attention_bias:
|
||||
raise ValueError("No position_bias provided and no weights to compute position_bias")
|
||||
position_bias = self.compute_bias(qlen, klen)
|
||||
if mask is not None:
|
||||
position_bias = position_bias + mask
|
||||
# mask = (mask == 0).expand_as(scores) # (bs, n_heads, qlen, klen)
|
||||
# scores.masked_fill_(mask, -float('inf')) # (bs, n_heads, qlen, klen)
|
||||
|
||||
scores += position_bias
|
||||
weights = tf.nn.softmax(scores, axis=-1) # (bs, n_heads, qlen, klen)
|
||||
weights = self.dropout(weights, training=training) # (bs, n_heads, qlen, klen)
|
||||
|
||||
# Mask heads if we want to
|
||||
if head_mask is not None:
|
||||
weights = weights * head_mask
|
||||
|
||||
context = tf.matmul(weights, v) # (bs, n_heads, qlen, dim_per_head)
|
||||
context = unshape(context) # (bs, qlen, dim)
|
||||
|
||||
context = self.o(context)
|
||||
|
||||
outputs = (context,)
|
||||
if self.output_attentions:
|
||||
outputs = outputs + (weights,)
|
||||
if self.has_relative_attention_bias:
|
||||
outputs = outputs + (position_bias,)
|
||||
return outputs
|
||||
|
||||
|
||||
class TFT5LayerSelfAttention(tf.keras.layers.Layer):
|
||||
def __init__(self, config, has_relative_attention_bias=False, **kwargs):
|
||||
super(TFT5LayerSelfAttention, self).__init__(**kwargs)
|
||||
self.SelfAttention = TFT5Attention(config,
|
||||
has_relative_attention_bias=has_relative_attention_bias,
|
||||
name='SelfAttention')
|
||||
self.layer_norm = TFT5LayerNorm(epsilon=config.layer_norm_epsilon,
|
||||
name='layer_norm')
|
||||
self.dropout = tf.keras.layers.Dropout(config.dropout_rate)
|
||||
|
||||
def call(self, hidden_states, attention_mask=None, position_bias=None,
|
||||
head_mask=None, training=False):
|
||||
norm_x = self.layer_norm(hidden_states)
|
||||
attention_output = self.SelfAttention(norm_x,
|
||||
mask=attention_mask,
|
||||
position_bias=position_bias,
|
||||
head_mask=head_mask,
|
||||
training=training)
|
||||
y = attention_output[0]
|
||||
layer_output = hidden_states + self.dropout(y, training=training)
|
||||
outputs = (layer_output,) + attention_output[1:] # add attentions if we output them
|
||||
return outputs
|
||||
|
||||
|
||||
class TFT5LayerCrossAttention(tf.keras.layers.Layer):
|
||||
def __init__(self, config, has_relative_attention_bias=False, **kwargs):
|
||||
super(TFT5LayerCrossAttention, self).__init__(**kwargs)
|
||||
self.EncDecAttention = TFT5Attention(config,
|
||||
has_relative_attention_bias=has_relative_attention_bias,
|
||||
name='EncDecAttention')
|
||||
self.layer_norm = TFT5LayerNorm(epsilon=config.layer_norm_epsilon,
|
||||
name='layer_norm')
|
||||
self.dropout = tf.keras.layers.Dropout(config.dropout_rate)
|
||||
|
||||
def call(self, hidden_states, kv, attention_mask=None, position_bias=None,
|
||||
head_mask=None, training=False):
|
||||
norm_x = self.layer_norm(hidden_states)
|
||||
attention_output = self.EncDecAttention(norm_x,
|
||||
mask=attention_mask,
|
||||
kv=kv,
|
||||
position_bias=position_bias,
|
||||
head_mask=head_mask,
|
||||
training=training)
|
||||
y = attention_output[0]
|
||||
layer_output = hidden_states + self.dropout(y, training=training)
|
||||
outputs = (layer_output,) + attention_output[1:] # add attentions if we output them
|
||||
return outputs
|
||||
|
||||
|
||||
class TFT5Block(tf.keras.layers.Layer):
|
||||
def __init__(self, config, has_relative_attention_bias=False, **kwargs):
|
||||
super(TFT5Block, self).__init__(**kwargs)
|
||||
self.is_decoder = config.is_decoder
|
||||
self.layer = []
|
||||
self.layer.append(TFT5LayerSelfAttention(config,
|
||||
has_relative_attention_bias=has_relative_attention_bias,
|
||||
name='layer_._0'))
|
||||
if self.is_decoder:
|
||||
self.layer.append(TFT5LayerCrossAttention(config,
|
||||
has_relative_attention_bias=has_relative_attention_bias,
|
||||
name='layer_._1'))
|
||||
self.layer.append(TFT5LayerFF(config, name='layer_._2'))
|
||||
else:
|
||||
self.layer.append(TFT5LayerFF(config, name='layer_._1'))
|
||||
|
||||
def call(self, hidden_states, attention_mask=None, position_bias=None,
|
||||
encoder_hidden_states=None, encoder_attention_mask=None, encoder_decoder_position_bias=None,
|
||||
head_mask=None, training=False):
|
||||
self_attention_outputs = self.layer[0](hidden_states,
|
||||
attention_mask=attention_mask,
|
||||
position_bias=position_bias,
|
||||
head_mask=head_mask,
|
||||
training=training)
|
||||
hidden_states = self_attention_outputs[0]
|
||||
outputs = self_attention_outputs[1:]
|
||||
|
||||
if not self.is_decoder:
|
||||
hidden_states = self.layer[1](hidden_states, training=training)
|
||||
else:
|
||||
cross_attention_outputs = self.layer[1](hidden_states,
|
||||
kv=encoder_hidden_states,
|
||||
attention_mask=encoder_attention_mask,
|
||||
position_bias=encoder_decoder_position_bias,
|
||||
head_mask=head_mask,
|
||||
training=training)
|
||||
hidden_states = cross_attention_outputs[0]
|
||||
outputs = outputs + cross_attention_outputs[1:]
|
||||
hidden_states = self.layer[2](hidden_states, training=training)
|
||||
|
||||
outputs = (hidden_states,) + outputs # add attentions if we output them
|
||||
return outputs # hidden-states, (self-attention weights), (self-attention position bias), (cross-attention weights), (cross-attention position bias)
|
||||
|
||||
|
||||
####################################################
|
||||
# The full model without a specific pretrained or finetuning head is
|
||||
# provided as a tf.keras.layers.Layer usually called "TFT5MainLayer"
|
||||
####################################################
|
||||
class TFT5MainLayer(tf.keras.layers.Layer):
|
||||
def __init__(self, config, **kwargs):
|
||||
super(TFT5MainLayer, self).__init__(**kwargs)
|
||||
self.output_attentions = config.output_attentions
|
||||
self.output_hidden_states = config.output_hidden_states
|
||||
self.is_decoder = config.is_decoder
|
||||
self.config = config
|
||||
self.num_hidden_layers = config.num_layers
|
||||
|
||||
self.block = [TFT5Block(config,
|
||||
has_relative_attention_bias=bool(i == 0),
|
||||
name='block_._{}'.format(i))
|
||||
for i in range(config.num_layers)]
|
||||
self.final_layer_norm = TFT5LayerNorm(epsilon=config.layer_norm_epsilon,
|
||||
name='final_layer_norm')
|
||||
self.dropout = tf.keras.layers.Dropout(config.dropout_rate)
|
||||
|
||||
def _resize_token_embeddings(self, new_num_tokens):
|
||||
raise NotImplementedError # Not implemented yet in the library fr TF 2.0 models
|
||||
|
||||
def _prune_heads(self, heads_to_prune):
|
||||
raise NotImplementedError # Not implemented yet in the library fr TF 2.0 models
|
||||
|
||||
def call(self, hidden_states, attention_mask=None, encoder_hidden_states=None,
|
||||
encoder_attention_mask=None, head_mask=None, training=False):
|
||||
|
||||
batch_size, seq_length = shape_list(hidden_states)[:2]
|
||||
if attention_mask is None:
|
||||
attention_mask = tf.fill((batch_size, seq_length), 1)
|
||||
if self.is_decoder and encoder_attention_mask is None:
|
||||
encoder_seq_length = encoder_hidden_states.shape[1]
|
||||
encoder_attention_mask = tf.fill((batch_size, encoder_seq_length), 1)
|
||||
|
||||
# We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
|
||||
# ourselves in which case we just need to make it broadcastable to all heads.
|
||||
attention_mask = tf.cast(attention_mask, dtype=tf.float32)
|
||||
num_dims_attention_mask = len(shape_list(attention_mask))
|
||||
if num_dims_attention_mask == 3:
|
||||
extended_attention_mask = attention_mask[:, None, :, :]
|
||||
elif num_dims_attention_mask == 2:
|
||||
# Provided a padding mask of dimensions [batch_size, seq_length]
|
||||
# - if the model is a decoder, apply a causal mask in addition to the padding mask
|
||||
# - if the model is an encoder, make the mask broadcastable to [batch_size, num_heads, seq_length, seq_length]
|
||||
if self.config.is_decoder:
|
||||
seq_ids = tf.range(seq_length)
|
||||
causal_mask = tf.less_equal(tf.tile(seq_ids[None, None, :], (batch_size, seq_length, 1)),
|
||||
seq_ids[None, :, None])
|
||||
causal_mask = tf.cast(causal_mask, dtype=tf.float32)
|
||||
extended_attention_mask = causal_mask[:, None, :, :] * attention_mask[:, None, None, :]
|
||||
else:
|
||||
extended_attention_mask = attention_mask[:, None, None, :]
|
||||
|
||||
# Since attention_mask is 1.0 for positions we want to attend and 0.0 for
|
||||
# masked positions, this operation will create a tensor which is 0.0 for
|
||||
# positions we want to attend and -10000.0 for masked positions.
|
||||
# Since we are adding it to the raw scores before the softmax, this is
|
||||
# effectively the same as removing these entirely.
|
||||
|
||||
# T5 has a mask that can compare sequence ids, we can simulate this here with this transposistion
|
||||
# Cf. https://github.com/tensorflow/mesh/blob/8d2465e9bc93129b913b5ccc6a59aa97abd96ec6/mesh_tensorflow/transformer/transformer_layers.py#L270
|
||||
# extended_attention_mask = tf.math.equal(extended_attention_mask,
|
||||
# tf.transpose(extended_attention_mask, perm=(-1, -2)))
|
||||
|
||||
extended_attention_mask = (1.0 - extended_attention_mask) * -1e9
|
||||
|
||||
if self.is_decoder:
|
||||
# If a 2D ou 3D attention mask is provided for the cross-attention
|
||||
# we need to make broadcastabe to [batch_size, num_heads, seq_length, seq_length]
|
||||
encoder_attention_mask = tf.cast(encoder_attention_mask, dtype=tf.float32)
|
||||
num_dims_encoder_attention_mask = len(shape_list(encoder_attention_mask))
|
||||
if num_dims_encoder_attention_mask == 3:
|
||||
encoder_extended_attention_mask = encoder_attention_mask[:, None, :, :]
|
||||
if num_dims_encoder_attention_mask == 2:
|
||||
encoder_extended_attention_mask = encoder_attention_mask[:, None, None, :]
|
||||
|
||||
# T5 has a mask that can compare sequence ids, we can simulate this here with this transposistion
|
||||
# Cf. https://github.com/tensorflow/mesh/blob/8d2465e9bc93129b913b5ccc6a59aa97abd96ec6/mesh_tensorflow/transformer/transformer_layers.py#L270
|
||||
# encoder_extended_attention_mask = tf.math.equal(encoder_extended_attention_mask,
|
||||
# tf.transpose(encoder_extended_attention_mask, perm=(-1, -2)))
|
||||
|
||||
encoder_extended_attention_mask = (1.0 - encoder_extended_attention_mask) * -1e9
|
||||
else:
|
||||
encoder_extended_attention_mask = None
|
||||
|
||||
# Prepare head mask if needed
|
||||
# 1.0 in head_mask indicate we keep the head
|
||||
# attention_probs has shape bsz x n_heads x N x N
|
||||
# input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
|
||||
# and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
|
||||
if not head_mask is None:
|
||||
raise NotImplementedError
|
||||
else:
|
||||
head_mask = [None] * self.num_hidden_layers
|
||||
# head_mask = tf.constant([0] * self.num_hidden_layers)
|
||||
|
||||
all_hidden_states = ()
|
||||
all_attentions = ()
|
||||
position_bias = None
|
||||
encoder_decoder_position_bias = None
|
||||
for i, layer_module in enumerate(self.block):
|
||||
if self.output_hidden_states:
|
||||
all_hidden_states = all_hidden_states + (hidden_states,)
|
||||
|
||||
layer_outputs = layer_module(hidden_states,
|
||||
attention_mask=extended_attention_mask,
|
||||
position_bias=position_bias,
|
||||
encoder_hidden_states=encoder_hidden_states,
|
||||
encoder_attention_mask=encoder_extended_attention_mask,
|
||||
encoder_decoder_position_bias=encoder_decoder_position_bias,
|
||||
head_mask=head_mask[i],
|
||||
training=training)
|
||||
hidden_states = layer_outputs[0]
|
||||
if i == 0:
|
||||
# We share the position biases between the layers - the first layer store them
|
||||
# layer_outputs = hidden-states, (self-attention weights), (self-attention position bias), (cross-attention weights), (cross-attention position bias)
|
||||
position_bias = layer_outputs[2 if self.output_attentions else 1]
|
||||
if self.is_decoder:
|
||||
encoder_decoder_position_bias = layer_outputs[4 if self.output_attentions else 2]
|
||||
|
||||
if self.output_attentions:
|
||||
all_attentions = all_attentions + (layer_outputs[1],)
|
||||
|
||||
hidden_states = self.final_layer_norm(hidden_states)
|
||||
layer_output = self.dropout(hidden_states, training=training)
|
||||
|
||||
# Add last layer
|
||||
if self.output_hidden_states:
|
||||
all_hidden_states = all_hidden_states + (hidden_states,)
|
||||
|
||||
outputs = (hidden_states,)
|
||||
if self.output_hidden_states:
|
||||
outputs = outputs + (all_hidden_states,)
|
||||
if self.output_attentions:
|
||||
outputs = outputs + (all_attentions,)
|
||||
return outputs # last-layer hidden state, (all hidden states), (all attentions)
|
||||
|
||||
|
||||
####################################################
|
||||
# TFT5PreTrainedModel is a sub-class of tf.keras.Model
|
||||
# which take care of loading and saving pretrained weights
|
||||
# and various common utilities.
|
||||
# Here you just need to specify a few (self-explanatory)
|
||||
# pointers for your model.
|
||||
####################################################
|
||||
class TFT5PreTrainedModel(TFPreTrainedModel):
|
||||
""" An abstract class to handle weights initialization and
|
||||
a simple interface for dowloading and loading pretrained models.
|
||||
"""
|
||||
config_class = T5Config
|
||||
pretrained_model_archive_map = TF_T5_PRETRAINED_MODEL_ARCHIVE_MAP
|
||||
base_model_prefix = "transformer"
|
||||
|
||||
@property
|
||||
def dummy_inputs(self):
|
||||
input_ids = tf.constant(DUMMY_INPUTS)
|
||||
input_mask = tf.constant(DUMMY_MASK)
|
||||
dummy_inputs = {'decoder_input_ids': input_ids,
|
||||
'encoder_input_ids': input_ids,
|
||||
'decoder_attention_mask': input_mask}
|
||||
return dummy_inputs
|
||||
|
||||
|
||||
T5_START_DOCSTRING = r""" The T5 model was proposed in
|
||||
`Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer`_
|
||||
by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu.
|
||||
It's an encoder decoder transformer pre-trained in a text-to-text denoising generative setting.
|
||||
|
||||
This model is a tf.keras.Model `tf.keras.Model`_ sub-class. Use it as a regular TF 2.0 Keras Model and
|
||||
refer to the TF 2.0 documentation for all matter related to general usage and behavior.
|
||||
|
||||
.. _`Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer`:
|
||||
https://arxiv.org/abs/1910.10683
|
||||
|
||||
.. _`tf.keras.Model`:
|
||||
https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/Model
|
||||
|
||||
Note on the model inputs:
|
||||
TF 2.0 models accepts two formats as inputs:
|
||||
|
||||
- having all inputs as keyword arguments (like PyTorch models), or
|
||||
- having all inputs as a list, tuple or dict in the first positional arguments.
|
||||
|
||||
This second option is usefull when using `tf.keras.Model.fit()` method which currently requires having all the tensors in the first argument of the model call function: `model(inputs)`.
|
||||
|
||||
If you choose this second option, there are three possibilities you can use to gather all the input Tensors in the first positional argument :
|
||||
|
||||
- a single Tensor with input_ids only and nothing else: `model(inputs_ids)
|
||||
- a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:
|
||||
`model([input_ids, attention_mask])` or `model([input_ids, attention_mask, token_type_ids])`
|
||||
- a dictionary with one or several input Tensors associaed to the input names given in the docstring:
|
||||
`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`
|
||||
|
||||
Parameters:
|
||||
config (:class:`~transformers.T5Config`): Model configuration class with all the parameters of the model.
|
||||
Initializing with a config file does not load the weights associated with the model, only the configuration.
|
||||
Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model weights.
|
||||
"""
|
||||
|
||||
T5_INPUTS_DOCSTRING = r"""
|
||||
Inputs:
|
||||
**input_ids**: ``Numpy array`` or ``tf.Tensor`` of shape ``(batch_size, sequence_length)``:
|
||||
Indices of input sequence tokens in the vocabulary.
|
||||
To match pre-training, T5 input sequence should be formatted with [CLS] and [SEP] tokens as follows:
|
||||
|
||||
(a) For sequence pairs:
|
||||
|
||||
``tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]``
|
||||
|
||||
(b) For single sequences:
|
||||
|
||||
``tokens: [CLS] the dog is hairy . [SEP]``
|
||||
|
||||
|
||||
T5 is a model with relative position embeddings so you should be able to pad the inputs on
|
||||
the right or the left.
|
||||
|
||||
Indices can be obtained using :class:`transformers.T5Tokenizer`.
|
||||
See :func:`transformers.PreTrainedTokenizer.encode` and
|
||||
:func:`transformers.PreTrainedTokenizer.convert_tokens_to_ids` for details.
|
||||
**attention_mask**: (`optional`) ``Numpy array`` or ``tf.Tensor`` of shape ``(batch_size, sequence_length)``:
|
||||
Mask to avoid performing attention on padding token indices.
|
||||
Mask values selected in ``[0, 1]``:
|
||||
``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
|
||||
**head_mask**: (`optional`) ``Numpy array`` or ``tf.Tensor`` of shape ``(num_heads,)`` or ``(num_layers, num_heads)``:
|
||||
Mask to nullify selected heads of the self-attention modules.
|
||||
Mask values selected in ``[0, 1]``:
|
||||
``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.
|
||||
"""
|
||||
|
||||
@add_start_docstrings("The bare T5 Model transformer outputting raw hidden-states"
|
||||
"without any specific head on top.",
|
||||
T5_START_DOCSTRING, T5_INPUTS_DOCSTRING)
|
||||
class TFT5Model(TFT5PreTrainedModel):
|
||||
r"""
|
||||
Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
|
||||
**last_hidden_state**: ``tf.Tensor`` of shape ``(batch_size, sequence_length, hidden_size)``
|
||||
Sequence of hidden-states at the output of the last layer of the model.
|
||||
**hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
|
||||
list of ``tf.Tensor`` (one for the output of each layer + the output of the embeddings)
|
||||
of shape ``(batch_size, sequence_length, hidden_size)``:
|
||||
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
|
||||
**attentions**: (`optional`, returned when ``config.output_attentions=True``)
|
||||
list of ``tf.Tensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
|
||||
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
|
||||
|
||||
Examples::
|
||||
|
||||
import tensorflow as tf
|
||||
from transformers import T5Tokenizer, TFT5Model
|
||||
|
||||
tokenizer = T5Tokenizer.from_pretrained('t5-small')
|
||||
model = TFT5Model.from_pretrained('t5-small')
|
||||
input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute"))[None, :] # Batch size 1
|
||||
outputs = model(input_ids)
|
||||
last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
|
||||
|
||||
"""
|
||||
def __init__(self, config, *inputs, **kwargs):
|
||||
super(TFT5Model, self).__init__(config, *inputs, **kwargs)
|
||||
self.shared = TFSharedEmbeddings(config.vocab_size, config.d_model,
|
||||
name='shared')
|
||||
|
||||
encoder_config = copy.deepcopy(config)
|
||||
self.encoder = TFT5MainLayer(encoder_config, name='encoder')
|
||||
|
||||
decoder_config = copy.deepcopy(config)
|
||||
decoder_config.is_decoder = True
|
||||
self.decoder = TFT5MainLayer(decoder_config, name='decoder')
|
||||
|
||||
def get_input_embeddings(self):
|
||||
return self.shared
|
||||
|
||||
def get_output_embeddings(self):
|
||||
return self.shared
|
||||
|
||||
def call(self, decoder_input_ids, **kwargs):
|
||||
# We allow two types of multi-inputs:
|
||||
# - traditional keyword arguments in the call method
|
||||
# - all the arguments provided as a dict in the first positional argument of call
|
||||
# The last option is useful to use the tf.keras fit() method.
|
||||
|
||||
if isinstance(decoder_input_ids, dict):
|
||||
kwargs.update(decoder_input_ids)
|
||||
else:
|
||||
kwargs['decoder_input_ids'] = decoder_input_ids
|
||||
|
||||
kwargs_common = dict((k, v) for k, v in kwargs.items()
|
||||
if not k.startswith("encoder_") and not k.startswith("decoder_"))
|
||||
kwargs_encoder = kwargs_common.copy()
|
||||
kwargs_decoder = kwargs_common.copy()
|
||||
kwargs_encoder.update(dict((k[len("encoder_"):], v) for k, v in kwargs.items() if k.startswith("encoder_")))
|
||||
kwargs_decoder.update(dict((k[len("decoder_"):], v) for k, v in kwargs.items() if k.startswith("decoder_")))
|
||||
|
||||
# Encode if needed (training, first prediction pass)
|
||||
encoder_hidden_states = kwargs_encoder.pop("hidden_states", None)
|
||||
if encoder_hidden_states is None:
|
||||
# Convert encoder inputs in embeddings if needed
|
||||
hidden_states = kwargs_encoder.pop("inputs_embeds", None)
|
||||
if hidden_states is None:
|
||||
encoder_inputs_ids = kwargs_encoder.pop("input_ids")
|
||||
hidden_states = self.shared(encoder_inputs_ids) # Convert inputs in embeddings
|
||||
|
||||
encoder_outputs = self.encoder(hidden_states, **kwargs_encoder)
|
||||
encoder_hidden_states = encoder_outputs[0]
|
||||
else:
|
||||
encoder_outputs = ()
|
||||
|
||||
# Decode
|
||||
# Convert decoder inputs in embeddings if needed
|
||||
hidden_states = kwargs_decoder.pop("inputs_embeds", None)
|
||||
if hidden_states is None:
|
||||
decoder_inputs_ids = kwargs_decoder.pop("input_ids")
|
||||
hidden_states = self.shared(decoder_inputs_ids)
|
||||
|
||||
kwargs_decoder["encoder_hidden_states"] = encoder_hidden_states
|
||||
kwargs_decoder["encoder_attention_mask"] = kwargs_encoder.get("attention_mask", None)
|
||||
decoder_outputs = self.decoder(hidden_states, **kwargs_decoder)
|
||||
|
||||
return decoder_outputs + encoder_outputs
|
||||
|
||||
|
||||
@add_start_docstrings("""T5 Model with a `language modeling` head on top. """,
|
||||
T5_START_DOCSTRING, T5_INPUTS_DOCSTRING)
|
||||
class TFT5WithLMHeadModel(TFT5PreTrainedModel):
|
||||
r"""
|
||||
Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
|
||||
**prediction_scores**: ``Numpy array`` or ``tf.Tensor`` of shape ``(batch_size, sequence_length, config.vocab_size)``
|
||||
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
|
||||
**hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
|
||||
list of ``Numpy array`` or ``tf.Tensor`` (one for the output of each layer + the output of the embeddings)
|
||||
of shape ``(batch_size, sequence_length, hidden_size)``:
|
||||
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
|
||||
**attentions**: (`optional`, returned when ``config.output_attentions=True``)
|
||||
list of ``Numpy array`` or ``tf.Tensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
|
||||
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
|
||||
|
||||
Examples::
|
||||
|
||||
import tensorflow as tf
|
||||
from transformers import T5Tokenizer, TFT5WithLMHeadModel
|
||||
|
||||
tokenizer = T5Tokenizer.from_pretrained('t5-small')
|
||||
model = TFT5WithLMHeadModel.from_pretrained('t5-small')
|
||||
input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute"))[None, :] # Batch size 1
|
||||
outputs = model(input_ids)
|
||||
prediction_scores = outputs[0]
|
||||
|
||||
"""
|
||||
def __init__(self, config, *inputs, **kwargs):
|
||||
super(TFT5WithLMHeadModel, self).__init__(config, *inputs, **kwargs)
|
||||
self.model_dim = config.d_model
|
||||
|
||||
self.shared = TFSharedEmbeddings(config.vocab_size, config.d_model,
|
||||
name='shared')
|
||||
|
||||
encoder_config = copy.deepcopy(config)
|
||||
self.encoder = TFT5MainLayer(encoder_config, name='encoder')
|
||||
|
||||
decoder_config = copy.deepcopy(config)
|
||||
decoder_config.is_decoder = True
|
||||
self.decoder = TFT5MainLayer(decoder_config, name='decoder')
|
||||
|
||||
def get_input_embeddings(self):
|
||||
return self.shared
|
||||
|
||||
def get_output_embeddings(self):
|
||||
return self.shared
|
||||
|
||||
def call(self, decoder_input_ids, **kwargs):
|
||||
# We allow two types of multi-inputs:
|
||||
# - traditional keyword arguments in the call method
|
||||
# - all the arguments provided as a dict in the first positional argument of call
|
||||
# The last option is useful to use the tf.keras fit() method.
|
||||
|
||||
if isinstance(decoder_input_ids, dict):
|
||||
kwargs.update(decoder_input_ids)
|
||||
else:
|
||||
kwargs['decoder_input_ids'] = decoder_input_ids
|
||||
|
||||
kwargs_common = dict((k, v) for k, v in kwargs.items()
|
||||
if not k.startswith("encoder_") and not k.startswith("decoder_"))
|
||||
kwargs_encoder = kwargs_common.copy()
|
||||
kwargs_decoder = kwargs_common.copy()
|
||||
kwargs_encoder.update(dict((k[len("encoder_"):], v) for k, v in kwargs.items() if k.startswith("encoder_")))
|
||||
kwargs_decoder.update(dict((k[len("decoder_"):], v) for k, v in kwargs.items() if k.startswith("decoder_")))
|
||||
|
||||
# Encode if needed (training, first prediction pass)
|
||||
encoder_hidden_states = kwargs_encoder.pop("hidden_states", None)
|
||||
if encoder_hidden_states is None:
|
||||
# Convert encoder inputs in embeddings if needed
|
||||
hidden_states = kwargs_encoder.pop("inputs_embeds", None)
|
||||
if hidden_states is None:
|
||||
encoder_inputs_ids = kwargs_encoder.pop("input_ids")
|
||||
hidden_states = self.shared(encoder_inputs_ids) # Convert inputs in embeddings
|
||||
|
||||
encoder_outputs = self.encoder(hidden_states, **kwargs_encoder)
|
||||
encoder_hidden_states = encoder_outputs[0]
|
||||
else:
|
||||
encoder_outputs = ()
|
||||
|
||||
# Decode
|
||||
# Convert decoder inputs in embeddings if needed
|
||||
hidden_states = kwargs_decoder.pop("inputs_embeds", None)
|
||||
if hidden_states is None:
|
||||
decoder_inputs_ids = kwargs_decoder.pop("input_ids")
|
||||
hidden_states = self.shared(decoder_inputs_ids)
|
||||
|
||||
kwargs_decoder["encoder_hidden_states"] = encoder_hidden_states
|
||||
kwargs_decoder["encoder_attention_mask"] = kwargs_encoder.get("attention_mask", None)
|
||||
decoder_outputs = self.decoder(hidden_states, **kwargs_decoder)
|
||||
|
||||
sequence_output = decoder_outputs[0] * (self.model_dim ** -0.5)
|
||||
lm_logits = self.shared(sequence_output, mode="linear")
|
||||
decoder_outputs = (lm_logits,) + decoder_outputs[1:]
|
||||
|
||||
return decoder_outputs + encoder_outputs
|
@ -24,14 +24,12 @@ import os
|
||||
import tensorflow as tf
|
||||
|
||||
from .configuration_utils import PretrainedConfig
|
||||
from .file_utils import (TF2_WEIGHTS_NAME, TF_WEIGHTS_NAME, WEIGHTS_NAME,
|
||||
from .file_utils import (TF2_WEIGHTS_NAME, TF_WEIGHTS_NAME, WEIGHTS_NAME, DUMMY_INPUTS,
|
||||
cached_path, hf_bucket_url, is_remote_url)
|
||||
from .modeling_tf_pytorch_utils import load_pytorch_checkpoint_in_tf2_model
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
DUMMY_INPUTS = [[7, 6, 0, 0, 1], [1, 2, 3, 0, 0], [0, 0, 0, 4, 5]]
|
||||
|
||||
class TFPreTrainedModel(tf.keras.Model):
|
||||
r""" Base class for all TF models.
|
||||
|
||||
@ -60,7 +58,7 @@ class TFPreTrainedModel(tf.keras.Model):
|
||||
Returns:
|
||||
tf.Tensor with dummy inputs
|
||||
"""
|
||||
return tf.constant(DUMMY_INPUTS)
|
||||
return {'input_ids': tf.constant(DUMMY_INPUTS)}
|
||||
|
||||
def __init__(self, config, *inputs, **kwargs):
|
||||
super(TFPreTrainedModel, self).__init__(*inputs, **kwargs)
|
||||
|
@ -460,7 +460,7 @@ class TFXLMPreTrainedModel(TFPreTrainedModel):
|
||||
langs_list = tf.constant([[1, 1, 0, 0, 1], [1, 1, 1, 0, 0], [1, 0, 0, 1, 1]])
|
||||
else:
|
||||
langs_list = None
|
||||
return [inputs_list, attns_list, langs_list]
|
||||
return {'input_ids': inputs_list, 'attention_mask': attns_list, 'langs': langs_list}
|
||||
|
||||
|
||||
XLM_START_DOCSTRING = r""" The XLM model was proposed in
|
||||
|
@ -31,12 +31,11 @@ from torch.nn import CrossEntropyLoss
|
||||
from torch.nn import functional as F
|
||||
|
||||
from .configuration_utils import PretrainedConfig
|
||||
from .file_utils import (TF2_WEIGHTS_NAME, TF_WEIGHTS_NAME, WEIGHTS_NAME,
|
||||
from .file_utils import (TF2_WEIGHTS_NAME, TF_WEIGHTS_NAME, WEIGHTS_NAME, DUMMY_INPUTS,
|
||||
cached_path, hf_bucket_url, is_remote_url)
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
try:
|
||||
from torch.nn import Identity
|
||||
except ImportError:
|
||||
@ -72,6 +71,15 @@ class PreTrainedModel(nn.Module):
|
||||
load_tf_weights = lambda model, config, path: None
|
||||
base_model_prefix = ""
|
||||
|
||||
@property
|
||||
def dummy_inputs(self):
|
||||
""" Dummy inputs to do a forward pass in the network.
|
||||
|
||||
Returns:
|
||||
torch.Tensor with dummy inputs
|
||||
"""
|
||||
return {'input_ids': torch.tensor(DUMMY_INPUTS)}
|
||||
|
||||
def __init__(self, config, *inputs, **kwargs):
|
||||
super(PreTrainedModel, self).__init__()
|
||||
if not isinstance(config, PretrainedConfig):
|
||||
@ -161,8 +169,7 @@ class PreTrainedModel(nn.Module):
|
||||
base_model.vocab_size = new_num_tokens
|
||||
|
||||
# Tie weights again if needed
|
||||
if hasattr(self, 'tie_weights'):
|
||||
self.tie_weights()
|
||||
self.tie_weights()
|
||||
|
||||
return model_embeds
|
||||
|
||||
@ -478,8 +485,7 @@ class PreTrainedModel(nn.Module):
|
||||
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
|
||||
model.__class__.__name__, "\n\t".join(error_msgs)))
|
||||
|
||||
if hasattr(model, 'tie_weights'):
|
||||
model.tie_weights() # make sure word embedding weights are still tied
|
||||
model.tie_weights() # make sure word embedding weights are still tied if needed
|
||||
|
||||
# Set model in evaluation mode to desactivate DropOut modules by default
|
||||
model.eval()
|
||||
|
@ -227,6 +227,16 @@ class XLMPreTrainedModel(PreTrainedModel):
|
||||
def __init__(self, *inputs, **kwargs):
|
||||
super(XLMPreTrainedModel, self).__init__(*inputs, **kwargs)
|
||||
|
||||
@property
|
||||
def dummy_inputs(self):
|
||||
inputs_list = torch.tensor([[7, 6, 0, 0, 1], [1, 2, 3, 0, 0], [0, 0, 0, 4, 5]])
|
||||
attns_list = torch.tensor([[1, 1, 0, 0, 1], [1, 1, 1, 0, 0], [1, 0, 0, 1, 1]])
|
||||
if self.config.use_lang_emb and self.config.n_langs > 1:
|
||||
langs_list = torch.tensor([[1, 1, 0, 0, 1], [1, 1, 1, 0, 0], [1, 0, 0, 1, 1]])
|
||||
else:
|
||||
langs_list = None
|
||||
return {'input_ids': inputs_list, 'attention_mask': attns_list, 'langs': langs_list}
|
||||
|
||||
def _init_weights(self, module):
|
||||
""" Initialize the weights. """
|
||||
if isinstance(module, nn.Embedding):
|
||||
@ -646,7 +656,7 @@ class XLMWithLMHeadModel(XLMPreTrainedModel):
|
||||
langs=langs,
|
||||
token_type_ids=token_type_ids,
|
||||
position_ids=position_ids,
|
||||
lengths=lengths,
|
||||
lengths=lengths,
|
||||
cache=cache,
|
||||
head_mask=head_mask,
|
||||
inputs_embeds=inputs_embeds)
|
||||
|
0
transformers/tests/fixtures/empty.txt
vendored
Normal file
0
transformers/tests/fixtures/empty.txt
vendored
Normal file
@ -15,18 +15,30 @@
|
||||
from __future__ import absolute_import, division, print_function
|
||||
|
||||
import os
|
||||
import six
|
||||
import time
|
||||
import unittest
|
||||
|
||||
from transformers.hf_api import HfApi, S3Obj, PresignedUrl, HfFolder, HTTPError
|
||||
import requests
|
||||
import six
|
||||
|
||||
from transformers.hf_api import HfApi, HfFolder, HTTPError, PresignedUrl, S3Obj
|
||||
|
||||
USER = "__DUMMY_TRANSFORMERS_USER__"
|
||||
PASS = "__DUMMY_TRANSFORMERS_PASS__"
|
||||
FILE_KEY = "Test-{}.txt".format(int(time.time()))
|
||||
FILE_PATH = os.path.join(
|
||||
os.path.dirname(os.path.abspath(__file__)), "fixtures/input.txt"
|
||||
)
|
||||
FILES = [
|
||||
(
|
||||
"Test-{}.txt".format(int(time.time())),
|
||||
os.path.join(
|
||||
os.path.dirname(os.path.abspath(__file__)), "fixtures/input.txt"
|
||||
)
|
||||
),
|
||||
(
|
||||
"yoyo {}.txt".format(int(time.time())), # space is intentional
|
||||
os.path.join(
|
||||
os.path.dirname(os.path.abspath(__file__)), "fixtures/empty.txt"
|
||||
)
|
||||
),
|
||||
]
|
||||
|
||||
|
||||
|
||||
@ -57,15 +69,21 @@ class HfApiEndpointsTest(HfApiCommonTest):
|
||||
self.assertEqual(user, USER)
|
||||
|
||||
def test_presign(self):
|
||||
urls = self._api.presign(token=self._token, filename=FILE_KEY)
|
||||
self.assertIsInstance(urls, PresignedUrl)
|
||||
self.assertEqual(urls.type, "text/plain")
|
||||
for FILE_KEY, FILE_PATH in FILES:
|
||||
urls = self._api.presign(token=self._token, filename=FILE_KEY)
|
||||
self.assertIsInstance(urls, PresignedUrl)
|
||||
self.assertEqual(urls.type, "text/plain")
|
||||
|
||||
def test_presign_and_upload(self):
|
||||
access_url = self._api.presign_and_upload(
|
||||
token=self._token, filename=FILE_KEY, filepath=FILE_PATH
|
||||
)
|
||||
self.assertIsInstance(access_url, six.string_types)
|
||||
for FILE_KEY, FILE_PATH in FILES:
|
||||
access_url = self._api.presign_and_upload(
|
||||
token=self._token, filename=FILE_KEY, filepath=FILE_PATH
|
||||
)
|
||||
self.assertIsInstance(access_url, six.string_types)
|
||||
with open(FILE_PATH, 'r') as f:
|
||||
body = f.read()
|
||||
r = requests.get(access_url)
|
||||
self.assertEqual(r.text, body)
|
||||
|
||||
def test_list_objs(self):
|
||||
objs = self._api.list_objs(token=self._token)
|
||||
|
@ -58,7 +58,7 @@ else:
|
||||
def _config_zero_init(config):
|
||||
configs_no_init = copy.deepcopy(config)
|
||||
for key in configs_no_init.__dict__.keys():
|
||||
if '_range' in key or '_std' in key:
|
||||
if '_range' in key or '_std' in key or 'initializer_factor' in key:
|
||||
setattr(configs_no_init, key, 0.0)
|
||||
return configs_no_init
|
||||
|
||||
@ -73,6 +73,7 @@ class CommonTestCases:
|
||||
test_pruning = True
|
||||
test_resize_embeddings = True
|
||||
test_head_masking = True
|
||||
is_encoder_decoder = False
|
||||
|
||||
def test_save_load(self):
|
||||
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||
@ -83,6 +84,8 @@ class CommonTestCases:
|
||||
model.eval()
|
||||
with torch.no_grad():
|
||||
outputs = model(**inputs_dict)
|
||||
out_2 = outputs[0].numpy()
|
||||
out_2[np.isnan(out_2)] = 0
|
||||
|
||||
with TemporaryDirectory() as tmpdirname:
|
||||
model.save_pretrained(tmpdirname)
|
||||
@ -93,9 +96,7 @@ class CommonTestCases:
|
||||
|
||||
# Make sure we don't have nans
|
||||
out_1 = after_outputs[0].cpu().numpy()
|
||||
out_2 = outputs[0].cpu().numpy()
|
||||
out_1 = out_1[~np.isnan(out_1)]
|
||||
out_2 = out_2[~np.isnan(out_2)]
|
||||
out_1[np.isnan(out_1)] = 0
|
||||
max_diff = np.amax(np.abs(out_1 - out_2))
|
||||
self.assertLessEqual(max_diff, 1e-5)
|
||||
|
||||
@ -117,20 +118,32 @@ class CommonTestCases:
|
||||
model = model_class(config)
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
first, second = model(inputs_dict["input_ids"])[0], model(inputs_dict["input_ids"])[0]
|
||||
self.assertEqual(first.ne(second).sum().item(), 0)
|
||||
|
||||
with torch.no_grad():
|
||||
first = model(**inputs_dict)[0]
|
||||
second = model(**inputs_dict)[0]
|
||||
out_1 = first.cpu().numpy()
|
||||
out_2 = second.cpu().numpy()
|
||||
out_1 = out_1[~np.isnan(out_1)]
|
||||
out_2 = out_2[~np.isnan(out_2)]
|
||||
max_diff = np.amax(np.abs(out_1 - out_2))
|
||||
self.assertLessEqual(max_diff, 1e-5)
|
||||
|
||||
def test_attention_outputs(self):
|
||||
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||
|
||||
decoder_seq_length = self.model_tester.decoder_seq_length if hasattr(self.model_tester, 'decoder_seq_length') else self.model_tester.seq_length
|
||||
encoder_seq_length = self.model_tester.encoder_seq_length if hasattr(self.model_tester, 'encoder_seq_length') else self.model_tester.seq_length
|
||||
decoder_key_length = self.model_tester.key_length if hasattr(self.model_tester, 'key_length') else decoder_seq_length
|
||||
encoder_key_length = self.model_tester.key_length if hasattr(self.model_tester, 'key_length') else encoder_seq_length
|
||||
|
||||
for model_class in self.all_model_classes:
|
||||
config.output_attentions = True
|
||||
config.output_hidden_states = False
|
||||
model = model_class(config)
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
outputs = model(**inputs_dict)
|
||||
with torch.no_grad():
|
||||
outputs = model(**inputs_dict)
|
||||
attentions = outputs[-1]
|
||||
self.assertEqual(model.config.output_attentions, True)
|
||||
self.assertEqual(model.config.output_hidden_states, False)
|
||||
@ -138,28 +151,42 @@ class CommonTestCases:
|
||||
self.assertListEqual(
|
||||
list(attentions[0].shape[-3:]),
|
||||
[self.model_tester.num_attention_heads,
|
||||
self.model_tester.seq_length,
|
||||
self.model_tester.key_len if hasattr(self.model_tester, 'key_len') else self.model_tester.seq_length])
|
||||
encoder_seq_length ,
|
||||
encoder_key_length])
|
||||
out_len = len(outputs)
|
||||
|
||||
if self.is_encoder_decoder:
|
||||
self.assertEqual(out_len % 2, 0)
|
||||
decoder_attentions = outputs[(out_len // 2)-1]
|
||||
self.assertEqual(model.config.output_attentions, True)
|
||||
self.assertEqual(model.config.output_hidden_states, False)
|
||||
self.assertEqual(len(decoder_attentions), self.model_tester.num_hidden_layers)
|
||||
self.assertListEqual(
|
||||
list(decoder_attentions[0].shape[-3:]),
|
||||
[self.model_tester.num_attention_heads,
|
||||
decoder_seq_length,
|
||||
decoder_key_length
|
||||
])
|
||||
|
||||
# Check attention is always last and order is fine
|
||||
config.output_attentions = True
|
||||
config.output_hidden_states = True
|
||||
model = model_class(config)
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
outputs = model(**inputs_dict)
|
||||
self.assertEqual(out_len+1, len(outputs))
|
||||
with torch.no_grad():
|
||||
outputs = model(**inputs_dict)
|
||||
self.assertEqual(out_len + (2 if self.is_encoder_decoder else 1), len(outputs))
|
||||
self.assertEqual(model.config.output_attentions, True)
|
||||
self.assertEqual(model.config.output_hidden_states, True)
|
||||
|
||||
attentions = outputs[-1]
|
||||
self.assertEqual(len(attentions), self.model_tester.num_hidden_layers)
|
||||
self_attentions = outputs[-1]
|
||||
self.assertEqual(len(self_attentions), self.model_tester.num_hidden_layers)
|
||||
self.assertListEqual(
|
||||
list(attentions[0].shape[-3:]),
|
||||
list(self_attentions[0].shape[-3:]),
|
||||
[self.model_tester.num_attention_heads,
|
||||
self.model_tester.seq_length,
|
||||
self.model_tester.key_len if hasattr(self.model_tester, 'key_len') else self.model_tester.seq_length])
|
||||
encoder_seq_length,
|
||||
encoder_key_length])
|
||||
|
||||
def test_torchscript(self):
|
||||
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||
@ -223,7 +250,6 @@ class CommonTestCases:
|
||||
|
||||
self.assertTrue(models_equal)
|
||||
|
||||
|
||||
def test_headmasking(self):
|
||||
if not self.test_head_masking:
|
||||
return
|
||||
@ -278,7 +304,6 @@ class CommonTestCases:
|
||||
self.assertNotEqual(
|
||||
attentions[-1][..., -1, :, :].flatten().sum().item(), 0.0)
|
||||
|
||||
|
||||
def test_head_pruning(self):
|
||||
if not self.test_pruning:
|
||||
return
|
||||
@ -297,7 +322,8 @@ class CommonTestCases:
|
||||
heads_to_prune = {0: list(range(1, self.model_tester.num_attention_heads)),
|
||||
-1: [0]}
|
||||
model.prune_heads(heads_to_prune)
|
||||
outputs = model(**inputs_dict)
|
||||
with torch.no_grad():
|
||||
outputs = model(**inputs_dict)
|
||||
|
||||
attentions = outputs[-1]
|
||||
|
||||
@ -333,7 +359,8 @@ class CommonTestCases:
|
||||
model = model_class.from_pretrained(directory)
|
||||
model.to(torch_device)
|
||||
|
||||
outputs = model(**inputs_dict)
|
||||
with torch.no_grad():
|
||||
outputs = model(**inputs_dict)
|
||||
attentions = outputs[-1]
|
||||
self.assertEqual(attentions[0].shape[-3], 1)
|
||||
self.assertEqual(attentions[1].shape[-3], self.model_tester.num_attention_heads)
|
||||
@ -362,7 +389,8 @@ class CommonTestCases:
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
|
||||
outputs = model(**inputs_dict)
|
||||
with torch.no_grad():
|
||||
outputs = model(**inputs_dict)
|
||||
attentions = outputs[-1]
|
||||
|
||||
self.assertEqual(attentions[0].shape[-3], 1)
|
||||
@ -389,7 +417,8 @@ class CommonTestCases:
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
|
||||
outputs = model(**inputs_dict)
|
||||
with torch.no_grad():
|
||||
outputs = model(**inputs_dict)
|
||||
attentions = outputs[-1]
|
||||
|
||||
self.assertEqual(attentions[0].shape[-3], self.model_tester.num_attention_heads - 1)
|
||||
@ -406,7 +435,8 @@ class CommonTestCases:
|
||||
model.to(torch_device)
|
||||
shutil.rmtree(directory)
|
||||
|
||||
outputs = model(**inputs_dict)
|
||||
with torch.no_grad():
|
||||
outputs = model(**inputs_dict)
|
||||
attentions = outputs[-1]
|
||||
|
||||
self.assertEqual(attentions[0].shape[-3], self.model_tester.num_attention_heads - 1)
|
||||
@ -417,7 +447,8 @@ class CommonTestCases:
|
||||
heads_to_prune = {0: [0], 2: [1, 2]}
|
||||
model.prune_heads(heads_to_prune)
|
||||
|
||||
outputs = model(**inputs_dict)
|
||||
with torch.no_grad():
|
||||
outputs = model(**inputs_dict)
|
||||
attentions = outputs[-1]
|
||||
|
||||
self.assertEqual(attentions[0].shape[-3], self.model_tester.num_attention_heads -1)
|
||||
@ -427,7 +458,6 @@ class CommonTestCases:
|
||||
|
||||
self.assertDictEqual(model.config.pruned_heads, {0: [0], 1: [1, 2], 2: [1, 2]})
|
||||
|
||||
|
||||
def test_hidden_states_output(self):
|
||||
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||
|
||||
@ -437,14 +467,16 @@ class CommonTestCases:
|
||||
model = model_class(config)
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
outputs = model(**inputs_dict)
|
||||
with torch.no_grad():
|
||||
outputs = model(**inputs_dict)
|
||||
hidden_states = outputs[-1]
|
||||
self.assertEqual(model.config.output_attentions, False)
|
||||
self.assertEqual(model.config.output_hidden_states, True)
|
||||
self.assertEqual(len(hidden_states), self.model_tester.num_hidden_layers + 1)
|
||||
self.assertListEqual(
|
||||
list(hidden_states[0].shape[-2:]),
|
||||
[self.model_tester.seq_length, self.model_tester.hidden_size])
|
||||
[self.model_tester.encoder_seq_length if hasattr(self.model_tester, 'encoder_seq_length') else self.model_tester.seq_length,
|
||||
self.model_tester.hidden_size])
|
||||
|
||||
def test_resize_tokens_embeddings(self):
|
||||
original_config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||
@ -550,8 +582,14 @@ class CommonTestCases:
|
||||
|
||||
def test_inputs_embeds(self):
|
||||
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||
input_ids = inputs_dict["input_ids"]
|
||||
del inputs_dict["input_ids"]
|
||||
if not self.is_encoder_decoder:
|
||||
input_ids = inputs_dict["input_ids"]
|
||||
del inputs_dict["input_ids"]
|
||||
else:
|
||||
encoder_input_ids = inputs_dict["encoder_input_ids"]
|
||||
decoder_input_ids = inputs_dict["decoder_input_ids"]
|
||||
del inputs_dict["encoder_input_ids"]
|
||||
del inputs_dict["decoder_input_ids"]
|
||||
|
||||
for model_class in self.all_model_classes:
|
||||
model = model_class(config)
|
||||
@ -559,9 +597,14 @@ class CommonTestCases:
|
||||
model.eval()
|
||||
|
||||
wte = model.get_input_embeddings()
|
||||
inputs_dict["inputs_embeds"] = wte(input_ids)
|
||||
outputs = model(**inputs_dict)
|
||||
if not self.is_encoder_decoder:
|
||||
inputs_dict["inputs_embeds"] = wte(input_ids)
|
||||
else:
|
||||
inputs_dict["encoder_inputs_embeds"] = wte(encoder_input_ids)
|
||||
inputs_dict["decoder_inputs_embeds"] = wte(decoder_input_ids)
|
||||
|
||||
with torch.no_grad():
|
||||
outputs = model(**inputs_dict)
|
||||
|
||||
class GPTModelTester(CommonModelTester):
|
||||
|
||||
@ -649,9 +692,10 @@ class CommonTestCases:
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
|
||||
outputs = model(input_ids, position_ids, token_type_ids)
|
||||
outputs = model(input_ids, position_ids)
|
||||
outputs = model(input_ids)
|
||||
with torch.no_grad():
|
||||
outputs = model(input_ids, position_ids, token_type_ids)
|
||||
outputs = model(input_ids, position_ids)
|
||||
outputs = model(input_ids)
|
||||
|
||||
hidden_state = outputs[0]
|
||||
self.parent.assertListEqual(
|
||||
@ -664,7 +708,8 @@ class CommonTestCases:
|
||||
model = self.lm_head_model_class(config)
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
outputs = model(input_ids, position_ids, token_type_ids, lm_labels)
|
||||
with torch.no_grad():
|
||||
outputs = model(input_ids, position_ids, token_type_ids, lm_labels)
|
||||
loss, lm_logits = outputs[:2]
|
||||
|
||||
total_voc = self.vocab_size
|
||||
@ -681,7 +726,8 @@ class CommonTestCases:
|
||||
model = model_class(config)
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
outputs = model(input_ids)
|
||||
with torch.no_grad():
|
||||
outputs = model(input_ids)
|
||||
presents = outputs[-1]
|
||||
self.parent.assertEqual(self.num_hidden_layers, len(presents))
|
||||
self.parent.assertListEqual(
|
||||
@ -694,7 +740,8 @@ class CommonTestCases:
|
||||
model = self.double_head_model_class(config)
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
outputs = model(input_ids, mc_token_ids, lm_labels=lm_labels, mc_labels=mc_labels,
|
||||
with torch.no_grad():
|
||||
outputs = model(input_ids, mc_token_ids, lm_labels=lm_labels, mc_labels=mc_labels,
|
||||
token_type_ids=token_type_ids, position_ids=position_ids)
|
||||
lm_loss, mc_loss, lm_logits, mc_logits = outputs[:4]
|
||||
loss = [lm_loss, mc_loss]
|
||||
|
185
transformers/tests/modeling_t5_test.py
Normal file
185
transformers/tests/modeling_t5_test.py
Normal file
@ -0,0 +1,185 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2018 Google T5 Authors and HuggingFace Inc. team.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
from __future__ import absolute_import
|
||||
from __future__ import division
|
||||
from __future__ import print_function
|
||||
|
||||
import unittest
|
||||
import shutil
|
||||
|
||||
from transformers import is_torch_available
|
||||
|
||||
from .modeling_common_test import (CommonTestCases, ids_tensor, floats_tensor)
|
||||
from .configuration_common_test import ConfigTester
|
||||
from .utils import require_torch, slow, torch_device
|
||||
|
||||
if is_torch_available():
|
||||
from transformers import (T5Config, T5Model, T5WithLMHeadModel)
|
||||
from transformers.modeling_t5 import T5_PRETRAINED_MODEL_ARCHIVE_MAP
|
||||
|
||||
|
||||
@require_torch
|
||||
class T5ModelTest(CommonTestCases.CommonModelTester):
|
||||
|
||||
all_model_classes = (T5Model, T5WithLMHeadModel) if is_torch_available() else ()
|
||||
test_pruning = False
|
||||
test_torchscript = False
|
||||
test_resize_embeddings = False
|
||||
is_encoder_decoder = True
|
||||
|
||||
class T5ModelTester(object):
|
||||
|
||||
def __init__(self,
|
||||
parent,
|
||||
batch_size=13,
|
||||
encoder_seq_length=7,
|
||||
decoder_seq_length=9,
|
||||
is_training=True,
|
||||
use_attention_mask=True,
|
||||
use_labels=True,
|
||||
vocab_size=99,
|
||||
n_positions=14,
|
||||
hidden_size=32,
|
||||
num_hidden_layers=5,
|
||||
num_attention_heads=4,
|
||||
d_ff=37,
|
||||
relative_attention_num_buckets=8,
|
||||
dropout_rate=0.1,
|
||||
initializer_factor=0.002,
|
||||
scope=None,
|
||||
):
|
||||
self.parent = parent
|
||||
self.batch_size = batch_size
|
||||
self.encoder_seq_length = encoder_seq_length
|
||||
self.decoder_seq_length = decoder_seq_length
|
||||
self.is_training = is_training
|
||||
self.use_attention_mask = use_attention_mask
|
||||
self.use_labels = use_labels
|
||||
self.vocab_size = vocab_size
|
||||
self.n_positions = n_positions
|
||||
self.hidden_size = hidden_size
|
||||
self.num_hidden_layers = num_hidden_layers
|
||||
self.num_attention_heads = num_attention_heads
|
||||
self.d_ff = d_ff
|
||||
self.relative_attention_num_buckets = relative_attention_num_buckets
|
||||
self.dropout_rate = dropout_rate
|
||||
self.initializer_factor = initializer_factor
|
||||
self.scope = scope
|
||||
|
||||
def prepare_config_and_inputs(self):
|
||||
encoder_input_ids = ids_tensor([self.batch_size, self.encoder_seq_length], self.vocab_size)
|
||||
decoder_input_ids = ids_tensor([self.batch_size, self.decoder_seq_length], self.vocab_size)
|
||||
|
||||
encoder_attention_mask = None
|
||||
decoder_attention_mask = None
|
||||
if self.use_attention_mask:
|
||||
encoder_attention_mask = ids_tensor([self.batch_size, self.encoder_seq_length], vocab_size=2)
|
||||
decoder_attention_mask = ids_tensor([self.batch_size, self.decoder_seq_length], vocab_size=2)
|
||||
|
||||
decoder_lm_labels = None
|
||||
if self.use_labels:
|
||||
decoder_lm_labels = ids_tensor([self.batch_size, self.decoder_seq_length], self.vocab_size)
|
||||
|
||||
config = T5Config(
|
||||
vocab_size_or_config_json_file=self.vocab_size,
|
||||
n_positions=self.n_positions,
|
||||
d_model=self.hidden_size,
|
||||
d_ff=self.d_ff,
|
||||
d_kv=self.hidden_size // self.num_attention_heads,
|
||||
num_layers=self.num_hidden_layers,
|
||||
num_heads=self.num_attention_heads,
|
||||
relative_attention_num_buckets=self.relative_attention_num_buckets,
|
||||
dropout_rate=self.dropout_rate,
|
||||
initializer_factor=self.initializer_factor)
|
||||
|
||||
return (config, encoder_input_ids, decoder_input_ids, encoder_attention_mask, decoder_attention_mask, decoder_lm_labels)
|
||||
|
||||
def check_loss_output(self, result):
|
||||
self.parent.assertListEqual(
|
||||
list(result["loss"].size()),
|
||||
[])
|
||||
|
||||
def create_and_check_t5_model(self, config, encoder_input_ids, decoder_input_ids, encoder_attention_mask, decoder_attention_mask, decoder_lm_labels):
|
||||
model = T5Model(config=config)
|
||||
model.eval()
|
||||
decoder_output, encoder_output = model(encoder_input_ids=encoder_input_ids,
|
||||
decoder_input_ids=decoder_input_ids,
|
||||
encoder_attention_mask=encoder_attention_mask,
|
||||
decoder_attention_mask=decoder_attention_mask)
|
||||
decoder_output, encoder_output = model(encoder_input_ids=encoder_input_ids,
|
||||
decoder_input_ids=decoder_input_ids)
|
||||
|
||||
result = {
|
||||
"encoder_output": encoder_output,
|
||||
"decoder_output": decoder_output,
|
||||
}
|
||||
self.parent.assertListEqual(
|
||||
list(result["encoder_output"].size()),
|
||||
[self.batch_size, self.encoder_seq_length, self.hidden_size])
|
||||
self.parent.assertListEqual(
|
||||
list(result["decoder_output"].size()),
|
||||
[self.batch_size, self.decoder_seq_length, self.hidden_size])
|
||||
|
||||
|
||||
def create_and_check_t5_with_lm_head(self, config, encoder_input_ids, decoder_input_ids, encoder_attention_mask, decoder_attention_mask, decoder_lm_labels):
|
||||
model = T5WithLMHeadModel(config=config)
|
||||
model.eval()
|
||||
outputs = model(encoder_input_ids=encoder_input_ids, decoder_input_ids=decoder_input_ids,
|
||||
decoder_attention_mask=decoder_attention_mask, decoder_lm_labels=decoder_lm_labels)
|
||||
loss, prediction_scores = outputs[0], outputs[1]
|
||||
result = {
|
||||
"loss": loss,
|
||||
"prediction_scores": prediction_scores,
|
||||
}
|
||||
self.parent.assertListEqual(
|
||||
list(result["prediction_scores"].size()),
|
||||
[self.batch_size, self.decoder_seq_length, self.vocab_size])
|
||||
self.check_loss_output(result)
|
||||
|
||||
def prepare_config_and_inputs_for_common(self):
|
||||
config_and_inputs = self.prepare_config_and_inputs()
|
||||
(config, encoder_input_ids, decoder_input_ids, encoder_attention_mask,
|
||||
decoder_attention_mask, decoder_lm_labels) = config_and_inputs
|
||||
inputs_dict = {'encoder_input_ids': encoder_input_ids,
|
||||
'decoder_input_ids': decoder_input_ids,
|
||||
'decoder_attention_mask': decoder_attention_mask,
|
||||
'encoder_attention_mask': encoder_attention_mask}
|
||||
return config, inputs_dict
|
||||
|
||||
def setUp(self):
|
||||
self.model_tester = T5ModelTest.T5ModelTester(self)
|
||||
self.config_tester = ConfigTester(self, config_class=T5Config, d_model=37)
|
||||
|
||||
def test_config(self):
|
||||
self.config_tester.run_common_tests()
|
||||
|
||||
def test_t5_model(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
self.model_tester.create_and_check_t5_model(*config_and_inputs)
|
||||
|
||||
def test_with_lm_head(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
self.model_tester.create_and_check_t5_with_lm_head(*config_and_inputs)
|
||||
|
||||
@slow
|
||||
def test_model_from_pretrained(self):
|
||||
cache_dir = "/tmp/transformers_test/"
|
||||
for model_name in list(T5_PRETRAINED_MODEL_ARCHIVE_MAP.keys())[:1]:
|
||||
model = T5Model.from_pretrained(model_name, cache_dir=cache_dir)
|
||||
shutil.rmtree(cache_dir)
|
||||
self.assertIsNotNone(model)
|
||||
|
||||
if __name__ == "__main__":
|
||||
unittest.main()
|
@ -69,6 +69,7 @@ class TFCommonTestCases:
|
||||
test_torchscript = True
|
||||
test_pruning = True
|
||||
test_resize_embeddings = True
|
||||
is_encoder_decoder = False
|
||||
|
||||
def test_initialization(self):
|
||||
pass
|
||||
@ -129,8 +130,12 @@ class TFCommonTestCases:
|
||||
for name, key in inputs_dict.items())
|
||||
with torch.no_grad():
|
||||
pto = pt_model(**pt_inputs_dict)
|
||||
tfo = tf_model(inputs_dict)
|
||||
max_diff = np.amax(np.abs(tfo[0].numpy() - pto[0].numpy()))
|
||||
tfo = tf_model(inputs_dict, training=False)
|
||||
tf_hidden_states = tfo[0].numpy()
|
||||
pt_hidden_states = pto[0].numpy()
|
||||
tf_hidden_states[np.isnan(tf_hidden_states)] = 0
|
||||
pt_hidden_states[np.isnan(pt_hidden_states)] = 0
|
||||
max_diff = np.amax(np.abs(tf_hidden_states - pt_hidden_states))
|
||||
self.assertLessEqual(max_diff, 2e-2)
|
||||
|
||||
# Check we can load pt model in tf and vice-versa with checkpoint => model functions
|
||||
@ -150,13 +155,21 @@ class TFCommonTestCases:
|
||||
with torch.no_grad():
|
||||
pto = pt_model(**pt_inputs_dict)
|
||||
tfo = tf_model(inputs_dict)
|
||||
max_diff = np.amax(np.abs(tfo[0].numpy() - pto[0].numpy()))
|
||||
tfo = tfo[0].numpy()
|
||||
pto = pto[0].numpy()
|
||||
tfo[np.isnan(tfo)] = 0
|
||||
pto[np.isnan(pto)] = 0
|
||||
max_diff = np.amax(np.abs(tfo - pto))
|
||||
self.assertLessEqual(max_diff, 2e-2)
|
||||
|
||||
def test_compile_tf_model(self):
|
||||
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||
|
||||
input_ids = tf.keras.Input(batch_shape=(2, 2000), name='input_ids', dtype='int32')
|
||||
if self.is_encoder_decoder:
|
||||
input_ids = {'decoder_input_ids': tf.keras.Input(batch_shape=(2, 2000), name='decoder_input_ids', dtype='int32'),
|
||||
'encoder_input_ids': tf.keras.Input(batch_shape=(2, 2000), name='encoder_input_ids', dtype='int32')}
|
||||
else:
|
||||
input_ids = tf.keras.Input(batch_shape=(2, 2000), name='input_ids', dtype='int32')
|
||||
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0)
|
||||
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
|
||||
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
|
||||
@ -189,7 +202,7 @@ class TFCommonTestCases:
|
||||
outputs_dict = model(inputs_dict)
|
||||
|
||||
inputs_keywords = copy.deepcopy(inputs_dict)
|
||||
input_ids = inputs_keywords.pop('input_ids')
|
||||
input_ids = inputs_keywords.pop('input_ids' if not self.is_encoder_decoder else 'decoder_input_ids', None)
|
||||
outputs_keywords = model(input_ids, **inputs_keywords)
|
||||
|
||||
output_dict = outputs_dict[0].numpy()
|
||||
@ -200,6 +213,11 @@ class TFCommonTestCases:
|
||||
def test_attention_outputs(self):
|
||||
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||
|
||||
decoder_seq_length = self.model_tester.decoder_seq_length if hasattr(self.model_tester, 'decoder_seq_length') else self.model_tester.seq_length
|
||||
encoder_seq_length = self.model_tester.encoder_seq_length if hasattr(self.model_tester, 'encoder_seq_length') else self.model_tester.seq_length
|
||||
decoder_key_length = self.model_tester.key_length if hasattr(self.model_tester, 'key_length') else decoder_seq_length
|
||||
encoder_key_length = self.model_tester.key_length if hasattr(self.model_tester, 'key_length') else encoder_seq_length
|
||||
|
||||
for model_class in self.all_model_classes:
|
||||
config.output_attentions = True
|
||||
config.output_hidden_states = False
|
||||
@ -212,16 +230,28 @@ class TFCommonTestCases:
|
||||
self.assertListEqual(
|
||||
list(attentions[0].shape[-3:]),
|
||||
[self.model_tester.num_attention_heads,
|
||||
self.model_tester.seq_length,
|
||||
self.model_tester.key_len if hasattr(self.model_tester, 'key_len') else self.model_tester.seq_length])
|
||||
encoder_seq_length,
|
||||
encoder_key_length])
|
||||
out_len = len(outputs)
|
||||
|
||||
if self.is_encoder_decoder:
|
||||
self.assertEqual(out_len % 2, 0)
|
||||
decoder_attentions = outputs[(out_len // 2)-1]
|
||||
self.assertEqual(model.config.output_attentions, True)
|
||||
self.assertEqual(model.config.output_hidden_states, False)
|
||||
self.assertEqual(len(decoder_attentions), self.model_tester.num_hidden_layers)
|
||||
self.assertListEqual(
|
||||
list(decoder_attentions[0].shape[-3:]),
|
||||
[self.model_tester.num_attention_heads,
|
||||
decoder_seq_length,
|
||||
decoder_key_length])
|
||||
|
||||
# Check attention is always last and order is fine
|
||||
config.output_attentions = True
|
||||
config.output_hidden_states = True
|
||||
model = model_class(config)
|
||||
outputs = model(inputs_dict)
|
||||
self.assertEqual(out_len+1, len(outputs))
|
||||
self.assertEqual(out_len + (2 if self.is_encoder_decoder else 1), len(outputs))
|
||||
self.assertEqual(model.config.output_attentions, True)
|
||||
self.assertEqual(model.config.output_hidden_states, True)
|
||||
|
||||
@ -230,8 +260,8 @@ class TFCommonTestCases:
|
||||
self.assertListEqual(
|
||||
list(attentions[0].shape[-3:]),
|
||||
[self.model_tester.num_attention_heads,
|
||||
self.model_tester.seq_length,
|
||||
self.model_tester.key_len if hasattr(self.model_tester, 'key_len') else self.model_tester.seq_length])
|
||||
encoder_seq_length,
|
||||
encoder_key_length])
|
||||
|
||||
def test_hidden_states_output(self):
|
||||
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||
@ -264,35 +294,53 @@ class TFCommonTestCases:
|
||||
for model_class in self.all_model_classes:
|
||||
model = model_class(config)
|
||||
first, second = model(inputs_dict, training=False)[0], model(inputs_dict, training=False)[0]
|
||||
self.assertTrue(tf.math.equal(first, second).numpy().all())
|
||||
out_1 = first.numpy()
|
||||
out_2 = second.numpy()
|
||||
out_1 = out_1[~np.isnan(out_1)]
|
||||
out_2 = out_2[~np.isnan(out_2)]
|
||||
max_diff = np.amax(np.abs(out_1 - out_2))
|
||||
self.assertLessEqual(max_diff, 1e-5)
|
||||
|
||||
def _get_embeds(self, wte, input_ids):
|
||||
# ^^ In our TF models, the input_embeddings can take slightly different forms,
|
||||
# so we try a few of them.
|
||||
# We used to fall back to just synthetically creating a dummy tensor of ones:
|
||||
try:
|
||||
x = wte(input_ids, mode="embedding")
|
||||
except:
|
||||
try:
|
||||
x = wte([input_ids], mode="embedding")
|
||||
except:
|
||||
try:
|
||||
x = wte([input_ids, None, None, None], mode="embedding")
|
||||
except:
|
||||
if hasattr(self.model_tester, "embedding_size"):
|
||||
x = tf.ones(input_ids.shape + [self.model_tester.embedding_size], dtype=tf.dtypes.float32)
|
||||
else:
|
||||
x = tf.ones(input_ids.shape + [self.model_tester.hidden_size], dtype=tf.dtypes.float32)
|
||||
return x
|
||||
|
||||
def test_inputs_embeds(self):
|
||||
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||
input_ids = inputs_dict["input_ids"]
|
||||
del inputs_dict["input_ids"]
|
||||
if not self.is_encoder_decoder:
|
||||
input_ids = inputs_dict["input_ids"]
|
||||
del inputs_dict["input_ids"]
|
||||
else:
|
||||
encoder_input_ids = inputs_dict["encoder_input_ids"]
|
||||
decoder_input_ids = inputs_dict["decoder_input_ids"]
|
||||
del inputs_dict["encoder_input_ids"]
|
||||
del inputs_dict["decoder_input_ids"]
|
||||
|
||||
for model_class in self.all_model_classes:
|
||||
model = model_class(config)
|
||||
|
||||
wte = model.get_input_embeddings()
|
||||
try:
|
||||
x = wte(input_ids, mode="embedding")
|
||||
except:
|
||||
try:
|
||||
x = wte([input_ids], mode="embedding")
|
||||
except:
|
||||
try:
|
||||
x = wte([input_ids, None, None, None], mode="embedding")
|
||||
except:
|
||||
if hasattr(self.model_tester, "embedding_size"):
|
||||
x = tf.ones(input_ids.shape + [self.model_tester.embedding_size], dtype=tf.dtypes.float32)
|
||||
else:
|
||||
x = tf.ones(input_ids.shape + [self.model_tester.hidden_size], dtype=tf.dtypes.float32)
|
||||
# ^^ In our TF models, the input_embeddings can take slightly different forms,
|
||||
# so we try a few of them.
|
||||
# We used to fall back to just synthetically creating a dummy tensor of ones:
|
||||
#
|
||||
inputs_dict["inputs_embeds"] = x
|
||||
if not self.is_encoder_decoder:
|
||||
inputs_dict["inputs_embeds"] = self._get_embeds(wte, input_ids)
|
||||
else:
|
||||
inputs_dict["encoder_inputs_embeds"] = self._get_embeds(wte, encoder_input_ids)
|
||||
inputs_dict["decoder_inputs_embeds"] = self._get_embeds(wte, decoder_input_ids)
|
||||
|
||||
outputs = model(inputs_dict)
|
||||
|
||||
|
||||
|
172
transformers/tests/modeling_tf_t5_test.py
Normal file
172
transformers/tests/modeling_tf_t5_test.py
Normal file
@ -0,0 +1,172 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2018 Google T5 Authors and HuggingFace Inc. team.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
from __future__ import absolute_import
|
||||
from __future__ import division
|
||||
from __future__ import print_function
|
||||
|
||||
import unittest
|
||||
import shutil
|
||||
import sys
|
||||
|
||||
from .modeling_tf_common_test import (TFCommonTestCases, ids_tensor)
|
||||
from .configuration_common_test import ConfigTester
|
||||
from .utils import require_tf, slow
|
||||
|
||||
from transformers import T5Config, is_tf_available
|
||||
|
||||
if is_tf_available():
|
||||
import tensorflow as tf
|
||||
from transformers.modeling_tf_t5 import (TFT5Model, TFT5WithLMHeadModel,
|
||||
TF_T5_PRETRAINED_MODEL_ARCHIVE_MAP)
|
||||
|
||||
|
||||
@require_tf
|
||||
class TFT5ModelTest(TFCommonTestCases.TFCommonModelTester):
|
||||
|
||||
is_encoder_decoder = True
|
||||
all_model_classes = (TFT5Model, TFT5WithLMHeadModel) if is_tf_available() else ()
|
||||
|
||||
class TFT5ModelTester(object):
|
||||
|
||||
def __init__(self,
|
||||
parent,
|
||||
batch_size=13,
|
||||
seq_length=7,
|
||||
is_training=True,
|
||||
use_input_mask=True,
|
||||
use_labels=True,
|
||||
vocab_size=99,
|
||||
n_positions=14,
|
||||
hidden_size=32,
|
||||
num_hidden_layers=5,
|
||||
num_attention_heads=4,
|
||||
d_ff=37,
|
||||
relative_attention_num_buckets=8,
|
||||
dropout_rate=0.1,
|
||||
initializer_factor=0.002,
|
||||
scope=None,
|
||||
):
|
||||
self.parent = parent
|
||||
self.batch_size = batch_size
|
||||
self.seq_length = seq_length
|
||||
self.is_training = is_training
|
||||
self.use_input_mask = use_input_mask
|
||||
self.use_labels = use_labels
|
||||
self.vocab_size = vocab_size
|
||||
self.n_positions = n_positions
|
||||
self.hidden_size = hidden_size
|
||||
self.num_hidden_layers = num_hidden_layers
|
||||
self.num_attention_heads = num_attention_heads
|
||||
self.d_ff = d_ff
|
||||
self.relative_attention_num_buckets = relative_attention_num_buckets
|
||||
self.dropout_rate = dropout_rate
|
||||
self.initializer_factor = initializer_factor
|
||||
self.scope = scope
|
||||
|
||||
def prepare_config_and_inputs(self):
|
||||
input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
|
||||
|
||||
input_mask = None
|
||||
if self.use_input_mask:
|
||||
input_mask = ids_tensor([self.batch_size, self.seq_length], vocab_size=2)
|
||||
|
||||
token_labels = None
|
||||
if self.use_labels:
|
||||
token_labels = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
|
||||
|
||||
config = T5Config(
|
||||
vocab_size_or_config_json_file=self.vocab_size,
|
||||
n_positions=self.n_positions,
|
||||
d_model=self.hidden_size,
|
||||
d_ff=self.d_ff,
|
||||
d_kv=self.hidden_size // self.num_attention_heads,
|
||||
num_layers=self.num_hidden_layers,
|
||||
num_heads=self.num_attention_heads,
|
||||
relative_attention_num_buckets=self.relative_attention_num_buckets,
|
||||
dropout_rate=self.dropout_rate,
|
||||
initializer_factor=self.initializer_factor)
|
||||
|
||||
return (config, input_ids, input_mask, token_labels)
|
||||
|
||||
def create_and_check_t5_model(self, config, input_ids, input_mask, token_labels):
|
||||
model = TFT5Model(config=config)
|
||||
inputs = {'encoder_input_ids': input_ids,
|
||||
'decoder_input_ids': input_ids,
|
||||
'decoder_attention_mask': input_mask}
|
||||
encoder_output, decoder_output = model(inputs)
|
||||
|
||||
encoder_output, decoder_output = model(input_ids,
|
||||
decoder_attention_mask=input_mask,
|
||||
encoder_input_ids=input_ids)
|
||||
|
||||
result = {
|
||||
"encoder_output": encoder_output.numpy(),
|
||||
"decoder_output": decoder_output.numpy(),
|
||||
}
|
||||
self.parent.assertListEqual(
|
||||
list(result["encoder_output"].shape),
|
||||
[self.batch_size, self.seq_length, self.hidden_size])
|
||||
self.parent.assertListEqual(
|
||||
list(result["decoder_output"].shape),
|
||||
[self.batch_size, self.seq_length, self.hidden_size])
|
||||
|
||||
|
||||
def create_and_check_t5_with_lm_head(self, config, input_ids, input_mask, token_labels):
|
||||
model = TFT5WithLMHeadModel(config=config)
|
||||
inputs = {'encoder_input_ids': input_ids,
|
||||
'decoder_input_ids': input_ids,
|
||||
'decoder_attention_mask': input_mask}
|
||||
prediction_scores, decoder_output = model(inputs)
|
||||
result = {
|
||||
"prediction_scores": prediction_scores.numpy(),
|
||||
}
|
||||
self.parent.assertListEqual(
|
||||
list(result["prediction_scores"].shape),
|
||||
[self.batch_size, self.seq_length, self.vocab_size])
|
||||
|
||||
|
||||
def prepare_config_and_inputs_for_common(self):
|
||||
config_and_inputs = self.prepare_config_and_inputs()
|
||||
(config, input_ids, input_mask, token_labels) = config_and_inputs
|
||||
inputs_dict = {'encoder_input_ids': input_ids,
|
||||
'decoder_input_ids': input_ids,
|
||||
'decoder_attention_mask': input_mask}
|
||||
return config, inputs_dict
|
||||
|
||||
def setUp(self):
|
||||
self.model_tester = TFT5ModelTest.TFT5ModelTester(self)
|
||||
self.config_tester = ConfigTester(self, config_class=T5Config, d_model=37)
|
||||
|
||||
def test_config(self):
|
||||
self.config_tester.run_common_tests()
|
||||
|
||||
def test_t5_model(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
self.model_tester.create_and_check_t5_model(*config_and_inputs)
|
||||
|
||||
def test_with_lm_head(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
self.model_tester.create_and_check_t5_with_lm_head(*config_and_inputs)
|
||||
|
||||
@slow
|
||||
def test_model_from_pretrained(self):
|
||||
cache_dir = "/tmp/transformers_test/"
|
||||
for model_name in ['t5-small']:
|
||||
model = TFT5Model.from_pretrained(model_name, cache_dir=cache_dir)
|
||||
shutil.rmtree(cache_dir)
|
||||
self.assertIsNotNone(model)
|
||||
|
||||
if __name__ == "__main__":
|
||||
unittest.main()
|
@ -67,7 +67,7 @@ class TFTransfoXLModelTest(TFCommonTestCases.TFCommonModelTester):
|
||||
self.batch_size = batch_size
|
||||
self.seq_length = seq_length
|
||||
self.mem_len = mem_len
|
||||
self.key_len = seq_length + mem_len
|
||||
self.key_length = seq_length + mem_len
|
||||
self.clamp_len = clamp_len
|
||||
self.is_training = is_training
|
||||
self.use_labels = use_labels
|
||||
|
@ -66,7 +66,7 @@ class TransfoXLModelTest(CommonTestCases.CommonModelTester):
|
||||
self.batch_size = batch_size
|
||||
self.seq_length = seq_length
|
||||
self.mem_len = mem_len
|
||||
self.key_len = seq_length + mem_len
|
||||
self.key_length = seq_length + mem_len
|
||||
self.clamp_len = clamp_len
|
||||
self.is_training = is_training
|
||||
self.use_labels = use_labels
|
||||
|
@ -139,5 +139,6 @@ class BertTokenizationTest(CommonTestCases.CommonTokenizerTester):
|
||||
assert encoded_sentence == [101] + text + [102]
|
||||
assert encoded_pair == [101] + text + [102] + text_2 + [102]
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
unittest.main()
|
||||
|
@ -67,6 +67,5 @@ class GPT2TokenizationTest(CommonTestCases.CommonTokenizerTester):
|
||||
self.assertListEqual(
|
||||
tokenizer.convert_tokens_to_ids(input_tokens), input_bpe_tokens)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
unittest.main()
|
||||
|
77
transformers/tests/tokenization_t5_test.py
Normal file
77
transformers/tests/tokenization_t5_test.py
Normal file
@ -0,0 +1,77 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2018 Google T5 Authors and HuggingFace Inc. team.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
from __future__ import absolute_import, division, print_function, unicode_literals
|
||||
|
||||
import os
|
||||
import unittest
|
||||
|
||||
from transformers.tokenization_t5 import (T5Tokenizer)
|
||||
from transformers.tokenization_xlnet import SPIECE_UNDERLINE
|
||||
|
||||
from .tokenization_tests_commons import CommonTestCases
|
||||
|
||||
SAMPLE_VOCAB = os.path.join(os.path.dirname(os.path.abspath(__file__)),
|
||||
'fixtures/test_sentencepiece.model')
|
||||
|
||||
class T5TokenizationTest(CommonTestCases.CommonTokenizerTester):
|
||||
|
||||
tokenizer_class = T5Tokenizer
|
||||
|
||||
def setUp(self):
|
||||
super(T5TokenizationTest, self).setUp()
|
||||
|
||||
# We have a SentencePiece fixture for testing
|
||||
tokenizer = T5Tokenizer(SAMPLE_VOCAB)
|
||||
tokenizer.save_pretrained(self.tmpdirname)
|
||||
|
||||
def get_tokenizer(self, **kwargs):
|
||||
return T5Tokenizer.from_pretrained(self.tmpdirname, **kwargs)
|
||||
|
||||
def get_input_output_texts(self):
|
||||
input_text = u"This is a test"
|
||||
output_text = u"This is a test"
|
||||
return input_text, output_text
|
||||
|
||||
def test_full_tokenizer(self):
|
||||
tokenizer = T5Tokenizer(SAMPLE_VOCAB)
|
||||
|
||||
tokens = tokenizer.tokenize(u'This is a test')
|
||||
self.assertListEqual(tokens, [u'▁This', u'▁is', u'▁a', u'▁t', u'est'])
|
||||
|
||||
self.assertListEqual(
|
||||
tokenizer.convert_tokens_to_ids(tokens), [285, 46, 10, 170, 382])
|
||||
|
||||
tokens = tokenizer.tokenize(u"I was born in 92000, and this is falsé.")
|
||||
self.assertListEqual(tokens, [SPIECE_UNDERLINE + u'I', SPIECE_UNDERLINE + u'was', SPIECE_UNDERLINE + u'b',
|
||||
u'or', u'n', SPIECE_UNDERLINE + u'in', SPIECE_UNDERLINE + u'',
|
||||
u'9', u'2', u'0', u'0', u'0', u',', SPIECE_UNDERLINE + u'and', SPIECE_UNDERLINE + u'this',
|
||||
SPIECE_UNDERLINE + u'is', SPIECE_UNDERLINE + u'f', u'al', u's', u'é', u'.'])
|
||||
ids = tokenizer.convert_tokens_to_ids(tokens)
|
||||
self.assertListEqual(
|
||||
ids, [8, 21, 84, 55, 24, 19, 7, 0,
|
||||
602, 347, 347, 347, 3, 12, 66,
|
||||
46, 72, 80, 6, 0, 4])
|
||||
|
||||
back_tokens = tokenizer.convert_ids_to_tokens(ids)
|
||||
self.assertListEqual(back_tokens, [SPIECE_UNDERLINE + u'I', SPIECE_UNDERLINE + u'was', SPIECE_UNDERLINE + u'b',
|
||||
u'or', u'n', SPIECE_UNDERLINE + u'in',
|
||||
SPIECE_UNDERLINE + u'', u'<unk>', u'2', u'0', u'0', u'0', u',',
|
||||
SPIECE_UNDERLINE + u'and', SPIECE_UNDERLINE + u'this',
|
||||
SPIECE_UNDERLINE + u'is', SPIECE_UNDERLINE + u'f', u'al', u's',
|
||||
u'<unk>', u'.'])
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
unittest.main()
|
@ -232,6 +232,15 @@ class CommonTestCases:
|
||||
self.assertNotEqual(len(tokens_2), 0)
|
||||
self.assertIsInstance(text_2, (str, unicode))
|
||||
|
||||
def test_encode_decode_with_spaces(self):
|
||||
tokenizer = self.get_tokenizer()
|
||||
|
||||
new_toks = ['[ABC]', '[DEF]', 'GHI IHG']
|
||||
tokenizer.add_tokens(new_toks)
|
||||
input = "[ABC] [DEF] [ABC] GHI IHG [DEF]"
|
||||
encoded = tokenizer.encode(input, add_special_tokens=False)
|
||||
decoded = tokenizer.decode(encoded)
|
||||
self.assertEqual(decoded, input)
|
||||
|
||||
def test_pretrained_model_lists(self):
|
||||
weights_list = list(self.tokenizer_class.max_model_input_sizes.keys())
|
||||
|
@ -30,6 +30,7 @@ from .tokenization_roberta import RobertaTokenizer
|
||||
from .tokenization_distilbert import DistilBertTokenizer
|
||||
from .tokenization_camembert import CamembertTokenizer
|
||||
from .tokenization_albert import AlbertTokenizer
|
||||
from .tokenization_t5 import T5Tokenizer
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
@ -44,6 +45,7 @@ class AutoTokenizer(object):
|
||||
|
||||
The tokenizer class to instantiate is selected as the first pattern matching
|
||||
in the `pretrained_model_name_or_path` string (in the following order):
|
||||
- contains `t5`: T5Tokenizer (T5 model)
|
||||
- contains `distilbert`: DistilBertTokenizer (DistilBert model)
|
||||
- contains `albert`: AlbertTokenizer (ALBERT model)
|
||||
- contains `camembert`: CamembertTokenizer (CamemBERT model)
|
||||
@ -69,6 +71,7 @@ class AutoTokenizer(object):
|
||||
|
||||
The tokenizer class to instantiate is selected as the first pattern matching
|
||||
in the `pretrained_model_name_or_path` string (in the following order):
|
||||
- contains `t5`: T5Tokenizer (T5 model)
|
||||
- contains `distilbert`: DistilBertTokenizer (DistilBert model)
|
||||
- contains `albert`: AlbertTokenizer (ALBERT model)
|
||||
- contains `camembert`: CamembertTokenizer (CamemBERT model)
|
||||
@ -119,7 +122,9 @@ class AutoTokenizer(object):
|
||||
tokenizer = AutoTokenizer.from_pretrained('./test/bert_saved_model/')
|
||||
|
||||
"""
|
||||
if 'distilbert' in pretrained_model_name_or_path:
|
||||
if 't5' in pretrained_model_name_or_path:
|
||||
return T5Tokenizer.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
|
||||
elif 'distilbert' in pretrained_model_name_or_path:
|
||||
return DistilBertTokenizer.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
|
||||
elif 'albert' in pretrained_model_name_or_path:
|
||||
return AlbertTokenizer.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
|
||||
|
176
transformers/tokenization_t5.py
Normal file
176
transformers/tokenization_t5.py
Normal file
@ -0,0 +1,176 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2018 T5 Authors and HuggingFace Inc. team.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
""" Tokenization class for model T5."""
|
||||
|
||||
from __future__ import absolute_import, division, print_function, unicode_literals
|
||||
|
||||
import logging
|
||||
import os
|
||||
import re
|
||||
import six
|
||||
from shutil import copyfile
|
||||
|
||||
from .tokenization_utils import PreTrainedTokenizer
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
SPIECE_UNDERLINE = u'▁'
|
||||
|
||||
####################################################
|
||||
# Mapping from the keyword arguments names of Tokenizer `__init__`
|
||||
# to file names for serializing Tokenizer instances
|
||||
####################################################
|
||||
VOCAB_FILES_NAMES = {'vocab_file': 'spiece.model'}
|
||||
|
||||
####################################################
|
||||
# Mapping from the keyword arguments names of Tokenizer `__init__`
|
||||
# to pretrained vocabulary URL for all the model shortcut names.
|
||||
####################################################
|
||||
PRETRAINED_VOCAB_FILES_MAP = {
|
||||
'vocab_file':
|
||||
{
|
||||
't5-small': "https://s3.amazonaws.com/models.huggingface.co/bert/t5-spiece.model",
|
||||
't5-base': "https://s3.amazonaws.com/models.huggingface.co/bert/t5-spiece.model",
|
||||
't5-large': "https://s3.amazonaws.com/models.huggingface.co/bert/t5-spiece.model",
|
||||
't5-3b': "https://s3.amazonaws.com/models.huggingface.co/bert/t5-spiece.model",
|
||||
't5-11b': "https://s3.amazonaws.com/models.huggingface.co/bert/t5-spiece.model",
|
||||
}
|
||||
}
|
||||
|
||||
####################################################
|
||||
# Mapping from model shortcut names to max length of inputs
|
||||
####################################################
|
||||
PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
|
||||
't5-small': 512,
|
||||
't5-base': 512,
|
||||
't5-large': 512,
|
||||
't5-3b': 512,
|
||||
't5-11b': 512,
|
||||
}
|
||||
|
||||
class T5Tokenizer(PreTrainedTokenizer):
|
||||
"""
|
||||
SentencePiece based tokenizer. Peculiarities:
|
||||
|
||||
- requires `SentencePiece <https://github.com/google/sentencepiece>`_
|
||||
- `extra_ids` add a number of extra ids added to the end of the vocabulary for use as sentinels.
|
||||
These tokens are accessible as `<extra_id_{%d}>` where `{%d}` is a number between 0 and extra_ids-1.
|
||||
Extra tokens are indexed from the end of the vocabulary up to beginnning (<extra_id_0> is the last token in the vocabulary)
|
||||
(like in T5 preprocessing
|
||||
see: https://github.com/google-research/text-to-text-transfer-transformer/blob/9fd7b14a769417be33bc6c850f9598764913c833/t5/data/preprocessors.py#L2117)
|
||||
"""
|
||||
vocab_files_names = VOCAB_FILES_NAMES
|
||||
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
|
||||
max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
|
||||
|
||||
def __init__(self, vocab_file, eos_token="</s>", unk_token="<unk>",
|
||||
pad_token="<pad>", extra_ids=100, additional_special_tokens=None, **kwargs):
|
||||
# Add extra_ids to the special token list
|
||||
if extra_ids > 0:
|
||||
if additional_special_tokens is None:
|
||||
additional_special_tokens = []
|
||||
additional_special_tokens.extend([u"<extra_id_{}>".format(i) for i in range(extra_ids)])
|
||||
|
||||
super(T5Tokenizer, self).__init__(eos_token=eos_token, unk_token=unk_token,
|
||||
pad_token=pad_token, additional_special_tokens=additional_special_tokens,
|
||||
**kwargs)
|
||||
|
||||
try:
|
||||
import sentencepiece as spm
|
||||
except ImportError:
|
||||
logger.warning("You need to install SentencePiece to use T5Tokenizer:"
|
||||
"https://github.com/google/sentencepiece"
|
||||
"pip install sentencepiece")
|
||||
|
||||
self.vocab_file = vocab_file
|
||||
self._extra_ids = extra_ids
|
||||
|
||||
self.sp_model = spm.SentencePieceProcessor()
|
||||
self.sp_model.Load(vocab_file)
|
||||
|
||||
@property
|
||||
def vocab_size(self):
|
||||
return self.sp_model.get_piece_size() + self._extra_ids
|
||||
|
||||
def __getstate__(self):
|
||||
state = self.__dict__.copy()
|
||||
state["sp_model"] = None
|
||||
return state
|
||||
|
||||
def __setstate__(self, d):
|
||||
self.__dict__ = d
|
||||
try:
|
||||
import sentencepiece as spm
|
||||
except ImportError:
|
||||
logger.warning("You need to install SentencePiece to use XLNetTokenizer: https://github.com/google/sentencepiece"
|
||||
"pip install sentencepiece")
|
||||
self.sp_model = spm.SentencePieceProcessor()
|
||||
self.sp_model.Load(self.vocab_file)
|
||||
|
||||
def _tokenize(self, text, return_unicode=True, sample=False):
|
||||
""" Take as input a string and return a list of strings (tokens) for words/sub-words
|
||||
"""
|
||||
if not sample:
|
||||
pieces = self.sp_model.EncodeAsPieces(text)
|
||||
else:
|
||||
pieces = self.sp_model.SampleEncodeAsPieces(text, 64, 0.1)
|
||||
|
||||
# convert back to unicode for py2
|
||||
if six.PY2 and return_unicode:
|
||||
ret_pieces = []
|
||||
for piece in pieces:
|
||||
if isinstance(piece, str):
|
||||
piece = piece.decode('utf-8')
|
||||
ret_pieces.append(piece)
|
||||
pieces = ret_pieces
|
||||
|
||||
return pieces
|
||||
|
||||
def _convert_token_to_id(self, token):
|
||||
""" Converts a token (str/unicode) in an id using the vocab. """
|
||||
if token.startswith(u"<extra_id_"):
|
||||
l = re.match(r'<extra_id_(\d+)>', token)
|
||||
num = int(l.group(1))
|
||||
return self.vocab_size - num - 1
|
||||
return self.sp_model.piece_to_id(token)
|
||||
|
||||
def _convert_id_to_token(self, index, return_unicode=True):
|
||||
"""Converts an index (integer) in a token (string/unicode) using the vocab."""
|
||||
if index < self.sp_model.get_piece_size():
|
||||
token = self.sp_model.IdToPiece(index)
|
||||
else:
|
||||
token = u"<extra_id_{}>".format(self.vocab_size - 1 - index)
|
||||
if six.PY2 and return_unicode and isinstance(token, str):
|
||||
token = token.decode('utf-8')
|
||||
return token
|
||||
|
||||
def convert_tokens_to_string(self, tokens):
|
||||
""" Converts a sequence of tokens (string) in a single string. """
|
||||
out_string = self.sp_model.decode_pieces(tokens)
|
||||
return out_string
|
||||
|
||||
def save_vocabulary(self, save_directory):
|
||||
""" Save the sentencepiece vocabulary (copy original file) and special tokens file
|
||||
to a directory.
|
||||
"""
|
||||
if not os.path.isdir(save_directory):
|
||||
logger.error("Vocabulary path ({}) should be a directory".format(save_directory))
|
||||
return
|
||||
out_vocab_file = os.path.join(save_directory, VOCAB_FILES_NAMES['vocab_file'])
|
||||
|
||||
if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file):
|
||||
copyfile(self.vocab_file, out_vocab_file)
|
||||
|
||||
return (out_vocab_file,)
|
@ -637,9 +637,11 @@ class PreTrainedTokenizer(object):
|
||||
text: The sequence to be encoded.
|
||||
**kwargs: passed to the child `self.tokenize()` method
|
||||
"""
|
||||
all_special_tokens = self.all_special_tokens
|
||||
|
||||
def lowercase_text(t):
|
||||
# convert non-special tokens to lowercase
|
||||
escaped_special_toks = [re.escape(s_tok) for s_tok in self.all_special_tokens]
|
||||
escaped_special_toks = [re.escape(s_tok) for s_tok in all_special_tokens]
|
||||
pattern = r'(^' + r'|'.join(escaped_special_toks) + r')|' + \
|
||||
r'(.+?)'
|
||||
return re.sub(
|
||||
@ -680,17 +682,17 @@ class PreTrainedTokenizer(object):
|
||||
tokenized_text = []
|
||||
for sub_text in text_list:
|
||||
if sub_text not in self.added_tokens_encoder \
|
||||
and sub_text not in self.all_special_tokens:
|
||||
and sub_text not in all_special_tokens:
|
||||
tokenized_text += split_on_token(tok, sub_text)
|
||||
else:
|
||||
tokenized_text += [sub_text]
|
||||
text_list = tokenized_text
|
||||
|
||||
return list(itertools.chain.from_iterable((self._tokenize(token, **kwargs) if token not \
|
||||
in self.added_tokens_encoder and token not in self.all_special_tokens \
|
||||
in self.added_tokens_encoder and token not in all_special_tokens \
|
||||
else [token] for token in tokenized_text)))
|
||||
|
||||
added_tokens = list(self.added_tokens_encoder.keys()) + self.all_special_tokens
|
||||
added_tokens = list(self.added_tokens_encoder.keys()) + all_special_tokens
|
||||
tokenized_text = split_on_tokens(added_tokens, text)
|
||||
return tokenized_text
|
||||
|
||||
@ -1178,12 +1180,12 @@ class PreTrainedTokenizer(object):
|
||||
if current_sub_text:
|
||||
sub_texts.append(self.convert_tokens_to_string(current_sub_text))
|
||||
current_sub_text = []
|
||||
sub_texts.append(" " + token)
|
||||
sub_texts.append(token)
|
||||
else:
|
||||
current_sub_text.append(token)
|
||||
if current_sub_text:
|
||||
sub_texts.append(self.convert_tokens_to_string(current_sub_text))
|
||||
text = ''.join(sub_texts)
|
||||
text = ' '.join(sub_texts)
|
||||
|
||||
if clean_up_tokenization_spaces:
|
||||
clean_text = self.clean_up_tokenization(text)
|
||||
|
Loading…
Reference in New Issue
Block a user