Update notebook table and transformers intro notebook (#9136)

This commit is contained in:
Sylvain Gugger 2020-12-16 10:24:31 -05:00 committed by GitHub
parent fb650df859
commit 4d48973523
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
3 changed files with 172 additions and 194 deletions

View File

@ -55,11 +55,11 @@ Coming soon!
|---|---|:---:|:---:|:---:|:---:|
| [**`language-modeling`**](https://github.com/huggingface/transformers/tree/master/examples/language-modeling) | Raw text | ✅ | - | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb)
| [**`multiple-choice`**](https://github.com/huggingface/transformers/tree/master/examples/multiple-choice) | SWAG, RACE, ARC | ✅ | ✅ | - | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ViktorAlm/notebooks/blob/master/MPC_GPU_Demo_for_TF_and_PT.ipynb)
| [**`question-answering`**](https://github.com/huggingface/transformers/tree/master/examples/question-answering) | SQuAD | ✅ | ✅ | ✅ | -
| [**`question-answering`**](https://github.com/huggingface/transformers/tree/master/examples/question-answering) | SQuAD | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://github.com/huggingface/notebooks/blob/master/examples/question_answering.ipynb)
| [**`summarization`**](https://github.com/huggingface/transformers/tree/master/examples/seq2seq) | CNN/Daily Mail | ✅ | - | - | -
| [**`text-classification`**](https://github.com/huggingface/transformers/tree/master/examples/text-classification) | GLUE, XNLI | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://github.com/huggingface/notebooks/blob/master/examples/text_classification.ipynb)
| [**`text-generation`**](https://github.com/huggingface/transformers/tree/master/examples/text-generation) | - | n/a | n/a | - | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/02_how_to_generate.ipynb)
| [**`token-classification`**](https://github.com/huggingface/transformers/tree/master/examples/token-classification) | CoNLL NER | ✅ | ✅ | ✅ | -
| [**`token-classification`**](https://github.com/huggingface/transformers/tree/master/examples/token-classification) | CoNLL NER | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://github.com/huggingface/notebooks/blob/master/examples/token_classification.ipynb)
| [**`translation`**](https://github.com/huggingface/transformers/tree/master/examples/seq2seq) | WMT | ✅ | - | - | -

View File

@ -73,12 +73,14 @@
"\n",
"The transformers library allows you to benefits from large, pretrained language models without requiring a huge and costly computational\n",
"infrastructure. Most of the State-of-the-Art models are provided directly by their author and made available in the library \n",
"in PyTorch and TensorFlow in a transparent and interchangeable way. "
"in PyTorch and TensorFlow in a transparent and interchangeable way. \n",
"\n",
"If you're executing this notebook in Colab, you will need to install the transformers library. You can do so with this command:"
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 2,
"metadata": {
"id": "KnT3Jn6fSXai",
"pycharm": {
@ -89,13 +91,12 @@
},
"outputs": [],
"source": [
"!pip install transformers\n",
"!pip install --upgrade tensorflow"
"# !pip install transformers"
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 3,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
@ -111,13 +112,11 @@
{
"data": {
"text/plain": [
"<torch.autograd.grad_mode.set_grad_enabled at 0x7f9c03e5b3c8>"
"<torch.autograd.grad_mode.set_grad_enabled at 0x7ff0cc2a2c50>"
]
},
"execution_count": 2,
"metadata": {
"tags": []
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
@ -130,7 +129,7 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 4,
"metadata": {
"id": "1xMDTHQXSXai",
"pycharm": {
@ -159,12 +158,50 @@
"source": [
"With only the above two lines of code, you're ready to use a BERT pre-trained model. \n",
"The tokenizers will allow us to map a raw textual input to a sequence of integers representing our textual input\n",
"in a way the model can manipulate."
"in a way the model can manipulate. Since we will be using a PyTorch model, we ask the tokenizer to return to us PyTorch tensors."
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"input_ids:\n",
"\ttensor([[ 101, 1188, 1110, 1126, 7758, 1859, 102]])\n",
"token_type_ids:\n",
"\ttensor([[0, 0, 0, 0, 0, 0, 0]])\n",
"attention_mask:\n",
"\ttensor([[1, 1, 1, 1, 1, 1, 1]])\n"
]
}
],
"source": [
"tokens_pt = tokenizer(\"This is an input example\", return_tensors=\"pt\")\n",
"for key, value in tokens_pt.items():\n",
" print(\"{}:\\n\\t{}\".format(key, value))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The tokenizer automatically converted our input to all the inputs expected by the model. It generated some additional tensors on top of the IDs: \n",
"\n",
"- token_type_ids: This tensor will map every tokens to their corresponding segment (see below).\n",
"- attention_mask: This tensor is used to \"mask\" padded values in a batch of sequence with different lengths (see below).\n",
"\n",
"You can check our [glossary](https://huggingface.co/transformers/glossary.html) for more information about each of those keys. \n",
"\n",
"We can just feed this directly into our model:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
@ -181,32 +218,12 @@
"name": "stdout",
"output_type": "stream",
"text": [
"Tokens: ['This', 'is', 'an', 'input', 'example']\n",
"Tokens id: [1188, 1110, 1126, 7758, 1859]\n",
"Tokens PyTorch: tensor([[ 101, 1188, 1110, 1126, 7758, 1859, 102]])\n",
"Token wise output: torch.Size([1, 7, 768]), Pooled output: torch.Size([1, 768])\n"
]
}
],
"source": [
"# Tokens comes from a process that splits the input into sub-entities with interesting linguistic properties. \n",
"tokens = tokenizer.tokenize(\"This is an input example\")\n",
"print(\"Tokens: {}\".format(tokens))\n",
"\n",
"# This is not sufficient for the model, as it requires integers as input, \n",
"# not a problem, let's convert tokens to ids.\n",
"tokens_ids = tokenizer.convert_tokens_to_ids(tokens)\n",
"print(\"Tokens id: {}\".format(tokens_ids))\n",
"\n",
"# Add the required special tokens\n",
"tokens_ids = tokenizer.build_inputs_with_special_tokens(tokens_ids)\n",
"\n",
"# We need to convert to a Deep Learning framework specific format, let's use PyTorch for now.\n",
"tokens_pt = torch.tensor([tokens_ids])\n",
"print(\"Tokens PyTorch: {}\".format(tokens_pt))\n",
"\n",
"# Now we're ready to go through BERT with out input\n",
"outputs = model(tokens_pt)\n",
"outputs = model(**tokens_pt)\n",
"last_hidden_state = outputs.last_hidden_state\n",
"pooler_output = outputs.pooler_output\n",
"\n",
@ -233,85 +250,9 @@
"require a fine-grained token-level. This is the case for Sentiment-Analysis of the sequence or Information Retrieval."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "DCxuDWH2SXaj",
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"The code you saw in the previous section introduced all the steps required to do simple model invocation.\n",
"For more day-to-day usage, transformers provides you higher-level methods which will makes your NLP journey easier.\n",
"Let's improve our previous example"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "sgcNCdXUSXaj",
"outputId": "af2fb928-7c17-475b-cf81-89cfc4b1d9e5",
"pycharm": {
"is_executing": false,
"name": "#%% code\n"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"input_ids:\n",
"\ttensor([[ 101, 1188, 1110, 1126, 7758, 1859, 102]])\n",
"token_type_ids:\n",
"\ttensor([[0, 0, 0, 0, 0, 0, 0]])\n",
"attention_mask:\n",
"\ttensor([[1, 1, 1, 1, 1, 1, 1]])\n",
"Difference with previous code: (0.0, 0.0)\n"
]
}
],
"source": [
"# tokens = tokenizer.tokenize(\"This is an input example\")\n",
"# tokens_ids = tokenizer.convert_tokens_to_ids(tokens)\n",
"# tokens_pt = torch.tensor([tokens_ids])\n",
"\n",
"# This code can be factored into one-line as follow\n",
"tokens_pt2 = tokenizer(\"This is an input example\", return_tensors=\"pt\")\n",
"\n",
"for key, value in tokens_pt2.items():\n",
" print(\"{}:\\n\\t{}\".format(key, value))\n",
"\n",
"outputs2 = model(**tokens_pt2)\n",
"last_hidden_state2 = outputs2.last_hidden_state\n",
"pooler_output2 = outputs2.pooler_output\n",
"\n",
"print(\"Difference with previous code: ({}, {})\".format((last_hidden_state2 - last_hidden_state).sum(), (pooler_output2 - pooler_output).sum()))"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "gC-7xGYPSXal"
},
"source": [
"As you can see above, calling the tokenizer provides a convenient way to generate all the required parameters\n",
"that will go through the model. \n",
"\n",
"Moreover, you might have noticed it generated some additional tensors: \n",
"\n",
"- token_type_ids: This tensor will map every tokens to their corresponding segment (see below).\n",
"- attention_mask: This tensor is used to \"mask\" padded values in a batch of sequence with different lengths (see below)."
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 8,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
@ -357,7 +298,7 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 9,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
@ -414,14 +355,47 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 10,
"metadata": {
"id": "Kubwm-wJSXan",
"pycharm": {
"is_executing": false
}
},
"outputs": [],
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "3b971be3639d4fedb02778fb5c6898a0",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(FloatProgress(value=0.0, description='Downloading', max=526681800.0, style=ProgressStyle(descri…"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Some layers from the model checkpoint at bert-base-cased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']\n",
"- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
"- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n",
"All the layers of TFBertModel were initialized from the model checkpoint at bert-base-cased.\n",
"If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.\n"
]
}
],
"source": [
"from transformers import TFBertModel, BertModel\n",
"\n",
@ -432,7 +406,7 @@
},
{
"cell_type": "code",
"execution_count": 12,
"execution_count": 11,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
@ -448,8 +422,8 @@
"name": "stdout",
"output_type": "stream",
"text": [
"last_hidden_state differences: 1.0094e-05\n",
"pooler_output differences: 7.2969e-07\n"
"last_hidden_state differences: 1.2933e-05\n",
"pooler_output differences: 2.9691e-06\n"
]
}
],
@ -482,7 +456,7 @@
"\n",
"For example, Google released a few months ago **T5** an Encoder/Decoder architecture based on Transformer and available in `transformers` with no more than 11 billions parameters. Microsoft also recently entered the game with **Turing-NLG** using 17 billions parameters. This kind of model requires tens of gigabytes to store the weights and a tremendous compute infrastructure to run such models which makes it impracticable for the common man !\n",
"\n",
"![transformers-parameters](https://lh5.googleusercontent.com/NRdXzEcgZV3ooykjIaTm9uvbr9QnSjDQHHAHb2kk_Lm9lIF0AhS-PJdXGzpcBDztax922XAp386hyNmWZYsZC1lUN2r4Ip5p9v-PHO19-jevRGg4iQFxgv5Olq4DWaqSA_8ptep7)\n",
"![transformers-parameters](https://github.com/huggingface/notebooks/blob/master/examples/images/model_parameters.png?raw=true)\n",
"\n",
"With the goal of making Transformer-based NLP accessible to everyone we @huggingface developed models that take advantage of a training process called **Distillation** which allows us to drastically reduce the resources needed to run such models with almost zero drop in performances.\n",
"\n",
@ -673,7 +647,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
"version": "3.7.9"
},
"pycharm": {
"stem_cell": {

View File

@ -31,7 +31,11 @@ Pull Request so it can be included under the Community notebooks.
| [Getting Started Tokenizers](https://github.com/huggingface/transformers/blob/master/notebooks/01-training-tokenizers.ipynb) | How to train and use your very own tokenizer |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/transformers/blob/master/notebooks/01-training-tokenizers.ipynb) |
| [Getting Started Transformers](https://github.com/huggingface/transformers/blob/master/notebooks/02-transformers.ipynb) | How to easily start using transformers | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/transformers/blob/master/notebooks/02-transformers.ipynb) |
| [How to use Pipelines](https://github.com/huggingface/transformers/blob/master/notebooks/03-pipelines.ipynb) | Simple and efficient way to use State-of-the-Art models on downstream tasks through transformers | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/transformers/blob/master/notebooks/03-pipelines.ipynb) |
| [How to train a language model](https://github.com/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb)| Highlight all the steps to effectively train Transformer model on custom data | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb)|
| [How to fine-tune a model on text classification](https://github.com/huggingface/notebooks/blob/master/examples/text_classification.ipynb) | Show how to preprocess the data and fine-tune a pretrained model on any GLUE task. | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/text_classification.ipynb)|
| [How to fine-tune a model on language modeling](https://github.com/huggingface/notebooks/blob/master/examples/language_modeling.ipynb) | Show how to preprocess the data and fine-tune a pretrained model on a causal or masked LM task. | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/language_modeling.ipynb)|
| [How to fine-tune a model on token classification](https://github.com/huggingface/notebooks/blob/master/examples/token_classification.ipynb) | Show how to preprocess the data and fine-tune a pretrained model on a token classification task (NER, PoS). | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/token_classification.ipynb)|
| [How to fine-tune a model on question answering](https://github.com/huggingface/notebooks/blob/master/examples/question_answering.ipynb) | Show how to preprocess the data and fine-tune a pretrained model on SQUAD. | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/question_answering.ipynb)|
| [How to train a language model from scratch](https://github.com/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb)| Highlight all the steps to effectively train Transformer model on custom data | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb)|
| [How to generate text](https://github.com/huggingface/blog/blob/master/notebooks/02_how_to_generate.ipynb)| How to use different decoding methods for language generation with transformers | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/02_how_to_generate.ipynb)|
| [How to export model to ONNX](https://github.com/huggingface/transformers/blob/master/notebooks/04-onnx-export.ipynb) | Highlight how to export and run inference workloads through ONNX |
| [How to use Benchmarks](https://github.com/huggingface/transformers/blob/master/notebooks/05-benchmark.ipynb) | How to benchmark models with transformers | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/transformers/blob/master/notebooks/05-benchmark.ipynb)|