mirror of
https://github.com/huggingface/transformers.git
synced 2025-07-30 17:52:35 +06:00
Update notebook table and transformers intro notebook (#9136)
This commit is contained in:
parent
fb650df859
commit
4d48973523
@ -55,11 +55,11 @@ Coming soon!
|
||||
|---|---|:---:|:---:|:---:|:---:|
|
||||
| [**`language-modeling`**](https://github.com/huggingface/transformers/tree/master/examples/language-modeling) | Raw text | ✅ | - | ✅ | [](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb)
|
||||
| [**`multiple-choice`**](https://github.com/huggingface/transformers/tree/master/examples/multiple-choice) | SWAG, RACE, ARC | ✅ | ✅ | - | [](https://colab.research.google.com/github/ViktorAlm/notebooks/blob/master/MPC_GPU_Demo_for_TF_and_PT.ipynb)
|
||||
| [**`question-answering`**](https://github.com/huggingface/transformers/tree/master/examples/question-answering) | SQuAD | ✅ | ✅ | ✅ | -
|
||||
| [**`question-answering`**](https://github.com/huggingface/transformers/tree/master/examples/question-answering) | SQuAD | ✅ | ✅ | ✅ | [](https://github.com/huggingface/notebooks/blob/master/examples/question_answering.ipynb)
|
||||
| [**`summarization`**](https://github.com/huggingface/transformers/tree/master/examples/seq2seq) | CNN/Daily Mail | ✅ | - | - | -
|
||||
| [**`text-classification`**](https://github.com/huggingface/transformers/tree/master/examples/text-classification) | GLUE, XNLI | ✅ | ✅ | ✅ | [](https://github.com/huggingface/notebooks/blob/master/examples/text_classification.ipynb)
|
||||
| [**`text-generation`**](https://github.com/huggingface/transformers/tree/master/examples/text-generation) | - | n/a | n/a | - | [](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/02_how_to_generate.ipynb)
|
||||
| [**`token-classification`**](https://github.com/huggingface/transformers/tree/master/examples/token-classification) | CoNLL NER | ✅ | ✅ | ✅ | -
|
||||
| [**`token-classification`**](https://github.com/huggingface/transformers/tree/master/examples/token-classification) | CoNLL NER | ✅ | ✅ | ✅ | [](https://github.com/huggingface/notebooks/blob/master/examples/token_classification.ipynb)
|
||||
| [**`translation`**](https://github.com/huggingface/transformers/tree/master/examples/seq2seq) | WMT | ✅ | - | - | -
|
||||
|
||||
|
||||
|
@ -73,12 +73,14 @@
|
||||
"\n",
|
||||
"The transformers library allows you to benefits from large, pretrained language models without requiring a huge and costly computational\n",
|
||||
"infrastructure. Most of the State-of-the-Art models are provided directly by their author and made available in the library \n",
|
||||
"in PyTorch and TensorFlow in a transparent and interchangeable way. "
|
||||
"in PyTorch and TensorFlow in a transparent and interchangeable way. \n",
|
||||
"\n",
|
||||
"If you're executing this notebook in Colab, you will need to install the transformers library. You can do so with this command:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"execution_count": 2,
|
||||
"metadata": {
|
||||
"id": "KnT3Jn6fSXai",
|
||||
"pycharm": {
|
||||
@ -89,13 +91,12 @@
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"!pip install transformers\n",
|
||||
"!pip install --upgrade tensorflow"
|
||||
"# !pip install transformers"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"execution_count": 3,
|
||||
"metadata": {
|
||||
"colab": {
|
||||
"base_uri": "https://localhost:8080/"
|
||||
@ -111,13 +112,11 @@
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"<torch.autograd.grad_mode.set_grad_enabled at 0x7f9c03e5b3c8>"
|
||||
"<torch.autograd.grad_mode.set_grad_enabled at 0x7ff0cc2a2c50>"
|
||||
]
|
||||
},
|
||||
"execution_count": 2,
|
||||
"metadata": {
|
||||
"tags": []
|
||||
},
|
||||
"execution_count": 3,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
@ -130,7 +129,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"execution_count": 4,
|
||||
"metadata": {
|
||||
"id": "1xMDTHQXSXai",
|
||||
"pycharm": {
|
||||
@ -159,12 +158,50 @@
|
||||
"source": [
|
||||
"With only the above two lines of code, you're ready to use a BERT pre-trained model. \n",
|
||||
"The tokenizers will allow us to map a raw textual input to a sequence of integers representing our textual input\n",
|
||||
"in a way the model can manipulate."
|
||||
"in a way the model can manipulate. Since we will be using a PyTorch model, we ask the tokenizer to return to us PyTorch tensors."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"execution_count": 6,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"input_ids:\n",
|
||||
"\ttensor([[ 101, 1188, 1110, 1126, 7758, 1859, 102]])\n",
|
||||
"token_type_ids:\n",
|
||||
"\ttensor([[0, 0, 0, 0, 0, 0, 0]])\n",
|
||||
"attention_mask:\n",
|
||||
"\ttensor([[1, 1, 1, 1, 1, 1, 1]])\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"tokens_pt = tokenizer(\"This is an input example\", return_tensors=\"pt\")\n",
|
||||
"for key, value in tokens_pt.items():\n",
|
||||
" print(\"{}:\\n\\t{}\".format(key, value))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The tokenizer automatically converted our input to all the inputs expected by the model. It generated some additional tensors on top of the IDs: \n",
|
||||
"\n",
|
||||
"- token_type_ids: This tensor will map every tokens to their corresponding segment (see below).\n",
|
||||
"- attention_mask: This tensor is used to \"mask\" padded values in a batch of sequence with different lengths (see below).\n",
|
||||
"\n",
|
||||
"You can check our [glossary](https://huggingface.co/transformers/glossary.html) for more information about each of those keys. \n",
|
||||
"\n",
|
||||
"We can just feed this directly into our model:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"metadata": {
|
||||
"colab": {
|
||||
"base_uri": "https://localhost:8080/"
|
||||
@ -181,32 +218,12 @@
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Tokens: ['This', 'is', 'an', 'input', 'example']\n",
|
||||
"Tokens id: [1188, 1110, 1126, 7758, 1859]\n",
|
||||
"Tokens PyTorch: tensor([[ 101, 1188, 1110, 1126, 7758, 1859, 102]])\n",
|
||||
"Token wise output: torch.Size([1, 7, 768]), Pooled output: torch.Size([1, 768])\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Tokens comes from a process that splits the input into sub-entities with interesting linguistic properties. \n",
|
||||
"tokens = tokenizer.tokenize(\"This is an input example\")\n",
|
||||
"print(\"Tokens: {}\".format(tokens))\n",
|
||||
"\n",
|
||||
"# This is not sufficient for the model, as it requires integers as input, \n",
|
||||
"# not a problem, let's convert tokens to ids.\n",
|
||||
"tokens_ids = tokenizer.convert_tokens_to_ids(tokens)\n",
|
||||
"print(\"Tokens id: {}\".format(tokens_ids))\n",
|
||||
"\n",
|
||||
"# Add the required special tokens\n",
|
||||
"tokens_ids = tokenizer.build_inputs_with_special_tokens(tokens_ids)\n",
|
||||
"\n",
|
||||
"# We need to convert to a Deep Learning framework specific format, let's use PyTorch for now.\n",
|
||||
"tokens_pt = torch.tensor([tokens_ids])\n",
|
||||
"print(\"Tokens PyTorch: {}\".format(tokens_pt))\n",
|
||||
"\n",
|
||||
"# Now we're ready to go through BERT with out input\n",
|
||||
"outputs = model(tokens_pt)\n",
|
||||
"outputs = model(**tokens_pt)\n",
|
||||
"last_hidden_state = outputs.last_hidden_state\n",
|
||||
"pooler_output = outputs.pooler_output\n",
|
||||
"\n",
|
||||
@ -233,85 +250,9 @@
|
||||
"require a fine-grained token-level. This is the case for Sentiment-Analysis of the sequence or Information Retrieval."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "DCxuDWH2SXaj",
|
||||
"pycharm": {
|
||||
"name": "#%% md\n"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"The code you saw in the previous section introduced all the steps required to do simple model invocation.\n",
|
||||
"For more day-to-day usage, transformers provides you higher-level methods which will makes your NLP journey easier.\n",
|
||||
"Let's improve our previous example"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"colab": {
|
||||
"base_uri": "https://localhost:8080/"
|
||||
},
|
||||
"id": "sgcNCdXUSXaj",
|
||||
"outputId": "af2fb928-7c17-475b-cf81-89cfc4b1d9e5",
|
||||
"pycharm": {
|
||||
"is_executing": false,
|
||||
"name": "#%% code\n"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"input_ids:\n",
|
||||
"\ttensor([[ 101, 1188, 1110, 1126, 7758, 1859, 102]])\n",
|
||||
"token_type_ids:\n",
|
||||
"\ttensor([[0, 0, 0, 0, 0, 0, 0]])\n",
|
||||
"attention_mask:\n",
|
||||
"\ttensor([[1, 1, 1, 1, 1, 1, 1]])\n",
|
||||
"Difference with previous code: (0.0, 0.0)\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# tokens = tokenizer.tokenize(\"This is an input example\")\n",
|
||||
"# tokens_ids = tokenizer.convert_tokens_to_ids(tokens)\n",
|
||||
"# tokens_pt = torch.tensor([tokens_ids])\n",
|
||||
"\n",
|
||||
"# This code can be factored into one-line as follow\n",
|
||||
"tokens_pt2 = tokenizer(\"This is an input example\", return_tensors=\"pt\")\n",
|
||||
"\n",
|
||||
"for key, value in tokens_pt2.items():\n",
|
||||
" print(\"{}:\\n\\t{}\".format(key, value))\n",
|
||||
"\n",
|
||||
"outputs2 = model(**tokens_pt2)\n",
|
||||
"last_hidden_state2 = outputs2.last_hidden_state\n",
|
||||
"pooler_output2 = outputs2.pooler_output\n",
|
||||
"\n",
|
||||
"print(\"Difference with previous code: ({}, {})\".format((last_hidden_state2 - last_hidden_state).sum(), (pooler_output2 - pooler_output).sum()))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "gC-7xGYPSXal"
|
||||
},
|
||||
"source": [
|
||||
"As you can see above, calling the tokenizer provides a convenient way to generate all the required parameters\n",
|
||||
"that will go through the model. \n",
|
||||
"\n",
|
||||
"Moreover, you might have noticed it generated some additional tensors: \n",
|
||||
"\n",
|
||||
"- token_type_ids: This tensor will map every tokens to their corresponding segment (see below).\n",
|
||||
"- attention_mask: This tensor is used to \"mask\" padded values in a batch of sequence with different lengths (see below)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"execution_count": 8,
|
||||
"metadata": {
|
||||
"colab": {
|
||||
"base_uri": "https://localhost:8080/"
|
||||
@ -357,7 +298,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"execution_count": 9,
|
||||
"metadata": {
|
||||
"colab": {
|
||||
"base_uri": "https://localhost:8080/"
|
||||
@ -414,14 +355,47 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"execution_count": 10,
|
||||
"metadata": {
|
||||
"id": "Kubwm-wJSXan",
|
||||
"pycharm": {
|
||||
"is_executing": false
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"application/vnd.jupyter.widget-view+json": {
|
||||
"model_id": "3b971be3639d4fedb02778fb5c6898a0",
|
||||
"version_major": 2,
|
||||
"version_minor": 0
|
||||
},
|
||||
"text/plain": [
|
||||
"HBox(children=(FloatProgress(value=0.0, description='Downloading', max=526681800.0, style=ProgressStyle(descri…"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
"output_type": "display_data"
|
||||
},
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Some layers from the model checkpoint at bert-base-cased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']\n",
|
||||
"- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
|
||||
"- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n",
|
||||
"All the layers of TFBertModel were initialized from the model checkpoint at bert-base-cased.\n",
|
||||
"If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from transformers import TFBertModel, BertModel\n",
|
||||
"\n",
|
||||
@ -432,7 +406,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 12,
|
||||
"execution_count": 11,
|
||||
"metadata": {
|
||||
"colab": {
|
||||
"base_uri": "https://localhost:8080/"
|
||||
@ -448,8 +422,8 @@
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"last_hidden_state differences: 1.0094e-05\n",
|
||||
"pooler_output differences: 7.2969e-07\n"
|
||||
"last_hidden_state differences: 1.2933e-05\n",
|
||||
"pooler_output differences: 2.9691e-06\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
@ -482,7 +456,7 @@
|
||||
"\n",
|
||||
"For example, Google released a few months ago **T5** an Encoder/Decoder architecture based on Transformer and available in `transformers` with no more than 11 billions parameters. Microsoft also recently entered the game with **Turing-NLG** using 17 billions parameters. This kind of model requires tens of gigabytes to store the weights and a tremendous compute infrastructure to run such models which makes it impracticable for the common man !\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"With the goal of making Transformer-based NLP accessible to everyone we @huggingface developed models that take advantage of a training process called **Distillation** which allows us to drastically reduce the resources needed to run such models with almost zero drop in performances.\n",
|
||||
"\n",
|
||||
@ -673,7 +647,7 @@
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.7.3"
|
||||
"version": "3.7.9"
|
||||
},
|
||||
"pycharm": {
|
||||
"stem_cell": {
|
||||
|
@ -31,7 +31,11 @@ Pull Request so it can be included under the Community notebooks.
|
||||
| [Getting Started Tokenizers](https://github.com/huggingface/transformers/blob/master/notebooks/01-training-tokenizers.ipynb) | How to train and use your very own tokenizer |[](https://colab.research.google.com/github/huggingface/transformers/blob/master/notebooks/01-training-tokenizers.ipynb) |
|
||||
| [Getting Started Transformers](https://github.com/huggingface/transformers/blob/master/notebooks/02-transformers.ipynb) | How to easily start using transformers | [](https://colab.research.google.com/github/huggingface/transformers/blob/master/notebooks/02-transformers.ipynb) |
|
||||
| [How to use Pipelines](https://github.com/huggingface/transformers/blob/master/notebooks/03-pipelines.ipynb) | Simple and efficient way to use State-of-the-Art models on downstream tasks through transformers | [](https://colab.research.google.com/github/huggingface/transformers/blob/master/notebooks/03-pipelines.ipynb) |
|
||||
| [How to train a language model](https://github.com/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb)| Highlight all the steps to effectively train Transformer model on custom data | [](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb)|
|
||||
| [How to fine-tune a model on text classification](https://github.com/huggingface/notebooks/blob/master/examples/text_classification.ipynb) | Show how to preprocess the data and fine-tune a pretrained model on any GLUE task. | [](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/text_classification.ipynb)|
|
||||
| [How to fine-tune a model on language modeling](https://github.com/huggingface/notebooks/blob/master/examples/language_modeling.ipynb) | Show how to preprocess the data and fine-tune a pretrained model on a causal or masked LM task. | [](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/language_modeling.ipynb)|
|
||||
| [How to fine-tune a model on token classification](https://github.com/huggingface/notebooks/blob/master/examples/token_classification.ipynb) | Show how to preprocess the data and fine-tune a pretrained model on a token classification task (NER, PoS). | [](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/token_classification.ipynb)|
|
||||
| [How to fine-tune a model on question answering](https://github.com/huggingface/notebooks/blob/master/examples/question_answering.ipynb) | Show how to preprocess the data and fine-tune a pretrained model on SQUAD. | [](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/question_answering.ipynb)|
|
||||
| [How to train a language model from scratch](https://github.com/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb)| Highlight all the steps to effectively train Transformer model on custom data | [](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb)|
|
||||
| [How to generate text](https://github.com/huggingface/blog/blob/master/notebooks/02_how_to_generate.ipynb)| How to use different decoding methods for language generation with transformers | [](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/02_how_to_generate.ipynb)|
|
||||
| [How to export model to ONNX](https://github.com/huggingface/transformers/blob/master/notebooks/04-onnx-export.ipynb) | Highlight how to export and run inference workloads through ONNX |
|
||||
| [How to use Benchmarks](https://github.com/huggingface/transformers/blob/master/notebooks/05-benchmark.ipynb) | How to benchmark models with transformers | [](https://colab.research.google.com/github/huggingface/transformers/blob/master/notebooks/05-benchmark.ipynb)|
|
||||
|
Loading…
Reference in New Issue
Block a user