transformers/notebooks/03-pipelines.ipynb
Funtowicz Morgan 71c8711970
Adding Docker images for transformers + notebooks (#3051)
* Added transformers-pytorch-cpu and gpu Docker images

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added automatic jupyter launch for Docker image.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Move image from alpine to Ubuntu to align with NVidia container images.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added TRANSFORMERS_VERSION argument to Dockerfile.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added Pytorch-GPU based Docker image

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added Tensorflow images.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Use python 3.7 as Tensorflow doesnt provide 3.8 compatible wheel.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Remove double FROM instructions on transformers-pytorch-cpu image.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added transformers-tensorflow-gpu Docker image.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* use the correct ubuntu version for tensorflow-gpu

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added pipelines example notebook

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added transformers-cpu and transformers-gpu (including both PyTorch and TensorFlow) images.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Docker images doesnt start jupyter notebook by default.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Tokenizers notebook

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Update images links

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Update Docker images to python 3.7.6 and transformers 2.5.1

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added 02-transformers notebook.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Trying to realign 02-transformers notebook ?

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added Transformer image schema

* Some tweaks on tokenizers notebook

* Removed old notebooks.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Attempt to provide table of content for each notebooks

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Second attempt.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Reintroduce transformer image.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Keep trying

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* It's going to fly !

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Remaining of the Table of Content

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Fix inlined elements for the table of content

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Removed anaconda dependencies for Docker images.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Removing notebooks ToC

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added LABEL to each docker image.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Removed old Dockerfile

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Directly use the context and include transformers from here.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Reduce overall size of compiled Docker images.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Install jupyter by default and use CMD for easier launching of the images.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Reduce number of layers in the images.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added README.md for notebooks.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Fix notebooks link in README

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Fix some wording issues.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added blog notebooks too.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing spelling errors in review comments.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

Co-authored-by: MOI Anthony <xn1t0x@gmail.com>
2020-03-04 11:45:57 -05:00

594 lines
17 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## How can I leverage State-of-the-Art Natural Language Models with only one line of code ?"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"Newly introduced in transformers v2.3.0, **pipelines** provides a high-level, easy to use,\n",
"API for doing inference over a variety of downstream-tasks, including: \n",
"\n",
"- Sentence Classification (Sentiment Analysis): Indicate if the overall sentence is either positive or negative. _(Binary Classification task or Logitic Regression task)_\n",
"- Token Classification (Named Entity Recognition, Part-of-Speech tagging): For each sub-entities _(**tokens**)_ in the input, assign them a label _(Classification task)_.\n",
"- Question-Answering: Provided a tuple (question, context) the model should find the span of text in **content** answering the **question**.\n",
"- Mask-Filling: Suggests possible word(s) to fill the masked input with respect to the provided **context**.\n",
"- Feature Extraction: Maps the input to a higher, multi-dimensional space learned from the data.\n",
"\n",
"Pipelines encapsulate the overall process of every NLP process:\n",
" \n",
" 1. Tokenization: Split the initial input into multiple sub-entities with ... properties (i.e. tokens).\n",
" 2. Inference: Maps every tokens into a more meaningful representation. \n",
" 3. Decoding: Use the above representation to generate and/or extract the final output for the underlying task.\n",
"\n",
"The overall API is exposed to the end-user through the `pipeline()` method with the following \n",
"structure:\n",
"\n",
"```python\n",
"from transformers import pipeline\n",
"\n",
"# Using default model and tokenizer for the task\n",
"pipeline(\"<task-name>\")\n",
"\n",
"# Using a user-specified model\n",
"pipeline(\"<task-name>\", model=\"<model_name>\")\n",
"\n",
"# Using custom model/tokenizer as str\n",
"pipeline('<task-name>', model='<model name>', tokenizer='<tokenizer_name>')\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"pycharm": {
"is_executing": false,
"name": "#%% code \n"
}
},
"outputs": [
{
"ename": "SyntaxError",
"evalue": "from __future__ imports must occur at the beginning of the file (<ipython-input-29-c3a037bd4c55>, line 5)",
"output_type": "error",
"traceback": [
"\u001b[0;36m File \u001b[0;32m\"<ipython-input-29-c3a037bd4c55>\"\u001b[0;36m, line \u001b[0;32m5\u001b[0m\n\u001b[0;31m from transformers import pipeline\u001b[0m\n\u001b[0m ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m from __future__ imports must occur at the beginning of the file\n"
]
}
],
"source": [
"import numpy as np\n",
"from __future__ import print_function\n",
"from ipywidgets import interact, interactive, fixed, interact_manual\n",
"import ipywidgets as widgets\n",
"from transformers import pipeline"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## 1. Sentence Classification - Sentiment Analysis"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"pycharm": {
"is_executing": false,
"name": "#%% code\n"
}
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "6aeccfdf51994149bdd1f3d3533e380f",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n"
]
},
{
"data": {
"text/plain": [
"[{'label': 'POSITIVE', 'score': 0.800251},\n",
" {'label': 'NEGATIVE', 'score': 1.2489903}]"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"nlp_sentence_classif = pipeline('sentiment-analysis')\n",
"nlp_sentence_classif(['Such a nice weather outside !', 'This movie was kind of boring.'])"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## 2. Token Classification - Named Entity Recognition"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"pycharm": {
"is_executing": false,
"name": "#%% code\n"
}
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "b5549c53c27346a899af553c977f00bc",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n"
]
},
{
"data": {
"text/plain": [
"[{'word': 'Hu', 'score': 0.9970937967300415, 'entity': 'I-ORG'},\n",
" {'word': '##gging', 'score': 0.9345750212669373, 'entity': 'I-ORG'},\n",
" {'word': 'Face', 'score': 0.9787060022354126, 'entity': 'I-ORG'},\n",
" {'word': 'French', 'score': 0.9981995820999146, 'entity': 'I-MISC'},\n",
" {'word': 'New', 'score': 0.9983047246932983, 'entity': 'I-LOC'},\n",
" {'word': '-', 'score': 0.8913455009460449, 'entity': 'I-LOC'},\n",
" {'word': 'York', 'score': 0.9979523420333862, 'entity': 'I-LOC'}]"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"nlp_token_class = pipeline('ner')\n",
"nlp_token_class('Hugging Face is a French company based in New-York.')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Question Answering"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"pycharm": {
"is_executing": false,
"name": "#%% code\n"
}
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "6e56a8edcef44ec2ae838711ecd22d3a",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"convert squad examples to features: 100%|██████████| 1/1 [00:00<00:00, 53.05it/s]\n",
"add example index and unique id: 100%|██████████| 1/1 [00:00<00:00, 2673.23it/s]\n"
]
},
{
"data": {
"text/plain": [
"{'score': 0.9632966867654424, 'start': 42, 'end': 50, 'answer': 'New-York.'}"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"nlp_qa = pipeline('question-answering')\n",
"nlp_qa(context='Hugging Face is a French company based in New-York.', question='Where is based Hugging Face ?')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Text Generation - Mask Filling"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"pycharm": {
"is_executing": false,
"name": "#%% code\n"
}
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "1930695ea2d24ca98c6d7c13842d377f",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n"
]
},
{
"data": {
"text/plain": [
"[{'sequence': '<s> Hugging Face is a French company based in Paris</s>',\n",
" 'score': 0.25288480520248413,\n",
" 'token': 2201},\n",
" {'sequence': '<s> Hugging Face is a French company based in Lyon</s>',\n",
" 'score': 0.07639515399932861,\n",
" 'token': 12790},\n",
" {'sequence': '<s> Hugging Face is a French company based in Brussels</s>',\n",
" 'score': 0.055500105023384094,\n",
" 'token': 6497},\n",
" {'sequence': '<s> Hugging Face is a French company based in Geneva</s>',\n",
" 'score': 0.04264815151691437,\n",
" 'token': 11559},\n",
" {'sequence': '<s> Hugging Face is a French company based in France</s>',\n",
" 'score': 0.03868963569402695,\n",
" 'token': 1470}]"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"nlp_fill = pipeline('fill-mask')\n",
"nlp_fill('Hugging Face is a French company based in <mask>')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Projection - Features Extraction "
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"pycharm": {
"is_executing": false,
"name": "#%% code\n"
}
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "92fa4d67290f49a3943dc0abd7529892",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n"
]
},
{
"data": {
"text/plain": [
"(1, 12, 768)"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import numpy as np\n",
"nlp_features = pipeline('feature-extraction')\n",
"output = nlp_features('Hugging Face is a French company based in Paris')\n",
"np.array(output).shape # (Samples, Tokens, Vector Size)\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"Alright ! Now you have a nice picture of what is possible through transformers' pipelines, and there is more\n",
"to come in future releases. \n",
"\n",
"In the meantime, you can try the different pipelines with your own inputs"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {
"pycharm": {
"is_executing": false,
"name": "#%% code\n"
}
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "261ae9fa30e84d1d84a3b0d9682ac477",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Dropdown(description='Task:', index=1, options=('sentiment-analysis', 'ner', 'fill_mask'), value='ner')"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "ddc51b71c6eb40e5ab60998664e6a857",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Text(value='', description='Your input:', placeholder='Enter something')"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[{'word': 'Paris', 'score': 0.9991844296455383, 'entity': 'I-LOC'}]\n",
"[{'sequence': '<s> I\\'m from Paris.\"</s>', 'score': 0.224044069647789, 'token': 72}, {'sequence': \"<s> I'm from Paris.)</s>\", 'score': 0.16959427297115326, 'token': 1592}, {'sequence': \"<s> I'm from Paris.]</s>\", 'score': 0.10994981974363327, 'token': 21838}, {'sequence': '<s> I\\'m from Paris!\"</s>', 'score': 0.0706234946846962, 'token': 2901}, {'sequence': \"<s> I'm from Paris.</s>\", 'score': 0.0698278620839119, 'token': 4}]\n",
"[{'sequence': \"<s> I'm from Paris and London</s>\", 'score': 0.12238534539937973, 'token': 928}, {'sequence': \"<s> I'm from Paris and Brussels</s>\", 'score': 0.07107886672019958, 'token': 6497}, {'sequence': \"<s> I'm from Paris and Belgium</s>\", 'score': 0.040912602096796036, 'token': 7320}, {'sequence': \"<s> I'm from Paris and Berlin</s>\", 'score': 0.039884064346551895, 'token': 5459}, {'sequence': \"<s> I'm from Paris and Melbourne</s>\", 'score': 0.038133684545755386, 'token': 5703}]\n",
"[{'sequence': '<s> I like go to sleep</s>', 'score': 0.08942786604166031, 'token': 3581}, {'sequence': '<s> I like go to bed</s>', 'score': 0.07789064943790436, 'token': 3267}, {'sequence': '<s> I like go to concerts</s>', 'score': 0.06356740742921829, 'token': 12858}, {'sequence': '<s> I like go to school</s>', 'score': 0.03660670667886734, 'token': 334}, {'sequence': '<s> I like go to dinner</s>', 'score': 0.032155368477106094, 'token': 3630}]\n"
]
}
],
"source": [
"task = widgets.Dropdown(\n",
" options=['sentiment-analysis', 'ner', 'fill_mask'],\n",
" value='ner',\n",
" description='Task:',\n",
" disabled=False\n",
")\n",
"\n",
"input = widgets.Text(\n",
" value='',\n",
" placeholder='Enter something',\n",
" description='Your input:',\n",
" disabled=False\n",
")\n",
"\n",
"def forward(_):\n",
" if len(input.value) > 0: \n",
" if task.value == 'ner':\n",
" output = nlp_token_class(input.value)\n",
" elif task.value == 'sentiment-analysis':\n",
" output = nlp_sentence_classif(input.value)\n",
" else:\n",
" if input.value.find('<mask>') == -1:\n",
" output = nlp_fill(input.value + ' <mask>')\n",
" else:\n",
" output = nlp_fill(input.value) \n",
" print(output)\n",
"\n",
"input.on_submit(forward)\n",
"display(task, input)"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"pycharm": {
"is_executing": false,
"name": "#%% Question Answering\n"
}
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "5ae68677bd8a41f990355aa43840d3f8",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Textarea(value='Einstein is famous for the general theory of relativity', description='Context:', placeholder=…"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "14bcfd9a2c5a47e6b1383989ab7632c8",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Text(value='Why is Einstein famous for ?', description='Question:', placeholder='Enter something')"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"convert squad examples to features: 100%|██████████| 1/1 [00:00<00:00, 168.83it/s]\n",
"add example index and unique id: 100%|██████████| 1/1 [00:00<00:00, 1919.59it/s]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'score': 0.40340670623875496, 'start': 27, 'end': 54, 'answer': 'general theory of relativity'}\n"
]
}
],
"source": [
"context = widgets.Textarea(\n",
" value='Einstein is famous for the general theory of relativity',\n",
" placeholder='Enter something',\n",
" description='Context:',\n",
" disabled=False\n",
")\n",
"\n",
"query = widgets.Text(\n",
" value='Why is Einstein famous for ?',\n",
" placeholder='Enter something',\n",
" description='Question:',\n",
" disabled=False\n",
")\n",
"\n",
"def forward(_):\n",
" if len(context.value) > 0 and len(query.value) > 0: \n",
" output = nlp_qa(question=query.value, context=context.value) \n",
" print(output)\n",
"\n",
"query.on_submit(forward)\n",
"display(context, query)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.6"
},
"pycharm": {
"stem_cell": {
"cell_type": "raw",
"source": [],
"metadata": {
"collapsed": false
}
}
}
},
"nbformat": 4,
"nbformat_minor": 1
}