Commit Graph

7 Commits

Author SHA1 Message Date
Nicolas Patry
900daec24e
Fixing NER pipeline for list inputs. (#10184)
Fixes #10168
2021-02-15 06:22:45 -05:00
Sylvain Gugger
00aa9dbca2
Copyright (#8970)
* Add copyright everywhere missing

* Style
2020-12-07 18:36:34 -05:00
Nicolas Patry
d8fc26e919
NerPipeline (TokenClassification) now outputs offsets of words (#8781)
* NerPipeline (TokenClassification) now outputs offsets of words

- It happens that the offsets are missing, it forces the user to pattern
match the "word" from his input, which is not always feasible.
For instance if a sentence contains the same word twice, then there
is no way to know which is which.
- This PR proposes to fix that by outputting 2 new keys for this
pipelines outputs, "start" and "end", which correspond to the string
offsets of the word. That means that we should always have the
invariant:

```python
input[entity["start"]: entity["end"]] == entity["entity_group"]
                                    # or entity["entity"] if not grouped
```

* Fixing doc style
2020-11-30 14:05:08 -05:00
Thomas Wolf
f4e04cd2c6
[breaking|pipelines|tokenizers] Adding slow-fast tokenizers equivalence tests pipelines - Removing sentencepiece as a required dependency (#8073)
* Fixing roberta for slow-fast tests

* WIP getting equivalence on pipelines

* slow-to-fast equivalence - working on question-answering pipeline

* optional FAISS tests

* Pipeline Q&A

* Move pipeline tests to their own test job again

* update tokenizer to add sequence id methods

* update to tokenizers 0.9.4

* set sentencepiecce as optional

* clean up squad

* clean up pipelines to use sequence_ids

* style/quality

* wording

* Switch to use_fast = True by default

* update tests for use_fast at True by default

* fix rag tokenizer test

* removing protobuf from required dependencies

* fix NER test for use_fast = True by default

* fixing example tests (Q&A examples use slow tokenizers for now)

* protobuf in main deps extras["sentencepiece"] and example deps

* fix protobug install test

* try to fix seq2seq by switching to slow tokenizers for now

* Update src/transformers/tokenization_utils_base.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Update src/transformers/tokenization_utils_base.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2020-11-15 22:50:59 +01:00
Lysandre Debut
850afb422d
Patch token classification pipeline (#8364)
* Patch token classification pipeline

* Some added tests for TokenClassificationArgumentHandler (#8366)

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2020-11-10 07:29:34 -05:00
Ceyda Cinarel
29b536a73a
[WIP] Ner pipeline grouped_entities fixes (#5970)
* Bug fix: NER pipeline shouldn't group separate entities of same type

* style fix

* [Bug Fix] Shouldn't group entities that are both 'B' even if they are same type
	(B-type1 B-type1) != (B-type1 I-type1)
[Bug Fix] add an option `ignore_subwords` to ignore subsequent ##wordpieces in predictions. Because some models train on only the first token of a word and not on the subsequent wordpieces (BERT NER default). So it makes sense doing the same thing at inference time.
	The simplest fix is to just group the subwords with the first wordpiece.
	[TODO] how to handle ignored scores? just set them to 0 and calculate zero invariant mean ?
	[TODO] handle different wordpiece_prefix ## ? possible approaches:
		get it from tokenizer? but currently most tokenizers dont have a wordpiece_prefix property?
		have an _is_subword(token)
[Feature add] added option to `skip_special_tokens`. Cause It was harder to remove them after grouping.
[Additional Changes] remove B/I prefix on returned grouped_entities
[Feature Request/TODO] Return indexes?
[Bug TODO]  can't use fast tokenizer with grouped_entities ('BertTokenizerFast' object has no attribute 'convert_tokens_to_string')

* use offset_mapping to fix [UNK] token problem

* ignore score for subwords

* modify ner_pipeline test

* modify ner_pipeline test

* modify ner_pipeline test

* ner_pipeline change ignore_subwords default to true

* add ner_pipeline ignore_subword=False test case

* fix offset_mapping index

* fix style again duh

* change is_subword and convert_tokens_to_string logic

* merge tests with new test structure

* change test names

* remove old tests

* ner tests for fast tokenizer

* fast tokenizers have convert_tokens_to_string

* Fix the incorrect merge

Co-authored-by: Ceyda Cinarel <snu-ceyda@users.noreply.github.com>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>
2020-11-03 17:21:04 -05:00
Thomas Wolf
3a40cdf58d
[tests|tokenizers] Refactoring pipelines test backbone - Small tokenizers improvements - General tests speedups (#7970)
* WIP refactoring pipeline tests - switching to fast tokenizers

* fix dialog pipeline and fill-mask

* refactoring pipeline tests backbone

* make large tests slow

* fix tests (tf Bart inactive for now)

* fix doc...

* clean up for merge

* fixing tests - remove bart from summarization until there is TF

* fix quality and RAG

* Add new translation pipeline tests - fix JAX tests

* only slow for dialog

* Fixing the missing TF-BART imports in modeling_tf_auto

* spin out pipeline tests in separate CI job

* adding pipeline test to CI YAML

* add slow pipeline tests

* speed up tf and pt join test to avoid redoing all the standalone pt and tf tests

* Update src/transformers/tokenization_utils_base.py

Co-authored-by: Sam Shleifer <sshleifer@gmail.com>

* Update src/transformers/pipelines.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/pipelines.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Update src/transformers/testing_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* add require_torch and require_tf in is_pt_tf_cross_test

Co-authored-by: Sam Shleifer <sshleifer@gmail.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2020-10-23 15:58:19 +02:00