Commit Graph

7 Commits

Author SHA1 Message Date
Nicolas Patry
db514a75d0
Fixing backward compatiblity for non prefixed tokens (B-, I-). (#13493) 2021-09-09 13:36:09 -04:00
Nicolas Patry
45a8eb66bb
Moving token-classification pipeline to new testing. (#13286)
* Moving `token-classification` pipeline to new testing.

* Fix tests.
2021-08-27 11:24:56 +02:00
Nicolas Patry
a3bd763732
Better heuristic for token-classification pipeline. (#12611)
* Better heuristic for token-classification pipeline.

Relooking at the problem makes thing actually much simpler,
when we look at ids from a tokenizer, we have no way in **general**
to recover if some substring is part of a word or not.

However, within the pipeline, with offsets we still have access to the
original string, so we can simply look if previous character (if it
exists) of a token, is actually a space. This will obviously be wrong
for tokenizers that contain spaces within tokens, tokenizers where
offsets include spaces too (Don't think there are a lot).

This heuristic hopefully is fully bc and still can handle non-word based
tokenizers.

* Updating test with real values.

* We still need the older "correct" heuristic to prevent fusing
punctuation.

* Adding a real warning when important.
2021-07-26 16:21:26 +02:00
Alex Hedges
e7f33e8cb3
Pass model_kwargs when loading a model in pipeline() (#12449)
* Pass model_kwargs when loading a model in pipeline

* Add test for model_kwargs parameter of pipeline()

* Rewrite test to not download model

* Fix failing style checks
2021-07-09 09:24:55 -04:00
Vyom Pathak
fd3b12e8c3
Fixed: Better names for nlp variables in pipelines' tests and docs. (#11752)
* Fixed: Better names for nlp variables in pipelines' tests and docs.

* Fixed: Better variable names
2021-05-18 09:47:28 -04:00
Nicolas Patry
b88e0e016d
[TokenClassification] Label realignment for subword aggregation (#11680)
* [TokenClassification] Label realignment for subword aggregation

Tentative to replace https://github.com/huggingface/transformers/pull/11622/files

- Added `AggregationStrategy`
- `ignore_subwords` and `grouped_entities` arguments are now fused
  into `aggregation_strategy`. It makes more sense anyway because
  `ignore_subwords=True` with `grouped_entities=False` did not have a
  meaning anyway.
- Added 2 new ways to aggregate which are MAX, and AVERAGE
- AVERAGE requires a bit more information than the others, for now this
case is slightly specific, we should keep that in mind for future
changes.
- Testing has been modified to reflect new argument, and to check the
correct deprecation and the new aggregation_strategy.
- Put the testing argument and testing results for aggregation_strategy,
close together, so that readers can understand what is supposed to
happen.
- `aggregate` is now only tested on a small model as it does not mean
anything to test it globally for all models.
- Previous tests are unchanged in desired output.
- Added a new test case that showcases better the difference between the
  FIRST, MAX and AVERAGE strategies.

* Wrong framework.

* Addressing three issues.

1- Tags might not follow B-, I- convention, so any tag should work now
(assumed as B-TAG)
2- Fixed an issue with average that leads to a substantial code change.
3- The testing suite was not checking for the "index" key for "none"
strategy. This is now fixed.

The issue is that "O" could not be chosen by AVERAGE strategy because
those tokens were filtered out beforehand, so their relative scores were
not counted in the average. Now filtering on
ignore_labels will happen at the very end of the pipeline fixing
that issue.
It's a bit hard to make sure this stays like that because we do
not have a end-to-end test for that behavior

* Formatting.

* Adding formatting to code + cleaner handling of B-, I- tags.

Co-authored-by: Francesco Rubbo <rubbo.francesco@gmail.com>
Co-authored-by: elk-cloner <rezakakhki.rk@gmail.com>

* Typo.

Co-authored-by: Francesco Rubbo <rubbo.francesco@gmail.com>
Co-authored-by: elk-cloner <rezakakhki.rk@gmail.com>
2021-05-18 09:53:20 +02:00
Nicolas Patry
c3fcba3219
Adding pipeline task aliases. (#11247)
* Adding task aliases and adding `token-classification` and
`text-classification` tasks.

* Cleaning docstring.
2021-04-15 09:51:24 +02:00