Commit Graph

7 Commits

Author SHA1 Message Date
Sam Shleifer
be1520d3a3
rename prepare_translation_batch -> prepare_seq2seq_batch (#6103) 2020-08-11 15:57:07 -04:00
Sam Shleifer
e0d58ddb65
[fix] Marian tests import (#5442) 2020-07-01 11:42:22 -04:00
Sam Shleifer
43cb03a93d
MarianTokenizer.prepare_translation_batch uses new tokenizer API (#5182) 2020-07-01 10:32:50 -04:00
Sam Shleifer
3d495c61ef
Fix marian tokenizer save pretrained (#5043) 2020-06-16 09:48:19 -04:00
Anthony MOI
36434220fc
[HUGE] Refactoring tokenizers backend - padding - truncation - pre-tokenized pipeline - fast tokenizers - tests (#4510)
* Use tokenizers pre-tokenized pipeline

* failing pretrokenized test

* Fix is_pretokenized in python

* add pretokenized tests

* style and quality

* better tests for batched pretokenized inputs

* tokenizers clean up - new padding_strategy - split the files

* [HUGE] refactoring tokenizers - padding - truncation - tests

* style and quality

* bump up requied tokenizers version to 0.8.0-rc1

* switched padding/truncation API - simpler better backward compat

* updating tests for custom tokenizers

* style and quality - tests on pad

* fix QA pipeline

* fix backward compatibility for max_length only

* style and quality

* Various cleans up - add verbose

* fix tests

* update docstrings

* Fix tests

* Docs reformatted

* __call__ method documented

Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>
Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>
2020-06-15 17:12:51 -04:00
Sam Shleifer
4ab7424597
[cleanup/marian] pipelines test and new kwarg (#4812) 2020-06-05 18:45:19 -04:00
Sam Shleifer
efbc1c5a9d
[MarianTokenizer] implement save_vocabulary and other common methods (#4389) 2020-05-19 19:45:49 -04:00