transformers/tests/utils
Nicolas Patry 1670be4bde
Adding Llama FastTokenizer support. (#22264)
* Adding Llama FastTokenizer support.

- Requires https://github.com/huggingface/tokenizers/pull/1183 version
- Only support byte_fallback for llama, raise otherwise (safety net).
- Lots of questions are special tokens

How to test:

```python

from transformers.convert_slow_tokenizer import convert_slow_tokenizer
from transformers import AutoTokenizer
from tokenizers import Tokenizer

tokenizer = AutoTokenizer.from_pretrained("huggingface/llama-7b")

if False:
    new_tokenizer = Tokenizer.from_file("tok.json")
else:
    new_tokenizer = convert_slow_tokenizer(tokenizer)
    new_tokenizer.save("tok.json")

strings = [
    "This is a test",
    "生活的真谛是",
    "生活的真谛是[MASK]。",
    # XXX: This one is problematic because of special tokens
    # "<s> Something something",
]

for string in strings:
    encoded = tokenizer(string)["input_ids"]
    encoded2 = new_tokenizer.encode(string).ids

    assert encoded == encoded2, f"{encoded} != {encoded2}"

    decoded = tokenizer.decode(encoded)
    decoded2 = new_tokenizer.decode(encoded2)

    assert decoded.strip() == decoded2, f"{repr(decoded)} != {repr(decoded2)}"
```

The converter + some test script.

The test script.

Tmp save.

Adding Fast tokenizer + tests.

Adding the tokenization tests.

Correct combination.

Small fix.

Fixing tests.

Fixing with latest update.

Rebased.

fix copies + normalized added tokens  + copies.

Adding doc.

TMP.

Doc + split files.

Doc.

Versions + try import.

Fix Camembert + warnings -> Error.

Fix by ArthurZucker.

Not a decorator.

* Fixing comments.

* Adding more to docstring.

* Doc rewriting.
2023-04-06 09:53:03 +02:00
..
__init__.py [Test refactor 1/5] Per-folder tests reorganization (#15725) 2022-02-23 15:46:28 -05:00
test_activations_tf.py TF: Add sigmoid activation function (#16819) 2022-04-19 16:13:08 +01:00
test_activations.py Add the GeLU activation from pytorch with the tanh approximation (#21345) 2023-02-02 09:33:04 -05:00
test_add_new_model_like.py Rename test_feature_extraction files (#21140) 2023-01-17 14:04:07 +00:00
test_cli.py CLI: reenable pt_to_tf test (#18108) 2022-07-12 13:38:05 +01:00
test_convert_slow_tokenizer.py Adding Llama FastTokenizer support. (#22264) 2023-04-06 09:53:03 +02:00
test_doc_samples.py [Test refactor 1/5] Per-folder tests reorganization (#15725) 2022-02-23 15:46:28 -05:00
test_file_utils.py Inheritance-based framework detection (#21784) 2023-02-27 15:31:55 +00:00
test_generic.py Refactor conversion function (#19799) 2022-10-24 13:48:40 -04:00
test_hf_argparser.py Update quality tooling for formatting (#21480) 2023-02-06 18:10:56 -05:00
test_hub_utils.py Update quality tooling for formatting (#21480) 2023-02-06 18:10:56 -05:00
test_image_processing_utils.py Add Image Processors (#19796) 2022-11-02 11:57:36 +00:00
test_image_utils.py Accept batched tensor of images as input to image processor (#21144) 2023-01-26 10:15:26 +00:00
test_logging.py Fix flaky test for log level (#21776) 2023-02-28 16:24:14 -05:00
test_model_card.py Black preview (#17217) 2022-05-12 16:25:55 -04:00
test_model_output.py Fix ModelOutput instantiation when there is only one tuple (#20416) 2022-11-23 15:09:21 -05:00
test_modeling_tf_core.py Apply ruff flake8-comprehensions (#21694) 2023-02-22 09:14:54 +01:00
test_offline.py Update quality tooling for formatting (#21480) 2023-02-06 18:10:56 -05:00
test_skip_decorators.py Update quality tooling for formatting (#21480) 2023-02-06 18:10:56 -05:00
test_versions_utils.py Update quality tooling for formatting (#21480) 2023-02-06 18:10:56 -05:00
tiny_model_summary.json Use real tokenizers if tiny version(s) creation has issue(s) (#22428) 2023-03-29 16:16:23 +02:00