Joao Gante
34d9409427
Llama/GPTNeoX: add RoPE scaling ( #24653 )
...
* add rope_scaling
* tmp commit
* add gptneox
* add tests
* GPTNeoX can now handle long inputs, so the pipeline test was wrong
* Update src/transformers/models/open_llama/configuration_open_llama.py
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* remove ntk
* remove redundant validation
---------
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
2023-07-13 16:47:30 +01:00
Arthur
b15343de6f
[Patch-t5-tokenizer] Patches the changes on T5 to make sure previous behaviour is still valide for beginning of words ( #24622 )
...
* patch `_tokenize` function
* more tests
* properly fix
* fixup
* Update src/transformers/models/t5/tokenization_t5.py
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* fix without ifs
* update
* protect import
* add python processing
* is first needed
* add doc and update with lefacy
* updaate
* fix T5 SPM converter
* styling
* fix T5 warning
* add is_seqio_available
* remove is_first
* revert some changes
* more tests and update
* update llama test batterie
* fixup
* refactor T5 spm common tests
* draft the llama tests
* update
* uopdate test
* nits
* refine
* name nit
* fix t5 tests
* fix T5
* update
* revert convert slow to fast changes that fail lots of tests
* legacy support
* fixup
* nits is first not defined
* don't use legacy behaviour for switch transformers
* style
* My attempt to check.
* nits
* fixes
* update
* fixup
* Apply suggestions from code review
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* updates
* fixup
* add legacy warning
* fixup
* warning_once nit
* update t5 documentation test
* update llama tok documentation
* add space to warning
* nits
* nit
* Apply suggestions from code review
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* last nits
---------
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2023-07-11 15:02:18 +02:00
Yuchao Dai
fb3b22c3b9
LlamaTokenizer should be picklable ( #24681 )
...
* LlamaTokenizer should be picklable
* make fixup
2023-07-06 10:21:27 +01:00
Arthur
6fc0454b2f
[LlamaTokenizerFast] nit update post_processor
on the fly ( #23855 )
...
* Update the processor when changing add_eos and add_bos
* fixup
* update
* add a test
* fix failing tests
* fixup
2023-05-30 16:50:41 +02:00
Nicolas Patry
1670be4bde
Adding Llama FastTokenizer support. ( #22264 )
...
* Adding Llama FastTokenizer support.
- Requires https://github.com/huggingface/tokenizers/pull/1183 version
- Only support byte_fallback for llama, raise otherwise (safety net).
- Lots of questions are special tokens
How to test:
```python
from transformers.convert_slow_tokenizer import convert_slow_tokenizer
from transformers import AutoTokenizer
from tokenizers import Tokenizer
tokenizer = AutoTokenizer.from_pretrained("huggingface/llama-7b")
if False:
new_tokenizer = Tokenizer.from_file("tok.json")
else:
new_tokenizer = convert_slow_tokenizer(tokenizer)
new_tokenizer.save("tok.json")
strings = [
"This is a test",
"生活的真谛是",
"生活的真谛是[MASK]。",
# XXX: This one is problematic because of special tokens
# "<s> Something something",
]
for string in strings:
encoded = tokenizer(string)["input_ids"]
encoded2 = new_tokenizer.encode(string).ids
assert encoded == encoded2, f"{encoded} != {encoded2}"
decoded = tokenizer.decode(encoded)
decoded2 = new_tokenizer.decode(encoded2)
assert decoded.strip() == decoded2, f"{repr(decoded)} != {repr(decoded2)}"
```
The converter + some test script.
The test script.
Tmp save.
Adding Fast tokenizer + tests.
Adding the tokenization tests.
Correct combination.
Small fix.
Fixing tests.
Fixing with latest update.
Rebased.
fix copies + normalized added tokens + copies.
Adding doc.
TMP.
Doc + split files.
Doc.
Versions + try import.
Fix Camembert + warnings -> Error.
Fix by ArthurZucker.
Not a decorator.
* Fixing comments.
* Adding more to docstring.
* Doc rewriting.
2023-04-06 09:53:03 +02:00
Arthur
c0f99b4d2e
Fix llama tokenizer ( #22402 )
...
* draft
* update tokenization limma and conversion script
* more udpates
* initial commit
* style
* default pad to None
* draft tokenization tests
* update test
* update tokenization tests
* nits
* update
* versioning test
* major fix
* fix more testst
* finish fixing special masks
* last nit
* more nits
* add encode decode tests
* add more
* fix token type ids
* style
2023-04-03 09:07:32 -04:00
Joao Gante
fd3eb3e3cd
Beef up Llama tests ( #22314 )
...
* tmp commit
* beef up llama tests
2023-03-22 15:20:48 +00:00
lewtun
f251441387
Add LlamaForSequenceClassification ( #22209 )
...
* Add LlamaForSequenceClassification
* Update src/transformers/models/llama/modeling_llama.py
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
* Update src/transformers/models/llama/modeling_llama.py
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
* Add docstring
* Add test
* Add input embedding getter and setter
* Remove dead code
---------
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
2023-03-17 14:39:26 +01:00
Jason Phang
0041be5b3d
LLaMA Implementation ( #21955 )
...
* LLaMA
* sharding and docs
* tweak
* black
* inits
* ruff
* LLAMA_PRETRAINED_CONFIG_ARCHIVE_MAP
* init
* no checkpoint
* docs
* ruff
* type_vocab_size
* tokenizer fixes
* tokenizer fixes
* Update tokenization_llama.py
* Update tokenization_llama.py
* Update configuration_llama.py
* Update modeling_llama.py
* tokenizer add_bos by default
* licenses
* remove decoder
* norms and mlp
* rope overhaul
* tweaks
* black
* mention OPT implementation
* off-by-one naming
* typo
* fix
* tokenization fix and slicing bug
* padding config
* cleanup
* black
* update tests
* undo typo
* fix vocab caching logic
* ruff
* docbuilder
* attn fix from BlackSamorez
* initial feedback
* typo
* docs
* llama case
* llama case
* load checkpoint docs
* comment about tokenizer
* tokenizer defaults
* clear past_key_values if use_cache=False
* last tweaks
* last tweaks
* last tweaks
* last tweaks
---------
Co-authored-by: Stella Biderman <stellabiderman@gmail.com>
2023-03-16 09:00:53 -04:00