transformers

mirror of https://github.com/huggingface/transformers.git synced 2025-07-24 06:48:58 +06:00

Author	SHA1	Message	Date
Joao Gante	34d9409427	Llama/GPTNeoX: add RoPE scaling (#24653 ) * add rope_scaling * tmp commit * add gptneox * add tests * GPTNeoX can now handle long inputs, so the pipeline test was wrong * Update src/transformers/models/open_llama/configuration_open_llama.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * remove ntk * remove redundant validation --------- Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>	2023-07-13 16:47:30 +01:00
Arthur	b15343de6f	[Patch-t5-tokenizer] Patches the changes on T5 to make sure previous behaviour is still valide for beginning of words (#24622 ) * patch `_tokenize` function * more tests * properly fix * fixup * Update src/transformers/models/t5/tokenization_t5.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * fix without ifs * update * protect import * add python processing * is first needed * add doc and update with lefacy * updaate * fix T5 SPM converter * styling * fix T5 warning * add is_seqio_available * remove is_first * revert some changes * more tests and update * update llama test batterie * fixup * refactor T5 spm common tests * draft the llama tests * update * uopdate test * nits * refine * name nit * fix t5 tests * fix T5 * update * revert convert slow to fast changes that fail lots of tests * legacy support * fixup * nits is first not defined * don't use legacy behaviour for switch transformers * style * My attempt to check. * nits * fixes * update * fixup * Apply suggestions from code review Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * updates * fixup * add legacy warning * fixup * warning_once nit * update t5 documentation test * update llama tok documentation * add space to warning * nits * nit * Apply suggestions from code review Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * last nits --------- Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2023-07-11 15:02:18 +02:00
Yuchao Dai	fb3b22c3b9	LlamaTokenizer should be picklable (#24681 ) * LlamaTokenizer should be picklable * make fixup	2023-07-06 10:21:27 +01:00
Arthur	6fc0454b2f	[LlamaTokenizerFast] nit update `post_processor` on the fly (#23855 ) * Update the processor when changing add_eos and add_bos * fixup * update * add a test * fix failing tests * fixup	2023-05-30 16:50:41 +02:00
Nicolas Patry	1670be4bde	Adding Llama FastTokenizer support. (#22264 ) * Adding Llama FastTokenizer support. - Requires https://github.com/huggingface/tokenizers/pull/1183 version - Only support byte_fallback for llama, raise otherwise (safety net). - Lots of questions are special tokens How to test: ```python from transformers.convert_slow_tokenizer import convert_slow_tokenizer from transformers import AutoTokenizer from tokenizers import Tokenizer tokenizer = AutoTokenizer.from_pretrained("huggingface/llama-7b") if False: new_tokenizer = Tokenizer.from_file("tok.json") else: new_tokenizer = convert_slow_tokenizer(tokenizer) new_tokenizer.save("tok.json") strings = [ "This is a test", "生活的真谛是", "生活的真谛是[MASK]。", # XXX: This one is problematic because of special tokens # "<s> Something something", ] for string in strings: encoded = tokenizer(string)["input_ids"] encoded2 = new_tokenizer.encode(string).ids assert encoded == encoded2, f"{encoded} != {encoded2}" decoded = tokenizer.decode(encoded) decoded2 = new_tokenizer.decode(encoded2) assert decoded.strip() == decoded2, f"{repr(decoded)} != {repr(decoded2)}" ``` The converter + some test script. The test script. Tmp save. Adding Fast tokenizer + tests. Adding the tokenization tests. Correct combination. Small fix. Fixing tests. Fixing with latest update. Rebased. fix copies + normalized added tokens + copies. Adding doc. TMP. Doc + split files. Doc. Versions + try import. Fix Camembert + warnings -> Error. Fix by ArthurZucker. Not a decorator. * Fixing comments. * Adding more to docstring. * Doc rewriting.	2023-04-06 09:53:03 +02:00
Arthur	c0f99b4d2e	Fix llama tokenizer (#22402 ) * draft * update tokenization limma and conversion script * more udpates * initial commit * style * default pad to None * draft tokenization tests * update test * update tokenization tests * nits * update * versioning test * major fix * fix more testst * finish fixing special masks * last nit * more nits * add encode decode tests * add more * fix token type ids * style	2023-04-03 09:07:32 -04:00
Joao Gante	fd3eb3e3cd	Beef up Llama tests (#22314 ) * tmp commit * beef up llama tests	2023-03-22 15:20:48 +00:00
lewtun	f251441387	Add LlamaForSequenceClassification (#22209 ) * Add LlamaForSequenceClassification * Update src/transformers/models/llama/modeling_llama.py Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> * Update src/transformers/models/llama/modeling_llama.py Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> * Add docstring * Add test * Add input embedding getter and setter * Remove dead code --------- Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>	2023-03-17 14:39:26 +01:00
Jason Phang	0041be5b3d	LLaMA Implementation (#21955 ) * LLaMA * sharding and docs * tweak * black * inits * ruff * LLAMA_PRETRAINED_CONFIG_ARCHIVE_MAP * init * no checkpoint * docs * ruff * type_vocab_size * tokenizer fixes * tokenizer fixes * Update tokenization_llama.py * Update tokenization_llama.py * Update configuration_llama.py * Update modeling_llama.py * tokenizer add_bos by default * licenses * remove decoder * norms and mlp * rope overhaul * tweaks * black * mention OPT implementation * off-by-one naming * typo * fix * tokenization fix and slicing bug * padding config * cleanup * black * update tests * undo typo * fix vocab caching logic * ruff * docbuilder * attn fix from BlackSamorez * initial feedback * typo * docs * llama case * llama case * load checkpoint docs * comment about tokenizer * tokenizer defaults * clear past_key_values if use_cache=False * last tweaks * last tweaks * last tweaks * last tweaks --------- Co-authored-by: Stella Biderman <stellabiderman@gmail.com>	2023-03-16 09:00:53 -04:00

1 2

59 Commits