* fix: Apostraphe splitting in the BasicTokenizer for CLIPTokenizer
* account for apostrophe at start of new word
* remove _run_split_on_punc, use re.findall instead
* remove debugging, make style and quality
* use pattern and punc splitting, repo-consistency will fail
* remove commented out debugging
* adds bool args to BasicTokenizer, remove pattern
* do_split_on_punc default True
* clean stray comments and line breaks
* rebase, repo-consistency
* update to just do punctuation split
* add unicode normalizing back
* remove redundant line
* Initial commit
* Update src/transformers/models/falcon/configuration_falcon.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* Update src/transformers/models/falcon/configuration_falcon.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* Cleanup config docstring
* Update src/transformers/models/falcon/configuration_falcon.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* Convert to relative imports
* Remove torch < 1.8 warning
* Restructure cos_sin header
* qkv -> query, key, value
* Refactor attention calculation
* Add a couple of config variables to account for the different checkpoints
* Successful merging of the code paths!
* Fix misplaced line in the non-parallel attention path
* Update config and tests
* Add a pad_token_id when testing
* Support output_attentions when alibi is None
* make fixup
* Skip KV cache shape test
* No more _keys_to_ignore_on_load_missing
* Simplify self attention a bit
* Simplify self attention a bit
* make fixup
* stash commit
* Some more attention mask updates
* Should pass all tests except assisted generation!
* Add big model generation test
* make fixup
* Add temporary workaround for test
* Test overrides for assisted generation
* Update src/transformers/models/falcon/modeling_falcon.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* Update src/transformers/models/falcon/modeling_falcon.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* Update src/transformers/models/falcon/modeling_falcon.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* Update tests/models/falcon/test_modeling_falcon.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* Test overrides for assisted generation
* Add generation demo
* Update copyright
* Make the docstring model actually small
* Add module-level docstring
* Remove all assertions
* Add copied from bloom
* Reformat the QKV layer
* Add copied from bloom
* Update src/transformers/models/falcon/modeling_falcon.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* Remove unused line and reformat
* No single letter variables
* Cleanup return names
* Add copied from line
* Remove the deprecated arguments blocks
* Change the embeddings test to an alibi on/off test
* Remove position_ids from FalconForQA
* Remove old check for token type IDs
* Fix the alibi path when multi_query is False
* Update src/transformers/models/falcon/modeling_falcon.py
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* Update src/transformers/models/falcon/modeling_falcon.py
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* Update tests/models/falcon/test_modeling_falcon.py
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* Update config naming
* Fix typo for new_decoder_architecture
* Add some comments
* Fix docstring
* Fix docstring
* Create range in the right dtype from the start
* Review comment cleanup
* n_head_kv -> num_kv_heads
* self.alibi -> self.use_alibi
* self.num_kv -> self.num_kv_heads
* Reorder config args
* Made alibi arguments Optional
* Add all model docstrings
* Add extra checkpoints
* Add author info for Falcon
* Stop removing token_type_ids because our checkpoints shouldn't return it anymore
* Add one hopeful comment for the future
* Fix typo
* Update tests, fix cache issue for generation
* Use -1e9 instead of -inf to avoid float overflow
* Recompute the rotary embeddings much less often
* Re-enable disabled tests
* One final fix to attention mask calculation, and update tests
* Cleanup targeting falcon-40b equivalency
* Post-rebase docs update
* Update docstrings, especially in the config
* More descriptive variable names, and comments where we can't rename them
---------
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* hidden layers, huh, what are they good for (absolutely nothing)
* Some tests break with 1 hidden layer, use 2
* Use 1 hidden layer in a few slow models
* Use num_hidden_layers=2 everywhere
* Slightly higher tol for groupvit
* Slightly higher tol for groupvit
* Adding warning messages to BERT for missing attention masks
These warning messages when there are pad tokens within the input ids and
no attention masks are given. The warning message should only show up once.
* Adding warning messages to BERT for missing attention masks
These warning messages are shown when the pad_token_id is not None
and no attention masks are given. The warning message should only
show up once.
* Ran fix copies to copy over the changes to some of the other models
* Add logger.warning_once.cache_clear() to the test
* Shows warning when there are no attention masks and input_ids start/end with pad tokens
* Using warning_once() instead and fix indexing in input_ids check
---------
Co-authored-by: JB Lau <hckyn@voyager2.local>
* don't add space before single letter chars that don't have a merge
* fix the fix
* fixup
* add a test
* more testing
* fixup
* hack to make sure fast is also fixed
* update switch transformers test
* revert convert slow
* Update src/transformers/models/t5/tokenization_t5.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* add typechecking
* quality
---------
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* Preliminary work on some models
* Fix test load missing and make sure nonpersistent buffers are tested
* Always ignore nonpersistent buffers if in state_dict
* Treat models
* More models
* Treat remaining models
* Fix quality
* Fix tests
* Remove draft
* This test is not needed anymore
* Fix copies
* Fix last test
* Newly added models
* Fix last tests
* Address review comments
* Fix TypeError: Object of type int64 is not JSON serializable
* Convert numpy.float64 and numpy.int64 to float and int for json serialization
* Black reformatted examples/pytorch/token-classification/run_ner_no_trainer.py
* * make style
* Squash 88 commits
* Use markdown
* Remove mdx files due to bad rebase
* Fix modeling files due to bad rebase
* Fix style
* Update comment
* fix
---------
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
* Allow dict input for audio classification pipeline
* make style
* Empty commit to trigger CI
* Empty commit to trigger CI
* check for torchaudio
* add pip instructions
Co-authored-by: Sylvain <sylvain.gugger@gmail.com>
* Update src/transformers/pipelines/audio_classification.py
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
* asr -> audio class
* asr -> audio class
---------
Co-authored-by: Sylvain <sylvain.gugger@gmail.com>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
* Replace python random with torch.rand to enable dynamo.export
* revert changes to flax model code
* Remove unused random import
* Fix torch template
* Move torch.manual_seed(0) to right location
* Refactor hyperparameter search backends
* Simpler refactoring without abstract base class
* black
* review comments:
specify name in class
use methods instead of callable class attributes
name constant better
* review comments: safer bool checking, log multiple available backends
* test ALL_HYPERPARAMETER_SEARCH_BACKENDS vs HPSearchBackend in unit test, not module. format with black.
* copyright
* let's go!
* initial implementation of token-level timestamps
* only return a single timestamp per token
* remove token probabilities
* fix return type
* fix doc comment
* strip special tokens
* rename
* revert to not stripping special tokens
* only support models that have alignment_heads
* add integration test
* consistently name it token-level timestamps
* small DTW tweak
* initial support for ASR pipeline
* fix pipeline doc comments
* resolve token timestamps in pipeline with chunking
* change warning when no final timestamp is found
* return word-level timestamps
* fixup
* fix bug that skipped final word in each chunk
* fix failing unit tests
* merge punctuations into the words
* also return word tokens
* also return token indices
* add (failing) unit test for combine_tokens_into_words
* make combine_tokens_into_words private
* restore OpenAI's punctuation rules
* add pipeline tests
* make requested changes
* PR review changes
* fix failing pipeline test
* small stuff from PR
* only return words and their timestamps, not segments
* move alignment_heads into generation config
* forgot to set alignment_heads in pipeline tests
* tiny comment fix
* grr