* don't run custom when not needed?
* update test fetcher filtering
* fixup and updates
* update
* update
* reduce burden
* nit
* nit
* mising comma
* this?
* this?
* more parallelism
* more
* nit for real parallelism on tf and torch examples
* update
* update
* update
* update
* update
* update
* update
* update
* update
* update
* update
* update
* update to make it more custom
* update to make it more custom
* update to make it more custom
* update to make it more custom
* update
* update
* update
* update
* update
* update
* use correct path
* fix path to test files and examples
* filter-tests
* filter?
* filter?
* filter?
* nits
* fix naming of the artifacts to be pushed
* list vs files
* list vs files
* fixup
* fix list of all tests
* fix the install steps
* fix the install steps
* fix the config
* fix the config
* only split if needed
* only split if needed
* extend should fix it
* extend should fix it
* arg
* arg
* update
* update
* run tests
* run tests
* run tests
* more nits
* update
* update
* update
* update
* update
* update
* update
* simpler way to show the test, reduces the complexity of the generated config
* simpler way to show the test, reduces the complexity of the generated config
* style
* oups
* oups
* fix import errors
* skip some tests for now
* update doctestjob
* more parallelism
* fixup
* test only the test in examples
* test only the test in examples
* nits
* from Arthur
* fix generated congi
* update
* update
* show tests
* oups
* oups
* fix torch job for now
* use single upload setp
* oups
* fu**k
* fix
* nit
* update
* nit
* fix
* fixes
* [test-all]
* add generate marker and generate job
* oups
* torch job runs not generate tests
* let repo utils test all utils
* UPdate
* styling
* fix repo utils test
* more parallel please
* don't test
* update
* bit more verbose sir
* more
* hub were skipped
* split by classname
* revert
* maybe?
* Amazing catch
Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com>
* fix
* update
* update
* maybe non capturing
* manual convert?
* pass artifacts as parameters as otherwise the config is too long
* artifact.json
* store output
* might not be safe?
* my token
* mmm?
* use CI job IS
* can't get a proper id?
* ups
* build num
* update
* echo url
* this?
* this!
* fix
* wget
* ish
* dang
* udpdate
* there we go
* update
* update
* pass all
* not .txt
* update
* fetcg
* fix naming
* fix
* up
* update
* update
* ??
* update
* more updates
* update
* more
* skip
* oups
* pr documentation tests are currently created differently
* update
* hmmmm
* oups
* curl -L
* update
* ????
* nit
* mmmm
* ish
* ouf
* update
* ish
* update
* update
* updatea
* nit
* nit
* up
* oups
* documentation_test fix
* test hub tests everything, just marker
* update
* fix
* test_hub is the only annoying one now
* tf threads?
* oups
* not sure what is happening?
* fix?
* just use folder for stating hub
* I am getting fucking annoyed
* fix the test?
* update
* uupdate
* ?
* fixes
* add comment!
* nit
---------
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com>
* fix redundant checkpointing in example scripts
* Update examples/pytorch/image-classification/run_image_classification_no_trainer.py
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
* Update examples/pytorch/translation/run_translation_no_trainer.py
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
* Update examples/pytorch/token-classification/run_ner_no_trainer.py
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
* Update examples/pytorch/text-classification/run_glue_no_trainer.py
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
* Update examples/pytorch/summarization/run_summarization_no_trainer.py
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
* Update examples/pytorch/semantic-segmentation/run_semantic_segmentation_no_trainer.py
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
* Update examples/pytorch/language-modeling/run_mlm_no_trainer.py
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
* Update examples/pytorch/language-modeling/run_fim_no_trainer.py
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
* Update examples/pytorch/language-modeling/run_clm_no_trainer.py
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
* Update examples/pytorch/image-pretraining/run_mim_no_trainer.py
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
* Update examples/pytorch/instance-segmentation/run_instance_segmentation_no_trainer.py
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
* Update examples/pytorch/multiple-choice/run_swag_no_trainer.py
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
* Update examples/pytorch/question-answering/run_qa_no_trainer.py
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
* Update examples/pytorch/object-detection/run_object_detection_no_trainer.py
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
* Update examples/pytorch/question-answering/run_qa_beam_search_no_trainer.py
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
---------
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
* Pass datasets trust_remote_code
* Pass trust_remote_code in more tests
* Add trust_remote_dataset_code arg to some tests
* Revert "Temporarily pin datasets upper version to fix CI"
This reverts commit b7672826ca.
* Pass trust_remote_code in librispeech_asr_dummy docstrings
* Revert "Pin datasets<2.20.0 for examples"
This reverts commit 833fc17a3e.
* Pass trust_remote_code to all examples
* Revert "Add trust_remote_dataset_code arg to some tests" to research_projects
* Pass trust_remote_code to tests
* Pass trust_remote_code to docstrings
* Fix flax examples tests requirements
* Pass trust_remote_dataset_code arg to tests
* Replace trust_remote_dataset_code with trust_remote_code in one example
* Fix duplicate trust_remote_code
* Replace args.trust_remote_dataset_code with args.trust_remote_code
* Replace trust_remote_dataset_code with trust_remote_code in parser
* Replace trust_remote_dataset_code with trust_remote_code in dataclasses
* Replace trust_remote_dataset_code with trust_remote_code arg
* Remove deprecated logic and warnings
* Add back some code that seems to be important...
* Let's just add all he nllb stuff back; removing it is a bit more involved
* Remove kwargs
* Remove more kwargs
* Update legacy Repository usage in `examples/pytorch/text-classification/run_glue_no_trainer.py`
Marked for deprecation here https://huggingface.co/docs/huggingface_hub/guides/upload#legacy-upload-files-with-git-lfs
* Fix import order
* Replace all example usage of deprecated Repository
* Fix remaining repo call and rename args variable
* Revert removing creation of gitignore files and don't change research examples
* add: initial script to train clm fim
* fix: if training model from scratch, new tokens will be added and embeddings resized
* fix: fixed attention_mask errors when generating FIM data
* fix: file formatted using black
* add: run_fim_no_trainer.py and fixed some comments in run_fim.py
* add: added fim examples to the README.md and ran code fixup
* fix: little bug in both fim training scripts
* fix: remove comment from notebook and added a note on fim related params
* fix: minor typo in README
* add: suggested minor changes to README and run_fim.py
* add: gradient_accumulation_steps and gradient_checkpointing args
* add: improved model embedding resizing
* add: pad_to_multiple_of and attn_implementation params
* add: requested minor changes
* add: deepspeed zero compatibility
* add: resize embeddings layer with zero3 support for fim model initialization
While using `run_clm.py`,[^1] I noticed that some files were being added
to my global cache, not the local cache. I set the `cache_dir` parameter
for the one call to `evaluate.load()`, which partially solved the
problem. I figured that while I was fixing the one script upstream, I
might as well fix the problem in all other example scripts that I could.
There are still some files being added to my global cache, but this
appears to be a bug in `evaluate` itself. This commit at least moves
some of the files into the local cache, which is better than before.
To create this PR, I made the following regex-based transformation:
`evaluate\.load\((.*?)\)` -> `evaluate\.load\($1,
cache_dir=model_args.cache_dir\)`. After using that, I manually fixed
all modified files with `ruff` serving as useful guidance. During the
process, I removed one existing usage of the `cache_dir` parameter in a
script that did not have a corresponding `--cache-dir` argument
declared.
[^1]: I specifically used `pytorch/language-modeling/run_clm.py` from
v4.34.1 of the library. For the original code, see the following URL:
acc394c4f5/examples/pytorch/language-modeling/run_clm.py.
* refactor: change default block_size
* fix: return tf to origin
* fix: change files to origin
* rebase
* rebase
* rebase
* rebase
* rebase
* rebase
* rebase
* rebase
* refactor: add min block_size to files
* reformat: add min block_size for run_clm tf