* Delete valohai.yaml
* NLP => ML
* typo
* website supports https
* datasets
* 60k + modalities
* unrelated link fixing for accelerate
* Ok those links were actually broken
* Fix link
* Make `AutoTokenizer` auto-link
* wording tweak
* add at least one non-nlp task
Comparisons like
version.parse(torch.__version__) > version.parse("1.6")
are True for torch==1.6.0+cu101 or torch==1.6.0+cpu
version.parse(version.parse(torch.__version__).base_version) are preferred (and available in pytorch_utils.py
Currently, tensorflow examples use the `load_metric` function from
Datasets library, commit migrates function call to `load` function
from Evaluate library.
* Migrate metric to Evaluate library in tf examples
Currently tensorflow examples use `load_metric` function from Datasets
library , commit migrates function call to `load` function to
Evaluate library.
Fix for #18306
* Migrate metric to Evaluate library in tf examples
Currently tensorflow examples use `load_metric` function from Datasets
library , commit migrates function call to `load` function to
Evaluate library.
Fix for #18306
* Migrate `metric` to Evaluate for all tf examples
Currently tensorflow examples use `load_metric` function from Datasets
library , commit migrates function call to `load` function to
Evaluate library.
* add info about megatron training
* upload models and datasets from CodeParrot organization
* upload models and datasets from CodeParrot organization
* Update examples/research_projects/codeparrot/README.md
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
* Update examples/research_projects/codeparrot/README.md
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
* Update examples/research_projects/codeparrot/README.md
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
* Update examples/research_projects/codeparrot/README.md
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
* Update examples/research_projects/codeparrot/README.md
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
* fix typo and add comment about codeparrot vs megatron
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
* Fix RESOURCE_EXHAUSTED error for large datasets on Flax example scripts
* using np.permutation for creating batch_idx
* train_samples_idx -> training_samples_idx
* fix type hints
* Add logits_processor parameter, used by `generate`, to `Seq2SeqTrainer` methods `evaluate` and `predict`
* Add all generate parameters to `Seq2SeqTrainer`, and also to `QuestionAnsweringSeq2SeqTrainer` which overrides it
* Remove `self._num_beams` from trainer classes
* - Run fixup
- Fix "Constraint" not exposed
- Fix synced_gpus to actually read from param
* Use kwargs
* Copy kwargs before making changes to it
* Fix style issues unused imports
* deduplication draft
* update style
* update style test
* dummy test main
* rename modules
* rename functions
* return extremes in deduplicate_clusters
* update style
* cast str for gzip
* update doc string
* time processing
* use dataset map to compute minhash
* fill value for short token
* remove da map method
* update style
* use share object to multiprocess
* update style
* use f-string and minor fix
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
Co-authored-by: Loubna Ben Allal <44069155+loubnabnl@users.noreply.github.com>
* update style
* use module parameters
* change ds_dedup to ds_filter
* save ds_dedup
* mv test to script tests
* make jaccard threshold a parameter of deduplicate_dataset
* update style
* add doc strings
* update style
* add doc string for DuplicationIndex
* save files into data dir
* update readme
* Update examples/research_projects/codeparrot/README.md
Co-authored-by: Loubna Ben Allal <44069155+loubnabnl@users.noreply.github.com>
* make near deduplication optional
* move near deduplication in README
* Update examples/research_projects/codeparrot/README.md
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
* use f string
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
Co-authored-by: Loubna Ben Allal <44069155+loubnabnl@users.noreply.github.com>
* Raise RepoNotFoundError in case of 401
* Include changes from revert-17646-skip_repo_not_found
* Add a comment
* 💄 Code quality
* 💚 Update `get_from_cache` test
* 💚 Code quality & skip failing test
* Add examples telemetry
* Alternative approach
* Add to all other examples
* Add to templates as well
* Put framework separately
* Same for TensorFlow