transformers/examples/research_projects/codeparrot/scripts
Jia LI da2bd2ae96
[CodeParrot] Near-deduplication with jaccard similarity (#17054)
* deduplication draft

* update style

* update style test

* dummy test main

* rename modules

* rename functions

* return extremes in deduplicate_clusters

* update style

* cast str for gzip

* update doc string

* time processing

* use dataset map to compute minhash

* fill value for short token

* remove da map method

* update style

* use share object to multiprocess

* update style

* use f-string and minor fix

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
Co-authored-by: Loubna Ben Allal <44069155+loubnabnl@users.noreply.github.com>

* update style

* use module parameters

* change ds_dedup to ds_filter

* save ds_dedup

* mv test to script tests

* make jaccard threshold a parameter of deduplicate_dataset

* update style

* add doc strings

* update style

* add doc string for DuplicationIndex

* save files into data dir

* update readme

* Update examples/research_projects/codeparrot/README.md

Co-authored-by: Loubna Ben Allal <44069155+loubnabnl@users.noreply.github.com>

* make near deduplication optional

* move near deduplication in README

* Update examples/research_projects/codeparrot/README.md

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* use f string

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
Co-authored-by: Loubna Ben Allal <44069155+loubnabnl@users.noreply.github.com>
2022-06-21 14:23:36 +02:00
..
tests [CodeParrot] Near-deduplication with jaccard similarity (#17054) 2022-06-21 14:23:36 +02:00
arguments.py [CodeParrot] Near-deduplication with jaccard similarity (#17054) 2022-06-21 14:23:36 +02:00
bpe_training.py fix: switch from slow to generic tokenizer class (#15122) 2022-01-12 09:12:43 -05:00
codeparrot_training.py Fix CodeParrot training script (#17291) 2022-05-23 12:55:35 +02:00
human_eval.py Black preview (#17217) 2022-05-12 16:25:55 -04:00
initialize_model.py Fix CodeParrot training script (#17291) 2022-05-23 12:55:35 +02:00
minhash_deduplication.py [CodeParrot] Near-deduplication with jaccard similarity (#17054) 2022-06-21 14:23:36 +02:00
preprocessing.py [CodeParrot] Near-deduplication with jaccard similarity (#17054) 2022-06-21 14:23:36 +02:00
pretokenizing.py CodeParrot data pretokenization (#16932) 2022-05-16 15:32:16 +02:00
validation_loss.py Add CodeParrot 🦜 codebase (#14536) 2021-12-02 10:41:35 +01:00