* Pass datasets trust_remote_code
* Pass trust_remote_code in more tests
* Add trust_remote_dataset_code arg to some tests
* Revert "Temporarily pin datasets upper version to fix CI"
This reverts commit b7672826ca.
* Pass trust_remote_code in librispeech_asr_dummy docstrings
* Revert "Pin datasets<2.20.0 for examples"
This reverts commit 833fc17a3e.
* Pass trust_remote_code to all examples
* Revert "Add trust_remote_dataset_code arg to some tests" to research_projects
* Pass trust_remote_code to tests
* Pass trust_remote_code to docstrings
* Fix flax examples tests requirements
* Pass trust_remote_dataset_code arg to tests
* Replace trust_remote_dataset_code with trust_remote_code in one example
* Fix duplicate trust_remote_code
* Replace args.trust_remote_dataset_code with args.trust_remote_code
* Replace trust_remote_dataset_code with trust_remote_code in parser
* Replace trust_remote_dataset_code with trust_remote_code in dataclasses
* Replace trust_remote_dataset_code with trust_remote_code arg
* Fix typos and grammar mistakes in docs and examples
* Fix typos in docstrings and comments
* Fix spelling of `tokenizer` in model tests
* Remove erroneous spaces in decorators
* Remove extra spaces in Markdown link texts
* add: tokenizer training script for TF TPU LM training.
* add: script for preparing the TFRecord shards.
* add: sequence of execution to readme.
* remove limit from the tfrecord shard name.
* Add initial train_model.py
* Add basic training arguments and model init
* Get up to the point of writing the data collator
* Pushing progress so far!
* Complete first draft of model training code
* feat: grouping of texts efficiently.
Co-authored-by: Matt <rocketknight1@gmail.com>
* Add proper masking collator and get training loop working
* fix: things.
* Read sample counts from filenames
* Read sample counts from filenames
* Draft README
* Improve TPU warning
* Use distribute instead of distribute.experimental
* Apply suggestions from code review
Co-authored-by: Matt <Rocketknight1@users.noreply.github.com>
* Modularize loading and add MLM probability as arg
* minor refactoring to better use the cli args.
* readme fillup.
* include tpu and inference sections in the readme.
* table of contents.
* parallelize maps.
* polish readme.
* change script name to run_mlm.py
* address PR feedback (round I).
---------
Co-authored-by: Matt <rocketknight1@gmail.com>
Co-authored-by: Matt <Rocketknight1@users.noreply.github.com>