* Added logic to return attention from flax-bert model and added test cases to check that
* Added new line at the end of file to test_modeling_flax_common.py
* fixing code style
* Fixing Roberta and Elextra models too from cpoying bert
* Added temporary hack to not run test_attention_outputs for FlaxGPT2
* Returning attention weights from GPT2 and changed the tests accordingly.
* last fixes
* bump flax dependency
Co-authored-by: jayendra <jayendra@infocusp.in>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
* add flax roberta
* make style
* correct initialiazation
* modify model to save weights
* fix copied from
* fix copied from
* correct some more code
* add more roberta models
* Apply suggestions from code review
* merge from master
* finish
* finish docs
Co-authored-by: Patrick von Platen <patrick@huggingface.co>
* Create modeling_flax_eletra with code copied from modeling_flax_bert
* Add ElectraForMaskedLM and ElectraForPretraining
* Add modeling test for Flax electra and fix naming and arg in Flax Electra model
* Add documentation
* Fix code style
* Create modeling_flax_eletra with code copied from modeling_flax_bert
* Add ElectraForMaskedLM and ElectraForPretraining
* Add modeling test for Flax electra and fix naming and arg in Flax Electra model
* Add documentation
* Fix code style
* Fix code quality
* Adjust tol in assert_almost_equal due to very small difference between model output, ranging 0.0010 - 0.0016
* Remove redundant ElectraPooler
* save intermediate
* adapt
* correct bert flax design
* adapt roberta as well
* finish roberta flax
* finish
* apply suggestions
* apply suggestions
Co-authored-by: Chris Nguyen <anhtu2687@gmail.com>