transformers/docs/source/en/main_classes/data_collator.md
RhuiDih 9cf4f2aa9a
Enhancing SFT Training Efficiency Using Packing and FlashAttention2 with Position IDs (#31629)
* add DataCollatorBatchFlattening

* Update data_collator.py

* change name

* new FA2 flow if position_ids is provided

* add comments

* minor fix

* minor fix data collator

* add test cases for models

* add test case for data collator

* remove extra code

* formating for ruff check and check_repo.py

* ruff format

ruff format tests src utils

* custom_init_isort.py
2024-07-23 15:56:41 +02:00

2.3 KiB

Data Collator

Data collators are objects that will form a batch by using a list of dataset elements as input. These elements are of the same type as the elements of train_dataset or eval_dataset.

To be able to build batches, data collators may apply some processing (like padding). Some of them (like [DataCollatorForLanguageModeling]) also apply some random data augmentation (like random masking) on the formed batch.

Examples of use can be found in the example scripts or example notebooks.

Default data collator

autodoc data.data_collator.default_data_collator

DefaultDataCollator

autodoc data.data_collator.DefaultDataCollator

DataCollatorWithPadding

autodoc data.data_collator.DataCollatorWithPadding

DataCollatorForTokenClassification

autodoc data.data_collator.DataCollatorForTokenClassification

DataCollatorForSeq2Seq

autodoc data.data_collator.DataCollatorForSeq2Seq

DataCollatorForLanguageModeling

autodoc data.data_collator.DataCollatorForLanguageModeling - numpy_mask_tokens - tf_mask_tokens - torch_mask_tokens

DataCollatorForWholeWordMask

autodoc data.data_collator.DataCollatorForWholeWordMask - numpy_mask_tokens - tf_mask_tokens - torch_mask_tokens

DataCollatorForPermutationLanguageModeling

autodoc data.data_collator.DataCollatorForPermutationLanguageModeling - numpy_mask_tokens - tf_mask_tokens - torch_mask_tokens

DataCollatorWithFlattening

autodoc data.data_collator.DataCollatorWithFlattening