# Processors This library includes processors for several traditional tasks. These processors can be used to process a dataset into examples that can be fed to a model. ## Processors All processors follow the same architecture which is that of the [`~data.processors.utils.DataProcessor`]. The processor returns a list of [`~data.processors.utils.InputExample`]. These [`~data.processors.utils.InputExample`] can be converted to [`~data.processors.utils.InputFeatures`] in order to be fed to the model. [[autodoc]] data.processors.utils.DataProcessor [[autodoc]] data.processors.utils.InputExample [[autodoc]] data.processors.utils.InputFeatures ## GLUE [General Language Understanding Evaluation (GLUE)](https://gluebenchmark.com/) is a benchmark that evaluates the performance of models across a diverse set of existing NLU tasks. It was released together with the paper [GLUE: A multi-task benchmark and analysis platform for natural language understanding](https://openreview.net/pdf?id=rJ4km2R5t7) This library hosts a total of 10 processors for the following tasks: MRPC, MNLI, MNLI (mismatched), CoLA, SST2, STSB, QQP, QNLI, RTE and WNLI. Those processors are: - [`~data.processors.utils.MrpcProcessor`] - [`~data.processors.utils.MnliProcessor`] - [`~data.processors.utils.MnliMismatchedProcessor`] - [`~data.processors.utils.Sst2Processor`] - [`~data.processors.utils.StsbProcessor`] - [`~data.processors.utils.QqpProcessor`] - [`~data.processors.utils.QnliProcessor`] - [`~data.processors.utils.RteProcessor`] - [`~data.processors.utils.WnliProcessor`] Additionally, the following method can be used to load values from a data file and convert them to a list of [`~data.processors.utils.InputExample`]. automethod,transformers.data.processors.glue.glue_convert_examples_to_features ### Example usage An example using these processors is given in the [run_glue.py](https://github.com/huggingface/transformers/tree/master/examples/legacy/text-classification/run_glue.py) script. ## XNLI [The Cross-Lingual NLI Corpus (XNLI)](https://www.nyu.edu/projects/bowman/xnli/) is a benchmark that evaluates the quality of cross-lingual text representations. XNLI is crowd-sourced dataset based on [*MultiNLI*](http://www.nyu.edu/projects/bowman/multinli/): pairs of text are labeled with textual entailment annotations for 15 different languages (including both high-resource language such as English and low-resource languages such as Swahili). It was released together with the paper [XNLI: Evaluating Cross-lingual Sentence Representations](https://arxiv.org/abs/1809.05053) This library hosts the processor to load the XNLI data: - [`~data.processors.utils.XnliProcessor`] Please note that since the gold labels are available on the test set, evaluation is performed on the test set. An example using these processors is given in the [run_xnli.py](https://github.com/huggingface/transformers/tree/master/examples/legacy/text-classification/run_xnli.py) script. ## SQuAD [The Stanford Question Answering Dataset (SQuAD)](https://rajpurkar.github.io/SQuAD-explorer//) is a benchmark that evaluates the performance of models on question answering. Two versions are available, v1.1 and v2.0. The first version (v1.1) was released together with the paper [SQuAD: 100,000+ Questions for Machine Comprehension of Text](https://arxiv.org/abs/1606.05250). The second version (v2.0) was released alongside the paper [Know What You Don't Know: Unanswerable Questions for SQuAD](https://arxiv.org/abs/1806.03822). This library hosts a processor for each of the two versions: ### Processors Those processors are: - [`~data.processors.utils.SquadV1Processor`] - [`~data.processors.utils.SquadV2Processor`] They both inherit from the abstract class [`~data.processors.utils.SquadProcessor`] [[autodoc]] data.processors.squad.SquadProcessor - all Additionally, the following method can be used to convert SQuAD examples into [`~data.processors.utils.SquadFeatures`] that can be used as model inputs. automethod,transformers.data.processors.squad.squad_convert_examples_to_features These processors as well as the aforementionned method can be used with files containing the data as well as with the *tensorflow_datasets* package. Examples are given below. ### Example usage Here is an example using the processors as well as the conversion method using data files: ```python # Loading a V2 processor processor = SquadV2Processor() examples = processor.get_dev_examples(squad_v2_data_dir) # Loading a V1 processor processor = SquadV1Processor() examples = processor.get_dev_examples(squad_v1_data_dir) features = squad_convert_examples_to_features( examples=examples, tokenizer=tokenizer, max_seq_length=max_seq_length, doc_stride=args.doc_stride, max_query_length=max_query_length, is_training=not evaluate, ) ``` Using *tensorflow_datasets* is as easy as using a data file: ```python # tensorflow_datasets only handle Squad V1. tfds_examples = tfds.load("squad") examples = SquadV1Processor().get_examples_from_dataset(tfds_examples, evaluate=evaluate) features = squad_convert_examples_to_features( examples=examples, tokenizer=tokenizer, max_seq_length=max_seq_length, doc_stride=args.doc_stride, max_query_length=max_query_length, is_training=not evaluate, ) ``` Another example using these processors is given in the [run_squad.py](https://github.com/huggingface/transformers/tree/master/examples/legacy/question-answering/run_squad.py) script.