mirror of
https://github.com/huggingface/transformers.git
synced 2025-07-31 02:02:21 +06:00
Doc to dataset (#18037)
* Link to the Datasets doc * Remove unwanted file
This commit is contained in:
parent
be79cd7d8e
commit
2e90c3df8f
@ -33,7 +33,7 @@ pip install transformers datasets accelerate nvidia-ml-py3
|
||||
|
||||
The `nvidia-ml-py3` library allows us to monitor the memory usage of the models from within Python. You might be familiar with the `nvidia-smi` command in the terminal - this library allows to access the same information in Python directly.
|
||||
|
||||
Then we create some dummy data. We create random token IDs between 100 and 30000 and binary labels for a classifier. In total we get 512 sequences each with length 512 and store them in a [`Dataset`](https://huggingface.co/docs/datasets/package_reference/main_classes.html?highlight=dataset#datasets.Dataset) with PyTorch format.
|
||||
Then we create some dummy data. We create random token IDs between 100 and 30000 and binary labels for a classifier. In total we get 512 sequences each with length 512 and store them in a [`~datasets.Dataset`] with PyTorch format.
|
||||
|
||||
|
||||
```py
|
||||
|
@ -244,7 +244,7 @@ For example, the [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) data
|
||||
'sampling_rate': 8000}
|
||||
```
|
||||
|
||||
1. Use 🤗 Datasets' [`cast_column`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.cast_column) method to upsample the sampling rate to 16kHz:
|
||||
1. Use 🤗 Datasets' [`~datasets.Dataset.cast_column`] method to upsample the sampling rate to 16kHz:
|
||||
|
||||
```py
|
||||
>>> dataset = dataset.cast_column("audio", Audio(sampling_rate=16_000))
|
||||
|
@ -117,7 +117,7 @@ The preprocessing function needs to:
|
||||
... return batch
|
||||
```
|
||||
|
||||
Use 🤗 Datasets [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) function to apply the preprocessing function over the entire dataset. You can speed up the map function by increasing the number of processes with `num_proc`. Remove the columns you don't need:
|
||||
Use 🤗 Datasets [`~datasets.Dataset.map`] function to apply the preprocessing function over the entire dataset. You can speed up the map function by increasing the number of processes with `num_proc`. Remove the columns you don't need:
|
||||
|
||||
```py
|
||||
>>> encoded_minds = minds.map(prepare_dataset, remove_columns=minds.column_names["train"], num_proc=4)
|
||||
|
@ -129,7 +129,7 @@ The preprocessing function needs to:
|
||||
... return inputs
|
||||
```
|
||||
|
||||
Use 🤗 Datasets [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) function to apply the preprocessing function over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once. Remove the columns you don't need, and rename `intent_class` to `label` because that is what the model expects:
|
||||
Use 🤗 Datasets [`~datasets.Dataset.map`] function to apply the preprocessing function over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once. Remove the columns you don't need, and rename `intent_class` to `label` because that is what the model expects:
|
||||
|
||||
```py
|
||||
>>> encoded_minds = minds.map(preprocess_function, remove_columns="audio", batched=True)
|
||||
|
@ -95,7 +95,7 @@ Create a preprocessing function that will apply the transforms and return the `p
|
||||
... return examples
|
||||
```
|
||||
|
||||
Use 🤗 Dataset's [`with_transform`](https://huggingface.co/docs/datasets/package_reference/main_classes.html?#datasets.Dataset.with_transform) method to apply the transforms over the entire dataset. The transforms are applied on-the-fly when you load an element of the dataset:
|
||||
Use 🤗 Dataset's [`~datasets.Dataset.with_transform`] method to apply the transforms over the entire dataset. The transforms are applied on-the-fly when you load an element of the dataset:
|
||||
|
||||
```py
|
||||
>>> food = food.with_transform(transforms)
|
||||
|
@ -118,7 +118,7 @@ Here is how you can create a preprocessing function to convert the list to a str
|
||||
... return tokenizer([" ".join(x) for x in examples["answers.text"]], truncation=True)
|
||||
```
|
||||
|
||||
Use 🤗 Datasets [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) function to apply the preprocessing function over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once and increasing the number of processes with `num_proc`. Remove the columns you don't need:
|
||||
Use 🤗 Datasets [`~datasets.Dataset.map`] function to apply the preprocessing function over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once and increasing the number of processes with `num_proc`. Remove the columns you don't need:
|
||||
|
||||
```py
|
||||
>>> tokenized_eli5 = eli5.map(
|
||||
@ -245,7 +245,7 @@ At this point, only three steps remain:
|
||||
```
|
||||
</pt>
|
||||
<tf>
|
||||
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`to_tf_dataset`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.to_tf_dataset). Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator:
|
||||
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator:
|
||||
|
||||
```py
|
||||
>>> tf_train_set = lm_dataset["train"].to_tf_dataset(
|
||||
@ -352,7 +352,7 @@ At this point, only three steps remain:
|
||||
```
|
||||
</pt>
|
||||
<tf>
|
||||
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`to_tf_dataset`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.to_tf_dataset). Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator:
|
||||
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator:
|
||||
|
||||
```py
|
||||
>>> tf_train_set = lm_dataset["train"].to_tf_dataset(
|
||||
|
@ -79,7 +79,7 @@ The preprocessing function needs to do:
|
||||
... return {k: [v[i : i + 4] for i in range(0, len(v), 4)] for k, v in tokenized_examples.items()}
|
||||
```
|
||||
|
||||
Use 🤗 Datasets [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) function to apply the preprocessing function over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once:
|
||||
Use 🤗 Datasets [`~datasets.Dataset.map`] function to apply the preprocessing function over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once:
|
||||
|
||||
```py
|
||||
tokenized_swag = swag.map(preprocess_function, batched=True)
|
||||
@ -224,7 +224,7 @@ At this point, only three steps remain:
|
||||
```
|
||||
</pt>
|
||||
<tf>
|
||||
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`to_tf_dataset`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.to_tf_dataset). Specify inputs in `columns`, targets in `label_cols`, whether to shuffle the dataset order, batch size, and the data collator:
|
||||
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs in `columns`, targets in `label_cols`, whether to shuffle the dataset order, batch size, and the data collator:
|
||||
|
||||
```py
|
||||
>>> data_collator = DataCollatorForMultipleChoice(tokenizer=tokenizer)
|
||||
|
@ -126,7 +126,7 @@ Here is how you can create a function to truncate and map the start and end toke
|
||||
... return inputs
|
||||
```
|
||||
|
||||
Use 🤗 Datasets [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) function to apply the preprocessing function over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once. Remove the columns you don't need:
|
||||
Use 🤗 Datasets [`~datasets.Dataset.map`] function to apply the preprocessing function over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once. Remove the columns you don't need:
|
||||
|
||||
```py
|
||||
>>> tokenized_squad = squad.map(preprocess_function, batched=True, remove_columns=squad["train"].column_names)
|
||||
@ -199,7 +199,7 @@ At this point, only three steps remain:
|
||||
```
|
||||
</pt>
|
||||
<tf>
|
||||
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`to_tf_dataset`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.to_tf_dataset). Specify inputs and the start and end positions of an answer in `columns`, whether to shuffle the dataset order, batch size, and the data collator:
|
||||
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs and the start and end positions of an answer in `columns`, whether to shuffle the dataset order, batch size, and the data collator:
|
||||
|
||||
```py
|
||||
>>> tf_train_set = tokenized_squad["train"].to_tf_dataset(
|
||||
|
@ -66,7 +66,7 @@ Create a preprocessing function to tokenize `text` and truncate sequences to be
|
||||
... return tokenizer(examples["text"], truncation=True)
|
||||
```
|
||||
|
||||
Use 🤗 Datasets [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) function to apply the preprocessing function over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once:
|
||||
Use 🤗 Datasets [`~datasets.Dataset.map`] function to apply the preprocessing function over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once:
|
||||
|
||||
```py
|
||||
tokenized_imdb = imdb.map(preprocess_function, batched=True)
|
||||
@ -144,7 +144,7 @@ At this point, only three steps remain:
|
||||
</Tip>
|
||||
</pt>
|
||||
<tf>
|
||||
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`to_tf_dataset`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.to_tf_dataset). Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator:
|
||||
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator:
|
||||
|
||||
```py
|
||||
>>> tf_train_set = tokenized_imdb["train"].to_tf_dataset(
|
||||
|
@ -85,7 +85,7 @@ The preprocessing function needs to:
|
||||
... return model_inputs
|
||||
```
|
||||
|
||||
Use 🤗 Datasets [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) function to apply the preprocessing function over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once:
|
||||
Use 🤗 Datasets [`~datasets.Dataset.map`] function to apply the preprocessing function over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once:
|
||||
|
||||
```py
|
||||
>>> tokenized_billsum = billsum.map(preprocess_function, batched=True)
|
||||
@ -160,7 +160,7 @@ At this point, only three steps remain:
|
||||
```
|
||||
</pt>
|
||||
<tf>
|
||||
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`to_tf_dataset`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.to_tf_dataset). Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator:
|
||||
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator:
|
||||
|
||||
```py
|
||||
>>> tf_train_set = tokenized_billsum["train"].to_tf_dataset(
|
||||
|
@ -126,7 +126,7 @@ Here is how you can create a function to realign the tokens and labels, and trun
|
||||
... return tokenized_inputs
|
||||
```
|
||||
|
||||
Use 🤗 Datasets [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) function to tokenize and align the labels over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once:
|
||||
Use 🤗 Datasets [`~datasets.Dataset.map`] function to tokenize and align the labels over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once:
|
||||
|
||||
```py
|
||||
>>> tokenized_wnut = wnut.map(tokenize_and_align_labels, batched=True)
|
||||
@ -199,7 +199,7 @@ At this point, only three steps remain:
|
||||
```
|
||||
</pt>
|
||||
<tf>
|
||||
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`to_tf_dataset`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.to_tf_dataset). Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator:
|
||||
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator:
|
||||
|
||||
```py
|
||||
>>> tf_train_set = tokenized_wnut["train"].to_tf_dataset(
|
||||
|
@ -87,7 +87,7 @@ The preprocessing function needs to:
|
||||
... return model_inputs
|
||||
```
|
||||
|
||||
Use 🤗 Datasets [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) function to apply the preprocessing function over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once:
|
||||
Use 🤗 Datasets [`~datasets.Dataset.map`] function to apply the preprocessing function over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once:
|
||||
|
||||
```py
|
||||
>>> tokenized_books = books.map(preprocess_function, batched=True)
|
||||
@ -162,7 +162,7 @@ At this point, only three steps remain:
|
||||
```
|
||||
</pt>
|
||||
<tf>
|
||||
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`to_tf_dataset`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.to_tf_dataset). Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator:
|
||||
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator:
|
||||
|
||||
```py
|
||||
>>> tf_train_set = tokenized_books["train"].to_tf_dataset(
|
||||
|
@ -169,7 +169,7 @@ The [`DefaultDataCollator`] assembles tensors into a batch for the model to trai
|
||||
|
||||
</Tip>
|
||||
|
||||
Next, convert the tokenized datasets to TensorFlow datasets with the [`to_tf_dataset`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.to_tf_dataset) method. Specify your inputs in `columns`, and your label in `label_cols`:
|
||||
Next, convert the tokenized datasets to TensorFlow datasets with the [`~datasets.Dataset.to_tf_dataset`] method. Specify your inputs in `columns`, and your label in `label_cols`:
|
||||
|
||||
```py
|
||||
>>> tf_train_dataset = small_train_dataset.to_tf_dataset(
|
||||
|
@ -1189,7 +1189,7 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin, TFGenerationMixin, Pu
|
||||
prefetch: bool = True,
|
||||
):
|
||||
"""
|
||||
Wraps a HuggingFace `datasets.Dataset` as a `tf.data.Dataset` with collation and batching. This method is
|
||||
Wraps a HuggingFace [`~datasets.Dataset`] as a `tf.data.Dataset` with collation and batching. This method is
|
||||
designed to create a "ready-to-use" dataset that can be passed directly to Keras methods like `fit()` without
|
||||
further modification. The method will drop columns from the dataset if they don't match input names for the
|
||||
model. If you want to specify the column names to return rather than using the names that match this model, we
|
||||
@ -1197,7 +1197,7 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin, TFGenerationMixin, Pu
|
||||
|
||||
Args:
|
||||
dataset (`Any`):
|
||||
A `datasets.Dataset` to be wrapped as a `tf.data.Dataset`.
|
||||
A [~`datasets.Dataset`] to be wrapped as a `tf.data.Dataset`.
|
||||
batch_size (`int`, defaults to 8):
|
||||
The size of batches to return.
|
||||
shuffle (`bool`, defaults to `True`):
|
||||
|
@ -232,7 +232,7 @@ class Trainer:
|
||||
default to [`default_data_collator`] if no `tokenizer` is provided, an instance of
|
||||
[`DataCollatorWithPadding`] otherwise.
|
||||
train_dataset (`torch.utils.data.Dataset` or `torch.utils.data.IterableDataset`, *optional*):
|
||||
The dataset to use for training. If it is an `datasets.Dataset`, columns not accepted by the
|
||||
The dataset to use for training. If it is a [`~datasets.Dataset`], columns not accepted by the
|
||||
`model.forward()` method are automatically removed.
|
||||
|
||||
Note that if it's a `torch.utils.data.IterableDataset` with some randomization and you are training in a
|
||||
@ -241,7 +241,7 @@ class Trainer:
|
||||
manually set the seed of this `generator` at each epoch) or have a `set_epoch()` method that internally
|
||||
sets the seed of the RNGs used.
|
||||
eval_dataset (`torch.utils.data.Dataset`, *optional*):
|
||||
The dataset to use for evaluation. If it is an `datasets.Dataset`, columns not accepted by the
|
||||
The dataset to use for evaluation. If it is a [`~datasets.Dataset`], columns not accepted by the
|
||||
`model.forward()` method are automatically removed.
|
||||
tokenizer ([`PreTrainedTokenizerBase`], *optional*):
|
||||
The tokenizer used to preprocess the data. If provided, will be used to automatically pad the inputs the
|
||||
@ -854,8 +854,8 @@ class Trainer:
|
||||
|
||||
Args:
|
||||
eval_dataset (`torch.utils.data.Dataset`, *optional*):
|
||||
If provided, will override `self.eval_dataset`. If it is an `datasets.Dataset`, columns not accepted by
|
||||
the `model.forward()` method are automatically removed. It must implement `__len__`.
|
||||
If provided, will override `self.eval_dataset`. If it is a [`~datasets.Dataset`], columns not accepted
|
||||
by the `model.forward()` method are automatically removed. It must implement `__len__`.
|
||||
"""
|
||||
if eval_dataset is None and self.eval_dataset is None:
|
||||
raise ValueError("Trainer: evaluation requires an eval_dataset.")
|
||||
@ -904,8 +904,8 @@ class Trainer:
|
||||
|
||||
Args:
|
||||
test_dataset (`torch.utils.data.Dataset`, *optional*):
|
||||
The test dataset to use. If it is an `datasets.Dataset`, columns not accepted by the `model.forward()`
|
||||
method are automatically removed. It must implement `__len__`.
|
||||
The test dataset to use. If it is a [`~datasets.Dataset`], columns not accepted by the
|
||||
`model.forward()` method are automatically removed. It must implement `__len__`.
|
||||
"""
|
||||
data_collator = self.data_collator
|
||||
|
||||
@ -2605,8 +2605,8 @@ class Trainer:
|
||||
|
||||
Args:
|
||||
eval_dataset (`Dataset`, *optional*):
|
||||
Pass a dataset if you wish to override `self.eval_dataset`. If it is an `datasets.Dataset`, columns not
|
||||
accepted by the `model.forward()` method are automatically removed. It must implement the `__len__`
|
||||
Pass a dataset if you wish to override `self.eval_dataset`. If it is a [`~datasets.Dataset`], columns
|
||||
not accepted by the `model.forward()` method are automatically removed. It must implement the `__len__`
|
||||
method.
|
||||
ignore_keys (`Lst[str]`, *optional*):
|
||||
A list of keys in the output of your model (if it is a dictionary) that should be ignored when
|
||||
|
@ -45,8 +45,8 @@ class Seq2SeqTrainer(Trainer):
|
||||
|
||||
Args:
|
||||
eval_dataset (`Dataset`, *optional*):
|
||||
Pass a dataset if you wish to override `self.eval_dataset`. If it is an `datasets.Dataset`, columns not
|
||||
accepted by the `model.forward()` method are automatically removed. It must implement the `__len__`
|
||||
Pass a dataset if you wish to override `self.eval_dataset`. If it is an [`~datasets.Dataset`], columns
|
||||
not accepted by the `model.forward()` method are automatically removed. It must implement the `__len__`
|
||||
method.
|
||||
ignore_keys (`List[str]`, *optional*):
|
||||
A list of keys in the output of your model (if it is a dictionary) that should be ignored when
|
||||
@ -93,7 +93,7 @@ class Seq2SeqTrainer(Trainer):
|
||||
|
||||
Args:
|
||||
test_dataset (`Dataset`):
|
||||
Dataset to run the predictions on. If it is an `datasets.Dataset`, columns not accepted by the
|
||||
Dataset to run the predictions on. If it is a [`~datasets.Dataset`], columns not accepted by the
|
||||
`model.forward()` method are automatically removed. Has to implement the method `__len__`
|
||||
ignore_keys (`List[str]`, *optional*):
|
||||
A list of keys in the output of your model (if it is a dictionary) that should be ignored when
|
||||
|
Loading…
Reference in New Issue
Block a user