7.3 KiB
XTREME-S benchmark examples
Maintainers: Anton Lozhkov and Patrick von Platen
The Cross-lingual TRansfer Evaluation of Multilingual Encoders for Speech (XTREME-S) benchmark is a benchmark designed to evaluate speech representations across languages, tasks, domains and data regimes. It covers XX typologically diverse languages and seven downstream tasks grouped in four families: speech recognition, translation, classification and retrieval.
XTREME-S covers speech recognition with BABEL, Multilingual LibriSpeech (MLS) and VoxPopuli, speech translation with CoVoST-2, speech classification with LangID (FLoRes) and intent classification (MInds-14) and finally speech retrieval with speech-speech translation data mining (bi-speech retrieval). Each of the tasks covers a subset of the 40 languages included in XTREME-S (shown here with their ISO 639-1 codes): ar, as, ca, cs, cy, da, de, en, en, en, en, es, et, fa, fi, fr, hr, hu, id, it, ja, ka, ko, lo, lt, lv, mn, nl, pl, pt, ro, ru, sk, sl, sv, sw, ta, tl, tr and zh.
Paper: <TODO>
Dataset: https://huggingface.co/datasets/google/xtreme_s
Fine-tuning for the XTREME-S tasks
Based on the run_xtreme_s.py
script.
This script can fine-tune any of the pretrained speech models on the hub on the XTREME-S dataset tasks.
XTREME-S is made up of 7 different task-specific subsets. Here is how to run the script on each of them:
export TASK_NAME=mls.all
python run_xtreme_s.py \
--model_name_or_path="facebook/wav2vec2-xls-r-300m" \
--dataset_name="google/xtreme_s" \
--dataset_config_name="${TASK_NAME}" \
--eval_split_name="validation" \
--output_dir="xtreme_s_xlsr_${TASK_NAME}" \
--num_train_epochs=100 \
--per_device_train_batch_size=32 \
--learning_rate="3e-4" \
--target_column_name="transcription" \
--save_steps=500 \
--eval_steps=500 \
--freeze_feature_encoder \
--gradient_checkpointing \
--fp16 \
--group_by_length \
--do_train \
--do_eval \
--push_to_hub
where TASK_NAME
can be one of: mls.all, voxpopuli, covost2.all, fleurs.all, minds14.all
.
We get the following results on the test set of the benchmark's datasets. The corresponding training commands for each dataset are given in the sections below:
Task | Dataset | Result | Fine-tuned model & logs | Training time | GPUs |
---|---|---|---|---|---|
Speech Recognition | MLS | 30.33 WER | here | 18:47:25 | 8xV100 |
Speech Recognition | VoxPopuli | - | - | - | - |
Speech Recognition | FLEURS | - | - | - | - |
Speech Translation | CoVoST-2 | - | - | - | - |
Speech Classification | Minds-14 | 94.74 F1 / 94.70 Acc. | here | 04:46:40 | 2xA100 |
Speech Classification | FLEURS | - | - | - | - |
Speech Retrieval | FLEURS | - | - | - | - |
Speech Recognition with MLS
The following command shows how to fine-tune the XLS-R model on XTREME-S MLS using 8 GPUs in half-precision.
python -m torch.distributed.launch \
--nproc_per_node=8 \
run_xtreme_s.py \
--task="mls" \
--language="all" \
--model_name_or_path="facebook/wav2vec2-xls-r-300m" \
--eval_split_name="test" \
--output_dir="xtreme_s_xlsr_300m_mls" \
--overwrite_output_dir \
--num_train_epochs=100 \
--per_device_train_batch_size=4 \
--per_device_eval_batch_size=1 \
--gradient_accumulation_steps=2 \
--learning_rate="3e-4" \
--warmup_steps=3000 \
--evaluation_strategy="steps" \
--max_duration_in_seconds=20 \
--save_steps=500 \
--eval_steps=500 \
--logging_steps=1 \
--layerdrop=0.0 \
--mask_time_prob=0.3 \
--mask_time_length=10 \
--mask_feature_prob=0.1 \
--mask_feature_length=64 \
--freeze_feature_encoder \
--gradient_checkpointing \
--fp16 \
--group_by_length \
--do_train \
--do_eval \
--metric_for_best_model="wer" \
--greater_is_better=False \
--load_best_model_at_end \
--push_to_hub
On 8 V100 GPUs, this script should run in ~19 hours and yield a cross-entropy loss of 0.6215 and word error rate of 30.33
Speech Classification with Minds-14
The following command shows how to fine-tune the XLS-R model on XTREME-S MLS using 2 GPUs in half-precision.
python -m torch.distributed.launch \
--nproc_per_node=2 \
run_xtreme_s.py \
--task="minds14" \
--language="all" \
--model_name_or_path="facebook/wav2vec2-xls-r-300m" \
--output_dir="xtreme_s_xlsr_300m_minds14" \
--overwrite_output_dir \
--num_train_epochs=50 \
--per_device_train_batch_size=32 \
--per_device_eval_batch_size=8 \
--gradient_accumulation_steps=1 \
--learning_rate="3e-4" \
--warmup_steps=1500 \
--evaluation_strategy="steps" \
--max_duration_in_seconds=30 \
--save_steps=200 \
--eval_steps=200 \
--logging_steps=1 \
--layerdrop=0.0 \
--mask_time_prob=0.3 \
--mask_time_length=10 \
--mask_feature_prob=0.1 \
--mask_feature_length=64 \
--freeze_feature_encoder \
--gradient_checkpointing \
--fp16 \
--group_by_length \
--do_train \
--do_eval \
--metric_for_best_model="f1" \
--greater_is_better=True \
--load_best_model_at_end \
--push_to_hub
On 2 A100 GPUs, this script should run in ~5 hours and yield a cross-entropy loss of 0.2890 and F1 score of 94.74