'Cannot Train Wav2vec XLSR Model With Common Voice Data

I am trying to train a transformer ASR model with wav2vec XLSR in the danish language, but whenever I try to pull the danish dataset with datasets library it's giving me an error.. Notebook link

error log:

ValueError: BuilderConfig da not found. Available: ['ab', 'ar', 'as', 'br', 'ca', 'cnh', 'cs', 'cv', 'cy', 'de', 'dv', 'el', 'en', 'eo', 'es', 'et', 'eu', 'fa', 'fi', 'fr', 'fy-NL', 'ga-IE', 'hi', 'hsb', 'hu', 'ia', 'id', 'it', 'ja', 'ka', 'kab', 'ky', 'lg', 'lt', 'lv', 'mn', 'mt', 'nl', 'or', 'pa-IN', 'pl', 'pt', 'rm-sursilv', 'rm-vallader', 'ro', 'ru', 'rw', 'sah', 'sl', 'sv-SE', 'ta', 'th', 'tr', 'tt', 'uk', 'vi', 'vot', 'zh-CN', 'zh-HK', 'zh-TW']



Solution 1:[1]

I checked it for you.

The Danish language subset to the Corpus is supported in:

  • Common Voice Corpus 8.0
  • Common Voice Corpus 9.0

releases.

However, Hugging Face's datasets library (version 2.2.1) uses the 6.1.0 version of the Corpus. You can check yourself this by loading any subset of corpus and printing dataset info as follows:

Code

from datasets import load_dataset

dataset_de = load_dataset("common_voice", "de")
print(dataset_de.info)

Output

Downloading and preparing dataset common_voice/de (download: 21.68 GiB, 
generated: 137.78 MiB, post-processed: Unknown size, total: 21.82 GiB) to 
/root/.cache/huggingface/datasets/common_voice/de/6.1.0/

See the Corpus Details

See the Librarry

You should wait for a new release of the library or open a request to their repo.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Bekir