'How can I handle this datasets to create a datasetDict?
I'm trying to build a datasetDictionary object to train a QA model on PyTorch. I have these two different datasets:
test_dataset
Dataset({
features: ['answer_text', 'answer_start', 'title', 'context', 'question', 'answers', 'id'],
num_rows: 21489
})
and
train_dataset
Dataset({
features: ['answer_text', 'answer_start', 'title', 'context', 'question', 'answers', 'id'],
num_rows: 54159
})
In the dataset's documentation I didn't find anything. I'm quite a noob, thus the solution may be really easy. What I wish to obtain is something like this:
dataset
DatasetDict({
train: Dataset({
features: ['answer_text', 'answer_start', 'title', 'context', 'question', 'answers', 'id'],
num_rows: 54159
})
test: Dataset({
features: ['answer_text', 'answer_start', 'title', 'context', 'question', 'answers', 'id'],
num_rows: 21489
})
})
I really don't find how to use two datasets to create a dataserDict or how to set the keys. Moreover, I wish to "cut" the train set in two: train and validation sets, but also this passage is hard for me to handle. The final result should be something like this:
dataset
DatasetDict({
train: Dataset({
features: ['answer_text', 'answer_start', 'title', 'context', 'question', 'answers', 'id'],
num_rows: 54159 - x
})
validation: Dataset({
features: ['answer_text', 'answer_start', 'title', 'context', 'question', 'answers', 'id'],
num_rows: x
})
test: Dataset({
features: ['answer_text', 'answer_start', 'title', 'context', 'question', 'answers', 'id'],
num_rows: 21489
})
})
Thank you in advance and pardon me for being a noob :)
Solution 1:[1]
to get the validation dataset, you can do like this:
train_dataset, validation_dataset= train_dataset.train_test_split(test_size=0.1).values()
This function will divide 10% of the train dataset into the validation dataset.
and to obtain "DatasetDict", you can do like this:
import datasets
dd = datasets.DatasetDict({"train":train_dataset,"test":test_dataset})
Solution 2:[2]
For future generations ;) Adding a bit more information about the answer.
from datasets.dataset_dict import DatasetDict
from datasets import Dataset
d = {'train':Dataset.from_dict({'label':y_train,'text':x_train}),
'val':Dataset.from_dict({'label':y_val,'text':x_val}),
'test':Dataset.from_dict({'label':y_test,'text':x_test})
}
DatasetDict(d)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Dharman |
| Solution 2 | Sahar Millis |
