'How can I handle this datasets to create a datasetDict?

I'm trying to build a datasetDictionary object to train a QA model on PyTorch. I have these two different datasets:

test_dataset

Dataset({
    features: ['answer_text', 'answer_start', 'title', 'context', 'question', 'answers', 'id'],
    num_rows: 21489
})

and

train_dataset

Dataset({
    features: ['answer_text', 'answer_start', 'title', 'context', 'question', 'answers', 'id'],
    num_rows: 54159
})

In the dataset's documentation I didn't find anything. I'm quite a noob, thus the solution may be really easy. What I wish to obtain is something like this:

dataset

DatasetDict({
    train: Dataset({
        features: ['answer_text', 'answer_start', 'title', 'context', 'question', 'answers', 'id'],
        num_rows: 54159
    })
    test: Dataset({
        features: ['answer_text', 'answer_start', 'title', 'context', 'question', 'answers', 'id'],
        num_rows: 21489
    })
})

I really don't find how to use two datasets to create a dataserDict or how to set the keys. Moreover, I wish to "cut" the train set in two: train and validation sets, but also this passage is hard for me to handle. The final result should be something like this:

dataset

DatasetDict({
    train: Dataset({
        features: ['answer_text', 'answer_start', 'title', 'context', 'question', 'answers', 'id'],
        num_rows: 54159 - x
    })
    validation: Dataset({
        features: ['answer_text', 'answer_start', 'title', 'context', 'question', 'answers', 'id'],
        num_rows: x
    })
    test: Dataset({
        features: ['answer_text', 'answer_start', 'title', 'context', 'question', 'answers', 'id'],
        num_rows: 21489
    })
})

Thank you in advance and pardon me for being a noob :)



Solution 1:[1]

to get the validation dataset, you can do like this:

train_dataset, validation_dataset= train_dataset.train_test_split(test_size=0.1).values()

This function will divide 10% of the train dataset into the validation dataset.

and to obtain "DatasetDict", you can do like this:

import datasets
dd = datasets.DatasetDict({"train":train_dataset,"test":test_dataset})

Solution 2:[2]

For future generations ;) Adding a bit more information about the answer.

from datasets.dataset_dict import DatasetDict
from datasets import Dataset

d = {'train':Dataset.from_dict({'label':y_train,'text':x_train}),
     'val':Dataset.from_dict({'label':y_val,'text':x_val}),
     'test':Dataset.from_dict({'label':y_test,'text':x_test})
     }

DatasetDict(d)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Dharman
Solution 2 Sahar Millis