'How am i incorrectly using SubsetRandomSampler?

I have a custom defined dataset: rcvdataset = rcvLSTMDataSet('foo.csv', 'foolabels.csv')

I also define the following:

batch_size = 50
validation_split = .2
shuffle_rcvdataset = True
random_seed= 42


```
rcvdataset_size = len(rcvdataset)
indices = list(range(rcvdataset_size))
split = int(np.floor(validation_split * rcvdataset_size))
if shuffle_rcvdataset :
    np.random.seed(random_seed)
    np.random.shuffle(indices)
train_indices, val_indices = indices[split:], indices[:split]


train_sampler = SubsetRandomSampler(train_indices)
test_sampler = SubsetRandomSampler(val_indices)

train_loader = torch.utils.data.DataLoader(rcvdataset, batch_size=batch_size, 
                                           sampler=train_sampler)
test_loader = torch.utils.data.DataLoader(rcvdataset, batch_size=batch_size,
                                                sampler=test_sampler)
```

using this training call:

```
def train(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset)
    model.train()
    for batch, (X, y) in enumerate(dataloader):
        X, y = X.to(device), y.to(device)

        # Compute prediction error
        pred = model(X)
        loss = loss_fn(pred, y)

        # Backpropagation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if batch % 100 == 0:
            loss, current = loss.item(), batch * len(X)
            print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")
```

But when i try to run it, i get:

    Epoch 1
    -------------------------------
    Traceback (most recent call last):
      File "lstmTrainer.py", line 94, in <module>
        train(train_sampler, model, loss_fn, optimizer)
      File "lstmTrainer.py", line 58, in train
        size = len(dataloader.dataset)
    AttributeError: 'SubsetRandomSampler' object has no attribute 'dataset'

If i instead load the dataset indirectly with:

    train(train_loader, model, loss_fn, optimizer)

it tells me:

    TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found <class 'pandas.core.series.Series'>

I'm not clear on what the first error is, at all. Is the second error trying to tell me that somewhere in my dataset, something isn't a tensor?

Thank you.

As requested, here is rcvDataSet.py:

from __future__ import print_function, division
import os
import torch
import pandas as pd
import numpy as np
from torch.utils.data import Dataset, DataLoader
class rcvLSTMDataSet(Dataset):
    """rcv dataset."""
    TIMESTEPS = 10

    def __init__(self, csv_data_file, annotations_file):
        """
        Args:
            csv_data_file (string): Path to the csv file with the training data
            annotations_file (string): Path to the file with the annotations
            
        """
        
        self.csv_data_file = csv_data_file
        self.annotations_file = annotations_file
        self.labels = pd.read_csv(annotations_file)
        self.data = pd.read_csv(csv_data_file)
        

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        """
        pytorch expects whatever data is returned is in the form of a tensor.  Included, it expects the label for the data.
        Together, they make a tuple.          
        """
        
        # convert every ten indexes and label into one observation
        Observation = []
        counter = 0
        start_pos = self.TIMESTEPS *idx
        avg_1 = 0
        avg_2 = 0
        avg_3 = 0
        while counter < self.TIMESTEPS:
            Observation.append(self.data.iloc[idx + counter])            
            avg_1 += self.labels.iloc[idx + counter][2]
            avg_2 += self.labels.iloc[idx + counter][1]
            avg_3 += self.labels.iloc[idx + counter][0]
            counter += 1        
        
        avg_1 = avg_1 / self.TIMESTEPS
        avg_2 = avg_2 / self.TIMESTEPS
        avg_3 = avg_3 / self.TIMESTEPS
        current_labels = [avg_1, avg_2, avg_3]
        print(current_labels)
        return Observation, current_labels
        

def main():
        loader = rcvLSTMDataSet('foo1.csv','foo2.csv')
        j = 0
        while j < len(loader.data % loader.TIMESTEPS):
            print(loader.__getitem__(j))
            j += 1

if "__main__" == __name__:
    main()

pytorch tensor

Solution 1:^[1]

Cause: If you look to the error message, you will find that you call train function like that:

train(train_sampler, model, loss_fn, optimizer)

This is not correct, you should call train() with train_loader not with train_sampler.

Solution: you should correct it to:

train(train_loader, model, loss_fn, optimizer)

Error message:

 Epoch 1
    -------------------------------
    Traceback (most recent call last):
      File "lstmTrainer.py", line 94, in <module>
        train(train_sampler, model, loss_fn, optimizer)   <------ look here 
      File "lstmTrainer.py", line 58, in train
        size = len(dataloader.dataset)
    AttributeError: 'SubsetRandomSampler' object has no attribute 'dataset'

Second Error message:

If you look at your Dataset class rcvLSTMDataSet, you will find observations list append items with pandas.core.series.Series type, and it is not a pythonic scalar numbers because you read all columns in your csv file. You should use .iloc[....].values instead of iloc[....]. By doing this, you will ensure that your list contains char or float or int types and it can be converted to a tensor smoothly without errors.

Final remark:

You can read here about Dataloader and Samplers, I summarized here some points:

Samplers are used to specify the sequence of indices/keys used in data loading.

Data loader combines a dataset and a sampler, and provides an iterable over the given dataset.

PyTorch provides two data primitives: torch.utils.data.DataLoader and torch.utils.data.Dataset that allow you to use pre-loaded datasets as well as your own data. Dataset stores the samples and their corresponding labels, and DataLoader wraps an iterable around the Dataset to enable easy access to the samples.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1

'How am i incorrectly using SubsetRandomSampler?

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]