'How am i incorrectly using SubsetRandomSampler?
I have a custom defined dataset: rcvdataset = rcvLSTMDataSet('foo.csv', 'foolabels.csv')
I also define the following:
batch_size = 50
validation_split = .2
shuffle_rcvdataset = True
random_seed= 42
```
rcvdataset_size = len(rcvdataset)
indices = list(range(rcvdataset_size))
split = int(np.floor(validation_split * rcvdataset_size))
if shuffle_rcvdataset :
np.random.seed(random_seed)
np.random.shuffle(indices)
train_indices, val_indices = indices[split:], indices[:split]
train_sampler = SubsetRandomSampler(train_indices)
test_sampler = SubsetRandomSampler(val_indices)
train_loader = torch.utils.data.DataLoader(rcvdataset, batch_size=batch_size,
sampler=train_sampler)
test_loader = torch.utils.data.DataLoader(rcvdataset, batch_size=batch_size,
sampler=test_sampler)
```
using this training call:
```
def train(dataloader, model, loss_fn, optimizer):
size = len(dataloader.dataset)
model.train()
for batch, (X, y) in enumerate(dataloader):
X, y = X.to(device), y.to(device)
# Compute prediction error
pred = model(X)
loss = loss_fn(pred, y)
# Backpropagation
optimizer.zero_grad()
loss.backward()
optimizer.step()
if batch % 100 == 0:
loss, current = loss.item(), batch * len(X)
print(f"loss: {loss:>7f} [{current:>5d}/{size:>5d}]")
```
But when i try to run it, i get:
Epoch 1
-------------------------------
Traceback (most recent call last):
File "lstmTrainer.py", line 94, in <module>
train(train_sampler, model, loss_fn, optimizer)
File "lstmTrainer.py", line 58, in train
size = len(dataloader.dataset)
AttributeError: 'SubsetRandomSampler' object has no attribute 'dataset'
If i instead load the dataset indirectly with:
train(train_loader, model, loss_fn, optimizer)
it tells me:
TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found <class 'pandas.core.series.Series'>
I'm not clear on what the first error is, at all. Is the second error trying to tell me that somewhere in my dataset, something isn't a tensor?
Thank you.
As requested, here is rcvDataSet.py:
from __future__ import print_function, division
import os
import torch
import pandas as pd
import numpy as np
from torch.utils.data import Dataset, DataLoader
class rcvLSTMDataSet(Dataset):
"""rcv dataset."""
TIMESTEPS = 10
def __init__(self, csv_data_file, annotations_file):
"""
Args:
csv_data_file (string): Path to the csv file with the training data
annotations_file (string): Path to the file with the annotations
"""
self.csv_data_file = csv_data_file
self.annotations_file = annotations_file
self.labels = pd.read_csv(annotations_file)
self.data = pd.read_csv(csv_data_file)
def __len__(self):
return len(self.labels)
def __getitem__(self, idx):
"""
pytorch expects whatever data is returned is in the form of a tensor. Included, it expects the label for the data.
Together, they make a tuple.
"""
# convert every ten indexes and label into one observation
Observation = []
counter = 0
start_pos = self.TIMESTEPS *idx
avg_1 = 0
avg_2 = 0
avg_3 = 0
while counter < self.TIMESTEPS:
Observation.append(self.data.iloc[idx + counter])
avg_1 += self.labels.iloc[idx + counter][2]
avg_2 += self.labels.iloc[idx + counter][1]
avg_3 += self.labels.iloc[idx + counter][0]
counter += 1
avg_1 = avg_1 / self.TIMESTEPS
avg_2 = avg_2 / self.TIMESTEPS
avg_3 = avg_3 / self.TIMESTEPS
current_labels = [avg_1, avg_2, avg_3]
print(current_labels)
return Observation, current_labels
def main():
loader = rcvLSTMDataSet('foo1.csv','foo2.csv')
j = 0
while j < len(loader.data % loader.TIMESTEPS):
print(loader.__getitem__(j))
j += 1
if "__main__" == __name__:
main()
Solution 1:[1]
Cause: If you look to the error message, you will find that you call train function like that:
train(train_sampler, model, loss_fn, optimizer)
This is not correct, you should call train() with train_loader not with train_sampler.
Solution: you should correct it to:
train(train_loader, model, loss_fn, optimizer)
Error message:
Epoch 1
-------------------------------
Traceback (most recent call last):
File "lstmTrainer.py", line 94, in <module>
train(train_sampler, model, loss_fn, optimizer) <------ look here
File "lstmTrainer.py", line 58, in train
size = len(dataloader.dataset)
AttributeError: 'SubsetRandomSampler' object has no attribute 'dataset'
Second Error message:
If you look at your Dataset class rcvLSTMDataSet, you will find observations list append items with pandas.core.series.Series type, and it is not a pythonic scalar numbers because you read all columns in your csv file. You should use .iloc[....].values instead of iloc[....]. By doing this, you will ensure that your list contains char or float or int types and it can be converted to a tensor smoothly without errors.
Final remark:
You can read here about Dataloader and Samplers, I summarized here some points:
Samplers are used to specify the sequence of indices/keys used in data loading.
Data loader combines a dataset and a sampler, and provides an iterable over the given dataset.
PyTorch provides two data primitives:
torch.utils.data.DataLoaderandtorch.utils.data.Datasetthat allow you to use pre-loaded datasets as well as your own data. Dataset stores the samples and their corresponding labels, and DataLoader wraps an iterable around the Dataset to enable easy access to the samples.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
