'efficient way to prepare time series with sliding window (save&read or make data every itr)
Time-series data have dimensions below C: Channel L: Length of time(sequence)
I have L is over 200000 and C is over 200.
In dataloader, data are
def __getitems(self, idx):
return data # data=(B, C, W), B:Batch size, W=sliding window size
The original data size is approximately
20^52003(float32) = 120Mb. But, I need a large sliding window such as over 512, then the data size with rolling window is 10^5 * 210^2 * 2^8 3(float32) [byte] = 32^910^2 =60Gb
There are several ways to prepare data
- Save and Load
split the data into a small size and make rolling window data
div = d1*d2*self.ls *3 // (1024*1024*1024) #1GB len_chunck = d1//div //self.save_batch * self.save_batch setL = 0 for k in tqdm(range(div), desc=f"data gen... idx={i+1}/{len(self.set_info)}") : data_split = data[k*len_chunck:(k+1)*len_chunck+self.ls] div_data = torch.Tensor(to_roll_window(data_split, self.ls).astype(float)) div_data = div_data[:div_data.shape[0]//self.save_batch*self.save_batch] L, self.C, self.T = div_data.shape for j, idx in enumerate(range(0, L, self.save_batch)): #save it into the storage fname = f'{i}_{j+setL:04d}.npy' #save it into the storage np.save(os.path.join(self.data_basepath, fname), div_data[idx:idx+self.save_batch])
but, it needs lots of reading time
make data at every getitem() in dataloader
def getitem(self, idx): d = torch.Tensor(np.empty([B, C, W])) for j, i in enumerate(idx): d[j] = self.data_transposed[:, i:i+W]
#data = (L, C) #data_transposed = transposed data (C, L)
I think the second way is more efficient, but it also consumes time...
Is there any fancy way to solve this problem?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|