'efficient way to prepare time series with sliding window (save&read or make data every itr)

Time-series data have dimensions below C: Channel L: Length of time(sequence)

I have L is over 200000 and C is over 200.

In dataloader, data are

def __getitems(self, idx):
 return  data # data=(B, C, W), B:Batch size, W=sliding window size

The original data size is approximately

20^52003(float32) = 120Mb. But, I need a large sliding window such as over 512, then the data size with rolling window is 10^5 * 210^2 * 2^8 3(float32) [byte] = 32^910^2 =60Gb

There are several ways to prepare data

  1. Save and Load
  • split the data into a small size and make rolling window data

      div = d1*d2*self.ls *3 // (1024*1024*1024) #1GB
      len_chunck = d1//div //self.save_batch * self.save_batch
      setL = 0
      for k in tqdm(range(div), desc=f"data gen... idx={i+1}/{len(self.set_info)}") :
          data_split = data[k*len_chunck:(k+1)*len_chunck+self.ls]
          div_data = torch.Tensor(to_roll_window(data_split, self.ls).astype(float))
          div_data = div_data[:div_data.shape[0]//self.save_batch*self.save_batch]
          L, self.C, self.T = div_data.shape
          for j, idx in enumerate(range(0, L, self.save_batch)):
              #save it into the storage
              fname = f'{i}_{j+setL:04d}.npy' #save it into the storage
              np.save(os.path.join(self.data_basepath, fname), div_data[idx:idx+self.save_batch])
    
  • but, it needs lots of reading time

  1. make data at every getitem() in dataloader

    def getitem(self, idx): d = torch.Tensor(np.empty([B, C, W])) for j, i in enumerate(idx): d[j] = self.data_transposed[:, i:i+W]

#data = (L, C) #data_transposed = transposed data (C, L)

I think the second way is more efficient, but it also consumes time...

Is there any fancy way to solve this problem?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source