'How do I properly make a tf Dataset for recurrent-neural-network?

I have just preprocessed my categories data to one-hot encoding and used tf.argmax(). This way, I got it into a range of numbers of 1 to 34; thus, there are sequences of, say: [7, 4, 28, 14, 5, 15, 22, 22].

My question comes about how do I make the further dataset preparations. I intend to use 3 numbers in sequence to predict the next 2. So, do I need to map the dataset into features being the first 3 and labels the last 2? Should I batch it with buffer_size 5 to specify the sequence legnth? Lastly, do I keep the one_hot representation over it's transformation into a simpler number?



Solution 1:[1]

No one had answered me in time, so I found the solution some days ago.

Just a note that I'm now using 8 numbers to predict the next 2.

First, to make a prediction of the 2 next steps, I decided that I can predict the label for the 8 initial steps and then make another prediction with the last 7 time steps from the initial steps and use the predicted value as the 8th input. Second, my teacher told me it was better for the rnn to use one-hot, so I mapped the dataset as features of 8 one-hots and a label of 1 one-hot. Third, what impressed me, is that in fact I was able to use batching as a form of grouping the splitted data sequence and by so indicating that the last one-hot is my label.

So here is the code:

INPUT_WIDTH = 8
LABEL_WIDTH = 1
shift = 1

INPUT_SLICE = slice(0, INPUT_WIDTH)

total_window_size = INPUT_WIDTH + shift
label_start = total_window_size - LABEL_WIDTH
LABELS_SLICE = slice(label_start, None) 

BATCH_SIZE = INPUT_WIDTH + LABEL_WIDTH

Above are some constants that I got from [1]. The only one that I couldn't understand correctly is the shift var, but set it to 1 and you're fine.

Below, the split function:

def split_window(features):
  inputs = features[INPUT_SLICE]
  labels = features[LABELS_SLICE]

  return inputs, labels

Simple and cute, isn't it? But be clear it was a disgrace to change this function from [1] and make it compatible with my input shape.

Now the dataset:

def create_seq_dataset(data):
    ds = tf.data.Dataset.from_tensor_slices(data)
    
    #At this point, my input was a single array of 3644 string categories to be turned
    #into one-hot. The below function just one-hots the data and turns it into float32 dtype,
    #as required by the rnn, so I won't cover it in detail.
    ds = ds.map(get_seq_label)
    #As I'm using from_tensor_slices() function, the 3644 number disappears from the shape.
    #Now, from shape (1), I've got shape (53) after one-hotting it, beign 53 the number of 
    #possible categories that I'm working with.

    #Here I batch the one-hot data
    ds = ds.batch(BATCH_SIZE, drop_remainder=True)
    #Shape (53) => (9, 53)
    #Without using drop_remainder, the rnn will complain that it can't predict "empty" data.
    
    ds = ds.map(split_window)
    #Got features shape (8, 53) and labels shape(53,)

    ds = ds.batch(16, drop_remainder=True)
    #Got features shape (16, 8, 53) and labels: (16, 53)
    return

train_ds = create_seq_dataset(train_df)

for features_batch, labels_batch in train_ds:
  print(features_batch.shape)
  print(labels_batch.shape)
  break

What I got from the print was: (16, 8, 53) and (16, 53)

Lastly, the LSTM:

inputs = layers.Input(name="sequence", shape=(INPUT_WIDTH, 53), dtype='float32')
#Reminder that the batch size is implicit when the input is a dataset, unless the LSTM
#is stateful.
def create_rnn_model():
    #From [1]: Shape [batch, time, features] => [batch, time, lstm_units]
    #In my case [16, 8, 53] => [16, 8 100], but as the 16 is the implicit batch size, it would
    #appear as None if you use rnn_model.summary() after declaring the model
    x = layers.LSTM(100, activation='tanh', stateful=False, return_sequences=False, name='LSTM_1')(inputs)
    #Got shape after this layer as [16, 100], or [None, 100].

    x = layers.Dense(32, activation='relu')(x)

    output = layers.Dense(NUM_CLASSES, activation='softmax')(x)


    model = Model(inputs = inputs, outputs = output)

    optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)

    model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])
return model
rnn_model = create_rnn_model()

[1]. https://www.tensorflow.org/tutorials/structured_data/time_series#4_create_tfdatadatasets

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Snykral fa Ashama