'Tensorflow Text Generator

I am trying to train a TF model on a load of text, and unfortunatly if I straight up use model.fit() I run out of RAM very quickly. Could someone help with making it use less RAM (e.g. use a generator instead)? Below is my code.

from numpy import array
from pickle import dump
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import LSTM

# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

# load
in_filename = 'char_sequences.txt'
raw_text = load_doc(in_filename)
lines = raw_text.split('\n')

# integer encode sequences of characters
chars = sorted(list(set(raw_text)))
mapping = dict((c, i) for i, c in enumerate(chars))
sequences = list()
for line in lines:
    # integer encode line
    encoded_seq = [mapping[char] for char in line]
    # store
    sequences.append(encoded_seq)

# vocabulary size
vocab_size = len(mapping)
print('Vocabulary Size: %d' % vocab_size)

# separate into input and output
sequences = array(sequences)
print(1)
X, y = sequences[:,:-1], sequences[:,-1]
print(2)
sequences = [to_categorical(x, num_classes=vocab_size) for x in X]
print(3)
X = array(sequences)
print(4)
y = to_categorical(y, num_classes=vocab_size)
print(5)
# define model
model = Sequential()
model.add(LSTM(75, input_shape=(X.shape[1], X.shape[2])))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())
# compile model
print(6)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model

epochs = int(input('Num of epochs:'))

model.fit(X, y, epochs=epochs, verbose=2)

# save the model to file
model.save('model.h5')
# save the mapping
dump(mapping, open('mapping.pkl', 'wb'))

char_sequences.txt is a text file that looks like this:

that could 
hat could b
at could be
t could be 
 could be a
could be an
ould be any
uld be anyo
ld be anyon
d be anyone
 be anyone 
be anyone W
e anyone Wh
 anyone Wha
anyone What
nyone What 
yone What r
one What r 
ne What r u
e What r u 
 What r u d
What r u do
hat r u doi
at r u doin
t r u doing
 r u doing 
r u doing N
 u doing Na
u doing Nah
 doing Nahh

It is at debug point 3 that it eats RAM (>16GB). I cannot get it past this point as my computer runs out of RAM and the kernel kills the process. char_sequences.txt is generated from a text file containing one text message per line:

NONONO
He's gonna wake up with this
Cuz that could be anyone
What r u doing
Nahh we need to see the name of the person
That's it

and is approximately 20K lines long. I am happy to provide more information as needed!

OS: Raspberry Pi OS Buster PC: Raspberry Pi 4B 4G RAM/Swap: 4GB/13GB of zram Tensorflow version: 2.4.0 Python version: 3.7



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source