'Tensorflow Text Generator
I am trying to train a TF model on a load of text, and unfortunatly if I straight up use model.fit() I run out of RAM very quickly. Could someone help with making it use less RAM (e.g. use a generator instead)? Below is my code.
from numpy import array
from pickle import dump
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import LSTM
# load doc into memory
def load_doc(filename):
# open the file as read only
file = open(filename, 'r')
# read all text
text = file.read()
# close the file
file.close()
return text
# load
in_filename = 'char_sequences.txt'
raw_text = load_doc(in_filename)
lines = raw_text.split('\n')
# integer encode sequences of characters
chars = sorted(list(set(raw_text)))
mapping = dict((c, i) for i, c in enumerate(chars))
sequences = list()
for line in lines:
# integer encode line
encoded_seq = [mapping[char] for char in line]
# store
sequences.append(encoded_seq)
# vocabulary size
vocab_size = len(mapping)
print('Vocabulary Size: %d' % vocab_size)
# separate into input and output
sequences = array(sequences)
print(1)
X, y = sequences[:,:-1], sequences[:,-1]
print(2)
sequences = [to_categorical(x, num_classes=vocab_size) for x in X]
print(3)
X = array(sequences)
print(4)
y = to_categorical(y, num_classes=vocab_size)
print(5)
# define model
model = Sequential()
model.add(LSTM(75, input_shape=(X.shape[1], X.shape[2])))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())
# compile model
print(6)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model
epochs = int(input('Num of epochs:'))
model.fit(X, y, epochs=epochs, verbose=2)
# save the model to file
model.save('model.h5')
# save the mapping
dump(mapping, open('mapping.pkl', 'wb'))
char_sequences.txt is a text file that looks like this:
that could
hat could b
at could be
t could be
could be a
could be an
ould be any
uld be anyo
ld be anyon
d be anyone
be anyone
be anyone W
e anyone Wh
anyone Wha
anyone What
nyone What
yone What r
one What r
ne What r u
e What r u
What r u d
What r u do
hat r u doi
at r u doin
t r u doing
r u doing
r u doing N
u doing Na
u doing Nah
doing Nahh
It is at debug point 3 that it eats RAM (>16GB). I cannot get it past this point as my computer runs out of RAM and the kernel kills the process. char_sequences.txt is generated from a text file containing one text message per line:
NONONO
He's gonna wake up with this
Cuz that could be anyone
What r u doing
Nahh we need to see the name of the person
That's it
and is approximately 20K lines long. I am happy to provide more information as needed!
OS: Raspberry Pi OS Buster PC: Raspberry Pi 4B 4G RAM/Swap: 4GB/13GB of zram Tensorflow version: 2.4.0 Python version: 3.7
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
