'Manipulating Pyspark dataframes for the LSTM model

I am trying to train a LSTM neural network for text prediction. I have a dataframe with 3.5 million chess games written as strings.
For examples:

enter image description here

I have parsed, tokenized and made the game the same length. To do this I have used an udf function so that I have a list of integers identifying the moves made during the game.

Once this is done my model accepts as input A 3D tensor with shape [batch, timesteps, feature]. (I use the keras LSTM: https://keras.io/api/layers/recurrent_layers/lstm/)

To do this I thought of converting my pyspark dataframe to pandas and use numpy but I can't do that because i'm developing it on databricks communty edition it always gives me out of memory problems.

Can someone tell me how i can solve this problem? Since I can't convert it to pandas because of the OOM problem can anyone suggest me another way?



Solution 1:[1]

Don't store the games as strings - store them as integers or, better, binary encoded values. This method would reduce your memory storage considerably and it would have the added benefit of being more amenable to computation. Using a relational schema, you would have games->moves, where games would store demographics (names, wins) and moves would just be a list of numerically-encoded exchanges.

You can still process the values as strings/words/whatever, because the computer doesn't need human text to produce a model. Of course, if it's a generative model, you'll have to perform reverse translation on the backside, but this would be trivial.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 StephenDonaldHuffPhD