'Dividing a large file into smaller ones for training

I have a very large file and I want to divide it into smaller ones for training. I've read about pickle files, so I split the large file into training-validation. Then, I divided the training file (about 1000000 datapoints) into ~60 pickle files. The testing file was divided into 5 pickles files.

Now I am confused on how to train these 60 files and then testing them with all the 5 testing files.

My strategy was run the network with the first training pickle file with the first testing pickle file. Then, save the weights after finishing the first training part. Then, load the weights and complete all the training pickle files with the first testing and move on to the next testing pickle file and do the same training process again saving the weights after each run... etc.

I don't this this strategy is correct because I am going to train again and test which will produce wrong unstable results.

Some recommend doing the training in parallel. For example, train the first training file in one epoch without validation, then move on to the second training file in the second epoch taking the weights that was saved from the first one and so on until I got through all the training files. Then do the validation later on the trained network. I don't think I know how to do this.

Note, I use Siamese network here to train images to find similar pairs.

here is my code so far,

First the pickle files part. I change the number when the training is done.

#train data
pickle_in = open(path + "TrainPairs1.pickle", "rb")
trainPixel = pickle.load(pickle_in)
trainPixel = np.array(trainPixel)
print(trainPixel.shape)
tr_pairs = trainPixel.reshape(40000,2,71,71,1)

#train labels
pickle_in = open(path + "TrainLabels1.pickle", "rb")
tr_y = pickle.load(pickle_in)
tr_y = np.array(tr_y)  

pickle_in = open(path + "TestPairs1.pickle", "rb")
testPixel = pickle.load(pickle_in)
testPixel = np.array(testPixel)
print(testPixel.shape)
te_pairs = testPixel.reshape(40000,2,71,71,1)
print(te_pairs.shape)

pickle_in = open(path + "TestLabels1.pickle", "rb")
te_y = pickle.load(pickle_in)
te_y = np.array(te_y)
print(te_y.shape)

Build the network:

# network definition
base_network = create_base_network(input_shape)

input_a = Input(shape=input_shape)
input_b = Input(shape=input_shape)

processed_a = base_network(input_a)
processed_b = base_network(input_b)

distance = Lambda(euclidean_distance,
              output_shape=eucl_dist_output_shape)([processed_a, processed_b]) 

model = Model([input_a, input_b], distance)

Here to load the saved weights:

#model.load_weights("/SavedWeights/weights.ckpt")

Train the network:

rms = RMSprop()
model.compile(loss=contrastive_loss, optimizer='adam', metrics=[accuracy])

history = model.fit([tr_pairs[:, 0], tr_pairs[:, 1]], tr_y,
      batch_size=128,
      epochs=epochs,
      validation_data=([te_pairs[:, 0], te_pairs[:, 1]], te_y))

#Training prediction
y_pred_tr = model.predict([tr_pairs[:, 0], tr_pairs[:, 1]])

#validation prediction
y_pred_te = model.predict([te_pairs[:, 0], te_pairs[:, 1]])

#save weights..
base_network.save("/my_model")
model.save_weights("/Saved_Weights/weights.ckpt")

Please help me in running these files correctly with my network.

Solution 1:^[1]

You should look into Tensorflows data api, specifically the tfrecords datatype. That might be easier to use than separate pickle files, since they are directly integrated into tensorflows data loaders.

If you really want to use pickle files, then I would recommend making a custom training loop and looping over the training files. This way you can load the data, train on it, load the testing data, make your predictions, and then train on the next set of data without having to save and reload the network weights.

Something like:

for e in epochs:
   for file in [train_file1.pickle, train_file2.pickle]:
      data = load(file)
      model_train(data)
   for file in [test_file1.pickle, test_file2.pickle]:
      data = load(file)
      model_test(data)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	Sean

'Dividing a large file into smaller ones for training

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]