'TensorFlow: Fit gives great val_acc, but evaluate gives low acc

I have some data (reuters). I combine the training and test data into one large data set. I shuffle the entire data set. I then split the data set into 50%. 50% for training and 50% for testing.

I then setup a sequential model (Embedding, GRU, BatchNormalization, Dense). I compile with adam optimizer, and loss = sparse_categorical_crossentropy

I run fit on it, for 5 epochs, with validation_split = 0.2

I get a val_acc of 95%. I am very happy with this.

I then call evaluate with the test data, and get acc = 71%

When I repeat evaluate with the train data (instead of test data), just to see what happens, I get 95% acc (as I should).

I am trying to understand what is wrong with my acc score with the test data.

The data is shuffled between train and test each time, so it can not be related to the data.
I tried the checkpoint save/restore trick, that did not seem to help.
I have reduced epochs in case of overfitting, but this has not helped.

I am curious what is going on. The only thing I can think of is there is some sort of overfitting going on that is carrying over to test, but is NOT impacting val_acc.

(Side note: the 20% data saved for validation during fit should not cause an overfitting problem, correct?)

Any ideas what could be wrong with my approach?

Thank you.

Edit 1: Okey, I might be onto something. I noticed something odd about the test data that is included with the reuters dataset, vs the training data.

In previous experiments, results were always poor evaluating against test set, vs an unused portion of the training data.

This time, I combined the training and test data sets into one set, and shuffled. Then shuffled some more, and then generated pseudo data around it, and shuffled some more.

Then I peeled off 20% to use as test data.

This time it worked. I got 95% for training, validation and evaluation accuracy.

I tried searching for information on the test set to see if anyone came up with anything but found no results. So I'm going to presently chalk it up to test data that is significantly different from the training set.

Edit 2: Nevermind edit 1. I think I was corrupting my test data with pseudo-generated version of training data, that was close enough to work.

The only conclusion I can draw is that there is a lot of overfitting going on during my training AND using validation data during training is misleading, as it is also being overfit.

(Why validation data is being used to help overfit, I do not know).

tensorflow

Solution 1:^[1]

Overfitting occurs due to many reasons, to overcome this problem you can try different methods like:

L1/L2 regularization.
Dropout layers: By using dropout layers in the network, we ignore a subset of units of our network with a set probability. Using dropout, we can reduce interdependent learning among units, which may have led to overfitting.
Early stopping: Monitors the performance of the model for every epoch on a held-out validation set during the training, and terminates the training conditional on the validation performance.
Feature selection: Improves the machine learning process and increases the predictive power of machine learning algorithms by selecting the most important variables and eliminating unnecessary and irrelevant features.

For more details refer to this link.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1

'TensorFlow: Fit gives great val_acc, but evaluate gives low acc

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]