'Tensorflow 2.7 shows different results on different machines

The problem I had Tensorflow setup and running perfectly fine on a Windows 10 machine to create and train my models. I managed to get good models with that, and the inference was working perfectly fine as well. I decided to save them and load them back again to test if everything was successful, and it was. Now, migrating this project onto another machine made me break everything in a way I don't understand. After loading the exact same model on the new machine, the inference gives completely wrong results, making it completely useless, almost as if it was never trained. The model itself is really simple, there is only 1 hidden layer of 500 neurons, with the ELU activation function:

norm = preprocessing.Normalization()
norm.adapt(x_train)

model = tf.keras.models.Sequential()
model.add(Input(shape=(48,))
model.add(norm)
model.add(Dense(500, activation='elu'))
model.add(Dropout(0.2))
model.add(Dense(686, activation='relu'))

optimizer = tf.keras.optimizers.Adam(learning_rate=0.0001)
model.compile(loss='mean_squared_error', optimizer=optimizer, metrics=['accuracy'])

What I have tried and verified The anaconda environment is exactly the same on the new machine (that I will now call machine 2) as the one on machine 1. Therefore, the packages as well (verified the versions of Numpy, Tensorflow and Keras). The Windows version (had to verify if it could be an issue at one point) is the same as well. If I try to retrain the same model on machine 2, it gives out the same results (wrong), but copying this newly trained model onto machine 1 still works, so somehow training is fine. I tried to do the same on a virtual machine and I have the same issue.

Basically, the only difference is in the hardware:

Machine 1:

  • Intel(R) Core(TM) i7-3770S @ 3.10 GHz
  • No GPU
  • 16 GB of RAM

Machine 2:

  • AMD Ryzen 3 3100 4-Core @ 3.80 GHz
  • NVIDIA GeForce RTX 2060 (not used to ensure it does not come from the GPU)
  • 32 GB of RAM

If anything needs to be added, let me know!

Update #1

For those of you who might need it, here is the result of the inference on machine 1:

And here is the inference on machine 2:

Note: the raw results are only an extract, but it's enough for you to see how wrong it is I suppose

Update #2

I tested on 2 other machines, and they seem to have the exact same issue. The only specification I identified, that they all 3 have in common, and that machine 1 doesn't share, is they support AVX and AVX2. Machine 1 only support AVX.



Solution 1:[1]

As a result, it seems that the framework generating the reference results changed just enough recently to change the calculation method. It is now available as well in pip, which was not the case at the time of the first computer calculation (meaning that the check for each and every packages' version could not be done on this one).

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Gildur7161