'Do big workloads need to be splitted into separate commands in MPSMatrixMultiplication?

I've been looking around and there doesn't seem to be much information regarding this topic. I'm comparing the performance of Accelerate sgemm against the performance of the GPU using MetalPerformanceShaders. For that purpose, I'm using MPSMatrixMultiplication. I've compared the results of both methods and they're correct. However, when the size of the matrixes gets past 30.000, the GPU seems to finish too fast (it tells me its performing at 11 TFLOPS, when the max theoretical performance is 2.6 TFLOPS (M1 Mac mini)). Also, it finishes the operations under 6 seconds, which is definitely wrong. How could I approach this? Thanks



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source