'How to optimize matrix multiplication on itself transposed using CUDA?

I have a matrix (M) of floats, roughly 17000 by 10000 values. I need to get scalar multiplications of each row by each row (so 17000 by 17000 values), which can be alternatively formalized as multiplying M by the transposed M.

I am new to CUDA, so I can write a "naive" solution using a thread for every matrix element but it's probably suboptimal computation speed-wise.

Alternatively, I can use something like cublasSgemm(...) with M and the transposed M as arguments, but the transposing is an additional operation that should not be necessary, and the additional memory usage is also considerable (I have only a 4 GB video card freely available).

Please help me with the optimal (or at least better) solution.

If it's important, I do know the number of columns beforehand (literally using #define numCol 10001), but the number of rows can vary as the rows are parsed from multiple .csv files.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'How to optimize matrix multiplication on itself transposed using CUDA?

Sources

Related Questions