'Armadillo BLAS Matrix Multiplication with its transpose. Blas too slower?

Good Day.

Does someone knows another trick or solution how can i perform matrix multiplcation by it's transpose. Current code for 1000 iteration take too much time for me. I tried to use directly the openblas. But seems that is a bit slower than armadillo.

Do i did something wrong ?

#include <iostream>
#include <armadillo>

class watch : std::chrono::steady_clock {
    time_point start_ = now();
public: auto elapsed_sec() const {return std::chrono::duration<double>(now() - start_).count();}
};

template <typename T>
void matrix_multiplication(arma::Mat<T> const& input, arma::Mat<T> &output)
{
    const char N = 'N';
    const char C = 'C';
    std::complex<double> alpha {1.0};
    std::complex<double> beta  {0.0};
    int m_ = input.n_rows, n_ = input.n_rows, k_=input.n_cols;
    arma::blas::gemm(&N, &C, &m_, &n_, &k_, &alpha, input.memptr(), &m_, input.memptr(), &n_, &beta, output.memptr(), &n_);
}

int main()
{
    arma::cx_mat mat1; // size (300, 20'000)
    mat1.load("rec.txt"); // can be used arma::fill::randu
    arma::cx_mat resu(mat1.n_rows, mat1.n_rows, arma::fill::none);

    int N = 10;
    [&,_= watch{}](){
        for(int i = 0; i < N; ++i)
        {
            matrix_multiplication(mat1, resu);
        }
        std::cout << _.elapsed_sec()/N <<std::endl;
    }();
    resu.submat(arma::span(0,1), arma::span(0,5)).print("resu");
        [&,_= watch{}](){
            for(int i = 0; i < N; ++i)
            {
                  resu = mat1 * mat1.t();
            }
            std::cout << _.elapsed_sec()/N <<std::endl;
        }();
            resu.submat(arma::span(0,1), arma::span(0,5)).print("resu");

    return 0;
}

I am using :

gcc 11.2
armadillo 10.7.3
openblas

My Results :

0.0394106 << blas 
resu

0.0253328 << armadillor
resu


Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source