'Full diagonalization NumPy vs SciPy

I have to make use of diagonalization routines to obtain all eigenpairs of a Hermitian complex matrix. I am a bit limited by performance since I need to repeat the operation thousands of times and my matrices are roughly of 8000x8000. I have created a little comparison between the NumPy and SciPy routines for diagonalization of Hermitian matrices and I got these times on a 6 physical cores machine:

enter image description here

I am observing that for 8000x8000 matrices this scales to~0.8 minutes and I need to repeat the process 50000 times. Is there something I am missing here or are these actually the performance times? Overall, this all looks very slow specially if this needs to be repeated several times. In fact, on a 30 core machine, I observe little performance gain. I am using the Python3.8 under Anaconda distribution so this is linked against the MKL.

Here is the example code

import numpy as np
import scipy.linalg
import matplotlib.pyplot as pyt
from time import time


t_ls = []
d_ls = np.array([100, 500, 1000,2000,4000])

for N in d_ls:

    A =np.random.rand( N,N ) + 1j*np.random.rand( N,N )    
    A = 0.5*( A + np.conj(A.T) ) 
       
    ts = time()    
    evals, evecs = np.linalg.eigh( A )
    t_np  = time()-ts
   

    ts = time()    
    evals2, evecs2 = scipy.linalg.eigh( A )
    t_sp  = time()-ts
    
    t_ls.append(np.array([t_np, t_sp]))
    
t_ls = np.array(t_ls)

pyt.plot( d_ls, t_ls[:,0], marker='s' )
pyt.plot( d_ls, t_ls[:,1], marker='^')
pyt.xlabel("N")
pyt.ylabel("time(secs)")
pyt.legend(["NumPy", "SciPy"])
pyt.show()

USING SVD AND MP PARALELLIZATION

Going through some of the comments in the post, I have tried SVD of the matrix and multiprocessing. I all cases I still see the serialized approach with NumPy eigh is the most efficient one; here is the code:

import numpy as np
import scipy.linalg
import matplotlib.pyplot as pyt
from time import time

import psutil


def f_mp_pool(*args):
     
    N = args[0]
        
    A =np.random.rand( N,N ) + 1j*np.random.rand( N,N )    
    A = 0.5*( A + np.conj(A.T) ) 
    
    evals, evecs = np.linalg.eigh(A)
    
    return evals, evecs


nreps = 100
N     = 700


ts = time()

for n in range(nreps):
    
    A =np.random.rand( N,N ) + 1j*np.random.rand( N,N )    
    A = 0.5*( A + np.conj(A.T) ) 
    
    res = np.linalg.eigh(A)
    

print("serialized:", time()-ts)

#use svd

import scipy.linalg

ts = time()
for n in range(nreps):
    res = scipy.linalg.svd( A, full_matrices=True, check_finite=False  ) 
    
print("SVD:", time()-ts)    

import multiprocessing as mp

nproc   = psutil.cpu_count(logical=False)-1
mp_pool = mp.Pool(processes=nproc)

args_ls = [ (N,) for n in range(nreps) ]

ts = time()
res = mp_pool.starmap( f_mp_pool, args_ls )
print("parallel:", time()-ts)


Solution 1:[1]

Pytorch will b faster, and if you have GPU it can also take advantage of that, however not so much because the QR iteration is not good for parallel computation. I have a potential solution to accelerate that part on GPUs but I never actually implemented it.

import numpy as np
import scipy.linalg
import torch
import matplotlib.pyplot as plt
from time import time

t_ls = []
d_ls = np.array([100, 500, 1000,2000,4000])

for N in d_ls:

    A =np.random.rand( N,N ) + 1j*np.random.rand( N,N )    
    A = 0.5*( A + np.conj(A.T) ) 

#  skipping numpy, it is slow here, you may put it back if you want
#     ts = time()    
#     evals, evecs = np.linalg.eigh( A )
#     t_np  = time()-ts
   

    ts = time()    
    evals2, evecs2 = scipy.linalg.eigh( A )
    t_sp  = time()-ts
    
    # When using CPU torch will use intra operation
    # parallelism, so if you care about latency
    # this is better than using multiprocessing
    A_cpu = torch.as_tensor(A)
    ts = time()
    evals3, evecs3 = torch.linalg.eigh(A_cpu)
    t_cpu = time() - ts;
    if torch.cuda.is_available():
        # Using GPU will give a significant speedup for some
        # operations, I guess the 
        A_gpu = A_cpu.to('cuda')
        torch.cuda.synchronize()
        ts = time()
        evals4, evecs4 = torch.linalg.eigh(A_gpu)
        torch.cuda.synchronize()
        t_gpu = time() - ts;
    else:
        t_gpu = np.nan #if you don't have GPU let's skip this part
    t_ls.append(np.array([np.nan, t_sp, t_cpu, t_gpu]))
    print(t_ls[-1])

t_ls = np.array(t_ls)

plt.plot( d_ls, t_ls[:,0], marker='s' )
plt.plot( d_ls, t_ls[:,1], marker='^')
plt.plot( d_ls, t_ls[:,2], marker='+')
plt.plot( d_ls, t_ls[:,3], marker='d')
plt.xlabel("N")
plt.ylabel("time(secs)")
plt.legend(["NumPy", "SciPy", "PyTorch CPU", "PyTorch GPU"])

My plot enter image description here

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Bob