'Speeding Up Kronecker Products Numpy
So I am trying to compute the kronecker product of two matrices each of arbitrary dimension. (I use square matrices of the same dimension just for the examples)
Initially I tried using kron :
a = np.random.random((60,60))
b = np.random.random((60,60))
start = time.time()
a = np.kron(a,b)
end = time.time()
Output: 0.160096406936645
To try and get a speed up I used tensordot:
a = np.random.random((60,60))
b = np.random.random((60,60))
start = time.time()
a = np.tensordot(a,b,axes=0)
a = np.transpose(a,(0,2,1,3))
a = np.reshape(a,(3600,3600))
end = time.time()
Output: 0.11808371543884277
After searching the web a bit, I found that (or at least to my understanding) numpy makes an extra copy when it has to reshape a tensor that has been transposed.
So I then tried the following (this code obviously does not give the kronecker product of a and b, but I was just doing it as a test):
a = np.random.random((60,60))
b = np.random.random((60,60))
start = time.time()
a = np.tensordot(a,b,axes=0)
a = np.reshape(a,(3600,3600))
end = time.time()
Output: 0.052041053771972656
My question is: how can I compute the kronecker product without encountering this issue associated with the transpose?
I am just looking for a fast speed up, so the solution does not have to use tensordot.
EDIT
I just found on this stack post: speeding up numpy kronecker products, that there is another way to do it:
a = np.random.random((60,60))
b = np.random.random((60,60))
c = a
start = time.time()
a = a[:,np.newaxis,:,np.newaxis]
a = a[:,np.newaxis,:,np.newaxis]*b[np.newaxis,:,np.newaxis,:]
a.shape = (3600,3600)
end = time.time()
test = np.kron(c,b)
print(np.array_equal(a,test))
print(end-start)
Output: True
0.05503702163696289
I am still interested in the question of whether or not you can speed up this computation further?
Solution 1:[1]
einsum seems to work:
>>> a = np.random.random((60,60))
>>> b = np.random.random((60,60))
>>> ab = np.kron(a,b)
>>> abe = np.einsum('ik,jl', a, b).reshape(3600,3600)
>>> (abe==ab).all()
True
>>> timeit(lambda: np.kron(a, b), number=10)
1.0697475590277463
>>> timeit(lambda: np.einsum('ik,jl', a, b).reshape(3600,3600), number=10)
0.42500176999601535
Simple broadcasting is even faster:
>>> abb = (a[:, None, :, None]*b[None, :, None, :]).reshape(3600,3600)
>>> (abb==ab).all()
True
>>> timeit(lambda: (a[:, None, :, None]*b[None, :, None, :]).reshape(3600,3600), number=10)
0.28011218502069823
UPDATE: Using blas and cython we can get another modest (30%) speedup. Decide for yourself whether it is worth the trouble.
[setup.py]
from distutils.core import setup
from Cython.Build import cythonize
setup(name='kronecker',
ext_modules=cythonize("cythkrn.pyx"))
[cythkrn.pyx]
import cython
cimport scipy.linalg.cython_blas as blas
import numpy as np
@cython.boundscheck(False)
@cython.wraparound(False)
def kron(double[:, ::1] a, double[:, ::1] b):
cdef int i = a.shape[0]
cdef int j = a.shape[1]
cdef int k = b.shape[0]
cdef int l = b.shape[1]
cdef int onei = 1
cdef double oned = 1
cdef int m, n
result = np.zeros((i*k, j*l), float)
cdef double[:, ::1] result_v = result
for n in range(i):
for m in range(k):
blas.dger(&l, &j, &oned, &b[m, 0], &onei, &a[n, 0], &onei, &result_v[m+k*n, 0], &l)
return result
To build run cython cythkrn.pyx and then python3 setup.py build.
>>> from timeit import timeit
>>> import cythkrn
>>> import numpy as np
>>>
>>> a = np.random.random((60,60))
>>> b = np.random.random((60,60))
>>>
>>> np.all(cythkrn.kron(a, b)==np.kron(a, b))
True
>>>
>>> timeit(lambda: cythkrn.kron(a, b), number=10)
0.18925874299020506
Solution 2:[2]
Speeding up memory bound calculations
- Avoid it completely, were possible (eg. the kron_and_sum example)
- Blocked execution, when combined with other calculations
- Maybe float32 intead of float64 is also enough
- If this calculation is in a loop, allocate the memory only once
I got quite the same timings with this code and @Paul Panzers implementation, but on both implementations I get the same weird behaviour. With pre allocated memory, there is absolutely no speed up if the calculation is parallelized (this is expected), but without pre allocated memory there is quite a significant speed up.
Code
import numba as nb
import numpy as np
@nb.njit(fastmath=True,parallel=True)
def kron(A,B):
out=np.empty((A.shape[0],B.shape[0],A.shape[1],B.shape[1]),dtype=A.dtype)
for i in nb.prange(A.shape[0]):
for j in range(B.shape[0]):
for k in range(A.shape[1]):
for l in range(B.shape[1]):
out[i,j,k,l]=A[i,k]*B[j,l]
return out
@nb.njit(fastmath=True,parallel=False)
def kron_preallocated(A,B,out):
for i in nb.prange(A.shape[0]):
for j in range(B.shape[0]):
for k in range(A.shape[1]):
for l in range(B.shape[1]):
out[i,j,k,l]=A[i,k]*B[j,l]
return out
@nb.njit(fastmath=True,parallel=True)
def kron_and_sum(A,B):
out=0.
for i in nb.prange(A.shape[0]):
TMP=np.float32(0.)
for j in range(B.shape[0]):
for k in range(A.shape[1]):
for l in range(B.shape[1]):
out+=A[i,k]*B[j,l]
return out
Timings
#Create some data
out_float64=np.empty((a.shape[0],b.shape[0],a.shape[1],b.shape[1]),dtype=np.float64)
out_float32=np.empty((a.shape[0],b.shape[0],a.shape[1],b.shape[1]),dtype=np.float32)
a_float64 = np.random.random((60,60))
b_float64 = np.random.random((60,60))
a_float32 = a_float64.astype(np.float32)
b_float32 = b_float64.astype(np.float32)
#Reference
%timeit np.kron(a_float64,b_float64)
147 ms ± 1.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#If you have to allocate memory for every calculation (float64)
%timeit B=kron(a_float64,b_float64).reshape(3600,3600)
17.6 ms ± 244 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
#If you don't have to allocate memory for every calculation (float64)
%timeit B=kron_preallocated(a_float64,b_float64,out_float64).reshape(3600,3600)
8.08 ms ± 269 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
#If you have to allocate memory for every calculation (float32)
%timeit B=kron(a_float32,b_float32).reshape(3600,3600)
9.27 ms ± 185 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
#If you don't have to allocate memory for every calculation (float32)
%timeit B=kron_preallocated(a_float32,b_float32,out_float32).reshape(3600,3600)
3.95 ms ± 155 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
#Example for a joined operation (sum of kroncker product)
#which isn't memory bottlenecked
%timeit B=kron_and_sum(a_float64,b_float64)
881 µs ± 104 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
Solution 3:[3]
I've improved the np.kron a bit (57%) here: https://github.com/numpy/numpy/pull/21232
The idea was to get rid of concatenate which was coincidently causing a ValueError.
For what it's worth, since I stumbled upon this, @Paul Panzer seems to have a better solution with broadcast in his answer which I'm trying to add to NumPy now. You can keep a look out at https://github.com/numpy/numpy/issues/21257 for the progress. Thanks @Paul for the idea.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | Yacola |
| Solution 3 | Ganesh Kathiresan |
