'Speeding Up Kronecker Products Numpy

So I am trying to compute the kronecker product of two matrices each of arbitrary dimension. (I use square matrices of the same dimension just for the examples)

Initially I tried using kron :

a = np.random.random((60,60))
b = np.random.random((60,60))

start = time.time()
a = np.kron(a,b)
end = time.time()

Output: 0.160096406936645

To try and get a speed up I used tensordot:

a = np.random.random((60,60))
b = np.random.random((60,60))

start = time.time()
a = np.tensordot(a,b,axes=0)
a = np.transpose(a,(0,2,1,3))
a = np.reshape(a,(3600,3600))
end = time.time()

Output: 0.11808371543884277

After searching the web a bit, I found that (or at least to my understanding) numpy makes an extra copy when it has to reshape a tensor that has been transposed.

So I then tried the following (this code obviously does not give the kronecker product of a and b, but I was just doing it as a test):

a = np.random.random((60,60))
b = np.random.random((60,60))

start = time.time()
a = np.tensordot(a,b,axes=0)
a = np.reshape(a,(3600,3600))
end = time.time()

Output: 0.052041053771972656

My question is: how can I compute the kronecker product without encountering this issue associated with the transpose?

I am just looking for a fast speed up, so the solution does not have to use tensordot.

EDIT

I just found on this stack post: speeding up numpy kronecker products, that there is another way to do it:

a = np.random.random((60,60))
b = np.random.random((60,60))

c = a

start = time.time()
a = a[:,np.newaxis,:,np.newaxis]
a = a[:,np.newaxis,:,np.newaxis]*b[np.newaxis,:,np.newaxis,:]
a.shape = (3600,3600)
end = time.time()

test = np.kron(c,b)
print(np.array_equal(a,test))
print(end-start)


Output: True
0.05503702163696289

I am still interested in the question of whether or not you can speed up this computation further?

Solution 1:^[1]

einsum seems to work:

>>> a = np.random.random((60,60))
>>> b = np.random.random((60,60))
>>> ab = np.kron(a,b)
>>> abe = np.einsum('ik,jl', a, b).reshape(3600,3600)
>>> (abe==ab).all()
True
>>> timeit(lambda: np.kron(a, b), number=10)
1.0697475590277463
>>> timeit(lambda: np.einsum('ik,jl', a, b).reshape(3600,3600), number=10)
0.42500176999601535

Simple broadcasting is even faster:

>>> abb = (a[:, None, :, None]*b[None, :, None, :]).reshape(3600,3600)
>>> (abb==ab).all()
True
>>> timeit(lambda:  (a[:, None, :, None]*b[None, :, None, :]).reshape(3600,3600), number=10)
0.28011218502069823

UPDATE: Using blas and cython we can get another modest (30%) speedup. Decide for yourself whether it is worth the trouble.

[setup.py]

from distutils.core import setup
from Cython.Build import cythonize

setup(name='kronecker',
      ext_modules=cythonize("cythkrn.pyx"))

[cythkrn.pyx]

import cython
cimport scipy.linalg.cython_blas as blas
import numpy as np

@cython.boundscheck(False)
@cython.wraparound(False)
def kron(double[:, ::1] a, double[:, ::1] b):
    cdef int i = a.shape[0]
    cdef int j = a.shape[1]
    cdef int k = b.shape[0]
    cdef int l = b.shape[1]
    cdef int onei = 1
    cdef double oned = 1
    cdef int m, n
    result = np.zeros((i*k, j*l), float)
    cdef double[:, ::1] result_v = result
    for n in range(i):
        for m in range(k):
            blas.dger(&l, &j, &oned, &b[m, 0], &onei, &a[n, 0], &onei, &result_v[m+k*n, 0], &l)
    return result

To build run cython cythkrn.pyx and then python3 setup.py build.

>>> from timeit import timeit
>>> import cythkrn
>>> import numpy as np
>>> 
>>> a = np.random.random((60,60))
>>> b = np.random.random((60,60))
>>>
>>> np.all(cythkrn.kron(a, b)==np.kron(a, b))
True
>>> 
>>> timeit(lambda: cythkrn.kron(a, b), number=10)
0.18925874299020506

Solution 2:^[2]

Speeding up memory bound calculations

Avoid it completely, were possible (eg. the kron_and_sum example)
Blocked execution, when combined with other calculations
Maybe float32 intead of float64 is also enough
If this calculation is in a loop, allocate the memory only once

I got quite the same timings with this code and @Paul Panzers implementation, but on both implementations I get the same weird behaviour. With pre allocated memory, there is absolutely no speed up if the calculation is parallelized (this is expected), but without pre allocated memory there is quite a significant speed up.

Code

import numba as nb
import numpy as np


@nb.njit(fastmath=True,parallel=True)
def kron(A,B):
    out=np.empty((A.shape[0],B.shape[0],A.shape[1],B.shape[1]),dtype=A.dtype)
    for i in nb.prange(A.shape[0]):
        for j in range(B.shape[0]):
            for k in range(A.shape[1]):
                for l in range(B.shape[1]):
                    out[i,j,k,l]=A[i,k]*B[j,l]
    return out

@nb.njit(fastmath=True,parallel=False)
def kron_preallocated(A,B,out):
    for i in nb.prange(A.shape[0]):
        for j in range(B.shape[0]):
            for k in range(A.shape[1]):
                for l in range(B.shape[1]):
                    out[i,j,k,l]=A[i,k]*B[j,l]
    return out

@nb.njit(fastmath=True,parallel=True)
def kron_and_sum(A,B):
    out=0.
    for i in nb.prange(A.shape[0]):
        TMP=np.float32(0.)
        for j in range(B.shape[0]):
            for k in range(A.shape[1]):
                for l in range(B.shape[1]):
                    out+=A[i,k]*B[j,l]
    return out

Timings

#Create some data
out_float64=np.empty((a.shape[0],b.shape[0],a.shape[1],b.shape[1]),dtype=np.float64)
out_float32=np.empty((a.shape[0],b.shape[0],a.shape[1],b.shape[1]),dtype=np.float32)
a_float64 = np.random.random((60,60))
b_float64 = np.random.random((60,60))
a_float32 = a_float64.astype(np.float32)
b_float32 = b_float64.astype(np.float32)


#Reference
%timeit np.kron(a_float64,b_float64)
147 ms ± 1.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

#If you have to allocate memory for every calculation (float64)
%timeit B=kron(a_float64,b_float64).reshape(3600,3600)
17.6 ms ± 244 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

#If you don't have to allocate memory for every calculation (float64)
%timeit B=kron_preallocated(a_float64,b_float64,out_float64).reshape(3600,3600)
8.08 ms ± 269 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

#If you have to allocate memory for every calculation (float32)
%timeit B=kron(a_float32,b_float32).reshape(3600,3600)
9.27 ms ± 185 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

#If you don't have to allocate memory for every calculation (float32)
%timeit B=kron_preallocated(a_float32,b_float32,out_float32).reshape(3600,3600)
3.95 ms ± 155 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

#Example for a joined operation (sum of kroncker product)
#which isn't memory bottlenecked
%timeit B=kron_and_sum(a_float64,b_float64)
881 µs ± 104 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

Solution 3:^[3]

I've improved the np.kron a bit (57%) here: https://github.com/numpy/numpy/pull/21232

The idea was to get rid of concatenate which was coincidently causing a ValueError.

For what it's worth, since I stumbled upon this, @Paul Panzer seems to have a better solution with broadcast in his answer which I'm trying to add to NumPy now. You can keep a look out at https://github.com/numpy/numpy/issues/21257 for the progress. Thanks @Paul for the idea.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1
Solution 2	Yacola
Solution 3	Ganesh Kathiresan

'Speeding Up Kronecker Products Numpy

Solution 1:[1]

Solution 2:[2]

Speeding up memory bound calculations

Solution 3:[3]

Sources

Related Questions

Solution 1:^[1]

Solution 2:^[2]

Solution 3:^[3]