'Distibuted Computing in Julia Slower than Serial

I have a julia function that seems very amenable to optimization. Each iteration only manipulates the stuff in its particular index. Yet this function, when implemented with distributed as below, is slower than its serial equivalent. I have tried an equivalent implementation with Distributed instead of Shared arrays, and it is even slower. There must be something simple I am missing here, but I cannot figure it out.

function f(A1, A2, I1, I2, n1, n2, n3)
    B1 = convert(SharedArray, zeros(n1, n2))
    B2 = convert(SharedArray, zeros(n2, n3))
    @sync @distributed for d in 1:n2
        for i in 1:n3
            B1[d, i] = A1[I1[d], I2[d][i]] / (A1[I1[d], I2[d][i]] + A2[I1[d], I2[d][i]]))
            B2[:, d] .+= log.(A2[:, I2[d]);
        end
        B2[:, d] .-= logsumexp(B2[:, d])
    end
    B1 = convert(Array, B1)
    B2 = convert(Array, B2)
    
    B2 = exp.(B2)
    return B1, B2
end


Solution 1:[1]

The amount of compute you're trying to distribute is likely much too small. Remember, all distributed computing has overhead of sending data back and forth between different processes, and that has a rather significant amount which needs to be overcome in order to actually speedup.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Chris Rackauckas