'Julia vs c++ performance almost factor 30

The Julia program below takes about 6 seconds on my laptop (second test(n)). An equivalent C++ program (using Eigen) only 0.19 s. According to the results I have seen on https://programming-language-benchmarks.vercel.app/cpp, I expected a much smaller difference. What is wrong with my Julia program? I would be very appreciative for hints on how to improve my Julia program.

using StaticArrays
using Printf

struct CoordinateTransformation
    b1::SVector{3,Float64}
    b2::SVector{3,Float64}
    b3::SVector{3,Float64}
    r0::SVector{3,Float64}
    mf::SMatrix{3,3,Float64}
    mb::SMatrix{3,3,Float64}
end

function dot(a::SVector{3,Float64}, b::SVector{3,Float64}) 
    a[1]*b[1] + a[2]*b[2] + a[3]*b[3]
end

function CoordinateTransformation(b1::SVector{3,Float64}, b2::SVector{3,Float64}, b3::SVector{3,Float64}, r0::SVector{3,Float64})
    mf = MMatrix{3,3,Float64}(undef)

    e1::SVector{3, Float64} = [1.0, 0.0, 0.0]
    e2::SVector{3, Float64} = [0.0, 1.0, 0.0]
    e3::SVector{3, Float64} = [0.0, 0.0, 1.0]

    mf[1, 1] = dot(b1, e1);
    mf[1, 2] = dot(b1, e2);
    mf[1, 3] = dot(b1, e3);
    mf[2, 1] = dot(b2, e1);
    mf[2, 2] = dot(b2, e2);
    mf[2, 3] = dot(b2, e3);
    mf[3, 1] = dot(b3, e1);
    mf[3, 2] = dot(b3, e2);
    mf[3, 3] = dot(b3, e3);
    mb = inv(mf)
    CoordinateTransformation(b1, b2, b3, r0, mf, mb)
end

@inline function transform_point_f(at::CoordinateTransformation, v::MVector{3,Float64})
    at.mf * v + at.r0
end

@inline function transform_point_b(at::CoordinateTransformation, v::MVector{3,Float64})
    at.mb * (v - at.r0)
end

@inline function transform_vector_f(at::CoordinateTransformation, v::MVector{3,Float64})
    at.mf * v
end

@inline function transform_vector_b(at::CoordinateTransformation, v::MVector{3,Float64})
    at.mb * v
end

function test(n)
    theta = 1.0;
    c = cos(1.0);
    s = sin(1.0);
    b1::SVector{3, Float64} = [c, 0.0, s]
    b2::SVector{3, Float64} = [0.0, 1.0, 0.0]
    b3::SVector{3, Float64} = [-s, 0.0, c]
    r0::SVector{3, Float64} = [0.0, 0.0, 1.0]
    at::CoordinateTransformation = CoordinateTransformation(b1, b2, b3, r0)

    @printf("%e\n", n)

    points = Array{MVector{3, Float64}, 1}(undef, n)
    @inbounds for i in 1:n
        points[i] = [1.0, 0.0, 0.0]
    end

    @inbounds for i in 1:n
        points[i] = transform_point_f(at, points[i])
    end
    println(points[n])

    @inbounds for i in 1:n
        points[i] = transform_point_b(at, points[i])
    end
    println(points[n])
end


n = 10000000
@timev test(n)
@timev test(n)


Solution 1:[1]

A major issue with your test function is that a massive number of MVectors are allocated in the 3 loops. In addition, since MVectors are mutable structs, which are reference types, the points vector is a vector of references, which is not great for performance.

Instead, I recommend changing points to a vector of SVectors and modifying the code to accommodate this (e.g., replace every MVector with SVector). In the first loop, points[i] = [1.0, 0.0, 0.0] should be changed to points[i] = SA[1.0, 0.0, 0.0] to avoid allocations from creating temporary vectors. (See also Eric's comment on this.)

Implementing these simple changes, I see an improvement from

2.523284 seconds (40.00 M allocations: 1.714 GiB, 43.11% gc time)

to

0.171544 seconds (267 allocations: 228.891 MiB)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Vincent Yu