'Optimizing mutable vs immutable vector math

Which coding style lends itself better to compiler optimizations? In particular, I'm interested in 1) minimizing the number of temporary values that get thrown away immediately and 2) automatic vectorization, i.e. generating SIMD instructions for arithmetic.

Suppose I have this struct:

#define FOR_EACH for (int i = 0; i < N; ++i)

template<typename T, unsigned N>
struct Vector {
    void scale(T scalar) {
        FOR_EACH v[i] *= scalar;
    }

    void add(const Vector<T, N>& other) {
        FOR_EACH v[i] += other.v[i];
    }

    void mul(const Vector<T, N>& other) {
        FOR_EACH v[i] *= other.v[i];
    }

    T v[N];
};

Example usage of this struct:

Vector<int, 3> v1 = ...;
Vector<int, 3> v2 = ...;
v1.scale(10);
v1.add(v2);
v1.mul(v2);

This is mutable approach.

An alternative immutable approach could look like this:

template<typename T, unsigned N>
struct Vector {
    Vector(const Vector<T, N>& other) {
        memcpy(v, other.v, sizeof(v));
    }

    Vector<T, N> operator+(const Vector<T, N>& other) const {
        Vector<T, N> result(*this);
        FOR_EACH result.v[i] += other.v[i];
        return result;
    }

    Vector<T, N> operator*(T scalar) const {
        Vector<T, N> result(*this);
        FOR_EACH result.v[i] *= scalar;
        return result;
    }

    Vector<T, N> operator*(const Vector<T, N>& other) const {
        Vector<T, N> result(*this);
        FOR_EACH result.v[i] *= other.v[i];
        return result;
    }

    T v[N];
};

Example usage:

Vector<int, 3> v1 = ...;
Vector<int, 3> v2 = ...;
auto result = (v1 * 10 + v2) * v2;

Now, I'm not concerned with API design in this question. Assume that both solutions are viable in this regard.

Also, instead of int in the sample code it could be float or double as well.

What interests me is this: which design can be more easily analyzed by a modern C++ compiler? I'm not targeting any single compiler in particular. If you have experience with any compiler and know how it deals with optimizations I'm asking about, please share your experience.

  • The second version produces a lot of temporary values. Can the compiler get rid of those if it ultimately inlines all operator calls and sees all the arithmetic expressions held within? (I'm assuming that without inlining no compiler can eliminate temporaries because of possible side effects)

  • The first version minimises the number of temporaries but constructs a strictly sequential calculation. Can the compiler still deduce the intent and reorder the operations in a way that minimises the number of operations and allows for their parallelization (at CPU instruction level)?

  • How difficult is it for a modern compiler to vectorize the loops above?



Solution 1:[1]

As far as I understand, the first example is easy to vectorize as long as there is support in the target architecture. This is because there is no data-dependency among the elements in successive iterations.

If you have loops in which there are data-dependencies between elements in successive iteration, in some instances they can be removed by software-pipelining. Software-pipelining helps vectorization.

In some architectures floating point calculations are not easily vectorizable because of limited floating point execution units.

In the second example there are temporaries which may be eliminated by inlining.

Useful links:

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Glorfindel