Compound expressions
When vectorizing the code it is beneficial, if we can do more than one vectorizable operation in the same loop. Especially it makes sense to do such operations, which do not require additional memory bandwidth. The CPUs vectorization is much more powerfull computationally, than it is typically possible to feed to it from the memory even if cached. While the CPU is waiting for the next value from the memory, this time can be used for some additionall math, which will not slow down the primary operation. Welcome to the MtxVec "compound expressions". Over 160 overloads have been added to TMtxVec type from which both TVec and TMtx derive to address this possibility. Below you can see an example from MtxVec Demo when executed on AMD Zen 2 (Rome) architecture at 3,2Ghz. Orange are sequenced expressions using Intel IPP and blue represent the compound expression.
For Skylake AVX512 architecture, the advantage is not so much, but still present:
The performance advantage varies depending on the CPU architecture quite considerably. Especially it makes sense to mix vectors with scalar operations, because such simple math expressions are limited mostly by memory bandwidth, which the scalar operations are not using. The following patterns have been made available, where X, Y and Z are vectors and the rest are scalars:
- X*xScale + Y*yScale + Z*zScale
- X*xScale - Y*yScale - Z*zScale
- X*Y*Z*xyzScale
- X*Y/Z*xyzScale
- X / (Y*Z)*xyzScale
- X*Y*xyScale + Z*zScale
- X*Y*xyScale - Z*zScale
- (X*xScale + Y*yScale) *Z*zScale
- (X*xScale - Y*yScale) *Z*zScale
- X / Y*xyScale + Z*zScale
- X / Y*xyScale - Z*zScale
Plus all combinations with reductions, where the scalars are either equal to 1 or 0. It is interesting to observe, that the following three examples:
- X + Y
- X*xScale + Y*yScale
- X*xScale + Y*yScale, and finding min and max of the resulting vector
take the same amount of time. It makes sense therefore to find patterns in your code, which can take advantage of this. Or simply build algorithms which counter intuitevly would be slower, because they compute more, but are in fact faster.