Getting up to speed

Code vectorization is where high performance computing starts

MtxVec allows the programmer to write high level object code that gives the performance of the most optimized assembler code supporting latest CPU instructions from within your current development environment. This is best examined on an example. Simply trying to use a faster Power function in the following loop will bring no major gains.[delphi]procedure YourFunc(var a,b: array of double; c1, c2, ea, eb: double)
var a1,b1: double;
begin
    a1 := a * c1 + c2;
    b1 := Power(b,ea) + 1;
    Result := a1*Power(b1,-eb);
end;
[/delphi][delphi]for i := 0 to 1000000 do
begin
    c[i] := YourFunc(a[i],b[i],c1,c2,ea,eb);
end;[/delphi]But, if this is rewritten:[delphi]a.Length := 2000;
b.Length := 2000;
for i := 0 to 499 do
begin
   YourFunc(a,b,c,c1,c2,ea,eb);
end;
[/delphi][delphi]procedure YourFunc(a,b,Result: TVec; c1,c2,ea,eb: TSample);
var a1,b1: TVec;
begin if (a.Length <> b.Length) then Eraise('a.Length <> b.Length');
   CreateIt(a1,b1); //work vectors
   try
     a1.Mul(a,c1);
   a1.Offset(c2);
     b1.Power(b,ea);
     b1.Offset(1);
     Result.Power(b1,-eb);
     Result.Mul(a1);
   finally
      FreeIt(a1,b1);
   end;
end;[/delphi]We can note that we wrote more lines and that we create and destroy objects within a loop. The objects created and destroyed within the function are not really created and not really destroyed. The CreateIt and FreeIt functions access a pool of pre-created objects called object cache. The objects from the object cache have some memory pre-allocated. But how could so many loops, instead of only one, be faster? We have 7 loops (Copy, Scale, Offset, Power, Offset, Power, Mul) in the second case and only one in the first. This makes it impossible for any compiler to perform loop optimization, store local variables in the CPU/FPU, precompute constants. The secret is called SIMD or Single Instruction Multiple Data. Intel's and AMD CPU's support this instruction set. It has been somewhat difficult for any compiler vendor to make efficient use of those instructions and even today many compilers run without proper support for SIMD with two major exceptions: Intel C++ and Intel Fortran compilers. SIMD supporting compilers convert the first loop of our case in to the second loop of our case. The transformation is not always as clean and the gains are not as nearly as large, as if the same principle is employed by hand. Sometimes it is difficult for the compiler to effectively brake down one single loop in to a list of more effective ones.

What is so special about SIMD and why are more loops required? The SIMD instructions work similar to this:

load up to 4 array elements from memory (ideally takes 1 CPU cycle)
execute the mathematical operation (ideally takes 1 CPU cycle)
save the result back to memory(ideally takes 1 CPU cycle)

Total CPU cycle count is 3. The normal loop would require 1 cycle for each element to load, store and apply function (in best case). In total that would be 12 CPU cycles. Of course the compiler does some optimization in the loop, stores some variables in to registers and the loop does not need full 12 cycles. Therefore typical speed ups for SIMD are not 4x but about 2-3x. However there are some implicit optimizations in our second loop as well. Because we know that the exponent is fixed, the vectorized Power function can take advantage of that, so the gap is increased again. Of course, the first loop could also be optimized for that, but you would have to think of it.

Continue...

Knowledge Base

Getting up to speed