Example

Normal vectorized procedure:

procedure ParetoPDF(const X: Vector; a, b: double;var Res: Vector); overload; begin Res.Size(X); Res.Power(X,-(a+1)); Res.Mul(Power(b,a)*a);; end;

Vectorized and blocked version of the Pareto probability distribution procedure:

procedure ParetoPDF(const X: Vector; a, b: double; var Res: Vector); overload; begin Res.Size(X); Res.BlockInit; X.BlockInit; while not X.BlockEnd do begin Res.Power(X,-(a+1)); Res.Mul(Power(b,a)*a); Res.BlockNext; X.BlockNext; end; end;

The block version of the ParetoPDF will execute faster then the non-blocked version in cases where X contains 5000-10000 elements or more (double precision). Below that value the two versions will perform about the same, except for very short vector sizes (below 50 elements), where the non-blocked version will have a slight advantage, because of the absence of block processing methods overhead. The time is saved between the calls to Res.Power(X,-(a+1)) and Res.Mul(Power(b,a)*a), where the same memory (stored in Res vector) is accesed in two consecutive calls. That memory is loaded in the CPU cache on the first call, if the Length of the Res vector is short enough to fit in. As an excercise you can also try to compare the performance of the vectorized and blocked version of the function with the single value version (ParetoPDF(X: double; a, b: double; Res: double) and measure the execution time of both versions for long vectors (100 000 elements) and short vectors (10 elements).

The differences with block processing will be more noticable on old CPU's without support for SSE2/SSE3.

Contents | Index | Home