Dew Research
Search entire site:
  Home| Products | Order | Downloads | FAQ | Support | About us
Introducing MtxVec 2.0 the number crunching library is back with dot Net support.More

Develop within .NET and deliver the code speed of assembler.
• Comprehensive and fast numerical math library
• support for VS.NET, Borland Delphi and C++ Builder
• statistical and DSP add-ons

 
MtxVec
Screenshots
MtxVec
Visit our comprehensive Screenshot Gallery.
Applications
MtxVec for mission critical applications where complex real time data processing is needed. Ten times faster than conventional programming.
MtxVec applications
Testimonials
"Using MtxVec 2, with its SSE2 support, I see about a x4 speed improvement over traditional x87 assembler when running on my Pentium 4 notebook!"

Matthew Wormington, Bede Corporation
More Testimonials
FFT properties 5.0
Borland Delphi Microsoft Dot .Net
About

MtxVec introduction

Getting up to speed

MtxVec allows the programmer to write high level object code that gives the performance of the most optimized assembler code supporting latest CPU instructions from within your current development environment. This is best examined on an example. Simply trying to use a faster Power function in the following loop will bring no major gains:

for i := 0 to 1000000-1 do
begin

Y[i] := (c1*Ax[i]+c2)/Power(1.0 + Power(Bx[i],eA), eB);

end;

But if the above loop is rewritten like below, things change a lot.

a.Length := 2000;
b.Length := 2000;
for i := 0 to 499 do
begin

YourFunc(a,b,c,c1,c2,ea,eb);

end;

procedure YourFunc(a,b,Result: TVec; c1,c2,ea,eb: TSample);
var a1,b1: TVec;
begin

if (a.Length <> b.Length) then Eraise('a.Length <> b.Length');
CreateIt(a1,b1); //work vectors
try

a1.Mul(a,c1);
a1.Offset(c2);
b1.Power(b,ea);
b1.Offset(1);
Result.Power(b1,-eb);
Result.Mul(a1);

finally

FreeIt(a1,b1);

end;

end;

We can note that we wrote more lines and that we create and destroy objects within a loop. The objects created and destroyed within the function are not really created and not really destroyed. The CreateIt and FreeIt functions access a pool of precreated objects called object cache. The objects from the object cache have some memory pre-allocated. But how could so many loops, instead of only one, be faster? We have 7 loops (Copy, Scale, Offset, Power, Offset, Power, Mul) in the second case and only one in the first. This makes it impossible for any compiler to perform loop optimization, store local variables in the CPU/FPU, precompute constants. The secret is called SIMD or Single Instruction Multiple Data. Intel's and AMD CPU's support a special instruction set. It has been very difficult for any compiler vendor to try to make efficient use of those instructions and even today most compilers run without support for SIMD with two major exceptions: Intel C++ and Intel Fortran compilers. SIMD supporting compilers convert the first loop of our case in to the second loop of our case. The transformation is not always as clean and the gains are not as nearly as large, as if the same principle is employed by hand. Sometimes it is difficult for the compiler to effectively brake down one single loop in to a list of more effective ones.

What is so special about SIMD and why are more loops required? The SIMD instructions work similar to this:

  • load up to 4 array elements from memory (ideally takes 1 CPU cycle)
  • execute the mathematical operation (ideally takes 1 CPU cycle)
  • save the result back to memory(ideally takes 1 CPU cycle)

Total CPU cycle count is 3. The normal loop would require 1 cycle for each element to load, store and apply function (in best case). In total that would be 12 CPU cycles. Of course the compiler does some optimization in the loop, stores some variables in to FPU registers and the loop does not need full 12 cycles. Therefore typical speed ups for SIMD are not 4x but about 2-3x. However there are some implicit optimizations in our second loop too. Because we know that the exponent is fixed, the vectorized Power function can take advantage of that, so the gap is increased again. Of course, the first loop could also be optimized for that, but you would have to think of it.

Block processing

When working with vectors it is absolutely critical to also consider the size of the CPU cache. If the arrays will not fit in the available CPU cache, a large (sometimes up to 3x) performance penalty will be imposed upon the algorithm. This means that vector arithmetic's should not be applied to vectors whose size exceed certain maximum length. Typically the maximum number of double precision elements ranges from 800 to 2000 per array. Longer vectors have to be split in pieces and processed in parts. MtxVec provides tools that allow you to achieve that easily. The following listing shows three versions of the same function.

Plain function:

function MaxwellPDF(x, a: TSample): TSample;
var xx: TSample;
begin

if (x >= 0) and (a > 0) then
begin

xx := Sqr(x);
Result := Sqrt(4*a*INVTWOPI)*a*xx*Exp(-0.5*a*xx);

end else Result := NAN;

end;

Vectorized function:

procedure MaxwellPDF(X: TVec; a: TSample; Res: TVec);
var Res1: TVec;
begin

CreateIt(Res1);
try

Res1.Sqr(X);
Res.Copy(Res1);
Res.Scale(-0.5*a);
Res.Exp;
Res.Mul(Res1);
Res.Scale(Sqrt(4*a*INVTWOPI)*a);

finally

FreeIt(Res1);

end;

end;

Block vectorized function:

procedure MaxwellPDF(X: TVec; a: TSample; Res: TVec);
var Res1: TVec;
begin

CreateIt(Res1);
try

Res.Size(X);
Res.BlockInit;
X.BlockInit;
while not X.BlockEnd do
begin

Res1.Sqr(X);
Res.Copy(Res1);
Res.Scale(-0.5*a);
Res.Exp;
Res.Mul(Res1);
Res.Scale(Sqrt(4*a*INVTWOPI)*a);
Res.BlockNext;
X.BlockNext;

end;

finally

FreeIt(Res1);

end;

end;

Screen shotOn P4 2.4 GHz CPU the vectorized function is about 9.5x faster than the plain function when using Delphi 6. The block vectorized version of the function is a little slower for short vectors but maintains its high performance even for vectors exceeding 10 000 double precision elements. (For a CPU with 512kB CPU cache the limit is about 10 000 elements and if CPU with 128kB cache is used the limit is about 2000 elements.)

The block vectorized function is only marginally faster than vectorized version due to the use of SSE2 instructions. If the CPU does not support SSE, then the gain of the block vectorized version will be much more significant (typical gains are about 6 times). For example, when using older CPU's the speed of the plain function for vectors with length larger than the size of the CPU cache will be higher than that of its vectorized version. The vectorized version has to access memory multiple times, while the plain function version can cache some intermediate results in to FPU registers or CPU cache. The block vectorized version will ensure that the chunk of the vector being processed can fit in to the CPU cache and will thus give optimal performance for long vectors even in that case.

Objects and numeric's

Object oriented numeric's has several advantages:

  • Improved code readability,
  • less lines of code for same effect,
  • better error protection,
  • allows vectorization of operations, thus taking advantage of the latest CPU architectures.

It also has some drawbacks:

  • Overhead for handling objects, especially create and destroy
  • Overhead for handling small vectors and matrices.

MtxVec addresses the advantages to the full extent possible and introduces two techniques which decrease the performance drawback for handling objects and short vectors:

  • Pre-creation of objects (object cache) and
  • memory pre allocation.

All this makes MtxVec a very fast object oriented library.

MtxVec encapsulates LAPACK

A quote from www.netlib.org : "LAPACK provides routines for solving systems of simultaneous linear equations, least-squares solutions of linear systems of equations, eigenvalue problems, and singular value problems. The associated matrix factorizations (LU, Cholesky, QR, SVD, Schur, generalized Schur) are also provided, as are related computations such as reordering of the Schur factorizations and estimating condition numbers. In all areas, similar functionality is provided for real and complex matrices, in both single and double precision."

Users guide is available at: http://www.netlib.org/lapack/lug/lapack_lug.html

MtxVec wraps LAPACK and offers complete functionality of LAPACK except for support of packed matrices and overcomes. LAPACK differs between 9 matrix types: general, symmetric, hermitian, positive definite symmetric, positive definite hermitian, general banded, positive definite symmetric banded, positive definite hermitian banded and triangular. With MtxVec fairly long argument lists and function sequences are reduced to (real or complex, single or double):

In case of multiplication:

a.Mul(b,c); {general matrix}
a.Mul(b,c, mtSymmetric); {b is symmetric}
a.Mul(b,c, mtGeneral,mtSymmetric); {c is symmetric}
a.Mul(b,c, mtGeneral,mtHermitian, opTransp); {b is transposed}

Solving a system of linear equations:

a.LUSolve(b,x); {general matrix}
a.LUSolve(b,x, mtTriangular) {a is triangular}

Eigenvalues:

a.Eig(b);

LeastSquare fit:

a.LQRSolve(b,x,R); {A * X = B}
a.LQRSolve(b,x); {does not compute and return R}

Delphi, C++ And C# (and MtxVec) use row major array ordering. Appropriate adjustments were made to interface FORTRAN column major array ordering, with minimum overhead.

Features

  • Encapsulates LAPACK
  • Interfaces Intel Perofrmance libraries.
  • Supports SMP for matrix operations.
  • Single and/or double precision supported.
  • Bottom up complete support for complex numbers.
  • Design structured in three layers: procedural, object and component; ensures good scalability.
  • Fast object creation and destruction via object cache..
  • CPU specific optimizations but runs on all CPU's.
  • Built in dynamic memory management with extensive support for debugging.
  • Completely matrix-vector oriented programming reduces development time and allows fast algorithm prototyping.
  • Support for future CPU instruction sets, without application recompile. (DLL replacement)
  • Introduces TSparseMtx class capable of handling real and complex sparse matrices
  • Encapsulates UMF Pack, Pardiso and Taucs sparse matrix solvers.

Back to MtxVec...

Navigation
Home Page
Special Offers
News

Products
Information

Order

Downloads

Information
Product Support
About us
Site Map
Resources
Testimonials
Customers
Link Request

  *

MtxVec© Janez Makovsek and Marjan Slatinek. All Rights Reserved. E-MAIL info@dewresearch.com. Delphi & C++ Builder are registered trademarks of Inprise Borland Corporation. All other brands and product names are trademarks or registered trademarks of their respective owners.
dogma