| |
| Home |
![]() |
support for VS.NET, Borland Delphi and C++ Builder statistical and DSP add-ons |
|
MtxVec
Screenshots
Applications
MtxVec for mission critical applications where complex real time data processing is needed. Ten times faster than conventional programming.
MtxVec applications Testimonials
"Using MtxVec 2, with its SSE2 support, I see about a x4 speed improvement over traditional x87 assembler when running on my Pentium 4 notebook!"
Matthew Wormington, Bede Corporation More Testimonials ![]() |
About
MtxVec introduction MtxVec allows the programmer to write high level object code that gives the performance of the most optimized assembler code supporting latest CPU instructions from within your current development environment. This is best examined on an example. Simply trying to use a faster Power function in the following loop will bring no major gains: for i := 0 to 1000000-1 do
end; But if the above loop is rewritten like below, things change a lot. a.Length := 2000;
end; procedure YourFunc(a,b,Result: TVec; c1,c2,ea,eb: TSample);
end; We can note that we wrote more lines and that we create and destroy objects within a loop. The objects created and destroyed within the function are not really created and not really destroyed. The CreateIt and FreeIt functions access a pool of precreated objects called object cache. The objects from the object cache have some memory pre-allocated. But how could so many loops, instead of only one, be faster? We have 7 loops (Copy, Scale, Offset, Power, Offset, Power, Mul) in the second case and only one in the first. This makes it impossible for any compiler to perform loop optimization, store local variables in the CPU/FPU, precompute constants. The secret is called SIMD or Single Instruction Multiple Data. Intel's and AMD CPU's support a special instruction set. It has been very difficult for any compiler vendor to try to make efficient use of those instructions and even today most compilers run without support for SIMD with two major exceptions: Intel C++ and Intel Fortran compilers. SIMD supporting compilers convert the first loop of our case in to the second loop of our case. The transformation is not always as clean and the gains are not as nearly as large, as if the same principle is employed by hand. Sometimes it is difficult for the compiler to effectively brake down one single loop in to a list of more effective ones. What is so special about SIMD and why are more loops required? The SIMD instructions work similar to this:
Total CPU cycle count is 3. The normal loop would require 1 cycle for each element to load, store and apply function (in best case). In total that would be 12 CPU cycles. Of course the compiler does some optimization in the loop, stores some variables in to FPU registers and the loop does not need full 12 cycles. Therefore typical speed ups for SIMD are not 4x but about 2-3x. However there are some implicit optimizations in our second loop too. Because we know that the exponent is fixed, the vectorized Power function can take advantage of that, so the gap is increased again. Of course, the first loop could also be optimized for that, but you would have to think of it. When working with vectors it is absolutely critical to also consider the size of the CPU cache. If the arrays will not fit in the available CPU cache, a large (sometimes up to 3x) performance penalty will be imposed upon the algorithm. This means that vector arithmetic's should not be applied to vectors whose size exceed certain maximum length. Typically the maximum number of double precision elements ranges from 800 to 2000 per array. Longer vectors have to be split in pieces and processed in parts. MtxVec provides tools that allow you to achieve that easily. The following listing shows three versions of the same function. Plain function: function MaxwellPDF(x, a: TSample): TSample;
end; Vectorized function: procedure MaxwellPDF(X: TVec; a: TSample; Res: TVec);
end; Block vectorized function: procedure MaxwellPDF(X: TVec; a: TSample; Res: TVec);
end;
The block vectorized function is only marginally faster than vectorized version due to the use of SSE2 instructions. If the CPU does not support SSE, then the gain of the block vectorized version will be much more significant (typical gains are about 6 times). For example, when using older CPU's the speed of the plain function for vectors with length larger than the size of the CPU cache will be higher than that of its vectorized version. The vectorized version has to access memory multiple times, while the plain function version can cache some intermediate results in to FPU registers or CPU cache. The block vectorized version will ensure that the chunk of the vector being processed can fit in to the CPU cache and will thus give optimal performance for long vectors even in that case. Object oriented numeric's has several advantages:
It also has some drawbacks:
MtxVec addresses the advantages to the full extent possible and introduces two techniques which decrease the performance drawback for handling objects and short vectors:
All this makes MtxVec a very fast object oriented library. A quote from www.netlib.org : "LAPACK provides routines for solving systems of simultaneous linear equations, least-squares solutions of linear systems of equations, eigenvalue problems, and singular value problems. The associated matrix factorizations (LU, Cholesky, QR, SVD, Schur, generalized Schur) are also provided, as are related computations such as reordering of the Schur factorizations and estimating condition numbers. In all areas, similar functionality is provided for real and complex matrices, in both single and double precision." Users guide is available at: http://www.netlib.org/lapack/lug/lapack_lug.html MtxVec wraps LAPACK and offers complete functionality of LAPACK except for support of packed matrices and overcomes. LAPACK differs between 9 matrix types: general, symmetric, hermitian, positive definite symmetric, positive definite hermitian, general banded, positive definite symmetric banded, positive definite hermitian banded and triangular. With MtxVec fairly long argument lists and function sequences are reduced to (real or complex, single or double): In case of multiplication:
Solving a system of linear equations:
Eigenvalues:
LeastSquare fit:
Delphi, C++ And C# (and MtxVec) use row major array ordering. Appropriate adjustments were made to interface FORTRAN column major array ordering, with minimum overhead.
|
Navigation
Home Page Special Offers News Products
InformationOrder Downloads Information
Product Support About us Site Map Resources Testimonials Customers Link Request
Read more about MtxVec It's all about speedCPU cache size and block processing Objects and numeric's Assembly level optimization Dynamic memory allocation Bringing LAPACK to Delphi Features
Function list: TVec |