- Home
- Support
- Knowledge Base
- MtxVec KB

K-NN or K nearest neighbors, is the grandfather of AI algorithms. It has many excellent properties, but one major flaw, the speed. With our implementation we accelerate KNN to performance levels not imaginable before. For very large problems (1M vectors compared with 1M vectors), the speed-up factor reaches 1000-2000x in compare to a naïve implementation on CPUs.

Once the performance of KNN is fixed, this opens up a whole array of applications, for which KNN was not considered interesting before. KNN was recently popularized by tSNE algorithm, which internally relies on KNN.

To open up the playing field, where KNN could be useful, let’s have a look at some numbers:

- 950k x 950k, 30 elements in vector, K = 50, 8x3.5GHz Skylake with AVX512, single precision takes 78s. With K = 1, 73s.
- 950k x 950k, 30 elements in vector, K = 50, 8x3.5GHz Skylake with AVX512, double precision takes 168s.
- Achieves linear scaling with core count and is thus faster than KD-Tree implementations for large problems.
- Price/performance advantage for CPU over GPU is more than 4x for double precision math. And it is simple and easy to deploy on CPU.
- Works equally well for symmetric and non-symmetric cases. The compare-to matrix can be different from the data matrix.
- It can use the same AI hardware, which is used to accelerate Deep Neural Networks, for acceleration.
- In the special case of K = 1 and assuming that all vectors contain quality data, KNN defeats DNN in accuracy with guaranteed and easily debugable results.

Current constraints:

- Only the Euclidian norm for the distance is possible.
- There is multi-threading support only down the “compare-to-data” dimension. If you are comparing only one vector to the previous dataset, only one CPU core will be used. However, to solve 950k x 1, 30 elements in vector, K = 50, 1x4GHz Skylake, takes 30ms.

The algorithm is called ClusteredKNN and is located in the MtxForLoop unit.

The main advantage of using single precision in compare to double, is to halve the memory usage and to double the speed of execution. The major new feature of MtxVec v6 is the capability to select algorithm precision at runtime. When a vector or matrix variable is declared and then sized for the first time:

a.Size(10, TMtxFloatPrecision.mvDouble);

You can optionally specify the storage precision. This storage precision can also be set/read with a property:

a.FloatPrecision

When working with the vector/matrix in expressions, the precision is preserved:

b.Sin(a);

This capability allows:

- User specified computational precision from the UI of your application.
- Write initial algorithm in any precision and decide on what is really needed after profiling.

Internally MtxVec will not strictly use only single precision, when the source data is in single precision. However, floating precision conversion is an expensive operation and it will try to avoid anything that might proove to be costly. At the same time, it will not unneccessarily drop the computational precision from double to single, when there is no performance or storage size benefit to be expected. This gives the user the best of both worlds.

For the source code users, it is possible to specify individual external libraries which are to be linked in. These libraries are available separately for single and for double precision floating point computation. If either precision is not used or needed, they can be left out from the distribution of the final application reducing the final package size.

When writing code, which can work with either single or double precision, consider that:

- The default array property or the indexer for an object a[i], will work with double or single precision of "a" Vector, but there will be some conversion cost, when the storage precision will be single. This coding pattern is meaningfull, when code simplicity helps, but does not affect the performance too much, because it is not used within tight loops.
- When tight loops are used and performance matters, it is recommend to split the branch between double and single (float) precision by checking the value of:

a.IsDouble

and access:

a.Values[i]

for double precision and

a.SValues[i]

for single precision.

Of course, the corresponding properties/values are available also for the complex number flavors:

a.CValues[i]

for double precision and

a.SCValues[i]

for single precision.

Second step of code optimization after applying code vectorization is called block processing

MtxVec includes vectorized functions working on complex and real numbers

The way to massively parallel computation

The vector and matrix objects provided by MtxVec have the ability to be super conductive for multiple threads. This means that when multiple objects are allocated and freed from multiple threads, there will never be a case of contention, or synchronization lock. This allows multi-threaded applications to use MtxVec expressions like a + b without worrying about memory allocation bottlenecks. Without MtxVec such expressions can result in nearly 100% contention of threads. We benchmarked the evaluation speed of the following function by varying the size of input vectors:

**function** TestFunction(x,m,b: TVec): Vector;**var** xm: Vector;**begin**

xm.Adopt(x); //adopted by Vector to allow use of expressions

Result := 0.5*(1.0+ sgn(xm-m)*(1.0-Exp(-Abs(xm-m)/b)));**end;**

The memory management features:

- Dedicated memory allocated per thread typically does not exceed CPU cache size (2MB). This makes the operation very memory and CPU cache efficient.
- MtxVec memory management does not interfere with other parts of the application which continue to use the default memory manager. Only those parts of code using MtxVec based vector/matrix objects are affected.
- Inteligent memory reuse keeps the working set of memory tightely packed and small, taking optimal advantage of the CPU cache design.
- Truly linear scaling of code on multiple cores becomes a reality.

The first picture shows an example of using two cores and comparing processing gains of super conductive memory manager with standard memory manager as a function of vector length:

As the number of cores increases the minimum vector length for which performance penalty is miniscule increases. As the vector length increases, the pressure on CPU cache size increases, thus possibly nullifying any gains due to multi-threading. By using MtxVec these problems can be completely avoided and fully linear scaling can be achieved independently of vector length. The second picture was done on a quad core CPU:

On quad core CPU, the differences increase showing a larger advantage, indicating that the default memory manager does not scale beyond two cores for this type of processing and that for short vectors the gains of using more than two cores can also be negative. (Gain factor larger than 4).

The source code of the benchmark is included with Dew Lab Studio trial version and Dew Lab Studio demos.

It was Delphi 2006 that first introduced the ability to overload operators on records. This feature enabled MtxVec to add high performance support for using vectors, matrices and complex numbers in expressions.

MtxVec expression evaluator can be used to evaluate individual expressions or it can evaluate a list of expressions (a script).

© DewResearch 1997 - 2022 All Rights Reserved.

E-mail This email address is being protected from spambots. You need JavaScript enabled to view it..

Delphi & C++ Builder are registered trademarks
of Embarcadero Corporation. All other brands and product names are trademarks or registered trademarks of their respective owners.