Brute-force exact KNN
K-NN or K nearest neighbors, is the grandfather of AI algorithms. It has many excellent properties, but one major flaw, the speed. With our implementation we accelerate KNN to performance levels not imaginable before. For very large problems (1M vectors compared with 1M vectors), the speed-up factor reaches 1000-2000x in compare to a naïve implementation on CPUs.
Once the performance of KNN is fixed, this opens up a whole array of applications, for which KNN was not considered interesting before. KNN was recently popularized by tSNE algorithm, which internally relies on KNN.
To open up the playing field, where KNN could be useful, let’s have a look at some numbers:
- 950k x 950k, 30 elements in vector, K = 50, 8x3.5GHz Skylake with AVX512, single precision takes 78s. With K = 1, 73s.
- 950k x 950k, 30 elements in vector, K = 50, 8x3.5GHz Skylake with AVX512, double precision takes 168s.
- Achieves linear scaling with core count and is thus faster than KD-Tree implementations for large problems.
- Price/performance advantage for CPU over GPU is more than 4x for double precision math. And it is simple and easy to deploy on CPU.
- Works equally well for symmetric and non-symmetric cases. The compare-to matrix can be different from the data matrix.
- It can use the same AI hardware, which is used to accelerate Deep Neural Networks, for acceleration.
- In the special case of K = 1 and assuming that all vectors contain quality data, KNN defeats DNN in accuracy with guaranteed and easily debugable results.
Current constraints:
- Only the Euclidian norm for the distance is possible.
- There is multi-threading support only down the “compare-to-data” dimension. If you are comparing only one vector to the previous dataset, only one CPU core will be used. However, to solve 950k x 1, 30 elements in vector, K = 50, 1x4GHz Skylake, takes 30ms.
The algorithm is called ClusteredKNN and is located in the MtxForLoop unit.