We took an Intel Core i7-7820X for a spin and compared the speed-up for scientific computations to Intel Core i5-4670. In the table below you can see some results, which are very typical across a large range of different scientific algorithms. The test run is from our "Efficient multithreading" example in the MtxVec demo. The code computes DFT using vectorized sin, cos, add, multiply and sum of vector.
|i5-4670, 32bit, 4cores,
||i7-7820X, 32bit, 4cores
||i7-7820X, 64bit, 4cores||i7-7820X, 64bit, 8cores|
|Pascal, one core (not vectorized)||40.24s||34.59s||35.62s||35.19|
|One CPU core (vectorized)||7.12s||5.86s||3.72s||3.77s|
|With blocks, one CPU core||6.80s||4.67s||2.44s||2.40s|
|With hand-written blocks||5.75s||4.25s||1.75s||1.76s|
|Threaded, with blocks||1.77s||1.22s||0.55s||0.34s|
|Threaded, blocks, Annonymous||1.78s||1.18s||0.57s||0.33s|
|Tthreaded, hand written, DoForLoop||1.54s||1.11s||0.43s||0.27s|
|Threaded, blocks, TParallel.For||2.93s||2.27s||1.20s||0.97s|
The code executed with MtxVec takes full advantage of all instruction set features. This includes AVX-512 included with i7 7820X. Note that "turbo" frequencies between both CPUs are different. When using AVX, the CPU will also not "turbo boost" up to the highest frequency. i7-7820X was mostly boosting up to 4.0GHz and the i5-4670 remained at 3.4GHz. The test was run with "default" optimized motherboard configuration and without overclocking.
Best results are in bold separately for single core (1.76s) and multi-core (0.27s) in the rightmost column. It appears that Intel software tools (compiler + libs) only optimize for AVX-512 for 64bit apps. In this (64bit) case the performance improvement per core is about 1.11/0.43 = 2.5x between both CPUs. In case of 32bit apps, the gain is only about 1.3x.The ratio of the fastest code path on 7820x against non-optimized code reaches a factor of 35/0.27 = 130x when all 8 cores are used with AVX-512. The fastest code path running on one core gives a gain 35/1.76 = 19.8x
Interestingly enough, the dgemm on which linear algebra (LAPACK) mostly depends on remains at only 30% gain even in 64bit mode. Possibly related to missing AVX-512 instructions available only on 7900X-series CPUs and some XEON CPUs. More AVX-512 capable CPUs are scheduled to be released in 2018 and 2019.
AVX-512 largely delivers on the promise on increasing the performance per clock by about 2x even in heaviliy multithreaded scenarios. This fact however is largely absent from various benchmarks that can be found on internet. Either the tested applications are not 64bit or they are not yet properly optimized for AVX-512 (instructions + memory bandwidth). When compared to i7-8700K, the multimedia and scientific benchmarks should be showing an advantage of about 1.8x per one core for i7-7280X.
- Created on .