Cougar Open CL
Open CL is a standard designed to make it easier to write high-performance code for GPU/CPU devices and make that code portable across different GPU devices. MtxVec implements all standard math functions.
Open CL is a standard designed to make it easier to write high-performance code for GPU/CPU devices and make that code portable across different GPU devices. Open CL drivers for CPU and GPU have many great features:
- Cross-platform support. Same code is to run on embedded devices (like mobile phones), desktop PCs and super computers across a wide range of operating systems.
- Support for both ATI and Nvidia GPUs.
- Support for CPU devices. There exists a great opportunity that an extended Open CL will become the main target for accelerated code running on CPUs. Both Intel and AMD currently offer their own drivers for Open CL code to run on CPUs.
- Dynamic code compilation. The compiler is included with the drivers and the code is compiled only for the target device. End users running applications have the possibility to specify expressions which (through the Open CL) can run on GPU or be compiled in to native CPU code.
- Open CL drivers are free for supported platforms.
- End user application can be distributed without any dlls.
Features of Cougar Open CL
- Open CL based numerical library for Delphi and C++Builder users
- Uses dynamic Open CL dll loading and can be included in end user applications possibly running on machines without Open CL drivers.
- Automatically detects all platforms (Intel, AMD, NVidia) and devices and loads their parameters.
- Provides routines to store encrypted source code in to .res resource files that are embedded in to the final application.
- Caches binaries compiled on the first run for subsequent faster load times.
- Automatically detects changes to the hardware or Open CL driver versions and rebuilds the cached binaries.
- Loads all the kernels (functions) present in the Open CL source code, including their properties.
- Implements a shared context between CPU and GPU devices for more efficient heterogeneous computing. (one context per platform)
- Allows build options and source headers to be specified at program load time optionally requesting unconditional recompile.
- Can automatically detect the fastest device in the system and have the code running on it.
MtxVec for Open CL
- Implements all standard math functions
- Support for real and complex numbers across all functions (where applicable)
- Makes use of object cache concept known from MtxVec for faster memory handling and higher performance.
- Implements separate kernels for CPU and GPU devices to achieve best performance on both architectures.
- Can run in single and double precision concurrently.
- Integrated debugger support for debugger visualizers allows GPU code debugging as if it would be running on the CPU.
- Delivers over 500 unique kernels. When considering also single/double and CPU/GPU variants, it is well over 2000.
- Full support for operator overloading.
- Supports multiple automatic code fall-back scenarios. Even when no Open CL driver is detected, the code will still run. When not using Open CL it can run with Delphi code only without external dll's or with native MtxVec using Intel IPP and MKL performance libraries. When native MtxVec is found to be faster than Open CL, it will automatically default to it.
- Supports execution of “micro” kernels. Micro kernels are short functions which could normally not be accelerated with Open CL.
- The performance penalty for micro-kernels is estimated at 50% of peak performance for GPU devices. This comes however with the benefit of utter simplicity of code writing and debugging with programmers productivity matching the work on CPU.
Code example:
Var a,b,c: clVector;
begin
a.CopyFromArray(cDoubleArray); //copy data from CPU memory to GPU memory
b.CopyFromArray(cDoubleArray);
c := sin(a) + cos(b)*a;
c.CopyToArray(cDoubleArray); //copy data from GPU memory to CPU memory
end;
Debugger allows you to stop on every line and examine the contents of all variables residing on GPU as if though they would be simple arrays.