Setting up OpenCL

Setting up OpenCL

HOWTO: Setup for OpenCL development in Visual Studio 2008

Download and install the latest NVIDIA drivers from https://www.nvidia.com/Download/index.aspx?lang=en-us (NOTE: "developer" drivers are available, but I'm not sure they add anything if you just use OpenCL - will investigate later)
Download and install the NVIDIA GPU Computing Toolkit aka CUDA Toolkit (I'm using version 3.2, 32bit) from here: https://developer.nvidia.com/object/cuda_3_2_downloads.html
Create your OpenCL project in Visual Studio (I'm using 2008, but I imagine it applies to other versions).
In Project Properties, go to the "C/C++" properties and add this path to your include directories:
- Win7: C:\Program Files (x86)\NVIDIA GPU Computing Toolkit\CUDA\v3.2\include
- WinXP: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v3.2\include
Still in Project Properties, go to the "Linker -> General" settings, add this path to Additional Library Directories:
- Windows 7: C:\Program Files (x86)\NVIDIA GPU Computing Toolkit\CUDA\v3.2\lib\Win32
- WinXP: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v3.2\lib\Win32
STILL doing Linker settings, under "Linker -> Input", add "OpenCL.lib" to "Additional Dependencies".
- Alternative: put this in one of your files: #pragma comment(lib, "OpenCL.lib" )
You are good to go!

HOWTO: Basic OpenCL programming

Use this link to just get some simple code that will compile and run quickly: https://www.thebigblob.com/getting-started-with-opencl-and-gpu-computing/

Summary:

#include "opencl.h"
Setup code that I imagine wouldn't change much (basically initialise the device, create queue where commands will be stored)
Allocate memory on the GPU with clCreateBuffer(...)
Allocate any memory you may need locally too.
Write your "kernel"/"kernels"
- Kernels are written in C-like language.
- A kernel is a single program, and many copies of it a run simultaneously, each in a different thread.
- Each kernel has can call a special function, get_global_id(...), that tells it what index it is. For example, if you were adding two vectors A+B=C, you would run a kernel for each element of C, and get_global_id would tell you what element you should be calculating.
  - So: int i = get_global_id(0); C[i] = A[i] + B[i];
- The trick to writing kernels is to:
  - figure out how to break your operation into its parallel parts (not to bad, its just the inside of your FOR loop usually)
  - be as smart as you can with memory accessing.
- To expand on that second point, to get the best results you have to be really clever with memory. By that, I don't just mean moving data to the GPU (although moving data between the GPU and CPU all the time will kill you fast), I mean the ratio of memory accesses to number of operations, as well as the way memory is arranged in GPU RAM. This is why vector addition will never beat CPU - two memory reads : one addition is not a good deal.
Compile kernel at runtime (!!!, OpenCL problem, CUDA can compile early), create a kernel object.
Set the inputs of your kernels using clSetKernelArg(...) (i.e. the first argument points to the A vector in GPU RAM, second argument B, third C). You can also set the inputs to be in CPU RAM (i.e. an integer that contains the size of a matrix) but I'm not sure whether it'd be better on the GPU.
Initialise everything (GPU and CPU). Memory is copied to GPU using clEnqueueWriteBuffer(...).
Perform your operation using clEnqueueNDRangeKernel(...). There are three keys arguments. First is the kernel to run. Second is the problem size (e.g. the length of the vector). Third is something I don't quite understand, its like the chunk of the problem to assign to each thread I think. You may think that setting it to 1 makes sense (e.g. each element, which the above example link does), but it turns out you can pass NULL and let the driver figure it out. When I did this, I got better performance, so this may be the way to go.
Pull your result back into local memory with clEnqueueReadBuffer(...).
Standard de-allocation code to finish up.

Benchmarking CPU and GPU performance

Parameters: number of times to run test, vector size n.

TEST: Matrix * Vector = Vector

Method:

Allocate memory for matrix A (size n by n) and vector B (size n) on CPU. Allocate memory for destination vector C (size n)
Initialise each element to a float related to the memory index (just to put something in there)
Setup OpenCL
Create
Load, compile the vector-matrix multiplication kernel:
- __kernel void vector_matrix(__global const float *A, __global const float *B, __global float *C, int m) {
- // Get the index of the current element to be processed
- int i = get_global_id(0);
- // Do the operation
- float a = A[i], result = 0.0f;
- for (int j = 0; j < m; j++) {
- result += a*B[i+j*m];
- }
- C[i] += result;
- }
Allocate memory on the GPU for A, B, C
Copy A,B to GPU, and set A, B, C as the arguments for the kernel.
GPU timing:
1. Start timer
2. Run clEnqueueNDRangeKernel for however many times you want (I used 1000 times).
3. Make sure it does all the operations! with clFinish(command_queue) (the GTX480 seemed to buffer them a lot more)
4. End timer
CPU timing
1. Start timer
2. Run the code however many times you want (e.g. 1000 times)
3. End timer
Done! Run this for a variety of size n and plot

Results

Size n:	10	50	100	500	1000	2000
NVIDIA Quadro FX580	0.031	0.062	0.093	0.516	1.486	5.906
Intel Xeon W3520 @ 2.67 GHz	0.008	0.016	0.031	0.969	8.907	38.296
NVIDIA GTX 480	0.008	0.024	0.032	0.205	0.397	0.791
Intel Core i7 @ 3.07 GHz	0.008	0.008	0.027	0.805	7.661	32.483
n^2 / 100,000	0.001	0.025	0.1	2.5	10	40

Comments

All done with floats.
Note that no memory is exchanged between CPU and GPU during the timing period. This is really crucial. If I did put it in, it would cripple the GPU because it would spend most of the time thrashing the memory. A good algorithm that utilised the GPU would not be moving much memory between CPU and GPU, especially whole matrices. Certainly RSM never needs to move the basis between GPU and CPU, only vectors.
The performance of the CPUs are in line with the problem complexity.
The Xeon is a quad core, and the i7 is too (8 with hyperthreading), but only 1 is being used. The improvement between the i7 and the Xeon is down to clock speed and any caching improvements. Even if all 8 hyperthreads could be used as efficiently as the single thread is being used, it still would not be faster than the GTX480.
The GTX 480 has 480 "CUDA cores", the Quadro has 32. It also has a better clock speed, and GDDR5 ram vs GDDR3 ram. It performs about 7 times better than the Quadro at n = 2000.
Fitting a linear trendline to the GTX480 numbers using Excel gives an R^2 value of 0.9997, which is awesome considering its a n^2 problem!

Topic revision: r5 - 2012-01-15 - TWikiAdminUser