I am using a headless linux system (no xorg) on a 32x opteron 6300 with 2 R7 200 cards
I managed to develop my application so it works on all 3 devices with the fglrx driver (15.12).
I have ported my openCL kernel from working ansi C code. As I did not use any memory management in the code, the whole program uses variables declared in the function (which is private memory in OpenCL, I believe). subfunctions though the kernel mainly get arguments which are pointers to private memory
now, here’s my problem:
The kernel runs 64x as fast on the opteron devices as it runs on the radeons.
I suspect the GPU is pushing variables back-and-forth from global memory, through I am not sure about this.
Using valgrind, the C code consumed ± 1000 bytes of memory for execution.
It does not use big global array - the biggest and most used array is an array of 128 uchars in constant memory
How can profile the kernel memory management on a headless system? I know I should use AMD APP SDK, but I can’t find something about profiling on a headless system.