my knowledge about gpu computing is a little outdated (from SM3 times), so I decided to have a look at OpenCL. As a first test, I made this  kernel, that should emulate a simple 32 bit quadcore risc cpu per workitem.
I’m slightly suprised about the performance numbers I get. But not so much about the overall throughput. The inner loop seems to get only to about 250k cyc/s. The gpu used is said to be clocked at 500MHz, and the code looks like if most lines should compile to something that the gpu manufacturer says could be done multiple time per clock, and I definetly don’t yet really see where 2K gpu cycles per loop cycle are…
Furthermore: Is it safe to assume that cl workitems can always see their own mem access (self-sync)?
You have a lot of memory reads. GPUs don’t have out-of-order capabilities CPU enjoy and normally rely on massive number of threads in flight to switch over to whenever current thread (aka workgroup aka wavefront aka warp) runs into a cache miss. You only have 240, therefore your whole GPU often stalls completely due to memory latency.
Each main loop cycle has 3*4 mem access, so if I get an overall throuput of 80MHz, this sums up to 1G independent memory item accesses (=4GB/s data). This sounds very heavy at first, especially given that this gpu shares mem with cpu. But with a limit to small memory areas, it could probably all be cached out. It seems to be this way, commenting out one of the 3 memory access blocks doesn’t really change a lot and feels more like “ok, the code is a little shorter now”.
It showed up that the compiler didn’t seem to have any problems to map my instructions (see the code linked above, it’s almost entirely swizzle-free 4-vec) to ISA ALU-packing (95%). I don’t remeber exectly the number, but the ISA instruction list showed round about 70 counted “coissue-groups” (or whatever that is) for the main loop.
And it said 80% of time the ALU was busy, whereas FETCH only 7% of time busy/working. It didn’t count any stall-events.
So why does computation take so long? Compensating the clock by ALU-busy percentage gives about 400MHz. So shouldn’t all these little work-item workers cycle the loop about 400MHz / about 70 “instructions” = 5.5M times per second? Why do I see only about 80Mcyc/s for the whole gpu?
Hmn, but if not all instructions are ALU instructions, others might consume time, too. There is one level of flow control (common to all items), for example. The instructions to move data to or from memory show up in the command stream, so probably won’t automagically be free of any cost at all.
I can’t really check anything out (where to read about that?). I even still haven’t found out how many stream processors my device contains.