cl beginner test

Grabungsleiter · August 13, 2017, 12:58pm

Hallo,

my knowledge about gpu computing is a little outdated (from SM3 times), so I decided to have a look at OpenCL. As a first test, I made this [1] kernel, that should emulate a simple 32 bit quadcore risc cpu per workitem.

I’m slightly suprised about the performance numbers I get. But not so much about the overall throughput. The inner loop seems to get only to about 250k cyc/s. The gpu used is said to be clocked at 500MHz, and the code looks like if most lines should compile to something that the gpu manufacturer says could be done multiple time per clock, and I definetly don’t yet really see where 2K gpu cycles per loop cycle are…

Furthermore: Is it safe to assume that cl workitems can always see their own mem access (self-sync)?

[1] abnuto.de/jan/code/quadromat.cl

Salabar · August 13, 2017, 2:25pm

You have a lot of memory reads. GPUs don’t have out-of-order capabilities CPU enjoy and normally rely on massive number of threads in flight to switch over to whenever current thread (aka workgroup aka wavefront aka warp) runs into a cache miss. You only have 240, therefore your whole GPU often stalls completely due to memory latency.

Grabungsleiter · August 13, 2017, 4:27pm

Each main loop cycle has 3*4 mem access, so if I get an overall throuput of 80MHz, this sums up to 1G independent memory item accesses (=4GB/s data). This sounds very heavy at first, especially given that this gpu shares mem with cpu. But with a limit to small memory areas, it could probably all be cached out. It seems to be this way, commenting out one of the 3 memory access blocks doesn’t really change a lot and feels more like “ok, the code is a little shorter now”.

Grabungsleiter · August 14, 2017, 4:51am

I asked the amd codeXL about it.

It showed up that the compiler didn’t seem to have any problems to map my instructions (see the code linked above, it’s almost entirely swizzle-free 4-vec) to ISA ALU-packing (95%). I don’t remeber exectly the number, but the ISA instruction list showed round about 70 counted “coissue-groups” (or whatever that is) for the main loop.

And it said 80% of time the ALU was busy, whereas FETCH only 7% of time busy/working. It didn’t count any stall-events.

So why does computation take so long? Compensating the clock by ALU-busy percentage gives about 400MHz. So shouldn’t all these little work-item workers cycle the loop about 400MHz / about 70 “instructions” = 5.5M times per second? Why do I see only about 80Mcyc/s for the whole gpu?

Salabar · August 14, 2017, 9:45am

And it said 80% of time the ALU was busy, whereas FETCH only 7% of time busy/working.

This is the sign of being latency bound. Your ALUs are underutilized therefore you do not create enough memory requests therefore you can’t feed your ALUs.

Compensating the clock by ALU-busy percentage gives about 400MHz. So shouldn’t all these little work-item workers cycle the loop about 400MHz / about 70 “instructions” = 5.5M times per second?

Check out how many cycles each instruction takes. Every instruction on GCN or Maxwell+ takes at least 4. It’s probably more on older GPUs.

Grabungsleiter · August 21, 2017, 8:39pm

Hmn, but if not all instructions are ALU instructions, others might consume time, too. There is one level of flow control (common to all items), for example. The instructions to move data to or from memory show up in the command stream, so probably won’t automagically be free of any cost at all.

I can’t really check anything out (where to read about that?). I even still haven’t found out how many stream processors my device contains.