Using OpenCL on GTX give slower computation compared to CPU . why ?

Hello !

I have some question regarding simulation times in Realflow and other software packages that using OpenCL acceleration

First of all , my GTX card shows me that it has 1.2 version of openCL .
Here on that site newest is 2.2 , is that a reason of not computing openCL on GPU ?
if I understand correctly , I can’t install 2.2 on GPU , its something that built inside it , right ?

When Using (GPU CUDA) redshift renderer for example , I know and see that it uses and loads GPU and surely times are amazing compared to CPU render .

  1. So , which Unit inside GPU in used when GPU computes on openCL ?

Some GPUs have 16bit/32bit/64bit performance info in their description specs .

for example brand NEW GP100 Quadro have :

FP64 Performance 5.2 TFLOPS
FP32 Performance 10.3 TFLOPS
FP16 Performance 20.7 TFLOPS

whereas , Quadro P4000 have only 32bit performance record:

FP32 Performance 5.3 TFLOPS


  1. Does it mean that P4000 will never compute 16 and 64 bit operations ?

3 Which one from those 3: 16/32/64 is used when computing using openCL on GPU ?


Short question: is there any reasong my GTX titan kepler , never gives me good results in GPU openCL ?

are GTXs capable of compute any of them 16/32/64 ? or only Quadros have that ability ?

NVIDIA haven’t released an OpenCL 2.* driver as of now, though they’ve got a beta support recently. But I doubt it matters for Realflow. You’re referring to half, single and double precision floating point numbers. Halves are used for machine learning and computer games, doubles are for scientific computation and singles are for games and stuff like raytracing rendering. I know Kepler got it’s double precision capabilities cut severly for better performance in games, so that’s probably is the answer. To your last question: no, every modern GPU supports each of those datatypes, just at different rates.

Kepler also had worse compute performance (but better power efficiency) than the older Fermi.

AMD 280X was about 1.9 times faster than expensive Kepler Titan for me in OpenCL.
AMD FuryX is still 1.6 times faster than GTX 1070.
GTX 1070 is 6 (!) times faster than Kepler GTX 670.

So i guess upgrading to a recent Pascal GPU could give you a speed up of maybe 6 too.
(I get those numbers from development of FP32 algorithms. Other algorithms - different numbers.)

But you should ask the developers of Realflow what they recommend. To me Kepler was at least 10 times faster than 4 core i7 CPU, so maybe their OpenCL implementation is just not that practical.
AMD is a good option if you don’t need Cuda.

AFAIK GP100 is the only current chip that has a speed advantage for FP16 (upcoming AMD Vega has it for consumer hardware), but it will take time until this matters in practice because there is no software yet. FP16 can represent colors or normals but unlikely vertex positions.
I’m not sure if any redering software uses FP64 but it’s very unlikely. (necessary e.g. for very large scenes with tiny details - usually we have enough with FP32 because distant details do not matter.)
But TFlops might not help you a lot to pick a new GPU anyways. It’s better to look for benchmarks for your most important tools.

[QUOTE=JoeJ___;42009]Kepler also had worse compute performance (but better power efficiency) than the older Fermi.

AMD 280X was about 1.9 times faster than expensive Kepler Titan for me in OpenCL.
AMD FuryX is still 1.6 times faster than GTX 1070.
GTX 1070 is 6 (!) times faster than Kepler GTX 670.

So i guess upgrading to a recent Pascal GPU could give you a speed up of maybe 6 too.
(I get those numbers from development of FP32 algorithms. Other algorithms - different numbers.)

But you should ask the developers of Realflow what they recommend. To me Kepler was at least 10 times faster than 4 core i7 CPU, so maybe their OpenCL implementation is just not that practical.
AMD is a good option if you don’t need Cuda.

AFAIK GP100 is the only current chip that has a speed advantage for FP16 (upcoming AMD Vega has it for consumer hardware), but it will take time until this matters in practice because there is no software yet. FP16 can represent colors or normals but unlikely vertex positions.
I’m not sure if any redering software uses FP64 but it’s very unlikely. (necessary e.g. for very large scenes with tiny details - usually we have enough with FP32 because distant details do not matter.)
But TFlops might not help you a lot to pick a new GPU anyways. It’s better to look for benchmarks for your most important tools.[/QUOTE]

Thats what I’m talking about , its a real pain in the a… when you have no rules to compare
you have nothing in fact , only rumors that some hardware is better
benchmarks also yes , but for example there is no such benchmark like
Realflow hybrido domain Firepro VS quadro VS gtx

now I can’t understand anything because I found that my titan is very good at 32bit floating point performance
it means it should do something but again times is slower than CPU

maybe its a API version , on quadros openCL 2 comes with drivers and firepros also
2.0 by default , maybe if nvidia will release 2.0 for their drivers I will check that also

thank you for answer

      • Updated - - -

Thanks for answer , I do download latest Nvidia drivers but it seems no 2.0 implemented yet because GPUz shows me OpenCL 1.2 anyway
after installation , any ideas ?

It’s unlikely the API version matters a lot if 1.2 is still supportet by Realflow, but you can only wait until NV releases CL 2.0 to consumer HW. (Requesting it in public forums is the best thing you can do for now :wink:
AMD has 2.0 support on consumer hardware but watch out with older cards. Fury and RX480 have it, but Furys 4 GB RAM may be a limit.

I agree it’s hard to compare compute performance. Technical specs tell you almost nothing, gaming benchmarks are totally pointless. Compute benchmarks are too rare to get an unbiased impression.

In offline rendering there is the additional problem to transfer a lot of data between CPU and GPU each frame.
If this becomes the bottleneck (which is quite likely) it may not matter if you spend 200 or 2000 bucks on a GPU. And If that’s the reason Realflow gets a slow down a stronger GPU would not help at all.
(Be sure you use the proper PCI slot for the GPU, they might have different transfer rates)

My own experience with GPUs for offline rendering and film editing is dated, but spending more than 300 turned out to be wasted money. It took me many years to learn this ignoring any marketing.
Maybe you can borrow a newer GPU just to try if it makes a difference.

Oh, look there: http://www.realflowforum.com/viewtopic.php?t=11109
This guy has more than 20 times GPU power than you but still his CPU is faster.

[QUOTE=JoeJ___;42011]It’s unlikely the API version matters a lot if 1.2 is still supportet by Realflow, but you can only wait until NV releases CL 2.0 to consumer HW. (Requesting it in public forums is the best thing you can do for now :wink:
AMD has 2.0 support on consumer hardware but watch out with older cards. Fury and RX480 have it, but Furys 4 GB RAM may be a limit.

I agree it’s hard to compare compute performance. Technical specs tell you almost nothing, gaming benchmarks are totally pointless. Compute benchmarks are too rare to get an unbiased impression.

In offline rendering there is the additional problem to transfer a lot of data between CPU and GPU each frame.
If this becomes the bottleneck (which is quite likely) it may not matter if you spend 200 or 2000 bucks on a GPU. And If that’s the reason Realflow gets a slow down a stronger GPU would not help at all.
(Be sure you use the proper PCI slot for the GPU, they might have different transfer rates)

My own experience with GPUs for offline rendering and film editing is dated, but spending more than 300 turned out to be wasted money. It took me many years to learn this ignoring any marketing.
Maybe you can borrow a newer GPU just to try if it makes a difference.

Oh, look there: http://www.realflowforum.com/viewtopic.php?t=11109
This guy has more than 20 times GPU power than you but still his CPU is faster.[/QUOTE]

Thanks you for your link ,
Realflow is only one side of a conundrum

at first place of course is the Viewport performance in Maya and upcoming Arnold render (most likely will be openCL GPU based)
Can you share any info regarding maya ? AFAIK maya is openGL based .

a real issue is , I don’t want to buy 2500 GPU and understand that 1080ti for 800$ is doing same good job or better
but GTXs have no ECC function which is good for me , only quadros and firepros have ECC
I guess I will choose between Firepro and Quadro , quadro is DAMN expensive - firepros also but less expensive
or maybe it worth to go with AMD Firepros or upcoming WX 8100 9100 series

I know some people who can’t risk buy AMD because of its reputation , and AMD does nothing to renew information
its so horrible , if I were there I would make tons of benchmarks with maya max reaflow to ensure that customers at least have info what they are buying

AMD is always cheaper but if you will look under the water of AMD there are much more underwater stones/obstacles

I have very little Maya experience (used to Max), but for the viewport any GPU should do.
NV has noticeable better OpenGL performance than AMD.
NV in general has better rasterization performance and AMD is better with compute.
Most likely you won’t notice a difference in the viewport, only when doing a render with Arnold and here AMD should be faster.
But it really depends on how and if the developers optimize properly for both vendors.

Personally i was happy with NV until i bought that cheap 280X to test if my OpenCL code runs on it.
I could not believe it was four times faster than NV hardware with similar price and Flops.
AMDs marketing totally failed to take advantage of this.
Also rumors about bad AMD drivers are wrong, they are just more strict to the specs. (E.g. AMD had similar OpenGL compute shader and OpenCL performance, but NV was twice faster with CL than with GL - that’s bad drivers)
So i changed my mind about reputation and started to order AMD also for offline rendering and film editing without issues (any software with Cuda already had an option to use OpenCL instead).

Today i do not work with offline or film anymore, i can’t tell if there are any obstacles underwater. It reminds me a bit on the discussion of Apple vs. Windows. Apple users still argue their platform is better, but in fact they don’t pay for their workstation themselves and just don’t wanna get used to a different system, hehe :slight_smile:
Personally i’d wait on the upcoming AMD Vega cards with build in SSD for a professional card. (This would solve the clustering problem discussed in the Realflow forum, probably without the need to update the software)
But NVs compute performance with Pascal is ok now and if you have more trust in their stronger position in the professional market you probably do nothing wrong.

Quote: but GTXs have no ECC function which is good for me

Why? I’m very curious about that. I see a need for scientific applications but not for rendering. I’ve had no issues due to crashed GPUs with over night jobs (but lots of issues due to wrong settings or buggy software.)

[QUOTE=JoeJ___;42014]I have very little Maya experience (used to Max), but for the viewport any GPU should do.
NV has noticeable better OpenGL performance than AMD.
NV in general has better rasterization performance and AMD is better with compute.
Most likely you won’t notice a difference in the viewport, only when doing a render with Arnold and here AMD should be faster.
But it really depends on how and if the developers optimize properly for both vendors.

Personally i was happy with NV until i bought that cheap 280X to test if my OpenCL code runs on it.
I could not believe it was four times faster than NV hardware with similar price and Flops.
AMDs marketing totally failed to take advantage of this.
Also rumors about bad AMD drivers are wrong, they are just more strict to the specs. (E.g. AMD had similar OpenGL compute shader and OpenCL performance, but NV was twice faster with CL than with GL - that’s bad drivers)
So i changed my mind about reputation and started to order AMD also for offline rendering and film editing without issues (any software with Cuda already had an option to use OpenCL instead).

Today i do not work with offline or film anymore, i can’t tell if there are any obstacles underwater. It reminds me a bit on the discussion of Apple vs. Windows. Apple users still argue their platform is better, but in fact they don’t pay for their workstation themselves and just don’t wanna get used to a different system, hehe :slight_smile:
Personally i’d wait on the upcoming AMD Vega cards with build in SSD for a professional card. (This would solve the clustering problem discussed in the Realflow forum, probably without the need to update the software)
But NVs compute performance with Pascal is ok now and if you have more trust in their stronger position in the professional market you probably do nothing wrong.

Quote: but GTXs have no ECC function which is good for me

Why? I’m very curious about that. I see a need for scientific applications but not for rendering. I’ve had no issues due to crashed GPUs with over night jobs (but lots of issues due to wrong settings or buggy software.)[/QUOTE]


Why? I’m very curious about that. I see a need for scientific applications but not for rendering. I’ve had no issues due to crashed GPUs with over night jobs (but lots of issues due to wrong settings or buggy software.)

I’ll tell you why , ECC is very good thing, Left PC to render 2 days on GPU for example is kind of a risky , when you have 1000 frames to do, for example
you might end up with 200 frames ready and other 800 crashed and stopped
when you have ECC inside all chips including CPU DDR RAM and GPU you can freely left your rig in other place to do all heavy lifting and not worry about crashes
(ECC comes with speed cost of course)

I’m maya user and sometimes when I’m working with nodes there is occasional crash
so, restarting maya doing SAME thing and here you go,there is no crash
random crash is horrible thing when lot information is held inside VRAM

Ah, i see - makes sense, thanks. (I’ve never used network rendering)

Also rumors about bad AMD drivers are wrong, they are just more strict to the specs.

Nah, they are legitimately crap. :smiley: Or rather, they work badly with crappy code. I’m tinkering with something atm. I have an enormous kernel with a lot of spilled registers, etc, and it does random whatever on GPU, while CPU works fine. So I’m kinda forced to rewrite all the stuff “the right way” I was supposed to take from get go. I’m not sure if it is good or bad even.

Haha, spilled registers sounds you have lots of work to do :slight_smile:

I have noticed only one AMD driver bug up to now but it was really terrible, giving random numbers on a simple prefix sum. I wonder nobody else noticed. AMD has fixed it meanwhile.

Another example of their crap is this prefix sum code:

#define HACK1 if (lID<128)

	HACK1 _ptr[(((lID &gt;&gt; 0) &lt;&lt; 1) | (lID &  0) |  1)]	+= _ptr[(((lID &gt;&gt; 0) &lt;&lt; 1) |  0)];	BARRIER_LOCAL 
	HACK1 _ptr[(((lID &gt;&gt; 1) &lt;&lt; 2) | (lID &  1) |  2)]	+= _ptr[(((lID &gt;&gt; 1) &lt;&lt; 2) |  1)];	BARRIER_LOCAL 
	HACK1 _ptr[(((lID &gt;&gt; 2) &lt;&lt; 3) | (lID &  3) |  4)]	+= _ptr[(((lID &gt;&gt; 2) &lt;&lt; 3) |  3)];	BARRIER_LOCAL 
	HACK1 _ptr[(((lID &gt;&gt; 3) &lt;&lt; 4) | (lID &  7) |  8)]	+= _ptr[(((lID &gt;&gt; 3) &lt;&lt; 4) |  7)];	BARRIER_LOCAL 
	HACK1 _ptr[(((lID &gt;&gt; 4) &lt;&lt; 5) | (lID & 15) | 16)]	+= _ptr[(((lID &gt;&gt; 4) &lt;&lt; 5) | 15)];	BARRIER_LOCAL 
	HACK1 _ptr[(((lID &gt;&gt; 5) &lt;&lt; 6) | (lID & 31) | 32)]	+= _ptr[(((lID &gt;&gt; 5) &lt;&lt; 6) | 31)];	BARRIER_LOCAL 
	HACK1 _ptr[(((lID &gt;&gt; 6) &lt;&lt; 7) | (lID & 63) | 64)]	+= _ptr[(((lID &gt;&gt; 6) &lt;&lt; 7) | 63)];	BARRIER_LOCAL 
	HACK1 _ptr[(((lID &gt;&gt; 7) &lt;&lt; 8) | (lID &127) |128)]	+= _ptr[(((lID &gt;&gt; 7) &lt;&lt; 8) |127)];	BARRIER_LOCAL 

This is a workgroup with 128 threads and the condition to check the thread id is always true so it makes no sense.
But it is a great improvement in speed. For some reason the compiler decided to load all values from LDS to VGPRs upfront, wasting a dozen of them for nothing. Occupancy goes down, kernel runs slow.
With the condition this does not happen, VGPR usage keeps low, fast kernel. Funny thing also the compiler is too stupid to remove the pointless branches :slight_smile:

But wait - i just checked it again, the hack is not necessary anymore :slight_smile: They have silently fixed it.