Why does OpenCL freeze my entire OS?

mangofruit · October 14, 2021, 3:04pm

When trying to process a “large” images (of size 1920x1080) using a simple box blur filter, or even print a large array, my whole OS freezes , and I cannot do anything (not even move the mouse) until the processing finishes. Why is OpenCL doing this? I have set the blocking mode to be CL_TRUE when outputting the buffers, so I understand that it will block my application thread. But I don’t understand why it blocks my whole pc? FYI I have a GTX 760 card (set as my default device), and I am using OpenCL v1.2 with the cpp bindings. Any ideas why this might be happening?

Dark_Photon · October 14, 2021, 5:15pm

You’re running your kernels on the GTX 760, yes?
You’re rendering your display with the GTX 760, yes?

Not sure of the specifics of your situation, but in my experience this is pretty standard behavior when a GPU warp/wavefront takes too long to execute. At some point (sometimes configurable), the OS will assume a GPU freeze, reset the GPU, kill your app, and return shared use of the GPU to the entire system.

For related info, websearch “TDR”. For instance:

Timeout detection and recovery (TDR)

mangofru1t · November 20, 2021, 5:01pm

Sorry for the delay, i had lost my account from a format i did to my pc.
Yes that is correct, I am using the same card to render my display and run the opencl kernel.

At some point (sometimes configurable), the OS will assume a GPU freeze, reset the GPU, kill your app, and return shared use of the GPU to the entire system.

Ok I see. The kernel I am running is an 2D image processing script, which blurs largish images (1920x1080). My OS freezes for the duration of the blur, but the program does not get killed , it just “pauses” whilst the blur is being calculated, but returns to normal once finished. How do rendering/game frameworks which use openCL solve these lags? Are there any workarounds (apart from reducing image resolution) which can be done to improve this?

Dark_Photon · November 21, 2021, 1:55am

Ok. I guess the freeze didn’t last long enough that the OS/display manager assumed the GPU was hung.

Out of curiousity, how long is the hang, and which OS is this?

AFAIK, you optimize your implementation so that each distinct kernel execution consumes less time. That can be done a few ways:

Use a more efficient algorithm.
Optimize the algorithm so that it takes better advantage of your GPU’s hardware,
Split up the total operation into separate kernel executions.

Your image size plus the fact that you’re seeing a GPU freeze in human timescales suggests you can get a lot of gain from #1. A simple, naive convolution (e.g. blur) can be considerably less efficient than an optimized one.

Read up on “image convolution optimization strategies”. Here’s a random websearch hit:

Convolution optimisation resources

Peppermint · November 21, 2021, 10:22pm

Thanks for the reply.

Out of curiousity, how long is the hang

It depends on the blur size parameter; if I increase the size (e.g. >100 pixels /single pixel) it can take few seconds (maybe ~5 secs or so). But smaller blurs are quicker. Here is an example of my kernel script FYI:

void __kernel blur(image2d_t img, global float* red_pixel, global float* green_pixel, global float* blue_pixel){

  int blur_size=100; /* the larger this value, the longer it takes */

  int3 color=(int3)(read_imageui(img, sampler, (int2)(get_global_id(0), get_global_id(1))).r, read_imageui(img, sampler, ).g, read_imageui(img, sampler, ).b);

  for(int i=-blur_size; i<blur_size; i++){
    for(int ii=-blur_size; ii<blur_size; ii++){
      int red=read_imageui(img, sampler, (int2)(get_global_id(0), get_global_id(1)) + (int2)(i, ii) ).r;
      int green=read_imageui(img, sampler,  + (int2)(i, ii) ).g;
      int blue=read_imageui(img, sampler,  + (int2)(i, ii) ).b;

      color+=(int3)(red, green, blue);
    }
  }

  color.r*=1.0/(blur_size*blur_size*4.0+1.0);
  color.g*=1.0/(blur_size*blur_size*4.0+1.0);
  color.b*=1.0/(blur_size*blur_size*4.0+1.0);

Increasing blur_size makes my PC hang for longer while the graphics card does its calcs.

and which OS is this?

I am on manjaro (unix). I’m also using the cpp bindings for OpenCL, and targeting version 1.2.

Use a more efficient algorithm.

Yeah that is a fair point. My script is a very crude box blur, so that can be optimized for sure

Split up the total operation into separate kernel executions.

Interesting point ! How would I break the kernel executions up for my case, if you happen to know? Is there a function that openCL provides that can assist with this? I suppose I can break up the blur into two separate passes (since it’s a box blur); first pass blurs horizontal pixels, then 2nd pass takes those blurred pixels and blurs them vertically to get the final 'box` blurred effect. Would I need two separate kernels for this (one for horiz blur, and another for vertical blur)? How would I then tell openCL which order to execute those kernels in?

Thanks in advance

Dark_Photon · November 22, 2021, 1:17am

Ok, thanks. Yes, this confirms what I thought you might be doing.

Here’s the relevant portion from that Convolution optimisation resources page I linked to that pertains to your algorithm:

A naive convolution of a matrix MxN with a filter k₁xk₂ has a computational complexity of O(MNk₁k₂) with data requirements of O(MN).

In other words, pretty ugly unless these are all small, tiny constants.

So yes, given this, you can almost certainly get a huge benefit here from using a more efficient convolution algorithm.

At a brief glance, what your code appears to be doing makes me think that using a Summed Area Table will get you a huge speed-up over your current method.
Here’s the “what the heck is that?” page from the Convolution optimisation resources link:

Here’s an old 2012 AMD presentation that talks about doing fast O(1) blurs using a Summed Area Table:

Summed–Area Tables And Their Application to Dynamic And Their Application to Dynamic Glossy Environment Reflections (AMD)

And here’s a GPU Gems 3 chapter that talks about how to compute Summed-Area Tables efficiently with a compute kernel:

Chapter 39. Parallel Prefix Sum (Scan) with CUDA (NVIDIA)

Some random OpenCL Summed-Area Table Compute code, and the paper describing the technique they used:

Honestly, I would switch your to a more efficient algorithm first. I expect the speed-up you’ll get will be so extreme that you’ll have no need to subdivide your work into multiple kernel executions. But I could be wrong.

system · May 24, 2022, 1:18am

This topic was automatically closed 183 days after the last reply. New replies are no longer allowed.