Stream compaction

I’m facing a stream compaction problem, exactly as described in sect. 39.3.1 of

My vector is in global memory and I have to compact it and place the result back in global memory. In the above cited article it is mentioned that

The addition of a native scatter in recent GPUs makes stream compaction considerably more efficient

Still, I cant understand the exact meaning of that sentence. Are there native OpenCL C instructions that allows to compact streams in global memory? More generally, which is the best way to compact a vector?

“Recent GPU” probably means less than 10-year old here…

Gather means that the GPU can do random-access loads, while scatter means that the GPU can do random-access stores.

It dates from the time when vertex shaders could not read data other than related to the vertex being processed (i.e. no texture fetch capability) and fragment shaders could not write data not related to the fragment being processed.

Such a GPU would not be OpenCL-compatible anyway.

Thanks, I was suspecting that this could be the answer but now I’m sure :wink:
In the meanwhile I went on with my stream compaction implementation. I think that it is impossible to compact a stream “in place” using multiple working groups, since there is no guarantee on the execution order and this can lead to a data race situation where one working group could overwrite a portion of the input array before the working group in charge of it can read its content. For this reason I will use an auxiliary buffer where compacted elements will be written. Since this compacted stream is needed only as input to another kernel I will copy it to the original one with cl::CommandQueue::enqueueCopyBuffer (I need the auxiliary buffer to compact many streams). So I won’t need host memory for this buffer: is there a way to allocate a buffer only on the GPU without allocating host memory?

Yes, clCreateBuffer will create a GPU buffer without allocating host memory (as far as you know; an implementation could if it wanted). I’d suggest starting with some of the OpenCL examples to get a hang of the easy stuff before attempting something more difficult.

Thanks, I already checked some example, read a book and experimented a bit. But understanding buffer creation is in my opinion the hardest thing for beginners, especially the meaning of CL_MEM_USE_HOST_PTR et al. The web is full of threads asking for clarifications about memory allocation, some of them even contradict each other in some aspects… I still have to find the definitive to this topic.

Just use clCreateBuffer() with CL_MEM_READ_WRITE flag. You can also add the hint flag CL_MEM_HOST_NO_ACCESS if your device has support for OpenCL 1.2.

Nice tip, thanks. Unfortunately I’m targeting NVIDIA GPU atm, so I cannot rely on that flag…

CL_MEM_READ_WRITE flag will create a buffer in device memory. CL_MEM_HOST_NO_ACCESS is just an optional hint.