How to readback PBO to main memory fast?

Hi everyone:
I’ve created a pixel buffer object(PBO) and modified it with CUDA. Now I want to readback the PBO result to main memory for transfering to other nodes. Is there an efficient method to readback PBO?
I mapped PBO and copy the contents to an array like this:
int count=0;
for (int i=0 ; i<width; i++)
for (int j=0 ; j<height; j++)
cpumem[count] = ptr[count];
cpumem[count+1] = ptr[count+1];
cpumem[count+2] = ptr[count+2];
cpumem[count+3] = ptr[count+3];
count += 4;

I found the map operation was too heavy. Is there another more efficient method to readback PBO? Thanks:)

Is the mapping the problem or the actual copying? Why are you copying the individual bytes “by hand” instead of just calling memcpy or something like it?

Using aligned memory and special SSE instruction helps a lot. See this:

Thanks a lot, I will try it.