Slow transfer speed on fermi cards, part II

jnygaard · February 25, 2013, 1:07pm

The old thread of the same name was closed (maybe because the last activity was almost two years ago?!) but I hope it’s ok if I “revive” it.

I am trying to get mfort’s CUDA workaround described in post #47 in the thread (“slow-transfer-speed-on-fermi-cards”, the bbs-system won’t let me post with URLs… sigh) to work, but I am banging my head into a wall at the moment.

When I try to execute


cErr = cudaMemcpyFromArray( cuda_mem, cArray, 0, 0, 32, cudaMemcpyDeviceToHost );

I get a SIGFPE (Arithmetic exception) even before it returns an error. (Note that the count 32 is just for testing, I have malloc’ed successfully much more memory.) If I use DeviceToDevice etc., I get the expected cudaErrorInvalidMemcpyDirection, and if I try to copy 0 bytes, it does not fail. All commands up until this point (both CUDA and OpenGL) has completed successfully.

I have done what mfort outlined, but the instructions do not say how
the renderBufferId object is set up. I tried with this


            glGenRenderbuffers(1, &renderBufferId);
            glBindRenderbuffer(GL_RENDERBUFFER, renderBufferId);
//            glFramebufferRenderbuffer( GL_FRAMEBUFFER, GL_COLOR_ATTACHMENT0, GL_RENDERBUFFER,
//                                                         GL_RENDERBUFFER, renderBufferId );

and also combinations with the commented out function call enabled, none of which works. I thought maybe the glFramebufferRenderbuffer would be required to connect the framebuffer (or whatever it is correct to call the thingy that OpenGL renders into) to the renderbuffer that the CUDA-copy will do the actual copying from.

Anybody got any ideas that I could try out?

J.

Dark_Photon · February 25, 2013, 5:19pm

the bbs-system won’t let me post with URLs… sigh

Yeah, new accounts are restricted from posting URLs for a while. These boards were just getting tons of spam postings from new users with links to various junk sites, and restricting links for new users cuts virtually all of it out. Just paste the URL as text in the post and maybe mutilate http or something so the boards will take it. Here’s the link I think you were talking about: slow transfer speed on fermi cards

…but to your problem, you want to read this carefully. Gives you what you want w/o CUDA:

Re: Nvidia Dual Copy Engines

jnygaard · February 26, 2013, 1:12am

[QUOTE=Dark Photon;1248530]
…but to your problem, you want to read this carefully.[/QUOTE]

Thanks for the link! But I have already done what l_hrabcak describes in the post.

Stage 1: I generate two buffers, buf[0] and buf[1], bind buf[0] to PIXEL_PACK, reserve storage with bufferData and GL_STATIC_COPY (once only), do glReadPixels, then unbind buf[0].

Then the copy stage: Bind buf[0] to COPY_READ_BUFFER, buf[1] to COPY_WRITE_BUFFER (and do glBufferData w/GL_STREAM_READ the first time) and finally glCopyBufferSubData.

Lastly, I use glMapBuffer(GL_COPY_WRITE_BUFFER, GL_READ_ONLY) to get the data to the host, most time spent ends up in this call, as expected.

Doing this, I get approximately 1.2 GB/s, which is comparable to what l_hrabcak gets; 3.6MiB/2.82ms = 1.28 GB/S approx. He uses PCIe 1.1, but should not the speed for that be about twice as much, i.e., around 3 GB/s? And I use PCIe 2.0 (Nvidia-settings says “x16 Gen2”) so I expected one more doubling, to about 6 GB/s. This is also what ‘bandwidthTest --memory=pinned’ indicates, it reports 6.35 GB/s for dev to host transfer.

Note 1: At this point I am not concerned with async vs. sync’ed transfers etc. I just want to achieve the best transfer speed first.

Note 2: If I use just one buffer, glReadPixels into PIXEL_PACK, and download with glMapBuffer(GL_PIXEL_PACK_BUFFER, GL_READ_ONLY), I actually get slightly better results, 1.5 GB/s. I don’t understand this, either. In the dual-buffer setup, the final glMapBuffer command takes 2.9 ms (1.24 GB/s) and for the single-buffer setup, the mapping takes approx. 2.2 ms (1.64 GB/s), all other operations (binding, unbinding etc. takes negligible time relative to this.)

Having tried this, and a lot of tweaks, I thus wanted to investigate the CUDA path…

J.

Dark_Photon · February 26, 2013, 5:37am

When I did this, my readback rate jumped from 0.8GB/sec to 3.0GB/sec (3.7X speed-up).

Theoretical max on PCIe2 x16 one direction is ~8GB/sec, but in practice you shouldn’t expect more than ~6.4GB/sec. And that’s for only timing a saturated transfer specifically – no overhead. Probably can’t expect to get that with other driver overhead and processing in the timing loop.

system · October 19, 2021, 5:38pm

This topic was automatically closed 183 days after the last reply. New replies are no longer allowed.