What is the maximum texture upload (CPU->GPU) transfer rate (bandwidth) that people have achieved ? Also what system configuration ?
I’ve tried glTexSubImage2D, FBO + glDrawPixels, Single PBO + Single TEX, Single PBO + Dual TEX, Dual PBO + Dual TEX. The maximum bandwidth (600 MB/sec) was with Dual PBO + Dual TEX with GL_RGBA + GL_BGRA for internal and external format.
The same code gives about 1.2 GB/sec bandwidth on windows. Is there is something specific with the Linux drivers ?
My system configuration:-
CPU -> AMD Athlon64 3800+
RAM -> 2GB
GPU -> Geforce 8800GTX
OS -> Ubuntu linux
dimensionX: What is the maximum texture upload (CPU->GPU) transfer rate (bandwidth) that people have achieved ? … GL_RGBA + GL_BGRA for internal and external format. … The same code gives about 1.2 GB/sec bandwidth on windows. Is there is something specific with the Linux drivers?
I haven’t tried tests as extensive as yours with other techniques, but with glTexSubImage2D specifically, it appears that 0.9GB/sec is the best saturated throughput I’ve been able to obtain on a 7900GTX PCIx16, and maybe 0.7GB/sec if my 64x64 subload results are showing memory caching artifacts (note: maximum practical x16 PCIe = 3.2GB/sec).
Given BGRA8 subloads:
1.8GB/sec - 512x512 tex with 64x64 subloads
1.3GB/sec - 512x512 tex with 512x512 subloads
0.9GB/sec - 1024x1024 tex with 64x64 subloads
0.7GB/sec - 1024x1024 tex with 1024x1024 subloads
0.9GB/sec - 2048x2048 tex with 64x64 subloads
0.7GB/sec - 2048x2048 tex with 2048x2048 subloads
Here I’m rendering a full-screen poly with a pre-defined simple GPU state using a texture sufficient times to compute the pipelined rendering time per poly. Then sub-loading it and re-rendering with it (to force the GPU upload) iteratively a number of times to compute the time per upload-render iteration. Then subtracting to obtain the (pipelined) upload time, and using that to compute effective bandwidth.
Unless there’s some inefficiency in the NV4x GPU/driver in dealing with textures > 512x512 (e.g. worse cache coherence, etc.), it’s possible that my 512 results are due to pipelining/measurement errors and that .9GB/sec is the saturated peak I’m really seeing.
The increased throughput with many small subloads is interesting. However, I need to change my test procedure to make sure this isn’t an effect of memory caching (currently, the same CPU memory block is being hit for all subloads). If there’s a fault there, the max I’m seeing is really 0.7 GB/sec, which is basically what you’re seeing on Linux.
Incidentally, I’ve done this testing with 16 other texture formats, and of those it suggests that BGRA8 is one of the highest MB/sec subload throughput formats available on the NVidia driver+NV4x GPU. Looking at the 2048x2048 results, it’s up there with DXT1,3,5, LUMINANCE_ALPHA16F, and LUMINANCE32F.
In addition to the above changes, I also want to test with other driver versions which may yield different results.
You have only half of the bandwith under Linux. This seems to be an extra copy on main memory, so your bandwith is decresed by factor 2.
Mostly the main memory is the bandwith limiter.
The maximum bus transfer (e.g. PCIe) is only a theoretical peak value. In practise you should be able to get a bit more than the half, but this bus is not your limit.
So far I think, the driver may copy the textures from main memory to DMA sektion (in main memory), where the GPU can pull it down to graphics memory.
This extra copy may not be done under windows, or you will have one more copy under linux.
I’m using PBOs. I get a pointer to the driver memory by doing glMapBuffer() and then memcpy() and finally glUnMapBuffer(). So I assume that should avoid the extra memcpy() that is done for simple glTexSubImage2D(). With just glTexSubImage2D() I get about 300 MB/sec.