Regarding performance comparison (R2VB vs TF)

Hi all,
I am trying to do a cloth simulation and I have two versions

  1. using render to vertex buffer that uses fragment shader for verlet integration and

  2. using transform feedback that uses vertex shader for verlet integration.

I tried to compare the performances of the two and in my tests using a 2D mesh grid ranging from 64x64 to 2048x2048, for small mesh sizes TF is around 1.25-1.5x faster however for larger meshes, R2VB is 1.5-2x faster than TF. Here are my stats on my NVIDIA Quadro FX 5800. All times are msecs per frame calc. using timer query as detailed below.


+------------+-------------+---------------+
|  Grid size |    R2VB     |       TF      |
+------------+-------------+---------------+
|  64 x 64   | 0.370-0.376 |   0.088-0.090 |
+------------+-------------+---------------+
| 128 x 128  | 0.403-0.431 |   0.238-0.240 |
+------------+-------------+---------------+
| 256 x 256  | 0.713-0.758 |   0.804-0.806 |
+------------+-------------+---------------+
| 512 x 512  | 2.100-2.308 |   3.090-3.096 |
+------------+-------------+---------------+
|1024 x 1024 | 7.670-9.250 | 12.205-12.209 |
+------------+-------------+---------------+
|2048 x 2048 |31.800-32.39 | 48.240-48.560 |
+------------+-------------+---------------+

Is this an expected output or am i doing something wrong in timing calc.
This is how i calc. my times

  1. For R2VB:

glBeginQuery(GL_TIME_ELAPSED,t_query);
glBindFramebuffer(GL_DRAW_FRAMEBUFFER, fboID[writeID]);	
   //bind verlet integration fragment shader
   //draw full screen quad
glFlush();
//read back the results into the VBO
glBindFramebuffer(GL_READ_FRAMEBUFFER, fboID[readID]);
glReadBuffer(GL_COLOR_ATTACHMENT0); 			
glBindBuffer(GL_PIXEL_PACK_BUFFER,vboID); 
glReadPixels(0, 0, texture_size_x, texture_size_y, GL_RGBA, GL_FLOAT, 0); 
glFlush();
glFinish();
glEndQuery(GL_TIME_ELAPSED);

//get the elapsed time
glGetQueryObjectui64v(t_query, GL_QUERY_RESULT, &elapsed_time);

  1. For TF:

glBeginQuery(GL_TIME_ELAPSED,t_query);
glBeginTransformFeedback(GL_POINTS);
   glDrawArrays(GL_POINTS, 0, total_points);
glEndTransformFeedback();
glFlush();
glEndQuery(GL_TIME_ELAPSED);				

//get the elapsed time
glGetQueryObjectui64v(t_query, GL_QUERY_RESULT, &elapsed_time);

I was expecting TF to be a better option since it did not involve any readback but the performance results are the opposite. Any ideas?

I don’t think either flushes or readpixels should be involved. Why do you have those?

I guess that the R2VB shaders make better use of the cache, thus they perform better on large grids.

Nvidia GPUs process vertex and fragment shaders in groups of 32 parallel threads per core, but the vertex shaders are probably scheduled in groups of 32x1 and the fragment shaders in groups of 8x4.

If you access neighbouring points in your shaders, you’ll get much more overlap (and thus cache hits) in the fragment shaders.

Hi tksuoran,
I have omitted the vbo part here. Actually readpixels copies the content to the vbo. If there is no readpixels, how would the data be copied to the vbo? And the reason the glFlush is needed here is because readpixels is async, it should be finished before I can read the time.

Thanks for the insights mbentrup. So I think if I reorder the way I access the neighbors in vertex shader I might get better performance. My current neighbor access stencil favours the fragment shaders more I think.
Two more questions:

  1. Is there a way to evaluate the number of cache hits/misses for GLSL shaders?
  2. Where do u get this information the number of parallel units beings executed in vs (32x1) / fs(8x4) ? Is there a document/manual that lists those?

Timer queries measure GPU time so as you use a pixel pack buffer the GPU time is indifferent whether you do a flush or not. In general, no glFlush or glFinish is needed anytime when you use timer queries.

OK I removed the glFlush/glFinish and now the times are reduced to around 10-20 msecs for TF whereas for R2VB they are reduced to around half of the value for small grid size (<=512) and for large grid size (>512) by around 0.5 msecs.