glTexSubImage2D problem

Hello,
I’m currently working on a project dealing with an Ultrasound video view.
My program is capturing Images and displays them using a Texture which is replaced each frame using the glTexImage2D function.
This works fine until I tried to blend the last 10 frames so I can see a kind of history of images. (Later I want to display them in 3D)
My problem is that uploading a frame into the texture using glTexImage2D is pretty slow (about 30ms/frame) on my system, which makes it impossible to diplay more than 3 or 4 layers without problems.

What I have read is that these speed problems can arise from using unmatching texure modes (e.g. RGBA and BGRA) together.
This does not seem to be the problem here, using different combinations makes little difference here.

I’m using 512 * 512 8bit Textures which should be displayed in grayscale.

I’ creating my texture like this:


  glGenTextures(1, &m_nTexName);
  glBindTexture( GL_TEXTURE_2D, m_nTexName );

  glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_S, GL_REPEAT);
  glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_T, GL_REPEAT);

  glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST);
  glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_LINEAR);

  glTexEnvf(GL_TEXTURE_ENV, GL_TEXTURE_ENV_MODE, GL_REPLACE);


  m_nTextureSize = 512;

  std::vector<BYTE> blankTexture(m_nTextureSize * m_nTextureSize );

  glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA8, m_nTextureSize, m_nTextureSize, 0, GL_LUMINANCE, GL_UNSIGNED_BYTE, &blankTexture[0]);



and my display loop looks like this:



  glBindTexture(GL_TEXTURE_2D,m_nTexName);

  glTexEnvf(GL_TEXTURE_ENV, GL_TEXTURE_ENV_MODE, GL_REPLACE);

  glEnable(GL_BLEND);
  glBlendFunc(GL_ONE,GL_ONE_MINUS_SRC_COLOR);
  glDisable( GL_DEPTH_TEST );
  glDisable(GL_COLOR_MATERIAL);
  

for (int i = 0; i < 100; i++)
{

    glTexSubImage2D( GL_TEXTURE_2D, 0, 0, 0, ImageWidth , ImageHeight , GL_LUMINANCE , GL_UNSIGNED_BYTE, m_memThImageBuffer);


    float texCoordH = (GLfloat)ImageHeight/(GLfloat)m_nTextureSize;
    float texCoordW = (GLfloat)ImageWidth/(GLfloat)m_nTextureSize;
    float l = 0;
    float r =	 nFrameWidth;
    float t = 0;
    float b =  nFrameHeight;

    float auxTex0 = 0;
    float auxTex1 = 0;

    glBegin(GL_QUADS);
    glTexCoord2f(auxTex0            , auxTex1            );glVertex3f(l, t, -10); 
    glTexCoord2f(auxTex0 + texCoordW, auxTex1            );glVertex3f(r, t, -10); 
    glTexCoord2f(auxTex0 + texCoordW, auxTex1 + texCoordH);glVertex3f(r, b, -10); 
    glTexCoord2f(auxTex0            , auxTex1 + texCoordH);glVertex3f(l, b, -10); 
    glEnd();
   
}


I’m using a Mobile Intel® Graphics Media Accelerator X3100 Card
which has no own graphics memory.
Has anybody any suggestions where the problem could be located?
Is there a way to speed up the tranfer on this limited card?
The program is considered for use on a tablet PC which does not allow a better graphics card to be used.

Thanks for your help.
Niko

To better understand where your bottleneck is, try with a very small viewport. If the speed increase greatly, it means you are fill limited. It would not be surprising with 10 times overdraw with such a weak card. In this case you don’t have a lot of options, maybe using GL_NEAREST texture filtering, enabling/disabling mipmaps, …
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_LINEAR); // linear or nearest min filter will look pretty much the same. try mipmaps for a change, sometimes they have better performance.

for (int i = 0; i < 100; i++) // that is a lot ! however you said 10x overdraw, not 100x ?

Ok… I will try this…thanks for this tip… yes the 100x came into the code when I tried to measure the timing for adding more precision. I first had it set to 10x.

Each frame, you are uploading the 10 previous pictures to the GPU. You don’t have to ! Create 10 textures and update one of them each frame. This way you will make 10 times less glTextureSubImage call. This should be a (very) big performance boost.

You can also improve the performances by using a fragment program to accumulate the 10 textures in one pass (with blending disabled). It will still need a lot of bandwidth but it should be way faster than blending the 10 quads in the framebuffer.