need help: multi-threaded openGL programming

I know there are many posts in this forum to discuss multi-threaded openGL programming, but I have checked nearly all the articles I can googled from this forum or other websites and I can't resovle my problem, so I decide to post this one to get help.

I am recenty working in a big project, in this project we use openGL to display video, and we use PBO to upload video data to texture to finnally display it in a window, for some reason, we have to use a multithreaded schema to display video frame, but we get weird result: it will display some newer frame before the elder frame. To isolate this problem, I write a test program to simulate the usage of openGL in our project and I reproduce this problem. Now I will describe the logic of the test program.

This test program has three threads: a render thread(UI thread), a play thread and a copy thread. each thread has its own openGL render context, and these render contexts share objects through wglShareLists(). the render thread create a texture object and two PBO objects, the PBO objects are maintained on a queue. To simulate the clock in the real project, the play thread will start a periodic timer (40 ms for a PAL video), and this thread will maintain a timeline position which is just a int number to simulate the time in real world. the play thread will wait on the timer event, each time this timer expired, the play thread will first generate a new display task, assign it a timeline postion 8 frame later than the current timeline position(this is to simulate the preroll of video playing), this newly created task is not ready to display but pending to transfer to copy thread, it will try to get a PBO object from the queue, if it get one it will notify the copy thread and the coy thread will fill some contents to the PBO assigned to the task, the copy logic can be described by the following code:

glBindBufferARB(GL_PIXEL_UNPACK_BUFFER_ARB, pTask->idPBO);

void *pBuf = glMapBufferARB(GL_PIXEL_UNPACK_BUFFER_ARB, GL_WRITE_ONLY_ARB);

FillImage(pBuf, pTask->nTimelinePos);

glUnmapBufferARB(GL_PIXEL_UNPACK_BUFFER_ARB);

glBindBufferARB(GL_PIXEL_UNPACK_BUFFER_ARB, 0);

glFlush();

FillImage() will fill a image to the PBO object assigned to the task according to the timeline position of the task (in our real project, we copy the frame image to this PBO, but in this test program I just fill a moving rectangle to this PBO, this rectangle move rightward 10 pixel each frame), After a task is copied, it will put it into ready state and to notify the play thread about this event. The play thread will play this task at a later time, the play logic can be described by the following code:

//////////////////////////////////////////////////////////////////////////
// upload texture
//////////////////////////////////////////////////////////////////////////
{
	CAutoLock lock(&m_lockPaint);

	glBindTexture(GL_TEXTURE_2D, m_iTex);

	glBindBufferARB(GL_PIXEL_UNPACK_BUFFER_ARB, pTask->idPBO);

	glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, TEX_WIDTH, TEX_HEIGHT, GL_BGRA_EXT, GL_UNSIGNED_BYTE, 0);

	glBindBufferARB(GL_PIXEL_UNPACK_BUFFER_ARB, 0);

	glBindTexture(GL_TEXTURE_2D, 0);

	glFlush();
}

Invalidate(FALSE);

The play thread will upload the PBO assgined to this task to the texture object and then generate a WM_PAINT message to the render thread(UI thread) through Invalidate(), then it put the PBO back to the queue. when the render thread receive WM_PAINT message it will display this task:

CRect rectClient;
GetClientRect(&rectClient);

{
	CAutoLock lock(&m_lockPaint);

	glEnable(GL_TEXTURE_2D);
	glBindTexture(GL_TEXTURE_2D, m_iTex);
	glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_LINEAR);
	glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_LINEAR);
	glTexEnvi(GL_TEXTURE_ENV, GL_TEXTURE_ENV_MODE, GL_REPLACE);

	glBegin(GL_QUAD_STRIP);

	glTexCoord2f(0, 0);
	glVertex2f(0, (GLfloat)rectClient.Height());
	glTexCoord2f(0, 1);
	glVertex2f(0, 0);
	glTexCoord2f(1, 0);
	glVertex2f((GLfloat)rectClient.Width(), (GLfloat)rectClient.Height());
	glTexCoord2f(1, 1);
	glVertex2f((GLfloat)rectClient.Width(), 0);

	glEnd();

	glBindTexture(GL_TEXTURE_2D, 0);
	glDisable(GL_TEXTURE_2D);

	glFlush();
}

SwapBuffers(m_pDC->m_hDC);

Start the test program I can see a rectangle moving rightward constantly, but sometime the weird thing will happen: the rectangle will suddenly move rightward too far then move back(leftward) and then move rightward again. That is to say a newer frame(timeline postion is larger) is displayed before an elder frame(timeline postion is smaller). 
For now I have two dirty tricks to solve this problem: the first one is to add a glFinish() after upload the PBO to texture in play thread, but this will significantly increase CPU usage; the second one is to remove copy thread and move the work done by the copy thread(fill image in PBO) to the play thread, but this method is conflict to our design in our real project.
The reason why this weird thing will happen I can figure out is : after a task(a frame)'s PBO is uploaded to texture through glTexSubImage2D(), the PBO is returned to the queue, and can be reused by a newer task, then the newer task will be transfered to copy thread to fill a image to the same PBO, but due to the async nature of openGL, the previously called glTexSubImage2D() will not be executed by GPU imediately, maybe it will upload it after the new task has filled the same PBO(the newer frame), and this can trigger the problem I have seen(so add a glFinish() can solve this problem). But I think it is the responsibility of the openGL driver to do sync work for us ? is that right? If the reason I figure out before is right, why do the same thing in one thread(fill PBO and uploaed texture in one thread) can also solve this problem ? Is that to say using PBO in multiple thread is not safe?  I am so confusing now, so I have some questions to ask which are listed below:

1) Is my usage of openGL conflict with openGL specification ? 
2) I have looked the share object part of the openGL specification (openGL 3.1, 3.2, appendix D), from the description of the specification, it seems that the openGL driver will sync the state of the share objects(such as PBO) between render contexts for us, but they don't say whether multi-threaded programming will benefit from this, is this a non-documented part of openGL specification?
3) is this a bug of openGL driver? 
4) can you give me some suggestion on how to solve this problem ?

Thanks

BTW, I have a nVidia 9800GT video card, I installled 190.62 driver. 
can I submit an attachment in this forum? I want to attach the test program, but I can't find a way to submit an attachment.

The OpenGL does not synchronize the operation among contexts.
When some OpenGL call returns does not mean the job is done.
When you pass this objects to another context/thread it is not seen as completed.

If you want to synchronize OpenGL objects between threads you must use either glFinish() or use new OpenGL 3.2 ARB_sync API.
Use glWaitSync() in the thread that is trying to use resource filled in another context.

BTW. For filling PBO you dont need OpenGL context at all.
Just map the PBO in your context and let the filler thread do its job (filling memory).

Thanks a lot for your reply!

I have some more questions to ask.

The OpenGL does not synchronize the operation among contexts.
When some OpenGL call returns does not mean the job is done.
When you pass this objects to another context/thread it is not seen as completed.

Do your mean if I uploaded a PBO to texture by calling glTexSubImage2D() in a thread/context and call glMapBuffer() imediately in another thread/context, the glMapBuffer() call will not wait until the DMA transfer caused by glTexSubImage2D() has finished? Or do you mean the thread/context which calling glMapBuffer() are likely to map the buffer before the DMA transfer actually start due to lack of synchronization?
thanks.

Do your mean if I uploaded a PBO to texture by calling glTexSubImage2D() in a thread/context and call glMapBuffer() imediately in another thread/context, the glMapBuffer() call will not wait until the DMA transfer caused by glTexSubImage2D() has finished? Or do you mean the thread/context which calling glMapBuffer() are likely to map the buffer before the DMA transfer actually start due to lack of synchronization?

OpenGL does not define how concurrency issues are resolved so any of those, or any number of other things may happen. Maybe someday it will.

In the meantime, it would probably be best to do all GL stuff in one thread. You can map a buffer object and hand the pointer over to another thread to fill it up (though you’ll need to remember whether you’ve bound and mapped a buffer object to a particular binding point, in case you try to do it while it’s still mapped).

A few interested related blog posts you might find useful:

Great thanks for all your reply!
I think I have to redesign my usage of openGL.