PBO performance ( again )

babis · April 13, 2008, 2:35am

Hi,

First of all, I’m aware of the many threads about this topic, so after much reading, I tried to implement my own accelerated texture streaming. Which unfortunately isn’t accelerated much.

I use a worker thread for image loading from disk and the main gl thread. The main thread, every time it needs to upload to the texture :

binds the current pbo,
unmaps it,
calls texSubImage with NULL,
binds the next pbo in queue,
maps the memory & passes the pointer to the worker thread,
unbinds the pbo.

The first time I start from step 4. I use 8 pbos, but I’ve tried from 2 to 32.

Is there anything wrong with the above procedure?? I hoped to get faster throughput than a normal texsubimage than the one I get.

And now some numbers :

For 10 Mbs of 640x480 images, I get a throughput from 150 MBs (direct) to 200MBs (pbos)

For 239 Mbs of 360x288 images, I get from 77 to 82
For 365 MBs of 304x200 images, I get from 46 to 51
For 3.24GB of 1280x720 images, I get for both around 36.

… Or is it just that my hard disk sucks??

Thanks,
babis

Lord_crc · April 13, 2008, 3:17am

Modern HD’s manages around 60-70 Mb/s (avg) under ideal conditions. That’s sequential reads though, so if your files are located at different places around the disk, that can quickly drop to 20-30 Mb/s.

I’m no PBO expert, so I’m sure someone else can help you with that part

mfort · April 13, 2008, 3:21am

To find out the bottleneck I’d replace the load from disk with
either no-op, or filling the memory with memset()

You should get around 2000MB/sec.

babis · April 13, 2008, 4:04am

Thanks for your quick replies,
nice one mfort, I’ll try it now.

babis · April 13, 2008, 4:23am

Well, far from 2000MB/sec.

Actually this is not a test app but a bit larger, so, since I measure the throughput as the ratio imgBytes / workerInterval where workerInterval is the time from one ‘finish loading’ event to another, I don’t expect extreme performance.

Anyway now, if I actually do no texture uploading, I get around 876 MB /sec. If I use direct uploading I get ~ 270 , and with PBO’s I get ~ 560. In all cases the worker just returns, and the imagebuffer uploaded is the same every time.

So the pbo part seems ok now, hmm…

NiCo1 · April 13, 2008, 5:08am

I’m assuming you’re using the GL_BGRA external data format for unsigned byte textures and GL_RGBA for float textures?

babis · April 13, 2008, 10:08am

Almost. The streaming files use GL_RGB as internal & GL_BGR as pixel format, as they’re bmp’s & I want to avoid a manual conversion.

Nicolas_Lelong · April 13, 2008, 10:34am

You should try to use GL_RGBA8 and GL_BGRA to see if it improves your performance. Certain platforms (NVidia among those for sure : http://developer.nvidia.com/object/nv_ogl_texture_formats.html) do not have a native hardware support of RGB texture, and all textures are uploaded as RGBA - thus requiring the driver to convert them.

You should also try with DXTC compressed textures as it is more likely for the driver not to touch them at all.

babis · April 13, 2008, 10:47am

I think I’ll print this & put it under my pillow, too useful…Thanks for the advice! DXT is not an option because I want to stress out the machine, so I want to load them raw.

yooyo · April 14, 2008, 4:25am

Can you tell which hw, OS and drivers you use? Is it AGP or PCIE platform?

You should create pool of PBO’s. Each PBO in pool should have 3 state:

mapped
full
pending
After app map PBO it retreive pointer into PBO memory. This memory is not cached so it is not good idea to directly decopmress frame into PBO memory. It is better to decompress frame into sys mem buffer and then just copy sysmem buffer int pbo mem bufer (using some fast memcpy funcition). When workerthread fill PBO it change state into full. Meanwhile, render thread check pool and if there is some full PBO’s, app unmap PBO call glTexSubImage to upload texture and set state to pending.

Pending state means “wait for a while before it change state to mapped”. How long to wait depends on image size, but next frame is good solution… So… wait one frame and change state to mapped. If you map buffer right after uploading it will be performance hit because app will stall until uploading (glTexSubImage) is finished.

PBO will not speedup those transfers… it just allow app and GPU to run asynchronous without stall. It allow to programmer to create deeper pipeline to hide latency and real transfer speed.

babis · April 14, 2008, 5:40am

Thanks for your reply.

HW : 3.2 Northwood, Win XP & 7800 GS AGP & two 4-year old HDs in raid 0 ,which made me falsely think that it would run faster

Ok I must say the pool scheme you describe is nice ( more defined than others I’ve read)

I actually have a pool of pbos, 8 of them, and the first usage I tried was sth like the one you mention. I actually memcpy’d to pbo, but unmapped the pbo immediately, before advancing the current pbo, and in the next frame I would use the last ‘full’ one, to bind, upload & bind the next again.

So actually the worker thread instead of freading into pbo, fread into sys & immediately memcpy to pbo. I actually also tried this, but I get almost the same ‘speedups’ that I get now. The rest is what I already do actually. And after uploading, I map the next one in queue so it shouldn’t stall.

But AFAIK one Good Thing about the pbos is that they eliminate that extra copy, how should the way you mentioned be faster?? ( = I didn’t get the cache thing)

yooyo · April 14, 2008, 7:18am

Sequential write is ok, but random access (read and write) in pbo buffer is slow. In your case fread in PBO block is OK (it is sequential write access pattern).

Using PBO you can avoid stall. Any memory transfer operation from SYSMEM to VRAM ot from VRAM to SYSMEM is blocking operation (CPU waits till finish). Using PBO those operation become nonblocking (driver post job in GPU queue and returns immediatly).

But the caveat is when you try to use same PBO before it is set as “free” from GPU side. Buffer will be free after some time (when GPU finish pending job which use that PBO). You dont know how much time it needs so creating several PBO’s and waiting some time (1…3 frames) sounds reasonable.

babis · April 14, 2008, 8:19am

And if I’ve understood correctly from the many posts for the subject, for the last caveat, what most people do ( including me in this attempt) is also a glBufferData() with NULL in order to not wait for the GPU to finish, but discard the data & use immediately. If for example you had enough pbos, wouldn’t BufferData be redundant, as the GPU would probably be finished by then? And does it have any difference from glBufferSubData in this matter of data invalidation?

yooyo · April 15, 2008, 2:51am

gMapBuffers can try to copy existing pbo content to cpu reachable memory. But if app call glBufferData(…) with NULL, driver knows that app doesnt need that old memory content anymore.

From VBO spec:


    Should there be a PRESERVE/DISCARD option on BufferSubDataARB?  On
    MapBufferARB?

        RESOLVED: NO, NO.  ATI_vertex_array_object had this option for
        UpdateObjectBufferATI, which is the equivalent of
        BufferSubDataARB, but it's unclear whether this has any utility.
        There might be some utility for MapBufferARB, but forcing the
        user to call BufferDataARB again with a NULL data pointer has
        some advantages of its own, such as forcing the user to respecify
        the size.

...

    What new usages do we need to add?

        RESOLVED.  We have defined a 3x3 matrix of usages.  The
	pixel-related terms draw, read, and copy are used to distinguish
	between three basic data paths: application to GL (draw), GL to
	application (read), and GL to GL (copy).  The terms stream,
	static, and dynamic are used to identify three data access
	patterns: specify once and use once or perhaps only a few times
        (stream), specify once and use many times (static), and specify
        and use repeatedly (dynamic).

	Note that the "copy" and "read" usage token values will become
	meaningful only when pixel transfer capability is added to
	buffer objects by a (presumed) subsequent extension.

        Note that the data paths "draw", "read", and "copy" are analogous
        in both name and meaning to the GL commands DrawPixels,
        ReadPixels, and CopyPixels, respectively.

...

        BufferData and BufferSubData are sent over the wire just as
        TexImage2D and TexSubImage2D, and GetBufferSubData does a round
        trip, just like GetTexImage.  MapBuffer goes over the wire with
        a request to map; the server replies to tell the client whether
        the map succeeded or failed, and the client returns a pointer to
        a system memory buffer in the event of success.  If the map is
        readable, the server passes back the contents of the buffer,
        while if the map is writeable, at Unmap time, the client passes
        back the new contents.  Unmap would always return TRUE.

babis · April 15, 2008, 7:50pm

If the pbo is mapped for writing, then according to the last bit of the spec reference, there’s no trigger for copying the old contents.
The only reasonable thing to me would be if the gpu was still working with the data,a further overwrite wouldn’t mess with these. But with enough pbos in the queue, the possibility would be minimal.
I guess it’s used just to be on the safe side, especially if it’s a cheap call. And thanks for the reference bits!