A review of a Sprite drawing strategy

vdweller · April 19, 2022, 3:04pm

Hi ,

I am putting together a sprite rendering engine using C++, SDL2 and Opengl 3.3 . I wanted to present the outline of my drawing algorithm in order to receive feedback in case I am doing something incredibly stupid.

Each instance in the engine has a Draw event. So far, all this does is draw an untransformed textured quad (2 triangles). Since this is a rendering engine, all instance sprite info could potentially change every frame (position, angle, scale etc), so all instance data has to be processed/changed in the VBO each frame. Each Draw event leaves some basic info in a “draw command buffer” (a simple byte array).

Then a “draw key” is generated for each instance. The info contained in this key, in order, is: Instance depth, texture ID, index in the “draw command buffer”. These keys are then sorted.

So far, each quad is 64 bytes (that will increase in the future). Each draw batch can eat up a maximum of 65536 bytes (arbitrary-may change that). A VBO is created, with a size 3x of that, essentially 3 subsections. Also an index buffer is created, using unsigned shorts as indices. There is a “section” variable, denoting in which one of the 3 VBO subsections we are.

So what happens is the following:

Map the VBO using glMapBufferRange() with GL_MAP_WRITE_BIT | GL_MAP_UNSYNCHRONIZED_BIT | GL_MAP_INVALIDATE_RANGE_BIT flags, get the correct VBO subsection mapping.
For each (now sorted) draw key, find its index in the draw command buffer to get quad info.
If adding that quad to the VBO exceeds the subsection size, or if a different texture than the last one is used, “close” that batch (ie draw it using glDrawElements). Increment section counter (go to the next VBO subsection), and set new texture.
If we are beyond the end of the VBO, orphan it using buffer respec, set section back to zero. Map that VBO subsection (same flags as above), also setting vertex attribute pointers accordingly.
Write quad info in the VBO.
Do that until the end, do a last draw if needed.

The algorithm works. I am trying to benchmark draw speed by putting a glFinish() before recording the time before and after drawing. Switching textures obviously is a pain, but there are ways to mitigate this (bindless etc). Also having an arbitrary depth range can produce lots of batches!

My issue is that rendering 30K sprites with 500 different depth values and only 2 different texture pages in this way eats up half the frame time (and with a good GPU - I am using my laptop’s Nvidia RTX 2060). Is this to be expected? Am I being very stupid with something, or is it simply the cost of changing sprite info every frame? I am trying to put things into perspective here, I don’t know if I am close to “the norm” or very far away. Maybe some other user can draw 100K sprites being updated each frame and I am way behind. I really don’t know where I stand in this, hence that post.

I am trying to use a more “legacy” way of drawing sprites in order to be more compatible, and I have also read that instancing is not a very good solution for rendering simple geometry like sprites.

Let me know your thought on this. Do I have a good basic idea in my hands, or should I redesign my drawing algorithm somehow?

Alfonse_Reinheart · April 19, 2022, 5:24pm

Don’t do that. If you want to measure CPU frame time, just measure the CPU time from one swap of the framebuffer to the next. If you want to measure the GPU time of some rendering process, use a timer query object.

It’s not clear why you would want to both do unsynchronized mapping and invalidate the range. The whole point of invalidation mapping is to bypass the need for synchronization.

Also, instead of using one buffer with 3 regions, consider using 3 separate buffers. Oh, and if you’re considering using bindless, then you’re clearly not wedded to hardware compatibility with GL 3.3-constrained hardware. Since only 4.x-class hardware supports bindless textures, consider using other 4.x features like persistent mapped buffers instead of mapping each frame.

In any case, your whole system seems over-designed for its purpose. They’re just quads. There’s no need for instancing or the like. Just render a sequence of quads. By which I mean go ahead and do the transforms for the quads on the CPU and write the transformed vertex data into the buffer. Then render all of those quads with a single draw call.

Your quads should have 4 vertices, with each vertex containing a 3D position (probably a float), texture coordinates (selecting the image from an array texture or texture atlas), and maybe a color if you need it (4 bytes). In total, each vertex should be 20 bytes.

The index list of unsigned shorts never needs to change. Ever. It can just be a GL_TRIANGLES single list of (0, 1, 2), (1, 2, 3), (4, 5, 6), (5, 6, 7) and so on, all the way up to 65535. If you have more than 16K quads to render, you can reuse the same index buffer, using base indexed rendering functions to offset the index in multiples of 64K indices.

And all of this assumes that you’re rendering a field of arbitrary, individual quads. If you have something like a tilemap with a fixed relationship between tile, that should be rendered as its own sheet without the need for CPU transformations.

vdweller · April 19, 2022, 5:58pm

Good points. To clarify a few things:

Don’t do that. If you want to measure CPU frame time, just measure the CPU time from one swap of the framebuffer to the next. If you want to measure the GPU time of some rendering process, use a timer query object.

I am also doing that, omitted for brevity. It’s just incidental that both methods yield the same result.

It’s not clear why you would want to both do unsynchronized mapping and invalidate the range. The whole point of invalidation mapping is to bypass the need for synchronization.

This is why. From the Opengl wiki:

glMapBufferRange has another flag you should know about: GL_MAP_INVALIDATE_RANGE_BIT. This is different from GL_MAP_INVALIDATE_BUFFER_BIT, which you’ve already been introduced to above.
According to Rob Barris, MAP_INVALIDATE_RANGE_BIT in combination with the WRITE bit (but not the READ bit) basically says to the driver that it doesn’t need to contain any valid buffer data, and that you promise to write the entire range you map. This lets the driver give you a pointer to scratch memory that hasn’t been initialized. For instance, driver allocated write-through uncached memory. See this post for more details.

Maybe I understood wrong? Please do let me know if I did!

Also, instead of using one buffer with 3 regions, consider using 3 separate buffers. Oh, and if you’re considering using bindless, then you’re clearly not wedded to hardware compatibility with GL 3.3-constrained hardware. Since only 4.x-class hardware supports bindless textures, consider using other 4.x features like persistent mapped buffers instead of mapping each frame.

That’s an idea I’m seriously flirting with. I just hoped I could squeeze all the -compatible- juice I can before going pre-Vulcan hi-tech.

In any case, your whole system seems over-designed for its purpose. They’re just quads. There’s no need for instancing or the like. Just render a sequence of quads. By which I mean go ahead and do the transforms for the quads on the CPU and write the transformed vertex data into the buffer. Then render all of those quads with a single draw call.

The goal is to make a rendering engine where instances issue their own drawing commands. An instance may set a shader, draw a sprite, then reset it, Another may draw ten sprites. Sprites may belong to different texture pages. If it only was so simple to just issue a draw call

The index list of unsigned shorts never needs to change.

That’s precisely how I do it. Set it up once, leave it be.

In any case, I think I may just be paying the price for flexibility. Maybe I’m getting too scared by reading posts that go like “Hey I am rendering 1,000,000 quads each frame! I’m great”. Maybe it’s the same 1M quads each frame. Who knows.

mhagain · April 19, 2022, 6:39pm

Not all quads are the same - depending what you’re drawing you could just as easily be bottlenecking on fillrate, blending or something else.

For example, 1,000,000 quads in a particle system, where each quad is quite small (relative to the framebuffer) will perform very differently to 1,000,000 quads where each quad is much larger and there’s a lot of overlap, even if the vertex submission is otherwise identical.

The first thing to do here is try to isolate your potential performance issues, so that you’re only measuring the parts you’re concerned about, and your measurements aren’t being skewed by other factors. In your case you should be able to easily enough achieve this by setting all 4 positions of each quad to the same values.

Then you can try benchmarking that against purely static data, or even glBegin/glEnd code, to see if the performance you’re getting is in the kind of range you’d expect.

Alfonse_Reinheart · April 19, 2022, 7:02pm

That doesn’t really answer my question. I’m asking about why you use that bit with unsychronized. Invalidation is not supposed to stop to synchronize with stuff; that’s kind of the point.

First, the word “instance” when used in conjunction with rendering typically means instanced rendering. You don’t seem to be talking about that.

Second, changing shaders per-quad (or anything close to per-quad) is a really bad idea if you like performance. Whatever an “instance” means in this context should not be changing shaders. Indeed, most “sprite” rendering doesn’t need to change shaders frequently, if at all.

Dark_Photon · April 20, 2022, 11:58am

Ok, after reading this, I expected to see that the VBO was mapped PERSISTENT possibly with COHERENT, with fences being used before writing into a new subsection, but…

Why are you trying to combine with this technique? If you’re going to do this, just 1X size the VBO and orphan the buffer when “full”.

I would choose one or the other. Both techinques work. The Map UNSYNCHRONIZED technique is simpler.

And if you’re using this case, GL_MAP_INVALIDATE_RANGE_BIT is fine BTW.

The whole purpose of the 3 subsections thing (dropping sync object “bread crumbs” as-you-go) is to avoid writing on a subsection that the GPU hasn’t read from yet. And if you orphan when full, then this is never going to happen. Because when you reset to the starting offset of the buffer object when writing, you’re scribbling on a new “page” not the old one. However, if you’re mapping PERSISTENT, you need some other technique to avoid the CPU stomping on some VBO data the GPU hasn’t read from yet when you reset to the starting offset of the buffer object. Thus the sync object “bread crumb” approach per subsection (has the GPU finished reading this subsectIon? Nope! Gotta wait until it does.)

Also BTW, there’s no need to create separate buffer objects for vertex attributes and indices. Just blast them all into the same buffer object end-to-end. Simpler. Buffer object objects are just arrays of bytes.

If your goal is to measure worst-case CPU+GPU time to render this frame, that’s fine. It’s useful to be able to measure this in your app without having to employ a separate profiling tool. For the most consistent results, I’d place this after SwapBuffers() and glClear() of the window. This to ensure that window frame rendering is complete and the driver has obtained a free swap chain image to render the next frame into.

I don’t know. It’s not clear all of what you’re doing yet. You’re running on NVIDIA. Run your program under Nsight Systems and see where your time is going.

A few random things to think about:

Are you running VSync OFF?
How many draw calls?
How many tris per draw?
How many state changes between each draw call?
What kind of state changes?
How many “expensive” state changes (e.g. shader binds and FBO binds)?
How do your frame times compare when you:
1. do / don’t upload new particles to the GPU?
2. do / don’t render everything with one shader program?
How much VBO data are you uploading from CPU-to-GPU each frame?
Are you using NVIDIA bindless buffers for draw call dispatch?

Obviously, no uploads, few draw calls, and few state changes is best (assuming you’re not giving the GPU a bunch of useless work). Minimize expensive state changes like shader binds.

vdweller · April 20, 2022, 3:57pm

Thanks a lot for your post. To address one basic question:

The reason I chose to test that combination (apart from being a masochist) is that I didn’t want to put hypothetical strain on the driver by orphaning often, so I tried forcing orphaning less times. Thankfully, the system was designed in such a way that changing a single variable controls the amount of subsections, so I just set that variable to 1. Time-wise, results were identical. However, I will follow the advice you (and others) posted and I will ultimately use one single VBO, if buffer orphaning isn’t such a big deal after all.

At this point I have to point out that I gave in and tried bindless textures. Now the fragment shader gets all it needs from 2 SSBOS, one updated each frame with an index into a sampler array per quad, the other being the sampler array block. I don’t know if that’s the fastest way to do it but draw time has already been cut by half. It’s… very hard to go back now.

Since I’ve gone full 4.x , in your experience, is it worth to try persistent buffer mapping? Does it yield better results in general?

EDIT: Apparently…not? It performs roughly the same as “regular” glMapBufferRange (without GL_MAP_UNSYNCHRONIZED_BIT). I may be using it the wrong way though.

Dark_Photon · April 21, 2022, 11:52am

My results mirror yours. My implementation supports both as well. To your question above though…:

History note: As I recall, the MAP_PERSISTENT buffer upload support wasn’t developed to fix a speed problem with MAP_UNSYNCHRONIZED. It was developed because NVIDIA found through profiling that the MAP_UNSYNCHRONIZED technique thwarted full parallelism in their driver with the multithreaded driver option enabled (Threaded Optimization = ON).

As you can see here, they really dumped on the MAP_UNSYNCHRONIZED technique, as they wanted you to flip to MAP_PERSISTENT:

Beyond Porting: How Modern OpenGL can Radically Reduce Driver Overhead (2014)

Hey if it was faster, I’d have been sold! However, perf was basically the same in my experience (and I’ve re-verified that since).

MAP_UNSYNCHRONIZED is easier and conceptually simpler, as the app doesn’t have to explicitly over-allocate buffer space, fence, and wait for those fences if the GPU gets too far behind. However, it does require more driver mojo under-the-hood to multi-buffer VBO re-allocations (orphans) of the same size for fast “swap chain” like behavior. MAP_PERSISTENT OTOH is more like what you’d do for Vulkan. So if current or future Vulkan cross-compatibility support is required, then it’s the better choice.

Now, I personally always turn Threaded Optimization = OFF because doing so leads to more consistent frame times (with or without MAP_UNSYNCHRONIZED use). And in my world, it’s all about hitting 60Hz, 90Hz, or 120Hz consistently every single frame, with minimal latency and multisampling. None of this “30Hz mostly with blur/TAA” stuff. Here’s a very recent case where perf issues with Threaded Optimization = ON came up:

vdweller · April 21, 2022, 2:48pm

Interesting. I think that, regarding Persistent Mapped Buffers, there is a lot of misinformation on the internet (I know, how original). There are many blogs, articles etc that present this ability as the “second coming” that will blast your app through the roof and will leave all other buffer mapping techniques smoldering, but for all their enthusiasm they don’t provide an adequate context as to in which specific use cases PMB can be beneficial. It’s just that new and exciting thing!

Aside from the historical context that you pointed out, and based on my own measurements, PMBs can actually be quite beneficial if the mapped buffer is sufficiently large (say, because it contains lots of complex meshes or whatever). Quads have a small size, so if you’re rendering quads, and you go for a 64K chunk size and use unsynchronized buffer mapping/orphaning, the driver can deliver you a fresh chunk of that size, no biggie. Now try to do that same trick by mapping several megs each frame and your GPU will pack its things and leave.

So, future reader: You want to draw lots of dynamically transformable Sprites? Use glMapBufferRange() with GL_MAP_WRITE_BIT and GL_MAP_UNSYNCHRONIZED_BIT enabled in conjunction with buffer orphaning. Use the technique outlined here. Use a reasonable VBO size. Don’t use instancing, it doesn’t translate well with low vertex count meshes. Same with geometry shaders, basically anything fancier than drawing four verts. Don’t bother with Persistent Mapped Buffers unless there is also some other functionality you want them for.

Dark_Photon · April 21, 2022, 6:19pm

Yeah. Though either one is so much better than garden variety glMapBuffer() (or glMapBufferRange() without these special flags). I could easily see someone thinking PMBs are the holy grail if they’re coming from that starting point.

The key is to avoid implicit synchronization in the driver – however you want to do it!

vdweller · June 14, 2022, 12:03pm

An interesting update regarding drawing lots of transformable sprites:

First, bindless textures are a must, even if you use atlases. Just do it!

Second, If you are willing to require ARB_shader_storage_buffer_object, you now have the option to completely forego Vertex buffer Objects. One issue with drawing quads is that, if you think about it, a lot of data is unnecessarily repeated each vertex. But a sprite is a set of 4 points with:

4 Texture coordinates (use 4 unsigned shorts, pack them into 2 uints)
A dimension (width, height - can be inferred from the texture coordinates)
An offset(a “pivot point” that the sprite is drawn rotated/scaled with - again, 2 ushorts = 1 uint)
X/Y scale (2 floats)
Rotation Angle (1 float)
A position (2 floats)
A color (1 float - assuming that you don’t care to colorize for each vertex but for the quad as a whole)

These are 36 bytes. Adding a “texture index” to lookup another SSBO for the sampler2D if you’re going bindless, we are at 40 bytes. PER QUAD. This is like 10 bytes per vertex which is absolutely ridiculous.

I found it better to use a SSBO for storing uints and another for storing floats (since interface blocks allow only one variable-length array - I haven’t tried to use a struct array holding a uint and a float yet but whatever).

So use an Index Buffer for your VAO (create it once, use it forever), use gl_VertexID / 4 to find your data position inside the SSBO, use_VertexID % 4 to find what point of your quad you’re looking at (0,1,2 or 3). Pull data from the SSBOs. Unpack/normalize where needed. Transform your vertex in the GPU. Pass the sampler to the fragment shader (reminder, use flat out / in).

I have tested this approach on 5 separate systems, NVidia, AMD (GPU/APU), even Intel (and one well-known handheld console). Almost all systems spat out the same amount of quads in half the time. Worst case scenario, even if you are dealing with a weak GPU, even if a GPU is front-loaded with a specialized vertex unit and chews through VBOs very fast, even if a specific SSBO implementation is bad, the sheer reduction in the amount of data needed to represent your quad will almost surely result in an improvement, even a slight one.

Dark_Photon · June 15, 2022, 12:55am

Yep, agreed!

I sure wish they’d expose support for bindless textures in SPIR-V under OpenGL. Because without that, SPIR-V’s useless to me in OpenGL.

Yep, that’s pretty good.

It’d be worth benching this against point sprites too. Like your approach, there’s only one bucket of attributes “per quad” (vs. per vertex), with the texcoord generated implicitly in the shader. You size it dynamically in the shader with gl_PointSize. Moreover, possibly even more efficient than your approach, where the shader code has to “pull” in the quad attributes via dependent SSBO reads, with the point sprite approach the GL pipeline “pushes” the quad attributes into the vertex shader. Depending on GPU/driver internals, this could yield even higher throughput.