Vulkan Performance


I am porting my game from OpenGL to Vulkan.
I followed the marvellous Vulkan tutorials of Sascha Willems.
All is working well (validation okay, no crashes) except for the performance.
I don’t expect the same performance since rendering nothing runs at > 1000 fps (because I cap it at 1000) in OpenGL and 600 fps in Vulkan.
But still… a 1000K triangle scene running at 250 fps in OpenGL and 135 fps in Vulkan is unacceptable.
My Vulkan shaders are even not shading, only multiplying colours.

My geometry is grouped per type for rendering (1 opaque walls, 2 opaque models, 3 alpha, 4 skybox, 5 blending, 6 2D GUI).
It is optimized (no polygon soups, vertex/index buffers (re-used when possible), ordered for performance, …).
This approach is the same for all graphic libs.

My (probably naive) question is how can I render per group, for instance:

  • render opaque walls and send this to the GPU
  • render opaque models and send this to the GPU
  • … same for all other groups.

In other words, do the commands, send them to the GPU so that it can start to work as soon as the commands are ready.

I tried the following techniques to improve the performance, without success:

  • dynamic uniform buffers.

  • to thread the rendering but it’s sometimes slower because of extra mutex waits or 5 fps faster in other situations.

  • a primary command with secondary commands for the groups, but this isn’t much helping because the secondary commands have to be ran in the primary command.
    So, this is the same as putting everything in a primary CB and submit the queue.

  • multiple graphics queues and put the CB’s of each group in a primary command buffer and submit it in its own queue.
    Then in the end, I realized that vkQueuePresentKHR works only with 1 graphics queue :frowning:

Thank you for your time, effort and ideas!

A NSIGHT screenshot, perhaps it might help.

Have you switched layers off for measuring?

Another common embarrasing mistake is if the time\fps measuring itself is improper.

Is it CPU limited or GPU limited?

Otherwisely question needs code of the render loop.

You should probably not allocate in the loop. Why not do it on startup?

Layers are off.
FPS measuring is correct, my measuring is identical to the NSIGHT measuring.
Everything is allocated and build on startup, except for particles.
The allocation is from particles, this doesn’t happen often and decreases over time as the game starts to have enough pooled particles. But even particle disabled it doesn’t make a difference.
The rendering code is huge and not centralized.
It is GPU limited, because the rendering algorithm is identical for my OpenGL implementation.

	//Bind command to the pipeline
	bool cmdBind(VkPipeline *lastVkPipeline, VkDescriptorSet *lastvkdescriptor){
		VkPipeline vkp;

			if(set->vkdescriptor != *lastvkdescriptor){
				vkCmdBindDescriptorSets(vkcurrentcmdbuf, VK_PIPELINE_BIND_POINT_GRAPHICS, pipeline->vkpipelinelayout, 0, 1, &set->vkdescriptor, 0, NULL);
				*lastvkdescriptor = set->vkdescriptor;

			//vkp = pipeline->getPipeline(getDynLights());
			vkp = pipeline->getPipeline(0);
			if(vkp != *lastVkPipeline){
				vkCmdBindPipeline(vkcurrentcmdbuf, VK_PIPELINE_BIND_POINT_GRAPHICS, vkp);
				*lastVkPipeline = vkp;

			return true;

		return false;

	BATCH_TYPE 			bt;
	bool 				swSorted;
	vk_render_isless	fn_sort;

	LIST<BATCH_ELEM*> 	elems;

	inline void reset(){
		swSorted = false;

//this is an implementation for an UBO update (the UBO VkBuffer is persistently mapped)
//it is called by be->berenderer->updateUBO()
void __fct_shader_TEX_LMAP_NRM(BERENDERER *br){
	UBO_MVP_TEX_NRM *ubo = (UBO_MVP_TEX_NRM*)br->addressUBO();


	//MVP Matrix
	ubo->mvp = br->be->mEye ? *render_mat_P() * *br->be->mEye : *render_mat_Identity();

	ubo->mModel = br->be->mObj ? *br->be->mObj : *render_mat_Identity();

	//Texture matrix
		ubo->mtex = *br->be->mesh->mat->tex->mTex;
		ubo->mtex = *render_mat_Identity();

	//x = alphaFade, y = alphaMax
	ubo->alphas.set(1, __render_alphaMax, 0);

// sort for opaque geom (there are more but they just sort differently)
bool __gl_qsort_batch_walls(BATCH_ELEM *a, BATCH_ELEM *b){
	//per distance near to far, discard fragments in shaders
	if(a->distFromCam < b->distFromCam) return true;
	if(a->distFromCam > b->distFromCam) return false;

	//per buffer, cache
	VkBuffer va = a->berenderer->getBufferIBO(), vb = a->berenderer->getBufferIBO();
	if(va < vb) return true;
	if(va > vb) return false;

	//per index, cache
	unsigned int oa = a->berenderer->getRange()->idxFrom, ob = b->berenderer->getRange()->idxFrom;
	if(oa < ob) return true;
	if(oa > ob) return false;

	//Quicksort is an unstable sort, therefore
	return a < b;

// this renders a group of geometry (opaque, alpha, blends, etc. ...)
void render_batches_render(BATCH_TYPE bt){
	VkDeviceSize 	start = 0;
	VkBuffer		*pvbo, ibo, *lastpvbo = 0, lastibo = 0;
	int				nIndices;
	RANGE			*range;
	VkPipeline 		lastVkPipeline = 0;
	VkDescriptorSet lastvkdescriptor = 0;

	prof = &renderProfiles[bt];
	if(!prof && !prof->elems.n) return;

	if(bt == BATCH_BLEND){
		//Update UBOs

			if(prof->fn_sort) prof->elems.sort(prof->fn_sort);
			prof->swSorted = true;

			for(int n = 0; n < prof->elems.n; n++){
				be = LIST_AT(prof->elems, n);
				__render_index_counter += be->berenderer->getRange()->range();

				} else {

				//Update the shader matrices

	} else if(bt == BATCH_OPAQUE_DYNAMIC){
		//Update UBOs

			if(prof->fn_sort) prof->elems.sort(prof->fn_sort);
			prof->swSorted = true;

		for(int n = 0; n < prof->elems.n; n++){
			be = LIST_AT(prof->elems, n);
			__render_index_counter += be->berenderer->getRange()->range();

	} else {
		//Update UBOs

			if(prof->fn_sort) prof->elems.sort(prof->fn_sort);
			prof->swSorted = true;

			for(int n = 0; n < prof->elems.n; n++){
				be = LIST_AT(prof->elems, n);
				__render_index_counter += be->berenderer->getRange()->range();

				if(bt == BATCH_ALPHA && be->mesh->mat)



	//RENDER 3D-------------------------------------------------------------------------------------
	for(int i = 0; i < __render_lstBatchTypes.n; i++){
		#ifdef LOG_BATCHES
			if(engine.swLogBatches) util_log(LL_MSG, "** RENDERING QUEUE %s ****************************************" , BATCH_TYPE_NAME[]);

		prof = &renderProfiles[];

		for(int n = 0; n < prof->elems.n; n++){
			be = LIST_AT(prof->elems, n);
			mr = be->mesh->mrenderer;
			range = be->berenderer->getRange();

			//Count the tris that will be rendered
			nIndices = range->range();

			if(be->berenderer->cmdBind(&lastVkPipeline, &lastvkdescriptor)){
				pvbo = be->berenderer->getBufferVBO();
				ibo = be->berenderer->getBufferIBO();

				if(lastpvbo != pvbo){
					vkCmdBindVertexBuffers(vkcurrentcmdbuf, 0, 1, pvbo, &start);
					lastpvbo = pvbo;
				if(lastibo != ibo){
					vkCmdBindIndexBuffer(vkcurrentcmdbuf, ibo, 0, VK_INDEX_TYPE_UINT32);
					lastibo = ibo;
				vkCmdDrawIndexed(vkcurrentcmdbuf, nIndices, 1, range->idxFrom, 0, 0);

			#ifdef LOG_BATCHES
			if(engine.swLogBatches) util_log(LL_MSG, "\t\tbatch elem: %s, dist %.3f, pipeline id %d, range %u - %u, tris %d, tex: %s", SHADERS[be->berenderer->pipeline->id].name, be->distFromCam, be->berenderer->pipeline->id, range->idxFrom, range->idxTill, nIndices / 3, (char*)(be->mesh->mat->tex ? stream::get_filename(be->mesh->mat->tex->path) : "no tex"));

		}//... next batch element



Profile in units of time, not frames per second. 250 FPS is 4ms per frame. 135 is ~8ms per frame.

This is precisely what you do not do in Vulkan. That’s the whole point of a command buffer: you don’t want work on one command to start just because you wrote it to a CB.

A better question is why you believe that the lack of this is why you are having performance problems.

I tried a lot.
All we want is drawing triangles on a screen fast, nothing more nothing less.
Now, is Vulkan slow then so be it.

No, this doesn’t follow.

You don’t have this kind of control in GL. It’s ghosting buffers, relocating them to different memory spaces, multithreading dispatch, and pulling all kinds of tricks behind-the-scenes to avoid implicit sync under-the-hood. Special sauce that you have to own in Vulkan.

Further, you’re getting different performance, right? 3.4 msec/frame different. So… something’s not the same.

Ok. So you’re on your own there.

You might inspect the timing of this in Nsight Systems. Nsight Graphics has its place. But I find the former more useful when it’s a “where the heck is my frame time going” issue. It shows you cross-thread, cross-queue, cross-device timing over multiple frames. Works well in GL too.

I meant the rendering algorithm or way and data structures the game uses to render, that is identical.
Developers code BIH, PVS, OCTREES, hashing, efficient collision, and everything to be fast… and then there is one component that kills all this effort, the one that should be the fastest, the rendering.
I know I’m on my own, but your answer is already very helpful.
Thank you Dark Photon.

Maybe I am missing something in the screenshot - but it is showing 3.9ms render frame time? (250FPS?)

and 1.8ms waiting on fences if I am reading that right? Perhaps you need to double/triple buffer a dynamic updated vertex buffer/texture?

vkWaitForFences takes the longest time, it is now executed inside a thread among other things, a swap or glFlush operation in a thread if you want, this improves performance but there is much more to do.
The mentioned FPS are an average of the same scene, not particular on this screenshot, the screenshot says more than code that’s why I provided it.
I don’t understand what you mean by doube/tripple buffers and how it can improve vkWaitForFences, can you explain a bit more? (ty!)

OK, let’s say you have 3 frames: A, B and C, which need to be rendered in that order.

In order to render frame A, you need to do some graphics setup work. This setup work requires resources: command buffers, uploading data to buffers for matrices and the like, etc. So you do that, submit the CBs for frame A.

Then you do the CPU/engine-side preparations to create the data for frame B. Once you’ve done the engine-side computations, you need to do the graphics setup work for frame B. Again, this requires resources.

If the resources you attempt to use are the same resources used by frame A, then you need to wait for frame A to be finished, via vkWaitForFences or the like. The problem is that frame A is still using those resources; you cannot overwrite them yet. Therefore, before you can even begin to do the setup work for frame B, frame A must have finished rendering.

That’s bad.

To avoid this, the resources used for the setup for frame B need to be different resources from those used by frame A. Different command buffer objects (and probably command pools, so that you can just reset the pools rather than individual CBs), different regions of buffers, etc. This is double buffering: you have two “buffers” worth of resources, and you alternate which gets used on every frame.

Now when frame C comes around, it does need to wait for fences. But it does not wait for B’s fence; it waits for A’s fence. But by the time you’re ready to start frame C’s graphics setup work, frame A should already be finished. So the wait time should be 0.0ms.

So each frame waits on the fences from two frames earlier.

1 Like

So, frames are being prepared and displayed on the screen a few frames later.
That is indeed a good idea!
I will give it a shot.
Thank you!

sqrt_1 was right, double buffering improves the performance.
For those who want to do the same, don’t start messing around with own threads… :slight_smile:
It’s all there in Vulkan. This is a tutorial Double buffering - Vulkan Guide
The new screen shot of the same scene shows a dramatic decrease of vkWaitForFences time.
Using a 24-Bit depth map format (VK_FORMAT_D24_UNORM_S8_UINT) instead of a 32-Bit also increases performance.