Poor multithreading secondary commandbuffer recording performance

I wrote a simple vulkan app that render 500 cube. I got poor performance from my multithreading secondary command buffer recording, each thread has one commandbuffer and one commandpool, I got ± 900 fps, while on single threading with one primary commandbuffer recording gave me ±4000 fps.What strange is, single threading secondary commandbuffer recording gave me better performance than multithreading with ± 3000 fps.So far i’ve tried ctpl threadpool, sascha willems’ threadpool, my own threadpool code and even std::thread without threadpool, and dynamical get function pointer from vkGetDeviceProcAddr and ofcource on release build on vs 2022 and validation layer off.

My spec :
Intel Core i7 10700k.
Nvidia GeForce RTX 3070 driver version 496.76
Windows 11 Home.

the code snippet:

void CommandBufferVulkan::DrawMultiThread(uint32_t threadIndex, const std::vector<DrawIndexedMultiThreadInfo>& drawInfo, uint32_t firstIndex, uint32_t lastIndex)
		const PerThreadCommandBuffer& perThreadCommandBuffer = mPerThreadCommandBuffers[mCurrentFrame][threadIndex];
		VkCommandPool commandPool = perThreadCommandBuffer.CommandPool;
		VkCommandBuffer commandBuffer = perThreadCommandBuffer.CommandBuffer;

		vkResetCommandPool(mpDeviceVulkan->GetDeviceHandle(), commandPool, 0);

		VkCommandBufferInheritanceInfo inheriteInfo = {};
		inheriteInfo.renderPass = mpCurrentRenderPass->GetRenderPassHandle();
		inheriteInfo.framebuffer = mpCurrentFramebuffer->GetFramebufferHandle(mCurrentFrame);

		VkCommandBufferBeginInfo cmdBufferBeginInfo = {};
		cmdBufferBeginInfo.pInheritanceInfo = &inheriteInfo;

		vkBeginCommandBuffer(commandBuffer, &cmdBufferBeginInfo);
		vkCmdSetViewport(commandBuffer, 0, (uint32_t)mCurrentViewports.size(), mCurrentViewports.data());
		vkCmdSetScissor(commandBuffer, 0, (uint32_t)mCurrentScissors.size(), mCurrentScissors.data());

		std::vector<VkDescriptorSet> globalSets;

		const DrawIndexedMultiThreadInfo& di_1 = drawInfo[firstIndex];

		for (uint32_t i = 0; i < di_1.PGlobalDescriptorSetBindInfo->PDescriptorSets.size(); ++i)

		vkCmdBindDescriptorSets(commandBuffer, VK_PIPELINE_BIND_POINT_GRAPHICS, di_1.PGlobalDescriptorSetBindInfo->PPipelineLayout->GetPipelineLayoutHandle(),
			di_1.PGlobalDescriptorSetBindInfo->FirstSet, (uint32_t)globalSets.size(), globalSets.data(), 0, nullptr);

		PipelineVulkan* pPipeline = nullptr;

		for (uint32_t i = firstIndex; i < lastIndex; ++i)
			const DrawIndexedMultiThreadInfo& dii = drawInfo[i];

			if (dii.PPipelineVulkan != pPipeline)
				VkPipeline pipeline = dii.PPipelineVulkan->GetPipelineHandle();
				vkCmdBindPipeline(commandBuffer, VK_PIPELINE_BIND_POINT_GRAPHICS, pipeline);

				pPipeline = dii.PPipelineVulkan;

			std::vector<VkDescriptorSet> sets;

			for (uint32_t j = 0; j < dii.DescriptorBindInfo.PDescriptorSets.size(); ++j)

			vkCmdBindDescriptorSets(commandBuffer, VK_PIPELINE_BIND_POINT_GRAPHICS, dii.DescriptorBindInfo.PPipelineLayout->GetPipelineLayoutHandle(),
				dii.DescriptorBindInfo.FirstSet, (uint32_t)sets.size(), sets.data(), 0, nullptr);

			vkCmdPushConstants(commandBuffer, dii.DescriptorBindInfo.PPipelineLayout->GetPipelineLayoutHandle(), dii.PushConstantStage, 0, dii.PushConstantSize,

			VkBuffer vertexBuffer = dii.PVertexBuffer->GetBufferHandle();
			VkBuffer indexBuffer = dii.PIndexBuffer->GetBufferHandle();
			VkDeviceSize offset = 0;

			vkCmdBindVertexBuffers(commandBuffer, 0, 1, &vertexBuffer, &offset);
			vkCmdBindIndexBuffer(commandBuffer, indexBuffer, 0, VK_INDEX_TYPE_UINT32);
			vkCmdDrawIndexed(commandBuffer, dii.IndexCount, 1, 0, 0, 0);


	void CommandBufferVulkan::DrawIndexedMultiThread(const std::vector<DrawIndexedMultiThreadInfo>& info)
		if (info.size() < mpWorkerPool->GetWorkerCount())
			printf("Draw count lower than num of thread!\n");

		uint32_t threadCount = (uint32_t)mpWorkerPool->GetWorkerCount();
		uint32_t drawCountPerThread = (uint32_t)(info.size() / threadCount);
		uint32_t drawCountPerThreadMod = info.size() % threadCount;
		uint32_t firstIndex = 0;
		uint32_t lastIndex = drawCountPerThread;

		VkDevice device = mpDeviceVulkan->GetDeviceHandle();
		std::vector<uint32_t> firstIndexV(threadCount);
		std::vector<uint32_t> lastIndexV(threadCount);
		for (uint32_t i = 0; i < threadCount; ++i)
			if (i == (threadCount - 1))
				lastIndex += drawCountPerThreadMod;
				lastIndexV[i] = lastIndex;
				firstIndexV[i] = firstIndex;
				lastIndexV[i] = lastIndex;

				firstIndex += drawCountPerThread;
				lastIndex += drawCountPerThread;
		for (uint32_t i = 0; i < threadCount; ++i)
			mpWorkerPool->PushWork(i, [=]()
					DrawMultiThread(i, info, firstIndexV[i], lastIndexV[i]);

		for (uint32_t i = 0; i < threadCount; ++i)
		     DrawMultiThread(i, info, firstIndexV[i], lastIndexV[i]);

		std::vector<VkCommandBuffer> commandBuffers;

		for (uint32_t i = 0; i < mPerThreadCommandBuffers[mCurrentFrame].size(); ++i)

		vkCmdExecuteCommands(mCurrentVkCommandBuffer, (uint32_t)commandBuffers.size(), commandBuffers.data());

mt screenshot:

st screenshot:

st secondary command buffers recording screenshot:

Please guide me.

First, stop using framerate to measure performance. You need to measure time.

900FPS is 1.1ms. 4000FPS is .25ms. That’s only a difference of 0.85ms. That’s not much in absolute numbers, is it? That could easily be inter-thread synchronization time, which will likely remain constant with regard to scene complexity.

Second, 500 cubes is nothing to even an integrated GPU, let alone an RTX 3070. All you’re really measuring is the CPU overhead of the API and thread synchronization. How such overhead scales with scene complexity is not something you’re measuring.

Basically, you need to profile realistic scenarios if you want to get meaningful numbers. The point of threading the construction of command buffers is to increase CPU throughput of rendering. With only one thread, you can only spend 16.6 ms of CPU time on a single frame if you want to make 60FPS. With 4 threads, you could spend as much as 64ms of CPU time per-frame (in theory. In practice, it’ll be less, but still much more than 16).

Not to be rude, but my first suggestion would be to flip out of gamer mode, switch to graphics developer mode, and start profiling this. FPS is useless. You need to use frame time.

If we assume your numbers are accurate (they’re not; they’ve only got 1 sigfig) and that they’re exactly the same over the entire FPS averaging interval (rarely true, except when you’re doing the exact same thing frame-to-frame):

  • 4000 fps = 0.25 ms/frame
  • 3000 fps = 0.33 ms/frame (0.08 ms slower = 32%)
  • _900 fps = 1.11 ms/frame (0.86 ms slower = 344%)

Your mission, should you decide to accept it, is to determine where that 0.86 ms is coming from. That’s not a tiny amount of time, so you should be able to pin this down pretty easily even with crude profiling methods.

That’s not rude at all. That’s literally what i need, a guide. Thank you. I’ll learn how to profile it properly, thanks again

You are WaitingIdle, which you probably would not do in your single threaded code.

std::vector::push_back can also be expensive without reserve.

There’s some overhead to threading, and requires more code. E.g. it’s one command buffer in the main thread either way, and everything else adds to overhead. Single secondary command buffer does not really reap the benefits of multithreading, but you still get 100 % of the overhead. Equally bad is probably to spam 500 threads with only one cube or something along those lines.

PS: Despite some militants, there is nothing wrong with FPS. But you must make sure if FPS is the value you want to know. Though if you are measuring “just” inverse duration, that is not really FPS and you should indeed be measuring the duration.

E.g. in this case you do nothing on the GPU (it is one or two clocks worth of work), so you might as well ignore that. Rest is largely synchronous code, so you migh as well measure that. Just measure how long it takes to execute the CommandBufferVulkan::DrawIndexedMultiThread and we can proceed from there. I assume it is gonna amont for 99 % of the difference between single and multithreded code, and it is just synchronous code, so you can just use regular trivial time measuring code.