Understanding shader execution overhead in simple shader

Hello all,

I’m running the following very simple shader for learning purposes. It does N times the same thing — I’m trying to understand Vulkan performance characteristics a bit better. The initialization requests Vulkan 1.2 for Intel HD 530 SKL GT2, without validation layers, compiled in release mode. System is on an up-to-date installation of Arch.

#version 450
#pragma shader_stage(compute)
#extension GL_EXT_shader_8bit_storage : enable

layout(push_constant, std430) uniform pc {
  uint width;
  uint height;
};

layout(std430, binding = 0) readonly buffer Image {
  uint8_t pixels[];
};

layout(std430, binding = 1) buffer ImageOut {
  uint8_t pixelsOut[];
};

layout (local_size_x = 32, local_size_y = 32, local_size_z = 1) in;

void main() {
  const uint idx = gl_GlobalInvocationID.y*width*3 + gl_GlobalInvocationID.x*3;
  const uint count = 150;
  for (int i = 0; i < count; i++) {
    for (int c = 0; c < 3; c++) {
      float vin = float(int(pixels[idx+c])) / 255.0;
      float vout = pow(vin, 2.4);
      pixelsOut[idx+c] = uint8_t(int(vout * 255.0));
    }
  }
}

Changing the iteration count results in the following profile (measured with Tracy), where the complexity is as expected linear, but flattens at a minimum around 50ms:

What can affect that minimum time? Can I reduce it, or make more apparent what’s happening by doing operations manually?

The two buffers contain 2D image data and are device local, host visible and host coherent because I need to upload to one of them and download from the other once the shader has finished. The first is VkMapMemory’d before shader execution, and the second after.

The code running in the pipeline (without the boilerplate) looks like this:

let info = vk::CommandBufferBeginInfo::builder();
device_context.device.begin_command_buffer(device_context.buffer, &info)?

let descriptor_sets = [pipeline.descriptor_set];
device_context.device.cmd_bind_pipeline(cmd.buffer, vk::PipelineBindPoint::COMPUTE, pipeline.pipeline);
device_context.device.cmd_bind_descriptor_sets(cmd.buffer, vk::PipelineBindPoint::COMPUTE, pipeline.layout, 0, &descriptor_sets, &[]);
device_context.device.cmd_push_constants(cmd.buffer, pipeline.layout, vk::ShaderStageFlags::COMPUTE, 0, &(width as u32).to_ne_bytes());
device_context.device.cmd_push_constants(cmd.buffer, pipeline.layout, vk::ShaderStageFlags::COMPUTE, 4, &(height as u32).to_ne_bytes());
device_context.device.cmd_dispatch(cmd.buffer, ((width + 16) / 32) as u32, ((height + 16)/ 32) as u32, 1);

let buffers = [device_context.buffer];
let info = vk::SubmitInfo::builder().command_buffers(&buffers);
device_context.device.end_command_buffer(cmd.buffer)?;
device_context.device.queue_submit(device_context.queue, &[info], vk::Fence::null())?;
device_context.device.queue_wait_idle(device_context.queue)?;