Manually optimizing OpenCL/CUDA intermediate code !


I am interested to optimize OpenCL code, in this regards i went through some OpenCL optmization guide book which says that there are following things you should consider while optimizing your code:

  1. Device utilization and occupancy:- it is required to launch as many blocks as possible to get optimal occupancy and to hide memory latency.
  2. Maximize Memory Bandwidth:- by minimizing the data transfer and by using overlapping of data transfer with device computation.
  3. Shared Memory:- Use shared memory when you need to access data more than once either within the same thread or from different thread within a block.

There may be few more things to consider while optimizing:

my questions are:

  1. what can be the other possibilities to optimize OpenCL/CUDA code?
  2. Is there any way to manually optimize IR code generated by OpenCL/CUDA compiler? If yes then what are the procedure to do this?
  3. One more thing I want to know about CUDA terminology is that why we have concept of warps/blocks/grids?
  4. OpenCL guarantees that its programs are portable but it does not guarantee of having optimum performance across different vendor’s device, so if I want to get optimum performance across different vendor’s device then how should I approach?
  5. Can we modify LLVM IR code generated by OpenCL to optimize my code?

Thanks !!