No code hoisting for openCL?


#1

Hi all, I’m trying to find whether the code hoisting has been implemented in the compiler level for the openCL code. I run a simple example to test it and it seems like it does not exist in Nvidia and AMD platform. Since having an invariant code in a loop is a very common scenario and the code hoisting is a simple concept, so I would like to know if someone has an explicit answer for the current state of the code hoisting for the Nvidia, AMD and Intel compiler. If it does exist, how should we turn it on?

Here is a minimum example:


//Just some complicate operation on the variable random_private
#define zero random_private[0]*random_private[1]*random_private[2]*random_private[3]*random_private[4]*random_private[5]*random_private[6]*random_private[7]*random_private[8]*random_private[9]
#define zero1 powr((double)zero*zero+zero,10)
#define zero2 zero1/(zero1+1)
#define zero3 zero2+zero2*zero2

//Test if the code hoisting exist
//C=A+B+something
kernel void matrix_add1(global  double *A,  global  double *B,global  double *C ,global uint* random) {
  uint rowNum=10000;
  uint colNum=100;
//localize the variable random to make sure the code hoisting is valid(Otherwise it is possible that the variable random can be changed by other thread when excuting the loop and therefore the code hoisting results in incorrect answer)
  uint random_private[10]={random[0],random[1],random[2],random[3],random[4],random[5],random[6],random[7],random[8],random[9]};
  for(uint j=0;j<colNum;j++){
    for(uint i=0;i<rowNum;i++){
//zero3 is a macro to do some super complicate operation on random_private
      C[i+j*rowNum]=A[i+j*rowNum]-B[i+j*rowNum]+zero3;
    }
  }
}

//Manually do the code hoisting
kernel void matrix_add2(global  double *A,  global  double *B,global  double *C ,global uint* random) {
  uint rowNum=10000;
  uint colNum=100;
  uint random_private[10]={random[0],random[1],random[2],random[3],random[4],random[5],random[6],random[7],random[8],random[9]};
//Compute the loop-invariant code
  uint tmp=zero3;
  for(uint j=0;j<colNum;j++){
    for(uint i=0;i<rowNum;i++){
      C[i+j*rowNum]=A[i+j*rowNum]-B[i+j*rowNum]+tmp;
    }
  }
}


The example runs 20 times with just one thread, here is the result on my computer:
Nvidia 1070:
matrix_add1: 28.46 sec
matrix_add2: 4.3 sec

AMD 1600X:
matrix_add1: 5.78 sec
matrix_add2: 0.16 sec

The function matrix_add1 is much much slower than the function matrix_add2. Is there anything wrong with matrix_add1 or the hoisting just does not exist? Is there any third-party compiler can do the code hoisting and generate the intermediate code? Thanks


#2

I found there is a reason that may prevent them from implementing the code hoisting (Besides the budget issue, of course). There is a trade-off between the efficiency and space. The code hosting uses registers to hold the intermediate results, and save the computation time in the loop. It works very well when you did not spill the registers. However, not like CPU, one compute unit on a GPU usually needs to run tens of thread simultaneously, it is very likely to spill the registers if each thread claims tons of private variables. Therefore, the benifit of code hoisting depends on how complicate the intemediate result is and how many loops would be performed. It is hard to know the answer without knowing the detail of the project, so they just simply not implement it.