Disabling unused code segment in a CL kernel impact the running speed?

fangqq · February 5, 2016, 10:51am

My colleague and I recently were puzzled by the following finding: there is an if() condition inside my opencl kernel, and we know for a particular run, this condition is always false. However, if we leave this unused if block in the .cl file and run the simulation, the run time is almost twice longer than when we completely remove the block from the source code (or disabling it by #ifdef/#endif). Yet, both produced the same output.

My question is, is this kind of behavior common in opencl’s JIT compilation? are there anything we can do to ensure that such overhead due to compilation optimization is minimized? a 2-fold difference seems significant in my application.

any comments on this would be appreciated.

Dithermaster · February 5, 2016, 12:25pm

It is a mystery to us too since we can’t see the code. One possibility is that the compiler doesn’t know the condition is always false (perhaps it is passed in as a kernel argument) and your condition has a lot of code it in, using lots of registers. This could cause less occupancy in the GPU and therefore cut the speed, even though the code is never executed. On some architectures (like older AMD) there is a “fast path” that some kernels can take if they avoid doing some things. Perhaps you do one of those things in your condition and so when you remove it, you get on the fast path (and were not before). Just two possibilities, I’m sure there are more.

fangqq · February 5, 2016, 9:37pm

in fact, my code is available online, if you are interested, just checkout this git repository

after git clone, switch to mcxlite branch by ‘git checkout mcxlite’, then go to src, run make. it should produce a binary called mcxcl. Then go to example/quicktest, and run ‘run_qtest.sh’ to do a benchmark.

the block that I found sensitive to the performance is this one:

github.com

fangq/mcxcl/blob/mcxlite/src/mcx_core.cl#L372-L432


      
          #ifdef MCXCL_DO_REFLECT
          
                        if(gcfg->doreflect || (gcfg->savedet && (mediaidold & DET_MASK)) ) {
                          //time-of-flight to hit the wall in each direction
                          htime.x=(v.x>EPS||v.x<-EPS)?(floor(p0.x)+(v.x>0.f)-p0.x)/v.x:VERY_BIG;
                          htime.y=(v.y>EPS||v.y<-EPS)?(floor(p0.y)+(v.y>0.f)-p0.y)/v.y:VERY_BIG;
                          htime.z=(v.z>EPS||v.z<-EPS)?(floor(p0.z)+(v.z>0.f)-p0.z)/v.z:VERY_BIG;
                          //get the direction with the smallest time-of-flight
                          tmp0=fmin(fmin(htime.x,htime.y),htime.z);
                          flipdir=(tmp0==htime.x?1.f:(tmp0==htime.y?2.f:(tmp0==htime.z&&idx1d!=idx1dold)?3.f:0.f));
          
                          //move to the 1st intersection pt
                          tmp0*=JUST_ABOVE_ONE;
          		htime=floor(p0+tmp0*v);
          
                          if(htime.x>=0&&htime.y>=0&&htime.z>=0&&htime.x<gcfg->maxidx.x&&htime.y<gcfg->maxidx.y&&htime.z<gcfg->maxidx.z){
                              if(media[(int)(htime.z*gcfg->dimlen.y+htime.y*gcfg->dimlen.x+htime.x)]==mediaidold){ //if the first vox is not air
          
                               GPUDEBUG((" first try failed: [%.1f %.1f,%.1f] %d (%.1f %.1f %.1f)\n",htime.x,htime.y,htime.z,
                                     media[(int)(htime.z*gcfg->dimlen.y+htime.y*gcfg->dimlen.x+htime.x)], gcfg->maxidx.x, gcfg->maxidx.y,gcfg->maxidx.z));

This file has been truncated. show original

when “MCXCL_DO_REFLECT” macro is not defined, this block is not compiled by JIT, and the speed of the simulation is 19600 photons/ms. If this macro is enabled when running the folloowing command in the quicktest folder:

…/…/bin/mcxcl -t 16384 -T 64 -g 10 -n 1e7 -f qtest.inp -s qtest -r 1 -a 0 -b 0 -k …/…/src/mcx_core.cl -d 0 -J ‘-D MCXCL_DO_REFLECT’

then the speed drops to 12000 photons/ms. The output results are exactly the same. The test was done on an nvidia card (980Ti) but similar finding was also found on AMD cards.

surely it is not always false, otherwise there is no need to have that block in the first place.

the block is enabled by an input parameter, gcfg->isreflect which is located in the constant memory. this flag is fixed for each kernel execution.

if I set gcfg->isreflect=false, I thought that the JIT should know this when building the program and automatically remove the unneeded blocks?

I did notice that my clBuildProgram was called (line#376) before I pass the gcfg constants (line#434)

github.com

fangq/mcxcl/blob/mcxlite/src/mcx_host.cpp#L376


      
               fflush(cfg->flog);
               fieldlen=dimxyz*cfg->maxgate;
          
               fprintf(cfg->flog,"init complete : %d ms\n",GetTimeMillis()-tic);
          
               OCL_ASSERT(((mcxprogram=clCreateProgramWithSource(mcxcontext, 1,(const char **)&(cfg->clsource), NULL, &status),status)));
               if(cfg->issavedet)
          	sprintf(opt,"-D SAVE_DETECTORS -cl-mad-enable -cl-fast-relaxed-math %s",cfg->compileropt);
               else
          	sprintf(opt,"-cl-mad-enable -cl-fast-relaxed-math %s",cfg->compileropt);
               status=clBuildProgram(mcxprogram, 0, NULL, opt, NULL, NULL);
               
               if(status!=CL_SUCCESS){
          	 size_t len;
          	 char msg[2048];
          	 // get the details on the error, and store it in buffer
          	 clGetProgramBuildInfo(mcxprogram,devices[devid],CL_PROGRAM_BUILD_LOG,sizeof(msg),msg,&len); 
          	 fprintf(cfg->flog,"Kernel build error:\n%s\n", msg);
          	 mcx_error(-(int)status,(char*)("Error: Failed to build program executable!"),__FILE__,__LINE__);
               }
               fprintf(cfg->flog,"build program complete : %d ms\n",GetTimeMillis()-tic);

github.com

fangq/mcxcl/blob/mcxlite/src/mcx_host.cpp#L434


      
                 fprintf(cfg->flog,"lauching mcx_main_loop for time window [%.1fns %.1fns] ...\n"
                     ,twindow0*1e9,twindow1*1e9);
          
                 //total number of repetition for the simulations, results will be accumulated to field
                 for(iter=0;iter<cfg->respin;iter++){
                     fprintf(cfg->flog,"simulation run#%2d ... \t",iter+1); fflush(cfg->flog);
          	   param.twin0=twindow0;
          	   param.twin1=twindow1;
                     for(devid=0;devid<workdev;devid++){
                        OCL_ASSERT((clEnqueueWriteBuffer(mcxqueue[devid],gparam,CL_TRUE,0,sizeof(MCXParam),&param, 0, NULL, NULL)));
                        OCL_ASSERT((clSetKernelArg(mcxkernel[devid],12, sizeof(cl_mem), (void*)&gparam)));
                        // launch mcxkernel
                         OCL_ASSERT((clEnqueueNDRangeKernel(mcxqueue[devid],mcxkernel[devid],1,NULL,mcgrid,mcblock, 0, NULL, 
          #ifndef USE_OS_TIMER
                            &kernelevent)));
          #else
                            NULL)));
          #endif
                         OCL_ASSERT((clEnqueueReadBuffer(mcxqueue[devid],gdetected[devid],CL_FALSE,0,sizeof(uint),
                                                      &detected, 0, NULL, waittoread+devid)));
                     }

I don’t think I can move line 434 before line 376 because mcxkernel has not be created until lines 388/398.

curious about this “fast path” technique, any links?

Salabar · February 6, 2016, 2:40am

Funnily enough, reason I thought was to blame doesn’t explain the slow down.
Your kernel is too big to run well:

Your code uses a lot of registers, which means only 512 threads can run simultaneously on AMD hardware. It is not enough to hide high memory latency. No reflection variant only uses slightly less registers, which should not actually affect perfomance (perhaps, 108 and 87 does make a difference on NV), but I can’t tell without runtime data. Split your kernel into a sequence of smaller operations so compiler could breathe more easily. It should improve general perfomance as well, just don’t be overzealous about it.

Regarding compile-time code removal: you can add “-D MCXCL_DO_REFLECT” compilation flag when calling clBuildProgram to give a compiler the right idea. Turning particular code paths off using language operators is a valid technique, but it should only be used when you simply have to compile too much variants of the same kernel.

As a general piece of advice, use this tool if you have an AMD machine available:
http://developer.amd.com/tools-and-sdks/opencl-zone/codexl/
You may also read this.
http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_OpenCL_Programming_Optimization_Guide2.pdf
I think NVIDIA’s guide is less comprehensive, even though AMD’s one contradicts itself on multiple occasions.

fangqq · February 6, 2016, 4:31pm

[QUOTE=Salabar;39750]Funnily enough, reason I thought was to blame doesn’t explain the slow down.
Your kernel is too big to run well:

Your code uses a lot of registers, which means only 512 threads can run simultaneously on AMD hardware. It is not enough to hide high memory latency. No reflection variant only uses slightly less registers, which should not actually affect perfomance (perhaps, 108 and 87 does make a difference on NV), but I can’t tell without runtime data. Split your kernel into a sequence of smaller operations so compiler could breathe more easily. It should improve general perfomance as well, just don’t be overzealous about it.[/QUOTE]

@Salabar, thanks for looking into this, and the helpful comments. Yes, this is a heavy kernel using a lot of registers. We have been optimizing the CUDA version of this software (mcx, GitHub - fangq/mcx: Monte Carlo eXtreme (MCX) - GPU-accelerated photon transport simulator ), nvvp also pointed out the high register utility. Despite this, the memory latency only accounts for 3% of the total latency for the CUDA implementation (benchmarked on 980Ti). The kernel seems to be compute bound. Things could be different on the cl kernel.

for the cuda version, we did some optimization to move registers (about 15-24) into the shared memory. The performance actually went down. We were quite puzzled by this, and were not sure if that was the right direction to go. Perhaps we did not reduce enough registers to reach the critical point.

That’s part of what I want to know here. I am glad you confirmed that this is a valid approach. Although, it is somewhat unexpected from what I have read. I thought the whole point of using JIT compilation in CL is to do run-time optimization. When all parameters are provided, the JIT compiler can efficiently ‘recompile’ to get better performance. But it looks like it is not yet intelligent enough to recognize the settings.

[QUOTE=Salabar;39750]As a general piece of advice, use this tool if you have an AMD machine available:
http://developer.amd.com/tools-and-sdks/opencl-zone/codexl/
You may also read this.
http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_OpenCL_Programming_Optimization_Guide2.pdf
I think NVIDIA’s guide is less comprehensive, even though AMD’s one contradicts itself on multiple occasions.[/QUOTE]

yes, we have been doing profiling with codexl, but as I mentioned in the other thread (Line-by-line time profiling for an OpenCL kernel - OpenCL - Khronos Forums), the codexl output was too coarse to provide specific guidance. For the CUDA version, the nvvp coming with cuda 7.5 can already do line-by-line profiling on Maxwell. I wish I can find a similar tool for OpenCL. This will make optimization much more focused.

Salabar · February 7, 2016, 5:47am

What are KernelOccupancy and VALUUtilization values in CodeXL?

For the CUDA version, the nvvp coming with cuda 7.5 can already do line-by-line profiling on Maxwell. I wish I can find a similar tool for OpenCL. This will make optimization much more focused.

That’s the second reason to split your kernel into a couple of smaller ones. It’s just wiser design-wise, because it allows to debug, test and profile operations of manageable size.

Dithermaster · February 7, 2016, 6:56am

AMD fast path mentioned in their optimization guide: http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk/opencl-optimization-guide/

fangqq · February 8, 2016, 1:59pm

I am glad you asked, we actually have a pretty wield finding regarding these numbers on an AMD GPU.

a few days ago, my student submitted a patch to fix a speed regression issue, see this tracker

the code changes only involved moving two floating point number accumulations (energyloss and energylaunched) outside a local function (launchnewphoton), see diff here:

the two versions are essentially the same computation-wise, however, the new code runs 4x faster than the old one (3000 photon/ms vs 800 photon/ms) on a Radeon 7970. My student also looked into the profiling outputs of codexl, and sent me the following table:


Method	old code		Method	(move energy* out of launch())
ExecutionOrder	1		ExecutionOrder	1
ThreadID	4594		ThreadID	4409
CallIndex	53		CallIndex	53
GlobalWorkSize	{ 16384 1 1}	GlobalWorkSize	{ 16384 1 1}
WorkGroupSize	{ 64 1 1}	WorkGroupSize	{ 64 1 1}
Time	        11234.5		Time	2734.63             <<-
LocalMemSize	1		LocalMemSize	1
VGPRs	        107		VGPRs	        253         <<-
SGPRs	        94		SGPRs	        99
ScratchRegs	0		ScratchRegs	24          <<-
FCStacks	NA		FCStacks	NA
Wavefronts	256		Wavefronts	256
VALUInsts	7.60E+08	VALUInsts	1.00E+08    <<-
SALUInsts	1.80E+08	SALUInsts	9690194     <<-
VFetchInsts	4009776		VFetchInsts	2432148     <<-
SFetchInsts	3.40E+07	SFetchInsts	3128574     <<-
VWriteInsts	896510		VWriteInsts	2437945     <<-
LDSInsts	0		LDSInsts	0
GDSInsts	0		GDSInsts	0
VALUUtilization	5.2		VALUUtilization	62.76       <<-
VALUBusy	64.13		VALUBusy	34.7
SALUBusy	27.45		SALUBusy	6.26
FetchSize	2316397		FetchSize	4.00E+07    <<-
WriteSize	6011651		WriteSize	1.00E+08    <<-
CacheHit	98.01		CacheHit	83.41
MemUnitBusy	19.53		MemUnitBusy	40.31       <<-
MemUnitStalled	0.28		MemUnitStalled	2.33
WriteUnitStalled	0	WriteUnitStalled	0.16
LDSBankConflict	0		LDSBankConflict	0

I placed a “<<-” marker along the items that were significantly different. It seems that, by simply moving those two additions outside this local function, suddenly, vector operations became possible (is it true? I am not exactly sure how to interpret these numbers though).

What made me even more puzzled was that, since energyloss/energylaunched were no longer needed inside launchnewphoton(), I asked my student to remove them from the parameter list from launchnewphoton, surprisingly, he found the speed went down again! the only way to get the higher speed was to keep those two parameters, and pass energyabsorbed in the place of energylaunched (as shown in his patch: replace energylaunched with energyabsorbed by leimingyu · Pull Request #9 · fangq/mcxcl · GitHub).

I guess many tricky things can happen when running opencl (at least on the AMD card, on the NVIDIA card, the difference was not significant). That’s why I’d like to do a line-by-line profiling and find out all these hidden inefficiencies.

I agree, but it is very difficult to restructure a particle random-walk kernel into smaller ones. Each kernel run has to contain the entire life span of a particle (and repeat), otherwise, you have to save a lot of states into the memory, that is expected to kill the speed.

of course, if you happen to know any other Monte Carlo code has successfully done so, I am happy to learn.

fangqq · February 8, 2016, 2:22pm

Forgot to attach the occupancy data


Kernel Name				old code	 new code
Thread ID				4594		 4409
Kernel Name				mcx_main_loop	 mcx_main_loop
Device Name				Tahiti		 Tahiti
Number of compute units			32		 32
Max. number of wavefronts per CU	40		 40
Max. number of work-group per CU	40		 40
Max. number of VGPR			256		 256
Max. number of SGPR			102		 102
Max. amount of LDS			65536		 65536
Number of VGPR used			107		 253		<<-
Number of SGPR used			94		 99
Amount of LDS used			1		 1
Size of wavefront			64		 64
Work-group size				64		 64
Wavefronts per work-group		1		 1
Max work-group size			256		 256
Max wavefronts per work-group		4		 4
Global work size			16384		 16384
Maximum global work size		16777216	 16777216
Nbr VGPR-limited waves			8		 4		<<-
Nbr SGPR-limited waves			20		 16
Nbr LDS-limited waves			40		 40
Nbr of WG limited waves			40		 40
Kernel occupancy			20		 10		<<-

the kernel occupancy actually dropped from 20 to 10, despite the 4x speed improvement.

Salabar · February 8, 2016, 6:30pm

VALUUtilization 5.2

Oh boy is it abyssmal. No clue how the little modification changes this so dramatically. I suggest “get rid of flow divergent first, think later” approach. Use this techniques (“bypass short-circuiting” appears to be pointless with current compiler).
http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk/opencl-optimization-guide/#50401334_pgfId-521246
Second, code like this makes a ton of branching:

	if(flipdir>=3.f) { //transmit through z plane
                	   v.xy=tmp0*v.xy;
			   v.z=sqrt(1.f - v.y*v.y - v.x*v.x);
                	}else if(flipdir>=2.f){ //transmit through y plane
                	   v.xz=tmp0*v.xz;
			   v.y=sqrt(1.f - v.x*v.x - v.z*v.z);
                	}else if(flipdir>=1.f){ //transmit through x plane
                	   v.yz=tmp0*v.yz;
			   v.x=sqrt(1.f - v.y*v.y - v.z*v.z);
                	}

This could be replaced with matrix multiplication (matrices consist of 0 or 1, so they can be encoded smartly):


__constant matrix transform[3] = {.... }
v = multiply(v, transform[(int) flipdir - 1])
v.xy=tmp0*v.xy;
v.z=sqrt(1.f - v.y*v.y - v.x*v.x);
v = multiply(v, transform[(int) flipdir - 1])

Once kernel code will only have divergence the algorithm has intrinsicly, real problems should show up. By the way, is CUDA code much different from this? What are CUDA profiling results?

of course, if you happen to know any other Monte Carlo code has successfully done so, I am happy to learn.

If there is a way to calculate the upperbound of random numbers required by each workitem, this could be a start.

Salabar · February 9, 2016, 6:15am

No clue how the little modification changes this so dramatically

I think I figured it out. Compiler is stuck with two options that are equally bad based on its heuristics. That little change doesn’t do much in particular, but it shook something random up and made compiler to believe that adding a lot of scratch registers isn’t as bad anymore. It did pay off, but it seems like coincidence to me.

fangqq · February 9, 2016, 8:31am

as I mentioned earlier, this was a regression, introduced in an earlier commit

before this change, the utilization rate was more like the 60% as the corrected code.

I agree that it has tons of branches, and generally speaking the less divergence the better. But to be honest, I am not convinced some of the branches are the bottleneck of the code. That’s why I want to find a profiler that gives me more direct evidence.

Part of my doubt came from the profiling results of the CUDA version. The CUDA version shares almost the same structure/complexity as the OpenCL version (but recently implemented more accurate algorithms, thus slower). However, the nvvp profiler output did not seem to suggestion major issues on divergence or branching. Below is the latency contribution report generated by nvvp.

From the line-by-line latency analysis in nvvp, we did find a hotspot in a device function, but I wasn’t sure any of these metrics has identified branching or divergence as the main cause of the latency. curious if you have any thoughts on this?

fangqq · February 9, 2016, 8:37am

mind explaining this in more details? I am particularly interested in understanding how you have arrived this conclusion. perhaps those metrics are more telling than I thought.

Salabar · February 9, 2016, 11:32am

It’s only a guess. In the first variant, the optimizing compiler didn’t want to allocate scratch registers. It managed exact zero. To achieve this, it had to handle branching very sub-optimally, and it made perfomance degrade. In the second one, some heuristic (X instructions in a loop or whatever) triggered. It snapped in compiler’s head that now amount of spilled registers doesn’t matter as much and optimizing whatever it intended to optimize coincidentally improved VALU. It doesn’t happen on NVIDIA, because they use different heuristics, but it is likely possible to make their compiler to do something weird as well.

As of your CUDA profiling. It takes 20% of time to simply fetch instructions. I found out that compiled kernel is 170 Kbytes, while Radeon’s instruction cache is only 32 Kbytes. On another hand, CodeXL showed great cache hit ratio, but I don’t know if it accounts for the code fetch. What is this metric for simple kernels like reduction? If it is much smaller, you may try to restructure your bigger loops by splitting them into few consecutive ones. This should allow GPU to use its instruction cache more effectively (it will make kernel flow more linear, so to say, and GPU won’t jump all over the code anymore). Another measure to make code more linear is to make branching very short. Instead of


if (whatever){
a = compute_x()
}else{
a = compute_y()

use


x =compute_x()
y = compute_y()
a = (whatever) ? x : y;
}

It appears to me, though, it still comes down to the sheer size of the kernel. How to mitigate it is probably something beyond my exprertise on GPGPU.