2d convolution kernel

i have a HW to write an optimized kernel for 2d convolution using OpenCl, I write it and its work fine but i want to use an optimization called “register tiling”, its mean i have to use the the registers per thread in order to reuse data(in addition of using shared memory),
any one heard about this optimization in 2d convolution and can help me or if there a source code that use this optimization.