# How to get vectorized enhanced performance for XOR, AND for ulong

Hi,

Problem 1:
I’ve tried couple of tricks to get vectorized performance for XOR, AND operations for ulong data in OpenCL.
Not a single one resulted in good performances. One technique is for example, breaking ulong data to 8 uchar and
then perform XOR (by ^) but the result performs worse.

# // Assembly language version of the below code provides 20% better result int len = 0 ; ulong v = … while ((v & UCHAR_MAX) == 0) { // UCHAR_MAX is 255, CHAR_BIT is 8 v >>= CHAR_BIT; en += 1; }

## int len = 0 ; ulong v = … // Some 64-bit data ulong8 u8cmax =(ulong8)(CHAR_BIT) ; if ((v & UCHAR_MAX) == 0) { ulong uvt[8] ; uvt[0] = v ; int i = 1 ; while(i < 8) { uvt[i] = uvt[i-1]>>CHAR_BIT ; i++ ; } ulong8 uv8 = (ulong8)(uvt[0], uvt[1], uvt[2], uvt[3], uvt[4], uvt[5], uvt[6], uvt[7]) ; ulong8 uc = uv8 & u8cmax ; // Vectorized AND of 8 ulong data ulong uv[8] = {uc.s0, uc.s1, uc.s2, uc.s3, uc.s4, uc.s5, uc.s6, uc.s7} ; i = 0 ; while(uv[i++]==0) { len += 1; } }

Can someone shed light on the above? I appreciate…

Thanks,
Syed Hussain