Hello everyone,
at the moment i’m working on a bit sliced version of aes algorithm in OpenCL. It it working quite well, but a bit slow. As my purpose is to get very short execution times, ive got some problems that i want to discuss. Some very strange things that i cannot understand, but i hope it’s just my lack of knowledge about compiling OpenCL code:
At first: In my project i don’t encrypt a stream of data, but just one block of data. Its size is 16 bytes. Also i run encryption with different keys.
In fact: I have a clear text/cipher pair and i want to find out which key has been used for encryption. Dictionary based attack. I only use 128bit key size. I use uint4 as datatype, and every vector holds one bit of the data.
In my algorithm, one thread does one encryption with one key. I precompute the keys at the beginning of each encryption. All keys are stored in one big array, which are handed over in one buffer. The specific key for a thread is copied into a local array in the kernel. All expanded key are also stored in a local array. Encryption result is saved in a bool array, which has as many elements as keys to encrypt. This is working fine, but quite slow. One encryption process needs about 32ms. After some research, i found out that accessing bigger arrays (expanded keys array has [352 elements, 16byte for a key * 8 * 11 rounds] / vector size), execution time exceeds from 0.003ms to 22ms. Also when i access the bool array for saving the result, execution time goes up by 0.2ms.
I don’t have a clue why this is running so slow. I’ve changed the key expansion step to expand only one key at a time and use it directy in the AddRoundKey step. Now it’s much faster, as the big array is gone, but accessing the result array is still quite slow. Without saving the result, execution time for one encryption is now down to 0.003ms, with saving it’s up to 0.22ms
Changing the key expansion step shouldn’t be a solution for that - in fact i think acessing big arrays could be quite slow. Older C compilers might have a problem with that, but i think that an optimized C compiler is used for that, but i don’t really know.
Here are some information about my graphic card, which is used to run the OpenCL code:
Plattform: NVIDIA CUDA
Device: GeForce GTX 1060
Total device memory: 6144 MB
Maximum buffer size: 1536 MB
Number of compute units: 10
I hope someone knows a bit about that.
Best regards and Happy Easter
Patrick