Return result of algorithm, eecuted multiple times

Hello everyone,

at the moment i’m working on a bit sliced version of aes algorithm in OpenCL. It it working quite well, but a bit slow. As my purpose is to get very short execution times, ive got some problems that i want to discuss. Some very strange things that i cannot understand, but i hope it’s just my lack of knowledge about compiling OpenCL code:

At first: In my project i don’t encrypt a stream of data, but just one block of data. Its size is 16 bytes. Also i run encryption with different keys.
In fact: I have a clear text/cipher pair and i want to find out which key has been used for encryption. Dictionary based attack. I only use 128bit key size. I use uint4 as datatype, and every vector holds one bit of the data.

In my algorithm, one thread does one encryption with one key. I precompute the keys at the beginning of each encryption. All keys are stored in one big array, which are handed over in one buffer. The specific key for a thread is copied into a local array in the kernel. All expanded key are also stored in a local array. Encryption result is saved in a bool array, which has as many elements as keys to encrypt. This is working fine, but quite slow. One encryption process needs about 32ms. After some research, i found out that accessing bigger arrays (expanded keys array has [352 elements, 16byte for a key * 8 * 11 rounds] / vector size), execution time exceeds from 0.003ms to 22ms. Also when i access the bool array for saving the result, execution time goes up by 0.2ms.

I don’t have a clue why this is running so slow. I’ve changed the key expansion step to expand only one key at a time and use it directy in the AddRoundKey step. Now it’s much faster, as the big array is gone, but accessing the result array is still quite slow. Without saving the result, execution time for one encryption is now down to 0.003ms, with saving it’s up to 0.22ms

Changing the key expansion step shouldn’t be a solution for that - in fact i think acessing big arrays could be quite slow. Older C compilers might have a problem with that, but i think that an optimized C compiler is used for that, but i don’t really know.

Here are some information about my graphic card, which is used to run the OpenCL code:
Plattform: NVIDIA CUDA
Device: GeForce GTX 1060
Total device memory: 6144 MB
Maximum buffer size: 1536 MB
Number of compute units: 10

I hope someone knows a bit about that.

Best regards and Happy Easter

Little update: I did some more research. Problem is still not solved. In the end of my algorithm i do a check if the original cipher is equal to the calculated state. For that i go thorugh every vector and check if their elements are equal. This is quite akward. Checking two vectors for equality with the == operator is also not useful, as it returns a vector of same data type, and i need a bool.

Anyway: This check costs a lot of time. Execution time goes up from 0.004ms (without the check) to 2.2ms.

Another thought is that the compiler optimizes my algorithm completly away when i don’t use the check at then end, because it all depents on it (the only step where i write something into a global parameter of the kernel, which is used later).


I’m not sure if this will reduce the time that it takes to do the check but why not use a data race to help you speed up the check.
Assign a bool in global (or local) memory. Initialize it as true. If there is ever a case where two values are not the same have the thread set the bool to false.
Even in the case of multiple false assignments, it should end up false. To ensure that this works, you must never have a thread try to assign a true condition.

After you go through all of the vectors, you can call a mem fence and then do a check for the first thread and have it place the bool into global memory (assuming you used a local memory bank) if not you don’t need to do this part.

I’m not sure this will give the speed up you are looking for, but it will return a bool which is what you are looking for.

Let me know if you have other questions. Sorry that I can’t answer your questions about AES encryption though, i don’t know enough about it to be helpful.

Best of luck,


Hi BHa,

thank you for your reply. Unfortunatly, your first suggestion won’t fit into my process. For example: I’ll check 100 keys in parallel, so 200 threads have to be done. If one of them got the matching key, one thread has to set the bool variable to true. For the other suggestion: i would like to avoid barriers. I was able to build up an implementation so far which doesn’t need barriers. But thank you anyway for your help.

There’s another think thats quite strange: I have another kernel which does an aes encryption, but not bitsliced - it just does the algorithm normally. At the end it performs a similar step. In this case i have a 16 byte state and 16 byte cipher, so i can compare it byte wise. The whole process needs 0.023ms of time.

For now i think that setting the global bool variable isn’t the problem… Maybe there’s something different.

Best regards