In my code, the several threads need to read from the global memory a lot of variables with the same address. Unfortunately, the size of the varibles is too large in order to fit them all in local memory. As a consequence, reading these variables takes 80% of the time, even if it represents only less than 5% of the instructions.
Can anyone suggest a way to speed up the access to these shared variables?

(my procedure is somehow similar to the multiplication of two matrices)

Optimization is very specific to the hardware you’re targeting, and also to the problem. Without much more detail you’re only going to get vague answers. Some of the things that are generally a good idea on a GPU when accessing global memory:
bmerry is correct, this will take some work.

First (and it seems you’ve done this), code it to use global memory, to work out the algorithm.

Then, figure out how to use shared memory, up to it’s limited size.

If you can’t fit everything you need, figure out some subset that will be useful.

There are great examples of using shared memory for array multiplies, find them and study them, to figure out how to make best use of shared memory.