Usually the number of work items is determined by the problem. i.e. just use 1 work item per output pixel or whatever.
Otherwise a reasonable general rule of thumb for all gpus is Nx64 work items per compute unit, where N is 4-16 (or more) - the ideal N is dependent on the code you’re running as much as it is the architecture it’s running on, so just experiment.
You must have many threads per core in order to hide memory latency and achieve high parallelism of the alus. But if the code is ALU bound, this isn’t so important.
In a GTX 560, for instance, have 7 Compute Units, so i will create only 7 Work-Groups with more work-items inside?
Or can i create more work-groups with less work-items, provided that the number of work-groups is a multiple of 7?
Cause, when if work with CPU, always have to put the number of threads equal a number of Processing Elements. If you put more will generate concurrency. And, in CPU generally is 2 potency a number or processing elements, so when you split your workload you will split in a number that is a 2 potency.
In the GPU, get the GTX 560 for instance that have 7 compute units, will follow the same pattern?