Automatic Load-balancing compiler?

I am beginner for parallel compilers.
I am looking for any compiler or utility for automatic detection and compilation for parallelizable code segments.
For example, if the compiler found a simple “for” loop,
I imagine that the compiler can detect whether it can be effectively executable on OpenCL-devices or not. Then, it can generate OpenCL kernel code or otherwise CPU-executable codes, automatically. (or, using LLVM, it can be somewhat easier.)
But, I found that typical CUDA or OpenCL compilers need the explicit marking on the parallel kernel codes.

Do you know any information on this kind of automatic parallelization and/or compilers?
Thanks in advance.

My guess is that this technology is too young, thus far, for such embellishments.

I don’t see why something like this would not work, at least in many common situations. But, since some vendors are still working out bugs in their implementations of 1.0, I think we have a while to wait for such intelligent tools…

But, eventually, I think yes!