Hello gentlemen and gentlewomen.
I’ve gotten my very complex kernel to work on both cpu and gpu now, and that constitutes proof of concept for me.
It’s time, I think, to give this MacBook Pro a bit of a rest (I’m cooking the poor thing alive!) and get a Mac Pro with a heavy-duty graphics card to take it to the next level.
I calculate that I’ll need at least 1.26 TeraFlops, and I’d rather have about 2.3. The Radeon 5870 available in the Mac Pro claims about 2.7, which should be wonderlicious. But, the programming I’ve done so far has been on the NVidia 300GM in my MacBook Pro.
SO, I’m wondering – should I stay with NVidia, since my stuff already runs on their branded hardware, even though it’s a different architecture? … OR, go for the ATI for the raw (claimed) processing power, and hope that the debugging I’ve already done isn’t too wasted?
The problem with NVidia, it doesn’t support OpenCL 1.1, because they are trying to be monopolist like Intel in CPU and Microsoft in software. I don’t like these three companies. But buying one step next card is not so advantageous act, buying ATI 6990 the two step next is better choice.
Thanks for the suggestion, uelkfr.
I like your thinking, and I also do not like monopolists.
My problem with going to the 6900 is that I’m on MacOS, and I’m pretty sure that dual-gpu cards only show as a single gpu there (same with 5970), so half the silicon would be wasted. Plus, I think that two 8-pin power connectors are required, which the Mac Pro lacks.
SO, in a single-gpu card, there are several ATI options that claim over 2 teraflops … the highest seem to be the 5870 (2.7Tf) and the 6970 (also 2.7Tf). The 6970 supports double precision, but not fast enough for my purposes; also, I have already converted my algorithms to single precision and have discovered that to be adequate.
6950 2.25 Tf … 6870 2 Tf … so, if I want is the fastest single-precision performance available in a single-gpu card, it looks like the 5870 or the 6970. I’d love to have the entire 2.7 Tf, but of course it would be very unlikely to achieve the maximum theoretical throughput. As far as needing to vectorize the kernel to achieve best performance on ATI vs. nV, I think that would not be too hard to do … maybe one long day to change all relevant floats to float4s while correspondingly reducing the granularity of my kernel … if I get lucky!