Where can I find graphics card specifications for openCL programming? - opencl

I would like to buy a new card, but I cann't find a lot information about cards.
I'm searching for
- private memory size
- local memory size
- constant memory size
- texture memory size
Thanks for the answers!

You may have more luck researching a card's architecture, rather than researching specific cards. Often graphics cards of a particular generation have varying amounts of identical hardware features on-chip. For example, the architecture of a core is often the same across a series of cards, but the number of cores in a product can be unique.
I also recommend reading AnandTech graphics card reviews. They tend to go into great detail when they benchmark graphics cards.

Related

Radeon HD 4850 and OpenCL: will cl_khr_fp64 work on this videocard?

This videocard (Radeon HD 4850) conforms only with OpenCL 1.0, per AMD Compatibility table. I need some hardware to conduct intensive financial calculations with doubleN types (no floats at all!). According to this cardtable, this card is able to work with double types. Now I have the possibility to buy it at quite an attractive price.
I'd greatly appreciate if an answerer has real experience in working with this card for OpenCL with fp64 extension. Of course, if there are problems with this card, please put two lines here.
Thank you and sorry for my English.
I haven't used this card with DP before, but if the spec says it is supported, then it's worth a try.
In my opinion, you should go with a newer model card though. There are a lot of cheap cards out that will outperform the 4850, and they will support some new features as well.
This card supports double precision but the 4xxx series doesn't include local memory in the chip. As the standard mandates local memory support it is emulated with global memory and very slow. Many algorithms require local memory for obtaining a good speed-up. So, a newer card 5xxx and higher is a lot better.
In addition, some combinations of older cards/older SDK versions only support double precision through the cl_amd_fp64 extension (not the official cl_khr_fp64 extension) because of some small things from the standard that are not supported. For the most part, this doesn't matter much except that you need to change the extension name in your code to make it work with doubles.
As a general tip, I would try to avoid the 4xxx series if you intend to make serious GPGPU development. Keep in mind also, that the newer 7xxx series it is much more optimized for GPU computations than both the 5xxx and 6xxx series closing much of the gap with NVIDIA cards. So, if you can, try to aim for a 7xxx with double precision support.

Is there any benefit in nVidia Tesla cards?

I'm planning to buy a serious GPU for running a parallel algorithm on (budget 2k-4k). Now I see everywhere supercomputers featuring nVidia Tesla GPU cards "made especially for GPGPU".
While this seems very nice on first sight, a better reading makes me have serious second thoughts on that: compared to e.g. a Radeon HD 7970, its performance (in terms of flops) is significantly lower, its cost price is significantly higher, and I can't seem to find any benchmark comparison between the Tesla and normal gaming GPUs.
I have found that the Tesla features ECC-memory. Is this the only difference? Or am I missing a deeper architectural difference between both? Perhaps relevant info: I will be using OpenCL, not Cuda.
There are two technical differences I know of between the brands, when you comparing similar cards.
1) Nvidia cards tend to have better double precision FLOPS than AMD - by a factor of 2 sometimes. AMD usually does better for single precision FLOPS.
2) ECC memory is available for both brands for the GDDR5 memory. The difference is that Nvidia uses ECC on the internal memory (registers and such) as well, where AMD does not.
In my opinion, choose the card based on your application. If you use more single than double precision, go AMD, otherwise Nvidia. If you need the ECC for high fault tolerance, maybe Nvidia is your best choice. Sometimes many cheaper cards does better than 1 or 2 top of the line cards - think of PCI-e bandwidth. Read up on benchmarks, and try to determine which card is best suited for your needs.
I don't know if your problem is similar to mining bitcoins, but there is a LOT of info on parallel GPU setups here...
https://en.bitcoin.it/wiki/Mining_hardware_comparison

Qt and "SGX Out of mem event" on Maemo

I'm still fighting with Qt and I managed to get an "SGX Out of mem event" on Nokia N900. It happens when I load some .obj models in my QGraphicsScene (usually after the fourth-fifth). Any idea on what is causing it or how I could trace it?
My guess is that you are running out of graphics memory (the N900's GPU is an Imagination Technologies PowerVR SGX 530).
As far as I can see, the N900 does not have any EGL extensions for directly querying graphics memory usage. In this case, the best you can do may be to reduce graphics memory usage by limiting the complexity of the scene you are trying to render - in other words, load fewer OBJ models, or reduce the complexity (number of polygons) of individual models.

Local data store vs. Texture cache in Cayman Architecture for scientific computation

I am trying to implement a GEMM implmentation using AMD-APP-SDK 2.4 on a ATI HD 6990 card (Cayman architecture).
One of the optimizing techniques is the use of blocking/tiling.
In its implementation, is it faster if we store the sub-matrices in the shared local memory or is it faster when we use a texture cache? If possible please give the reason also.
Please also suggest which is easier to implement.
Thanks.
P.S. I want it for single precision only, if it matters!
Note: The size of the sub matrix is not an issue, however I feel that since the larger it is the better it would be. The only factor to be taken in consideration is that if unit of memory is 128 bit (4 single precision) then, block size should be a multiple of 4.
The Cypress chips were used in the 5800 series Radeons. The 6900 series uses the Cayman core, which has several important differences, most notably that it is a VLIW4 architecture instead of the VLIW5 configuration used in earlier cores.
As always, the only definitive way to know which method is faster is to benchmark it. In particular, since you give no information about the size of the sub-matrices, it is hard to say where they will best fit.

Developing with OpenCl on ATI and Nvidia on the same time

our workgroup is slowly trying a little bit of OpenCl in a side project. So far 'everybody' is working on NVIDIA Quadro FX 580. Now we are planning to buy new computers for new colleages and instead of the FX 580 we could buy ATI FirePro V4800 instead, which costs only 15Eur more and give us 1Gig instead of 512Gig of Ram which will benificial for our data intensive tasks.
So, how much trouble is it to develop OpenCl code at the same time on Nvidia and ATI?
I read the following SO question, Running OpenCL on hardware from mixed vendors, which was very pessimistic about developing on/for different vendors. On the other side, the question is already a year old.
What do you reccomend?
I have previous worked extensively with CUDA programming language.
I have been planning to start developing apps using OpenCL. As you mentioned one of the best features with OpenCL is running on many vendor hardware (Intel, AMD and Nvidia).
One project that I came across that used openCL extensively for large scale development is http://sourceforge.net/projects/hypgad/. It might be a good idea to look at the source code from this group and understand how they have developed their application on so many hardware including sony cell processor.
Another approach would be to use PyOPENCL, which provides higher abstraction than OpenCL and can significantly reduce the coding effort.
Do you need the code to run unchanged on both bits of hardware? If so you may have to develop for a limited subset of common functions.
If you can run slightly different c ode on each you will probably get better performance - in CUDA/OpenCL you generally have to tune the algorithms for the amount of ram, number of GPU engines anyway so it shoudldn't be much more work to also tweak for NVidia/AMD
The biggest problem is workgroup sizes. Some ATI cards I have used crash at above 64, but then it may be the Apple OSX 10.6 drivers I am using.
Developing for both ATI and NVIDIA is actually not too difficult so long as you avoid using any part of either vendor's SDK. Stick to OpenCL as it is defined in the OpenCL spec. (www.khronos.org/opencl) and your code will stay syntax portable. Due to differences in the underlying architectures, performance portability may be an issue. Local & Global worksizes really have to be determined independently for each card to maximize performance. Another thing to pay attention to is the types being used. Vector types (float2, float4) are especially useful on ATI cards, as each processing element actually contains 4 execution units (one for each RGB color channel, plus aplha).

Resources