Developing with OpenCl on ATI and Nvidia on the same time - opencl

our workgroup is slowly trying a little bit of OpenCl in a side project. So far 'everybody' is working on NVIDIA Quadro FX 580. Now we are planning to buy new computers for new colleages and instead of the FX 580 we could buy ATI FirePro V4800 instead, which costs only 15Eur more and give us 1Gig instead of 512Gig of Ram which will benificial for our data intensive tasks.
So, how much trouble is it to develop OpenCl code at the same time on Nvidia and ATI?
I read the following SO question, Running OpenCL on hardware from mixed vendors, which was very pessimistic about developing on/for different vendors. On the other side, the question is already a year old.
What do you reccomend?

I have previous worked extensively with CUDA programming language.
I have been planning to start developing apps using OpenCL. As you mentioned one of the best features with OpenCL is running on many vendor hardware (Intel, AMD and Nvidia).
One project that I came across that used openCL extensively for large scale development is http://sourceforge.net/projects/hypgad/. It might be a good idea to look at the source code from this group and understand how they have developed their application on so many hardware including sony cell processor.
Another approach would be to use PyOPENCL, which provides higher abstraction than OpenCL and can significantly reduce the coding effort.

Do you need the code to run unchanged on both bits of hardware? If so you may have to develop for a limited subset of common functions.
If you can run slightly different c ode on each you will probably get better performance - in CUDA/OpenCL you generally have to tune the algorithms for the amount of ram, number of GPU engines anyway so it shoudldn't be much more work to also tweak for NVidia/AMD

The biggest problem is workgroup sizes. Some ATI cards I have used crash at above 64, but then it may be the Apple OSX 10.6 drivers I am using.

Developing for both ATI and NVIDIA is actually not too difficult so long as you avoid using any part of either vendor's SDK. Stick to OpenCL as it is defined in the OpenCL spec. (www.khronos.org/opencl) and your code will stay syntax portable. Due to differences in the underlying architectures, performance portability may be an issue. Local & Global worksizes really have to be determined independently for each card to maximize performance. Another thing to pay attention to is the types being used. Vector types (float2, float4) are especially useful on ATI cards, as each processing element actually contains 4 execution units (one for each RGB color channel, plus aplha).

Related

Shipping reliable OpenCL applications - Tools/Techniques/Tips?

I want to ship OpenCL code that should work on all OpenCL 1.1 compatible GPUs. Rather than buying a bunch of GPUs and testing on them, are there any tools that can help ensure reliability?
If anyone has experience shipping OpenCL applications to a wide hardware base, I'd be interested in knowing about any other methods for testing reliability.
I've a bit of knowledge on this. Unfortunately, the answer is: depends on what the kernel is doing.
My biggest gripe is with NVIDIA and OpenCL, since they don't seem to support: vectors (float2, 4, etc) and global offsets. Kind of obnoxious. Intel and ATI are both solid, but even then vector sizes can differ. The above doesn't really matter if you are doing image convolution.
It matters if you want to run AMD FFT on an NVIDIA card, are doing matrix math, etc. To address the vector issue, you can write multiple kernels that each have a different vector size and call the right one: MatrixMult_float4(...).
You can check whether your code compiles by using the AMD KernelAnalyzer2, although this does need some component of the Catalyst drivers so it only works for me on PCs with AMD GPUs. There is also the Intel Kernel Builder, which works for devices with Intel OpenCL SDK support. Nvidia's implementation has bugs in it, especially on newer GPUs in my experience so there the best is to test one GPU from each generation.
To avoid extensions and validate CL language versions, one could try to test compile the code using the LLVM, or just getting the grammar for validation, e.g. as BNF.
There's a promising open source project, which probably contains useful stuff: http://bazaar.launchpad.net/~pocl/pocl/master/files/head:/lib/CL/
However, the problems I encountered were:
Newline characters caused build breakers on certain implementations (CR, LF, CRLF) in OpenCL source files. Specifying one of these as the only valid line ending would be just stupid. If one is editing source files on different platforms in conjunction with an SCM, it could get inconvenient. So I remove comments and clean up line breaks before compilation.
Performance: Feeding the GPU efficiently using multithreading; different hardware constellations have different bottlenecks. Here I needed a client side pipeline with multiple dispatcher threads. Of course, the amount of work that remains for the CPU depends on the task or capabilities, amount and resources of computing devices. Things that needed serialized execution or dynamic loop counts have been such candidates.

Radeon HD 4850 and OpenCL: will cl_khr_fp64 work on this videocard?

This videocard (Radeon HD 4850) conforms only with OpenCL 1.0, per AMD Compatibility table. I need some hardware to conduct intensive financial calculations with doubleN types (no floats at all!). According to this cardtable, this card is able to work with double types. Now I have the possibility to buy it at quite an attractive price.
I'd greatly appreciate if an answerer has real experience in working with this card for OpenCL with fp64 extension. Of course, if there are problems with this card, please put two lines here.
Thank you and sorry for my English.
I haven't used this card with DP before, but if the spec says it is supported, then it's worth a try.
In my opinion, you should go with a newer model card though. There are a lot of cheap cards out that will outperform the 4850, and they will support some new features as well.
This card supports double precision but the 4xxx series doesn't include local memory in the chip. As the standard mandates local memory support it is emulated with global memory and very slow. Many algorithms require local memory for obtaining a good speed-up. So, a newer card 5xxx and higher is a lot better.
In addition, some combinations of older cards/older SDK versions only support double precision through the cl_amd_fp64 extension (not the official cl_khr_fp64 extension) because of some small things from the standard that are not supported. For the most part, this doesn't matter much except that you need to change the extension name in your code to make it work with doubles.
As a general tip, I would try to avoid the 4xxx series if you intend to make serious GPGPU development. Keep in mind also, that the newer 7xxx series it is much more optimized for GPU computations than both the 5xxx and 6xxx series closing much of the gap with NVIDIA cards. So, if you can, try to aim for a 7xxx with double precision support.

Is there any benefit in nVidia Tesla cards?

I'm planning to buy a serious GPU for running a parallel algorithm on (budget 2k-4k). Now I see everywhere supercomputers featuring nVidia Tesla GPU cards "made especially for GPGPU".
While this seems very nice on first sight, a better reading makes me have serious second thoughts on that: compared to e.g. a Radeon HD 7970, its performance (in terms of flops) is significantly lower, its cost price is significantly higher, and I can't seem to find any benchmark comparison between the Tesla and normal gaming GPUs.
I have found that the Tesla features ECC-memory. Is this the only difference? Or am I missing a deeper architectural difference between both? Perhaps relevant info: I will be using OpenCL, not Cuda.
There are two technical differences I know of between the brands, when you comparing similar cards.
1) Nvidia cards tend to have better double precision FLOPS than AMD - by a factor of 2 sometimes. AMD usually does better for single precision FLOPS.
2) ECC memory is available for both brands for the GDDR5 memory. The difference is that Nvidia uses ECC on the internal memory (registers and such) as well, where AMD does not.
In my opinion, choose the card based on your application. If you use more single than double precision, go AMD, otherwise Nvidia. If you need the ECC for high fault tolerance, maybe Nvidia is your best choice. Sometimes many cheaper cards does better than 1 or 2 top of the line cards - think of PCI-e bandwidth. Read up on benchmarks, and try to determine which card is best suited for your needs.
I don't know if your problem is similar to mining bitcoins, but there is a LOT of info on parallel GPU setups here...
https://en.bitcoin.it/wiki/Mining_hardware_comparison

Are there any current non-Harvard architecture microcontrollers?

I have used and like the Atmel ATMEGA and ATTINY series microcontrollers, and think them quite good. One thing I am not terribly fond of though is the fact that they (and Microchip PIC uC family also) are all Harvard machines, meaning I can't really put external memory to use or execute out of RAM, only the flash.
While there are obvious advantages to this design, it makes it technically very difficult to do things like FORTH using an AVR or PIC. (I know there is at least one implementation, but it does not work like a normal FORTH and will wear out the flash rather rapidly)
FORTH was originally created for interactive machine control type systems where lots of flexibility was needed, so things like the Z80 or 6809 were used as microcontrollers with the control program executing out or RAM or some other storage device.
Does anyone know of current devices of similar complexity (preferably available in DIP packages) to the AVR/PIC that are von Neumman machines?
In addition to Freescale processors (that starblue has already pointed out), the Texas Instrument MSP430 family uses von Neumann architecture. However only the smallest ones are available in a DIP package.
UPDATE to include PIC32:
In my original post, I had forgotten that PIC32 microcontrollers have always been able to execute out of RAM, as demonstrated by this code example;and now Microchip has come out with the new PIC32MZ line of microcontrollers, with up to 2 MB of Flash and 512K of RAM which makes them feasible for fairly large RAM-based programs. Unfortunately none of them chips are available in DIP packages.
However Olimex, sort of the Bulgarian equivalent of SparkFun and Adafruit, has a PIC32-HMZ144 development board for $21.95 EUR, which is about $24. This is a smoking hot deal since the processor alone costs over $12 at Digi-Key. (There are other boards available from US suppliers from around $50 and up.)
The original PIC32MX line has twenty variants in 28-pin DIP packages, but they are limited to a maximum of 64K of RAM, still useful for some projects.
Farnell has a nice search function that let's you search for microcontrollers in DIP packages. Though you'll have figure out which families are non-Harvard by looking at the data sheets.
Take a look at the 68K ones and the HCS08.
Update: In the meantime some ARM Cortex-M controllers in DIP packages have become available, the LPC810M021FN8 and the LPC1114FN28 from NXP.
You might want to peruse the designs available at the OpenCores project. That is an open source project devoted to CPU core designs implemented in VHDL, Verilog, and similar FPGA design languages. There are complete and respectable implementations of classic 8-bit CPUs such as the 8080, 6502, and 8051. The 6502 I linked to claims to be cycle-accurate compared to the original chip. Others are functionally complete, but often have more modern buses and signals.
They won't (I think) be available in DIP packages, but you can always find breakout boards.
The designs are all open source, under a wide variety of licenses.
You may also have a look at the Zilog eZ80. Since they're binary-compatible with the old Z80, you should be able find a FORTH implementation that runs on them, but you'd probably need to run it on top of good old CP/M :)
Also, these are the only ones that I found that have the memory bus accessible from the outside, i.e. allow code execution from external memory.
The arm based ones, even the cortex-m3 claims to be harvard, but you can load programs into data ram and execute from that ram. it is really not harvard. Other arms are normally not harvard, some have external memory interfaces you can use to expand the internal resources.
This is actually not a question, but more of a related query. Why would you go to von-neumann in a microcontroller if the previous generation was harvard? Isnt it all win-win in terms of performance? other than complexity (which if the original PIC's can handle it, should not be that great) what are the downsides of having Harvard architecture?
The new Kinetis line of microcontrollers from Freescale puts an ARM Cortex-M4 inside a microcontroller package, and program code can be located anywhere in addressable space (RAM or FLASH, or even Flex Memory.)
The Kinetis Solution Advisor is a powerful selector guide that can help you find the micro you want. Memory from 32kB to 1MB, all the peripherals you could want, and pricing from under a dollar to around 10.

OpenCL vs. DirectCompute?

I'm looking for comparisons between OpenCL and DirectCompute, but I haven't found anything. OpenCL's advantages of being cross-platform and having a wider range of supported GPUs don't matter to me. I'm fine with coding on Windows against DX11 GPUs only. Assuming that, what are the pros and cons of each API?
I know this question was raised before, but I'm looking for more details.
I'm not interested in CUDA, since I don't want to restrict myself to only Nvidia hardware.
Probably the biggest difference for a coder is that DirectCompute is programmed by a language which is similar to HLSL, and OpenCL is programmed via a C-like language.
Another difference to consider is that, generally, for commodity level GPUs, the DirectX support is better (faster and less buggy) than OpenGL support on Windows. This may translate to more stable support for DirectCompute, but really, this is just speculation.
Well the major advantage of OpenCL is that it is not just limited to graphics cards. You can make use of your multicore CPU, Graphics Card and potentially any number of other hardware acceleration devices (DSPs etc) all from the same program.
I'm not sure if DirectCompute allows that freedom.
The OpenCL cross-platform-ness is not just a detail, as the host code (the one calling the OpenCL API and submitting kernels) can itself be cross-platform (see link text, link text...).
Write once, run on any GPGPU, anywhere.
Otherwise the OpenCL tooling is really getting better, with an ATI Stream plugin for Visual Studio, the NVidia & ATI SDKs that contains tons of samples, etc...
Another option now is C++ AMP which gives you modern C++ syntax without a need for a seperate compiler while still preserving hardware portability. Please follow links from here for more info and feel free to post questions as you have them: http://blogs.msdn.com/b/nativeconcurrency/archive/2011/09/13/c-amp-in-a-nutshell.aspx
I use OpenCL because i can easily port my App to Linux but with DirectCompute this is not possible.
I think also that the performance of the OpenCL implementation will increase with time (that it comes at the same Level like CUDA for NVidia Cards) and also that the (driver)bugs will (hopefully ;) ) be eliminated with time.

Resources