Any new ideas on using openCL with multiple GPUs? - opencl

My question is :
Has there been any new advancement (or perhaps a tool/library developed) for using openCL with multiple GPUs? I understand that if someone wants to write a code in openCL with the goal of using multiple GPUs, then he can, but I have been told that the way you can arrange the communications between them is a little "primitive". What I want to know is if there is something out there that can put a level of abstraction between the programmer and all that arrangement of communications between the GPUs.
I am working at stochastic simulations with pretty big lattices and I would like to be able to break them into different GPUs, each of which can do the computing and communicate if necessary. Writing this in a way that it's efficient is difficult enough, so if I can avoid all the low level work of using the standard way to do it through openCL, it would be a big help.
Thanks!

On the academic side, there is this paper from Seoul National University in South Korea:
Achieving a single compute device image in OpenCL for multiple GPUs, http://dl.acm.org/citation.cfm?id=1941591
The authors propose an automatic mechanism for dividing a kernel across multiple GPUs. Unfortunately, their framework has not been released yet.

Related

Combine OpenMP and MPI

I am working on a particle physics programm and want to parallelize it now on a cluster. I attended some HPC lecture so i managed to create parallel tasks within a node using OpenMP. This still does not scale good enough.
Hence I want to use several nodes that use some message exchange via network while in one node i parallelize with OpenMP.
For instance, i only know about MPI but i have heard about something called OpenMPI as a combined approach. I would be thankful about good sources about such an approach, that are understandable for someone with no Background in HP computing. Also i would love to hear other suggestions regarding my idea.

Local data store vs. Texture cache in Cayman Architecture for scientific computation

I am trying to implement a GEMM implmentation using AMD-APP-SDK 2.4 on a ATI HD 6990 card (Cayman architecture).
One of the optimizing techniques is the use of blocking/tiling.
In its implementation, is it faster if we store the sub-matrices in the shared local memory or is it faster when we use a texture cache? If possible please give the reason also.
Please also suggest which is easier to implement.
Thanks.
P.S. I want it for single precision only, if it matters!
Note: The size of the sub matrix is not an issue, however I feel that since the larger it is the better it would be. The only factor to be taken in consideration is that if unit of memory is 128 bit (4 single precision) then, block size should be a multiple of 4.
The Cypress chips were used in the 5800 series Radeons. The 6900 series uses the Cayman core, which has several important differences, most notably that it is a VLIW4 architecture instead of the VLIW5 configuration used in earlier cores.
As always, the only definitive way to know which method is faster is to benchmark it. In particular, since you give no information about the size of the sub-matrices, it is hard to say where they will best fit.

suggest a Benchmark program to compare MPICH and OpenMPI

I am new to HPC and the task in hand is to do a performance analysis and comparison between MPICH and OpenMPI on a cluster which comprises of IBM servers equipped with dual-core AMD Opteron processors, running on a ClusterVisionOS.
Which benchmark program should I pick to compare between MPICH and OpenMPI implementations?
I am not sure if High-Performance Linpack Benchmark can help, as i am not attempting to measure the performance of the cluster itself.. kindly suggest..
Thank you
The classic examples are:
NAS Parallel Benchmarks - they
are representative numerical kernels
that you'd see in a lot of scientific
computing applications. These
admittedly have a lot of computation
but also have the communications
patterns you'd expect to see in real
applications, so they are fairly
relevant.
Or, if you really just want MPI "microbenchmarks", the OSU benchmarks or the Intel MPI Benchmarks are well known choices. These run zillions of tests -- ping-poing, broadcast, etc -- of various sizes and configurations, so you end up with a very large amount of data. The good news is that if you run these with the two MPIs, you'll know exactly where each one is stronger or weaker.
MPICH and OpenMPI are both actively maintained and very solid, and have a long-standing friendly rivalry; so I'd be very surprised if you found one to be consistently faster than the other. We have had both on our system, and there were differences with the default settings on real applications, but usually fairly small, some favouring one some favouring the other. But to really find out which is better for a particular application, you need to do more than run with the default parameters; both implementations can have a large number of variables set dealing with how they deal with collectives (OpenMPI 1.5.x has very interesting-looking hierarchical collectives I haven't played with yet), etc.
What I would do is to search in the ACM Digital Library. You will get objective stuff there.
Some tips for the search:
Sort by relevance.
Read the Abstract (at the bottom) to see if it matches what you are looking for.
If a paper matches your search, buy that paper, it is usually cheap. Other option is to subscribe to ACM if you plan to search often as you will get a better price.
Hope this helps someone.

Developing with OpenCl on ATI and Nvidia on the same time

our workgroup is slowly trying a little bit of OpenCl in a side project. So far 'everybody' is working on NVIDIA Quadro FX 580. Now we are planning to buy new computers for new colleages and instead of the FX 580 we could buy ATI FirePro V4800 instead, which costs only 15Eur more and give us 1Gig instead of 512Gig of Ram which will benificial for our data intensive tasks.
So, how much trouble is it to develop OpenCl code at the same time on Nvidia and ATI?
I read the following SO question, Running OpenCL on hardware from mixed vendors, which was very pessimistic about developing on/for different vendors. On the other side, the question is already a year old.
What do you reccomend?
I have previous worked extensively with CUDA programming language.
I have been planning to start developing apps using OpenCL. As you mentioned one of the best features with OpenCL is running on many vendor hardware (Intel, AMD and Nvidia).
One project that I came across that used openCL extensively for large scale development is http://sourceforge.net/projects/hypgad/. It might be a good idea to look at the source code from this group and understand how they have developed their application on so many hardware including sony cell processor.
Another approach would be to use PyOPENCL, which provides higher abstraction than OpenCL and can significantly reduce the coding effort.
Do you need the code to run unchanged on both bits of hardware? If so you may have to develop for a limited subset of common functions.
If you can run slightly different c ode on each you will probably get better performance - in CUDA/OpenCL you generally have to tune the algorithms for the amount of ram, number of GPU engines anyway so it shoudldn't be much more work to also tweak for NVidia/AMD
The biggest problem is workgroup sizes. Some ATI cards I have used crash at above 64, but then it may be the Apple OSX 10.6 drivers I am using.
Developing for both ATI and NVIDIA is actually not too difficult so long as you avoid using any part of either vendor's SDK. Stick to OpenCL as it is defined in the OpenCL spec. (www.khronos.org/opencl) and your code will stay syntax portable. Due to differences in the underlying architectures, performance portability may be an issue. Local & Global worksizes really have to be determined independently for each card to maximize performance. Another thing to pay attention to is the types being used. Vector types (float2, float4) are especially useful on ATI cards, as each processing element actually contains 4 execution units (one for each RGB color channel, plus aplha).

Are there any current non-Harvard architecture microcontrollers?

I have used and like the Atmel ATMEGA and ATTINY series microcontrollers, and think them quite good. One thing I am not terribly fond of though is the fact that they (and Microchip PIC uC family also) are all Harvard machines, meaning I can't really put external memory to use or execute out of RAM, only the flash.
While there are obvious advantages to this design, it makes it technically very difficult to do things like FORTH using an AVR or PIC. (I know there is at least one implementation, but it does not work like a normal FORTH and will wear out the flash rather rapidly)
FORTH was originally created for interactive machine control type systems where lots of flexibility was needed, so things like the Z80 or 6809 were used as microcontrollers with the control program executing out or RAM or some other storage device.
Does anyone know of current devices of similar complexity (preferably available in DIP packages) to the AVR/PIC that are von Neumman machines?
In addition to Freescale processors (that starblue has already pointed out), the Texas Instrument MSP430 family uses von Neumann architecture. However only the smallest ones are available in a DIP package.
UPDATE to include PIC32:
In my original post, I had forgotten that PIC32 microcontrollers have always been able to execute out of RAM, as demonstrated by this code example;and now Microchip has come out with the new PIC32MZ line of microcontrollers, with up to 2 MB of Flash and 512K of RAM which makes them feasible for fairly large RAM-based programs. Unfortunately none of them chips are available in DIP packages.
However Olimex, sort of the Bulgarian equivalent of SparkFun and Adafruit, has a PIC32-HMZ144 development board for $21.95 EUR, which is about $24. This is a smoking hot deal since the processor alone costs over $12 at Digi-Key. (There are other boards available from US suppliers from around $50 and up.)
The original PIC32MX line has twenty variants in 28-pin DIP packages, but they are limited to a maximum of 64K of RAM, still useful for some projects.
Farnell has a nice search function that let's you search for microcontrollers in DIP packages. Though you'll have figure out which families are non-Harvard by looking at the data sheets.
Take a look at the 68K ones and the HCS08.
Update: In the meantime some ARM Cortex-M controllers in DIP packages have become available, the LPC810M021FN8 and the LPC1114FN28 from NXP.
You might want to peruse the designs available at the OpenCores project. That is an open source project devoted to CPU core designs implemented in VHDL, Verilog, and similar FPGA design languages. There are complete and respectable implementations of classic 8-bit CPUs such as the 8080, 6502, and 8051. The 6502 I linked to claims to be cycle-accurate compared to the original chip. Others are functionally complete, but often have more modern buses and signals.
They won't (I think) be available in DIP packages, but you can always find breakout boards.
The designs are all open source, under a wide variety of licenses.
You may also have a look at the Zilog eZ80. Since they're binary-compatible with the old Z80, you should be able find a FORTH implementation that runs on them, but you'd probably need to run it on top of good old CP/M :)
Also, these are the only ones that I found that have the memory bus accessible from the outside, i.e. allow code execution from external memory.
The arm based ones, even the cortex-m3 claims to be harvard, but you can load programs into data ram and execute from that ram. it is really not harvard. Other arms are normally not harvard, some have external memory interfaces you can use to expand the internal resources.
This is actually not a question, but more of a related query. Why would you go to von-neumann in a microcontroller if the previous generation was harvard? Isnt it all win-win in terms of performance? other than complexity (which if the original PIC's can handle it, should not be that great) what are the downsides of having Harvard architecture?
The new Kinetis line of microcontrollers from Freescale puts an ARM Cortex-M4 inside a microcontroller package, and program code can be located anywhere in addressable space (RAM or FLASH, or even Flex Memory.)
The Kinetis Solution Advisor is a powerful selector guide that can help you find the micro you want. Memory from 32kB to 1MB, all the peripherals you could want, and pricing from under a dollar to around 10.

Resources