Programming Intel IGP (e.g. Iris Pro 5200) hardware without OpenCL - opencl

The Peak GFLOPS of the the cores for the Desktop i7-4770k # 4GHz is 4GHz * 8 (AVX) * (4 FMA) * 4 cores = 512 GFLOPS. But the latest Intel IGP (Iris Pro 5100/5200) has a peak of over 800 GFLOPS. Some algorithms will therefore run even faster on the IGP. Combining the cores with the IGP together would even be better. Additionally, the IGP keeps eating up more silicon. The Iris Pro 5100 takes up over 30% of the silicon now. It seems clear which direction Intel desktop processors are headed.
As far as I have seen the Intel IGP, however, is mostly ignored by programmers with the exception of OpenCL/OpenGL. I'm curious to know how one can program the Intel HD Graphics hardware for compute (e.g. SGEMM) without OpenCL?
Added comment:
Their is no Intel support for HD graphics and OpenCL on Linux. I found beignet which is open source attempt to add support to Linux at least for Ivy Bridge HD graphics. I have not tried it. Probably the people developing Beignet know how to program the HD graphics hardware without OpenCL then.

Keep in mind that there is a performance hit to copy the data to the video card and back, so this must be taken into account. AMD is close to releasing APU chips that have unified memory for the CPU and GPU on the same die, which will go a long way towards alleviating this problem.
The way the GPU used to be utilized before CUDA and OpenCL were to represent the memory to be operated on as a texture utilizing DirectX or OpenGL. Thank goodness we don't have to do that anymore!
AMD is really pushing the APU / OpenCL model, so more programs should take advantage of the GPU via OpenCL - if the performance trade off is there. Currently, GPU computing is a bit of a niche market relegated to high performance computing or number crunching that just isn't needed for web browsing and word processing.

It doesn't make sense any more for vendors to let you program using low-level ISA.
It's very hard and most programmers won't use it.
It keeps them from adjusting the ISA in future revisions.
So programmers use a language (like C99 in OpenCL) and the runtime does ISA-specific optimizations right on the user's machine.
An example of what this enables: AMD switched from VLIW vector machines to scalar machines and existing kernels still ran (most ran faster). You couldn't do this if you wrote ISA directly.

Programming a coprocessor like iris without opencl is rather like driving a car without the steering wheel.
OpenCL is designed to expose the requisite parallelism that iris needs to achieve its theoretical performance. You cant just spawn 100s of threads or processes on it and expect performance. Having blocks of threads doing the same thing, at the same time, on similar memory addresses, is the whole crux of the matter.
Maybe you can think of a better paradigm than opencl for achieving that goal; but until you do, I suggest you try learning some opencl. If you are into python; pyopencl is a great place to start.

Related

Use OpenCL on AMD APU but use discrete GPU for the X server

Is it possible to enable OpenCL on an A10-7800 without using it for the X server? I have a Linux box that I use for GPGPU programming. A discrete GEForce 740 card is used for both the X server and running OpenCL & Cuda programs I develop. I would also like the option of running OpenCL code on the APU's integrated GPU cores.
Everything I've read so far implies that if I want to use the APU for OpenCL, I have to install Catalyst and, AFAIK, that means using it for the X server. Is this true? Would there be an advantage to using the APU for my X server and using the GEForce solely for GPGPU code?
I had a similar goal, so I've built a system with AMD APU (4 regular cores + 6 GPUs) and Nvidia discrete graphics board. Sorry to say it wasn't easy to make it work, so I asked a question on the Ask Ubuntu forum, didn't get any answers, experimented a lot with hardware and software setup, and finally have posted my own answer to my question.
I'll describe my setup again here - who knows, what might happen with my auto-answered question on the Ask Ubuntu?
At first, I had to enable the integrated graphics hardware via a BIOS flag. This flag is called IGFX Multi-Monitor on my motherboard (ASUS A88X-PRO).
The second step was to find a right mix of a low-level graphics driver and high-level OpenCL implementation. The low-level driver for AMD processors is called AMD Catalyst and has a file name fglrx. I didn't install this driver from the Ubuntu software center - instead I used a version 15.302, directly downloaded from the AMD site. I had to install a significant number of prerequisites for this driver. The most important finding was that I had to skip running the aticonfig command after the fglrx installation - this command actually configures the X server to use this driver for graphics output, and I didn't want that.
Then I've installed the AMD SDK Ver 3.0 (release 130.136, earlier releases didn't work with my fglrx) - it's the OpenCL implementation from AMD. The clinfo command reports both CPUs and GPUs with correct number of cores now.
So, I have a hybrid AMD processor, supported by the OpenCL, with all the graphics output, supported by a discrete graphics card with Nvidia processor.
Good luck!
I maintain a Linux server (OpenSUSE, but the distribution shouldn't matter) containing both NVIDIA and (a discrete) AMD GPU. It's headless, so technically I do not know whether the X server will create additional problems, but I don't think so. You can always configure xorg.conf to use exactly the driver you want. Or for that matter: install Catalyst, but delete the X server driver file itself, which is not the same thing that you need for OpenCL.
There is one problem with a mixed-vendor system that I noticed, however: AMDs OpenCL driver (ICD) will go spelunking for a libGL.so library, I guess in order to do OpenCL/OpenGL-interop. If it finds any of the NVIDIA-supplied libGL.so's, it will get confused and hang - at least on my machine. I "solved" this by deleting all libGL.so's (I do not need it on a headless compute server), but that might not be an acceptable solution for you. Maybe you can arrange things such that the AMD-supplied libGL.so's take precedence, possibly by installing the AMD driver last.

Why doesn't Intel design its SIMD ISAs in a more compatible or universal way?

Intel has several SIMD ISAs, such as SSE, AVX, AVX2, AVX-512 and IMCI on Xeon Phi. These ISAs are supported on different processors. For example, AVX-512 BW, AVX-512 DQ and AVX-512 VL are only supported on Skylake, but not on Xeon Phi. AVX-512F, AVX-512 CDI, AVX-512 ERI and AVX-512 PFI are supported on both the Skylake and Xeon Phi.
Why doesn't Intel design a more universal SIMD ISA that can run on all of its advanced processors?
Also, Intel removes some intrinsics and adds new ones when developing ISAs. A lot of intrinsics have many flavours. For example, some work on packed 8-bit while some work on packed 64-bit. Some flavours are not widely supported. For example, Xeon Phi is not going to have the capability to process packed 8-bit values. Skylake, however, will have this.
Why does Intel alter its SIMD intrinsics in such an inconsistent way?
If the SIMD ISAs are more compatible with each other, an existed AVX code may be ported to AVX-512 with much less efforts.
I see the reason why as three-fold.
(1) When they originally designed MMX they had very little area to work with so made it as simple as possible. They also did it in such a way that was fully compatible with the existing x86 ISA (precise interrupts + some state saving on context switches). They hadn't anticipated that they would continually enlarge the SIMD register widths and add so many instructions. Every generation when they added wider SIMD registers and more sophisticated instructions they had to maintain the old ISA for compatibility.
(2) This weird thing you're seeing with AVX-512 is from the fact that they are trying to unify two disparate product lines. Skylake is from Intel's PC/server line therefore their path can be seen as MMX -> SSE/2/3/4 -> AVX -> AVX2 -> AVX-512. The Xeon Phi was based on an x86-compatible graphics card called Larrabee that used the LRBni instruction set. This is more or less the same as AVX-512, but with less instructions and not officially compatible with MMX/SSE/AVX/etc...
(3) They have different products for different demographics. For example, (as far as I know) the AVX-512 CD instructions won't be available in the regular SkyLake processors for PCs, just in the SkyLake Xeon processors used for servers in addition to the Xeon Phi used for HPC. I can understand this to an extent since the CD extensions are targeted at things like parallel histogram generation; this case is more likely to be a critical hotspot in servers/HPC than in general-purpose PCs.
I do agree it's a bit of mess. Intel are beginning to see the light and planning better for additional expansions; AVX-512 is supposedly ready to scale to 1024 bits in a future generation. Unfortunately it's still not really good enough and Agner Fog discusses this on the Intel Forums.
For me I would have liked to see a model that can be upgraded without the user having to recompile their code each time. For example, instead of defining AVX register as 512-bits in the ISA, this should be a parameter stored in the microarchitecture and retrievable by the programmer at runtime. The user asks what is the maximum SIMD width available on this machine?, the architecture returns XYZ, and the user has generic control flow to cope with whatever that XYZ is. This would be much cleaner and scalable than the current technique which uses several versions of the same function for every possible SIMD version. :-/
There is SIMD ISA convergence between Xeon and Xeon Phi and ultimately they may become identical. I doubt you will ever get the same SIMD ISA across the whole Intel CPU line - bear in mind that it stretches from a tiny Quark SOC to Xeon Phi. There will be a long time, possibly infinite, before AVX-1024 migrates from Xeon Phi to Quark or a low end Atom CPU.
In order to get better portability between different CPU families, including future ones, I advise you to use higher level concepts than bare SIMD instructions or intrinsics. Use OpenCL, OpenMP, Cilk Plus, C++ AMP and autovectorizing compiler. Quite often, they will do a good job generating platform specific SIMD instructions for you.

How to use 2 OpenCL runtimes

I want to use 2 OpenCL runtimes in one system together (in my case AMD and Nvidia, but the question is pretty generic).
I know that I can compile my program with any SDK. But when running the program, I need to provide libOpenCL.so. How can I provide the libs of both runtimes so that I see 3 devices (AMD CPU, AMD GPU, Nvidia GPU) in my OpenCL program?
I know that it must be possible somehow, but I didn't find a description on how to do it for linux, yet.
Thanks a lot,
Tomas
You're not thinking of it right. SDK's are not provided by the application, and are not needed for running a compiled program. OpenCL runtimes are provided by the client system, and that's what's giving your program platforms and devices to use in clGetPlatformIDs and clGetDeviceIDs.
If the user does not have an Nvidia graphics card, you are simply not going to be able to use an Nvidia platform and device on his system, because he doesn't have the Nvidia OpenCL runtime or hardware.
All different OpenCL SDK's provide you are vendor-specific extensions, which are then understood by the vendor runtime.
The Khronos OpenCL working group defined a ICD layer (installable client driver) that allows multiple vendor drivers to be installed on the system. The application accesses the vendor drivers through the ICD layer. For more details see cl_khr_icd.txt.
The Smith and Thomas answers are correct; this is just expanding on that information: When you enumerate the OpenCL platforms, you'll get one for each installed driver. Within each platform you enumerate the devices. The AMD and Intel drivers also expose CPU devices. So on a fully populated machines, you might see an AMD platform (with CPU and GPU devices), an NVIDIA platform (with GPU device), and an Intel platform (with CPU and GPU devices). Your code creates a context on whichever devices you want to use, and one or more command queues to feed them work. You can keep them all busy working on things, but you can only share data buffers between devices from the same platform. To share data across platforms, it must hit CPU memory in between.
In regards to running on multiple OpenCL devices at the same time. If you want to run on multiple devices create a separate context for each device/vendor and run each one in a separate thread. For example I have a GTX 590. This shows up as two GTX 590 devices. I also have the Intel i7 processor. I create three contexts: two for the 590 devices and one for the CPU and run each context/device in three threads using SDL_CreateThread (pthreads works well as well). You have to weight the number of jobs for each device proportional to their "speed" if you want to get good results. For example 45% for each GTX 590 and 10% for the CPU. The best weights to use depend on the application.

AMD APP OpenCL SDK on Intel

I have seen that AMD APP SDK samples work on a machine having only Intel CPU.
How can this happen? How does the compiler target a different machine architecture?
Do I not need Intel's set of compilers for running the code on the intel CPU?
I think if we have to run an OpenCL application on a specific hardware, I have to (re)compile it using device's vendor specifics compiler.
Where is my understanding wrong?
Firstly, OpenCL is built to work on CPU's and GPU's. You can compile and run the same source code on either type of device. However, its very likely that CPU code will be sub-optimal for a GPU and vice-versa.
AMD H/W is 7% - 14% of total x86/x64 CPU's. So AMD must develop compilers for both AMD and Intel chips to be relevant. AMD have history developing compilers for both sets of chips. Conversely, Intel have developed compilers that either don't work on AMD chips or don't work that well. That's no surprise.
With OpenCL, the AMD APP SDK is the most flexible it will work well on AMD and Intel CPU's and AMD GPUs. Intel's OpenCL SDK doesn't even install on AMD x86 H/W.
If you compile an OpenCL program to binary, you can save and reuse it as long as it matches the OpenCL Platform and Device that created it. So, if you compile for one device and use on another you are very likely to get an error.
The power of OpenCL is abstracting the underlaying hardware and offer massive, parallel and heterogeneous computing power.
Some SDKs and platforms offers some specific features to "optimize" the code, i honestly think that such features are just marketing and they introduce boilerplate code making the application less portable.
There are also some pseudo-new technologies that are just wrappers to OpenCL or they are really similar in the concept like the Intel quick sync.
About Intel i should say that at the first place they were supporting all the iCore generation and even some C2D, now the new SDK only support the 3rd iCore generation, i don't get their strategy honestly, probably Intel is the last option if you want to adopt OpenCL and targeting the biggest possible audience, also their SDK doesn't seems to be really good at all .
Stick with the standard and you will avoid both possible legal and performance issues and your code will also be more portable.
The bottom line is that the AMD SDK includes a compiler for targeting x86 CPUs for OpenCL. That means that even though you are running an Intel CPU the generated code will run on it. It's the same concept as compiling a C program to run on an x86 CPU: it works on Intel and AMD CPUs (or any that implement the x86 instruction set).
The vendor's compiler might have specific optimizations, like user827992 mentions, but in my experience the performance of AMD's CPU compiler isn't that bad when running on an Intel CPU. I haven't tried Intel's OpenCL implementation.
It is true that for some (maybe most in the future) hardware, only the vendor's compiler will support it. AMD's SDK won't build code that will run on an NVIDIA card, and vice-versa. CPUs happen to be a bit of a special case in that the basic instruction set is so widely deployed that the CPU compiler will work on most machines you're likely to come in contact with.

AMD CPU versus Intel CPU openCL

With some friends we want to use openCL. For this we look to buy a new computer, but we asked us the best between AMD and Intel for use of openCL. The graphics card will be a Nvidia and we don't have choice on the graphic card, so we start to want buy an intel cpu, but after some research we figure out that may be AMD cpu are better with openCL. We didn't find benchmarks which compare the both.
So here is our questions:
Is AMD better than Intel with openCL?
Is it a matter to have a Nvidia card with an AMD cpu for the performance of openCL?
Thank you,
GrWEn
You shouldn't care as much about what CPU you use as much as what GPU you use. You would need to choose between an AMD/ATI GPU or nVidia GPU.
I would personally recommend an nVidia GPU as, in addition to OpenCL support, you can experiment with their more proprietary CUDA technology which offers a far richer development experience than OpenCL does today. While you're at it take a look at the new AMP technology that was just announced by Microsoft for C++ which aims to bring language extensions akin to nVidia's CUDA. nVidia also has offerings for the enterprise with their Tesla GPUs with several vendors offering GPU clusters and you can even get a GPU compute cluster on Amazon EC2 now which is all based on nVidia hardware.
You want to buy a new computer with your friends? What kind of project do you plan to do? The question about the hardware is answered with the needs you have. If you give some more information, we can provide better suggestions.
As written before, the CPU is not the important point as long as you do not want to buy a multiprocessor multicore system like 4 Quadprocessors. The difference in performance is mostly the differences of the GPUs used and there you can find different cards for all needs. From a cheap GPU to the nVidia Tesla cards.
It is definitely not a problem to run a nVidia board on a AMD system. I do it here. You also can use the OpenCL devices from the AMD Multicore CPU and the nVidia GPU in parallel.
You should pay attention: If you plan to buy a potent system to run your software (like a webserver), every developer of OpenCL software needs a system for testing. So every developer needs at least a modern multi-core CPU with an OpenCL SDK. Where the OpenCL kernels are developed does not matter. OpenCL is platform independed.
Both Intel and AMD have good OpenCL-support for their CPUs, so currently it does not really matter which you cooose. If you want to use the embedded GPU on AMD Fusion or Intel SandyBridge, then I suggest you go for Fusion since Intel does not have a driver for their GPUs (yet). Depending on what you are going to use OpenCL for, I could suggest a GPU - sometimes NVidia is faster, sometimes AMD.
AMP, CUDA, RenderScript and the many, many others all work nice but they don't work on all hardware as OpenCL does. CUDA certainly has advantages, but in the time you have learnt openCL I can assure you the tools around OpenCL have catched up.
The CPU has no influence on GPU OpenCL performance.
You might also want to try running the OpenCL kernels on CPU. Checkout the Intel OpenCL compiler beta. You can even run kernels on both CPU and GPU.

Resources