OpenCL for beginners. What you advise? - opencl

I want learn a OpenCL for graphics computing, but I newby in the heterogeneous computing.
What you advise? What better for read?
Please, Could you give me a links to amazon?

Starting out in OpenCL I'd recommend
OpenCL Programming Guide - Aaftab Munshi
and perhaps
Heterogeneous Computing with OpenCL: Revised OpenCL 1.2 Edition - Ben Gaster
although I got that one free.

Related

Microsoft rxNeuralNet acceleration

In Microsoft's R NeuralNet package, there is an option for acceleration, where you can specify "GPU" or "SSE". I know what GPU is, does anyone know what SSE acceleration is?
Thanks!
SSE stands for "Streaming SIMD Extensions". SIMD stands for "Single Instruction Multiple Data".
https://www.intel.com/content/www/us/en/support/processors/000005779.html
David
I cannot comment yet, so therefore I am posting an "answer".
From what I gather SSE is nowadays "default" unless you have the ability to do GPU. A wikipedia article: https://en.wikipedia.org/wiki/Streaming_SIMD_Extensions
Niels

Minimum SSE/AVX version required to compare 2 64-bit integers, atomically?

Besides the title... is there an easy way to find this information myself? Preferably in a tabular format.
Easy way to find it yourself:
Intel Intrinsics Guide
Don't be confused by title; Intrinsics Guide is actually very convinient for the purpose of ISA-specific instructions finding.
pcmpgtq and pcmpeqq were both introduced with SSE4.1, if that's what you're looking for.
x64 with its REX.W CMP has been around for longer though.
See also
Intel's manuals
AMD's developer guides
ref.x86asm.net

Advice about inversion of large sparse matrices

Just got a Windows box set up with two 64 bit Intel Xeon X5680 3.33 GHz processors (6 cores each) and 12 GB of RAM. I've been using SAS on some large data sets, but it's just too slow, so I want to set up R to do parallel processing. I want to be able to carry out matrix operations, e.g., multiplication and inversion. Most of my data are not huge, 3-4 GB range, but one file is around 50 GB. It's been a while since I used R, so I looked around on the web, including the CRAN HPC, to see what was available. I think a foreach loop and the bigmemory package will be applicable. I came across this post: Is there a package for parallel matrix inversion in R that had some interesting suggestions. I was wondering if anyone has experience with the HIPLAR packages. Looks like hiparlm adds functionality to the matrix package and hiplarb add new functions altogether. Which of these would be recommended for my application? Furthermore, there is a reference to the PLASMA library. Is this of any help? My matrices have a lot of zeros, so I think they could be considered sparse. I didn't see any examples of how to pass data fro R to PLASMA, and looking at the PLASMA docs, it says it does not support sparse matrices, so I'm thinking that I don't need this library. Am I on the right track here? Any suggestions on other approaches?
EDIT: It looks like HIPLAR and package pbdr will not be helpful. I'm leaning more toward bigmemory, although it looks like I/O may be a problem: http://files.meetup.com/1781511/bigmemoryRandLinearAlgebra_BryanLewis.pdf. This article talks about a package vam for virtual associative matrices, but it must be proprietary. Would package ff be of any help here? My R skills are just not current enough to know what direction to pursue. Pretty sure I can read this using bigmemory, but not sure the processing will be very fast.
If you want to use HiPLAR (MAGMA and PLASMA libraries in R), it is only available for Linux at the moment. For this and many other things, I suggest switching your OS to the penguin.
That being said, Intel MKL optimization can do wonders for these sort of operations. For most practical uses, it is the way to go. Python built with MKL optimization for example can process large matrices about 20x faster than IDL, which was designed specifically for image processing. R has similarly shown vast improvements when built with MKL optimization. You can also install R Open from Revolution Analytics, which includes MKL optimization, but I am not sure that it has quite the same effect as building it yourself using Intel tools: https://software.intel.com/en-us/articles/build-r-301-with-intel-c-compiler-and-intel-mkl-on-linux
I would definitely consider the type of operations one is looking to perform. GPU processes are those that lend well to high parallelism (many of the same little computations running at once, as with matrix algebra), but they are limited by bus speeds. Intel MKL optimization is similar in that it can help use all of your CPU cores, but it is really optimized to Intel CPU architecture. Hence, it should provide basic memory optimization too. I think that is the simplest route. HiPLAR is certainly the future, as it is CPU-GPU by design, especially with highly parallel heterogeneous architectures making their way into consumer systems. Most consumer systems today cannot fully utilize this though I think.
Cheers,
Adam

inverse FFT in shader language?

does anyone know an implementation of the inverse FFT in HLSL/GLSL/cg ... ?
It would save me much work.
Best,
heinrich
Do you already have a FFT implementation? You may already be aware, but the inverse can be computed by reversing the order of the N inputs, taking the FFT over those, and dividing the result by N.
DirectX11 comes with a FFT example for compute shaders (see DX11 August SDK Release Notes). As PereAllenWebb points out, this can be also used for inverse FFT.
Edit: If you just want a fast FFT, you could try the CUFFT, which runs on the GPU. It's part of the CUDA SDK. The AMCL from AMD also has a FFT, which is currently not GPU accelerated, but this will be likely added soon.
I implemented a 1D FFT on 7800GTX hardware back in 2005. This was before CUDA etc so I had to resort to using Cg and manually implementing the FFT.
I have two FFT implementations. One is a Radix2 Decimation in Time FFT and the other a Stockham Autosort FFT. The stockham would perform around 2-4x faster than a CPU (at the time 3GHz P4 single core) for larger sizes (> 8192) but for smaller sizes the CPU was faster as it doesn't have to shift data to/from the GPU.
If you're interested in the shader code feel free to contact me and I'll send it over by email. It was from a personal project so not covered by any commercial copyright. I would imagine that CUDA (and similar) implementations would massively outperform my implementation, however from a learning perspective you can't get better than to write or study the code yourself!
Maybe you could take a look at OpenCL which is a standard for general purpose computing on graphics (and other) hardware.
The wikipedia article contains a OpenCL example for a standard FFT:
http://en.wikipedia.org/wiki/OpenCL#Example
If you are on a Mac with OS X 10.6, you just need to install the developer tools to get started with OpenCL development.
I also heard that hardware vendors already provide basic OpenCL driver support on Windows.

Intel MKL vs. AMD Math Core Library

Does anybody have experience programming for both the Intel Math Kernel Library and the AMD Math Core Library? I'm building a personal computer for high performance statistical computations and am debating on the components to buy. An appeal of the AMD Math Core library is that it is free, but I am in academia so the MKL is not that expensive. But I'd be interested in hearing thoughts on:
Which provides a better API?
Which provides better performance, on average, per dollar, including licensing and hardware costs.
Is the AMCL-GPU a factor I should consider?
Intel MKL and ACML have similar APIs but MKL has a richer set of supported functionality including BLAS (and CBLAS)/LAPACK/FFTs/Vector and Statistical Math/Sparse direct and iterative solvers/Sparse BLAS, and so on. Intel MKL is also optimized for both Intel and AMD processors and has an active user forum you can turn to for help or guidance. An independent assessment of the two libraries is posted here: (http://www.advancedclustering.com/company-blog/high-performance-linpack-on-xeon-5500-v-opteron-2400.html)
• Shane Corder, Advanced Clustering, (also carried by HPCWire: Benchmark Challenge: Nehalem Versus Istanbul): “In our recent testing and through real world experience, we have found that the Intel compilers and Intel Math Kernel Library (MKL) usually provide the best performance. Instead of just settling on Intel's toolkit we tried various compilers including: Intel, GNU compilers, and Portland Group. We also tested various linear algebra libraries including: MKL, AMD Core Math Library (ACML), and libGOTO from the University of Texas. All of the testing showed we could achieve the highest performance when using both the Intel Compilers and Intel Math Library--even on the AMD system--so these were used them as the base of our benchmarks.” [Benchmark testing showed 4-core Nehalem X5550 2.66GHz at 74.0GFs vs. Istanbul 2435 2.6GHz at 99.4GFs; Istanbul only 34% faster despite 50% more cores]
Hope this helps.
In fact, there are two versions of LAPACK routines in ACML. The ones without trailing underscore (_) are the C-version routines, which as Victor said, don't require workspace arrays and you can just pass values instead of references for the parameters. The ones with the underscore however are just vanilla Fortran routines. Do a "dumpbin /exports" on libacml_dll.dll and you'll see.
I have used AMCL for its BLAS/LAPACK routines, so this will probably not answer your question, but I hope it's useful for someone. Comparing them to vanilla BLAS/LAPACK, their performance was a factor of 2-3 better in my particular use case. I used it for dense nonsymmetric complex matrices, for both linear solves and eigensystem computations. You should know that the function declarations are not identical to the vanilla routines. This required a substantial amount of preprocessor macros to allow me to freely switch between the two. In particular all LAPACK routines in AMCL do not require work arrays. This is a major convenience if AMCL is the only library you will use.

Resources