Microsoft rxNeuralNet acceleration - r

In Microsoft's R NeuralNet package, there is an option for acceleration, where you can specify "GPU" or "SSE". I know what GPU is, does anyone know what SSE acceleration is?
Thanks!

SSE stands for "Streaming SIMD Extensions". SIMD stands for "Single Instruction Multiple Data".
https://www.intel.com/content/www/us/en/support/processors/000005779.html
David

I cannot comment yet, so therefore I am posting an "answer".
From what I gather SSE is nowadays "default" unless you have the ability to do GPU. A wikipedia article: https://en.wikipedia.org/wiki/Streaming_SIMD_Extensions
Niels

Related

When to use eigen and when to use Blas

I did some basic reading on Eigen and Blas. Both library have support for matrix matrix, matrix vector multiplication. I don't understand which one should I use in which case? To me it seems, both have almost same performance. It would be nice if someone could give me some resource or just tell me what are the advantages one library have over another? Or how does these two differer in case of matrix and vector manipulation? Thanks in advance.
Use Eigen, it's more complete and much easier to use. Then, if you wonder if another fully optimized BLAS implementation could give you higher performance, then just recompile your code with -DEIGEN_USE_BLAS and link to your favorite blas and see by yourself.
Also, when using Eigen, don't forget to enable compiler optimizations, e.g. -O3 and the instruction-sets your hardware supports, e.g., -mavx -mfma when using latest Eigen.
So the answer to this question is here.
http://eigen.tuxfamily.org/index.php?title=FAQ#How_does_Eigen_compare_to_BLAS.2FLAPACK.3F
More or less, I use Eigen mostly, because it has a comforable interface. If you need speed and multicore parallelism or have only little but time-consuming linear algebra stuff in your code, go for GotoBlas2. Usually it is fastest on Intel machines.

OpenCL for beginners. What you advise?

I want learn a OpenCL for graphics computing, but I newby in the heterogeneous computing.
What you advise? What better for read?
Please, Could you give me a links to amazon?
Starting out in OpenCL I'd recommend
OpenCL Programming Guide - Aaftab Munshi
and perhaps
Heterogeneous Computing with OpenCL: Revised OpenCL 1.2 Edition - Ben Gaster
although I got that one free.

What algorithms do FPUs use to compute transcendental functions?

What methods would a modern FPU use to compute transcendental functions?
For example, Intel CPUs provide instructions such as FSIN, FCOS, FYL2X, etc. I am curious as to what algorithms would be used to actually implement these in hardware.
My naïve guess would be Taylor series perhaps combined with some lookup tables, but that's nothing more than a wild guess. Please enlighten me.
P.S. This question is more general than just Intel hardware.
One place to start could be "New Algorithms for Improved
Transcendental Functions on IA-64" by Shane Story and Ping Tak Peter Tang, both from Intel. It probably doesn't have as many details as you might like, but it includes several references.
Update 08/13/2014
The original link is broken. IEEE's public abstract/citation page can be found here:
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=762822&tag=1
In hardware (as well as software where hardware multiply instruction is not available) it is usually implemented in CORDIC since this requires only addition, subtraction, bit shift and table lookup
Related:
How does C compute sin() and other math functions?
How are sin and cos implemented hardware wise?
What algorithm is used by computers to calculate logarithms?
How do computers calculate sin values?
How does the computer calculate Square roots?

General sparse iterative solver libraries

What are some of the better libraries for large sparse iterative (conjugate gradient, MINRES, GMRES, etc.) linear algebra system solving? I've often coded my own routines, but I'm interested to know which "off-the-shelf" packages people prefer. I've heard of PETSc, TAUCS, IML++, and a few others. I'm wondering how these stack up, and what else is out there. My preference is for ease of use, and freely available software.
Victor Eijkhout's Overview of Iterative Linear System Solver Packages would probably be a good place to start.
You may also wish to look at Trilinos
http://trilinos.sandia.gov/
It is designed by some great software craftsman, using modern
design techniques.
Moreover, from within Trilinos, you can call PetsC if you desire.
NIST has some sparse Linear Algebra software you can download
here: http://math.nist.gov/sparselib++/ and here: http://math.nist.gov/spblas/
I haven't used those packages myself, but I've heard good things about them.
http://www.cise.ufl.edu/research/sparse/umfpack/
UMFPACK is a set of routines for
solving unsymmetric sparse linear
systems, Ax=b, using the Unsymmetric
MultiFrontal method. Written in
ANSI/ISO C, with a MATLAB (Version 6.0
and later) interface. Appears as a
built-in routine (for lu, backslash,
and forward slash) in MATLAB. Includes
a MATLAB interface, a C-callable
interface, and a Fortran-callable
interface. Note that "UMFPACK" is
pronounced in two syllables, "Umph
Pack". It is not "You Em Ef Pack".
I'm using it for FEM code.
I would check out Microsoft's Solver Foundation. It's free to cheap for even pretty big problems. The unlimited version is industrial strength and is based on Gurobi and of course isn't cheap.
http://code.msdn.microsoft.com/solverfoundation

inverse FFT in shader language?

does anyone know an implementation of the inverse FFT in HLSL/GLSL/cg ... ?
It would save me much work.
Best,
heinrich
Do you already have a FFT implementation? You may already be aware, but the inverse can be computed by reversing the order of the N inputs, taking the FFT over those, and dividing the result by N.
DirectX11 comes with a FFT example for compute shaders (see DX11 August SDK Release Notes). As PereAllenWebb points out, this can be also used for inverse FFT.
Edit: If you just want a fast FFT, you could try the CUFFT, which runs on the GPU. It's part of the CUDA SDK. The AMCL from AMD also has a FFT, which is currently not GPU accelerated, but this will be likely added soon.
I implemented a 1D FFT on 7800GTX hardware back in 2005. This was before CUDA etc so I had to resort to using Cg and manually implementing the FFT.
I have two FFT implementations. One is a Radix2 Decimation in Time FFT and the other a Stockham Autosort FFT. The stockham would perform around 2-4x faster than a CPU (at the time 3GHz P4 single core) for larger sizes (> 8192) but for smaller sizes the CPU was faster as it doesn't have to shift data to/from the GPU.
If you're interested in the shader code feel free to contact me and I'll send it over by email. It was from a personal project so not covered by any commercial copyright. I would imagine that CUDA (and similar) implementations would massively outperform my implementation, however from a learning perspective you can't get better than to write or study the code yourself!
Maybe you could take a look at OpenCL which is a standard for general purpose computing on graphics (and other) hardware.
The wikipedia article contains a OpenCL example for a standard FFT:
http://en.wikipedia.org/wiki/OpenCL#Example
If you are on a Mac with OS X 10.6, you just need to install the developer tools to get started with OpenCL development.
I also heard that hardware vendors already provide basic OpenCL driver support on Windows.

Resources