I am looking for a definitive answer as to whether the Intel DAAL libraries are compatible with the x100 Knights Corner Xeon Phi co-processor.
I have searched high and low on the internet can can't tell either way, and can't seem to make it work on my x100 Xeon Phi.
Okay. Found this. Only the KNL is mentioned in the list of Xeon Phi processors. It is not explicit though that the KNC is not supported.
From Daal supports KNL
Related
My notebook is a 2014 acer with Intel core i7 running Windows 10, and it has two GPUs: one is an Intel graphics Family and another one is an AMD Radeon HD 8670M (part of the 8600 Family I think). I noted while plotting charts with hundres of thousands of points using in ggplot2 or gganimate that only the Intel GPU works while the plots are being rendered. The AMD sits idle, even though sometimes the Intel GPU is quite busy. I tried googling, to no avail. Would anyone have some pointer to share at least? Any further hardware/software info I would need to post to make this question "answerable"? Thanks in advance!
In Microsoft's R NeuralNet package, there is an option for acceleration, where you can specify "GPU" or "SSE". I know what GPU is, does anyone know what SSE acceleration is?
Thanks!
SSE stands for "Streaming SIMD Extensions". SIMD stands for "Single Instruction Multiple Data".
https://www.intel.com/content/www/us/en/support/processors/000005779.html
David
I cannot comment yet, so therefore I am posting an "answer".
From what I gather SSE is nowadays "default" unless you have the ability to do GPU. A wikipedia article: https://en.wikipedia.org/wiki/Streaming_SIMD_Extensions
Niels
To my surprise, I cannot find a comparison of these products using open source OpenCL benchmark suites, such as rodinia and SHOC. Such a comparison could be more interesting than comparisons of theoretical peak performance, or of performance in simple matrix multiplication kernels, which I have been able to find.
Does anyone know where such results might be available? Failing that, do any stack overflow users have access to one or both products, and the time and inclination to run the benchmarks and share the results? Results for any of the versions of either card would be interesting.
CLBenchmark.com now has some results for the Xeon Phi, and a complete set for the K20c.
Here is a side-by-side comparison.
Here is a comparison of the Xeon Phi with a GTX Titan.
http://clbenchmark.com/compare.jsp?config_0=14470292&config_1=15887974
The Xeon Phi basically gets completely destroyed in 10/12 benchmarks and is on par for the other 2. So the 300 watt 22 nm Phi part does not far well against the 250 watt 28 nm GPU.
Basically the Phi seems to be having major troubles utilizing it's bandwidth capacity, vectorizing the code seems to be another issue.
Here is a benchmark comparing sparse matrix multiplication performance:
http://uk.arxiv.org/abs/1302.1078
It partly answers my question, but I would rather see more than one algorithm, and I would like to see how portable OpenCL performance is, I will still accept any answers which can provide that information.
SHOC benchmark suite for Xeon Phi is on github here:
Intel Xeon Phi SHOC Benchmark Suite
Plenty of benchmark postings starting to go public and "googlable", but here is the standard Intel communication on benchmarking of Xeon Phi versus a dual socket E5-2670:
Intel Xeon Phi Performance Doc.
When looking to compare performance of Xeon Phi to a regular Xeon, or any other platform, make sure you're taking into account the power envelope of the platform (dual socket Xeon) and if the application was already tuned for a Xeon or not. One of the big sells on Xeon Phi is that you typically get Xeon improvements in addition to Xeon Phi improvements. Pretty sweet..
does anyone know an implementation of the inverse FFT in HLSL/GLSL/cg ... ?
It would save me much work.
Best,
heinrich
Do you already have a FFT implementation? You may already be aware, but the inverse can be computed by reversing the order of the N inputs, taking the FFT over those, and dividing the result by N.
DirectX11 comes with a FFT example for compute shaders (see DX11 August SDK Release Notes). As PereAllenWebb points out, this can be also used for inverse FFT.
Edit: If you just want a fast FFT, you could try the CUFFT, which runs on the GPU. It's part of the CUDA SDK. The AMCL from AMD also has a FFT, which is currently not GPU accelerated, but this will be likely added soon.
I implemented a 1D FFT on 7800GTX hardware back in 2005. This was before CUDA etc so I had to resort to using Cg and manually implementing the FFT.
I have two FFT implementations. One is a Radix2 Decimation in Time FFT and the other a Stockham Autosort FFT. The stockham would perform around 2-4x faster than a CPU (at the time 3GHz P4 single core) for larger sizes (> 8192) but for smaller sizes the CPU was faster as it doesn't have to shift data to/from the GPU.
If you're interested in the shader code feel free to contact me and I'll send it over by email. It was from a personal project so not covered by any commercial copyright. I would imagine that CUDA (and similar) implementations would massively outperform my implementation, however from a learning perspective you can't get better than to write or study the code yourself!
Maybe you could take a look at OpenCL which is a standard for general purpose computing on graphics (and other) hardware.
The wikipedia article contains a OpenCL example for a standard FFT:
http://en.wikipedia.org/wiki/OpenCL#Example
If you are on a Mac with OS X 10.6, you just need to install the developer tools to get started with OpenCL development.
I also heard that hardware vendors already provide basic OpenCL driver support on Windows.
Does anybody have experience programming for both the Intel Math Kernel Library and the AMD Math Core Library? I'm building a personal computer for high performance statistical computations and am debating on the components to buy. An appeal of the AMD Math Core library is that it is free, but I am in academia so the MKL is not that expensive. But I'd be interested in hearing thoughts on:
Which provides a better API?
Which provides better performance, on average, per dollar, including licensing and hardware costs.
Is the AMCL-GPU a factor I should consider?
Intel MKL and ACML have similar APIs but MKL has a richer set of supported functionality including BLAS (and CBLAS)/LAPACK/FFTs/Vector and Statistical Math/Sparse direct and iterative solvers/Sparse BLAS, and so on. Intel MKL is also optimized for both Intel and AMD processors and has an active user forum you can turn to for help or guidance. An independent assessment of the two libraries is posted here: (http://www.advancedclustering.com/company-blog/high-performance-linpack-on-xeon-5500-v-opteron-2400.html)
• Shane Corder, Advanced Clustering, (also carried by HPCWire: Benchmark Challenge: Nehalem Versus Istanbul): “In our recent testing and through real world experience, we have found that the Intel compilers and Intel Math Kernel Library (MKL) usually provide the best performance. Instead of just settling on Intel's toolkit we tried various compilers including: Intel, GNU compilers, and Portland Group. We also tested various linear algebra libraries including: MKL, AMD Core Math Library (ACML), and libGOTO from the University of Texas. All of the testing showed we could achieve the highest performance when using both the Intel Compilers and Intel Math Library--even on the AMD system--so these were used them as the base of our benchmarks.” [Benchmark testing showed 4-core Nehalem X5550 2.66GHz at 74.0GFs vs. Istanbul 2435 2.6GHz at 99.4GFs; Istanbul only 34% faster despite 50% more cores]
Hope this helps.
In fact, there are two versions of LAPACK routines in ACML. The ones without trailing underscore (_) are the C-version routines, which as Victor said, don't require workspace arrays and you can just pass values instead of references for the parameters. The ones with the underscore however are just vanilla Fortran routines. Do a "dumpbin /exports" on libacml_dll.dll and you'll see.
I have used AMCL for its BLAS/LAPACK routines, so this will probably not answer your question, but I hope it's useful for someone. Comparing them to vanilla BLAS/LAPACK, their performance was a factor of 2-3 better in my particular use case. I used it for dense nonsymmetric complex matrices, for both linear solves and eigensystem computations. You should know that the function declarations are not identical to the vanilla routines. This required a substantial amount of preprocessor macros to allow me to freely switch between the two. In particular all LAPACK routines in AMCL do not require work arrays. This is a major convenience if AMCL is the only library you will use.