does anyone know an implementation of the inverse FFT in HLSL/GLSL/cg ... ?
It would save me much work.
Best,
heinrich
Do you already have a FFT implementation? You may already be aware, but the inverse can be computed by reversing the order of the N inputs, taking the FFT over those, and dividing the result by N.
DirectX11 comes with a FFT example for compute shaders (see DX11 August SDK Release Notes). As PereAllenWebb points out, this can be also used for inverse FFT.
Edit: If you just want a fast FFT, you could try the CUFFT, which runs on the GPU. It's part of the CUDA SDK. The AMCL from AMD also has a FFT, which is currently not GPU accelerated, but this will be likely added soon.
I implemented a 1D FFT on 7800GTX hardware back in 2005. This was before CUDA etc so I had to resort to using Cg and manually implementing the FFT.
I have two FFT implementations. One is a Radix2 Decimation in Time FFT and the other a Stockham Autosort FFT. The stockham would perform around 2-4x faster than a CPU (at the time 3GHz P4 single core) for larger sizes (> 8192) but for smaller sizes the CPU was faster as it doesn't have to shift data to/from the GPU.
If you're interested in the shader code feel free to contact me and I'll send it over by email. It was from a personal project so not covered by any commercial copyright. I would imagine that CUDA (and similar) implementations would massively outperform my implementation, however from a learning perspective you can't get better than to write or study the code yourself!
Maybe you could take a look at OpenCL which is a standard for general purpose computing on graphics (and other) hardware.
The wikipedia article contains a OpenCL example for a standard FFT:
http://en.wikipedia.org/wiki/OpenCL#Example
If you are on a Mac with OS X 10.6, you just need to install the developer tools to get started with OpenCL development.
I also heard that hardware vendors already provide basic OpenCL driver support on Windows.
Related
(Please don't recommend a specific product or service kthxbye)
I'm considering getting a discrete OpenCL-oriented GPU (i.e. not NVIDIA). Now, with CUDA GPUs (i.e. NVIDIA...), you have the 'Compute Capability' figure, which you can easily translate into concrete compute-related features - but I can't seem to find something parallel in the OpenCL world. You can find overall bandwidth, or a sort-of-cooked figure for maximum number of work-items executing in parallel (it's a cooked number for reasons such as not telling you what you can do with each of these between clock cycles. I can double the figure by doubling the number of cycles per op) - but not the very long and specific set of micro-features (which are mostly independent from the GPU's macro-features).
I'm interested in an answer regarding both integrated and discrete, and not just in NVIDIA's contender AMD. It's specifically interesting for me to look at supposedly 'weak' GPUs since I care more about the architecture than how much I can actually crunch with it.
Just got a Windows box set up with two 64 bit Intel Xeon X5680 3.33 GHz processors (6 cores each) and 12 GB of RAM. I've been using SAS on some large data sets, but it's just too slow, so I want to set up R to do parallel processing. I want to be able to carry out matrix operations, e.g., multiplication and inversion. Most of my data are not huge, 3-4 GB range, but one file is around 50 GB. It's been a while since I used R, so I looked around on the web, including the CRAN HPC, to see what was available. I think a foreach loop and the bigmemory package will be applicable. I came across this post: Is there a package for parallel matrix inversion in R that had some interesting suggestions. I was wondering if anyone has experience with the HIPLAR packages. Looks like hiparlm adds functionality to the matrix package and hiplarb add new functions altogether. Which of these would be recommended for my application? Furthermore, there is a reference to the PLASMA library. Is this of any help? My matrices have a lot of zeros, so I think they could be considered sparse. I didn't see any examples of how to pass data fro R to PLASMA, and looking at the PLASMA docs, it says it does not support sparse matrices, so I'm thinking that I don't need this library. Am I on the right track here? Any suggestions on other approaches?
EDIT: It looks like HIPLAR and package pbdr will not be helpful. I'm leaning more toward bigmemory, although it looks like I/O may be a problem: http://files.meetup.com/1781511/bigmemoryRandLinearAlgebra_BryanLewis.pdf. This article talks about a package vam for virtual associative matrices, but it must be proprietary. Would package ff be of any help here? My R skills are just not current enough to know what direction to pursue. Pretty sure I can read this using bigmemory, but not sure the processing will be very fast.
If you want to use HiPLAR (MAGMA and PLASMA libraries in R), it is only available for Linux at the moment. For this and many other things, I suggest switching your OS to the penguin.
That being said, Intel MKL optimization can do wonders for these sort of operations. For most practical uses, it is the way to go. Python built with MKL optimization for example can process large matrices about 20x faster than IDL, which was designed specifically for image processing. R has similarly shown vast improvements when built with MKL optimization. You can also install R Open from Revolution Analytics, which includes MKL optimization, but I am not sure that it has quite the same effect as building it yourself using Intel tools: https://software.intel.com/en-us/articles/build-r-301-with-intel-c-compiler-and-intel-mkl-on-linux
I would definitely consider the type of operations one is looking to perform. GPU processes are those that lend well to high parallelism (many of the same little computations running at once, as with matrix algebra), but they are limited by bus speeds. Intel MKL optimization is similar in that it can help use all of your CPU cores, but it is really optimized to Intel CPU architecture. Hence, it should provide basic memory optimization too. I think that is the simplest route. HiPLAR is certainly the future, as it is CPU-GPU by design, especially with highly parallel heterogeneous architectures making their way into consumer systems. Most consumer systems today cannot fully utilize this though I think.
Cheers,
Adam
What methods would a modern FPU use to compute transcendental functions?
For example, Intel CPUs provide instructions such as FSIN, FCOS, FYL2X, etc. I am curious as to what algorithms would be used to actually implement these in hardware.
My naïve guess would be Taylor series perhaps combined with some lookup tables, but that's nothing more than a wild guess. Please enlighten me.
P.S. This question is more general than just Intel hardware.
One place to start could be "New Algorithms for Improved
Transcendental Functions on IA-64" by Shane Story and Ping Tak Peter Tang, both from Intel. It probably doesn't have as many details as you might like, but it includes several references.
Update 08/13/2014
The original link is broken. IEEE's public abstract/citation page can be found here:
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=762822&tag=1
In hardware (as well as software where hardware multiply instruction is not available) it is usually implemented in CORDIC since this requires only addition, subtraction, bit shift and table lookup
Related:
How does C compute sin() and other math functions?
How are sin and cos implemented hardware wise?
What algorithm is used by computers to calculate logarithms?
How do computers calculate sin values?
How does the computer calculate Square roots?
I have a linear algebra code that I am trying get to run faster. Its a iterative algorithm with a loop and matrix vector multiplications within in.
So far, I have used MATMUL (Fortran Lib.), DGEMV, Tried writing my own MV code in OpenMP but the algorithm is doing no better in terms of scalability. Speed ups are barely 3.5 - 4 irrespective of how many processors I am allotting to it (I have tried up 64 processors).
The profiling shows significant time being spent in Matrix-Vector and the rest is fairly nominal.
My question is:
I have a shared memory system with tons of RAM and processors. I have tried tweaking OpenMP implementation of the code (including Matrix Vector) but has not helped. Will it help to code in MPI? I am not a pro at MPI but the ability to fine tune the message communication might help a bit but I can't be sure. Any comments?
More generally, from the literature I have read, MPI = Distributed, OpenMP = Shared but can they perform well in the others' territory? Like MPI in Shared? Will it work? Will it be better than the OpenMP implementation if done well?
You're best off just using a linear algebra package that is already well optimized for a multitcore environment and using that for your matrix-vector multiplication. The Atlas package, gotoblas (if you have a nehalem or older; sadly it's no longer being updated), or vendor BLAS implementations (like MKL for intel CPUs, ACML for AMD, or VecLib for apple, which all cost money) all have good, well-tuned, multithreaded BLAS implementations. Unless you have excellent reason to believe that you can do better than those full time development teams can, you're best off using them.
Note that you'll never get the parallel speedup with DGEMV that you do with DGEMM, just because the vector is smaller than another matrix and so there's less work; but you can still do quite well, and you'll find you get much better perforamance with these libraries than you do with anything hand-rolled unless you were already doing multi-level cache blocking.
You can use MPI in a shared environment (though not OpenMP in a distributed one). However, achieving a good speedup depends a lot more on your algorithms and data dependencies than the technology used. Since you have a lot of shared memory, I'd recommend you stick with OpenMP, and carefully examine whether you're making the best use of your resources.
Is there any general FFT lib available for running on the GPU using OpenCL? As far as my knowledge goes, Apple sample code for power-of-two OpenCL FFT is the only such code available?
Does any such library exist for non-power-of-two transform sizes? If not, how easy or difficult is it to modify the Apple OpenCL sample?
I am looking at image processing applications, with non-power-of-two transform sizes, and I will have to do a whole bunch of FFTs, a batched FFT.
Try clFFT developed by AMD. It is aimed at AMD graphic cards, but should work on nVidia GPU's too. It can transform arrays with a radix of 2, 3 and 5 (and combinations there off).
https://github.com/clMathLibraries/clFFT
There are python bindings available
https://github.com/geggo/gpyfft
I know of an OpenCL FFT library that is currently under development,
but they don't plan on having non-power-of-two transform sizes in the first release.
Can you provide any information about your application? It might help to get the priority for that feature raised if it's something a lot of people can use.
You can download some OpenCL code samples including FFT from the SHOC benchmark suite.
Null-padding can be used to make arbitrary-length data fit for a power-of-two FFT algorithm. Consider if that would suit your application.
Increasing the number of samples decreases the "step size" in the output domain, which means higher output resolution.
OpenMM (https://simtk.org/home/openmm) contains a 3D FFT for OpenCL. It may not work for you directly, since it's designed for a specific case: 3D FFTs where each dimension is small enough to be stored in local memory (e.g. a 100x100x100 grid). But it does support non-power-of-two sizes (radix 2, 3, 4, and 5), so you might be able to adapt it.
VexCL provides an implementation of FFT for OpenCL that accepts arbitrary vector expressions as input, allows one to perform multidimensional transforms (of any number of dimensions), and supports arbitrary sized vectors. Here is a link to the relevant part of its README.
Have a look at APPML-FFT library. Though its still for power of two transforms.