OpenCL FFT lib for GPUs? - opencl

Is there any general FFT lib available for running on the GPU using OpenCL? As far as my knowledge goes, Apple sample code for power-of-two OpenCL FFT is the only such code available?
Does any such library exist for non-power-of-two transform sizes? If not, how easy or difficult is it to modify the Apple OpenCL sample?
I am looking at image processing applications, with non-power-of-two transform sizes, and I will have to do a whole bunch of FFTs, a batched FFT.

Try clFFT developed by AMD. It is aimed at AMD graphic cards, but should work on nVidia GPU's too. It can transform arrays with a radix of 2, 3 and 5 (and combinations there off).
https://github.com/clMathLibraries/clFFT
There are python bindings available
https://github.com/geggo/gpyfft

I know of an OpenCL FFT library that is currently under development,
but they don't plan on having non-power-of-two transform sizes in the first release.
Can you provide any information about your application? It might help to get the priority for that feature raised if it's something a lot of people can use.

You can download some OpenCL code samples including FFT from the SHOC benchmark suite.

Null-padding can be used to make arbitrary-length data fit for a power-of-two FFT algorithm. Consider if that would suit your application.
Increasing the number of samples decreases the "step size" in the output domain, which means higher output resolution.

OpenMM (https://simtk.org/home/openmm) contains a 3D FFT for OpenCL. It may not work for you directly, since it's designed for a specific case: 3D FFTs where each dimension is small enough to be stored in local memory (e.g. a 100x100x100 grid). But it does support non-power-of-two sizes (radix 2, 3, 4, and 5), so you might be able to adapt it.

VexCL provides an implementation of FFT for OpenCL that accepts arbitrary vector expressions as input, allows one to perform multidimensional transforms (of any number of dimensions), and supports arbitrary sized vectors. Here is a link to the relevant part of its README.

Have a look at APPML-FFT library. Though its still for power of two transforms.

Related

Simple OpenCL example in R with R code?

Is it possible to use OpenCL but with R code? I still don't have a good understanding of OpenCL and GPU programming. For example, suppose I have the following R code:
aaa <- function(x) mean(rnorm(1000000))
sapply(1:10, aaa)
I like that I can kind of use mclapply as a dropin replacement for lapply. Is there a way to do that for OpenCL? Or to use OpenCL as a backend for mclapply? I'm guessing this is not possible because I have not been able to find an example, so I have two questions:
Is this possible and if so can you give a complete example using my function aaa above?
If this is not possible, can you please explain why? I do not know much about GPU programming. I view GPU just like CPUs, so why cannot I run R code in parallel?
I would start by looking at the High Performance Computing CRAN task view, in particular the Parallel computing: GPUs section.
There are a number of packages listed there which take advantage of GPGPU for specific tasks that lend themselves to massive parallelisation (e.g. gputools, HiPLARM). Most of these use NVIDIA's own CUDA rather than OpenCL.
There is also a more generic OpenCL package, but it requires you to learn how to write OpenCL code yourself, and merely provides an interface to that code from R.
It isn't possible because GPUs work differently than CPUs which means you can't give them the same instructions that you'd give a CPU.
Nvidia puts on a good show with this video of describing the difference between CPU and GPU processing. Essentially the difference is that GPUs typically have, by orders of magnitude, more cores than CPUs.
Your example is one that can be extended to GPU code because it is highly parallel.
Here's some code to create random numbers (although they aren't normally distributed) http://cas.ee.ic.ac.uk/people/dt10/research/rngs-gpu-mwc64x.html
Once you create the random numbers you could break them into chunks and then sum each of the chunks in parallel and then add the sums of the chunks to get the overall sum Is it possible to run the sum computation in parallel in OpenCL?
I realize that your code would make the random number vector and its sum in serial and parallel that operation 10 times but with GPU processing, having a mere 10 tasks isn't very efficient since you'd leave so many cores idle.

Where/how can I obtain the computational features of a non-CUDA GPU?

(Please don't recommend a specific product or service kthxbye)
I'm considering getting a discrete OpenCL-oriented GPU (i.e. not NVIDIA). Now, with CUDA GPUs (i.e. NVIDIA...), you have the 'Compute Capability' figure, which you can easily translate into concrete compute-related features - but I can't seem to find something parallel in the OpenCL world. You can find overall bandwidth, or a sort-of-cooked figure for maximum number of work-items executing in parallel (it's a cooked number for reasons such as not telling you what you can do with each of these between clock cycles. I can double the figure by doubling the number of cycles per op) - but not the very long and specific set of micro-features (which are mostly independent from the GPU's macro-features).
I'm interested in an answer regarding both integrated and discrete, and not just in NVIDIA's contender AMD. It's specifically interesting for me to look at supposedly 'weak' GPUs since I care more about the architecture than how much I can actually crunch with it.

Parallel arithmetic on large integers

Are there any software tools for performing arithmetic on very large numbers in parallel? What I mean by parallel is that I want to use all available cores on my computer for this.
The constraints are wide open for me. I don't mind trying any language or tech.
Please and thanks.
It seems like you are either dividing really huge numbers, or are using a suboptimal algorithm. Parallelizing things to a fixed number of cores will only tweak the constants, but have no effect on the asymptotic behavior of your operation. And if you're talking about hours for a single division, asymptotic behavior is what matters most. So I suggest you first make sure sure your asymptotic complexity is as good as can be, and then start looking for ways to improve the constants, perhaps by parallelizing.
Wikipedia suggests Barrett division, and GMP has a variant of that. I'm not sure whether what you've tried so far is on a similar level, but unless you are sure that it is, I'd give GMP a try.
See also Parallel Modular Multiplication on Multi-core Processors for recent research. Haven't read into that myself, though.
The only effort I am aware of is a CUDA library called CUMP. However, the library only provides support for addition, subtraction and multiplication. Anyway, you can use multiplication to perform the division on the GPU and check if the quality of the result is enough for your particular problem.

What algorithms do FPUs use to compute transcendental functions?

What methods would a modern FPU use to compute transcendental functions?
For example, Intel CPUs provide instructions such as FSIN, FCOS, FYL2X, etc. I am curious as to what algorithms would be used to actually implement these in hardware.
My naïve guess would be Taylor series perhaps combined with some lookup tables, but that's nothing more than a wild guess. Please enlighten me.
P.S. This question is more general than just Intel hardware.
One place to start could be "New Algorithms for Improved
Transcendental Functions on IA-64" by Shane Story and Ping Tak Peter Tang, both from Intel. It probably doesn't have as many details as you might like, but it includes several references.
Update 08/13/2014
The original link is broken. IEEE's public abstract/citation page can be found here:
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=762822&tag=1
In hardware (as well as software where hardware multiply instruction is not available) it is usually implemented in CORDIC since this requires only addition, subtraction, bit shift and table lookup
Related:
How does C compute sin() and other math functions?
How are sin and cos implemented hardware wise?
What algorithm is used by computers to calculate logarithms?
How do computers calculate sin values?
How does the computer calculate Square roots?

inverse FFT in shader language?

does anyone know an implementation of the inverse FFT in HLSL/GLSL/cg ... ?
It would save me much work.
Best,
heinrich
Do you already have a FFT implementation? You may already be aware, but the inverse can be computed by reversing the order of the N inputs, taking the FFT over those, and dividing the result by N.
DirectX11 comes with a FFT example for compute shaders (see DX11 August SDK Release Notes). As PereAllenWebb points out, this can be also used for inverse FFT.
Edit: If you just want a fast FFT, you could try the CUFFT, which runs on the GPU. It's part of the CUDA SDK. The AMCL from AMD also has a FFT, which is currently not GPU accelerated, but this will be likely added soon.
I implemented a 1D FFT on 7800GTX hardware back in 2005. This was before CUDA etc so I had to resort to using Cg and manually implementing the FFT.
I have two FFT implementations. One is a Radix2 Decimation in Time FFT and the other a Stockham Autosort FFT. The stockham would perform around 2-4x faster than a CPU (at the time 3GHz P4 single core) for larger sizes (> 8192) but for smaller sizes the CPU was faster as it doesn't have to shift data to/from the GPU.
If you're interested in the shader code feel free to contact me and I'll send it over by email. It was from a personal project so not covered by any commercial copyright. I would imagine that CUDA (and similar) implementations would massively outperform my implementation, however from a learning perspective you can't get better than to write or study the code yourself!
Maybe you could take a look at OpenCL which is a standard for general purpose computing on graphics (and other) hardware.
The wikipedia article contains a OpenCL example for a standard FFT:
http://en.wikipedia.org/wiki/OpenCL#Example
If you are on a Mac with OS X 10.6, you just need to install the developer tools to get started with OpenCL development.
I also heard that hardware vendors already provide basic OpenCL driver support on Windows.

Resources