How to count Instruction count in CUDA?(or Instruction per cycle) - count

I'm doing research about GPGPU.
The purpose of the research is to measure IPC.
However, I don't have nsight tool(i do but I can't use it) and I am just programming in Linux terminal.
So, I can measure the Clock cycle by adding clock() function, but I can't measure the Instruction count.
Then, How do I get a number of instructions running CUDA programs?
Thank you.

Related

Profiling valid for parallel efficiency study?

I have been puzzled by the following matter:
I am trying to check the weak scaling of an in-house parallel Fortran code. Initially I tried to utilise the time command, but I would receive significant higher real times than the sys+user times. So, I ended up using gprof to perform the time measuring (although it may slow down the execution).
Is gprof a valid approach for benchmarking parallel efficiency (considering it is not ideal approach)?

Shorten the time of Qt make process

I executed command make after correctly passed all steps, and it takes to long to build all that libraries and tools, is it possible to shorten that time? For example to throw out uneccessary libraries and tools? It looks that make time goes to infinity...
Try running make with parallel processes:
make -jN
Where N is the number of parallel processes you want, perhaps the number of CPUs of your computer

Simple OpenCL example in R with R code?

Is it possible to use OpenCL but with R code? I still don't have a good understanding of OpenCL and GPU programming. For example, suppose I have the following R code:
aaa <- function(x) mean(rnorm(1000000))
sapply(1:10, aaa)
I like that I can kind of use mclapply as a dropin replacement for lapply. Is there a way to do that for OpenCL? Or to use OpenCL as a backend for mclapply? I'm guessing this is not possible because I have not been able to find an example, so I have two questions:
Is this possible and if so can you give a complete example using my function aaa above?
If this is not possible, can you please explain why? I do not know much about GPU programming. I view GPU just like CPUs, so why cannot I run R code in parallel?
I would start by looking at the High Performance Computing CRAN task view, in particular the Parallel computing: GPUs section.
There are a number of packages listed there which take advantage of GPGPU for specific tasks that lend themselves to massive parallelisation (e.g. gputools, HiPLARM). Most of these use NVIDIA's own CUDA rather than OpenCL.
There is also a more generic OpenCL package, but it requires you to learn how to write OpenCL code yourself, and merely provides an interface to that code from R.
It isn't possible because GPUs work differently than CPUs which means you can't give them the same instructions that you'd give a CPU.
Nvidia puts on a good show with this video of describing the difference between CPU and GPU processing. Essentially the difference is that GPUs typically have, by orders of magnitude, more cores than CPUs.
Your example is one that can be extended to GPU code because it is highly parallel.
Here's some code to create random numbers (although they aren't normally distributed) http://cas.ee.ic.ac.uk/people/dt10/research/rngs-gpu-mwc64x.html
Once you create the random numbers you could break them into chunks and then sum each of the chunks in parallel and then add the sums of the chunks to get the overall sum Is it possible to run the sum computation in parallel in OpenCL?
I realize that your code would make the random number vector and its sum in serial and parallel that operation 10 times but with GPU processing, having a mere 10 tasks isn't very efficient since you'd leave so many cores idle.

Where/how can I obtain the computational features of a non-CUDA GPU?

(Please don't recommend a specific product or service kthxbye)
I'm considering getting a discrete OpenCL-oriented GPU (i.e. not NVIDIA). Now, with CUDA GPUs (i.e. NVIDIA...), you have the 'Compute Capability' figure, which you can easily translate into concrete compute-related features - but I can't seem to find something parallel in the OpenCL world. You can find overall bandwidth, or a sort-of-cooked figure for maximum number of work-items executing in parallel (it's a cooked number for reasons such as not telling you what you can do with each of these between clock cycles. I can double the figure by doubling the number of cycles per op) - but not the very long and specific set of micro-features (which are mostly independent from the GPU's macro-features).
I'm interested in an answer regarding both integrated and discrete, and not just in NVIDIA's contender AMD. It's specifically interesting for me to look at supposedly 'weak' GPUs since I care more about the architecture than how much I can actually crunch with it.

inverse FFT in shader language?

does anyone know an implementation of the inverse FFT in HLSL/GLSL/cg ... ?
It would save me much work.
Best,
heinrich
Do you already have a FFT implementation? You may already be aware, but the inverse can be computed by reversing the order of the N inputs, taking the FFT over those, and dividing the result by N.
DirectX11 comes with a FFT example for compute shaders (see DX11 August SDK Release Notes). As PereAllenWebb points out, this can be also used for inverse FFT.
Edit: If you just want a fast FFT, you could try the CUFFT, which runs on the GPU. It's part of the CUDA SDK. The AMCL from AMD also has a FFT, which is currently not GPU accelerated, but this will be likely added soon.
I implemented a 1D FFT on 7800GTX hardware back in 2005. This was before CUDA etc so I had to resort to using Cg and manually implementing the FFT.
I have two FFT implementations. One is a Radix2 Decimation in Time FFT and the other a Stockham Autosort FFT. The stockham would perform around 2-4x faster than a CPU (at the time 3GHz P4 single core) for larger sizes (> 8192) but for smaller sizes the CPU was faster as it doesn't have to shift data to/from the GPU.
If you're interested in the shader code feel free to contact me and I'll send it over by email. It was from a personal project so not covered by any commercial copyright. I would imagine that CUDA (and similar) implementations would massively outperform my implementation, however from a learning perspective you can't get better than to write or study the code yourself!
Maybe you could take a look at OpenCL which is a standard for general purpose computing on graphics (and other) hardware.
The wikipedia article contains a OpenCL example for a standard FFT:
http://en.wikipedia.org/wiki/OpenCL#Example
If you are on a Mac with OS X 10.6, you just need to install the developer tools to get started with OpenCL development.
I also heard that hardware vendors already provide basic OpenCL driver support on Windows.

Resources