How would I write argmin or argmax with PyOpenCL? I figure I would need to calculate the argmin/min for each workgroup, and then reduce these using subsequent invocations.
Adapt this to collect the minimum and its location rather than just the minimum.
Related
I have a computation loop with adaptive time stepping and I need to store the results at each iteration. In other words, I do not know the vector size before the computation, so I can not preallocate a vector size to store the data. Right now, I build a vector using the push! function
function comp_loop()
clock = [0.0]
data = [0.0]
while run_time < model_time
# Calculate new timestep
timestep = rand(Float64) # Be sure to add Random
run_time += timestep
# Build vector
push!(clock,run_time)
push!(data,timestep)
end
end
Is there a more efficient way to go about this? Again, I know the optimal choice is to preallocate the vector, but I do not have that luxury available. Buffers are theoretically not an option either, as I don't now how large to make them. I'm looking for something more "optimal" on how to implement this in Julia (i.e. maybe some advanced application available in the language).
Theoretically, you can use a linked list such as the one from DataStructures.jl to get O(1) appending. And then optionally write that out to a Vector afterwards (probably in reverse order, though).
In practise, push!ing to a Vector is often efficient enough -- Vectors use a doubling strategy to manage their dynamic size, which leads to amortized constant time and the advantage of contiguous memory access.
So you could try the linked list, but be sure to benchmark whether it's worth the effort.
Now, the above is about time complexity. When you care about allocation, the argument is quite similar; with a vector, you are going to end up with memory proportional to the next power of two after your actual requirements. Most often, that can be considered amortized, too.
I am doing a learning exercise. I want to calculate the number of primes in a range from 0 to N. What function of mpi can I use to distribute ranges of numbers to each process? In other words, each process calculates the number of primes within a number range of the main range.
You could simply use a for loop and MPI_Send on the root (and MPI_Recv on receivers) to send to each process the number at which it should start and how many numbers it should check.
Another possibility, even better, is to send N to each process with MPI_Bcast (On root and receivers) and let each process compute which numbers it should check using it's own rank. (something like start=N/MPI_Comm_size*MPI_Comm_rank and length=N/MPI_Comm_size, and some adequate rounding etc.)
You can probably optimize load balancing even more but you should get it working first.
At the end you should call MPI_Reduce with a sum.
I'm new in OpenCL and I'm trying to implement power iteration method (described over here)
matrix sizes over 100000x100000!
Actually I have no idea how to implement this.
It's because workgroup have restriction CL_DEVICE_MAX_WORK_GROUP_SIZE (so I can't make one workgoup with 1000000 work-items)
But on each step of iterating I need to synchronize and normalize vector.
1) So is it possible to make all calculations inside one kernel? (I think that answer is no if matrix sizes is more than CL_DEVICE_MAX_WORK_GROUP_SIZE)
2) Can I make "while" loop in the host code? and is it still profitable to use GPU in this case?
something like:
while (condition)
{
kernel calling
synchronization
}
2: Yes, you can make a while loop in host code. Whether this is still profitable in terms of performance depends on whether the kernel that is called achieves a good speedup. My personal preference is not to pack too much logic into a single kernel, because smaller kernels are easier to maintain and sometimes easier to optimize. But of course, invoking a kernel has a (small) overhead that has to be taken into account. And whether combining to kernels into one can bring a speedup (or new potential for optimizations) depends on what the kernels are actually doing. But in this case (Matrix Multiplation and Vector Normalization) I'd personally start with two different kernels that are invoked from the host in a while-loop.
1: Since a 100000x100000 matrix with float values will take at least 40GB of memory, you'll have to think about the approach in general anyhow. There is a vast amount of literature on Matrix operations, their parallelization, and the corresponding implementations on the GPU. One important aspect from the "high level" point of view is whether the matrices are dense or sparse ( http://en.wikipedia.org/wiki/Sparse_matrix ). Depending on the sparsity, it might even be possible to handle 100000x100000 matrices in main memory. Apart from that, you might consider having a look at a library for matrix operations (e.g. http://viennacl.sourceforge.net/ ) because implementing an efficient matrix multiplication is challenging, particularly for sparse matrices. But if you want to go the whole way on your own: Good luck ;-) and ... the CL_DEVICE_MAX_WORK_GROUP_SIZE imposes no limitation on the problem size. In fact, the problem size (that is, the total number of work-items) in OpenCL is virtually infinitely large. If your CL_DEVICE_MAX_WORK_GROUP_SIZE is 256, and you want to handle 10000000000 elements, then you create 10000000000/256 work groups and let OpenCL care about how they are actually dispatched and executed. For matrix operations, the CL_DEVICE_MAX_WORK_GROUP_SIZE is primarily relevant when you want to use local memory (and you will have to, in order to achieve good performance): The size of the work groups thus implicitly defines how large your chunks of local memory may be.
Can I get maximum warp/work-group on one compute unit through some function like clGetDeviceInfo.
From what I've found the number depends only on Compute capability.So is there any function that can detect it?
thx
jikra
I think you are looking for clGetKernelWorkGroupInfo.
Specifically, CL_KERNEL_WORK_GROUP_SIZE and CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE will help you tune your work group sizes.
I want to write an app to transpose the key a wav file plays in (for fun, I know there are apps that already do this)... my main understanding of how this might be accomplished is to
1) chop the audio file into very small blocks (say 1/10 a second)
2) run an FFT on each block
3) phase shift the frequency space up or down depending on what key I want
4) use an inverse FFT to return each block to the time domain
5) glue all the blocks together
But now I'm wondering if the transformed blocks would no longer be continuous when I try to glue them back together. Are there ideas how I should do this to guarantee continuity, or am I just worrying about nothing?
Overlap the time samples for each block by half so that each block after the first consists of the last N/2 samples from the previous block and N/2 new samples. Be sure to apply some window to the samples before the transform.
After shifting the frequency, perform an inverse FFT and use the middle N/2 samples from each block. You'll need to adjust the final gain after the IFFT.
Of course, mixing the time samples with a sine wave and then low pass filtering will provide the same shift in the time domain as well. The frequency of the mixer would be the desired frequency difference.
For speech you might want to look at PSOLA - this is a popular algorithm for pitch-shifting and/or time stretching/compression which is a little more sophisticated than the basic overlap-add method, but not much more complex.
If you need to process non-speech samples, e.g. music, then there are several possibilities, however the overlap-add FFT/modify/IFFT approach mentioned in other answers is probably the best bet.
Found this great article on the subject, for anyone trying it in the future!
You may have to find a zero-crossing between the blocks to glue the individual wavs back together. Otherwise you may find that you are getting clicks or pops between the blocks.