Effect of Work Items on Computation - opencl

I know that the number of Work Items affect the run time but should I expect to have slightly different results by using different number of work items while running a kernel many times? Thanks

If you get different results, then you have a bug in your code. If you just do norm+=element*element; for a global variable norm, what will happen is that some threads do this at the same time and part of your norm gets lost.
To make things simple, this even happens when two work items increase the same counter, a++. Both will read the current value of a, both will increase it to a+1 and both will write a+1 back to the register. So you end up with a+1 instead of a+2. What happens is similar for your vector.

Related

Flipping coin simulation with gain/loss

Suppose you successively toss a fair coin and each time the result is
heads, you win $1, while if you get tails you lose 1$. Your initial capital is
3$. The throws stop if your capital is zeroed or you reach 10$. Let X_n be the
process that describes your chapter during the nth throw.
Simulate the X_n process 1000 times and present the graph
of its evolution through R.
2. Estimate the average number of consecutive throws until you stop. Is the result expected?
Can someone help me solve this or at least understand the steps I am supposed to take?
Someone already posted a link to a solution of your homework in the comments. I fear, however, that this uncommented code is incomprehensive for you, given that you have asked the question in the first place.
I would therefore suggest to first write your own implementation with an outer for loop and an inner while loop conditioned upon the running capital, call rbinom in each run and recompute the running capital. Store the resulting runs in a numeric vector and call mean on this vector.
It will start becoming interesting when you measure the runtime of your solution, which will be surprisingly slow. To speed it up, you must use "vectorization", which the linked to solution uses, but this is a completely different topic to be left for a different lesson...

Parallel iteration over array with step size greater than 1

I'm working on a practice program for doing belief propagation stereo vision. The relevant aspect of that here is that I have a fairly long array representing every pixel in an image, and want to carry out an operation on every second entry in the array at each iteration of a for loop - first one half of the entries, and then at the next iteration the other half (this comes from an optimisation described by Felzenswalb & Huttenlocher in their 2006 paper 'Efficient belief propagation for early vision'.) So, you could see it as having an outer for loop which runs a number of times, and for each iteration of that loop I iterate over half of the entries in the array.
I would like to parallelise the operation of iterating over the array like this, since I believe it would be thread-safe to do so, and of course potentially faster. The operation involved updates values inside the data structures representing the neighbouring pixels, which are not themselves used in a given iteration of the outer loop. Originally I just iterated over the entire array in one go, which meant that it was fairly trivial to carry this out - all I needed to do was put .Parallel between Array and .iteri. Changing to operating on every second array entry is trickier, however.
To make the change from simply iterating over every entry, I from Array.iteri (fun i p -> ... to using for i in startIndex..2..(ArrayLength - 1) do, where startIndex is either 1 or 0 depending on which one I used last (controlled by toggling a boolean). This means though that I can't simply use the really nice .Parallel to make things run in parallel.
I haven't been able to find anything specific about how to implement a parallel for loop in .NET which has a step size greater than 1. The best I could find was a paragraph in an old MSDN document on parallel programming in .NET, but that paragraph only makes a vague statement about transforming an index inside a loop body. I do not understand what is meant there.
I looked at Parallel.For and Parallel.ForEach, as well as creating a custom partitioner, but none of those seemed to include options for changing the step size.
The other option that occurred to me was to use a sequence expression such as
let getOddOrEvenArrayEntries myarray oddOrEven =
seq {
let startingIndex =
if oddOrEven then
1
else
0
for i in startingIndex..2..(Array.length myarray- 1) do
yield (i, myarray.[i])
}
and then using PSeq.iteri from ParallelSeq, but I'm not sure whether it will work correctly with .NET Core 2.2. (Note that, currently at least, I need to know the index of the given element in the array, as it is used as the index into another array during the processing).
How can I go about iterating over every second element of an array in parallel? I.e. iterating over an array using a step size greater than 1?
You could try PSeq.mapi which provides not only a sequence item as a parameter but also the index of an item.
Here's a small example
let res = nums
|> PSeq.mapi(fun index item -> if index % 2 = 0 then item else item + 1)
You can also have a look over this sampling snippet. Just be sure to substitute Seq with PSeq

Iterating results back into an OpenCL kernel

I have written an openCL kernel that takes 25million points and checks them relative to two lines, (A & B). It then outputs two lists; i.e. set A of all of the points found to be beyond line A, and vice versa.
I'd like to run the kernel repeatedly, updating the input points with each of the line results sets in turn (and also updating the checking line). I'm guessing that reading the two result sets out of the kernel, forming them into arrays and then passing them back in one at a time as inputs is quite a slow solution.
As an alternative, I've tested keeping a global index in the kernel that logs which points relate to which line. This is updated at each line checking cycle. During each iteration, the index for each point in the overall set is switched to 0 (no line), A or B or so forth (i.e. the related line id). In subsequent iterations only points with an index that matches the 'live' set being checked in that cycle (i.e. tagged with A for set A) are tested further.
The problem is that, in each iteration, the kernels still have to check through the full index (i.e. all 25m points) to discover wether or not they are in the 'live' set. As a result, the speed of each cycle does not significantly improve as the size of the results set decrease over time. Again, this seems a slow solution; whilst avoiding passing too much information between GPU and CPU it instead means that a large number of the work items aren't doing very much work at all.
Is there an alternative solution to what I am trying to do here?
You could use atomics to sort the outputs into two arrays. Ie if we're in A then get my position by incrementing the A counter and put me into A, and do the same for B
Using global atomics on everything might be horribly slow (fast on amd, slow on nvidia, no idea about other devices) - instead you can use a local atomic_inc in a 0'd local integer to do exactly the same thing (but for only the local set of x work-items), and then at the end do an atomic_add to both global counters based on your local counters
To put this more clearly in code (my explanation is not great)
int id;
if(is_a)
id = atomic_inc(&local_a);
else
id = atomic_inc(&local_b);
barrier(CLK_LOCAL_MEM_FENCE);
__local int a_base, b_base;
int lid = get_local_id(0);
if(lid == 0)
{
a_base = atomic_add(a_counter, local_a);
b_base = atomic_add(b_counter, local_b);
}
barrier(CLK_LOCAL_MEM_FENCE);
if(is_a)
a_buffer[id + a_base] = data;
else
b_buffer[id + b_base] = data;
This involves faffing around with atomics which are inherently slow, but depending on how quickly your dataset reduces it might be much faster. Additionally if B data is not considered live, you can omit getting the b ids and all the atomics involving b, as well as the write back

Product of range in Prolog

I need to write a program, which calculates product of product in range:
I written the following code:
mult(N,N,R,R).
mult(N,Nt,R,Rt):-N1=Nt+1,R1=Rt*(1/(log(Nt))),mult(N,N1,R,R1).
This should implement basic product from Nt to N of 1/ln(j). As far as I understand it's got to be stopped when Nt and N are equal. However, I can't get it working due to:
?- mult(10,2,R,1), write(R).
ERROR: Out of global stack
The following error. Is there any other way to implement loop not using default libraries of SWI-Prolog?
Your program never terminates! To see this consider the following failure-slice of your program:
mult(N,N,R,R) :- false.
mult(N,Nt,R,Rt):-
N1=Nt+1,
R1=Rt*(1/(log(Nt))),
mult(N,N1,R,R1), false.
This new program does never terminate, and thus the original program doesn't terminate. To see that this never terminates, consider the two (=)/2 goals. In the first, the new variable N1 is unified with something. This will always succeed. Similarly, the second goal with always succeed. There will never be a possibility for failure prior to the recursive goal. And thus, this program never terminates.
You need to add some goal, or to replace existing goals. in the visible part. Maybe add
N > Nt.
Further, it might be a good idea to replace the two (=)/2 goals by (is)/2. But this is not required for termination, strictly speaking.
Out of global stack means you entered a too-long chain of recursion, possibly an infinite one.
The problem stems from using = instead of is in your assignments.
mult(N,N,R,R).
mult(N,Nt,R,Rt):-N1 is Nt+1, R1 is Rt*(1/(log(Nt))), mult(N,N1,R,R1).
You might want to insert a cut in your first clause to avoid going on after getting an answer.
If you have a graphical debugger (like the one in SWI) try setting 'trace' and 'debug' on and running. You'll soon realize that performing N1 = Nt+1 giving Ntas 2 yields the term 2+1. Since 2+1+1+1+(...) will never unify with with 10, that's the problem right there.

R: reference iteration number in call to sfLapply(1:N, function(x))

Is it possible to reference the iteration number in a sfLapply call as follows -
wrapper <- function(a) {
y.mat <- data.frame(get(foo[i,1]), get(foo[i,2]))
...
...
do other things....
}
results <- sfLapply(1:200000, wrapper)
Where i is the iteration number as sfLapply cycles through 1:200000.
The problem I am faced with is that I have over 200,000 cases to test, with each case requiring the construction of a data.frame to which various operations will be performed.
I have a 2 Ghz Intel Core 2 Duo processor (macbook laptop) and so I began to investigate the snowfall package to take advantage of parallel processing. This led me to sfLapply and so I started to investigate whether I could re-write my code to work with lapply(). However, I have yet to come across examples that reference the iteration number in lappy() calls.
Maybe I am heading in the wrong direction. If anyone has any suggestions I would be greatly appreciative.
You're not using parameter a in the code to wrapper. All the numbers from 1:200000 will be passed to wrapper, so it is this a that represents your iteration (instead of i).
Don't forget, though, that these will not appear in order (courtesy of sfLapply).
As far as I know, there is no way of knowing the how manyth iteration your going into, as the different processes don't know what the others are doing.

Resources