According to the specs:
If barrier is inside a conditional statement, then all work-items must enter the conditional if any work-item enters the conditional statement and executes the barrier.
If barrier is inside a loop, all work-items must execute the barrier for each iteration of the loop before any are allowed to continue execution beyond the barrier.
In my understanding this means, that in any Kernel:
if(0 == (get_local_id(0)%2)){
//block a
work_group_barrier(CLK_GLOBAL_MEM_FENCE);
//block a part 2
}else{
//block b
work_group_barrier(CLK_GLOBAL_MEM_FENCE);
//block b part 2
}
//common operations
Should one worker reach //block a, every other worker needs to reach it too.
By this logic it is not possible to correctly synchronize every odd local worker with every even one ( blocks a and b ) to be run at the same time.
Is this understanding correct?
What would be a good synchronisation strategy for a situation like this ?
where every other worker would need to do a different logic, but by //block a part 2 and //block b part 2 the workers inside one working group be synced up.
in the requested usecase there are more phases, than 2, and I'd like to keep every phase to be synchronised.
Would a logic like this be an acceptable solution?
__local number_finished = 0;
if(0 == (get_local_id(0)%2)){
//block a
atomic_add(&number_finished, 1);
while(number_finished < get_local_size(0));
//block a part 2
}else{
//block b
atomic_add(&number_finished, 1);
while(number_finished < get_local_size(0));
//block b part 2
}
work_group_barrier(CLK_GLOBAL_MEM_FENCE);
//common operations
By this logic it is not possible to correctly synchronize every odd local worker with every even one ( blocks a and b ) to be run at the same time.
Is this understanding correct?
It is not possible. When using a work_group_barrier, you must ensure all work-items in the work group reach it. If there are paths within your code that don't reach the barrier, it may lead to undefined behavior.
What would be a good synchronization strategy for a situation like this ?
Usually, barriers are used outside of any conditional sections / loop:
if (cond)
{
// all work items perform work that needs to be synchronized later
// such as local memory writes
}
barrier(CLK_LOCAL_MEM_FENCE); // note the CLK_LOCAL_MEM_FENCE flag
// now every work item can read the data the other work items wrote before the barrier.
for (...)
{
}
barrier(CLK_LOCAL_MEM_FENCE);
Would a logic like this be an acceptable solution?
It might work, but a barrier outside the conditional section would be more efficient than a busy wait.
if(0 == (get_local_id(0)%2)){
//block a
}else{
// block b
}
barrier(CLK_LOCAL_MEM_FENCE); // ensures number_finished == get_local_size(0)
Related
Everyone good time of day!
Not so long ago, I was able to parallel the recursive algorithm for searching for possible options for combining some events. At the moment, the code is as follows:
//#include's
// function announcements
// declaring a global variable:
QVector<QVector<QVector<float>>> variant; (or "std::vector")
int main() {
// reads data from file
// data are converted and analyzed
// the variant variable containing the current best result is filled in (here - by pre-analysis)
#pragma omp parallel shared(variant)
#pragma omp master
// occurs call a recursive algorithm of search all variants:
PEREBOR(Tabl_1, a, i_a, ..., reс_depth);
return 0;
}
void PEREBOR(QVector<QVector<uint8_t>> Tabl_1, QVector<A_struct> a, uint8_t i_a, ..., uint8_t reс_depth)
{
// looking for the boundaries of the first cycle for some reasons
for (int i = quantity; i < another_quantity; i++) {
// the Tabl_1 is processed and modified to determine the number of steps in the subsequent for cycle
for (int k = 0; k < the_quantity_just_found; k++) {
if the recursion depth is not 1, we go down further: {
// add descent to the next recursion level to the call stack:
#pragma omp task
PEREBOR(Tabl_1_COPY, a, i_a, ..., reс_depth-1);
}
else (if we went down to the lowest level): {
if (condition fulfilled) // condition check - READ variant variable
variant = it_is_equal_to_that_,_to_that...;
else
continue;
}
}
}
}
At the moment, this thing really works well, and on six cores the CPU gives an increase of more than 5.7 from the single-core version.
As you can see, with a sufficiently large number of threads, there may be a failure associated with the simultaneous reading/writing of the variant variable. I understand she needs to be protected. At the moment, I see an output only in the use of blocking functions, since the critical section is not suitable because if the variable variant is written in only one section of the code (at the lowest level of recursion), then the reading occurs in many places.
Actually, here is the question - if I apply the constructions:
omp_lock_t lock;
int main() {
...
omp_init_lock(&lock);
#pragma omp parallel shared(variant, lock)
...
}
...
else (if we went down to the lowest level): {
if (condition fulfilled) { // condition check - READ variant variable
omp_set_lock(&lock);
variant = it_is_equal_to_that_,_to_that...;
omp_unset_lock(&lock);
}
else
continue;
...
will this lock protect the reading of the variable in all other places? Or will I need to manually check the lock status and pause the thread before reading elsewhere?
I will be incredibly grateful to the distinguished community for help!
In OpenMP specification (1.4.1 The structure of OpenMP memory model) you can read
The OpenMP API provides a relaxed-consistency, shared-memory model.
All OpenMP threads have access to a place to store and to retrieve
variables, called the memory. In addition, each thread is allowed to
have its own temporary view of the memory. The temporary view of
memory for each thread is not a required part of the OpenMP memory
model, but can represent any kind of intervening structure, such as
machine registers, cache, or other local storage, between the thread
and the memory. The temporary view of memory allows the thread to
cache variables and thereby to avoid going to memory for every
reference to a variable.
This practically means that (as with any relaxed memory model), only at well-defined points, are threads guaranteed to have the same, consistent view on the value of shared variables. In between such points, the temporary view may be different across the threads.
In your code you handled the problem of simultaneous writing of the same variable, but there is no guarantee that an another thread reads the correct value of the variable without additional measures.
You have 3 options to do (Note that each of these solutions not only will handle simultaneous read/writes, but also provides a consistent view on the value of shared variables.):
If your variable is scalar type, the best solution is to use atomic operations. This is the fastest option as atomic operations are typically supported by the hardware.
#pragma omp parallel
{
...
#pragma omp atomic read
tmp=variant;
....
#pragma omp atomic write
variant=new_value;
}
Use critical construct. This solution can be used if your variable is a complex type (such as class) and its read/write cannot be performed atomically. Note that it is much less efficient (slower) than an atomic operation.
#pragma omp parallel
{
...
#pragma omp critical
tmp=variant;
....
#pragma omp critical
variant=new_value;
}
Use locks for each read/write of your variable. Your code is OK for write, but have to use it for reads as well. It requires the most coding, but practically the result is the same as using the critical construct. Note that OpenMP implementations typically use locks to implement critical constructs.
// down = acquire the resource
// up = release the resource
typedef int semaphore;
semaphore resource_1;
semaphore resource_2;
void process_A(void) {
down(&resource_1);
down(&resource_2);
use_both_resources();
up(&resource_2);
up(&resource_1);
}
If the resource return in the same order as it acquired, i.e,
void process_A(void) {
down(&resource_1);
down(&resource_2);
use_both_resources();
up(&resource_1);
up(&resource_2);
}
Would that cause any potential problem.
Thanks for any explanation!
The important part is if you are taking the locks in the same order in different threads or not.
The order of release has no effect; there's nothing stopping the program from releasing the second lock after the first one has been released, (unless you're taking new locks in between, but then you're back at the first case; taking the locks in the correct order.)
If you have two functions that try to take the same two locks, in different orders, they could grab one lock each, and wait forever for the other one to release their lock. Example code:
down(first_lock)
down(second_lock)
running concurrently with
down(second_lock)
down(first_lock)
they could both take their first lock before any of them take their second lock, and then they'll deadlock.
I have a long and busy for loop which is supposed to addElement on the stage iteratively. Since it takes several seconds to execute the whole loop (i=1:N), i just want to refresh the stage at each loop so that the output is displayed before the loop reaches its final point (N). each iteration should add a displayable element before the next iteration begins.
For this i wrote the following
for(var i:int = 0; i < 280; i++){
addElement(...);
validateNow();
}
but it is not working like i want. anyone having solution please?
You need to divide up this lengthy work so that it can occur over multiple frames. Flash/Flex do not update the screen when your code is executing. Have a look at the elastic race track for Flash to help understand why. The Flex component life cycle adds another layer on top of that as well. By the way, calling validateNow() can be computationally expensive, possibly making your loop take longer :)
There are a number of ways to break up the work in the loop.
Here's a simple example using callLater():
private function startWorking():void
{
// create a queue of work (whatever you are iterating over)
// in your loop you're counting to 280, you could use a
// simple counter variable here instead
var queue:Array = [ a, b, c];
callLater(doWork, [ queue ] );
}
private function doWork(workQueue:Array):void
{
var workItem:Object = workQueue.shift();
// do expensive work to create the element to add to screen
addElement(...);
if (workQueue.length > 0)
callLater(doWork, workQueue);
}
You may want to process more than 1 item at a time. You could also do the same thing using the ENTER_FRAME event instead of callLater() (in essence, this is what callLater() is doing).
References:
validateClient() docs, calling validateNow() results in a call to this expensive method
I have the code snippet below in C++ which basically calculates the pi using classic monte carlo technic.
srand48((unsigned)time(0) + my_rank);
for(int i = 0 ; i < part_points; i++)
{
double x = drand48();
double y = drand48();
if( (pow(x,2)+pow(y,2)) < 1){ ++count; }
}
MPI_Reduce(&count, &total_hits, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
MPI_Barrier(MPI_COMM_WORLD);
if(my_rank == root)
{
pi = 4*(total_hits/(double)total_points);
cout << "Calculated pi: " << pi << " in " << end_time-start_time << endl;
}
I am just wondering if the MPI_Barrier call is necessary. Does MPI_Reduce make sure that the body of the if statement won't be executed before the reduce operation is completely finished ? Hope I was clear. Thanks
Yes, all collective communication calls (Reduce, Scatter, Gather, etc) are blocking. There's no need for the barrier.
Blocking yes, a barrier, no. It is very important to call MPI_Barrier() for MPI_Reduce() when executing in a tight loop. If not calling MPI_Barrier() the receive buffers of the reducing process will eventually run full and the application will abort. While other participating processes only need to send and continue, the reducing process has to receive and reduce.
The above code does not need the barrier if my_rank == root == 0 (what probably is true). Anyways... MPI_Reduce() does not perform a barrier or any form of synchronization. AFAIK even MPI_Allreduce() isn't guaranteed to synchronize (at least not by the MPI standard).
Ask your self if that barrier is needed. Suppose you are not the root; you call Reduce, which sends off your data. Is there any reason to sit and wait until the root has the result? Answer: no, so you don't need the barrier.
Suppose you're the root. You issue the reduce call. Semantically you are now forced to sit and wait until the result is fully assembled. So why the barrier? Again, no barrier call is needed.
In general, you almost never need a barrier because you don't care about temporal synchronization. The semantics guarantee that your local state is correct after the reduce call.
Recently ,I come across this problem as I memtioned in this Title.
I have tried by using QThread::terminate(),but I just can NOT stop
the thread ,which is in a dead loop (let's say,while(1)).
thanks a lot.
Terminating the thread is the easy solution to stopping an async operation, but it is usually a bad idea: the thread could be doing a system call or could be in the middle of updating a data structure when it is terminated, which could leave the program or even the OS in an unstable state.
Try to transform your while(1) into while( isAlive() ) and make isAlive() return false when you want the thread to exit.
QThreads can deadlock if they finish "naturally" during termination.
For example in Unix, if the thread is waiting on a "read" call, the termination attempt (a Unix signal) will make the "read" call abort with an error code before the thread is destroyed.
That means that the thread can still reach it's natural exit point while being terminated. When it does so, a deadlock is reached since some internal mutex is already locked by the "terminate" call.
My workaround is to actually make sure that the thread never returns if it was terminated.
while( read(...) > 0 ) {
// Do stuff...
}
while( wasTerminated )
sleep(1);
return;
wasTerminated here is actually implemented a bit more complex, using atomic ints:
enum {
Running, Terminating, Quitting
};
QAtomicInt _state; // Initialized to Running
void myTerminate()
{
if( _state.testAndSetAquire(Running, Terminating) )
terminate();
}
void run()
{
[...]
while(read(...) > 0 ) {
[...]
}
if( !_state.testAndSetAquire(Running, Quitting) ) {
for(;;) sleep(1);
}
}
Have you tried exit or quit?
Did the thread call QThread::setTerminationEnabled(false)? That would cause thread termination to delay indefinitely.
EDIT: I don't know what platform you're on, but I checked the Windows implementation of QThread::terminate. Assuming the thread was actually running to begin with, and termination wasn't disabled via the above function, it's basically a wrapper around TerminateThread() in the Windows API. This function accepts disrespect from no thread, and tends to leave a mess behind with resource leaks and similar dangling state. If it's not killing the thread, you're either dealing with zombie kernel calls (most likely blocked I/O) or have even bigger problems somewhere.
To use unnamed pipes
int gPipeFdTest[2]; //create a global integer array
As an when where you intend to create pipes use
if( pipe(gPipeFdTest) < 0)
{
perror("Pipe failed");
exit(1);
}
The above code will create a pipe which has two ends gPipeFdTest[0] for reading and gPipeFdTest[1] for writing. What you can do is in your run function set up to read the pipe using select system call. And from where you want to come out of run, there set up to write using write system call. I have used select system call for monitoring the read end of the pipe as it suits my implmentation. Try to figure all this out in your case. If you need any more help, give me a buzz.
Edit:
My problem was just like yours. I had a while(1) loop and the other things I tried needed mutexes and other fancy multithreading mumbo jumbo, which added complexity and debugging was nightmare. Using pipes absolved me from those complexities besides simplified the code. I am not saying that it is the best option but in my case it turned out to be the best and cleanest alternative. I was bugged my hung application before this solution.