Here is the code I have that whenever I run throws a "Iterator not incrementable" error at run time. If I comment out the sp++ line or the asteroids.push_back(*sp); line then it runs fine. So it has something to do with those lines...I saw in a previous post that the line sp->getSize() is incremented the pointer as well and may be the cause of the issues? thanks for the help!
while(sp != asteroids.end()){
if(sp->getSize() == .5 || sp->getSize() == 0.25){
glPushMatrix();
glScalef(.1, .1, .1);
glTranslatef(3,3,0);
sp->display_asteriod(sp->getSize(), random, randomTwo);
glPopMatrix();
asteroidCount++;
spawn.setSize(sp->getSize());
//spawn.setLife(it->getLife());
random = ((double) rand() / (RAND_MAX+1));
randomTwo = ((double) rand() / (RAND_MAX+1)) * 7;
spawn = createAsteroid(spawn);
x_speed_asteriod = (spawn.getXDirection())*(spawn.getRandomVelocity());// + x_speed_asteriod;
y_speed_asteriod = (spawn.getYDirection())*(spawn.getRandomVelocity());// + y_speed_asteriod;
spawn.setXSpeed(x_speed_asteriod);
spawn.setYSpeed(y_speed_asteriod);
if(spawn.getRandomAxis() == 0){
glRotatef(spawn.getAngleRotation(), 1, 0, 0);
}else if(spawn.getRandomAxis() == 1){
glRotatef(spawn.getAngleRotation(), 0, 1, 0);
}else if(spawn.getRandomAxis() == 2){
glRotatef(spawn.getAngleRotation(), 0, 0, 1);
}
//it = asteroids.begin() + asteroidCount;
//asteroids.insert(it, spawn);
//asteroids.resize(asteroidCount);
asteroids.push_back(*sp);
glPushMatrix();
glScalef(.1,.1,.1);
glTranslatef(spawn.getXPosition()-3, spawn.getYPosition()-3, 0);
spawn.display_asteriod(spawn.getSize(), random, randomTwo);
glPopMatrix();
}else{
sp++;
}
Your iterator sp is getting invalidated by the call to push_back. You are modifying the asteroids vector but you are still using the old iterator that you obtained before the modification.
This post contains a summary of rules for when iterators are invalidated.
Keeping track of new items to work on is often done using a queue (or a deque) in a safe way like this:
#include<deque>
vector<Asteroid> asteroids;
deque<Asteroid> asteroid_queue;
//add all current asteroids into the queue
asteroid_queue.assign(asteroids.begin(), asteroids.end());
while(!asteroid_queue.empty())
{
//grab the next asteroid to process
Asteroid asteroid = asteroid_queue.front();
//remove it from the queue
asteroid_queue.pop_front();
do_some_work()
//if necessary add a new asteroid .. be careful not to end in an infinite loop
asteroid_queue.push_back(..);
}
Related
I am trying to MPIs RMA scheme with Fences. In some cases it works fine, but for systems with multiple nodes I get the following error:
Error message: MPI failed with Error_code = 71950898
Wrong synchronization of RMA calls , error stack:
MPI_Rget(176): MPI_Rget(origin_addr=0x2ac7b10, origin_count=1, MPI_INTEGER, target_rank=0, target_disp=0, target_count=1, MPI_INTEGER, win=0xa0000000, request=0x7ffdc1efe634) failed
(unknown)(): Wrong synchronization of RMA calls
Error from PE:0/4
This is a schematic of how I setup the code:
call MPI_init(..)
CALL MPI_WIN_CREATE(..)
do i =1,10
MPI_Win_fence(0, handle, err)
calc_values()
MPI_Put(values)
MPI_Put(values)
MPI_Put(values)
MPI_Win_fence(0, handle, err)
MPI_Rget(values, req)
MPI_WAIT(req)
do_something(values)
MPI_Rget(values, req)
MPI_WAIT(req)
do_something(values)
enddo
call MPI_finalize()
I know that MPI_Put is non-blocking. Is it guaranteed, that the MPI_Put is finished after MPI_Win_fence(0, handle, err) or do I have to use MPI_RPUT?
What does this error even mean: Wrong synchronization of RMA calls ?
How do I fix my communication scheme?
Make sure you add the following call as necessary to ensure synchronization (you need to make sure your window(s) are created before putting data in them):
MPI_Win_fence(0, window);
Please look at the example below (source) and note that they are making two fence calls.
// Create the window
int window_buffer = 0;
MPI_Win window;
MPI_Win_create(&window_buffer, sizeof(int), sizeof(int), MPI_INFO_NULL, MPI_COMM_WORLD, &window);
if(my_rank == 1)
{
printf("[MPI process 1] Value in my window_buffer before MPI_Put: %d.\n", window_buffer);
}
MPI_Win_fence(0, window);
if(my_rank == 0)
{
// Push my value into the first integer in MPI process 1 window
int my_value = 12345;
MPI_Put(&my_value, 1, MPI_INT, 1, 0, 1, MPI_INT, window);
printf("[MPI process 0] I put data %d in MPI process 1 window via MPI_Put.\n", my_value);
}
// Wait for the MPI_Put issued to complete before going any further
MPI_Win_fence(0, window);
if(my_rank == 1)
{
printf("[MPI process 1] Value in my window_buffer after MPI_Put: %d.\n", window_buffer);
}
// Destroy the window
MPI_Win_free(&window);
Suppose I have a very large array of things and I have to do some operation on all these things.
In case operation fails for one element, I want to stop the work [this work is distributed across number of processors] across all the array.
I want to achieve this while keeping the number of sent/received messages to a minimum.
Also, I don't want to block processors if there is no need to.
How can I do it using MPI?
This seems to be a common question with no easy answer. Both other answer have scalability issues. The ring-communication approach has linear communication cost, while in the one-sided MPI_Win-solution, a single process will be hammered with memory requests from all workers. This may be fine for low number of ranks, but pose issues when increasing the rank count.
Non-blocking collectives can provide a more scalable better solution. The main idea is to post a MPI_Ibarrier on all ranks except on one designated root. This root will listen to point-to-point stop messages via MPI_Irecv and complete the MPI_Ibarrier once it receives it.
The tricky part is that there are four different cases "{root, non-root} x {found, not-found}" that need to be handled. Also it can happen that multiple ranks send a stop message, requiring an unknown number of matching receives on the root. That can be solved with an additional reduction that counts the number of ranks that sent a stop-request.
Here is an example how this can look in C:
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
const int iter_max = 10000;
const int difficulty = 20000;
int find_stuff()
{
int num_iters = rand() % iter_max;
for (int i = 0; i < num_iters; i++) {
if (rand() % difficulty == 0) {
return 1;
}
}
return 0;
}
const int stop_tag = 42;
const int root = 0;
int forward_stop(MPI_Request* root_recv_stop, MPI_Request* all_recv_stop, int found_count)
{
int flag;
MPI_Status status;
if (found_count == 0) {
MPI_Test(root_recv_stop, &flag, &status);
} else {
// If we find something on the root, we actually wait until we receive our own message.
MPI_Wait(root_recv_stop, &status);
flag = 1;
}
if (flag) {
printf("Forwarding stop signal from %d\n", status.MPI_SOURCE);
MPI_Ibarrier(MPI_COMM_WORLD, all_recv_stop);
MPI_Wait(all_recv_stop, MPI_STATUS_IGNORE);
// We must post some additional receives if multiple ranks found something at the same time
MPI_Reduce(MPI_IN_PLACE, &found_count, 1, MPI_INT, MPI_SUM, root, MPI_COMM_WORLD);
for (found_count--; found_count > 0; found_count--) {
MPI_Recv(NULL, 0, MPI_CHAR, MPI_ANY_SOURCE, stop_tag, MPI_COMM_WORLD, &status);
printf("Additional stop from: %d\n", status.MPI_SOURCE);
}
return 1;
}
return 0;
}
int main()
{
MPI_Init(NULL, NULL);
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
srand(rank);
MPI_Request root_recv_stop;
MPI_Request all_recv_stop;
if (rank == root) {
MPI_Irecv(NULL, 0, MPI_CHAR, MPI_ANY_SOURCE, stop_tag, MPI_COMM_WORLD, &root_recv_stop);
} else {
// You may want to use an extra communicator here, to avoid messing with other barriers
MPI_Ibarrier(MPI_COMM_WORLD, &all_recv_stop);
}
while (1) {
int found = find_stuff();
if (found) {
printf("Rank %d found something.\n", rank);
// Note: We cannot post this as blocking, otherwise there is a deadlock with the reduce
MPI_Request req;
MPI_Isend(NULL, 0, MPI_CHAR, root, stop_tag, MPI_COMM_WORLD, &req);
if (rank != root) {
// We know that we are going to receive our own stop signal.
// This avoids running another useless iteration
MPI_Wait(&all_recv_stop, MPI_STATUS_IGNORE);
MPI_Reduce(&found, NULL, 1, MPI_INT, MPI_SUM, root, MPI_COMM_WORLD);
MPI_Wait(&req, MPI_STATUS_IGNORE);
break;
}
MPI_Wait(&req, MPI_STATUS_IGNORE);
}
if (rank == root) {
if (forward_stop(&root_recv_stop, &all_recv_stop, found)) {
break;
}
} else {
int stop_signal;
MPI_Test(&all_recv_stop, &stop_signal, MPI_STATUS_IGNORE);
if (stop_signal)
{
MPI_Reduce(&found, NULL, 1, MPI_INT, MPI_SUM, root, MPI_COMM_WORLD);
printf("Rank %d stopping after receiving signal.\n", rank);
break;
}
}
};
MPI_Finalize();
}
While this is not the simplest code, it should:
Introduce no additional blocking
Scale with the implementation of a barrier (usually O(log N))
Have a worst-case-latency from one found, to all stop of 2 * loop time ( + 1 p2p + 1 barrier + 1 reduction).
If many/all ranks find a solution at the same time, it still works but may be less efficient.
A possible strategy to derive this global stop condition in a non-blocking fashion is to rely on MPI_Test.
scenario
Consider that each process posts an asynchronous receive of type MPI_INT to its left rank with a given tag to build a ring. Then start your computation. If a rank encounters the stop condition it sends its own rank to its right rank. In the meantime each rank uses MPI_Test to check for the completion of the MPI_Irecv during the computation if it is completed then enter a branch first waiting the message and then transitively propagating the received rank to the right except if the right rank is equal to the payload of the message (not to loop).
This done you should have all processes in the branch, ready to trigger an arbitrary recovery operation.
Complexity
The topology retained is a ring as it minimizes the number of messages at most (n-1) however it augments the propagation time. Other topologies could be retained with more messages but lower spatial complexity for example a tree with a n.ln(n) complexity.
Implementation
Something like this.
int rank, size;
MPI_Init(&argc,&argv);
MPI_Comm_rank( MPI_COMM_WORLD, &rank);
MPI_Comm_size( MPI_COMM_WORLD, &size);
int left_rank = (rank==0)?(size-1):(rank-1);
int right_rank = (rank==(size-1))?0:(rank+1)%size;
int stop_cond_rank;
MPI_Request stop_cond_request;
int stop_cond= 0;
while( 1 )
{
MPI_Irecv( &stop_cond_rank, 1, MPI_INT, left_rank, 123, MPI_COMM_WORLD, &stop_cond_request);
/* Compute Here and set stop condition accordingly */
if( stop_cond )
{
/* Cancel the left recv */
MPI_Cancel( &stop_cond_request );
if( rank != right_rank )
MPI_Send( &rank, 1, MPI_INT, right_rank, 123, MPI_COMM_WORLD );
break;
}
int did_recv = 0;
MPI_Test( &stop_cond_request, &did_recv, MPI_STATUS_IGNORE );
if( did_recv )
{
stop_cond = 1;
MPI_Wait( &stop_cond_request, MPI_STATUS_IGNORE );
if( right_rank != stop_cond_rank )
MPI_Send( &stop_cond_rank, 1, MPI_INT, right_rank, 123, MPI_COMM_WORLD );
break;
}
}
if( stop_cond )
{
/* Handle the stop condition */
}
else
{
/* Cleanup */
MPI_Cancel( &stop_cond_request );
}
That is a question I've asked myself a few times without finding any completely satisfactory answer... The only thing I thought of (beside MPI_Abort() that does it but which is a bit extreme) is to create an MPI_Win storing a flag that will be raise by whichever process facing the problem, and checked by all processes regularly to see if they can go on processing. This is done using non-blocking calls, the same way as described in this answer.
The main weaknesses of this are:
This depends on the processes to willingly check the status of the flag. There is no immediate interruption of their work to notifying them.
The frequency of this checking must be adjusted by hand. You have to find the trade-off between the time you waste processing data while there's no need to because something happened elsewhere, and the time it takes to check the status...
In the end, what we would need is a way of defining a callback action triggered by an MPI call such as MPI_Abort() (basically replacing the abort action by something else). I don't think this exists, but maybe I overlooked it.
In my code I have kernelA and kernelB. kernelB depends on kernelA results. I am iterating over this kernels tousand of times and each iteration depends on the results from the previous iteration.
The host side enqueue code snipped is like this:
for(int x = 0; x < iterations; ++x)
{
queue.enqueueNDRangeKernel(kernelA, cl::NullRange, cl::NDRange(3*256, 1), cl::NDRange(256, 1));
queue.enqueueNDRangeKernel(kernelB, cl::NullRange, cl::NDRange(256, 1), cl::NDRange(256, 1));
}
queue.finish();
The above code is working perfectly fine.
Now I want to port the above code to use device side enqueue and I'm facing issues on AMD GPU. The kernel code:
__attribute__((reqd_work_group_size(256, 1, 1)))
__kernel void kernelA(...){}
__attribute__((reqd_work_group_size(256, 1, 1)))
__kernel void kernelB(...){}
__attribute__((reqd_work_group_size(1, 1, 1)))
__kernel void kernelLauncher(...)
{
queue_t default_queue = get_default_queue();
clk_event_t ev1, ev2;
for (int x = 0; x < iterations; ++x)
{
void(^fnKernelA)(void) = ^{ kernelA(
... // kernel params come here
); };
if (x == 0)
{
enqueue_kernel(default_queue,
CLK_ENQUEUE_FLAGS_NO_WAIT,
ndrange_1D(3 * 256, 256),
0, NULL, &ev1,
fnKernelA);
}
else
{
enqueue_kernel(default_queue,
CLK_ENQUEUE_FLAGS_NO_WAIT,
ndrange_1D(3 * 256, 256),
1, &ev2, &ev1, // ev2 sets dependency on kernelB here
fnKernelA);
}
void(^fnKernelB)(void) = ^{ kernelB(
... // kernel params come here
); };
enqueue_kernel(default_queue,
CLK_ENQUEUE_FLAGS_NO_WAIT,
ndrange_1D(256, 256),
1, &ev1, &ev2, // ev1 sets dependency on kernelA here
fnKernelB);
}
}
The host code:
queue.enqueueNDRangeKernel(kernelLauncher, cl::NullRange, cl::NDRange(1, 1), cl::NDRange(1, 1));
The issue is that the results returned from the kernel when run on AMD GPU are wrong. Sometimes kernel also hangs which may indicate that there is probably something wrong with kernel synchronization. The same code works fine on Intel CPU, not sure if that is a luck or there is something wrong with synchronization points in the kernel.
Update: enqueue_kernel is failing on 1025th enqueue command with error -1. I tried to get more detailed error (added -g during build) but to no avail. I increased the device queue size to maximum but that didn't change anything (still failing on 1025th enqueue command). Removing content of kernelA and kernelB didn't change anything either. Any thoughts?
Answering an old question to hopefully save someone time in the future. If you query CL_DEVICE_MAX_ON_DEVICE_EVENTS on your device it will return 1024. That is the max number of events you can queue "on device". That is why it is failing on the 1025 queue. If you run your OpenCL code on a different GPU (like Intel) you may be lucky enough to get a real error code back which will be CLK_DEVICE_QUEUE_FULL or -161. AMD ignores the -g option and doesn't ever seem to give anything back but -1 on a failed on-device enqueue.
I need help with openCV. System: WinXP, OpenCV 2.4.7.2, QT 5.2.1. I have problem with memory leak. When this function starts, and about 2min i have an error like this:"std::bad_alloc at memory location 0x0012a9e8...
void EnclosingContour(IplImage* _image){
assert(_image!=0);
//Clone src image and convert to gray
clone_image = cvCreateImage(cvGetSize(_image), IPL_DEPTH_8U, 1);
cvConvertImage(_image, clone_image, CV_BGR2GRAY);
//Some images for processing
dst = cvCreateImage( cvGetSize(_image), IPL_DEPTH_8U, 1);
temp = cvCreateImage( cvGetSize(_image), IPL_DEPTH_8U, 1);
//Make ROI
if (ui.chb_ROI->isChecked()){
cvSetImageROI(clone_image, cvRect(ui.spb_x1->value(), ui.spb_y1->value(),ui.spb_x2->value(),ui.spb_y2->value()));}
//Create image for processing
bin = cvCreateImage(cvGetSize(clone_image), IPL_DEPTH_8U, 1);
bin = cvCloneImage(clone_image);
//Canny before
if (ui.chb_canny_before->isChecked()){
cvCanny(bin, bin, ui.hsl_threshold_1->value(), ui.hsl_threshold_2->value());
}
//Adaptive threshold
if (Adaptive==true){
cvAdaptiveThreshold(bin, dst, ui.hsl_adaptive->value(), 0, 0, 3, 5);
bin = cvCloneImage(dst);
cvReleaseImage(&dst);
}
//Morphology operations
if (morphology==true){
cvMorphologyEx(bin, bin, temp, NULL, operations, 1);
cvReleaseImage(&temp);
}
//Canny after
if (ui.chb_canny_after->isChecked()){
cvCanny(bin, bin, ui.hsl_threshold_1->value(), ui.hsl_threshold_2->value());
}
//Zero ROI
cvZero(clone_image);
cvCopyImage(bin, clone_image);
cvResetImageROI(clone_image);
//Show
cvNamedWindow( "bin", 1 );
cvShowImage("bin", clone_image);
cvReleaseImage(&clone_image);
//storage for contours
storage = cvCreateMemStorage(0);
contours=0;
// find contours
if (ui.chb_ROI->isChecked()){
int contoursCont = cvFindContours(bin, storage,&contours,sizeof(CvContour),CV_RETR_LIST,method,cvPoint(ui.spb_x1->value(), ui.spb_y1->value()));
}else
{
int contoursCont = cvFindContours(bin, storage,&contours,sizeof(CvContour),CV_RETR_LIST,method,cvPoint(0,0));
}
assert(contours!=0);
//How many contours
// All contours
for(CvSeq* current = contours; current != NULL; current = current->h_next ){
//Draw rectangle over all contours
CvRect r = cvBoundingRect(current, 1);
cvRectangleR(_image, r, cvScalar(0, 0, 255, 0), 3, 8, 0);
//Show width of rect
ui.textEdit_2->setText(QString::number(r.width));
}
// Clean resources
cvReleaseMemStorage(&storage);
cvReleaseImage(&bin);
}
You are releasing 'temp' and 'dst' images inside 'if'. This is a sure recipe for memory leak, since they may not be released at all.
On a side note, you are using C interface of OpenCV which is deprecated and will be removed soon. If you switch to C++ interface (i.e. if you use Mat instead of IplImage*) than problem of memory leak because of unreleased image could not possibly happen.
I have a nested loop and from inside the loop I call the MPI send which I want it to
send to the receiver a specific value then at the receiver takes the data and again sends MPI messages
to another set of CPUs ... I used something like this but it looks like there is a problem in the receive ... and I cant see where I went wrong ..."the machine goes to infinite loop somewhere ...
I am trying to make it work like this :
master CPU >> send to other CPUs >> send to slave CPUs
.
.
.
int currentCombinationsCount;
int mp;
if (rank == 0)
{
for (int pr = 0; pr < combinationsSegmentSize; pr++)
{
int CblockBegin = CombinationsSegementsBegin[pr];
int CblockEnd = CombinationsSegementsEnd [pr];
currentCombinationsCount = numOfCombinationsEachLoop[pr];
prossessNum = 1; //specify which processor we are sending to
// now substitute and send to the main Processors
for (mp = CblockBegin; mp <= CblockEnd; mp++)
{
MPI_Send(&mp , 1, MPI_INT , prossessNum, TAG, MPI_COMM_WORLD);
prossessNum ++;
}
}//this loop goes through all the specified blocks for the combinations
} // end of rank 0
else if (rank > currentCombinationsCount)
{
// here I want to put other receives that will take values from the else below
}
else
{
MPI_Recv(&mp , 1, MPI_INT , 0, TAG, MPI_COMM_WORLD, &stat);
// the code stuck here in infinite loop
}
You've only initialised currentCombinationsCount within the if(rank==0) branch so all other procs will see an uninitialised variable. That will result in undefined behaviour and the outcome depends on your compiler. Your program may crash or the value may be set to 0 or an undetermined value.
If you're lucky, the value may be set to 0 in which case your branch reduces to:
if (rank == 0) { /* rank == 0 will enter this }
else if (rank > 0) { /* all other procs enter this }
else { /* never entered! Recvs are never called to match the sends */ }
You therefore end up with sends that are not matched by any receives. Since MPI_Send is potentially blocking, the sending proc may stall indefinitely. With procs blocking on sends, it can certainly look as thought "...the machine goes to infinite loop somewhere...".
If currentCombinationsCount is given an arbitrary value (instead of 0) then rank!=0 procs will enter arbitrary branchss (with a higher chance of all entering the final else). You then end up with second set of receives not being called resulting in the same issue as above.