Concurrent Processing - Petersons Algorithm - deadlock

For those unfamiliar, the following is Peterson's algorithm used for process coordination:
int No_Of_Processes; // Number of processes
int turn; // Whose turn is it?
int interested[No_Of_Processes]; // All values initially FALSE
void enter_region(int process) {
int other; // number of the other process
other = 1 - process; // the opposite process
interested[process] = TRUE; // this process is interested
turn = process; // set flag
while(turn == process && interested[other] == TRUE); // wait
}
void leave_region(int process) {
interested[process] = FALSE; // process leaves critical region
}
My question is, can this algorithm give rise to deadlock?

No, there is no deadlock possible.
The only place you are waiting is while loop. And the process variables is not shared between threads and they are different, but turn variable is shared. So it's impossible to get true condition for turn == process for more then one thread in every single moment.
But anyway your solution is not correct at all, the Peterson's algorithm is only for two concurrent threads, not for any No_Of_Processes like in your code.
In original algorithm for N processes deadlocks are possible link.

Related

OpenCL 2.0 Device command queue keeps filling up and halting execution

I am utilizing OpenCL's enqueue_kernel() function to enqueue kernels dynamically from the GPU to reduce unnecessary host interactions. Here is a simplified example of what I am trying to do in the kernels:
kernel void kernelA(args)
{
//This kernel is the one that is enqueued from the host, with only one work item. This kernel
//could be considered the "master" kernel that controls the logic of when to enqueue tasks
//First, it checks if a condition is met, then it enqueues kernelB
if (some condition)
{
enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange_1D(some amount, 256), ^{kernelB(args);});
}
else
{
//do other things
}
}
kernel void kernelB(args)
{
//Do some stuff
//Only enqueue the next kernel with the first work item. I do this because the things
//occurring in kernelC rely on the things that kernelB does, so it must take place after kernelB is completed,
//hence, the CLK_ENQUEUE_FLAGS_WAIT_KERNEL
if (get_global_id(0) == 0)
{
enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange_1D(some amount, 256), ^{kernelC(args);});
}
}
kernel void kernelC(args)
{
//Do some stuff. This one in particular is one step in a sorting algorithm
//This kernel will enqueue kernelD if a condition is met, otherwise it will
//return to kernelA
if (get_global_id(0) == 0 && other requirements)
{
enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange_1D(1, 1), ^{kernelD(args);});
}
else if (get_global_id(0) == 0)
{
enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange_1D(1, 1), ^{kernelA(args);});
}
}
kernel void kernelD(args)
{
//Do some stuff
//Finally, if some condition is met, enqueue kernelC again. What this will do is it will
//bounce back and forth between kernelC and kernelD until the condition is
//no longer met. If it isn't met, go back to kernelA
if (some condition)
{
enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange_1D(some amount, 256), ^{kernelC(args);});
}
else
{
enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange_1D(1, 1), ^{kernelA(args);});
}
}
So that is the general flow of the program, and it works perfectly and does exactly as I intended it to do, in the exact order I intended it to do it in, except for one issue. In certain cases when the workload is very high, a random one of the enqueue_kernel()s will fail to enqueue and halt the program. This happens because the device queue is full, and it cannot fit another task into it. But I cannot for the life of me figure out why this is, even after extensive research.
I thought that once a task in the queue (a kernel for instance) is finished, it would free up that spot in the queue. So my queue should really only reach a max of like 1 or 2 tasks at a time. But this program will literally fill up the entire 262,144 byte size of the device command queue, and stop functioning.
I would greatly appreciate some potential insight as to why this is happening if anyone has any ideas. I am sort of stuck and cannot continue until I get past this issue.
Thank you in advance!
(BTW I am running on a Radeon RX 590 card, and am using the AMD APP SDK 3.0 to use with OpenCL 2.0)
I don't know exactly what's going wrong, but I've noticed a few things in the code you posted and this feedback would be too long/hard to read in comments, so here goes - not a definite answer, but an attempt to get a bit closer:
Code doesn't quite do what the comments say
In kernelD, you have:
//Finally, if some condition is met, enqueue kernelC again.
…
if (get_global_id(0) == 0)
{
enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange_1D(some amount, 256), ^{kernelD(args);});
}
This actually enqueues kernelD itself again, not kernelC as the comments suggest. The other condition branch enqueues kernelA.
This could be a typo in the reduced version of your code.
Potential task explosion
This could again be down to the way you've abridged the code, but I don't quite see how
So my queue should really only reach a max of like 1 or 2 tasks at a time.
can be true. By my reading, all work items of both kernelC and kernelD will spawn new tasks; and as there seems to be more than 1 work item in each case, this seems like it could easily spawn a very large number of tasks:
For example, in kernelC:
if (get_global_id(0) == 0 && other requirements)
{
enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange_1D(some amount, 256), ^{kernelD(args);});
}
else
{
enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange_1D(1, 1), ^{kernelA(args);});
}
kernelB will have created at least 256 work items running kernelC. Here, work item 0 will (if other requirements met) spawn 1 task with at least 256 more work items, and 255+ tasks with 1 work-item running kernelA. kernelD behaves similarly.
So with a few iterations, you could easily end up with a few thousand tasks for running kernelA queued. I don't really know what your code does, but it seems like a good idea to check if cutting down these hundreds of kernelA tasks improves the situation, and whether you can perhaps modify kernelA so that you just enqueue it once with a range instead of enqueueing a work size of 1 from every work item. (Or something along those lines - perhaps enqueue once per group if that makes more sense. Basically, reduce the number of times enqueue_kernel gets called.)
enqueue_kernel() return value
Have you actually checked the return value for enqueue_kernel? It tells you exactly why it failed, so even if my suggestion above isn't possible, perhaps you can set some global state which will allow kernelA to restart the calculation once more tasks have drained, if it was interrupted?

Alarms on an ESP32

I am using an ESP32 connected to Google Cloud IoT Core to control lights with a raspberry pi as a hub to send messages to the esp32 through IoT core to turn them on and off at set times of the day.
What I want to do now is just eliminate the raspberry pi and have the esp32 manage when it needs to turn them on or off but I cant see to find a way to set alarms or something like alarms that trigger at specific times of days.
The time value coming in is in milliseconds from UTC
Does something like that exists or how could I accomplish alarms on the esp32?
I don't know if you are using the ESP IDF or some sort of Arduino framework for the ESP32 (with which I am not familiar). This answer is for the ESP IDF. If you are not using the ESP-IDF and don't have access to the relevant header files and libraries, some of the ideas might still be relevant.
Activate SNTP
Ensure that the clock on the controller is updated by following the SNTP example at https://github.com/espressif/esp-idf/tree/master/examples/protocols/sntp .
Use localtime()
If you are using the ESP-IDF you will have access to the localtime_r() function. Pass this (a pointer to) the number of 'seconds since the Epoch' - as returned by time(NULL) - and it will fill a struct tm with the day of the week, day of the month, hour, minute and second.
The general procedure is (error handling ommitted) ...
#include <time.h>
// ...
struct tm now = {};
time_t tnow = time(NULL);
localtime_r(&tnow, &now);
// Compare current day, hour, minute, etc. against pre-defined alarms.
Note: You can use gmtime_r() instead of localtime_r() if you prefer to use UTC.
You can invoke this periodically from your main while() loop as outlined below.
Alarm structures
Define a structure for your alarm information. You could use something like the following - tweaking according to the level of granularity you need. Note that it includes a callback member - to specify what the program should actually do when the alarm is triggered.
typedef struct {
int week_day; // -1 for every day of week
int hour; // -1 for every hour
int minute; // -1 for every minute
void (*callback)(void);
void *callback_arg;
} alarm_t;
Define alarms and callbacks
Define your array of alarms, callback functions and arguments.
static void trigger_relay(const void *arg) {
uint32_t num = *(uint32_t *)arg;
printf("Activating relay %u ...\n", num);
// Handle GPIO etc.
}
static uint32_t relay_1 = 0;
static uint32_t relay_2 = 1;
static alarm_t alarms[] = {
// Trigger relay 1 at 9:00 on Sunday
{0, 9, 0, trigger_relay, (void *)&relay_1},
// Relay 2 at 7:30 every day
{-1, 7, 30, trigger_relay, (void *)&relay_2},
};
Periodic scan
Define a check_alarms() routine which loads a struct tm structure with the current time and compares this with each of the alarms you have defined. If the current time matches that of the alarm, the callback is invoked.
static void check_alarms(void) {
time_t tnow = time(NULL);
struct tm now = {};
localtime_r(&tnow, &now);
ESP_LOGI(TAG, "Time: %u:%u:%u", now.tm_hour, now.tm_min, now.tm_sec);
size_t n_alarms = sizeof(alarms) / sizeof(alarms[0]);
for (int i = 0; i < n_alarms; i++) {
alarm_t *alrm = &alarms[i];
if (alrm->week_day != -1 && alrm->week_day != now.tm_wday) {
continue;
}
if (alrm->hour != -1 && alrm->hour != now.tm_hour) {
continue;
}
if (alrm->minute != -1 && alrm->minute != now.tm_min) {
continue;
}
alrm->callback(alrm->callback_arg);
}
}
This would then be called from your main while() loop with a frequency depending on the granularity of your alarm. Since the alarm_t defined above has a granularity of 1 minute, that is how often I will call check_alarms(). This example is for FreeRTOS but could be adapted as needs be.
while (1) {
TickType_t tick = xTaskGetTickCount();
if (tick - g_time_last_check > pdMS_TO_TICKS(60000)) {
g_time_last_check = tick;
check_alarms();
}
}
Efficiency
The above is not very efficient if you have a lot of alarms and / or each alarm should be triggered at high frequency. In such a case, you might consider algorithms which sort the alarms based on time until elapsed, and - instead of iterating over the entire array every time - instead employ a one-shot software timer to invoke the relevant callback. It probably isn't the worth the hassle unless you have a very large number of timers though.
Alternatives
Both FreeRTOS and the ESP-IDF include APIs for software timers - but these can not be (easily) used for alarms at a specific time of the day - which you state as a requirement. On the other hand - as you may already know - the ESP32 system-on-a-chip does have a built-in RTC, but I think it is best to use a higher-level interface than communicating with it directly.
A real-time clock is what you are looking for. I've never worked with the ESP32 but apparently it has one and it is bad. However, 5% error might be good enough for you. If it's an option, I would use an external one.

Will this producer-consumer scenario with semaphores ever deadlock?

I have read through a whole bunch of producer-consumer problems that use semaphores, but I haven't been able to find an answer for this exact one. What I want to know is if this solution will ever deadlock?
semaphore loadedBuffer = 0
semaphore emptyBuffer = N where n>2
semaphore mutex = 1
Producer(){
P(emptyBuffers)
P(mutex)
//load buffer
V(loadedBuffers)
v(mutex)
Consumer(){
P(loadedBuffer)
P(mutex)
//empty buffer
V(mutex)
v(emptyBuffer)
I do believe this is a good solution, because I cannot find a circumstance where this would deadlock because any time the mutex semaphore is used, a thread cannot possibly be waiting on anything else.
Am I correct in assuming this is a good solution and will never deadlock?
It's not quite clear what are those P(), V() and v() in your algorithm, but in general you need just two semaphores as described in Wikipedia:
semaphore fillCount = 0; // items produced
semaphore emptyCount = BUFFER_SIZE; // remaining space
procedure producer()
{
while (true)
{
item = produceItem();
down(emptyCount);
putItemIntoBuffer(item);
up(fillCount);
}
}
procedure consumer()
{
while (true)
{
down(fillCount);
item = removeItemFromBuffer();
up(emptyCount);
consumeItem(item);
}
}
Source: https://en.wikipedia.org/wiki/Producer%E2%80%93consumer_problem#Using_semaphores

Using MPI_Scatter with 3 processes

I am new to MPI , and my question is how the root(for example rank-0) initializes all its values (in the array) before other processes receive their i'th value from the root?
for example:
in the root i initialize: arr[0]=20,arr[1]=90,arr[2]=80.
My question is ,If i have for example process (number -2) that starts a little bit before the root process. Can the MPI_Scatter sends incorrect value instead 80?
How can i assure the root initialize all his memory before others use Scatter ?
Thank you !
The MPI standard specifies that
If comm is an intracommunicator, the outcome is as if the root executed n
send operations, MPI_Send(sendbuf+i, sendcount, extent(sendtype), sendcount, sendtype, i,...), and each process executed a receive, MPI_Recv(recvbuf, recvcount, recvtype, i,...).
This means that all the non-root processes will wait until their recvcount respective elements have been transmitted. This is also known as synchronized routine (the process waits until the communication is completed).
You as the programmer are responsible of ensuring that the data being sent is correct by the time you call any communication routine and until the send buffer available again (in this case, until MPI_Scatter returns). In a MPI only program, this is as simple as placing the initialization code before the call to MPI_Scatter, as each process executes the program sequentially.
The following is an example based in the document's Example 5.11:
MPI_Comm comm = MPI_COMM_WORLD;
int grank, gsize,*sendbuf;
int root, rbuf[100];
MPI_Comm_rank( comm, &grank );
MPI_Comm_size(comm, &gsize);
root = 0;
if( grank == root ) {
sendbuf = (int *)malloc(gsize*100*sizeof(int));
// Initialize sendbuf. None of its values are valid at this point.
for( int i = 0; i < gsize * 100; i++ )
sendbuf[i] = i;
}
rbuf = (int *)malloc(100*sizeof(int));
// Distribute sendbuf data
// At the root process, all sendbuf values are valid
// In non-root processes, sendbuf argument is ignored.
MPI_Scatter(sendbuf, 100, MPI_INT, rbuf, 100, MPI_INT, root, comm);
MPI_Scatter() is a collective operation, so the MPI library does take care of everything, and the outcome of a collective operation does not depend on which rank called earlier than an other.
In this specific case, a non root rank will block (at least) until the root rank calls MPI_Scatter().
This is no different than a MPI_Send() / MPI_Recv().
MPI_Recv() blocks if called before the remote peer MPI_Send() a matching message.

segmentation fault when using shared memory created by open_shm on Xeon Phi

I have written my code for single Xeon Phi node( with 61 cores on it). I have two files. I have called MPI_Init(2) before calling any other mpi calls. I have found ntasks, rank also using mpi calls. I have also included all the required libraries. Still i get an error. Can you please help me out with this?
In file 1:
int buffsize;
int *sendbuff,**recvbuff,buffsum;
int *shareRegion;
shareRegion = (int*)gInit(MPI_COMM_WORLD, buffsize, ntasks); /* gInit is in file 2 */
buffsize=atoi(argv[1]);
sendbuff=(int *)malloc(sizeof(int)*buffsize);
if( taskid == 0 ){
recvbuff=(int **)malloc(sizeof(int *)*ntasks);
recvbuff[0]=(int *)malloc(sizeof(int)*ntasks*buffsize);
for(i=1;i<ntasks;i++)recvbuff[i]=recvbuff[i-1]+buffsize;
}
else{
recvbuff=(int **)malloc(sizeof(int *)*1);
recvbuff[0]=(int *)malloc(sizeof(int)*1);
}
for(i=0;i<buffsize;i++){
sendbuff[i]=1;
MPI_Barrier(MPI_COMM_WORLD);
call(sendbuff, buffsize, shareRegion, recvbuff[0],buffsize,taskid,ntasks);
In file 2:
void* gInit( MPI_Comm comm, int size, int num_proc)
{
int share_mem = shm_open("share_region", O_CREAT|O_RDWR,0666 );
if( share_mem == -1)
return NULL;
int rank;
MPI_Comm_rank(comm,&rank);
if( ftruncate( share_mem, sizeof(int)*size*num_proc) == -1 )
return NULL;
int* shared = mmap(NULL, sizeof(int)*size*num_proc, PROT_WRITE | PROT_READ, MAP_SHARED, share_mem, 0);
if(shared == (void*)-1)
printf("error in mem allocation (mmap)\n");
*(shared+(rank)) = 0
MPI_Barrier(MPI_COMM_WORLD);
return shared;
}
void call(int *sendbuff, int sendcount, volatile int *sharedRegion, int **recvbuff, int recvcount, int rank, int size)
{
int i=0;
int k,j;
j=rank*sendcount;
for(i=0;i<sendcount;i++)
{
sharedRegion[j] = sendbuff[i];
j++;
}
if( rank == 0)
for(k=0;k<size;k++)
for(i=0;i<sendcount;i++)
{
j=0;
recvbuff[k][i] = sharedRegion[j];
j++;
}
}
Then i am doing some computation in file 1 on this recvbuff.
I get this segmentation fault while using sharedRegion variable.
MPI represents the Message Passing paradigm. That means, processes (ranks) are isolated and are generally running on a distributed machine. They communicate via explicit communication messages, recent versions allow also one-sideded, but still explicit, data transfer. You can not assume that shared memory is available for the processes. Have a look at any MPI tutorial to see how MPI is used.
Since you did not specify on what kind of machine you are running, any further suggestion is purely speculative. If you actually are on a shared memory machine, you may want to use a real shared memory paradigm instead, e.g. OpenMP.
While it's possible to restrict MPI to only use one machine and have shared memory (see the RMA chapter, especially in MPI-3), if you're only ever going to use one machine, it's easier to use some other paradigm.
However, if you're going to use multiple nodes and have multiple ranks on one node (multi-core processes for example), then it might be worth taking a look at MPI-3 RMA to see how it can help you with both locally shared memory and remote memory access. There are multiple papers out on the subject, but because they're so new, there's not a lot of good tutorials yet. You'll have to dig around a bit to find something useful to you.
The ordering of these two lines:
shareRegion = (int*)gInit(MPI_COMM_WORLD, buffsize, ntasks); /* gInit is in file 2 */
buffsize=atoi(argv[1]);
suggest that buffsize could possibly have different values before and after the call to gInit. If buffsize as passed in the first argument to the program is larger than its initial value while gInit is called, then out-of-bounds memory access would occur later and lead to a segmentation fault.
Hint: run your code as an MPI singleton (e.g. without mpirun) from inside a debugger (e.g. gdb) or change the limits so that cores would get dumped on error (e.g. with ulimit -c unlimited) and then examine the core file(s) with the debugger. Compiling with debug information (e.g. adding -g to the compiler options) helps a lot in such cases.

Resources