Parallelizing recursive function through MPI? - recursion

can we parallelize a recursive function using MPI?
I am trying to parallelize the quick sort function, but don't know if it works in MPI because it is recursive. I also want to know where should I do the parallel region.
// quickSort.c
#include <stdio.h>
void quickSort( int[], int, int);
int partition( int[], int, int);
void main()
int a[] = { 7, 12, 1, -2, 0, 15, 4, 11, 9};
int i;
printf("\n\nUnsorted array is: ");
for(i = 0; i < 9; ++i)
printf(" %d ", a[i]);
quickSort( a, 0, 8);
printf("\n\nSorted array is: ");
for(i = 0; i < 9; ++i)
printf(" %d ", a[i]);
void quickSort( int a[], int l, int r)
int j;
if( l < r )
// divide and conquer
j = partition( a, l, r);
quickSort( a, l, j-1);
quickSort( a, j+1, r);
int partition( int a[], int l, int r) {
int pivot, i, j, t;
pivot = a[l];
i = l; j = r+1;
while( 1)
do ++i; while( a[i] <= pivot && i <= r );
do --j; while( a[j] > pivot );
if( i >= j ) break;
t = a[i]; a[i] = a[j]; a[j] = t;
t = a[l]; a[l] = a[j]; a[j] = t;
return j;
I would also really appreciate it if there is another simpler code for the quick sort.

Well, technically you can, but I'm afraid this would be efficient only in SMP. And does the array fit to single node? If no, then you cannot perform even the first pass of a quick-sort.
If you really need to sort an array on a parallel system using MPI, you might want to consider using merge sort instead (of course you still can use quick sort for single blocks at each node, before you begin merging the blocks).
If you still want to use quick sort, but you are confused with the recursive version, here is a sketch of non-recursive algorithm which hopefully can be parallelized a bit easier, although it's essentially the same:
std::stack<std::pair<int, int> > unsorted;
unsorted.push(std::make_pair(0, size-1));
while (!unsorted.empty()) {
std::pair<int, int> u =;
m = partition(A, u.first, u.second);
// here you can send one of intervals to another node instead of
// pushing it into the stack, so it would be processed in parallel.
if (m+1 < u.second) unsorted.push(std::make_pair(m+1, u.second));
if (u.first < m-1) unsorted.push(std::make_pair(u.first, m-1));

Theoretically "anything" can be parallelized using MPI, but remember that MPI isn't doing any parallelization itself. It's just providing the communication layer between processes. As long as all of your sends and receives (or collective calls) match up, it's a correct program for the most part. That being said, it may not be the most efficient thing to use MPI, depending on your algorithm. If you are going to be sorting lots and lots of data (more than can fit in the memory of one node) then it could be efficient to use MPI (you probably want to take a look at the RMA chapter in that case) or some other higher level library that might make things even simpler for this type of application (UPC, Co-array Fortran, SHMEM, etc.).


Access components of vector data type (e.g. int4) as array of scalars in OpenCL

I have kernel which evaluate interaction between all pairs of neighbors of an atoms. Each atom has max. 4 neighbors so I store their indexes in int4. But in order to loop over these neighbors I need to access them by index (neighs[0] rather than neighs.x ).
The loop should look something like:
int iatom = get_global_id(0);
int4 ng = neighs[iatom]; // each atoms has 4 neighbors
float4 p0 = atom_pos[iatom];
float4 force = (float)(0.f,0.f,0.f,0.f);
for(int i=0; i<4; i++){
int ing = ng[i]; // HERE: index into vector
float4 pi = atom_pos[ing];
for(int j=i+1; j<4; j++){
int jng = ng[j]; // HERE: index into vector
float4 pj = atom_pos[jng];
force += evalInteraction( p0, pi, pj );
I have some idea how it can be probably done but not sure:
Unroll the loops
since there are just 4*3/2=6 pair-interactions it would be probably even more efficient. But it would be much less readable and more difficult do modify.
cast int4 to int*
but is it fine ? Doesn't it break something? Doesn't it make some performance issue? I mean this:
int4 ng_ = neighs[iatom]; // make sure we copy it to local memory or register
int* ng = (int*)&ng_; // pointer to local memory can be optimized out, right ?
for(int i=0; i<4; i++){
int ing = ng[i];
You can cast directly, but you can also declare a union for easier access:
int components[4];
int4 vector;
} neighbors;
neighbors.vector = ng;
neighbors.components[i]; // Works now

Heap Corruption error when calling C from R, can't find the source issue

UPDATE3: Problem is solved but I'm leaving the code here as-is for future reference--I've posted an answer below with the final state of the code in case people wanted to see the final product.
UPDATE2: Refactored to use R_alloc instead of calloc for automated cleanup. Unfortunately the problem persists.
UPDATE: If I add this line right before UNPROTECT(1):
Rprintf("%p %p %p", (void *)rans, (void *)fm, (void *)corrs);
then the function executes with no corrupted heap error. Maybe there's a background garbage collection call that corrupts one of the pointers prior to execution finishing, resulting in a write to a garbage pointer? Important to note here that if I don't print out all three of the pointer addresses, the error comes back.
Also I'm running this on an M1 Mac and compiling with clang via R CMD SHLIB, in case Apple silicon is to blame.
I'm at my wits end trying to debug this issue, and I figured I'd turn to SO for help. I'm writing a function in C to optimize some parts of my R code, and I'm getting a Heap Corruption Error when running the function many times. The function trimCovar() is called from R using the .Call("trimCovar", ...) interface.
I'm having a lot of difficulty debugging this for a few reasons:
I'm on OSX, so I can't use Valgrind
C function depends on inputs from R, so I can't debug the C code on its own
Heap corruption only occurs when calling the function many times within an R function
(just running .Call directly a bunch of times has no errors)
Error point is inconsistent
I start with two sets of vectors, and I condense them into a frequency matrix, where each column is a position in the vector set, and each row is a particular character that appears. I concatenate them into one matrix prior to passing in because it makes pre-processing easier. An toy example of the frequency matrix would be:
v1_1 = 101
v1_2 = 011
v2_1 = 111
v2_2 = 110
Frequency Matrix:
position: | 1_1 | 1_2 | 1_3 | 2_1 | 2_2 | 2_3 |
0: 0.5 0.5 0.0 0.0 0.0 0.5
1: 0.5 0.5 1.0 1.0 1.0 0.5
The goal is to find the NV highest correlated positions across the vector sets, which I do by calculating pairwise KL divergence of positions. These are stored in a linked list sorted in ascending order, and at the end I take the positions corresponding to the first NV entries. The R code I have can deparse everything else, so I really just need a vector of positions at the end (duplicates are allowed).
The function takes in 5 arguments:
fMAT: a frequency matrix (RObject, so gets read in as a flat vector)
fSP : columns in matrix corresponding to positions from the first vector set
sSP : same as fSP but for second vector set
NV : Number of values to return
NR : Number of columns in fMAT
The error returned is:
R(95564,0x104858580) malloc: Heap corruption detected, free list is damaged at 0x600000f10040
*** Incorrect guard value: 4626885667169763328
R(95564,0x104858580) malloc: *** set a breakpoint in malloc_error_break to debug
This only happens when I run an R function that calls this 10+ times, so I'm assuming that I'm just missing one or two small hanging pointers corrupting a memory reference. I've tried running this with gc() called in R immediately after each call, but it doesn't fix the problem. I'm not really sure what else to do at this point, I've tried using lldb but I'm not really sure how to use that program. From running lots of print statements I've determined that it usually crashes in the main loop (identified in code below), but it's inconsistent on when it crashes. I've also tried saving off erroneous inputs--I can rerun them individually with no issues, so it must be something relatively small that only appears over many runs.
Happy to provide more details if it would help. Code is listed at the bottom.
The only thing being allocated here are linked list nodes, and I thought I had free()'d them all prior to returning. I've also double checked the input values, so I'm 99.99% sure that I'm never referencing out of bounds on firstSeqPos, secondSeqPos, ans, or fm. I've also triple checked the R code surrounding this and can confidently say it is not the source of this error.
I haven't coded in C in a long time so I feel like I'm missing something obvious. If I really have to I can try to get ahold of a Linux box to run valgrind, but if there's another option I'd prefer it. Thanks in advance!
#include <R.h>
#include <Rdefines.h>
#include <Rinternals.h>
#include <math.h>
#include <stdlib.h>
#include <stdbool.h>
typedef struct node {
double data;
int i1;
int i2;
struct node *next;
} node;
// Linked list
// data is the correlation value,
// i1 the position from first vector set,
// i2 the position from second vector set
node *makeNewNode(double data, int i1, int i2){
node *newNode;
newNode = (node *)R_alloc(1, sizeof(node));
newNode->data = data;
newNode->i1 = i1;
newNode->i2 = i2;
newNode->next = NULL;
//insert link in sorted order (ascending)
void insertSorted(node **head, node *toInsert, int maxSize) {
int ctr = 0;
if ((*head) == NULL || (*head)->data >= toInsert->data){
toInsert->next = *head;
*head = toInsert;
} else {
node *temp = *head;
while (temp->next != NULL && temp->next->data < toInsert->data){
temp = temp->next;
if (ctr == maxSize){
// Performance optimization, if we aren't inserting in the first NR
// positions then we can just skip since we only care about the NR
// lowest scores overall
ctr += 1;
toInsert->next = temp->next;
temp->next = toInsert;
// (This is the one that crashes)
// Converting input SEXPs into C-compatible values
int nv = asInteger(NV);
int nr = asInteger(NR);
int sp1l = length(fSP);
int sp2l = length(sSP);
int *firstSeqPos = INTEGER(coerceVector(fSP, INTSXP));
int *secondSeqPos = INTEGER(coerceVector(sSP, INTSXP));
double *fm = REAL(fMAT);
int colv1, colv2;
// Using a linked list for efficient insert
node *corrs = NULL;
int cv1, cv2;
double p1, p2, score=0;
for ( int i=0; i<sp1l; i++ ){
cv1 = firstSeqPos[i];
colv1 = (cv1 - 1) * nr;
for ( int j=0; j<sp2l; j++ ){
cv2 = secondSeqPos[j];
colv2 = (cv2 - 1) * nr;
// KL Divergence
score = 0;
for ( int k=0; k<nr; k++){
p1 = fm[colv1 + k];
p2 = fm[colv2 + k];
if (p1 != 0 && p2 != 0){
score += p1 * log(p1 / p2);
// Add result into LL
node *newNode = makeNewNode(score, cv1, cv2);
insertSorted(&corrs, newNode, nv);
SEXP ans;
PROTECT(ans = allocVector(INTSXP, 2*nv));
int *rans = INTEGER(ans);
int ctr=0;
int pos1, pos2;
node *ptr = corrs;
for ( int i=0; i<nv; i++){
rans[2*i] = ptr->i1;
rans[2*i+1] = ptr->i2;
ptr = ptr->next;
int *firstSeqPos = INTEGER(coerceVector(fSP, INTSXP));
int *secondSeqPos = INTEGER(coerceVector(sSP, INTSXP));
This is not good. The SEXPs returned by the 2 calls to coerceVector() need to be protected. However it's usually considered better practice to do this coercion at the R level right before entering the .Call entry point. Note that if fSP and sSP are integer matrices, there's no need to coerce them to integer as they are already seen as integer vectors at the C level. This also avoids a possibly expensive copy (as.integer() in R and coerceVector() in C both trigger a full copy of the matrix data).
The question was answered above, but I received a couple messages from people asking for the final code, so I'm going to include it as an answer to preserve the original question. There's a couple optimizations here (thanks to #hpages for help and troubleshooting regarding these):
Original code fails because the output of coerceVector() wasn't protected with PROTECT(). I've refactored the R code to check for integer inputs prior to calling this C function to avoid this function call and be more efficient with memory (see the accepted answer for more details).
Original code uses R_alloc(), which gives responsibility to R to clean up memory at the end of the function call. However, this introduces substantial memory overhead during the runtime of the function, since memory allocated to nodes not inserted into the linked list aren't cleared until the end of the function call.
Allocation with calloc() isn't as simple as switching over and calling free() at the end of the function, since we have to guard the case where the user interrupts execution of the program. If an interrupt signal is thrown prior to the end of the function, we'll never free the memory.
Final C Code:
#include <R.h>
#include <Rdefines.h>
#include <Rinternals.h>
#include <math.h>
#include <stdlib.h>
#include <stdbool.h>
typedef struct node {
double data;
int i1;
int i2;
struct node *next;
} node;
// Defining the head as a static so that we can access it globally
// Important for ensuring clean up in case of interrupt
static node *corrs = NULL;
// Function to clean up memory allocations in case of interrupt
void cleanupFxn(){
node *ptr = corrs;
// Free allocated memory in linked list
while (corrs != NULL){
ptr = corrs;
corrs = corrs->next;
node *makeNewNode(double data, int i1, int i2){
node *newNode;
// very important to use calloc here so we have control of when we free it
// R_alloc() memory won't be freed until after function finishes execution
newNode = (node *)calloc(1, sizeof(node));
newNode->data = data;
newNode->i1 = i1;
newNode->i2 = i2;
newNode->next = NULL;
// insert link in sorted order
// returns a bool corresponding to if we inserted
bool insertSorted(node **head, node *toInsert, int maxSize) {
int ctr = 0;
if ((*head) == NULL || (*head)->data >= toInsert->data){
toInsert->next = *head;
*head = toInsert;
} else {
node *temp = *head;
while (temp->next != NULL && temp->next->data < toInsert->data){
temp = temp->next;
if (ctr == maxSize){
// Performance optimization, if we aren't inserting in the first NR
// positions then we can just skip since we only care about the NR
// lowest scores overall. Saves a huge amount of time and memory.
ctr += 1;
toInsert->next = temp->next;
temp->next = toInsert;
// Converting inputs into C-compatible forms
int nv = asInteger(NV);
int nr = asInteger(NR);
int sp1l = length(fSP);
int sp2l = length(sSP);
// Note here we're not using coerceVector() anymore
// typechecking done on R side
int *firstSeqPos = INTEGER(fSP);
int *secondSeqPos = INTEGER(sSP);
double *fm = REAL(fMAT);
int colv1, colv2;
// Using a linked list for efficient insert
corrs = NULL;
int cv1, cv2;
double p1, p2, score=0;
bool success;
for ( int i=0; i<sp1l; i++ ){
cv1 = firstSeqPos[i];
colv1 = (cv1 - 1) * nr;
for ( int j=0; j<sp2l; j++ ){
cv2 = secondSeqPos[j];
colv2 = (cv2 - 1) * nr;
score = 0;
for ( int k=0; k<nr; k++){
p1 = fm[colv1 + k];
p2 = fm[colv2 + k];
if (p1 != 0 && p2 != 0){
score += p1 * log(p1 / p2);
node *newNode = makeNewNode(score, cv1, cv2);
success = insertSorted(&corrs, newNode, nv);
// If we don't insert, free the associated memory
// I'm checking for NULL here just out of an abundance of caution
if (!success && newNode != NULL){
newNode = NULL;
SEXP ans;
PROTECT(ans = allocVector(INTSXP, 2*nv));
int *rans = INTEGER(ans);
node *ptr=corrs;
for ( int i=0; i<nv; i++){
rans[2*i] = ptr->i1;
rans[2*i+1] = ptr->i2;
ptr = ptr->next;
// Free allocated memory in linked list
Assuming the C file is named trimCovar.c, we'd compile with R CMD SHLIB trimCovar.c.
R Code to run this function:
# Wrapped into a function with on.exit(...) to ensure cleanup
# in the event the user or system interrupts execution early
CorrComp_C <- function(fm, fsp, ssp, nv, nr){
# type checking to ensure input to C is integer vector
# (could probably do more type checking here, mainly for illustration)
stopifnot(is(fsp, 'integer'))
stopifnot(is(ssp, 'integer'))
a <- .Call('trimCovar', fm, fsp, ssp, nv, nr)

C code with openmp called from R gives inconsistent results

Below is a piece of C code run from R used to compare each row of a matrix to a vector. The number of identical values is stored in the first column of a two-column matrix.
I know it can easily be done in R (as done to check the results), but this is a first step for a more complex use case.
When openmp is not used, it works ok. When openmp is used, it give correlated (0.99) but inconsistent results.
Question1: What am I doing wrong?
Question2: I use a double for loop to fill the output matrix (ret) with zeros. What would be a better solution?
Also, inconsistencies were observed when the code was used in a package. I tried to make the code reproducible using inline, but it does not recognize the openmp statements (I tried to include 'omp.h', in the parameters of cfunction, ...).
Question3: How can we make this code work with inline?
I'm (too?) far outside my comfort zone on this topic.
compare <- cfunction(c(x = "integer", vec = "integer"), "
const int I = nrows(x), J = ncols(x);
SEXP ret;
PROTECT(ret = allocMatrix(INTSXP, I, 2));
int *ptx = INTEGER(x), *ptvec = INTEGER(vec), *ptret = INTEGER(ret);
for (int i=0; i<I; i++)
for (int j=0; j<2; j++)
ptret[j * I + i] = 0;
int i, j;
#pragma omp parallel for default(none) shared(ptx, ptvec, ptret) private(i,j)
for (j=0; j<J; j++)
for (i=0; i<I; i++)
if (ptx[i + I * j] == ptvec[j]) {++ptret[i];}
return ret;
N = 3e3
M = 1e4
m = matrix(sample(c(-1:1), N*M, replace = TRUE), nc = M)
v = sample(-1:1, M, replace = TRUE)
cc = compare(m, v)
cr = rowSums(t(t(m) == v))
all.equal(cc[,1], cr)
Thanks to the comments above, I reconsidered the data race issue.
IIUC, my loop was parallelized on j (the columns). Then, each thread had its own value of i (the rows), but possible identical values across threads, that were then trying to increment ptret[i] at the same time.
To avoid this, I now loop on i first, so that only a single thread will increment each row.
Then, I realized that I could move the zero-initialization of ptret within the first loop.
It seems to work. I get identical results, increased CPU usage, and 3-4x speedup on my laptop.
I guess that solves questions 1 and 2. I will have a closer look at the inline/openmp problem.
Code below, fwiw.
#include <omp.h>
#include <R.h>
#include <Rinternals.h>
#include <stdio.h>
SEXP c_compare(SEXP x, SEXP vec)
const int I = nrows(x), J = ncols(x);
SEXP ret;
PROTECT(ret = allocMatrix(INTSXP, I, 2));
int *ptx = INTEGER(x), *ptvec = INTEGER(vec), *ptret = INTEGER(ret);
int i, j;
#pragma omp parallel for default(none) shared(ptx, ptvec, ptret) private(i, j)
for (i = 0; i < I; i++) {
// init ptret to zero
ptret[i] = 0;
ptret[I + i] = 0;
for (j = 0; j < J; j++)
if (ptx[i + I * j] == ptvec[j]) {
return ret;

Dynamically increase size of list in Rcpp

I am trying to implement a "coupling to the past" algorithm in Rcpp. For this I need to store a matrix of random numbers, and if the algorithm did not converge create a new matrix of random numbers and store that as well. This might have to be done 10+ times or something until convergence.
I was hoping I could use a List and dynamically update it, similar as I would in R. I was actually very surprised it worked a bit but I got errors whenever the list size becomes large. This seems to make sense as I did not allocate the needed memory for the additional list elements, although I am not that familiar with C++ and not sure if that is the problem.
Here is an example of what I tried. however be aware that this will probably crash your R session:
includes = '
NumericMatrix RandMat(int nrow, int ncol)
int N = nrow * ncol;
NumericMatrix Res(nrow,ncol);
NumericVector Rands = runif(N);
for (int i = 0; i < N; i++)
Res[i] = Rands[i];
code = '
void foo()
// This is the relevant part, I create a list then update it and print the results:
List x;
for (int i=0; i<10; i++)
x[i] = RandMat(100,10);
Does anyone know a way to do this without crashing R? I guess I could initiate the list at a fixed amount of elements here, but in my application the amount of elements is random.
You have to "allocate" enough space for your list. Maybe you can use something like a resizefunction:
List resize( const List& x, int n ){
int oldsize = x.size() ;
List y(n) ;
for( int i=0; i<oldsize; i++) y[i] = x[i] ;
return y ;
and whenever you want your list to be bigger than it is now, you can do:
x = resize( x, n ) ;
Your initial list is of size 0, so it expected that you get unpredictable behavior at the first iteration of your loop.

Codility K-Sparse Test **Spoilers**

Have you tried the latest Codility test?
I felt like there was an error in the definition of what a K-Sparse number is that left me confused and I wasn't sure what the right way to proceed was. So it starts out by defining a K-Sparse Number:
In the binary number "100100010000" there are at least two 0s between
any two consecutive 1s. In the binary number "100010000100010" there
are at least three 0s between any two consecutive 1s. A positive
integer N is called K-sparse if there are at least K 0s between any
two consecutive 1s in its binary representation. (My emphasis)
So the first number you see, 100100010000 is 2-sparse and the second one, 100010000100010, is 3-sparse. Pretty simple, but then it gets down into the algorithm:
Write a function:
class Solution { public int sparse_binary_count(String S,String T,int K); }
that, given:
string S containing a binary representation of some positive integer A,
string T containing a binary representation of some positive integer B,
a positive integer K.
returns the number of K-sparse integers within the range [A..B] (both
ends included)
and then states this test case:
For example, given S = "101" (A = 5), T = "1111" (B=15) and K=2, the
function should return 2, because there are just two 2-sparse integers
in the range [5..15], namely "1000" (i.e. 8) and "1001" (i.e. 9).
Basically it is saying that 8, or 1000 in base 2, is a 2-sparse number, even though it does not have two consecutive ones in its binary representation. What gives? Am I missing something here?
Tried solving that one. The assumption that the problem makes about binary representations of "power of two" numbers being K sparse by default is somewhat confusing and contrary.
What I understood was 8-->1000 is 2 power 3 so 8 is 3 sparse. 16-->10000 2 power 4 , and hence 4 sparse.
Even we assume it as true , and if you are interested in below is my solution code(C) for this problem. Doesn't handle some cases correctly, where there are powers of two numbers involved in between the two input numbers, trying to see if i can fix that:
int sparse_binary_count (const string &S,const string &T,int K)
char buf[50];
char *str1,*tptr,*Sstr,*Tstr;
int i,len1,len2,cnt=0;
long int num1,num2;
char *pend,*ch;
Sstr = (char *)S.c_str();
Tstr = (char *)T.c_str();
str1 = (char *)malloc(300001);
tptr = str1;
num1 = strtol(Sstr,&pend,2);
num2 = strtol(Tstr,&pend,2);
buf[i] = '0';
buf[i] = '\0';
str1 = tptr;
if( (i & (i-1))==0)
if(i >= (pow((float)2,(float)K)))
str1 = myitoa(i,str1,2);
ch = strstr(str1,buf);
if(ch == NULL)
if((i % 2) != 0)
return cnt;
char* myitoa(int val, char *buf, int base){
int i = 299999;
int cnt=0;
for(; val && i ; --i, val /= base)
buf[i] = "0123456789abcdef"[val % base];
buf[i+cnt+1] = '\0';
return &buf[i+1];
There was an information within the test details, showing this specific case. According to this information, any power of 2 is considered K-sparse for any K.
You can solve this simply by binary operations on integers. You are even able to tell, that you will find no K-sparse integers bigger than some specific integer and lower than (or equal to) integer represented by T.
As far as I can see, you must pay also a lot of attention to the performance, as there are sometimes hundreds of milions of integers to be checked.
My own solution, written in Python, working very efficiently even on large ranges of integers and being successfully tested for many inputs, has failed. The results were not very descriptive, saying it does not work as required within question (although it meets all the requirements in my opinion).
solutions with bitwise operators:
no of bits per int = 32 on 32 bit system,check for pattern (for K=2,
like 1001, 1000) in each shift and increment the count, repeat this
for all numbers in range.
int KsparseNumbers(int a, int b, int s) {
int nbits = sizeof(int)*8;
int slen = 0;
int lslen = pow(2, s);
int scount = 0;
int i = 0;
for (; i < s; ++i) {
slen += pow(2, i);
printf("\n slen = %d\n", slen);
for(; a <= b; ++a) {
int num = a;
for(i = 0 ; i < nbits-2; ++i) {
if ( (num & slen) == 0 && (num & lslen) ) {
printf("\n Scount = %d\n", scount);
num >>=1;
return scount;
int main() {
printf("\n No of 2-sparse numbers between 5 and 15 = %d\n", KsparseNumbers(5, 15, 2));
