void container_row_change(struct brick_win_size *win, int character)
{
row_container *container = &(win->container[win->current_row]);
/*int offset = win->current_column;
char *data = win->container[win->current_row]. data;
if(offset < 0 || offset >= win->col)
offset = container->size;
data = realloc(data, win->container[win->current_row].size + 2 );
memmove(&data[offset + 1], &data[offset], (win->container[win->current_row].size - offset + 1));
data[offset] = character;
win->current_column++;
container->size ++;
*/
int offset = win->current_column;
if(offset < 0 || offset >= win->col)
offset = container->size;
win->container[win->current_row].data = realloc(win->container[win->current_row]. data, win->container[win->current_row]. size + 2);
memmove(&(win->container[win->current_row].data[offset + 1]), &(win->container[win->current_row].data[offset]), win->container[win->current_row].size - offset + 1);
win->container[win->current_row].data[offset] =character;
win->current_column++;
win->container[win->current_row].size ++;
}
Can anybody tell why the commented line fails and not the other one although both are same?
I am wondering is there any errors in the way I assign pointers and reallocate it
In the commented out code, you realloc the local pointer data and never update the .data pointer field in the data structure, so it continues to point at the old (now freed) memory, leading to corruption when you try to use it.
Add the line win->container[win->current_row].data = data; to the commented out code and it will actually be equivalent to the later code.
Note that in either case, if realloc fails, you'll crash -- you should be checking for failure and doing something appropriate (probably printing an error message and trying to exit gracefully).
Related
UPDATE3: Problem is solved but I'm leaving the code here as-is for future reference--I've posted an answer below with the final state of the code in case people wanted to see the final product.
UPDATE2: Refactored to use R_alloc instead of calloc for automated cleanup. Unfortunately the problem persists.
UPDATE: If I add this line right before UNPROTECT(1):
Rprintf("%p %p %p", (void *)rans, (void *)fm, (void *)corrs);
then the function executes with no corrupted heap error. Maybe there's a background garbage collection call that corrupts one of the pointers prior to execution finishing, resulting in a write to a garbage pointer? Important to note here that if I don't print out all three of the pointer addresses, the error comes back.
Also I'm running this on an M1 Mac and compiling with clang via R CMD SHLIB, in case Apple silicon is to blame.
I'm at my wits end trying to debug this issue, and I figured I'd turn to SO for help. I'm writing a function in C to optimize some parts of my R code, and I'm getting a Heap Corruption Error when running the function many times. The function trimCovar() is called from R using the .Call("trimCovar", ...) interface.
I'm having a lot of difficulty debugging this for a few reasons:
I'm on OSX, so I can't use Valgrind
C function depends on inputs from R, so I can't debug the C code on its own
Heap corruption only occurs when calling the function many times within an R function
(just running .Call directly a bunch of times has no errors)
Error point is inconsistent
I start with two sets of vectors, and I condense them into a frequency matrix, where each column is a position in the vector set, and each row is a particular character that appears. I concatenate them into one matrix prior to passing in because it makes pre-processing easier. An toy example of the frequency matrix would be:
INPUT:
v1_1 = 101
v1_2 = 011
v2_1 = 111
v2_2 = 110
Frequency Matrix:
position: | 1_1 | 1_2 | 1_3 | 2_1 | 2_2 | 2_3 |
0: 0.5 0.5 0.0 0.0 0.0 0.5
1: 0.5 0.5 1.0 1.0 1.0 0.5
The goal is to find the NV highest correlated positions across the vector sets, which I do by calculating pairwise KL divergence of positions. These are stored in a linked list sorted in ascending order, and at the end I take the positions corresponding to the first NV entries. The R code I have can deparse everything else, so I really just need a vector of positions at the end (duplicates are allowed).
The function takes in 5 arguments:
fMAT: a frequency matrix (RObject, so gets read in as a flat vector)
fSP : columns in matrix corresponding to positions from the first vector set
sSP : same as fSP but for second vector set
NV : Number of values to return
NR : Number of columns in fMAT
The error returned is:
R(95564,0x104858580) malloc: Heap corruption detected, free list is damaged at 0x600000f10040
*** Incorrect guard value: 4626885667169763328
R(95564,0x104858580) malloc: *** set a breakpoint in malloc_error_break to debug
This only happens when I run an R function that calls this 10+ times, so I'm assuming that I'm just missing one or two small hanging pointers corrupting a memory reference. I've tried running this with gc() called in R immediately after each call, but it doesn't fix the problem. I'm not really sure what else to do at this point, I've tried using lldb but I'm not really sure how to use that program. From running lots of print statements I've determined that it usually crashes in the main loop (identified in code below), but it's inconsistent on when it crashes. I've also tried saving off erroneous inputs--I can rerun them individually with no issues, so it must be something relatively small that only appears over many runs.
Happy to provide more details if it would help. Code is listed at the bottom.
The only thing being allocated here are linked list nodes, and I thought I had free()'d them all prior to returning. I've also double checked the input values, so I'm 99.99% sure that I'm never referencing out of bounds on firstSeqPos, secondSeqPos, ans, or fm. I've also triple checked the R code surrounding this and can confidently say it is not the source of this error.
I haven't coded in C in a long time so I feel like I'm missing something obvious. If I really have to I can try to get ahold of a Linux box to run valgrind, but if there's another option I'd prefer it. Thanks in advance!
Code:
#include <R.h>
#include <Rdefines.h>
#include <Rinternals.h>
#include <math.h>
#include <stdlib.h>
#include <stdbool.h>
typedef struct node {
double data;
int i1;
int i2;
struct node *next;
} node;
// Linked list
// data is the correlation value,
// i1 the position from first vector set,
// i2 the position from second vector set
node *makeNewNode(double data, int i1, int i2){
node *newNode;
newNode = (node *)R_alloc(1, sizeof(node));
newNode->data = data;
newNode->i1 = i1;
newNode->i2 = i2;
newNode->next = NULL;
return(newNode);
}
//insert link in sorted order (ascending)
void insertSorted(node **head, node *toInsert, int maxSize) {
int ctr = 0;
if ((*head) == NULL || (*head)->data >= toInsert->data){
toInsert->next = *head;
*head = toInsert;
} else {
node *temp = *head;
while (temp->next != NULL && temp->next->data < toInsert->data){
temp = temp->next;
if (ctr == maxSize){
// Performance optimization, if we aren't inserting in the first NR
// positions then we can just skip since we only care about the NR
// lowest scores overall
return;
}
ctr += 1;
}
toInsert->next = temp->next;
temp->next = toInsert;
}
}
// MAIN FUNCTION CALLED FROM R
// (This is the one that crashes)
SEXP trimCovar(SEXP fMAT, SEXP fSP, SEXP sSP, SEXP NV, SEXP NR){
// Converting input SEXPs into C-compatible values
int nv = asInteger(NV);
int nr = asInteger(NR);
int sp1l = length(fSP);
int sp2l = length(sSP);
int *firstSeqPos = INTEGER(coerceVector(fSP, INTSXP));
int *secondSeqPos = INTEGER(coerceVector(sSP, INTSXP));
double *fm = REAL(fMAT);
int colv1, colv2;
// Using a linked list for efficient insert
node *corrs = NULL;
int cv1, cv2;
double p1, p2, score=0;
// USUALLY FAILS IN THIS LOOP
for ( int i=0; i<sp1l; i++ ){
cv1 = firstSeqPos[i];
colv1 = (cv1 - 1) * nr;
for ( int j=0; j<sp2l; j++ ){
cv2 = secondSeqPos[j];
colv2 = (cv2 - 1) * nr;
// KL Divergence
score = 0;
for ( int k=0; k<nr; k++){
p1 = fm[colv1 + k];
p2 = fm[colv2 + k];
if (p1 != 0 && p2 != 0){
score += p1 * log(p1 / p2);
}
}
// Add result into LL
node *newNode = makeNewNode(score, cv1, cv2);
insertSorted(&corrs, newNode, nv);
}
R_CheckUserInterrupt();
}
SEXP ans;
PROTECT(ans = allocVector(INTSXP, 2*nv));
int *rans = INTEGER(ans);
int ctr=0;
int pos1, pos2;
node *ptr = corrs;
for ( int i=0; i<nv; i++){
rans[2*i] = ptr->i1;
rans[2*i+1] = ptr->i2;
ptr = ptr->next;
}
UNPROTECT(1);
return(ans);
}
int *firstSeqPos = INTEGER(coerceVector(fSP, INTSXP));
int *secondSeqPos = INTEGER(coerceVector(sSP, INTSXP));
This is not good. The SEXPs returned by the 2 calls to coerceVector() need to be protected. However it's usually considered better practice to do this coercion at the R level right before entering the .Call entry point. Note that if fSP and sSP are integer matrices, there's no need to coerce them to integer as they are already seen as integer vectors at the C level. This also avoids a possibly expensive copy (as.integer() in R and coerceVector() in C both trigger a full copy of the matrix data).
The question was answered above, but I received a couple messages from people asking for the final code, so I'm going to include it as an answer to preserve the original question. There's a couple optimizations here (thanks to #hpages for help and troubleshooting regarding these):
Original code fails because the output of coerceVector() wasn't protected with PROTECT(). I've refactored the R code to check for integer inputs prior to calling this C function to avoid this function call and be more efficient with memory (see the accepted answer for more details).
Original code uses R_alloc(), which gives responsibility to R to clean up memory at the end of the function call. However, this introduces substantial memory overhead during the runtime of the function, since memory allocated to nodes not inserted into the linked list aren't cleared until the end of the function call.
Allocation with calloc() isn't as simple as switching over and calling free() at the end of the function, since we have to guard the case where the user interrupts execution of the program. If an interrupt signal is thrown prior to the end of the function, we'll never free the memory.
Final C Code:
#include <R.h>
#include <Rdefines.h>
#include <Rinternals.h>
#include <math.h>
#include <stdlib.h>
#include <stdbool.h>
typedef struct node {
double data;
int i1;
int i2;
struct node *next;
} node;
// Defining the head as a static so that we can access it globally
// Important for ensuring clean up in case of interrupt
static node *corrs = NULL;
// Function to clean up memory allocations in case of interrupt
void cleanupFxn(){
node *ptr = corrs;
// Free allocated memory in linked list
while (corrs != NULL){
ptr = corrs;
corrs = corrs->next;
free(ptr);
}
}
node *makeNewNode(double data, int i1, int i2){
node *newNode;
// very important to use calloc here so we have control of when we free it
// R_alloc() memory won't be freed until after function finishes execution
newNode = (node *)calloc(1, sizeof(node));
newNode->data = data;
newNode->i1 = i1;
newNode->i2 = i2;
newNode->next = NULL;
return(newNode);
}
// insert link in sorted order
// returns a bool corresponding to if we inserted
bool insertSorted(node **head, node *toInsert, int maxSize) {
int ctr = 0;
if ((*head) == NULL || (*head)->data >= toInsert->data){
toInsert->next = *head;
*head = toInsert;
return(true);
} else {
node *temp = *head;
while (temp->next != NULL && temp->next->data < toInsert->data){
temp = temp->next;
if (ctr == maxSize){
// Performance optimization, if we aren't inserting in the first NR
// positions then we can just skip since we only care about the NR
// lowest scores overall. Saves a huge amount of time and memory.
return(false);
}
ctr += 1;
}
toInsert->next = temp->next;
temp->next = toInsert;
return(true);
}
}
SEXP trimCovar(SEXP fMAT, SEXP fSP, SEXP sSP, SEXP NV, SEXP NR){
// Converting inputs into C-compatible forms
int nv = asInteger(NV);
int nr = asInteger(NR);
int sp1l = length(fSP);
int sp2l = length(sSP);
// Note here we're not using coerceVector() anymore
// typechecking done on R side
int *firstSeqPos = INTEGER(fSP);
int *secondSeqPos = INTEGER(sSP);
double *fm = REAL(fMAT);
int colv1, colv2;
// Using a linked list for efficient insert
corrs = NULL;
int cv1, cv2;
double p1, p2, score=0;
bool success;
for ( int i=0; i<sp1l; i++ ){
cv1 = firstSeqPos[i];
colv1 = (cv1 - 1) * nr;
for ( int j=0; j<sp2l; j++ ){
cv2 = secondSeqPos[j];
colv2 = (cv2 - 1) * nr;
score = 0;
for ( int k=0; k<nr; k++){
p1 = fm[colv1 + k];
p2 = fm[colv2 + k];
if (p1 != 0 && p2 != 0){
score += p1 * log(p1 / p2);
}
}
node *newNode = makeNewNode(score, cv1, cv2);
success = insertSorted(&corrs, newNode, nv);
// If we don't insert, free the associated memory
// I'm checking for NULL here just out of an abundance of caution
if (!success && newNode != NULL){
free(newNode);
newNode = NULL;
}
}
R_CheckUserInterrupt();
}
SEXP ans;
PROTECT(ans = allocVector(INTSXP, 2*nv));
int *rans = INTEGER(ans);
node *ptr=corrs;
for ( int i=0; i<nv; i++){
rans[2*i] = ptr->i1;
rans[2*i+1] = ptr->i2;
ptr = ptr->next;
}
// Free allocated memory in linked list
cleanupFxn();
UNPROTECT(1);
return(ans);
}
Assuming the C file is named trimCovar.c, we'd compile with R CMD SHLIB trimCovar.c.
R Code to run this function:
dyn.load("trimCovar.so")
# Wrapped into a function with on.exit(...) to ensure cleanup
# in the event the user or system interrupts execution early
CorrComp_C <- function(fm, fsp, ssp, nv, nr){
# type checking to ensure input to C is integer vector
# (could probably do more type checking here, mainly for illustration)
stopifnot(is(fsp, 'integer'))
stopifnot(is(ssp, 'integer'))
on.exit(.C("cleanupFxn"))
a <- .Call('trimCovar', fm, fsp, ssp, nv, nr)
return(a)
}
I am learning opencl for the first time, and I am currently modifying the shortest path finding algorithm. I know that opencl usually uses the idea of parallel computing to solve problems. So I wonder if I can also use this parallel idea when I am dealing with finding the minimum value and its position in the array?
This is my previous attempt. I think that as long as the variable is the smallest, the result can be obtained regardless of whether the operation is locked or not. Unfortunately, when I use printf to view variables, although valid nodes have been judged, I can't get the correct results.
__kernel void findWay(__global int* A, __global int* B, __global int* minNode, __global int* minDis, __global int* isFinish)
{
//A: weightMatrix , B: usedNode
//dijkstra algorithm , src node is 0
size_t dst = get_global_id(1);
size_t src = get_global_id(0);
size_t vCount = get_global_size(0);
int index = dst * vCount + src;
while(isFinish[0] != vCount){
if((src == minNode[0])&&(B[dst] == 0)&&(A[index] != INT_MAX)){
A[dst*vCount] = min(A[dst*vCount + 0],A[minNode[0]*vCount + 0] + A[index]);
}
minDis[0] = INT_MAX;
barrier(CLK_GLOBAL_MEM_FENCE);
//here is the bug
if((src == 0) &&(B[dst] == 0)){
if(minDis[0] > A[index]){
minDis[0] = A[index];
minNode[0] = dst;
}
}
//=========
barrier(CLK_GLOBAL_MEM_FENCE);
B[minNode[0]] = 1;
if(index == 0){
isFinish[0]++;
}
}
}
In the end, I can only use a normal way to achieve this operation.
if((src == 0) &&(dst == 0)){
for(int i = 0 ; i < vCount ;i++){
if(B[i] == 0 && minDis[0] > A[i *vCount]){
minDis[0] = A[i*vCount];
minNode[0] = i;
}
}
I would like to ask about this search process, can the looping step be omitted?
Horizontal operations on the parallelized array are difficult. The general approach to them is binary-tree-like kernel passes. Start with the original array, make each GPU thread load 2 neighboring elements and choose the smaller one, write that in the same array to position of the first of the two elements. Next kernel loads two elements from the list of every second element, compares the two, writes the smaller one in the first position of the two. Repeat until there is only one element left.
I will illustrate it beloe. I mark values that are not touched by the kernel anymore with *.
original array: 5|2|1|6|9|3|4|8
after 1st kernel pass: 2 *|1 *|3 *|4 *
after 2nd kernel pass: 1 * * *|3 * * *
after 3nd kernel pass: 1 * * * * * * *
smallest element is 1.
I am trying to run this fast fourier implementation code. It compiles fine but gives this error at runtime. I have no idea about the error or what it means. Can anyone help me out?
I compiled and run the program by:
mpicc -o exec test.c
./exec
CODE:
This is the code that I found on GITHUB. Its the parallel version of fast fourier algorithm.
#include <stdio.h>
#include <mpi.h> //To use MPI
#include <complex.h> //to use complex numbers
#include <math.h> //for cos() and sin()
#include "timer.h" //to use timer
#define PI 3.14159265
#define bigN 16384 //Problem Size
#define howmanytimesavg 3
int main()
{
int my_rank,comm_sz;
MPI_Init(NULL,NULL); //start MPI
MPI_Comm_size(MPI_COMM_WORLD,&comm_sz); ///how many processes are we
using?
MPI_Comm_rank(MPI_COMM_WORLD,&my_rank); //which process is this?
double start,finish;
double avgtime = 0;
FILE *outfile;
int h;
if(my_rank == 0) //if process 0 open outfile
{
outfile = fopen("ParallelVersionOutput.txt", "w"); //open from current
directory
}
for(h = 0; h < howmanytimesavg; h++) //loop to run multiple times for AVG
time.
{
if(my_rank == 0) //If it's process 0 starts timer
{
start = MPI_Wtime();
}
int i,k,n,j; //Basic loop variables
double complex evenpart[(bigN / comm_sz / 2)]; //array to save the data
for EVENHALF
double complex oddpart[(bigN / comm_sz / 2)]; //array to save the data
for ODDHALF
double complex evenpartmaster[ (bigN / comm_sz / 2) * comm_sz]; //array
to save the data for EVENHALF
double complex oddpartmaster[ (bigN / comm_sz / 2) * comm_sz]; //array
to save the data for ODDHALF
double storeKsumreal[bigN]; //store the K real variable so we can abuse
symmerty
double storeKsumimag[bigN]; //store the K imaginary variable so we can
abuse symmerty
double subtable[(bigN / comm_sz)][3]; //Each process owns a subtable
from the table below
double table[bigN][3] = //TABLE of numbers to use
{
0,3.6,2.6, //n, Real,Imaginary CREATES TABLE
1,2.9,6.3,
2,5.6,4.0,
3,4.8,9.1,
4,3.3,0.4,
5,5.9,4.8,
6,5.0,2.6,
7,4.3,4.1,
};
if(bigN > 8) //Everything after row 8 is all 0's
{
for(i = 8; i < bigN; i++)
{
table[i][0] = i;
for(j = 1; j < 3;j++)
{
table[i][j] = 0.0; //set to 0.0
}
}
}
int sendandrecvct = (bigN / comm_sz) * 3; //how much to send and
recieve??
MPI_Scatter(table,sendandrecvct,MPI_DOUBLE,subtable,sendandrecvct,MPI_DOUBLE,0,MPI_COMM_WORLD); //scatter the table to subtables
for (k = 0; k < bigN / 2; k++) //K coeffiencet Loop
{
/* Variables used for the computation */
double sumrealeven = 0.0; //sum of real numbers for even
double sumimageven = 0.0; //sum of imaginary numbers for even
double sumrealodd = 0.0; //sum of real numbers for odd
double sumimagodd = 0.0; //sum of imaginary numbers for odd
for(i = 0; i < (bigN/comm_sz)/2; i++) //Sigma loop EVEN and ODD
{
double factoreven , factorodd = 0.0;
int shiftevenonnonzeroP = my_rank * subtable[2*i][0]; //used to shift index numbers for correct results for EVEN.
int shiftoddonnonzeroP = my_rank * subtable[2*i + 1][0]; //used to shift index numbers for correct results for ODD.
/* -------- EVEN PART -------- */
double realeven = subtable[2*i][1]; //Access table for real number at spot 2i
double complex imaginaryeven = subtable[2*i][2]; //Access table for imaginary number at spot 2i
double complex componeeven = (realeven + imaginaryeven * I); //Create the first component from table
if(my_rank == 0) //if proc 0, dont use shiftevenonnonzeroP
{
factoreven = ((2*PI)*((2*i)*k))/bigN; //Calculates the even factor for Cos() and Sin()
// *********Reduces computational time*********
}
else //use shiftevenonnonzeroP
{
factoreven = ((2*PI)*((shiftevenonnonzeroP)*k))/bigN; //Calculates the even factor for Cos() and Sin()
// *********Reduces computational time*********
}
double complex comptwoeven = (cos(factoreven) - (sin(factoreven)*I)); //Create the second component
evenpart[i] = (componeeven * comptwoeven); //store in the evenpart array
/* -------- ODD PART -------- */
double realodd = subtable[2*i + 1][1]; //Access table for real number at spot 2i+1
double complex imaginaryodd = subtable[2*i + 1][2]; //Access table for imaginary number at spot 2i+1
double complex componeodd = (realodd + imaginaryodd * I); //Create the first component from table
if (my_rank == 0)//if proc 0, dont use shiftoddonnonzeroP
{
factorodd = ((2*PI)*((2*i+1)*k))/bigN;//Calculates the odd factor for Cos() and Sin()
// *********Reduces computational time*********
}
else //use shiftoddonnonzeroP
{
factorodd = ((2*PI)*((shiftoddonnonzeroP)*k))/bigN;//Calculates the odd factor for Cos() and Sin()
// *********Reduces computational time*********
}
double complex comptwoodd = (cos(factorodd) - (sin(factorodd)*I));//Create the second component
oddpart[i] = (componeodd * comptwoodd); //store in the oddpart array
}
/*Process ZERO gathers the even and odd part arrays and creates a evenpartmaster and oddpartmaster array*/
MPI_Gather(evenpart,(bigN / comm_sz / 2),MPI_DOUBLE_COMPLEX,evenpartmaster,(bigN / comm_sz / 2), MPI_DOUBLE_COMPLEX,0,MPI_COMM_WORLD);
MPI_Gather(oddpart,(bigN / comm_sz / 2),MPI_DOUBLE_COMPLEX,oddpartmaster,(bigN / comm_sz / 2), MPI_DOUBLE_COMPLEX,0,MPI_COMM_WORLD);
if(my_rank == 0)
{
for(i = 0; i < (bigN / comm_sz / 2) * comm_sz; i++) //loop to sum the EVEN and ODD parts
{
sumrealeven += creal(evenpartmaster[i]); //sums the realpart of the even half
sumimageven += cimag(evenpartmaster[i]); //sums the imaginarypart of the even half
sumrealodd += creal(oddpartmaster[i]); //sums the realpart of the odd half
sumimagodd += cimag(oddpartmaster[i]); //sums the imaginary part of the odd half
}
storeKsumreal[k] = sumrealeven + sumrealodd; //add the calculated reals from even and odd
storeKsumimag[k] = sumimageven + sumimagodd; //add the calculated imaginary from even and odd
storeKsumreal[k + bigN/2] = sumrealeven - sumrealodd; //ABUSE symmetry Xkreal + N/2 = Evenk - OddK
storeKsumimag[k + bigN/2] = sumimageven - sumimagodd; //ABUSE symmetry Xkimag + N/2 = Evenk - OddK
if(k <= 10) //Do the first 10 K's
{
if(k == 0)
{
fprintf(outfile," \n\n TOTAL PROCESSED SAMPLES : %d\n",bigN);
}
fprintf(outfile,"================================\n");
fprintf(outfile,"XR[%d]: %.4f XI[%d]: %.4f \n",k,storeKsumreal[k],k,storeKsumimag[k]);
fprintf(outfile,"================================\n");
}
}
}
if(my_rank == 0)
{
GET_TIME(finish); //stop timer
double timeElapsed = finish-start; //Time for that iteration
avgtime = avgtime + timeElapsed; //AVG the time
fprintf(outfile,"Time Elaspsed on Iteration %d: %f Seconds\n", (h+1),timeElapsed);
}
}
if(my_rank == 0)
{
avgtime = avgtime / howmanytimesavg; //get avg time
fprintf(outfile,"\nAverage Time Elaspsed: %f Seconds", avgtime);
fclose(outfile); //CLOSE file ONLY proc 0 can.
}
MPI_Barrier(MPI_COMM_WORLD); //wait to all proccesses to catch up before finalize
MPI_Finalize(); //End MPI
return 0;
}
ERROR:
Fatal error in PMPI_Gather: Invalid datatype, error stack:
PMPI_Gather(904): MPI_Gather(sbuf=0x7fffb62799a0, scount=8192,
MPI_DATATYPE_NULL, rbuf=0x7fffb6239980, rcount=8192, MPI_DATATYPE_NULL,
root=0, MPI_COMM_WORLD) failed
PMPI_Gather(815): Datatype for argument sendtype is a null datatype
[unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=537490947
:
system msg for write_line failure : Bad file descriptor
There is no MPI_DATATYPE_NULL in your code, but you only use MPI_DOUBLE_COMPLEX. Note the latter type is a Fortran datatype, and using it in C is not correct strictly speaking.
My guess is that MPI_DOUBLE_COMPLEX is causing the issue (type not defined or not initialized because you invoked the C version of MPI_Init()).
You can obviously rewrite your code in Fortran, or use your own derived datatype for a C double complex number.
Meanwhile, I suggest you write simple C and Fortran helloworld programs that use MPI_DOUBLE_COMPLEX (MPI_Bcast() of one element for example) to confirm the issue is with MPI_DOUBLE_COMPLEX and is restricted to C or not.
I'm trying to implement Hough transform for circles in OpenCL, but i've encountered really weird problem. Every time i run the Hough kernel, i end up with slightly different accumulator, even though parameters are the same and accumulator is always a freshly zero'ed table (ex. http://imgur.com/a/VcIw1). My kernel code is as below:
#define BLOCK_LEN 256
__kernel void HoughCirclesKernel(
__global int* A,
__global int* imgData,
__global int* _width,
__global int* _height,
__global int* r
)
{
__local int imgBuff[BLOCK_LEN];
int localThreadIndex = get_local_id(0); //threadIdx.x
int globalThreadIndex = get_local_id(0) + get_group_id(0) * BLOCK_LEN; //threadIdx.x + blockIdx.x * Block_Len
int width = *_width; int height = *_height;
int radius = *r;
A[globalThreadIndex] = 0;
barrier(CLK_GLOBAL_MEM_FENCE);
if(globalThreadIndex < width*height)
{
imgBuff[localThreadIndex] = imgData[globalThreadIndex];
barrier(CLK_LOCAL_MEM_FENCE);
if(imgBuff[localThreadIndex] > 0)
{
float s1, c1;
for(int i = 0; i<180; i++)
{
s1 = sincos(i, &c1);
int centerX = globalThreadIndex % width + radius * c1;
int centerY = ((globalThreadIndex - centerX) / height) + radius * s1;
if(centerX < width && centerY < height)
atomic_inc(A + centerX + centerY * width);
}
}
}
barrier(CLK_GLOBAL_MEM_FENCE);
}
Could this be the fault of how I am incrementing the accumulator?
if(globalThreadIndex < width*height)
{
imgBuff[localThreadIndex] = imgData[globalThreadIndex];
barrier(CLK_LOCAL_MEM_FENCE);
...
}
this is undefined behaviour since there is a barrier inside a branch.
All streaming units in a compute unit must enter same memory fence.
Try this:
if(globalThreadIndex < width*height)
{
imgBuff[localThreadIndex] = imgData[globalThreadIndex];
...
}
barrier(CLK_LOCAL_MEM_FENCE);
Alse there could be another issue if you are using multiple devices:
get_local_id(0) + get_group_id(0)
here get_group_id(0) is getting group id per device and it starts from 0 for all devices just as get_global_id starts zero too; so you should add proper offsets in the "ndrange" instruction when using multiple devices. Even though different devices can support same floatig point accuracy requirements, one of them may give better accuracy than other and can give slightly different results. If it is single device, then you should try lowering gpu frequencies as it may have defects or side effects of an overclock.
I have managed to solve my problem by finding and correcting three issues.
First of all the kernel code, the line:
int centerY = ((globalThreadIndex - centerX) / height) + radius * s1;
should be:
int centerY = (globalThreadIndex / width) + radius * s1;
The main change here was dividing by width, not height. This caused inaccuracy problems.
if(centerX < width && centerY < height)
The above condition was changed to:
if(x < width && x >= 0)
if(y < height && y >=0)
As for the accumulator problem, first I will post the code I used to create clBuffer (I am using OpenCL.net library for C#):
int[] a = new int[width*height]; //image size
ErrorCode error;
Mem cl_accumulator = (Mem)Cl.CreateBuffer(cl_context, MemFlags.ReadWrite, (IntPtr)(a.Length * sizeof(int)), out error);
CheckErr(error, "Cl.CreateBuffer");
The fix here was simple and pretty much self-explainatory:
int[] a = Enumerable.Repeat(0, width * height).ToArray();
ErrorCode error;
GCHandle accHandle = GCHandle.Alloc(a, GCHandleType.Pinned);
IntPtr accPtr = accHandle.AddrOfPinnedObject();
Mem cl_accumulator = (Mem)Cl.CreateBuffer(cl_context, MemFlags.ReadWrite | MemFlags.CopyHostPtr, (IntPtr)(a.Length * sizeof(int)), accPtr, out error);
CheckErr(error, "Cl.CreateBuffer");
I filled the accumulator table with zeros and then copied it to device buffer each time I executed the kernel.
The above errors caused the accumulator to look different and bit malformed each time I executed the kernel.
I am a novice at OpenCL and recently I have stumbled onto something which does not make sense to me.
I am using Intel drivers (working on linux machine) and the device is Xeon Phi coprocessor.
The problem is that when I give local_item_size as an argument to
clEnqueueNDRangeKernel(commandQueue,
forceKernel, 1,
&localItemSize, &globalItemSize,
NULL, 0, NULL, &kernelDone);
and when printing global thread id in the kernel
int tid = get_global_id(0);
The thread ids start from 1 and not from 0.
When I do not describe what my local_item_size and have NULL as an argument it seems to start counting correctly from 0.
At the moment I am fixing this in my code by subtracting 1 from the return value of get_global_id(0) for my code to work correctly..
Shortly: When I say what my local_item_size is the tid starts from 1. When I give NULL it starts from 0.
Size setting code:
// Global item size
if (n <= NUM_THREADS) {
globalItemSize = NUM_THREADS;
localItemSize = 16;
} else if (n % NUM_THREADS != 0) {
globalItemSize = (n / NUM_THREADS + 1) * NUM_THREADS;
} else {
globalItemSize = n;
}
// Local item size
localItemSize = globalItemSize / NUM_THREADS;
The 4th parameter to clEnqueueNDRangeKernel is an array of the offsets, not the local size - that's the 6th parameter. Your call should be
clEnqueueNDRangeKernel(commandQueue,
forceKernel, 1,
NULL, &globalItemSize,
&localItemSize, 0, NULL, &kernelDone);
This is also why the IDs started at 1 - because you requested an offset of 1!
You are passing your work-group size to the wrong argument. The third argument of clEnqueueNDRangeKernel is the global work offset, which is why your global IDs are appearing offset. The work-group size should go to the sixth argument:
clEnqueueNDRangeKernel(commandQueue,
forceKernel, 1, NULL,
&globalItemSize, &localItemSize,
0, NULL, &kernelDone);