Persistent communication in MPI - odd behaviour - mpi

I am solving the coarsest grid of parallel geometric multi grid using jacobi iterations and using Non-blocking calls MPI_Isend() and MPI_Irecv(). There are no problems in this. As soon as I replace the non-blocking communications with persistent communications - the results stop converging at this level and program goes into an infinite loop. The calls MPI_Startall() and MPI_Waitall() always return MPI_SUCCESS. Has anyone faced this problem before ? Please advise.
Coarsest_grid_solve()
{
MPI_Recv_init(&e_c_old[0][1][1], 1, x_subarray_c, X_DOWN, 10, new_comm, &recv[0]);
MPI_Recv_init(&e_c_old[PXC+1][1][1], 1, x_subarray_c, X_UP, 20, new_comm, &recv[1]);
MPI_Recv_init(&e_c_old[1][PYC+1][1], 1, y_subarray_c, Y_RIGHT, 30, new_comm, &recv[2]);
MPI_Recv_init(&e_c_old[1][0][1], 1, y_subarray_c, Y_LEFT, 40, new_comm, &recv[3]);
MPI_Recv_init(&e_c_old[1][1][PZC+1], 1, z_subarray_c, Z_AWAY_U, 50, new_comm, &recv[4]);
MPI_Recv_init(&e_c_old[1][1][0], 1, z_subarray_c, Z_TOWARDS_U, 60, new_comm, &recv[5]);
MPI_Send_init(&e_c_old[PXC][1][1], 1, x_subarray_c, X_UP, 10, new_comm, &send[0]);
MPI_Send_init(&e_c_old[1][1][1], 1, x_subarray_c, X_DOWN, 20, new_comm, &send[1]);
MPI_Send_init(&e_c_old[1][1][1], 1, y_subarray_c, Y_LEFT, 30, new_comm, &send[2]);
MPI_Send_init(&e_c_old[1][PYC][1], 1, y_subarray_c, Y_RIGHT, 40, new_comm, &send[3]);
MPI_Send_init(&e_c_old[1][1][1], 1, z_subarray_c, Z_TOWARDS_U, 50, new_comm, &send[4]);
MPI_Send_init(&e_c_old[1][1][PZC], 1, z_subarray_c, Z_AWAY_U, 60, new_comm, &send[5]);
while(rk_global/r0_global > TOL_CNORM)
{
coarse_iterations++ ;
err = MPI_Startall(6,recv);
if(err == MPI_SUCCESS)
printf("success");
err = MPI_Startall(6,send);
if(err == MPI_SUCCESS)
printf("success");
err = MPI_Waitall(6, send, MPI_STATUSES_IGNORE);
if(err == MPI_SUCCESS)
printf("success");
err = MPI_Waitall(6, recv, MPI_STATUSES_IGNORE);
if(err == MPI_SUCCESS)
printf("success");
//do work here
if(coarse_iterations == 1)
{
update_neumann_c(e_c_old, PXC, PYC, PZC, X_UP, Y_RIGHT, Z_AWAY_U);
residual_coarsest(e_c_old, rho_c, PXC, PYC, PZC, X_UP, Y_RIGHT, Z_AWAY_U, hc, rho_temp);
r0_local = residual_norm(rho_temp, PXC, PYC, PZC);
start_allred = MPI_Wtime();
MPI_Allreduce(&r0_local, &r0_global, 1, MPI_DOUBLE, MPI_SUM, new_comm);
end_allred = MPI_Wtime();
r0_global = r0_global/( (PXC*dims0) * (PYC*dims1) * (PZC*dims2) );
if(rank == 0)
printf("\nGlobal residual norm is = %f", r0_global);
rk_global = r0_global;
}
else
{
update_neumann_c(e_c_old, PXC, PYC, PZC, X_UP, Y_RIGHT, Z_AWAY_U);
residual_coarsest(e_c_old, rho_c, PXC, PYC, PZC, X_UP, Y_RIGHT, Z_AWAY_U, hc, rho_temp);
rk_local = residual_norm(rho_temp, PXC, PYC, PZC);
start_allred = MPI_Wtime();
MPI_Allreduce(&rk_local, &rk_global, 1, MPI_DOUBLE, MPI_SUM, new_comm);
end_allred = MPI_Wtime();
rk_global = rk_global/( (PXC*dims0) * (PYC*dims1) * (PZC*dims2) );
if(rank == 0)
printf("\nGlobal residual norm is = %f", rk_global);
}
//do dependent work and exchange matrices
}//while loop ends
for(i = 0; i <= 5 ; i++)
{
MPI_Request_free(&send[i]);
MPI_Request_free(&recv[i]);
}
}//End coarsest grid solve
Note: Strangely the ghost data becomes zero on alternate iterations. (Just found out - don't know why).

When we make a persistent handle for communication, we point to a specific piece of memory that we want to transfer to some other process. Now in Jacobi iterations we need to exchange pointers at the end of the iteration to make the old matrix point to the new updated matrix. Thus, the memory location pointed to by the pointer changes. Thus, the original locations are exchanged. The way around is to define two persistent communication handles. On odd iterations use the first handle and on even iterations use the other handle i.e. alternate them. This solved my problem. This problem also widened my understanding of persistent communications in MPI.

Related

Why this adafruit screen code interferes with this code?

I have a code for artificial neural network on Arduino. I'm using an Arduino UNO at 16Mhz and a SSD1306 led screen. The ANN is a simple logic XOR function. The code generates two random numbers 1.0 or 0.0, then passes the values to the inputs of the ANN, then the ANN inference the value and prints the inference time in milliseconds on the led screen, after that, prints on Serial monitor the input values, the output value and the inference time. Complete code is:
#include <SPI.h>
#include <Wire.h>
#include <Adafruit_GFX.h>
#include <Adafruit_SSD1306.h>
#include <math.h>
#define SCREEN_WIDTH 128 // OLED display width, in pixels
#define SCREEN_HEIGHT 64 // OLED display height, in pixels
// Declaration for an SSD1306 display connected to I2C (SDA, SCL pins)
#define OLED_RESET -1 // Reset pin # (or -1 if sharing Arduino reset pin)
Adafruit_SSD1306 display(SCREEN_WIDTH, SCREEN_HEIGHT, &Wire, OLED_RESET);
volatile unsigned long adcTime = 0;
const float NEFTIS_E = 2.71828183;
const int NEFTIS_ANN_LENGHT = 13; // Total neurons in the network.
const int NEFTIS_OUT_LENGHT = 1; // Total output neurons.
const int NEFTIS_INPUT_LENGHT = 2; // Total input neurons.
enum ENUM_ActivationFuntionType
{
AF_Logistic, // Logistic sigmoid.
AF_TanH, // Hypervolic tangent.
AF_ArcTan, // Arc tangent.
AF_ReLU, // Rectified Linear Unit.
AF_LeakyReLU, // Leaky ReLU.
AF_ELU, // Exponentian Linear Unit.
AF_SoftPlus, //
AF_Boolean, // Boolean.
};
// Synapse structure
typedef struct
{
float weight;
int inputNeuron;
} struct_synapse;
// Neuron structure
typedef struct
{
float value;
float bias;
int indexStart;
int nInputs;
bool isInput;
ENUM_ActivationFuntionType activationFunction;
float linearDelta;
} struct_neuron;
struct_neuron Neftis_Neurons[] = {
{1.0f, 0, 0, 0, true, AF_Logistic, 0.01},
{1.0f, 0, 0, 0, true, AF_Logistic, 0.01},
{0.0f, 0, 0, 2, false, AF_Logistic, 0.01},
{0.0f, 0, 2, 2, false, AF_Logistic, 0.01},
{0.0f, 0, 4, 2, false, AF_Logistic, 0.01},
{0.0f, 0, 6, 2, false, AF_Logistic, 0.01},
{0.0f, 0, 8, 2, false, AF_Logistic, 0.01},
{0.0f, 0, 10, 2, false, AF_Logistic, 0.01},
{0.8f, 0, 12, 2, false, AF_Logistic, 0.01},
{0.0f, 0, 14, 2, false, AF_Logistic, 0.01},
{0.0f, 0, 16, 2, false, AF_Logistic, 0.01},
{0.0f, 0, 18, 2, false, AF_Logistic, 0.01},
{0.0f, 0, 20, 10, false, AF_Logistic, 0.01}
};
struct_synapse Neftis_Synapses[] = {
{-5.43316f, 0},
{-5.543714f, 1},
{-4.262666f, 0},
{1.139778f, 1},
{2.446421f, 0},
{-6.2367f, 1},
{-2.993009f, 0},
{-0.5112332f, 1},
{-3.330907f, 0},
{0.02449156f, 1},
{-1.701479f, 0},
{-1.936165f, 1},
{0.5805757f, 0},
{1.178081f, 1},
{-3.712192f, 0},
{0.5605027f, 1},
{-4.816404f, 0},
{-4.946669f, 1},
{-3.816507f, 0},
{0.6793132f, 1},
{-9.65515f, 2},
{2.788617f, 3},
{5.523755f, 4},
{0.8867579f, 5},
{1.384886f, 6},
{-0.4491375f, 7},
{-3.600839f, 8},
{1.99833f, 9},
{-6.788064f, 10},
{2.101213f, 11}
};
void setup() {
// put your setup code here, to run once:
Serial.begin(115200);/*
display.begin(SSD1306_SWITCHCAPVCC, 0x3C);
display.clearDisplay();
display.display();*/
pinMode(LED_BUILTIN, OUTPUT);
}
void loop() {
// put your main code here, to run repeatedly:
float inputs[NEFTIS_INPUT_LENGHT];
inputs[0] = (float) random(0, 2);
inputs[1] = (float) random(0, 2);
Neftis_SetInputs(inputs);
unsigned long start = micros();
Neftis_Run();
adcTime = micros() - start;
display.clearDisplay();
display.setTextSize(1); // Normal 1:1 pixel scale
display.setTextColor(SSD1306_WHITE); // Draw white text
display.setCursor(1, 16); // Start at top-left corner
display.print(adcTime);
Serial.println(Neftis_Neurons[0].value);
Serial.println(Neftis_Neurons[1].value);
Serial.println(Neftis_Neurons[12].value);
display.display();
Serial.println(adcTime);
digitalWrite(LED_BUILTIN, true);
delay(1000);
digitalWrite(LED_BUILTIN, false);
delay(500);
}
void Neftis_SetInputs(float inputs[])
{
for(int i = 0; i < NEFTIS_INPUT_LENGHT; i++)
{
Neftis_Neurons[i].value = inputs[i];
}
}
void Neftis_Run()
{
// for every neuron
for(int i = 0; i < NEFTIS_ANN_LENGHT; i++)
{
float sum = 0.0f;
for(int j = 0; j < Neftis_Neurons[i].nInputs; j++)
{
// sums the inputs
sum += Neftis_Synapses[Neftis_Neurons[i].indexStart + j].weight * Neftis_Neurons[Neftis_Synapses[Neftis_Neurons[i].indexStart + j].inputNeuron].value;
}
sum += Neftis_Neurons[i].bias;
// apply activation function if is not input neuron
if(Neftis_Neurons[i].isInput == false)
{
// >> Logistic
if(Neftis_Neurons[i].activationFunction == AF_Logistic){
Neftis_Neurons[i].value = (float) (1 / (1 + pow(NEFTIS_E, -sum)));
// >> TanH.
}
}
}
}
void Neftis_GetOutputs(float* outArray)
{
for(int i = 0; i < NEFTIS_OUT_LENGHT; i++)
{
outArray[i] = Neftis_Neurons[NEFTIS_ANN_LENGHT - NEFTIS_OUT_LENGHT + i].value;
}
}
The problem is, the screen don't show anything and the serial monitor shows strange values for the inputs, output and inference time, as show in image below.
But if I remove the lines
display.begin(SSD1306_SWITCHCAPVCC, 0x3C);
display.clearDisplay();
display.display();*/
and
display.clearDisplay();
display.setTextSize(1); // Normal 1:1 pixel scale
display.setTextColor(SSD1306_WHITE); // Draw white text
display.setCursor(1, 16); // Start at top-left corner
then the serial monitor shows the right values for inputs, output and elapsed time for inference.
Now, I modified this line of code from this:
Neftis_Neurons[i].value = (float) (1 / (1 + pow(NEFTIS_E, -sum)));
to this:
Neftis_Neurons[i].value = (float) (1 / (1 + pow(NEFTIS_E, -1.5)));
and then the screen works and shows the inference time. The question is simple, why the screen code is interfering with pow function and serial monitor? or pow function is interfering with screen and serial monitor code? By the way, I executed the same code on a Raspberry Pi Pico and works ok, no issues there.
Thank you very much, I hope you find the answer.
I followed the suggestion of CherryDT and posted my question on the Arduino forum and was answered in [https://forum.arduino.cc/t/why-this-adafruit-screen-code-interferes-with-this-other-code/1049580]this link1
You are running out of dynamic memory on the Uno, which causes unpredictable errors during run time.
This line allocates a screen buffer of 1024 bytes, which won't be shown in the memory usage report during compiling/uploading:
display.begin(SSD1306_SWITCHCAPVCC, 0x3C);
To save dynamic memory space, you can put your other data arrays in program memory using the PROGMEM keyword. Check the Arduino reference for usage.
I tried the same code but with a smaller ANN model and it worked.

when using MPI_Isend() MPI_and Irecv(), should the pair use the same request? in case the message is already an array, how it should be passed?

In this code I am trying to broadcast using non blocking send and receive as a practice. I have multiple questions and issues.
1.Should I pair Isend() and Irecv() to use the same request?
2.When the message is an array, how should it be passed? in this case, message or &message?
3.Why I cannot run this code on less or more than 8 processors? if the rank doesn't exit, shouldn't it just go on without executing that piece of code?
4.The snippet on the at the bottom is there in order to print the total time once, but the waitall() does not work, and I do not understand why.
5. When passing arrays longer than 2^12, I get segmentation error, while I have checked the limits of Isend() and Irecv() and they supposed to handle even bigger length messages.
6.I used long double for record the time, is this a common or good practice? when I used smaller variables like float or double I would get nan.
#include<stdio.h>
#include<stdlib.h>
#include<math.h>
#include<mpi.h>
int main(int argc, char *argv[]){
MPI_Init(&argc, &argv);
int i, rank, size, ready;
long int N = pow(2, 10);
float* message = (float *)malloc(sizeof(float *) * N + 1);
long double start, end;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
//MPI_Request* request = (MPI_Request *)malloc(sizeof(MPI_Request *) * size);
MPI_Request request[size-1];
/*Stage I: -np 8*/
if(rank == 0){
for(i = 0; i < N; i++){
message[i] = N*rand();
message[i] /= rand();
}
start = MPI_Wtime();
MPI_Isend(&message, N, MPI_FLOAT, 1, 0, MPI_COMM_WORLD, &request[0]);
MPI_Isend(&message, N, MPI_FLOAT, 2, 0, MPI_COMM_WORLD, &request[1]);
MPI_Isend(&message, N, MPI_FLOAT, 4, 0, MPI_COMM_WORLD, &request[3]);
printf("Processor root-rank %d- sent the message...\n", rank);
}
if (rank == 1){
MPI_Irecv(&message, N, MPI_FLOAT, 0, 0, MPI_COMM_WORLD, &request[0]);
MPI_Wait(&request[0], MPI_STATUS_IGNORE);
printf("Processor rank 1 received the message.\n");
MPI_Isend(&message, N, MPI_FLOAT, 3, 0, MPI_COMM_WORLD, &request[2]);
MPI_Isend(&message, N, MPI_FLOAT, 5, 0, MPI_COMM_WORLD, &request[4]);
}
if(rank == 2){
MPI_Irecv(&message, N, MPI_FLOAT, 0, 0, MPI_COMM_WORLD, &request[1]);
MPI_Wait(&request[1], MPI_STATUS_IGNORE);
printf("Processor rank 2 received the message.\n");
MPI_Isend(&message, N, MPI_FLOAT, 6, 0, MPI_COMM_WORLD, &request[5]);
}
if(rank == 3){
MPI_Irecv(&message, N, MPI_FLOAT, 1, 0, MPI_COMM_WORLD, &request[2]);
MPI_Wait(&request[2], MPI_STATUS_IGNORE);
printf("Processor rank 3 received the message.\n");
MPI_Isend(&message, N, MPI_FLOAT, 7, 0, MPI_COMM_WORLD, &request[6]);
}
if(rank == 4){
MPI_Irecv(&message, N, MPI_FLOAT, 0, 0, MPI_COMM_WORLD, &request[3]);
MPI_Wait(&request[3], MPI_STATUS_IGNORE);
printf("Processor rank 4 received the message.\n");
}
if(rank == 5){
MPI_Irecv(&message, N, MPI_FLOAT, 1, 0, MPI_COMM_WORLD, &request[4]);
MPI_Wait(&request[4], MPI_STATUS_IGNORE);
printf("Processor rank 5 received the message.\n");
}
if(rank == 6){
MPI_Irecv(&message, N, MPI_FLOAT, 2, 0, MPI_COMM_WORLD, &request[5]);
MPI_Wait(&request[5], MPI_STATUS_IGNORE);
printf("Processor rank 6 received the message.\n");
}
if(rank == 7){
MPI_Irecv(&message, N, MPI_FLOAT, 3, 0, MPI_COMM_WORLD, &request[6]);
MPI_Wait(&request[6], MPI_STATUS_IGNORE);
printf("Processor rank 7 received the message.\n");
}
/*MPI_Testall(size-1,request,&ready, MPI_STATUS_IGNORE);*/
/* if (ready){*/
end = MPI_Wtime();
printf("Total Time: %Lf\n", end - start);
/*}*/
MPI_Finalize();
}
Each MPI task runs in its own address space, so there is no correlation between request[1] on rank 0 and request[1] on rank 2. That means you do not have to "pair" the requests. That being said, if you think "pairing" the requests improves the readability of your code, you might want to do so even if this is not required.
the buffer parameter of MPI_Isend() and MPI_Irecv() is a pointer to the start of the data, this is message (and not &message) here.
if you run with let's say 2 MPI tasks, MPI_Send(..., dest=2, ...) on rank 0 will fail because there 2 is an invalid rank in the MPI_COMM_WORLD communicator.
many requests are uninitialized when MPI_Waitall() (well, MPI_Testall() here) is invoked. One option is to first initialize all of them to MPI_REQUEST_NULL.
using &message results in memory corruption and that likely explains the crash.
From the MPI standard, the prototype is double MPI_Wtime(), so you'd rather use double here (the NaN likely come from the memory corruption described above)

Problem with defining linear programming constraints

Me and my friend are trying to implement a paper and the last step requires solving a linear programming problem to get the final result. We are not so familiar with LP so i'd like to ask for your help.
Here's the function which is based on the PROFSET model
and here's the proposed constraints
(1)
(2)
where:
Pa and Qi are the binary decision variables
J are all the available categories
F are sets of frequent categories
Φ is the total number of selected categories
Constraint (1) actually says that Qi is 1 if category i is included in some itemset A where Pa = 1
Basically, we are trying to use some common open source lp solvers (like joptimizer) but we dont know
how to define those constraints, especially those that define set inclusion rules. Most of those solvers
seem to accept just inequalities.
So, do you have any idea about how to define those constraints? Maybe transform them to inequalities or
something? Any help would be appreciated.
Thank you
A constraint that is written as an equality can be also written as two inequalities.
e.g.
A*x=b is the same as
A*x<=b and A*x>=b
In order to write such a LP there are two ways.
To hardcode it, meaning writing everything in code for example in Java.
Write it the mathematical way in a "language" called AMPL: https://ampl.com/resources/the-ampl-book/ For the second way you don't really need to know a programming language. AMPL transforms magically your LP into code and feeds it to a solver e.g. commercial: CPLEX, Gurobi (academic license available) or open source: GLPK. AMPL provides also an online platform that you can provide your model as .mod file and data as .dat files.
If you still want to hardcode your LP GLPK has nice examples e.g. in JAVA:
public class Lp {
// Minimize z = -.5 * x1 + .5 * x2 - x3 + 1
//
// subject to
// 0.0 <= x1 - .5 * x2 <= 0.2
// -x2 + x3 <= 0.4
// where,
// 0.0 <= x1 <= 0.5
// 0.0 <= x2 <= 0.5
// 0.0 <= x3 <= 0.5
public static void main(String[] arg) {
glp_prob lp;
glp_smcp parm;
SWIGTYPE_p_int ind;
SWIGTYPE_p_double val;
int ret;
try {
// Create problem
lp = GLPK.glp_create_prob();
System.out.println("Problem created");
GLPK.glp_set_prob_name(lp, "myProblem");
// Define columns
GLPK.glp_add_cols(lp, 3);
GLPK.glp_set_col_name(lp, 1, "x1");
GLPK.glp_set_col_kind(lp, 1, GLPKConstants.GLP_CV);
GLPK.glp_set_col_bnds(lp, 1, GLPKConstants.GLP_DB, 0, .5);
GLPK.glp_set_col_name(lp, 2, "x2");
GLPK.glp_set_col_kind(lp, 2, GLPKConstants.GLP_CV);
GLPK.glp_set_col_bnds(lp, 2, GLPKConstants.GLP_DB, 0, .5);
GLPK.glp_set_col_name(lp, 3, "x3");
GLPK.glp_set_col_kind(lp, 3, GLPKConstants.GLP_CV);
GLPK.glp_set_col_bnds(lp, 3, GLPKConstants.GLP_DB, 0, .5);
// Create constraints
// Allocate memory
ind = GLPK.new_intArray(3);
val = GLPK.new_doubleArray(3);
// Create rows
GLPK.glp_add_rows(lp, 2);
// Set row details
GLPK.glp_set_row_name(lp, 1, "c1");
GLPK.glp_set_row_bnds(lp, 1, GLPKConstants.GLP_DB, 0, 0.2);
GLPK.intArray_setitem(ind, 1, 1);
GLPK.intArray_setitem(ind, 2, 2);
GLPK.doubleArray_setitem(val, 1, 1.);
GLPK.doubleArray_setitem(val, 2, -.5);
GLPK.glp_set_mat_row(lp, 1, 2, ind, val);
GLPK.glp_set_row_name(lp, 2, "c2");
GLPK.glp_set_row_bnds(lp, 2, GLPKConstants.GLP_UP, 0, 0.4);
GLPK.intArray_setitem(ind, 1, 2);
GLPK.intArray_setitem(ind, 2, 3);
GLPK.doubleArray_setitem(val, 1, -1.);
GLPK.doubleArray_setitem(val, 2, 1.);
GLPK.glp_set_mat_row(lp, 2, 2, ind, val);
// Free memory
GLPK.delete_intArray(ind);
GLPK.delete_doubleArray(val);
// Define objective
GLPK.glp_set_obj_name(lp, "z");
GLPK.glp_set_obj_dir(lp, GLPKConstants.GLP_MIN);
GLPK.glp_set_obj_coef(lp, 0, 1.);
GLPK.glp_set_obj_coef(lp, 1, -.5);
GLPK.glp_set_obj_coef(lp, 2, .5);
GLPK.glp_set_obj_coef(lp, 3, -1);
// Write model to file
// GLPK.glp_write_lp(lp, null, "lp.lp");
// Solve model
parm = new glp_smcp();
GLPK.glp_init_smcp(parm);
ret = GLPK.glp_simplex(lp, parm);
// Retrieve solution
if (ret == 0) {
write_lp_solution(lp);
} else {
System.out.println("The problem could not be solved");
}
// Free memory
GLPK.glp_delete_prob(lp);
} catch (GlpkException ex) {
ex.printStackTrace();
ret = 1;
}
System.exit(ret);
}
/**
* write simplex solution
* #param lp problem
*/
static void write_lp_solution(glp_prob lp) {
int i;
int n;
String name;
double val;
name = GLPK.glp_get_obj_name(lp);
val = GLPK.glp_get_obj_val(lp);
System.out.print(name);
System.out.print(" = ");
System.out.println(val);
n = GLPK.glp_get_num_cols(lp);
for (i = 1; i <= n; i++) {
name = GLPK.glp_get_col_name(lp, i);
val = GLPK.glp_get_col_prim(lp, i);
System.out.print(name);
System.out.print(" = ");
System.out.println(val);
}
}}

OpenCL, double buffering using two command-queues for a single device

I'm creating an aplication with openCL 1.2 which is a test for a bigger application. This test sums 1 to each value of a 4x4 matrix with each kernel execution. The idea is to get double buffering to work. I created two kernels that do actually the same thing, they share the same READ_WRITE buffer so each kernel execution can continue where the last one left it, but they differ because they have a different output buffer, allowing to use one of the output buffer with a kernel while reading the data of the other, just like this:
Diagram
The pieces of code I think are relevant or could be problematic are the following, I include queues, buffers and events just in case but I tried changing everything regarding this:
Queues
compute_queue = clCreateCommandQueueWithProperties(context, device_id, 0, &err);
data_queue = clCreateCommandQueueWithProperties(context, device_id, 0, &err);
Buffer
input_Parametros = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, sizeof(double) * 5, Parametros, NULL);
input_matA = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, sizeof(double) * 4, matA_1, NULL); // The 4x4 matrix
output_buffer = clCreateBuffer(context, CL_MEM_WRITE_ONLY , sizeof(double) * 4 * iteraciones_por_kernel, NULL, NULL);
output_buffer_2 = clCreateBuffer(context, CL_MEM_WRITE_ONLY , sizeof(double) * 4 * iteraciones_por_kernel, NULL, NULL);
Argument set for each kernel
clSetKernelArg(kernel_1, 0, sizeof(cl_mem), &input_matA);
clSetKernelArg(kernel_1, 1, sizeof(cl_mem), &input_Parametros);
clSetKernelArg(kernel_1, 3, sizeof(cl_mem), &output_buffer);
clSetKernelArg(kernel_2, 0, sizeof(cl_mem), &input_matA);
clSetKernelArg(kernel_2, 1, sizeof(cl_mem), &input_Parametros);
clSetKernelArg(kernel_2, 3, sizeof(cl_mem), &output_buffer_2);
Events
cl_event event_1, event_2, event_3, event_4;
Kernel and Read enqueue
////////////////////////////////////////////////////////////////
// START
////////////////////////////////////////////////////////////////
clEnqueueNDRangeKernel(compute_queue, kernel_1, 1, NULL, global, local, 0, 0, &event_1);
clEnqueueNDRangeKernel(compute_queue, kernel_2, 1, NULL, global, local, 0, 0, &event_2);
clEnqueueReadBuffer(data_queue, output_buffer, CL_FALSE, 0, sizeof(double)*4*iteraciones_por_kernel, datos_salida, 1 , &event_1, &event_3);
////////////////////////////////////////////////////////////////
// ENQUEUE LOOP
////////////////////////////////////////////////////////////////
for (int i = 1; i <= (n_iteraciones_int - 2); i++){
////////////////////////////////////////////////////////////////
// LOOP PART 1
////////////////////////////////////////////////////////////////
if (i % 2 != 0){
clEnqueueNDRangeKernel(compute_queue, kernel_1, 1, NULL, global, local, 1, &event_3, &event_1);
clEnqueueReadBuffer(data_queue, output_buffer_2, CL_FALSE, 0, sizeof(double) * 4 * iteraciones_por_kernel, &datos_salida[i*iteraciones_por_kernel_int*4], 1, &event_2, &event_4);
}
////////////////////////////////////////////////////////////////
// LOOP PART 2
////////////////////////////////////////////////////////////////
if (i % 2 == 0){
clEnqueueNDRangeKernel(compute_queue, kernel_2, 1, NULL, global, local, 1, &event_4, &event_2);
clEnqueueReadBuffer(data_queue, output_buffer, CL_FALSE, 0, sizeof(double) * 4 * iteraciones_por_kernel, &datos_salida[i*iteraciones_por_kernel_int * 4], 1, &event_1, &event_3);
}
}
////////////////////////////////////////////////////////////////
// END
////////////////////////////////////////////////////////////////
clEnqueueReadBuffer(data_queue, output_buffer_2, CL_TRUE, 0, sizeof(double) * 4 * iteraciones_por_kernel, &datos_salida[(n_iteraciones_int - 1) * 4], 1, &event_2, 0);
I just can't get this to work even with everything seeming perfectly fine. The first read gives the expected values, but from then on it's like the kernels don't execute anymore, since i get 0's from the output_buffer_2 and the same values as in the first read form the first output_buffer.
This works perfectly fine with the same kernels and just one queue that does it all with a single data transfer at the end, but I don't want that.
I revised everything and investigated as much as I could, tried every variation I could imagine. This should be easy and possible I think... where is the problem?
I'm using AMD HD7970 as device, Windows 10 and visual studio community 2013 if it is any help.
Thanks to huseyin tugrul buyukisik help, the program worked with the following variations:
Events
cl_event event[20]; //adjust this to your needs
Kernel and read enqueue
////////////////////////////////////////////////////////////////
// START
////////////////////////////////////////////////////////////////
clEnqueueNDRangeKernel(compute_queue, kernel_1, 1, NULL, global, local, 0, 0, &event[0]);
clEnqueueNDRangeKernel(compute_queue, kernel_2, 1, NULL, global, local, 0, 0, &event[1]);
clEnqueueReadBuffer(data_queue, output_buffer, CL_FALSE, 0, sizeof(double)*4*iteraciones_por_kernel, datos_salida, 1 , &event[0], &event[2]);
////////////////////////////////////////////////////////////////
// LOOP
////////////////////////////////////////////////////////////////
for (int i = 1; i <= (n_iteraciones_int - 2); i++){
////////////////////////////////////////////////////////////////
// LOOP PART 1
////////////////////////////////////////////////////////////////
if (i % 2 == 1){
clEnqueueNDRangeKernel(compute_queue, kernel_1, 1, NULL, global, local, 1, &event[2+2*(i - 1)], &event[4 + 2 * (i - 1)]);
clEnqueueReadBuffer(data_queue, output_buffer_2, CL_FALSE, 0, sizeof(double) * 4 * iteraciones_por_kernel, &datos_salida[i*(iteraciones_por_kernel_int) * 4], 1, &event[1+2*(i - 1)], &event[3 + 2 * (i - 1)]);
}
////////////////////////////////////////////////////////////////
// LOOP PART 2
////////////////////////////////////////////////////////////////
if (i % 2 == 0){
clEnqueueNDRangeKernel(compute_queue, kernel_2, 1, NULL, global, local, 1, &event[3 + 2 * (i - 2)], &event[5 + 2 * (i - 2)]);
clEnqueueReadBuffer(data_queue, output_buffer, CL_FALSE, 0, sizeof(double) * 4 * iteraciones_por_kernel, &datos_salida[i*(iteraciones_por_kernel_int) * 4], 1, &event[4 + 2 * (i - 2)], &event[6 + 2 * (i - 2)]);
}
}
////////////////////////////////////////////////////////////////
// END
////////////////////////////////////////////////////////////////
clFlush(compute_queue);
clFlush(data_queue);
clEnqueueReadBuffer(data_queue, output_buffer_2, CL_TRUE, 0, sizeof(double) * 4 * iteraciones_por_kernel, &datos_salida[(n_iteraciones_int-1)*(iteraciones_por_kernel_int) * 4], 1, &event[5+2*(n_iteraciones_int-4)], 0);

Calling atan function on Blackberry 4.2 JDE

I need to calculate the arc tan value from my Blackberry Java app. Unfortunately, the blackberry 4.2 api doesn't have the Math.atan() function. Version 4.6 of the Blackberry JDE has it, but not 4.2.
Does anyone know of a workaround to calculate atan?
From Arctan in J2ME by Stephen Zimmerman:
// calculation functions
public class Calculation {
// Because J2ME has no floating point numbers,
// some sort of fixed point math is required.
// My implementation is simply to shift 10 places.
// for example, 1024 (>> 10) = 1
// and 512 (>> 10) = 0.5
public static final int[] AtanTable = { 0, 1, 2, 3, 5, 6, 7, 8, 10, 11, 12,
13, 14, 16, 17, 18, 19, 20, 21, 22, 23, 25, 26, 27, 28, 29,
30, 30,31, 32, 33, 34, 35, 36, 37, 37, 38, 39, 40, 40, 41,
42, 43, 43, 44, 45 };
// / returns angle 0->359 in degrees
public static int atan(int Y, int X) {
boolean swap = false;
int top = Math.abs(Y);
int bottom = Math.abs(X);
if (top > bottom) {
int btemp = bottom;
bottom = top;
top = btemp;
swap = true;
} else if (bottom == 0)
return -300;
// this should keep index inbounds [0, 45]
int index = (top * 45) / bottom;
int angle = AtanTable[index];
if (swap)
angle = 90 - angle;
// X & Y += 180
// X & !Y = ...90
// !X & Y = ... 270
if ((X < 0) && (Y < 0))
angle += 180;
else if (Y < 0) {
angle = 90 - angle;
angle += 270;
} else if (X < 0) {
angle = 90 - angle;
angle += 90;
}
if (angle == 360)
angle = 0;
return angle;
}
}
When all else fails, one could probably obtain a decent value by estimating the result of an infinite series of the arctan function.
The Wikipedia page on inverse trigonometic functions has a section on the infinite series of inverse trigonometric functions, including arctan. In order to obtain an estimate, one would carry out the infinite series until the desired precision is obtained.
As for the reason why the arctan function is not included, it is probably because the processor in the Blackberry isn't very powerful, and would take a lot of processor resources to perform the calculation.
Also, looking at the Blackberry JDE 4.2 API documentation, there appears to be a fixed-point math library called Fixed32 which offers two flavors of arctan. They perform the calculation with 32-bit integers, so they probably offer some performance advantages over performing floating-point arithmetic.
Here is the function I use (no guarantees that it is very fast):
/** Square root from 3 */
final static public double SQRT3 = 1.732050807568877294;
static public double atan(double x)
{
boolean signChange=false;
boolean Invert=false;
int sp=0;
double x2, a;
// check up the sign change
if(x<0.)
{
x=-x;
signChange=true;
}
// check up the invertation
if(x>1.)
{
x=1/x;
Invert=true;
}
// process shrinking the domain until x<PI/12
while(x>Math.PI/12)
{
sp++;
a=x+SQRT3;
a=1/a;
x=x*SQRT3;
x=x-1;
x=x*a;
}
// calculation core
x2=x*x;
a=x2+1.4087812;
a=0.55913709/a;
a=a+0.60310579;
a=a-(x2*0.05160454);
a=a*x;
// process until sp=0
while(sp>0)
{
a=a+Math.PI/6;
sp--;
}
// invertation took place
if(Invert) a=Math.PI/2-a;
// sign change took place
if(signChange) a=-a;
//
return a;
}
I had the same problem... the missing math functions can be found in the following package:
net.rim.device.api.util.MathUtilities
First implement the standard arctan(x) using Taylor series (as described at http://en.wikipedia.org/wiki/Inverse_trigonometric_functions#Infinite_series)
The do the following before calling arctan:
1) First do this check.
if (x == 0) {
return 0;
}
2) if |x| > 1, compute arctan(1/x) and finally subtract the result from Pi/2
3) if |x| is close to 1, compute arctan of the half angle using the half angle formula
arctan(x) = 2*arctan(x/(1+sqrt(1+x*x))). That is, first compute the half angle and then multiply the result by 2. Otherwise, for |x| close to 1, arctan converges very slowly.

Resources