How are you supposed to have multiple attributes in one service? - bluetooth-lowenergy

I'm building a BLE automation IO card, so I'm trying to use BLE-Automation I/O
https://www.bluetooth.com/wp-content/uploads/Sitecore-Media-Library/Gatt/Xml/Services/org.bluetooth.service.automation_io.xml
And it says that it supports 1 or more Analog (0x2a58) or Digital Attributes (0x2a56)... so I created a whole list of analog and digital attributes:
Digital - "Output energized" [RO]
Analog - "Current Temp" [RO]
Digital - "Self-test" [RW]
Digital - "Self-test results" [RO]
Analog - "Temp setpoint" [RW]
-- but, since they are the BLE SIG attributes, they're all coming up with the same UUID and it's confusing the client software.
Should I make them "custom" with my own 128 bit UUID's or is there an "index" or somesuch field that I should use to serialize them properly?
I'm using the BlueNRG1 chip, and the code I'm currently using is
suuid.Service_UUID_16 = 0x1815; //Automation
ret = aci_gatt_add_service(UUID_TYPE_16, &suuid, PRIMARY_SERVICE, 35, &AutomationHandle);
if (ret != BLE_STATUS_SUCCESS) {PRINTF("j"); goto fail;}
cuuid.Char_UUID_16=0x2A56; //[0]Digital: Bar Detected
ret = aci_gatt_add_char(AutomationHandle, UUID_TYPE_16, &cuuid, 1, CHAR_PROP_READ, ATTR_PERMISSION_NONE, GATT_NOTIFY_READ_REQ_AND_WAIT_FOR_APPL_RESP, 16, CHAR_VALUE_LEN_CONSTANT, &OutCharHandle);
if (ret != BLE_STATUS_SUCCESS) {PRINTF("k"); goto fail;}
cuuid.Char_UUID_16= 0x2A58; //[1]Analog: Current Temp
ret = aci_gatt_add_char(AutomationHandle, UUID_TYPE_16, &cuuid, 2, CHAR_PROP_READ, ATTR_PERMISSION_NONE, GATT_NOTIFY_READ_REQ_AND_WAIT_FOR_APPL_RESP, 16, CHAR_VALUE_LEN_CONSTANT, &CurDetectCharHandle);
if (ret != BLE_STATUS_SUCCESS) {PRINTF("l"); goto fail;}
ApplyFormat_Temp16(AutomationHandle,CurDetectCharHandle);
cuuid.Char_UUID_16= 0x2A58; //[2]Analog: BarCount
ret = aci_gatt_add_char(AutomationHandle, UUID_TYPE_16, &cuuid, 2, CHAR_PROP_READ | CHAR_PROP_WRITE | GATT_NOTIFY_ATTRIBUTE_WRITE , ATTR_PERMISSION_NONE, GATT_NOTIFY_WRITE_REQ_AND_WAIT_FOR_APPL_RESP | GATT_NOTIFY_READ_REQ_AND_WAIT_FOR_APPL_RESP, 16, CHAR_VALUE_LEN_CONSTANT, &LastbarcountCharHandle);
if (ret != BLE_STATUS_SUCCESS) {PRINTF("r"); goto fail;}
cuuid.Char_UUID_16= 0x2A58; //[3]Analog: Last Bar Temp
ret = aci_gatt_add_char(AutomationHandle, UUID_TYPE_16, &cuuid, 2, CHAR_PROP_READ, ATTR_PERMISSION_NONE, GATT_NOTIFY_READ_REQ_AND_WAIT_FOR_APPL_RESP, 16, CHAR_VALUE_LEN_CONSTANT, &LastbartempCharHandle);
cuuid.Char_UUID_16= 0x2A56; //[5]Digital: Self Test
ret = aci_gatt_add_char(AutomationHandle, UUID_TYPE_16, &cuuid, 1, CHAR_PROP_READ | CHAR_PROP_WRITE | GATT_NOTIFY_ATTRIBUTE_WRITE , ATTR_PERMISSION_NONE, GATT_NOTIFY_WRITE_REQ_AND_WAIT_FOR_APPL_RESP | GATT_NOTIFY_READ_REQ_AND_WAIT_FOR_APPL_RESP,
16, CHAR_VALUE_LEN_CONSTANT, &SelftestCharHandle);
if (ret != BLE_STATUS_SUCCESS) {PRINTF("o"); goto fail;}
cuuid.Char_UUID_16= 0x2A56; //[6]Digital: Self Test Passed
ret = aci_gatt_add_char(AutomationHandle, UUID_TYPE_16, &cuuid, 1, CHAR_PROP_READ | CHAR_PROP_WRITE | GATT_NOTIFY_ATTRIBUTE_WRITE , ATTR_PERMISSION_NONE, GATT_NOTIFY_WRITE_REQ_AND_WAIT_FOR_APPL_RESP | GATT_NOTIFY_READ_REQ_AND_WAIT_FOR_APPL_RESP, 16, CHAR_VALUE_LEN_CONSTANT, &STpassedCharHandle);
if (ret != BLE_STATUS_SUCCESS) {PRINTF("p"); goto fail;}
...
I do see them in the app, but they all have the same UUID so the app is not seeing the differences in the access permissions.

Related

Problems additionVector with OpenCL

I want to learn OpenCL so i read a tutorial with a simple vector addition https://www.eriksmistad.no/getting-started-with-opencl-and-gpu-computing/
Im working with ubuntu
Distributor ID: Ubuntu
Description: Ubuntu 22.04.1 LTS
Release: 22.04
Codename: jammy
And i have a RTX 3080Ti known by my computer
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.05 Driver Version: 525.85.05 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:02:00.0 On | N/A |
| 0% 54C P8 38W / 350W | 634MiB / 12288MiB | 2% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1766 G /usr/lib/xorg/Xorg 312MiB |
| 0 N/A N/A 2087 G /usr/bin/gnome-shell 105MiB |
| 0 N/A N/A 3343 G ...5/usr/lib/firefox/firefox 183MiB |
+-----------------------------------------------------------------------------+
give by an nvidia-smi
I installed OpenCL with apt-get install opencl-headers and cuda for OpenCL drivers.
Here is the code :
#include <stdio.h>
#include <stdlib.h>
#ifdef __APPLE__
#include <OpenCL/opencl.h>
#else
#include <CL/cl.h>
#endif
#define MAX_SOURCE_SIZE (0x100000)
int main(void) {
// Create the two input vectors
int i;
const int LIST_SIZE = 10;
int *A = (int*)malloc(sizeof(int)*LIST_SIZE);
int *B = (int*)malloc(sizeof(int)*LIST_SIZE);
for(i = 0; i < LIST_SIZE; i++) {
A[i] = i;
B[i] = i;
}
// Load the kernel source code into the array source_str
FILE *fp;
char *source_str;
size_t source_size;
char str_buffer[1024];
fp = fopen("vector_add_kernel.cl", "r");
if (!fp) {
fprintf(stderr, "Failed to load kernel.\n");
exit(1);
}
source_str = (char*)malloc(MAX_SOURCE_SIZE);
source_size = fread( source_str, 1, MAX_SOURCE_SIZE, fp);
fclose( fp );
// Get platform and device information
cl_platform_id platform_id = NULL;
cl_device_id device_id = NULL;
cl_uint ret_num_devices;
cl_uint ret_num_platforms;
cl_int ret = clGetPlatformIDs(1, &platform_id, &ret_num_platforms);
ret = clGetDeviceIDs( platform_id, CL_DEVICE_TYPE_GPU, 1,
&device_id, &ret_num_devices);
// Create an OpenCL context
cl_context context = clCreateContext( NULL, 1, &device_id, NULL, NULL, &ret);
// Create a command queue
cl_command_queue command_queue = clCreateCommandQueue(context, device_id, 0, &ret);
// Create memory buffers on the device for each vector
cl_mem a_mem_obj = clCreateBuffer(context, CL_MEM_READ_ONLY,
LIST_SIZE * sizeof(int), NULL, &ret);
cl_mem b_mem_obj = clCreateBuffer(context, CL_MEM_READ_ONLY,
LIST_SIZE * sizeof(int), NULL, &ret);
cl_mem c_mem_obj = clCreateBuffer(context, CL_MEM_WRITE_ONLY,
LIST_SIZE * sizeof(int), NULL, &ret);
// Copy the lists A and B to their respective memory buffers
ret = clEnqueueWriteBuffer(command_queue, a_mem_obj, CL_TRUE, 0,
LIST_SIZE * sizeof(int), A, 0, NULL, NULL);
ret = clEnqueueWriteBuffer(command_queue, b_mem_obj, CL_TRUE, 0,
LIST_SIZE * sizeof(int), B, 0, NULL, NULL);
// Create a program from the kernel source
cl_program program = clCreateProgramWithSource(context, 1,
(const char **)&source_str, (const size_t *)&source_size, &ret);
// Build the program
ret = clBuildProgram(program, 1, &device_id, NULL, NULL, NULL);
// Create the OpenCL kernel
cl_kernel kernel = clCreateKernel(program, "vector_add", &ret);
// Set the arguments of the kernel
ret = clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&a_mem_obj);
ret = clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&b_mem_obj);
ret = clSetKernelArg(kernel, 2, sizeof(cl_mem), (void *)&c_mem_obj);
// Execute the OpenCL kernel on the list
size_t global_item_size = LIST_SIZE; // Process the entire lists
size_t local_item_size = 64; // Divide work items into groups of 64
ret = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL,
&global_item_size, &local_item_size, 0, NULL, NULL);
// Read the memory buffer C on the device to the local variable C
int *C = (int*)malloc(sizeof(int)*LIST_SIZE);
ret = clEnqueueReadBuffer(command_queue, c_mem_obj, CL_TRUE, 0,
LIST_SIZE * sizeof(int), C, 0, NULL, NULL);
// Display the result to the screen
for(i = 0; i < LIST_SIZE; i++)
printf("%d + %d = %d\n", A[i], B[i], C[i]);
// Clean up
ret = clFlush(command_queue);
ret = clFinish(command_queue);
ret = clReleaseKernel(kernel);
ret = clReleaseProgram(program);
ret = clReleaseMemObject(a_mem_obj);
ret = clReleaseMemObject(b_mem_obj);
ret = clReleaseMemObject(c_mem_obj);
ret = clReleaseCommandQueue(command_queue);
ret = clReleaseContext(context);
free(A);
free(B);
free(C);
return 0;
}
And the code of the kernel :
__kernel void vector_add(__global const int *A, __global const int *B, __global int *C) {
// Get the index of the current element to be processed
int i = get_global_id(0);
// Do the operation
C[i] = A[i] + B[i];
}
I compile with : gcc main.c -o vectorAddition -l OpenCL
And the execution of vectorAddition give me this :
platform name : NVIDIA CUDA
platform vendor : NVIDIA Corporation
Device name : NVIDIA Corporation
0 + 0 = 0
1 + 1 = 0
2 + 2 = 0
3 + 3 = 0
4 + 4 = 0
5 + 5 = 0
6 + 6 = 0
7 + 7 = 0
8 + 8 = 0
9 + 9 = 0
Thanks
I already read a post which is pretty the same than mine :
https://stackoverflow.com/questions/54606449/opencl-vector-addition-program
But i think my clCreateBuffer are good
I put these lines in my code to be sure my gpu is know :
//Get the name of the platform and device
ret = clGetPlatformInfo(0, CL_PLATFORM_NAME, sizeof(str_buffer), &str_buffer, NULL);
printf("platform name : %s\n",str_buffer);
ret = clGetPlatformInfo(0, CL_PLATFORM_VENDOR, sizeof(str_buffer), &str_buffer, NULL);
printf("platform vendor : %s\n",str_buffer);
ret = clGetDeviceInfo(0, CL_DEVICE_NAME, sizeof(str_buffer), &str_buffer, NULL);
printf("Device name : %s\n",str_buffer);
``
If anyone have the same issue i found out the solution. The problem is that the man who wrote the tutorial made work-groups with these lines :
size_t global_item_size = LIST_SIZE; // Process the entire lists
size_t local_item_size = 64; // Divide work items into groups of 64
ret = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL,&global_item_size, &local_item_size, 0, NULL, NULL);
You can have an unknown behavior if the local_item_size is not a multiple of list_size
So you can send a NULL argument instead of &local_item_size or chose 64,128,... for LIST_SIZE.

Authentication after a read failure of a MIFARE card

An arduino and a mfrc522 rfid reader are easy to use to read and write mifare card.
My aim is to use the both key of the card in a sector, some blocks are readable with key A, and some are only readable by key B.
Setting properly the access bits allow this behaviour. So as to try it on the second sector (blocks 4 5 6 7), I set the access bits g0 [0 0 0], g1 [1 0 1], g2 [0 0 0], g3 [1 0 1] by writting the block 7 with {,0xF5,0xA5,0xA0,0x38,} cf. ยง8.7 of https://www.nxp.com/docs/en/data-sheet/MF1S50YYX_V1.pdf. Now the block 5 is not readable with keyA but with keyB (g1), and keyB is no more readable (g3). So when I authenticate with keyA an attempt to read the block 5 lead to an error. At this time the others authentications fail, and it is not possible to read the others blocks, though the access bits allow it.
I tried to read the second sector with key B for block 5 and keyA for the others code, it is working. But if I try to read with key A then with keyB in case of failure it is not working.
Extract of the code :
// The sector of interest is the second one : blocks 4 5 6 7
Serial.println("\nAuthenticate block 0x05 with key B");
for (i = 4; i < 8; i++) {
// Block 5 is readable with key B
status = readblock(i==5?MF1_AUTHENT1B:MF1_AUTHENT1A, i, i==5?keyB:keyA, serial);
if ( status == MI_ERR) {
Serial.print(" - Unable to read block nb. 0x");
Serial.println(i, HEX);
}
}
Serial.println("\nAuthenticate with key A then key B if failed");
for (i = 4; i < 8; i++) {
// Try to authenticate each block first with the A key.
status = readblock(MF1_AUTHENT1A, i, keyA, serial);
if ( status == MI_ERR) {
Serial.print(" - Try keyB - ");
status = readblock(MF1_AUTHENT1B, i, keyB, serial);
if ( status == MI_ERR) {
Serial.print(" - Unable to read block nb. 0x");
Serial.println(i, HEX);
}
}
}
readblock function (authentication and read)
byte readblock(byte mode, byte block, byte *key, byte *serial)
{
int j;
byte data[MAX_LEN];
byte status = MI_ERR;
// Try to authenticate the block first
status = nfc.authenticate(mode, block, key, serial);
if ( status == MI_OK) {
Serial.print("Authenticated block nb. 0x");
Serial.print(block, HEX);
// Reading block i from the tag into data.
status = nfc.readFromTag(block, data);
if (status == MI_OK) {
// If there was no error when reading; print all the hex
// values in the data.
Serial.print(" : ");
for (j = 0; j < 15; j++) {
Serial.print(data[j], HEX);
Serial.print(", ");
}
Serial.println(data[15], HEX);
} else
Serial.print(" - Read failed");
} else {
Serial.print("Failed authentication block nb. 0x");
Serial.print(block, HEX);
}
return status;
}
The result is
Authenticate block 0x05 with key B
Authenticated block nb. 0x4 : 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
Authenticated block nb. 0x5 : AB, CD, EF, 1, 23, 45, 67, 89, 98, 76, 54, 1A, 10, FE, DC, BA
Authenticated block nb. 0x6 : 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
Authenticated block nb. 0x7 : 0, 0, 0, 0, 0, 0, F5, A5, A0, 38, 0, 0, 0, 0, 0, 0
Authenticate with key A then key B if failed
Authenticated block nb. 0x4 : 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
Authenticated block nb. 0x5 - Read failed - Try keyB - Failed authentication block nb. 0x5 - Unable to read block nb. 0x5
Failed authentication block nb. 0x6 - Try keyB - Failed authentication block nb. 0x6 - Unable to read block nb. 0x6
Failed authentication block nb. 0x7 - Try keyB - Failed authentication block nb. 0x7 - Unable to read block nb. 0x7
So I would like to know if it is possible to attempt to read a block with the bad key and then go on reading the block with the othe rkey, and so on.
An explanation can be found in https://www.nxp.com/docs/en/application-note/AN1304.pdf p.24
Each time an Authentication operation, a Read operation or a Write operation fails, the MIFARE Classic or MIFARE Plus remains silent and it does not respond anymore to any commands. In this situation in order to continue the NDEF Detection Procedure the MIFARE Classic or MIFARE Plus needs to be re-activated and selected.
So you have to re-activate and select after failure, by adding these lines to your code for instance :
Serial.println("\nAuthenticate with key A then key B if failed");
for (i = 4; i < 8; i++) {
// Try to authenticate each block first with the A key.
status = readblock(MF1_AUTHENT1A, i, keyA, serial);
if ( status == MI_ERR) {
Serial.print(" - Try keyB - ");
/** RE ACTIVATE AND SELECT ------------------------------- **/
nfc.haltTag();
status = nfc.requestTag(MF1_REQIDL, data);
if (status == MI_OK) {
status = nfc.antiCollision(data);
memcpy(serial, data, 5);
nfc.selectTag(serial);
}
/** ------------------------------------------------------ **/
status = readblock(MF1_AUTHENT1B, i, keyB, serial);
if ( status == MI_ERR) {
Serial.print(" - Unable to read block nb. 0x");
Serial.println(i, HEX);
}
}
}

OpenCL, double buffering using two command-queues for a single device

I'm creating an aplication with openCL 1.2 which is a test for a bigger application. This test sums 1 to each value of a 4x4 matrix with each kernel execution. The idea is to get double buffering to work. I created two kernels that do actually the same thing, they share the same READ_WRITE buffer so each kernel execution can continue where the last one left it, but they differ because they have a different output buffer, allowing to use one of the output buffer with a kernel while reading the data of the other, just like this:
Diagram
The pieces of code I think are relevant or could be problematic are the following, I include queues, buffers and events just in case but I tried changing everything regarding this:
Queues
compute_queue = clCreateCommandQueueWithProperties(context, device_id, 0, &err);
data_queue = clCreateCommandQueueWithProperties(context, device_id, 0, &err);
Buffer
input_Parametros = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, sizeof(double) * 5, Parametros, NULL);
input_matA = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, sizeof(double) * 4, matA_1, NULL); // The 4x4 matrix
output_buffer = clCreateBuffer(context, CL_MEM_WRITE_ONLY , sizeof(double) * 4 * iteraciones_por_kernel, NULL, NULL);
output_buffer_2 = clCreateBuffer(context, CL_MEM_WRITE_ONLY , sizeof(double) * 4 * iteraciones_por_kernel, NULL, NULL);
Argument set for each kernel
clSetKernelArg(kernel_1, 0, sizeof(cl_mem), &input_matA);
clSetKernelArg(kernel_1, 1, sizeof(cl_mem), &input_Parametros);
clSetKernelArg(kernel_1, 3, sizeof(cl_mem), &output_buffer);
clSetKernelArg(kernel_2, 0, sizeof(cl_mem), &input_matA);
clSetKernelArg(kernel_2, 1, sizeof(cl_mem), &input_Parametros);
clSetKernelArg(kernel_2, 3, sizeof(cl_mem), &output_buffer_2);
Events
cl_event event_1, event_2, event_3, event_4;
Kernel and Read enqueue
////////////////////////////////////////////////////////////////
// START
////////////////////////////////////////////////////////////////
clEnqueueNDRangeKernel(compute_queue, kernel_1, 1, NULL, global, local, 0, 0, &event_1);
clEnqueueNDRangeKernel(compute_queue, kernel_2, 1, NULL, global, local, 0, 0, &event_2);
clEnqueueReadBuffer(data_queue, output_buffer, CL_FALSE, 0, sizeof(double)*4*iteraciones_por_kernel, datos_salida, 1 , &event_1, &event_3);
////////////////////////////////////////////////////////////////
// ENQUEUE LOOP
////////////////////////////////////////////////////////////////
for (int i = 1; i <= (n_iteraciones_int - 2); i++){
////////////////////////////////////////////////////////////////
// LOOP PART 1
////////////////////////////////////////////////////////////////
if (i % 2 != 0){
clEnqueueNDRangeKernel(compute_queue, kernel_1, 1, NULL, global, local, 1, &event_3, &event_1);
clEnqueueReadBuffer(data_queue, output_buffer_2, CL_FALSE, 0, sizeof(double) * 4 * iteraciones_por_kernel, &datos_salida[i*iteraciones_por_kernel_int*4], 1, &event_2, &event_4);
}
////////////////////////////////////////////////////////////////
// LOOP PART 2
////////////////////////////////////////////////////////////////
if (i % 2 == 0){
clEnqueueNDRangeKernel(compute_queue, kernel_2, 1, NULL, global, local, 1, &event_4, &event_2);
clEnqueueReadBuffer(data_queue, output_buffer, CL_FALSE, 0, sizeof(double) * 4 * iteraciones_por_kernel, &datos_salida[i*iteraciones_por_kernel_int * 4], 1, &event_1, &event_3);
}
}
////////////////////////////////////////////////////////////////
// END
////////////////////////////////////////////////////////////////
clEnqueueReadBuffer(data_queue, output_buffer_2, CL_TRUE, 0, sizeof(double) * 4 * iteraciones_por_kernel, &datos_salida[(n_iteraciones_int - 1) * 4], 1, &event_2, 0);
I just can't get this to work even with everything seeming perfectly fine. The first read gives the expected values, but from then on it's like the kernels don't execute anymore, since i get 0's from the output_buffer_2 and the same values as in the first read form the first output_buffer.
This works perfectly fine with the same kernels and just one queue that does it all with a single data transfer at the end, but I don't want that.
I revised everything and investigated as much as I could, tried every variation I could imagine. This should be easy and possible I think... where is the problem?
I'm using AMD HD7970 as device, Windows 10 and visual studio community 2013 if it is any help.
Thanks to huseyin tugrul buyukisik help, the program worked with the following variations:
Events
cl_event event[20]; //adjust this to your needs
Kernel and read enqueue
////////////////////////////////////////////////////////////////
// START
////////////////////////////////////////////////////////////////
clEnqueueNDRangeKernel(compute_queue, kernel_1, 1, NULL, global, local, 0, 0, &event[0]);
clEnqueueNDRangeKernel(compute_queue, kernel_2, 1, NULL, global, local, 0, 0, &event[1]);
clEnqueueReadBuffer(data_queue, output_buffer, CL_FALSE, 0, sizeof(double)*4*iteraciones_por_kernel, datos_salida, 1 , &event[0], &event[2]);
////////////////////////////////////////////////////////////////
// LOOP
////////////////////////////////////////////////////////////////
for (int i = 1; i <= (n_iteraciones_int - 2); i++){
////////////////////////////////////////////////////////////////
// LOOP PART 1
////////////////////////////////////////////////////////////////
if (i % 2 == 1){
clEnqueueNDRangeKernel(compute_queue, kernel_1, 1, NULL, global, local, 1, &event[2+2*(i - 1)], &event[4 + 2 * (i - 1)]);
clEnqueueReadBuffer(data_queue, output_buffer_2, CL_FALSE, 0, sizeof(double) * 4 * iteraciones_por_kernel, &datos_salida[i*(iteraciones_por_kernel_int) * 4], 1, &event[1+2*(i - 1)], &event[3 + 2 * (i - 1)]);
}
////////////////////////////////////////////////////////////////
// LOOP PART 2
////////////////////////////////////////////////////////////////
if (i % 2 == 0){
clEnqueueNDRangeKernel(compute_queue, kernel_2, 1, NULL, global, local, 1, &event[3 + 2 * (i - 2)], &event[5 + 2 * (i - 2)]);
clEnqueueReadBuffer(data_queue, output_buffer, CL_FALSE, 0, sizeof(double) * 4 * iteraciones_por_kernel, &datos_salida[i*(iteraciones_por_kernel_int) * 4], 1, &event[4 + 2 * (i - 2)], &event[6 + 2 * (i - 2)]);
}
}
////////////////////////////////////////////////////////////////
// END
////////////////////////////////////////////////////////////////
clFlush(compute_queue);
clFlush(data_queue);
clEnqueueReadBuffer(data_queue, output_buffer_2, CL_TRUE, 0, sizeof(double) * 4 * iteraciones_por_kernel, &datos_salida[(n_iteraciones_int-1)*(iteraciones_por_kernel_int) * 4], 1, &event[5+2*(n_iteraciones_int-4)], 0);

Persistent communication in MPI - odd behaviour

I am solving the coarsest grid of parallel geometric multi grid using jacobi iterations and using Non-blocking calls MPI_Isend() and MPI_Irecv(). There are no problems in this. As soon as I replace the non-blocking communications with persistent communications - the results stop converging at this level and program goes into an infinite loop. The calls MPI_Startall() and MPI_Waitall() always return MPI_SUCCESS. Has anyone faced this problem before ? Please advise.
Coarsest_grid_solve()
{
MPI_Recv_init(&e_c_old[0][1][1], 1, x_subarray_c, X_DOWN, 10, new_comm, &recv[0]);
MPI_Recv_init(&e_c_old[PXC+1][1][1], 1, x_subarray_c, X_UP, 20, new_comm, &recv[1]);
MPI_Recv_init(&e_c_old[1][PYC+1][1], 1, y_subarray_c, Y_RIGHT, 30, new_comm, &recv[2]);
MPI_Recv_init(&e_c_old[1][0][1], 1, y_subarray_c, Y_LEFT, 40, new_comm, &recv[3]);
MPI_Recv_init(&e_c_old[1][1][PZC+1], 1, z_subarray_c, Z_AWAY_U, 50, new_comm, &recv[4]);
MPI_Recv_init(&e_c_old[1][1][0], 1, z_subarray_c, Z_TOWARDS_U, 60, new_comm, &recv[5]);
MPI_Send_init(&e_c_old[PXC][1][1], 1, x_subarray_c, X_UP, 10, new_comm, &send[0]);
MPI_Send_init(&e_c_old[1][1][1], 1, x_subarray_c, X_DOWN, 20, new_comm, &send[1]);
MPI_Send_init(&e_c_old[1][1][1], 1, y_subarray_c, Y_LEFT, 30, new_comm, &send[2]);
MPI_Send_init(&e_c_old[1][PYC][1], 1, y_subarray_c, Y_RIGHT, 40, new_comm, &send[3]);
MPI_Send_init(&e_c_old[1][1][1], 1, z_subarray_c, Z_TOWARDS_U, 50, new_comm, &send[4]);
MPI_Send_init(&e_c_old[1][1][PZC], 1, z_subarray_c, Z_AWAY_U, 60, new_comm, &send[5]);
while(rk_global/r0_global > TOL_CNORM)
{
coarse_iterations++ ;
err = MPI_Startall(6,recv);
if(err == MPI_SUCCESS)
printf("success");
err = MPI_Startall(6,send);
if(err == MPI_SUCCESS)
printf("success");
err = MPI_Waitall(6, send, MPI_STATUSES_IGNORE);
if(err == MPI_SUCCESS)
printf("success");
err = MPI_Waitall(6, recv, MPI_STATUSES_IGNORE);
if(err == MPI_SUCCESS)
printf("success");
//do work here
if(coarse_iterations == 1)
{
update_neumann_c(e_c_old, PXC, PYC, PZC, X_UP, Y_RIGHT, Z_AWAY_U);
residual_coarsest(e_c_old, rho_c, PXC, PYC, PZC, X_UP, Y_RIGHT, Z_AWAY_U, hc, rho_temp);
r0_local = residual_norm(rho_temp, PXC, PYC, PZC);
start_allred = MPI_Wtime();
MPI_Allreduce(&r0_local, &r0_global, 1, MPI_DOUBLE, MPI_SUM, new_comm);
end_allred = MPI_Wtime();
r0_global = r0_global/( (PXC*dims0) * (PYC*dims1) * (PZC*dims2) );
if(rank == 0)
printf("\nGlobal residual norm is = %f", r0_global);
rk_global = r0_global;
}
else
{
update_neumann_c(e_c_old, PXC, PYC, PZC, X_UP, Y_RIGHT, Z_AWAY_U);
residual_coarsest(e_c_old, rho_c, PXC, PYC, PZC, X_UP, Y_RIGHT, Z_AWAY_U, hc, rho_temp);
rk_local = residual_norm(rho_temp, PXC, PYC, PZC);
start_allred = MPI_Wtime();
MPI_Allreduce(&rk_local, &rk_global, 1, MPI_DOUBLE, MPI_SUM, new_comm);
end_allred = MPI_Wtime();
rk_global = rk_global/( (PXC*dims0) * (PYC*dims1) * (PZC*dims2) );
if(rank == 0)
printf("\nGlobal residual norm is = %f", rk_global);
}
//do dependent work and exchange matrices
}//while loop ends
for(i = 0; i <= 5 ; i++)
{
MPI_Request_free(&send[i]);
MPI_Request_free(&recv[i]);
}
}//End coarsest grid solve
Note: Strangely the ghost data becomes zero on alternate iterations. (Just found out - don't know why).
When we make a persistent handle for communication, we point to a specific piece of memory that we want to transfer to some other process. Now in Jacobi iterations we need to exchange pointers at the end of the iteration to make the old matrix point to the new updated matrix. Thus, the memory location pointed to by the pointer changes. Thus, the original locations are exchanged. The way around is to define two persistent communication handles. On odd iterations use the first handle and on even iterations use the other handle i.e. alternate them. This solved my problem. This problem also widened my understanding of persistent communications in MPI.

Crash in Intel MPI_COMM_SPAWN_MULTIPLE

I getting a crash inside MPI_COMM_SPAWN_MULTIPLE when using the Intel MPI implementation and compiling the code with ifort. The same code works without a problem using OpenMPI and compiling with gfortran. The relevant code is posted below.
ALLOCATE(commands(num_processes - 1))
ALLOCATE(args(num_processes - 1,2))
ALLOCATE(info(num_processes - 1))
ALLOCATE(max_procs(num_processes - 1))
ALLOCATE(error_array(num_processes - 1))
commands = TRIM(context%cl_parser%command)
args(:,1) = temp_string
IF (num_threads .lt. 0) THEN
args(:,2) = '-para=-1'
ELSE IF (num_threads .lt. 10) THEN
WRITE (temp_string, 1001) num_threads
args(:,2) = TRIM(temp_string)
ELSE
WRITE (temp_string, 1002) num_threads
args(:,2) = TRIM(temp_string)
END IF
max_procs = 1
DO i = 2, num_processes
CALL MPI_INFO_CREATE(info(i - 1), error)
CALL MPI_INFO_SET(info(i - 1), "wdir", process_dir(i), &
& error)
END DO
CALL MPI_COMM_SPAWN_MULTIPLE(num_processes - 1, commands, args, &
& max_procs, info, 0, &
& MPI_COMM_WORLD, child_comm, &
& error_array, error)
CALL MPI_INTERCOMM_MERGE(child_comm, .false., &
& context%intra_comm, error)
DO i = 1, num_processes - 1
CALL MPI_INFO_FREE(info(i), error)
END DO
DEALLOCATE(info)
DEALLOCATE(max_procs)
DEALLOCATE(args)
DEALLOCATE(commands)
DEALLOCATE(error_array)

Resources