How does a barrier work for OpenCl Kernel?

How does a barrier work for OpenCl Kernel? - opencl

Kernel code:
#pragma OPENCL EXTENSION cl_khr_fp64: enable
#pragma OPENCL EXTENSION cl_amd_printf : enable
__kernel void calculate (__global double* in)
{
int idx = get_global_id(0); // statement 1
printf("started for %d workitem\n", idx); // statement 2
in[idx] = idx + 100; // statement 3
printf("value changed to %lf in %d workitem\n", in[idx], idx); // statement 4
barrier(CLK_GLOBAL_MEM_FENCE); // statement 5
printf("completed for %d workitem\n", idx); // statement 6
}
I am calling kernel using clEnqueueNDRangeKernel, by passing an argument of array of double datatype having 5 elements with value initialized to 0.0
i am calling kernel with 5 global_work_size, hence each element of array i will solve on each workitem.
But as per my theoritical understanding of barriers, To synchronize work-items in a work-group, OpenCL provides a similar capability
with the barrier function. This forces a work-item to wait until every other work-item
in the group reaches the barrier. By creating a barrier, you can make sure that every work-item has reached the same
point in its processing. This is a crucial concern when the work-items need to finish
computing an intermediate result that will be used in future computation.
Hence, i was expecting an output like:
started for 0 workitem
started for 1 workitem
value changed to 100.000000 in 0 workitem
value changed to 101.000000 in 1 workitem
started for 3 workitem
value changed to 103.000000 in 3 workitem
started for 2 workitem
value changed to 102.000000 in 2 workitem
started for 4 workitem
value changed to 104.000000 in 4 workitem
completed for 3 workitem
completed for 0 workitem
completed for 1 workitem
completed for 2 workitem
completed for 4 workitem
these completed statements, will come at the end together because of barrier will restrict other work items till reaching that point.
But, result i am getting,
started for 0 workitem
value changed to 100.000000 in 0 workitem
completed for 0 workitem
started for 4 workitem
value changed to 104.000000 in 4 workitem
completed for 4 workitem
started for 1 workitem
started for 2 workitem
started for 3 workitem
value changed to 101.000000 in 1 workitem
value changed to 103.000000 in 3 workitem
completed for 3 workitem
value changed to 102.000000 in 2 workitem
completed for 2 workitem
completed for 1 workitem
Am i missing something in logic? then, How does a barrier work for OpenCl Kernel?
Added more checks in kernel for cross checking updated values after Barrier instead of print statements.
#pragma OPENCL EXTENSION cl_khr_fp64: enable
#pragma OPENCL EXTENSION cl_amd_printf : enable
__kernel void calculate (__global double* in)
{
int idx = get_global_id(0);
in[idx] = idx + 100;
barrier(CLK_GLOBAL_MEM_FENCE);
if (idx == 0) {
in[0] = in[4];
in[1] = in[3];
in[2] = in[2];
in[3] = in[1];
in[4] = in[0];
}
}
then after array should be
after arr[0] = 104.000000
after arr[1] = 103.000000
after arr[2] = 102.000000
after arr[3] = 101.000000
after arr[4] = 100.000000
But results, i am getting:
after arr[0] = 0.000000
after arr[1] = 101.000000
after arr[2] = 102.000000
after arr[3] = 103.000000
after arr[4] = 104.000000

The code looks perfectly fine, I doubt about the size of local work-group, if you have not specified local work-group size, OpenCL compiler chooses best based on some checks (and generally it is ONE).
Check your clEnqueueNDRangeKernel call w.r.t below call
size_t global_item_size = 5; //Specifies no. of total work items
size_t local_item_size = 5; // Specifies no. of work items per local group
clEnqueueNDRangeKernel( command_queue, kernel, 1, NULL, &global_item_size, &local_item_size, 0, NULL, NULL );
NOTE: This answer is with assumption that either you have not specified local work group size or its not set properly as per your requirement.
Little more on work Group::
Barrier will block all threads in work group, as you have not specified the work group size (its size is considered as one) and you will have 5 work groups each having only one thread.

Yes, you are missing the fact that adding a printf() makes all the result orders invalid.
In fact, OpenCL states that the use of printf() is implemetation defined and In the case that printf is executed from multiple work-items concurrently, there is no guarantee of ordering with respect to written data. The simple logic will tell you that the queue will be flushed in order for each WI, since that is the easier way to serialize a flush after a parallel execution has filled many buffers (one per each WI printf).
They are executing in the order you expect, but the output flush of the stdout occurs after the kernel has already finish, and does not follow the original order.

Related

Why does declaring with a class take more memory than a struct?

I have a question about the memory model for .NET Core. First, a simple test:
I created a simple class "MyType" with 3 props:
int a;
int b;
double c;
I created a List<MyType, reserved 1 million elements, and filled it:
List<MyType> someList = new List<MyType>(1000000);
for (int i = 0; i < 1000000; i++)
{
someList.Add(new MyType());
}
Now if I use a class for the "MyType" declaration, the program consumes 42 MB of memory. When using a struct, it consumes 25 MB.
Knowing that class is located on the heap and that the build is x86 Release, each of my objects stored on the heap should have an address store overhead of 4 bytes. This means that 1 million objects should create 4 MB "overhead", and thus I would expect the classes to consume 29 MB, not 42 MB.
What's causing the difference in memory consumption here?

You aren't accounting for the object header size, which is 8 bytes per object on x86; so that's another 8MB; 29 + 8 = 37, which is closer; then add some padding in the allocation areas.
The object header is the metadata that sits before every heap-allocated object instance to say what the type is, etc.
A struct (when not boxed, etc) does not have an object header.

Using MPI_Scatter with 3 processes

I am new to MPI , and my question is how the root(for example rank-0) initializes all its values (in the array) before other processes receive their i'th value from the root?
for example:
in the root i initialize: arr[0]=20,arr[1]=90,arr[2]=80.
My question is ,If i have for example process (number -2) that starts a little bit before the root process. Can the MPI_Scatter sends incorrect value instead 80?
How can i assure the root initialize all his memory before others use Scatter ?
Thank you !

The MPI standard specifies that
If comm is an intracommunicator, the outcome is as if the root executed n
send operations, MPI_Send(sendbuf+i, sendcount, extent(sendtype), sendcount, sendtype, i,...), and each process executed a receive, MPI_Recv(recvbuf, recvcount, recvtype, i,...).
This means that all the non-root processes will wait until their recvcount respective elements have been transmitted. This is also known as synchronized routine (the process waits until the communication is completed).
You as the programmer are responsible of ensuring that the data being sent is correct by the time you call any communication routine and until the send buffer available again (in this case, until MPI_Scatter returns). In a MPI only program, this is as simple as placing the initialization code before the call to MPI_Scatter, as each process executes the program sequentially.
The following is an example based in the document's Example 5.11:
MPI_Comm comm = MPI_COMM_WORLD;
int grank, gsize,*sendbuf;
int root, rbuf[100];
MPI_Comm_rank( comm, &grank );
MPI_Comm_size(comm, &gsize);
root = 0;
if( grank == root ) {
sendbuf = (int *)malloc(gsize*100*sizeof(int));
// Initialize sendbuf. None of its values are valid at this point.
for( int i = 0; i < gsize * 100; i++ )
sendbuf[i] = i;
}
rbuf = (int *)malloc(100*sizeof(int));
// Distribute sendbuf data
// At the root process, all sendbuf values are valid
// In non-root processes, sendbuf argument is ignored.
MPI_Scatter(sendbuf, 100, MPI_INT, rbuf, 100, MPI_INT, root, comm);

MPI_Scatter() is a collective operation, so the MPI library does take care of everything, and the outcome of a collective operation does not depend on which rank called earlier than an other.
In this specific case, a non root rank will block (at least) until the root rank calls MPI_Scatter().
This is no different than a MPI_Send() / MPI_Recv().
MPI_Recv() blocks if called before the remote peer MPI_Send() a matching message.

How to make Arduino toggle a variable after a completion of UART transmission?

I would like to have a bit of code executed (like toggling a flag variable),
after a completion of UART transmission issued by Serial.write(buf, len).
I tried several things to no success. Could someone suggest what would be the best way?
Thanks.
m

According to the code on the Arduino repository (you can see it here, particularly the HardwareSerial, Stream and Print classes) the Serial class functions are not blocking: they use an internal buffer which is emptied by the ISR.
Looking at the code, you have two possibilities.
The first and easiest one is to use the built-in Serial.flush function. This function waits for the uC to complete the sending and returns only when everything has been sent.
The second way is to mimic the flush behavior; this is what you need if you don't want the program to stop (i.e. perform other tasks) and just check if the board is sending data or not. If bit UDRIE0 (register UCSR0B) is set then the library has some data in its queue. At the end, however, you have to wait for the final byte to be sent, so you also have to check that bit TXC0(register UCSR0A) is cleared.
To implement and test this, I wrote this simple program:
unsigned long thisMillis = -1;
unsigned long lastMillis = -1;
char mybuffer[256];
void setup() {
Serial.begin(9600);
}
bool isSending()
{
return bit_is_set(UCSR0B, UDRIE0) || bit_is_clear(UCSR0A, TXC0);
}
void loop() {
int printed = snprintf(mybuffer, 256, "time: %lu\r\n", millis());
Serial.write(mybuffer, printed);
// Just enable ONE of the following methods
//Serial.flush();
//while(isSending());
lastMillis = millis();
}
Then I simulated it.
When neither the flush nor the while are enabled, the board buffers the string until it can. Then, when the buffer becomes full, starts waiting for some chars to go away before filling it again. Consequently we will see a lot of strings in the first milliseconds, then almost equally spaced strings (when the buffer is full). And in fact the output of the program is this one:
time: 0
time: 0
time: 1
time: 1
time: 1
time: 1
time: 2
time: 2
time: 7
time: 17
time: 27
When, however, we uncomment the Serial.flush function, the loop will wait for all the data to get out of the uC before looping again. This leads to equally spaced timestamps just from the beginning, and in fact this is the output
time: 0
time: 9
time: 19
time: 29
time: 39
time: 51
When enabling the while loop the behavior is exactly the same:
time: 0
time: 9
time: 19
time: 29
time: 39
time: 51
So, summing up: if you want to wait for the board to send just go with Serial.flush(); if you want to just test it (and then do something else) test the two bits reported above.

Try this:
void setup() {
Serial.begin(9600);
_SFR_BYTE(UCSR0B) |= _BV(TXCIE0); //Enable TXCIE0 interrupts
Serial.print("When UART finishes to send this text, it will call the routine ISR(USART0_TX_vect) ");
}
void loop() {
}
ISR(USART1_TX_vect)
{
//do whatever you want, but as quick as you can.
}

OpenCL reduction from private to local then global?

The following kernel computes an acoustic pressure field, with each thread computing it's own private instance of the pressure vector, which then needs to be summed down into global memory.
I'm pretty sure the code which computes the pressurevector is correct, but I'm still having trouble making this produce the expected result.
int gid = get_global_id(0);
int lid = get_local_id(0);
int nGroups = get_num_groups(0);
int groupSize = get_local_size(0);
int groupID = get_group_id(0);
/* Each workitem gets private storage for the pressure field.
* The private instances are then summed into local storage at the end.*/
private float2 pressure[HYD_DIM_TOTAL];
local float2 pressure_local[HYD_DIM_TOTAL];
/* Code which computes value of 'pressure' */
//wait for all workgroups to finish accessing any memory
barrier(CLK_GLOBAL_MEM_FENCE | CLK_LOCAL_MEM_FENCE);
/// sum all results in a workgroup into local buffer:
for(i=0; i<groupSize; i++){
//each thread sums its own private instance into the local buffer
if (i == lid){
for(iHyd=0; iHyd<HYD_DIM_TOTAL; iHyd++){
pressure_local[iHyd] += pressure[iHyd];
}
}
//make sure all threads in workgroup get updated values of the local buffer
barrier(CLK_LOCAL_MEM_FENCE);
}
/// copy all the results into global storage
//1st thread in each workgroup writes the group's local buffer to global memory
if(lid == 0){
for(iHyd=0; iHyd<HYD_DIM_TOTAL; iHyd++){
pressure_global[groupID +nGroups*iHyd] = pressure_local[iHyd];
}
}
barrier(CLK_GLOBAL_MEM_FENCE);
/// sum the various instances in global memory into a single one
// 1st thread sums global instances
if(gid == 0){
for(iGroup=1; iGroup<nGroups; iGroup++){
//we only need to sum the results from the 1st group onward
for(iHyd=0; iHyd<HYD_DIM_TOTAL; iHyd++){
pressure_global[iHyd] += pressure_global[iGroup*HYD_DIM_TOTAL +iHyd];
barrier(CLK_GLOBAL_MEM_FENCE);
}
}
}
Some notes on data dimensions:
The total number of threads will vary between 100 and 2000, but may on occasion lie outside this interval.
groupSizewill depend on hardware but I'm currently using values between 1(cpu) and 32(gpu).
HYD_DIM_TOTAL is known at compile time and varies between 4 and 32 (will generally, but not necessarily, be a power of 2).
Is there anything blatantly wrong with this reduction code?
PS: I run this on an i7 3930k with AMD APP SDK 2.8 and on an NVIDIA GTX580.

I notice two issues here, one big, one smaller:
This code suggests that you have a misunderstanding of what a barrier does. A barrier never synchronizes across multiple workgroups. It only synchronizes within a workgroup. The CLK_GLOBAL_MEM_FENCE makes it look like it is global synchronization, but it really isn't. That flag just fences all of the current work item's accesses to global memory. So outstanding writes will be globally observable after a barrier with this flag. But it does not change the barrier's synchronization behavior, which is only at the scope of a workgroup. There is no global synchronization in OpenCL, beyond launching another NDRange or Task.
The first for loop causes multiple work items to overwrite each others' computation. The indexing of pressure_local with iHyd will be done by each work item with the same iHyd. This will produce undefined results.
Hope this helps.

MSP430 not able to handle double

I am trying to program a MSP430 with a simple "FIR filter" program, that looks like the following:
#include "msp430x22x4.h"
#include "legacymsp430.h"
#define FILTER_LENGTH 4
#define TimerA_counter_value 12000 // 12000 counts/s -> 12000 counts ~ 1 Hz
int i;
double x[FILTER_LENGTH+1] = {0,0,0,0,0};
double y = 0;
double b[FILTER_LENGTH+1] = {0.0338, 0.2401, 0.4521, 0.2401, 0.0338};
signed char floor_and_convert(double y);
void setup(void)
{
WDTCTL = WDTPW + WDTHOLD; // Stop WDT
BCSCTL1 = CALBC1_8MHZ; // Set DCO
DCOCTL = CALDCO_8MHZ;
/* Setup Port 3 */
P3SEL |= BIT4 + BIT5; // P3.4,5 = USART0 TXD/RXD
P3DIR |= BIT4; // P3.4 output direction
/* UART */
UCA0CTL1 = UCSSEL_2; // SMCLK
UCA0BR0 = 0x41; // 9600 baud from 8Mhz
UCA0BR1 = 0x3;
UCA0MCTL = UCBRS_2;
UCA0CTL1 &= ~UCSWRST; // **Initialize USCI state machine**
IE2 |= UCA0RXIE; // Enable USCI_A0 RX interrupt
/* Setup TimerA */
BCSCTL3 |= LFXT1S_2; // LFXT1S_2: Mode 2 for LFXT1 = VLO
// VLO provides a typical frequency of 12kHz
TACCTL0 = CCIE; // TACCR0 Capture/compare interrupt enable
TACCR0 = TimerA_counter_value; // Timer A Capture/Compare 0: -> 25 Hz
TACTL = TASSEL_1; // TASSEL_1: Timer A clock source select: 1 - ACLK
TACTL |= MC_1; // Start Timer_A in up mode
__enable_interrupt();
}
void main(void) // Beginning of program
{
setup(); // Call Function setup (see above)
_BIS_SR(LPM3_bits); // Enter LPM0
}
/* USCIA interrupt service routine */
/*#pragma vector=USCIAB0RX_VECTOR;*/
/*__interrupt void USCI0RX_ISR(void)*/
interrupt (USCIAB0RX_VECTOR) USCI0RX_ISR(void)
{
TACTL |= MC_1; // Start Timer_A in up mode
x[0] = (double)((signed char)UCA0RXBUF); // Read received sample and perform type casts
y = 0;
for(i = 0;i <= FILTER_LENGTH;i++) // Run FIR filter for each received sample
{
y += b[i]*x[i];
}
for(i = FILTER_LENGTH-1;i >= 0;i--) // Roll x array in order to hold old sample inputs
{
x[i+1] = x[i];
}
while (!(IFG2&UCA0TXIFG)); // Wait until USART0 TX buffer is ready?
UCA0TXBUF = (signed char) y;
TACTL |= TACLR; // Clear TimerA (prevent interrupt during receive)
}
/* Timer A interrupt service routine */
/*#pragma vector=TIMERA0_VECTOR;*/
/*__interrupt void TimerA_ISR (void)*/
interrupt (TIMERA0_VECTOR) TimerA_ISR(void)
{
for(i = 0;i <= FILTER_LENGTH;i++) // Clear x array if no data has arrived after 1 sec
{
x[i] = 0;
}
TACTL &= ~MC_1; // Stops TimerA
}
The program interacts with a MatLab code, that sends 200 doubles to the MSP, for processing in the FIR filter. My problem is, that the MSP is not able to deal with the doubles.
I am using the MSPGCC to compile the code. When I send a int to the MSP it will respond be sending a int back again.

Your problem looks like it is in the way that the data is being sent to the MSP.
The communications from MATLAB is, according to your code, a sequence of 4 binary byte values that you then take from the serial port and cast it straight to a double. The value coming in will have a range -128 to +127.
If your source data is any other data size then your program will be broken. If your data source is providing binary "double" data then each value may be 4 or 8 bytes long depending upon its internal data representation. Sending one of these values over the serial port will be interpreted by the MSP as a full set of 4 input samples, resulting in absolute garbage for a set of answers.
The really big question is WHY ON EARTH ARE YOU DOING THIS IN FLOATING POINT - on a 16 bit integer processor that (many versions) have integer multiplier hardware.

As Ian said, You're taking an 8bit value (UCA0RXBUF is only 8 bits wide anyway) and expecting to get a 32bit or 64 bit value out of it.
In order to get a proper sample you would need to read UCA0RXBUF multiple times and then concatenate each 8 bit value into 32/64 bits which you then would cast to a double.
Like Ian I would also question the wisdom of doing floating point math in a Low power embedded microcontroller. This type of task is much better suited to a DSP.
At least you should use fixed point math, seewikipedia (even in a DSP you would use fixed point arithmetic).

Hmm. Actually the code is made of my teacher, I'm just trying to make it work on my Mac, and not in AIR :-)
MATLAB code is like this:
function FilterTest(comport)
Fs = 100; % Sampling Frequency
Ts = 1/Fs; % Sampling Periode
L = 200; % Number of samples
N = 4; % Filter order
Fcut = 5; % Cut-off frequency
B = fir1(N,Fcut/(Fs/2)) % Filter coefficients in length N+1 vector B
t = [0:L-1]*Ts; % time array
A_m = 80; % Amplitude of main component
F_m = 5; % Frequency of main component
P_m = 80; % Phase of main component
y_m = A_m*sin(2*pi*F_m*t - P_m*(pi/180));
A_s = 40; % Amplitude of secondary component
F_s = 40; % Frequency of secondary component
P_s = 20; % Phase of secondary component
y_s = A_s*sin(2*pi*F_s*t - P_s*(pi/180));
y = round(y_m + y_s); % sum of main and secondary components (rounded to integers)
y_filt = round(filter(B,1,y)); % filtered data (rounded to integers)
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Serial_port_object = serial(comport); % create Serial port object
set(Serial_port_object,'InputBufferSize',L) % set InputBufferSize to length of data
set(Serial_port_object,'OutputBufferSize',L) % set OutputBufferSize to length of data
fopen(Serial_port_object) % open Com Port
fwrite(Serial_port_object,y,'int8'); % send out data
data = fread(Serial_port_object,L,'int8'); % read back data
fclose(Serial_port_object) % close Com Port
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
subplot(2,1,1)
hold off
plot(t,y)
hold on
plot(t,y_filt,'r')
plot(t,y_filt,'ro')
plot(t,data,'k.')
ylabel('Amplitude')
legend('y','y filt (PC)','y filt (PC)','y filt (muP)')
subplot(2,1,2)
hold off
plot(t,data'-y_filt)
hold on
xlabel('time')
ylabel('muP - PC')
figure(1)

It is also not advised to keep interrupt routines doing long processing routines, because you will impact on interrupt latency. Bytes comming from the PC can get easily lost, because of buffer overrun on the serial port.
The best is to build a FIFO buffer holding a resonable number of input values. The USCI routine fills the FIFO while the main program keeps looking for data inside it and process them as they are available.
This way, while the data is being processed, the USCI can interrupt to handle new incomming bytes.
When the FIFO is empty, you can put the main process in a suitable LPM mode to conserve power (and this is the best MSP430 feature). The USCI routine will wake the CPU up when a data is ready (just put the WAKEUP attribute in the USCI handler if you are using MSPGCC).
In such a scenario be sure to declare volatile every variable that are shared between interrupt routines and the main process.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How does a barrier work for OpenCl Kernel? - opencl

Related

Why does declaring with a class take more memory than a struct?

Using MPI_Scatter with 3 processes

How to make Arduino toggle a variable after a completion of UART transmission?

OpenCL reduction from private to local then global?

MSP430 not able to handle double

Categories

Resources