I want to implement a systolic structure for matrix multiplication. My objective is to use a single kernel for every Processing Element so I will execute the same kernel from the host part multiple times.
To communicate between kernels I would like to use channels or pipes. The problem is that "channels extension does not support dynamic indexing into arrays of channel IDs". The number of kernels will depend on the size of the matrix so I will need some method to connect the channels to the corresponding kernels automatically.
Summarizing, I am looking for a method to create this functionality:
channel float c0[32];
__kernel void producer (__global float * data_in){
for(int i=0; i<32; i++){
write_channel_altera(c0[i],data_in[i]);
}
}
__kernel void consumer (__global float * ret_buf){
for(int i=0; i<32; i++){
ret_buf[i]=read_channel_altera(c0[i]);
}
}
Thanks in advance!
OpenCL channels (Intel FPGA extension) do not support "true" dynamic
indexing, but you can work around this limitation in most cases by
using switch or #pragma unroll approach:
switch approach is described in Intel FPGA SDK for OpenCL Programming Guide:
channel int ch[WORKGROUP_SIZE];
__kernel void consumer() {
int gid = get_global_id(0);
int value;
switch(gid)
{
case 0: value = read_channel_intel(ch[0]); break;
case 1: value = read_channel_intel(ch[1]); break;
case 2: value = read_channel_intel(ch[2]); break;
case 3: value = read_channel_intel(ch[3]); break;
//statements
case WORKGROUP_SIZE-1:read_channel_intel(ch[WORKGROUP_SIZE-1]); break;
}
}
You can also use #pragma unroll if you have a loop over channels:
__kernel void consumer() {
int values[WORKGROUP_SIZE]
#pragma unroll
for (int i = 0; i < WORKGROUP_SIZE; ++i) {
values[i] = read_channel_intel(ch[i]);
}
}
As far as I know, we need to know how many channels we would require at the maximum much before compiling the program for the board, as we cannot program the FPGA like the way we do for other computing system and allocate resources on the go. Once we know the maximum number (atleast) we can use
#pragma unroll
before we start the loop for reading/writing the channels
Related
This program is a simple program designed to plot at the same time both your & Uc on the serial monitor. Arduino runs through the first for loop & plot the F1 function and after that does the same with F2. My objective is to plot them both at the same time.
My idea is to actually take a small fraction of time, let's say 10 ms, to plot F1 & the next 10 ms to plot F2, but I don't know how to write this down. I think the millis function is the solution, but I'm not quite sure how to implement it.
const short int R = 5000;
const float C = 0.0005;
const float TE = 0.1;
const float Tau = R*C;
const short int E = 5;
float t, Tinit,Tfin;
void setup() {
// put your setup code here, to run once:
Serial.begin(9600);
}
void loop() {
//F1
for ( t = 0; t <= 20; t = t+TE)
{
float Ur = E*exp(-t/Tau);
Serial.println (Ur);
}
//F2
for ( t = 0; t <= 20; t = t+TE)
{
float Uc = E*(1-exp(-t/Tau));
Serial.println (Uc);
}
}
Thread can be used to solve your problem . It has a huge documentation , it is widely used library for Arduino(unofficial).
Give it a try.
It will be easy for you, if you see these :
Example - 1 (thread instance example)
Example - 2 (callback example)
Example - 3 (It is still buggy , but I think it will help)
If you want to do it without libraries , then you need to create two functions , without those loops . Like
void f1()
{
float Ur = E*exp(-t/Tau);
Serial.println (Ur);
}
void f2()
{
float Uc = E*(1-exp(-t/Tau));
Serial.println (Uc);
}
Now inside "void loop()" you can implement the basic logic of threading , which will be pretty rough , but fulfill your requirements. Like :
void loop() {
unsigned long now = millis();
static unsigned long last_finger_update;
if (now - last_finger_update >= FINGER_UPDATE_PERIOD) {
last_finger_update = now;
f1();
}
static unsigned long last_wrist_update;
if (now - last_wrist_update >= WRIST_UPDATE_PERIOD) {
last_wrist_update = now;
f2();
}
}
You have to declare two variables
const unsigned long FINGER_UPDATE_PERIOD = 1000;
const unsigned long WRIST_UPDATE_PERIOD = 1000;
All time units are in milliseconds. This strategy is collected from internet.
The most deterministic way of handling this is simply:
for (t = 0; t <= 20; t = t + TE) {
float Ur = E*exp(-t/Tau);
float Uc = E*(1-exp(-t/Tau));
Serial.println (Ur);
Serial.println (Uc);
}
More generally, you can implement a primitive resource scheduler:
while (true) {
task_one();
task_two();
}
You can do that Easily if you run RTOS in MCU as you know other solutions will also be sequential...
I've been using TridentTD_EasyFreeRTOS library it has easy way of having multiple tasks and controlling them in different sketch files..
I have two pointers in memory and I want to swap it atomically but atomic operation in CUDA support only int types. There is a way to do the following swap?
classA* a1 = malloc(...);
classA* a2 = malloc(...);
atomicSwap(a1,a2);
When writing device-side code...
While CUDA provides atomics, they can't cover multiple (possibly remote) memory locations at once.
To perform this swap, you will need to "protect" access to both these values with something like mutex, and have whoever wants to write values to them take a hold of the mutex for the duration of the critical section (like in C++'s host-side std::lock_guard). This can be done using CUDA's actual atomic facilities, e.g. compare-and-swap, and is the subject of this question:
Implementing a critical section in CUDA
A caveat to the above is mentioned by #RobertCrovella: If you can make do with, say, a pair of 32-bit offsets rather than a 64-bit pointer, then if you were to store them in a 64-bit aligned struct, you could use compare-and-exchange on the whole struct to implement an atomic swap of the whole struct.
... but is it really device side code?
Your code actually doesn't look like something one would run on the device: Memory allocation is usually (though not always) done from the host side before you launch your kernel and do actual work. If you could make sure these alterations only happen on the host side (think CUDA events and callbacks), and that device-side code will not be interfered with by them - you can just use your plain vanilla C++ facilities for concurrent programming (like lock_guard I mentioned above).
I managed to have the needed behaviour, it is not atomic swap but still safe. The context was a monotonic Linked List working both on CPU and GPU:
template<typename T>
union readablePointer
{
T* ptr;
unsigned long long int address;
};
template<typename T>
struct LinkedList
{
struct Node
{
T value;
readablePointer<Node> previous;
};
Node start;
Node end;
int size;
__host__ __device__ void initialize()
{
size = 0;
start.previous.ptr = nullptr;
end.previous.ptr = &start;
}
__host__ __device__ void push_back(T value)
{
Node* node = nullptr;
malloc(&node, sizeof(Node));
readablePointer<Node> nodePtr;
nodePtr.ptr = node;
nodePtr.ptr->value = value;
#ifdef __CUDA_ARCH__
nodePtr.ptr->previous.address = atomicExch(&end.previous.address, nodePtr.address);
atomicAdd(&size,1);
#else
nodePtr.ptr->previous.address = end.previous.address;
end.previous.address = nodePtr.address;
size += 1;
#endif
}
__host__ __device__ T pop_back()
{
assert(end.previous.ptr != &start);
readablePointer<Node> lastNodePtr;
lastNodePtr.ptr = nullptr;
#ifdef __CUDA_ARCH__
lastNodePtr.address = atomicExch(&end.previous.address,end.previous.ptr->previous.address);
atomicSub(&size,1);
#else
lastNodePtr.address = end.previous.address;
end.previous.address = end.previous.ptr->previous.address;
size -= 1;
#endif
T toReturn = lastNodePtr.ptr->value;
free(lastNodePtr.ptr);
return toReturn;
}
__host__ __device__ void clear()
{
while(size > 0)
{
pop_back();
}
}
};
I'm doing the following in an OpenCL kernel (simplified example):
__kernel void step(const uint count, __global int *map, __global float *sum)
{
const uint i = get_global_id(0);
if(i < count) {
sum[map[i]] += 12.34;
}
}
Here, sum is some quantity I want to calculate (previously set to zero in another kernel) and map is a mapping from integers i to integers j, such that multiple i's can map to the same j.
(map could be in constant memory rather than global, but it seems the amount of constant memory on my GPU is incredibly limited)
Will this work? Is a "+=" implemented in an atomic way, or is there a chance of concurrent operations overwriting each other?
Will this work? Is a "+=" implemented in an atomic way, or is there a chance of concurrent operations overwriting each other?
It will not work. When threads access memory written to by other threads, you need to explicitly resort to atomic operations. In this case, atomic_add.
Something like:
__kernel void step(const uint count, __global int *map, __global double *sum)
{
const uint i = get_global_id(0);
if(i < count) {
atomic_add(&sum[map[i]], 1234);
}
}
I am trying to implement atomic functions in my opencl kernel. Multiple threads I am creating are parallely trying to write a single memory location. I want them to perform serial execution on that particular line of code. I have never used an atomic function before.
I found similar problems on many blogs and forums,and I am trying one solution.,i.e. use of two different functions 'acquire' and 'release' for locking and unlocking the semaphore. I have included necessary opencl extensions, which are all surely supported by my device (NVIDIA GeForce GTX 630M).
My kernel execution configuration:
global_item_size = 8;
ret = clEnqueueNDRangeKernel(command_queue2, kernel2, 1, NULL, &global_item_size2, &local_item_size2, 0, NULL, NULL);
Here is my code: reducer.cl
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
#pragma OPENCL EXTENSION cl_khr_global_int32_base_atomics : enable
#pragma OPENCL EXTENSION cl_khr_local_int32_base_atomics : enable
#pragma OPENCL EXTENSION cl_khr_global_int32_extended_atomics : enable
#pragma OPENCL EXTENSION cl_khr_local_int32_extended_atomics : enable
typedef struct data
{
double dattr[10];
int d_id;
int bestCent;
}Data;
typedef struct cent
{
double cattr[5];
int c_id;
}Cent;
__global void acquire(__global int* mutex)
{
int occupied;
do {
occupied = atom_xchg(mutex, 1);
} while (occupied>0);
}
__global void release(__global int* mutex)
{
atom_xchg(mutex, 0); //the previous value, which is returned, is ignored
}
__kernel void reducer(__global int *keyMobj, __global int *valueMobj,__global Data *dataMobj,__global Cent *centMobj,__global int *countMobj,__global double *sumMobj, __global int *mutex)
{
__local double sum[2][2];
__local int cnt[2];
int i = get_global_id(0);
int n,j;
if(i<2)
cnt[i] = countMobj[i];
barrier(CLK_GLOBAL_MEM_FENCE);
n = keyMobj[i];
for(j=0; j<2; j++)
{
barrier(CLK_GLOBAL_MEM_FENCE);
acquire(mutex);
sum[n][j] += dataMobj[i].dattr[j];
release(mutex);
}
if(i<2)
{
for(j=0; j<2; j++)
{
sum[i][j] = sum[i][j]/countMobj[i];
centMobj[i].cattr[j] = sum[i][j];
}
}
}
Unfortunately the solution doesn't seem like working for me. When I am reading back the centMobj into the host memory, using
ret = clEnqueueReadBuffer(command_queue2, centMobj, CL_TRUE, 0, (sizeof(Cent) * 2), centNode, 0, NULL, NULL);
ret = clEnqueueReadBuffer(command_queue2, sumMobj, CL_TRUE, 0, (sizeof(double) * 2 * 2), sum, 0, NULL, NULL);
it is giving me error with error code = -5 (CL_OUT_OF_RESOURCES) for both centMobj and sumMobj.
I am not getting if there is any problem in my atomic function code or problem is in reading back data into the host memory. If I am using the atomic function incorrectly, please make me correct.
Thank you in advance.
In OpenCL, synchronization between work items can be done only inside a work-group. Code trying to synchronize work-items across different work-groups may work in some very specific (and implementation/device dependent) cases, but will fail in the general case.
The solution is to either use atomics to serialize accesses to the same memory location (but without blocking any work item), or redesign the code differently.
About atomic access of __local variables:
I know it's slow to do global operations compared with local ones. In this sense I'd like to make atomic access of some variables.
I know I can do atomic operations in OpenCL:
// Program A:
#pragma OPENCL EXTENSION cl_khr_global_int32_base_atomics : enable
#pragma OPENCL EXTENSION cl_khr_local_int32_base_atomics : enable
__kernel void test(global int * num)
{
atom_inc(&num[0]);
}
How do I share atomic data between work-itens within a given work-group?
for ex: I'd like to do something like that:
// Program B: (it doesn't work, just to show how I'd like it to be)
#pragma OPENCL EXTENSION cl_khr_global_int32_base_atomics : enable
#pragma OPENCL EXTENSION cl_khr_local_int32_base_atomics : enable
__kernel void test(global int * num, const int numOperations)
{
__local int num;
if (get_global_id(0) < numOperations) {
atom_inc(&num);
}
}
In the end the num value should return: numOperations - 1;
Isn't this possible? If not how could I do it?
Typically, you have one thread which intializes the shared (local) atomic followed by some barrier. I.e. your kernel starts like this:
__local int sharedNum;
if (get_local_id (0) == 0) {
sharedNum = 0;
}
barrier (CLK_LOCAL_MEM_FENCE);
// Now, you can use sharedNum
while (is_work_left ()) {
atomic_inc (&sharedNum);
}
There's not much magic to it -- all items in a work-group can see the same local variables, so you can just access it as usual.