port SYCL/DPC++ code originally written for GPUs to FPGAs - intel

I'm kinda new to the world of FPGAs and I'm trying to port some code written for GPUs to FPGAs, to compare the performances.
From my understanding, using parallel_for ain't a good practice (in fact it runs very slow), instead (I think) I should use a single_task and an unrolled for loop. I'm struggling to make it work properly though.
So, I have
q.submit([&](sycl::handler &h){
h.parallel_for<class Foo>(sycl::nd_range<1>(n_blocks * n_threads, n_threads),
[=](auto& it) {
some_kernel(it, <other params here ...> );
});
}).wait();
and my attempt is
q.submit([&](sycl::handler &h){
h.single_task<class Foo>(
#pragma unroll
for(int i = 0; i < n_blocks * n_threads; ++i)
some_kernel(...)
);
}).wait();
But I'm not sure how to adapt what I was previously doing with a sycl::item (for instance, how to use the loop index to replace the calls to the methods get_group, get_local_id? ).
Should I entirely change the design of the kernel ? In other word, is the "work_groups - work_group_size" approach not appropriate with FPGAs ?

Related

Using OpenMP with GPU

Everyone good time of day!
I would like to ask the advice of the respected community about the use of GPU computing power instead of or together with the CPU.
I have a well-functioning program based on recursive search of all kinds of combinations of some events, paralleled using OpenMP to run on all available processor cores.
The pseudocode C++ is as follows:
// #includes
// function announcements
// declaring a global variable:
QVector<QVector<QVector<float>>> variant; // (or "std::vector")
int main() {
// reads data from file
// data are converted and analyzed
// the variant variable containing the current best result is filled in (here - by pre-analysis)
#pragma omp parallel shared(variant)
#pragma omp master
// occurs call a recursive algorithm of search all variants:
PEREBOR(Tabl_1, a, i_a, ..., reс_depth);
return 0;
}
void PEREBOR(QVector<QVector<uint8_t>> Tabl_1, QVector<A_struct> a, uint8_t i_a, ..., uint8_t reс_depth)
{
// looking for the boundaries of the first cycle for some reasons
for (int i = quantity; i < another_quantity; i++) {
// the Tabl_1 is processed and modified to determine the number of steps in the subsequent for cycle
for (int k = 0; k < the_quantity_just_found; k++) {
if the recursion depth is not 1, we go down further: {
// add descent to the next recursion level to the call stack:
#pragma omp task
PEREBOR(Tabl_1_COPY, a, i_a, ..., reс_depth-1);
}
else (if we went down to the lowest level): {
if (condition fulfilled) // condition check - READ variant variable
variant = it_is_equal_to_that_,_to_that...;
else
continue;
}
}
}
}
Unfortunately, I don't have a CPU with a thousand cores at my disposal, and without this, the algorithm works for a very long time. At the place where I work, I was advised to think about using a GPU to speed up calculations. I learned that OpenMP can work with video cards (and especially with NVidia), but OpenACC also does it well.
In this regard, my main question is whether it is possible to simply and, at the same time, effectively set the execution of a recursive algorithm on a GPU? Can this give a noticeable acceleration relative to the CPU? If so, maybe OpenACC will do better? And is it possible to give instructions to the video card through the "#pragma omp task", or are other instructions REQUIRED? And how would it be possible to combine calculations on the CPU and GPU?
Thank you so much for any help!
P.S. I apologize for my English, which is not my native language :)

Why memcpy NOT work on this set<int> array case?

a is set< int> ARRAY, I want to copy it to b. BUT...
int main(){
set<int> a[10];
a[1].insert(99);
a[3].insert(99);
if(a[1]==a[3])cout<<"echo"<<endl;
set<int> b[10];
memcpy(b,a,sizeof(a));
if(b[1]==b[3])cout<<"echo"<<endl;// latch up here, what happen?
return 0;}
Do you know What is computer doing?
I assume the 'set' class you are using is a std::set? What makes you think that simplying memcpying the raw bytes of a std::set (or array of them, in this case) will work properly? This is highly dependent on the internal structure and implementation of the set class, and trying to do such a thing with anything more complicated than a primitive or array of primitives is almost guaranteed to give unexpected results. Doing this sort of raw byte manipulation when classes are involved is rarely going to be correct.
To do this properly you should iterate over the sets and use their '=' operator to assign them, which knows how to copy the contents properly:
for(int i = 0; i < 10; ++i) {
b[i] = a[i];
}
Even better you can use std::copy:
std::copy(std::begin(a), std::end(a), std::begin(b));

How can MPI be used efficiently in a portion of code?

My application involves run myfuns() in a serial time execution. It calls dothings(...), which has a object instances and others passing to it. This function involves a loop each does a breadth-first search and it is really time consuming. I have used OpenMP for the loop and it speed up just a little bit, not good enough. I am thinking of using MPI parallelism to get more processes to work on, but not sure how to use it in an efficient way for this portion of code embedded deep inside a sequential code.
void dothings (object obji...) {
std::vector retvec;
for (i = 0; i < somenumber; i++) {
/* this function involves breadth first search using std:queue */
retval = compute(obji, i);
retvec.push_back(retval);
}
}
/* myfuns() gets called in a sequential manner */
void myfuns() {
dothings(objectInstance,...)
}

OpenCL trying to use semaphore crashes drivers

While writing simple OpenCL kernel I tried to use semaphores and it crushed my GPU Drivers (AMD 12.10). After checking out examples I found out, that crash happens only when local work size is not equal to 1.
This code taken from example:
#pragma OPENCL EXTENSION cl_khr_global_int32_base_atomics : enable
#pragma OPENCL EXTENSION cl_khr_local_int32_base_atomics : enable
#pragma OPENCL EXTENSION cl_khr_global_int32_extended_atomics : enable
#pragma OPENCL EXTENSION cl_khr_local_int32_extended_atomics : enable
void GetSemaphor(__global int * semaphor)
{
int occupied = atom_xchg(semaphor, 1);
while(occupied > 0)
{
occupied = atom_xchg(semaphor, 1);
}
}
void ReleaseSemaphor(__global int * semaphor)
{
int prevVal = atom_xchg(semaphor, 0);
}
__kernel void kernelNoAtomInc(__global int * num,
__global int * semaphor)
{
int i = get_global_id(0);
GetSemaphor(&semaphor[0]);
{
num[0]++;
}
ReleaseSemaphor(&semaphor[0]);
}
In example author uses
CQ.Execute(kernelNoAtomInc, null, new long[1] { N }, new long[1] { 1 }, null);
Where N = global_work_size and local_work_size = 1
Now if I change 1 to null or 2 or 4 or any other number i tried - AMD drivers will crush.
CQ.Execute(kernelNoAtomInc, null, new long[1] { N }, new long[1] { 2 }, null);
I do not have other PC to test on it at the moment. However it seems strange that author deliberately left local_group_size = 1, that's why I think I missing something here. Can someone please explain this to me? Also, as far as I understand, leaving local_group_size at 1 will affect performance greatly or it won't?
Thanks.
Host: Win8 x64, HD6870
Your problem is not reproducible and I can furthermore not find your source from the link, but here are a few ideas on why it could crash, which should be helpful (9 years in the past).
It propably crashes, because...
... the driver thinks you want the local version of that atom_xchg() function to be executed, when instead you want the global one.
... your loop slows down execution of that kernel so drastically on an old machine, that an internal limit of execution time got passed, causing the driver to terminate the kernel.
What I can suggest for a possible fix:
do not activate the local version of the atom function in your kernel
Try running it on CPU
There is no way to fix this, unless we could access your computer and debug on it.
You were also asking, why the author chose the local_group_size of one. This is because the global work size needs to be divisible by the local work size, such that the division results in a natural number. Dividing a natural number by one always results in a natural number, therefor this is perfect for experimenting. You are completely correct by saying that it will affect performance greatly. (Just maybe the maths didn't add up and it didn't crash, but not even start)
Different notes:
To make the incrementing be functionally correct, you should use an atom_inc() on your num buffer. I don't see how this could lead to a crash, but it definitely makes your program not work as intended
I would go and use the atomic functions from the 2.0 standard, since they already feature a semaphore-like functions: bool atomic_flag_test_and_set(volatile atomic_flag *object) and void atomic_flag_clear(volatile atomic_flag *object)

What are the possible ways of intercepting system calls on unix environments?

What are the possible ways of intercepting system calls on unix environments?
I'm looking to do in AIX.
Thanks
Not familiar with AIX, but the following works on Linux and Solaris. You can use the LD_PRELOAD environment variable, which tells ld.so to load a shared library before libc and then write your own version of the system call, and optionally call the original. man ld.so for more information. Something along the lines of
#include <dlfcn.h>
typedef int (*ioctl_fn)(int, int, void*);
static
int
my_ioctl(int fildes,
int request,
void* argp,
ioctl_fn fn_ptr)
{
int result = 0;
/* call original or do my stuff */
if (request == INTERESTED)
{
result = 0;
}
else
{
result = (*fn_ptr)(fildes, request, argp);
}
return result;
}
/*
* override ioctl() - on first call get a pointer to the "real" one
* and then pass it onto our version of the function
*/
int
ioctl(int fildes,
int request,
void* argp)
{
static ioctl_fn S_fn_ptr = 0;
if (S_fn_ptr == 0)
{
S_fn_ptr = (ioctl_fn)dlsym(RTLD_NEXT, "ioctl");
}
return my_ioctl(fildes, request, argp, S_fn_ptr);
}
Carved this out of some code I had lying around, apologies if I have made it incorrect.
Well, there's always systrace.
I'm not sure about AIX, but I've done it on Linux.
On Linux, the system call table is contained in the sys_call_table array.
We need to first find out the address of this table. Now, this is a tricky thing and there are multiple ways to do it.
We can find its address by looking at sysmap file:
punb200m2labs08vm1:/ # cat /boot/System.map-4.4.21-69-default | grep sys_call_table
ffffffff81600180 R sys_call_table
Hence, ffffffff81600180 is the address of sys_call_table on my machine.
In your kernel module, you can just change the default function corresponding to certain system call number (that you're changing) and assign it to your own function.
e.g. Suppose you want to intercept 'open' system call whose number is __NR_open on Linux. After you get the sys_call_table address from above, just assign your function to index __NR_open of sys_call_table:
sys_call_table[__NR_open] = your_function;
where your_function is implemented by you to intercept 'open' system call.
From now on, every open system call will go through this function.
The details would differ on AIX, but the overall idea would be similar, I guess. You just need to find out AIX specific procedure to achieve this.

Resources