How to define and execute an array of functions on Sycl+openCL+DPCPP - opencl

In my program, I defined an array of functions
#include <CL/sycl.hpp>
#include <iostream>
#include <tbb/tbb.h>
#include <tbb/parallel_for.h>
#include <vector>
#include <string>
#include <queue>
#include<tbb/blocked_range.h>
#include <tbb/global_control.h>
#include <chrono>
using namespace tbb;
template<class Tin, class Tout, class Function>
class Map {
private:
Function fun;
public:
Map() {}
Map(Function f):fun(f) {}
std::vector<Tout> operator()(bool use_tbb, std::vector<Tin>& v) {
std::vector<Tout> r(v.size());
if(use_tbb){
// Start measuring time
auto begin = std::chrono::high_resolution_clock::now();
tbb::parallel_for(tbb::blocked_range<Tin>(0, v.size()),
[&](tbb::blocked_range<Tin> t) {
for (int index = t.begin(); index < t.end(); ++index){
r[index] = fun(v[index]);
}
});
// Stop measuring time and calculate the elapsed time
auto end = std::chrono::high_resolution_clock::now();
auto elapsed = std::chrono::duration_cast<std::chrono::nanoseconds>(end - begin);
printf("Time measured: %.3f seconds.\n", elapsed.count() * 1e-9);
return r;
} else {
sycl::queue gpuQueue{sycl::gpu_selector()};
sycl::range<1> n_item{v.size()};
sycl::buffer<Tin, 1> in_buffer(&v[0], n_item);
sycl::buffer<Tout, 1> out_buffer(&r[0], n_item);
gpuQueue.submit([&](sycl::handler& h){
//local copy of fun
auto f = fun;
sycl::accessor in_accessor(in_buffer, h, sycl::read_only);
sycl::accessor out_accessor(out_buffer, h, sycl::write_only);
h.parallel_for(n_item, [=](sycl::id<1> index) {
out_accessor[index] = f(in_accessor[index]);
});
}).wait();
}
return r;
}
};
template<class Tin, class Tout, class Function>
Map<Tin, Tout, Function> make_map(Function f) { return Map<Tin, Tout, Function>(f);}
typedef int(*func)(int x);
//define different functions
auto function = [](int x){ return x; };
auto functionTimesTwo = [](int x){ return (x*2); };
auto functionDivideByTwo = [](int x){ return (x/2); };
auto lambdaFunction = [](int x){return (++x);};
int main(int argc, char *argv[]) {
std::vector<int> v = {1,2,3,4,5,6,7,8,9};
//auto f = [](int x){return (++x);};
//Array of functions
func functions[] =
{
function,
functionTimesTwo,
functionDivideByTwo,
lambdaFunction
};
for(int i = 0; i< sizeof(functions); i++){
auto m1 = make_map<int, int>(functions[i]);
//auto m1 = make_map<int, int>(f);
std::vector<int> r = m1(true, v);
//print the result
for(auto &e:r) {
std::cout << e << " ";
}
}
return 0;
}
instead of each time defining a function, I am interested in defining an array of functions and then execute it in my program. But in the part of SYCL for executing on GPU, I have an error and I do not know how to fix it.
The ERROR:
SYCL kernel cannot call through a function pointer

In particular, SYCL device code, as defined by this specification, does not support virtual function calls, function pointers in general, exceptions, runtime type information or the full set of C++ libraries that may depend on these features or on features of a particular host compiler. Nevertheless, these basic restrictions can be relieved by some specific Khronos or vendor extensions.
As per the sycl 2020 specification, No function pointers are allowed to be called in a SYCL kernel or any functions called by the kernel.
Please refer https://www.khronos.org/registry/SYCL/specs/sycl-2020/html/sycl-2020.html#introduction

Related

Using `.C` Interface of R to handle read/write files

I am trying to filter a huge txt file line by line, which pure R is not so good at. So, I wrote a c function that hopefully can speed up the process. Below is a minimum working example of filter.c, just for the demo purpose.
Currently, I have tried .C to do the trick without luck. Here is my attempt.
built filter.so using gcc -shared -o lfilter.so -fPIC filter.c
dyn.load("lfilter.so")
.C("filter", as.character("I1.txt"), as.character("I1.out.txt"), as.character("filter.txt"))
R crashed on me with 3rd step. But unfortunately, I have to stay within R.
Any help or suggestions are welcome.
filter.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define LL 256
int get_row(char *filename)
{
char line[LL];
int i = 0;
FILE *stream = fopen(filename, "r");
while (fgets(line, LL, stream))
{
i++;
}
fclose(stream);
return i;
}
void filter(char *R1_in,
char *R1_out,
char *filter)
{
char R1_line[LL];
FILE *R1_stream = fopen(R1_in, "r");
FILE *R1_out_stream = fopen(R1_out,"w");
/*****************loading filters*******************/
int nrows = get_row(filter);
FILE *filter_stream = fopen(filter, "r");
char **filter_list = (char **)malloc(nrows * sizeof(*filter_list));
for(int i = 0; i <nrows; i++)
{
filter_list[i] = malloc(LL * sizeof(char));
fgets(filter_list[i], LL, filter_stream);
}
fclose(filter_stream);
/*****************filtering*******************/
while (fgets(R1_line, LL, R1_stream))
{
// printf("%s", R1_line);
for(int i = 0; i<nrows; i++)
{
if(strcmp(R1_line, filter_list[i])==0)
{
fprintf(R1_out_stream, "%s", R1_line);
break;
}
}
}
printf("\n");
for(int i=0; i<nrows; i++)
{
free(filter_list[i]);
}
free(filter_list);
fclose(R1_stream);
fclose(R1_out_stream);
}
// int main()
// {
// char R1_in[] = "I1.txt";
// char R1_out[] = "I1.out.txt";
//
// char filters[] = "filter.txt";
//
// filter(R1_in, R1_out, filters);
// return 0;
// }
I1.txt
aa
baddf
ca
daa
filter.txt
ca
cb
Expected Output I1.out.txt
ca
I had never used R before. But, I was a bit intrigued. So, I installed R and did a little research.
Everything in R [using the .C interface] is passed to the C function as a pointer.
From: https://www.r-bloggers.com/2014/02/three-ways-to-call-cc-from-r/ we have:
Inside a running R session, the .C interface allows objects to be directly accessed in an R session’s active memory. Thus, to write a compatible C function, all arguments must be pointers. No matter the nature of your function’s return value, it too must be handled using pointers. The C function you will write is effectively a subroutine.
So, if we pass an integer, the C function argument must be:
int *
I took a guess that:
char *
Needed to be:
char **
And, then tested it with:
#include <stdio.h>
#define SHOW(_sym) \
show(#_sym,_sym)
static void
show(const char *sym,char **ptr)
{
char *str;
printf("%s: ptr=%p",sym,ptr);
str = *ptr;
printf(" str=%p",str);
printf(" '%s'\n",str);
}
void
filter(char **R1_in,char **R1_out,char **filt)
{
SHOW(R1_in);
SHOW(R1_out);
SHOW(filt);
}
Here is the output:
> dyn.load("filter.so");
> .C("filter",
+ as.character("abc"),
+ as.character("def"),
+ as.character("ghi"))
R1_in: ptr=0x55a9f8cb1798 str=0x55a9f9de9760 'abc'
R1_out: ptr=0x55a9f8cb1818 str=0x55a9f9de9728 'def'
filt: ptr=0x55a9f8cb1898 str=0x55a9f9de96f0 'ghi'
[[1]]
[1] "abc"
[[2]]
[1] "def"
[[3]]
[1] "ghi"
> q()
So, you want:
void
filter(char **R1_in, char **R1_out, char **filt)
{
FILE *R1_stream = fopen(*R1_in, "r");
// ...
}

Why it is not possible to use unique_ptr in QFuture?

Here is my sample code, I am using std::vector<std::unique_ptr<std::string>> as future result.
#include "mainwindow.h"
#include <QLineEdit>
#include <QtConcurrent/QtConcurrent>
#include <vector>
MainWindow::MainWindow(QWidget *parent) :
QMainWindow(parent) {
auto model = new QLineEdit(this);
this->setCentralWidget(model);
auto watcher = new QFutureWatcher<std::vector<std::unique_ptr<std::string>>>(/*this*/);
auto future = QtConcurrent::run([this]() -> std::vector<std::unique_ptr<std::string>> {
std::vector<std::unique_ptr<std::string>> res;
for (int k = 0; k < 100; ++k) {
auto str = std::make_unique<std::string>("Hi");
res.push_back(std::move(str));
}
return res;
});
connect(watcher, &QFutureWatcher<std::vector<std::unique_ptr<std::string>>>::finished, this, [=]() {
for (const auto &item : future.result()){
model->setText(model->text() + QString::fromStdString(*item));
}
delete watcher;
});
watcher->setFuture(future);
}
MainWindow::~MainWindow() {
}
But this code can't compile.
Here is the log,
/Users/ii/QT/qt-everywhere-src-6.2.0-beta4/include/QtCore/qfuture.h:328:12: note: in instantiation of member function 'std::vector<std::unique_ptr<std::string>>::vector' requested here
return d.resultReference(0);
^
/Users/ii/CLionProjects/simpleQT/mainwindow.cpp:22:40: note: in instantiation of function template specialization 'QFuture<std::vector<std::unique_ptr<std::string>>>::result<std::vector<std::unique_ptr<std::string>>, void>' requested here
for (const auto &item : future.result()){
^
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX12.1.sdk/usr/include/c++/v1/__memory/base.h:103:16: note: candidate template ignored: substitution failure [with _Tp = std::unique_ptr<std::string>, _Args = <std::unique_ptr<std::string> &>]: call to implicitly-deleted copy constructor of 'std::unique_ptr<std::string>'
constexpr _Tp* construct_at(_Tp* __location, _Args&& ...__args) {
^
You need to use takeResult instead of result in order to use move only types.
QFuture isn't really designed with movable types in mind and you might be better off using a copyable type or std::future instead.

QtConcurrent mapped with index

I wondered if there is an option to also hand over the current processed index with QtConcurrent::mapped(someVector, &someFunction)) (also filter, filtered, map,...)
What I want: I want to do something with the elements in someVector based on the current index in it. but since the function someFunction is only taking the type T which is also used for the QVector<T> vector.
What I did: Because I needed this, I created a QVector<std::pair<int, T>> and manually created the index for the elements.
Since this requires more space and is not a nice solution, I thought maybe there could be another solution.
Docs: https://doc.qt.io/qt-5/qtconcurrent-index.html
If your input is a QVector, you can make use of the fact that QVector stores all the elements contiguously. This means that given a reference to an element e in a QVector v, then the index of e can be obtained by:
std::ptrdiff_t idx = &e - &v.at(0);
Below is a complete example using QtConcurrent::mapped:
#include <iterator>
#include <numeric>
#include <type_traits>
#include <utility>
#include <QtCore>
#include <QtConcurrent>
// lambda functions are not directly usable in QtConcurrent::mapped, the
// following is a necessary workaround
// see https://stackoverflow.com/a/49821973
template <class T> struct function_traits :
function_traits<decltype(&T::operator())> {};
template <typename ClassType, typename ReturnType, typename... Args>
struct function_traits<ReturnType(ClassType::*)(Args...) const> {
// specialization for pointers to member function
using functor_type = ClassType;
using result_type = ReturnType;
using arg_tuple = std::tuple<Args...>;
static constexpr auto arity = sizeof...(Args);
};
template <class Callable, class... Args>
struct CallableWrapper : Callable, function_traits<Callable> {
CallableWrapper(const Callable &f) : Callable(f) {}
CallableWrapper(Callable &&f) : Callable(std::move(f)) {}
};
template <class F, std::size_t ... Is, class T>
auto wrap_impl(F &&f, std::index_sequence<Is...>, T) {
return CallableWrapper<F, typename T::result_type,
std::tuple_element_t<Is, typename T::arg_tuple>...>(std::forward<F>(f));
}
template <class F> auto wrap(F &&f) {
using traits = function_traits<F>;
return wrap_impl(std::forward<F>(f),
std::make_index_sequence<traits::arity>{}, traits{});
}
int main(int argc, char* argv[]) {
QCoreApplication app(argc, argv);
// a vector of numbers from 0 to 500
QVector<int> seq(500, 0);
std::iota(seq.begin(), seq.end(), 0);
qDebug() << "input: " << seq;
QFuture<int> mapped = QtConcurrent::mapped(seq, wrap([&seq](const int& x) {
// the index of the element in a QVector is the difference between
// the address of the first element in the vector and the address of
// the current element
std::ptrdiff_t idx = std::distance(&seq.at(0), &x);
// we can then use x and idx however we want
return x * idx;
}));
qDebug() << "output: " << mapped.results();
QTimer::singleShot(100, &app, &QCoreApplication::quit);
return app.exec();
}
See this question for a related discussion. Note that the linked question has a cleaner answer that involves the usage of zip and counting iterators from boost (or possibly their C++20 ranges counterparts), but I don't think that this would play well with QtConcurrent::map when map slices the sequence into blocks, and distributes these blocks to multiple threads.

Looking for Segmentation Fault in C script

Hi trying to learn C specifically how to use pointers.
I wrote this script to practice ideas I've learned, but it crashes with segmentation fault error.
Bit of research search suggests that I am trying to access something that I should not be accessing I think that is an uninitialized pointer but I can't find it.
#include <stdio.h>
struct IntItem {
struct IntItem* next;
int value;
};
struct IntList {
struct IntItem* head;
struct IntItem* tail;
};
void append_list(struct IntList* ls, int item){
struct IntItem* last = ls->tail;
struct IntItem addition = {NULL,item};
last->next = &addition;
ls->tail = &addition;
if (!ls->head) {
ls->head = &addition;
}
}
int sum(int x, int y){
return x + y;
}
int max(int x, int y){
return x*(x>y) + y*(y>x);
}
int reduce(struct IntList xs, int (*opy)(int, int)){
struct IntItem current = *xs.head;
int running = 0;
while (current.next) {
running = opy(running,current.value);
current = *current.next;
}
return running;
}
int main(void) {
struct IntList ls = {NULL, NULL};
printf("Start Script\n");
append_list(&ls, 1);
append_list(&ls, 2);
append_list(&ls, 3);
printf("List Complete\n");
printf("Sum: %i",reduce(ls,sum));
printf("Max: %i",reduce(ls,max));
return 0;
}
Hints:
When you call append_list(&ls, 1), then inside append_list, what is the value of last?
What does last->next = &addition do?
And for your next bug:
What happens to addition after append_list returns? What does that mean for pointers to it?

QFuture Memoryleak

I want to parallelize a function and have the problem that after a few hours my memory is overloaded.
The test program calculates something simple, and works so far. Only the memory usage is constantly increasing.
QT Project file:
QT -= gui
QT += concurrent widgets
CONFIG += c++11 console
CONFIG -= app_bundle
DEFINES += QT_DEPRECATED_WARNINGS
SOURCES += main.cpp
QT program file:
#include <QCoreApplication>
#include <qdebug.h>
#include <qtconcurrentrun.h>
double parallel_function(int instance){
return (double)(instance)*10.0;
}
int main(int argc, char *argv[])
{
QCoreApplication a(argc, argv);
int nr_of_threads = 8;
double result_sum,temp_var;
for(qint32 i = 0; i<100000000; i++){
QFuture<double> * future = new QFuture<double>[nr_of_threads];
for(int thread = 0; thread < nr_of_threads; thread++){
future[thread] = QtConcurrent::run(parallel_function,thread);
}
for(int thread = 0; thread < nr_of_threads; thread++){
future[thread].waitForFinished();
temp_var = future[thread].result();
qDebug()<<"result: " << temp_var;
result_sum += temp_var;
}
}
qDebug()<<"total: "<<result_sum;
return a.exec();
}
As I have observed, QtConcurrent::run(parallel_function,thread) allocates memory, but does not release memory after future[thread].waitForFinished().
What's wrong here?
You have memory leak because future array is not deleted. Add delete[] future at the end of outer for loop.
for(qint32 i = 0; i<100000000; i++)
{
QFuture<double> * future = new QFuture<double>[nr_of_threads];
for(int thread = 0; thread < nr_of_threads; thread++){
future[thread] = QtConcurrent::run(parallel_function,thread);
}
for(int thread = 0; thread < nr_of_threads; thread++){
future[thread].waitForFinished();
temp_var = future[thread].result();
qDebug()<<"result: " << temp_var;
result_sum += temp_var;
}
delete[] future; // <--
}
Here's how this might look - note how much simpler everything can be! You're dead set on doing manual memory management: why? First of all, QFuture is a value. You can store it very efficiently in any vector container that will manage the memory for you. You can iterate such a container using range-for. Etc.
QT = concurrent # dependencies are automatic, you don't use widgets
CONFIG += c++14 console
CONFIG -= app_bundle
SOURCES = main.cpp
Even though the example is synthetic and the map_function is very simple, it's worth considering how to do things most efficiently and expressively. Your algorithm is a typical map-reduce operation, and blockingMappedReduce has half the overhead of manually doing all of the work.
First of all, let's recast the original problem in C++, instead of some C-with-pluses Frankenstein.
// https://github.com/KubaO/stackoverflown/tree/master/questions/future-ranges-49107082
/* QtConcurrent will include QtCore as well */
#include <QtConcurrent>
#include <algorithm>
#include <iterator>
using result_type = double;
static result_type map_function(int instance){
return instance * result_type(10);
}
static void sum_modifier(result_type &result, result_type value) {
result += value;
}
static result_type sum_function(result_type result, result_type value) {
return result + value;
}
result_type sum_approach1(int const N) {
QVector<QFuture<result_type>> futures(N);
int id = 0;
for (auto &future : futures)
future = QtConcurrent::run(map_function, id++);
return std::accumulate(futures.cbegin(), futures.cend(), result_type{}, sum_function);
}
There is no manual memory management, and no explicit splitting into "threads" - that was pointless, since the concurrent execution platform is aware of how many threads there are. So this is already better!
But this seems quite wasteful: each future internally allocates at least once (!).
Instead of using futures explicitly for each result, we can use the map-reduce framework. To generate the sequence, we can define an iterator that provides the integers we wish to work on. The iterator can be a forward or a bidirectional one, and its implementation is the bare minimum needed by QtConcurrent framework.
#include <iterator>
template <typename tag> class num_iterator : public std::iterator<tag, int, int, const int*, int> {
int num = 0;
using self = num_iterator;
using base = std::iterator<tag, int, int, const int*, int>;
public:
explicit num_iterator(int num = 0) : num(num) {}
self &operator++() { num ++; return *this; }
self &operator--() { num --; return *this; }
self &operator+=(typename base::difference_type d) { num += d; return *this; }
friend typename base::difference_type operator-(self lhs, self rhs) { return lhs.num - rhs.num; }
bool operator==(self o) const { return num == o.num; }
bool operator!=(self o) const { return !(*this == o); }
typename base::reference operator*() const { return num; }
};
using num_f_iterator = num_iterator<std::forward_iterator_tag>;
result_type sum_approach2(int const N) {
auto results = QtConcurrent::blockingMapped<QVector<result_type>>(num_f_iterator{0}, num_f_iterator{N}, map_function);
return std::accumulate(results.cbegin(), results.cend(), result_type{}, sum_function);
}
using num_b_iterator = num_iterator<std::bidirectional_iterator_tag>;
result_type sum_approach3(int const N) {
auto results = QtConcurrent::blockingMapped<QVector<result_type>>(num_b_iterator{0}, num_b_iterator{N}, map_function);
return std::accumulate(results.cbegin(), results.cend(), result_type{}, sum_function);
}
Could we drop the std::accumulate and use blockingMappedReduced instead? Sure:
result_type sum_approach4(int const N) {
return QtConcurrent::blockingMappedReduced(num_b_iterator{0}, num_b_iterator{N},
map_function, sum_modifier);
}
We can also try a random access iterator:
using num_r_iterator = num_iterator<std::random_access_iterator_tag>;
result_type sum_approach5(int const N) {
return QtConcurrent::blockingMappedReduced(num_r_iterator{0}, num_r_iterator{N},
map_function, sum_modifier);
}
Finally, we can switch from using range-generating iterators, to a precomputed range:
#include <numeric>
result_type sum_approach6(int const N) {
QVector<int> sequence(N);
std::iota(sequence.begin(), sequence.end(), 0);
return QtConcurrent::blockingMappedReduced(sequence, map_function, sum_modifier);
}
Of course, our point is to benchmark it all:
template <typename F> void benchmark(F fun, double const N) {
QElapsedTimer timer;
timer.start();
auto result = fun(N);
qDebug() << "sum:" << fixed << result << "took" << timer.elapsed()/N << "ms/item";
}
int main() {
const int N = 1000000;
benchmark(sum_approach1, N);
benchmark(sum_approach2, N);
benchmark(sum_approach3, N);
benchmark(sum_approach4, N);
benchmark(sum_approach5, N);
benchmark(sum_approach6, N);
}
On my system, in release build, the output is:
sum: 4999995000000.000000 took 0.015778 ms/item
sum: 4999995000000.000000 took 0.003631 ms/item
sum: 4999995000000.000000 took 0.003610 ms/item
sum: 4999995000000.000000 took 0.005414 ms/item
sum: 4999995000000.000000 took 0.000011 ms/item
sum: 4999995000000.000000 took 0.000008 ms/item
Note how using map-reduce on a random-iterable sequence has over 3 orders of magnitude lower overhead than using QtConcurrent::run, and is 2 orders of magnitude faster than non-random-iterable solutions.

Resources