Using vloadn (opencl) to load unallocated memory - opencl

I am using vloadn to load data and as a parameter I pass the range I want to read and it works, but I am wondering what's the behavior of vload4. If this might cause some unexpected issue or I am perfectly safe to do this. An example might be something like this:
__kernel void myKernel(__global float* data_ptr, int size)
{
float4 vec = vload4(0, data_ptr);
float sum = 0.f;
// data_ptr points to an array of 2 floats in global mem
if (size == 2) {
sum += vec.s1;
sum += vec.s0;
}
else if (size == 1) {
sum += vec.s0;
}
}
data_ptr is an array of 2 floats in global memory, but even though I am accessing only those 2 floats, I am loading 4 floats using vload4. The reason I am asking is that I want to use a single vloadn and decide afterwards how much of it I actually want to use and not to use vloadn based on size (e.g. for size==4 use vload4, for size==8 vload8 etc.

If it's still within data_ptr it will be fine; you don't have to use all the data you read. However, if you read past either end of the buffer that data_ptr points to you can have problems (memory read exception, for example, or some other device-dependent error). Note: Check the address alignment requirements for vload to see if you're allowed to read at any address or only multiple of with size.

Related

Assigning Rcpp::XPtr on the R side

I'm learning about external pointers, XPtr, in Rcpp. I made the following test functions:
// [[Rcpp::export]]
Rcpp::XPtr<arma::Mat<int>> create_xptr(int i, int j) {
arma::Mat<int>* ptr(new arma::Mat<int>(i, j));
Rcpp::XPtr<arma::Mat<int>> p(ptr, true);
return p;
}
// [[Rcpp::export]]
void fill_xptr(Rcpp::XPtr<arma::Mat<int>>& xptr, int k) {
(*xptr).fill(k);
}
// [[Rcpp::export]]
arma::Mat<int> return_val(Rcpp::XPtr<arma::Mat<int>>& xptr) {
return *xptr;
}
Now on the R side I can of-course create an instance and work with it:
x <- create_xptr(1000, 1000) # (1)
Say, for some reason I accidentally called create_xptr again and assigned the result to x, i.e
x <- create_xptr(1000, 1000) # (2)
Then, I have no longer access to the pointer i created in (1) which makes sense and hence, I cannot free the memory. What I would like is, that the second time (2) it just overwrite the first one (1). And secondly, if I create an external pointer in some local scope (say a simple for loop), should the memory used then be freed automatically when it goes out of scope? I've tried the following:
for (i in 1:3) {
a <- create_xptr(1000, 100000)
fill_xptr(a, 1)
}
But it just adds to the memory usage for each i.
I have tried reading some code on different git-repos, old posts here on stack read a little about finalizers and garbage collection in R. I can't seem to put together the puzzle.
We use external pointers for things that do not have already existing interfaces such as database connections or objects from other new libraries. You may be better off looking at some existing uses of XPtr (or the other external pointer variants in some other packages, there are two small ones on CRAN).
And I don't think I can think of an example directly referencing this in R. It is mostly for "wrapping" external objects to hold on to them and to pass them around for further use elsewhere. And you are correct that you need to read up a little on finalizers. I find reading Writing R Extensions, as dense as it is, to be the best source because you need to get the initial "basics" in C right first.

OpenCL Programm freezes everytime private variables used

I got a kernel method, simplified looks like this.
__kernel void calculate (__global float *a, __global float *b, __global int *res) {
int workItem = get_global_id(0); // Syntax may not right, but you get the idea
int found = 0;
for (int i = 0; i<100000000; i++) {
float c = a[i]*3;
float d = b[i]*2;
if (c<d) {
found++;
}
}
res[workItem]= found;
}
So, nothing much but simple calculation and a very big loop, the problem is the programm freezes all the time when i run this code. I have to force reset the computer everytime this happens.
But if i make some changes, like this
if (true) {
found++;
}
Or
if (1<2) {
found++
}
Then the programm works like a charm, and very fast! So i wonder if is there any thing wrong with variables c and d ? I tried to use things like
__private float c= ..;
__private float d= ..;
It didnt work either.
I read the return code of every step while creating programm and kernel, so its not the problem.
What did I do wrong here ?
You have several issues here.
You are running a giant loop, presumably with a single work item, which is exactly what you want to avoid in OpenCL, as the entire concept of parallelism is lost. OpenCL will not magically make your code fast if you don't use it correctly.
Even if you are using multiple work items, you are not making use of the value you retrieve from get_global_id() except for output. But, as they're all using the same input, you'll get the same output for every single work item!
Work items and their associated global IDs are intended to allow you to partition your processing into discrete units, rather than one big monolithic loop. I suggest you look at some tutorials like this one to understand the concept better. Don't start writing your own code until you understand his.
As for why your program freezes your PC, I can only speculate without seeing your host code. Perhaps you are getting buffer overrun?

Is there a minimum string length for F() to be useful?

Is there a limit for short strings where using the F() macro brings more RAM overhead then saving?
For (a contrived) example:
Serial.print(F("\n"));
Serial.print(F("Hi"));
Serial.print(F("there!"));
Serial.print(F("How do you doyou how?"));
Would any one of those be more efficient without the F()?
I imagine it uses some RAM to iterate over the string and copy it from PROGMEM to RAM. I guess the question is: how much? Also, is heap fragmentation a concern here?
I'm looking at this purely from SRAM-conserving perspective.
From a purely SRAM-conserving perspective all of your examples are identical in that no SRAM is used. At run-time some RAM is used, but only momentarily on the stack. Keep in mind that calling println() (w/o any parameters) uses some stack/RAM.
For a single character it will take up less space in flash if a char is passed into print or println. For example:
Serial.print('\n');
The char will be in flash (not static RAM).
Using
Serial.print(F("\n"));
will create a string in flash memory that is two bytes long (newline char + null terminator) and will additionally pass a pointer to that string to print which is probably two bytes long.
Additionally at runtime, using the F macro will result in two fetches ('\n' and the null terminator) from flash. While fetches from flash are fast, passing in a char results in zero fetches from flash, which is a tiny bit faster.
I don't think there is any minimum size of the string to be useful. If you look at how the outputting is implemented in Print.cpp:
size_t Print::print(const __FlashStringHelper *ifsh)
{
PGM_P p = reinterpret_cast<PGM_P>(ifsh);
size_t n = 0;
while (1) {
unsigned char c = pgm_read_byte(p++);
if (c == 0) break;
n += write(c);
}
return n;
}
You can see from there that only one byte of RAM is used at a time (plus a couple of variables), as it pulls the string from PROGMEM a byte at a time. These are all on the stack so there is no ongoing overhead.
I imagine it uses some RAM to iterate over the string and copy it from PROGMEM to RAM. I guess the question is: how much?
No, it doesn't as I showed above. It outputs a byte at a time. There is no copying (in bulk) of the string into RAM first.
Also, is heap fragmentation a concern here?
No, the code does not use the heap.

Exit early on found in OpenCL

I'm trying to write an OpenCL implementation of memchr to help me learn how OpenCL works. What I'm planning to do is to assign each work item a chunk of memory to search. Then, inside each work item, it loops through the chunk searching for the character.
Especially if the buffer is large, I don't want the other threads to keep searching after an occurrence has already been found (assume there is only one occurrence of the character in any given buffer).
What I'm stuck on is how does a work item indicate, both to the host and other threads, when it has found the character?
Thanks,
One way you could do this is to use a global flag variable. You atomically set it to 1 when you find the value and other threads will check on that value when they are doing work.
For example:
__kernel test(__global int* buffer, __global volatile int* flag)
{
int tid = get_global_id(0);
int sx = get_global_size(0);
int i = tid;
while(buffer[i] != 8) //Whatever value we're trying to find.
{
int stop = atomic_add(&flag, 0); //Read the atomic value
if(stop)
break;
i = i + sx;
}
atomic_xchg(&flag, 1); //Set the atomic value
}
This might add more overhead than by just running the whole kernel (unless you are doing a lot of work on every iteration). In addition, this method won't work if each thread is just checking a single value in the array. Each thread must have multiple iterations of work.
Finally, I've seen instances where writing to an atomic variable doesn't immediately commit, so you need to check to see if this code will deadlock on your system because the write isn't committing.

How to reverse a QList?

I see qCopy, and qCopybackward but neither seems to let me make a copy in reverse order. qCopybackward only copies it in reverse order, but keeps the darn elements in the same order! All I want to do is return a copy of the list in reverse order. There has to be a function for that, right?
If you don't like the QTL, just use the STL. They might not have a Qt-ish API, but the STL API is rock-stable :) That said, qCopyBackward is just std::copy_backward, so at least they're consistent.
Answering your question:
template <typename T>
QList<T> reversed( const QList<T> & in ) {
QList<T> result;
result.reserve( in.size() ); // reserve is new in Qt 4.7
std::reverse_copy( in.begin(), in.end(), std::back_inserter( result ) );
return result;
}
EDIT 2015-07-21: Obviously (or maybe not), if you want a one-liner (and people seem to prefer that, looking at the relative upvotes of different answers after five years) and you have a non-const list the above collapses to
std::reverse(list.begin(), list.end());
But I guess the index fiddling stuff is better for job security :)
Reverse your QList with a single line:
for(int k = 0; k < (list.size()/2); k++) list.swap(k,list.size()-(1+k));
[Rewrite from original]
It's not clear if OP wants to know "How [do I] reverse a QList?" or actually wants a reversed copy. User mmutz gave the correct answer for a reversed copy, but if you just want to reverse the QList in place, there's this:
#include <algorithm>
And then
std::reverse(list.begin(), list.end());
Or in C++11:
std::reverse(std::begin(list), std::end(list));
The beauty of the C++ standard library (and templates in general) is that the algorithms and containers are separate. At first it may seem annoying that the standard containers (and to a lesser extent the Qt containers) don't have convenience functions like list.reverse(), but consider the alternatives: Which is more elegant: Provide reverse() methods for all containers, or define a standard interface for all containers that allow bidirectional iteration and provide one reverse() implementation that works for all containers that support bidirectional iteration?
To illustrate why this is an elegant approach, consider the answers to some similar questions:
"How do you reverse a std::vector<int>?":
std::reverse(std::begin(vec), std::end(vec));
"How do you reverse a std::deque<int>?":
std::reverse(std::begin(deq), std::end(deq));
What about portions of the container?
"How do you reverse the first seven elements of a QList?": Even if the QList authors had given us a convenience .reverse() method, they probably wouldn't have given us this functionality, but here it is:
if (list.size() >= 7) {
std::reverse(std::begin(list), std::next(std::begin(list), 7));
}
But it gets better: Because the iterator interface is the same as C pointer syntax, and because C++11 added the free std::begin() and std::end functions, you can do these:
"How do you reverse an array float x[10]?":
std::reverse(std::begin(x), std::end(x));
or pre C++11:
std::reverse(x, x + sizeof(x) / sizeof(x[0]));
(That is the ugliness that std::end() hides for us.)
Let's go on:
"How do you reverse a buffer float* x of size n?":
std::reverse(x, x + n);
"How do you reverse a null-terminated string char* s?":
std::reverse(s, s + strlen(s));
"How do you reverse a not-necessarily-null-terminated string char* s in a buffer of size n?":
std::reverse(s, std::find(s, s + n, '\0'));
Note that std::reverse uses swap() so even this will perform pretty much as well as it possibly could:
QList<QList<int> > bigListOfBigLists;
....
std::reverse(std::begin(bigListOfBigLists), std::end(bigListOfBigLists));
Also note that these should all perform as well as a hand-written loop since, when possible, the compiler will turn these into pointer arithmetic. Also, you can't cleanly write a reusable, generic, high-performance reverse function like this C.
#Marc Jentsch's answer is good. And if you want to get an additional 30% performance boost you can change his one-liner to:
for(int k=0, s=list.size(), max=(s/2); k<max; k++) list.swap(k,s-(1+k));
One a ThinkPad W520 with a QList of 10 million QTimers I got these numbers:
reversing list stack overflow took 194 ms
reversing list stack overflow with max and size took 136 ms
The boost is a result of
the expression (list.size()/2) being calculated only once when initializing the loop and not after every step
the expression list.size() in swap() is called only once when initializing the loop and not after every step
You can use the Java style iterator. Complete example here (http://doc.qt.digia.com/3.2/collection.html). Look for the word "reverse".
QList<int> list; // initial list
list << 1;
list << 2;
list << 3;
QList<int> rlist; // reverse list+
QListIterator<int> it(list);
while (it.hasPrevious()) {
rlist << it.previous();
}
Reversing a QList is going to be O(n) however you do it, since QList isn't guaranteed to have its data stored contiguously in memory (unlike QVector). You might consider just traversing the list in backwards order where you need to, or use something like a QStack which lets you retrieve the elements in the opposite order they were added.
For standard library lists it would look like this
std::list<X> result;
std::copy(list.rbegin(), list.rend(), std::back_inserter(result));
Unfortunately, Qt doesn't have rbegin and rend functions that return reverse iterators (the ones that go from the end of the container to its begnning). You may write them, or you can just write copying function on your own -- reversing a list is a nice excersize. Or you can note that QList is actually an array, what makes writing such a function trivial. Or you can convert the list to std::list, and use rbegin and rend. Choose whatever you like.
As of Qt 5.14 (circa 2020), QList provides a constructor that takes iterators, so you can just construct a reversed copy of the list with the reverse iterators of the source:
QList<int> backwards(forwards.rbegin(), forwards.rend());
Or if you want to be able to inline it, more generically (replace QList<I> with just I if you want to be super duper generic):
template <typename I> QList<I> reversed (const QList<I> &forwards) {
return QList<I>(forwards.rbegin(), forwards.rend());
}
Which lets you do fun one-liners with temporaries like:
QString badDay = "reporter covers whale exploding";
QString worseDay = reversed(badDay.split(' ')).join(' ');

Resources