OpenCL - Storing a large array in private memory

OpenCL - Storing a large array in private memory - opencl

I have a large array of float called source_array with the size of around 50.000. I am current trying to implement a collections of modifications on the array and evaluate it. Basically in pseudo code:
__kernel void doSomething (__global float *source_array, __global boolean *res. __global int *mod_value) {
// Modify values of source_array with mod_value;
// Evaluate the modified array.
}
So in the process I would need to have a variable to hold modified array, because source_array should be a constant for all work item, if i modify it directly it might interfere with another work item (not sure if I am right here).
The problem is the array is too big for private memory therefore I can't initialize in kernel code. What should I do in this case ?
I considered putting another parameter into the method, serves as place holder for modified array, but again it would intefere with another work items.

Private "memory" on GPUs literally consists of registers, which generally are in short supply. So the __private address space in OpenCL is not suitable for this as I'm sure you've found.
Victor's answer is correct - if you really need temporary memory for each work item, you will need to create a (global) buffer object. If all work items need to independently mutate it, it will need a size of <WORK-ITEMS> * <BYTES-PER-ITEM> and each work-item will need to use its own slice of the buffer. If it's only temporary, you never need to copy it back to host memory.
However, this sounds like an access pattern that will work very inefficiently on GPUs. You will do much better if you decompose your problem differently. For example, you may be able to make whole work-groups coordinate work on some subrange of the array - copy the subrange into local (group-shared) memory, the work is divided between the work items in the group, and the results are written back to global memory, and the next subrange is read to local, etc. Coordinating between work-items in a group is much more efficient than each work item accessing a huge range of global memory We can only help you with this algorithmic approach if you are more specific about the computation you are trying to perform.

Why not to initialize this array in OpenCL host memory buffer. I.e.
const size_t buffer_size = 50000 * sizeof(float);
/* cl_malloc, malloc or new float [50000] or = {0.1f,0.2f,...} */
float *host_array_ptr = (float*)cl_malloc(buffer_size);
/*
put your data into host_array_ptr hear
*/
cl_int err_code;
cl_mem my_array = clCreateBuffer( my_cl_context, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, buffer_size, host_array_ptr, &err_code );
Then you can use this cl_mem my_array in OpenCL kernel
Find out more

Related

(CAPL) how i assign array length by using parameter

void func(int a){
byte arr[a];
}
this code is not working. how I assign array length by using parameter?

In CAPL you have many options to go, but first you'll have to consider you probably want to step back and ask yourself if you really need variable array size at runtime. The measurement performance is what you should be concerned about, declaring a suitable array size as design may be a safer approach.
A global array of parametric size could be something like this:
variables
{
int arraySize = 256;
byte arr[arraySize];
}
From the docs,
Declaration of arrays (arrays, vectors, matrices) is permitted in CAPL. They are used and initialized in a manner analogous to C language.
In C, array size is constant:
Array is a type consisting of a contiguously allocated nonempty sequence of objects with a particular element type. The number of those objects (the array size) never changes during the array lifetime. [source]
This is why your code is not working: you cannot create an array of runtime-based size. Similarly, from the same source
Variable-length arrays
If expression is not an integer constant
expression, the declarator is for an array of variable size.
Each time the flow of control passes over the declaration, expression
is evaluated (and it must always evaluate to a value greater than
zero), and the array is allocated (correspondingly, lifetime of a VLA
ends when the declaration goes out of scope). The size of each VLA
instance does not change during its lifetime, but on another pass over
the same code, it may be allocated with a different size.
This is why you should be able to define a parametric array like I showed you. Even if in the code arraySize should change, arr will be of 256 elements for the execution of your CAPL script.
void func(int a){
byte arr[a];
}
Will throw error, because int a is determined to be of non-constant time, thus violating the requirements above. What you can do, is to memcpy parts of a larger array to a location of choice, for example a smaller array, or employ a number of "buffer" arrays as you often see in CAPL scripts.
As I took it home, the gist of it is: use a larger size array, and be precise about where you are putting your information inside of it. Note that you must be precise, because every element in the array contains some kind of data, at init most of it is non-sense, and there is no safeguard for you against this digital noise.

Arduino Zero - Region Ram overflowed with stack

I have some code that uses nested Structs to store device parameters see below:
This is using an Ardunio Zero ( Atmel SAMD21)
The declares Storeage with up to 3 networks each network with 64 devices.
I would like to use 5 networks however when I increase the networks to 4 the code will not compile.
I get region RAM overflowed with stack / RAM overflowed by 4432 bytes.
I understand that this is taking more ram then I have? I am looking to see if there is a solution using a different method to achieve the same thing but get it to fit?
struct device {
int stat;
bool changed;
char data[51];
char state[51];
char atime[14];
char btime[14];
};
struct outputs {
device fitting[64];
};
struct storage {
int deviceid =0;
int addstore =0;
bool set;
bool run_events = false;
char authkey[10];
outputs network[3];
} ;
storage data_store;

Well, the usual approches are:
Consider if all or any of the data is actually read-only, and thus can be made const (which should move it to read-only memory, if that fails you can usually force it by adding compiler-specific magic).
Figure out means of representing the data using fewer bits. For instance using 14 bytes for each of three timestamps might seem excessive; switching these to 32-bit timestamps and generating the strings when needed would save around 70%.
If there are duplicates, then perhaps each storage doesn't need three unique outputs, but can instead store pointers into a shared "pool" of unique configurations.
If not all 64 fittings are used, that array could also be refactored into having non-constant length.
It's hard to be more specific since I don't know your data or application well enough.

Your struct is taking too much place. That's all. Assuming chars, ints and bools are internally 1 byte each, your device struct takes 132 bytes. Then, your outputs struct takes 8448 bytes or 8.25Kb. Your unit has 32Kb of RAM...

QList crashes when size is large

I am using a QList to store the data read from a SQL Table. The table has more than a million records. I need to get them in a list and then do some processing on the list.
QList<QVariantMap> list;
QString selectNewDB = QString("SELECT * FROM newDatabase.M106SRData");
QSqlQuery selectNewDBQuery = QSqlDatabase::database("CurrentDBConn").exec(selectNewDB);
while (selectNewDBQuery.next())
{
QSqlRecord selectRec = selectNewDBQuery.record();
QVariantMap varMap;
QString key;
QVariant value;
for (int i=0; i < selectRec.count(); ++i)
{
key = selectRec.fieldName(i);
value = selectRec.value(i);
varMap.insert(key, value);
}
list << varMap;
}
I get "qvector.h, line 534: Out of memory" error.
The program crashes when the list reaches the size of <1197762 items>. I tried using reserve() but it didn't work. Is QList limited to a specific size?

You've ran out of memory because the C++ runtime has reported that it cannot allocate any more memory. It's not a problem with Qt containers. The containers are limited to 2^31-1 items due to the size of int the use for the index. You're nowhere near that.
At the very least:
Use a QVector instead of QList as it has much lower overhead for the QVariantMap element.
Attempt to reserve the space if the query allows it: this will almost halve the memory requirements!
Compile for a 64 bit target if you can.
QVector<QVariantMap> list;
QString selectNewDB = QString("SELECT * FROM newDatabase.M106SRData");
QSqlQuery selectNewDBQuery = QSqlDatabase::database("CurrentDBConn").exec(selectNewDB);
auto const size = selectNewDBQuery.size();
if (size > 0) list.reserve(size);
while (selectNewDBQuery.next())
{
auto selectRec = selectNewDBQuery.record();
QVariantMap varMap;
for (int i=0; i < selectRec.count(); ++i)
{
auto const key = selectRec.fieldName(i);
auto const value = selectRec.value(i);
varMap.insert(key, value);
}
list.append(varMap);
}

You either don't have enough ram, or more likely are using a 32bit Qt build, which cannot utilize more than 4 GB of ram. Or maybe both. Size wise the container itself should be able to handle more than 2 billion elements.
QList ain't helping either, as in your case it will likely store every element as a pointer and do an additional heap allocation for the actual variant map. So you end up with a sizeable additional heap allocation overhead.
And since the query already contains a significant amount of data, it probably eats a decent amount of ram itself.
Unless you have disabled pagefile, running out of ram on its own should not result in a crash, as it would just start paging and ruin performance, but keep running, so you are likely hitting the memory limit for a 32 bit process, which may be as low as a mere 2 GB.
Aside from doing the things Kuba suggested in his answer, you might want to split your query into smaller pieces, and get the results in a few queries rather than one if possible, and process them one at a time, reducing the memory used by the query results and freeing the memory for a query once you are done with it.
There is also the option of saving on RAM from QString, in case you have a lot of repeating strings. As it is implicitly shared, you can have a bunch of identical strings that all use the same underlying data. You can take advantage of this, by using a QSet to keep a collection of unique strings and a quick check if a string is already present. Then instead of using the string from the query result, use the one from the set. All identical strings copied by value from the set will reuse the same string data. In contrast, your current approach will use n amount of space for every n duplicated strings.

How to free resources of QString when use it inside std::vector

I have a structure "rs" for every record of my dataset.
All records are in a vector "r".
My record count is in “rc”.
....
struct rs{
uint ip_i;//index
QString ip_addr;//ip address
};
std::vector <rs> r;//rows ordered by key
int rc;//row count
....
I would like to control this memory usage.
That's why I don't want to use r.insert and r.erase.
When I need to insert a record, I will:
Increase size of r by r.resize(..);r.shrink_to_fit() (if needed).
Shift elements of r to the right (if needed) by std::rotate.
Put new values: r[i].ip_i=...;r[i].ip_addr=...
When I need to delete a record, I will:
Shift elements of r to the left (if needed) by std::rotate.
For example, std::rotate(r.begin()+i,r.begin()+i+1,r.begin()+rc);.
Free resources of r[rc].ip_addr.
How to free resouces of QString r[rc].ip_addr?
I've tried to do r[i].ip_addr.~QString() and catched an runtime error.
Make r.resize() (if needed).
I don't want to loose memory because of Qstring copies stayed after rows deleting.
How can I control them?
Thanks.

QString handles all memory control for you. Just treat it as a regular object and you'll be fine. std::vector is OO-aware, so it will call destructors when freeing elements.
The only thing you should not do is use low-level memory manipulation routines like memcpy or memset. std::vector operations are safe.
If you really want to free a string for a record that is within [0..size-1] range (that is, you do not actually decrease size with resize() after moving elements), then calling r[i].ip_addr.clear() would suffice. Or better yet, introduce the clear() method in your structure that will call ip_addr.clear() (in case you add more fields that need to be cleared). But you can only call it on a valid record, of course, not one beyond your actual vector size (no matter what the underlying capacity is, it's just an implementation detail).
On a side note, it probably makes sense to use QList instead since you're using Qt anyway, unless you have specific reasons to use std::vector. As far as memory control goes, QList offers reserve method which allows you reserve exactly as many elements as you need. Inserting then would look like
list.reserve(list.size() + 1);
list.insert(i, r);

Specifying multiple address spaces for a type in the list of arguments of function

I wrote a function in OpenCL:
void sort(int* array, int size)
{
}
and I need to call the function once over a __private array and once over a __global array. Apparently, it's not allowed in OpenCL to specify multiple address spaces for a type. Therefore, I should duplicate the declaration of function, while they have exactly the same body:
void sort_g(__global int* array, int size)
{
}
void sort_p(__private int* array, int size)
{
}
This is very inefficient for maintaining the code and I am wondering if there is a better way to manage multiple address spaces in OpenCL or not?
P.S.: I don't see why OpenCL doesn't allow multiple address spaces for a type. Compiler could generate multiple instances of the function (one per address space) and use them appropriately once they're called in the kernel.

For OpenCL < 2.0, this is how the language is designed and there is no getting around it, regrettably.
For OpenCL >= 2.0, with the introduction of the generic address space, your first piece of code works as you would expect.
In short, upgrading to 2.0 would solve your problem (and bring in other niceties), otherwise you're out of luck (you could perhaps wrap your function in a macro, but ew, macros).

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex