OpenCL 1.2: mem_fence() or barrier() or both

OpenCL 1.2: mem_fence() or barrier() or both - opencl

I have just started openCL C programming. All work items of a work group update unique locations of local memory. Later, a private variable of a work item is updated based on local data updated by two other work items. Something like this:
__kernel MyKernel(__global int *in_ptr)
{
/* Define a variable in private address space */
int priv_data;
/* Define two indices in private address space */
int index1, index2;
/* index1 and index2 are legitimate local work group indices */
index1 = SOME_CORRECT_VALUE;
index2 = ANOTHER_CORRECT_VALUE;
/* Define storage in local memory large enough to cater to all work items of this work group */
__local int tempPtr[WORK_GROUP_SIZE];
tempPtr[get_local_id(0)] = SOME_RANDOM_VALUE;
/* Do not proceed until the update of tempPtr by this WI has completed */
mem_fence(CLK_LOCAL_MEM_FENCE);
/* Do not proceed until all WI of this WG have updated tempPtr */
barrier(CLK_LOCAL_MEM_FENCE);
/* Update private data */
priv_data = tempPtr[index1] + tempPtr[index2];
}
Although the snippet above is conservative, wouldn't barrier have done the job as it internally does fencing?

Yes, barrier already does fencing.
A barrier will sync the execution in that point. So, all previous instructions have to be executed, therefore memory is consistent at that point.
A fence will only ensure all reads/writes are finished before any further read/write is performed, but the workers may be executing different instructions.
In some cases you can go with a single fencing. If you do not care about local workers going out of sync, and you just want the previous memory writes/read be completed. In your case a fence would be enough. (unless that code is running in a loop and there is extra code you have not put in the example).

Related

OpenCL - Storing a large array in private memory

I have a large array of float called source_array with the size of around 50.000. I am current trying to implement a collections of modifications on the array and evaluate it. Basically in pseudo code:
__kernel void doSomething (__global float *source_array, __global boolean *res. __global int *mod_value) {
// Modify values of source_array with mod_value;
// Evaluate the modified array.
}
So in the process I would need to have a variable to hold modified array, because source_array should be a constant for all work item, if i modify it directly it might interfere with another work item (not sure if I am right here).
The problem is the array is too big for private memory therefore I can't initialize in kernel code. What should I do in this case ?
I considered putting another parameter into the method, serves as place holder for modified array, but again it would intefere with another work items.

Private "memory" on GPUs literally consists of registers, which generally are in short supply. So the __private address space in OpenCL is not suitable for this as I'm sure you've found.
Victor's answer is correct - if you really need temporary memory for each work item, you will need to create a (global) buffer object. If all work items need to independently mutate it, it will need a size of <WORK-ITEMS> * <BYTES-PER-ITEM> and each work-item will need to use its own slice of the buffer. If it's only temporary, you never need to copy it back to host memory.
However, this sounds like an access pattern that will work very inefficiently on GPUs. You will do much better if you decompose your problem differently. For example, you may be able to make whole work-groups coordinate work on some subrange of the array - copy the subrange into local (group-shared) memory, the work is divided between the work items in the group, and the results are written back to global memory, and the next subrange is read to local, etc. Coordinating between work-items in a group is much more efficient than each work item accessing a huge range of global memory We can only help you with this algorithmic approach if you are more specific about the computation you are trying to perform.

Why not to initialize this array in OpenCL host memory buffer. I.e.
const size_t buffer_size = 50000 * sizeof(float);
/* cl_malloc, malloc or new float [50000] or = {0.1f,0.2f,...} */
float *host_array_ptr = (float*)cl_malloc(buffer_size);
/*
put your data into host_array_ptr hear
*/
cl_int err_code;
cl_mem my_array = clCreateBuffer( my_cl_context, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, buffer_size, host_array_ptr, &err_code );
Then you can use this cl_mem my_array in OpenCL kernel
Find out more

QList crashes when size is large

I am using a QList to store the data read from a SQL Table. The table has more than a million records. I need to get them in a list and then do some processing on the list.
QList<QVariantMap> list;
QString selectNewDB = QString("SELECT * FROM newDatabase.M106SRData");
QSqlQuery selectNewDBQuery = QSqlDatabase::database("CurrentDBConn").exec(selectNewDB);
while (selectNewDBQuery.next())
{
QSqlRecord selectRec = selectNewDBQuery.record();
QVariantMap varMap;
QString key;
QVariant value;
for (int i=0; i < selectRec.count(); ++i)
{
key = selectRec.fieldName(i);
value = selectRec.value(i);
varMap.insert(key, value);
}
list << varMap;
}
I get "qvector.h, line 534: Out of memory" error.
The program crashes when the list reaches the size of <1197762 items>. I tried using reserve() but it didn't work. Is QList limited to a specific size?

You've ran out of memory because the C++ runtime has reported that it cannot allocate any more memory. It's not a problem with Qt containers. The containers are limited to 2^31-1 items due to the size of int the use for the index. You're nowhere near that.
At the very least:
Use a QVector instead of QList as it has much lower overhead for the QVariantMap element.
Attempt to reserve the space if the query allows it: this will almost halve the memory requirements!
Compile for a 64 bit target if you can.
QVector<QVariantMap> list;
QString selectNewDB = QString("SELECT * FROM newDatabase.M106SRData");
QSqlQuery selectNewDBQuery = QSqlDatabase::database("CurrentDBConn").exec(selectNewDB);
auto const size = selectNewDBQuery.size();
if (size > 0) list.reserve(size);
while (selectNewDBQuery.next())
{
auto selectRec = selectNewDBQuery.record();
QVariantMap varMap;
for (int i=0; i < selectRec.count(); ++i)
{
auto const key = selectRec.fieldName(i);
auto const value = selectRec.value(i);
varMap.insert(key, value);
}
list.append(varMap);
}

You either don't have enough ram, or more likely are using a 32bit Qt build, which cannot utilize more than 4 GB of ram. Or maybe both. Size wise the container itself should be able to handle more than 2 billion elements.
QList ain't helping either, as in your case it will likely store every element as a pointer and do an additional heap allocation for the actual variant map. So you end up with a sizeable additional heap allocation overhead.
And since the query already contains a significant amount of data, it probably eats a decent amount of ram itself.
Unless you have disabled pagefile, running out of ram on its own should not result in a crash, as it would just start paging and ruin performance, but keep running, so you are likely hitting the memory limit for a 32 bit process, which may be as low as a mere 2 GB.
Aside from doing the things Kuba suggested in his answer, you might want to split your query into smaller pieces, and get the results in a few queries rather than one if possible, and process them one at a time, reducing the memory used by the query results and freeing the memory for a query once you are done with it.
There is also the option of saving on RAM from QString, in case you have a lot of repeating strings. As it is implicitly shared, you can have a bunch of identical strings that all use the same underlying data. You can take advantage of this, by using a QSet to keep a collection of unique strings and a quick check if a string is already present. Then instead of using the string from the query result, use the one from the set. All identical strings copied by value from the set will reuse the same string data. In contrast, your current approach will use n amount of space for every n duplicated strings.

Mysterious EXC_BAD_ACCESS on dispatch_async serial queue

I have a location-based app that gets location every 1 second, and saves a batch of the locations at a time to the CoreData DB so as not to make the locations array too large. However, for some reason it crashes with EXC_BAD_ACCESS even though I use dispatch_async on a SERIAL QUEUE:
Created a global serial custom queue like this:
var MyCustomQueue : dispatch_queue_t =
dispatch_queue_create("uniqueID.for.custom.queue", DISPATCH_QUEUE_SERIAL);
in the didUpdateToLocation protocol function, I have this bit of code that saves the latest batch of CLLocation items onto the disk, if the batch size is greater than a preset number:
if(myLocations.count == MAX_LOCATIONS_BUFFER)
{
dispatch_async(MyCustomQueue)
{
for myLocation in self.myLocations
{
// the next line is where it crashes with EXC_BAD_ACCESS
newLocation = NSEntityDescription.insertNewObjectForEntityForName(locationEntityName, inManagedObjectContext: context as NSManagedObjectContext) ;
newLocation.setValue(..)
// all the code to set the attributes
}
appDelegate.saveContext();
dispatch_async(dispatch_get_main_queue()) {
// once the latest batch is saved, empty the in-memory array
self.myLocations = [];
}
}
}
The mystery is, isn't a SERIAL QUEUE supposed to execute all the items on it in order, even though you use dispatch_async (which simply makes the execution concurrent with the MAIN thread, that doesn't touch the SQLite data)?
Notes:
The crash happens at random times...sometimes it'll crash after 0.5 miles, other times after 2 miles, etc.
If I just take out any of the dispatch_async code (just make everything executed in order on the main thread), there is no issue.

You can take a look at this link. Your issue is probably because there is no MOC at the time you are trying to insert item into it.
Core Data : inserting Objects crashed in global queue [ARC - iPhone simulator 6.1]

Is sending a signal sending one bit of information?

Descriptions of the traditional signal sending facility implemented in UNIX systems sometimes identify the action of ”sending a signal” with ”sending one bit of information.” Is this identification accurate?
Example of such a description
An example of such a description is provided by Advanced Programming in the UNIX Environment, (W. Richard Stevens,Stephen A. Rago), where the description of the sigqueue function (10.20) claims that ”Generally a signal carries one bit of information: the signal itself.”
Why a signal actually carries 0 bit of information
The signal has no state, so its state is described by 0 bits of information. What carries information is merely that a transmission actually occurred.
Sending 2 bits of information is the same as sending 2 times in a row 1 bit of information, so if a signal would convey 1 bit of information, we could send 2 bits of informations to a process by sending it two times (the same) signal, which looks impossible.

I don't think one bit of information from the quote means literally one bit of information in information theory.
Here's my understanding. For a certain signal, there are only two possible states in a certain moment: a signal happens, or it doesn't happen. This binary states can be represented by one bit (0 or 1). By sending a signal to a process, this information is sent to it.
Again, I think you are overthinking this statement, after all, it's a book about programming, not information theory.

According to man 2 kill, the kill function takes pid_t pid and int sig as parameters. So with kill, you can send information via the signal type (int) and that's it.
If you take a look at man 2 sigaction, it describes the information structure that you can handle in your signal handlers:
siginfo_t {
int si_signo; /* Signal number */
int si_errno; /* An errno value */
int si_code; /* Signal code */
int si_trapno; /* Trap number that caused
hardware-generated signal
(unused on most architectures) */
pid_t si_pid; /* Sending process ID */
uid_t si_uid; /* Real user ID of sending process */
int si_status; /* Exit value or signal */
clock_t si_utime; /* User time consumed */
clock_t si_stime; /* System time consumed */
sigval_t si_value; /* Signal value */
int si_int; /* POSIX.1b signal */
void *si_ptr; /* POSIX.1b signal */
int si_overrun; /* Timer overrun count; POSIX.1b timers */
int si_timerid; /* Timer ID; POSIX.1b timers */
void *si_addr; /* Memory location which caused fault */
long si_band; /* Band event (was int in
glibc 2.3.2 and earlier) */
int si_fd; /* File descriptor */
short si_addr_lsb; /* Least significant bit of address
(since Linux 2.6.32) */
}
This would imply that you can send some additional info by choosing who (uid) you send the signal as and where from (pid -- you can tell if it is your child who's sending you the signal).
As for composing messages out of signal chains, that's theoretically possible, but you'd need to make sure your signal chain won't get randomly interrupted by another process (you can't prevent certain signal such as SIGKILL, but then, you're done when you receive a SIGKILL so no more translating message chains into ASCII or whatever :)
There's also sigqueue, which you can use to attach an integer to a signal you send (http://www.kernel.org/doc/man-pages/online/pages/man2/sigqueue.2.html), as I've learned in the following very good article on Linux signals:
http://www.kernel.org/doc/man-pages/online/pages/man2/sigqueue.2.html

How do game trainers change an address in memory that's dynamic?

Lets assume I am a game and I have a global int* that contains my health. A game trainer's job is to modify this value to whatever in order to achieve god mode. I've looked up tutorials on game trainers to understand how they work, and the general idea is to use a memory scanner to try and find the address of a certain value. Then modify this address by injecting a dll or whatever.
But I made a simple program with a global int* and its address changes every time I run the app, so I don't get how game trainers can hard code these addresses? Or is my example wrong?
What am I missing?

The way this is usually done is by tracing the pointer chain from a static variable up to the heap address containing the variable in question. For example:
struct CharacterStats
{
int health;
// ...
}
class Character
{
public:
CharacterStats* stats;
// ...
void hit(int damage)
{
stats->health -= damage;
if (stats->health <= 0)
die();
}
}
class Game
{
public:
Character* main_character;
vector<Character*> enemies;
// ...
}
Game* game;
void main()
{
game = new Game();
game->main_character = new Character();
game->main_character->stats = new CharacterStats;
// ...
}
In this case, if you follow mikek3332002's advice and set a breakpoint inside the Character::hit() function and nop out the subtraction, it would cause all characters, including enemies, to be invulnerable. The solution is to find the address of the "game" variable (which should reside in the data segment or a function's stack), and follow all the pointers until you find the address of the health variable.
Some tools, e.g. Cheat Engine, have functionality to automate this, and attempt to find the pointer chain by themselves. You will probably have to resort to reverse-engineering for more complicated cases, though.

Discovery of the access pointers is quite cumbersome and static memory values are difficult to adapt to different compilers or game versions.
With API hooking of malloc(), free(), etc. there is a different method than following pointers. Discovery starts with recording all dynamic memory allocations and doing memory search in parallel. The found heap memory address is then reverse matched against the recorded memory allocations. You get to know the size of the object and the offset of your value within the object. You repeat this with backtracing and get the jump-back code address of a malloc() call or a C++ constructor. With that information you can track and modify all objects which get allocated from there. You dump the objects and compare them and find a lot more interesting values. E.g. the universal elite game trainer "ugtrain" does it like this on Linux. It uses LD_PRELOAD.
Adaption works by "objdump -D"-based disassembly and just searching for the library function call with the known memory size in it.
See: http://en.wikipedia.org/wiki/Trainer_%28games%29
Ugtrain source: https://github.com/sriemer/ugtrain
The malloc() hook looks like this:
static __thread bool no_hook = false;
void *malloc (size_t size)
{
void *mem_addr;
static void *(*orig_malloc)(size_t size) = NULL;
/* handle malloc() recursion correctly */
if (no_hook)
return orig_malloc(size);
/* get the libc malloc function */
no_hook = true;
if (!orig_malloc)
*(void **) (&orig_malloc) = dlsym(RTLD_NEXT, "malloc");
mem_addr = orig_malloc(size);
/* real magic -> backtrace and send out spied information */
postprocess_malloc(size, mem_addr);
no_hook = false;
return mem_addr;
}
But if the found memory address is located within the executable or a library in memory, then ASLR is likely the cause for the dynamic. On Linux, libraries are PIC (position-independent code) and with latest distributions all executables are PIE (position-independent executables).

EDIT: never mind it seems it was just good luck, however the last 3 numbers of the pointer seem to stay the same. Perhaps this is ASLR kicking in and changing the base image address or something?
aaahhhh my bad, i was using %d for printf to print the address and not %p. After using %p the address stayed the same
#include <stdio.h>
int *something = NULL;
int main()
{
something = new int;
*something = 5;
fprintf(stdout, "Address of something: %p\nValue of something: %d\nPointer Address of something: %p", &something, *something, something);
getchar();
return 0;
}

Example for a dynamicaly allocated varible
The value I want to find is the number of lives to stop my lives from being reduced to 0 and getting game over.
Play the Game and search for the location of the lifes variable this instance.
Once found use a disassembler/debugger to watch that location for changes.
Lose a life.
The debugger should have reported the address that the decrement occurred.
Replace that instruction with no-ops
Got this pattern from the program called tsearch
A few related websites found from researching this topic:
http://deviatedhacking.com/index.php?/topic/75-dynamic-memory-allocation/
http://www.edgeofnowhere.cc/viewforum.php?f=183
http://www.oldschoolhack.de/tutorials/Theories%20and%20methods%20of%20code-caves.htm
http://webcache.googleusercontent.com/search?q=cache:4wzMzFIZx54J:gamehacking.com/forums/tutorials-beginners/11597-c-making-game-trainer.html+reading+a+dynamic+memory+address+game+trainer&cd=2&hl=en&ct=clnk&gl=au&client=firefox-a (A google cache version)
http://www.codeproject.com/KB/cpp/codecave.aspx

The way things like Gameshark codes were figured out were by dumping the memory image of the application, then doing one thing, then looking to see what changed. There might be a few things changing, but there should be patterns to look for. E.g. dump memory, shoot, dump memory, shoot again, dump memory, reload. Then look for changes and get an idea for where/how ammo is stored. For health it'll be similar, but a lot more things will be changing (since you'll be moving at the very least). It'll be easiest though to do it when minimizing the "external effects," e.g. don't try to diff memory dumps during a firefight because a lot is happening, do your diffs while standing in lava, or falling off a building, or something of that nature.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex