Optimizing mask function with ARM SIMD instructions - mask

I was wondering if you could help me use NEON intrinsics to optimize this mask function. I already tried to use auto-vectorization using the O3 gcc compiler flag but the performance of the function was smaller than running it with O2, which turns off the auto-vectorization. For some reason the assembly code produced with O3 is 1,5 longer than the one with O2.
void mask(unsigned int x, unsigned int y, uint32_t *s, uint32_t *m)
unsigned int ixy;
ixy = xsize * ysize;
while (ixy--)
*(s++) &= *(m++);
Probably I have to use the following commands:
vld1q_u32 // to load 4 integers from s and m
vandq_u32 // to execute logical and between the 4 integers from s and m
vst1q_u32 // to store them back into s
However i don't know how to do it in the most optimal way. For instance should I increase s,m by 4 after loading , anding and storing? I am quite new to NEON so I would really need some help.
I am using gcc 4.8.1 and I am compiling with the following cmd:
arm-linux-gnueabihf-gcc -mthumb -march=armv7-a -mtune=cortex-a9 -mcpu=cortex-a9 -mfloat-abi=hard -mfpu=neon -O3 -fprefetch-loop-arrays name.c -o name
I would probably do it like this. I've included 4x loop unrolling. Preloading the cache is always a good idea and can speed things up another 25%. Since there's not much processing going on (it's mostly spending time loading and storing), it's best to load lots of registers, then process them as it gives time for the data to actually load. It assumes the data is an even multiple of 16 elements.
void fmask(unsigned int x, unsigned int y, uint32_t *s, uint32_t *m)
unsigned int ixy;
uint32x4_t srcA,srcB,srcC,srcD;
uint32x4_t maskA,maskB,maskC,maskD;
ixy = xsize * ysize;
ixy /= 16; // process 16 at a time
while (ixy--)
__builtin_prefetch(&s[64]); // preload the cache
srcA = vld1q_u32(&s[0]);
maskA = vld1q_u32(&m[0]);
srcB = vld1q_u32(&s[4]);
maskB = vld1q_u32(&m[4]);
srcC = vld1q_u32(&s[8]);
maskC = vld1q_u32(&m[8]);
srcD = vld1q_u32(&s[12]);
maskD = vld1q_u32(&m[12]);
srcA = vandq_u32(srcA, maskA);
srcB = vandq_u32(srcB, maskB);
srcC = vandq_u32(srcC, maskC);
srcD = vandq_u32(srcD, maskD);
vst1q_u32(&s[0], srcA);
vst1q_u32(&s[4], srcB);
vst1q_u32(&s[8], srcC);
vst1q_u32(&s[12], srcD);
s += 16;
I would start with the simplest one and take it as a reference for compare with future routines.
A good rule of thumb is to calculate needed things as soon as possible, not exactly when needed.
This means that instructions can take X cycles to execute, but the results are not always immediately ready, so scheduling is important
As an example, a simple scheduling schema for your case would be (pseudocode)
nn=n/4 // Assuming n is a multiple of 4
LOADI_S(0) // Load and immediately after increment pointer
LOADI_M(0) // Load and immediately after increment pointer
for( k=1; k<nn;k++){
AND_SM(k-1) // Inner op
LOADI_S(k) // Load and increment after
LOADI_M(k) // Load and increment after
STORE_S(k-1) // Store and increment after
STORE_S(nn-1) // Store. Not needed to increment
Leaving out these instructions from the inner loop we achieve that the ops inside don't depend on the result of the previous op.
This schema can be further extended in order to take profit of the time that otherwise would be lost waiting for the result of the previous op.
Using OpenMP with GPU

Everyone good time of day!
I would like to ask the advice of the respected community about the use of GPU computing power instead of or together with the CPU.
I have a well-functioning program based on recursive search of all kinds of combinations of some events, paralleled using OpenMP to run on all available processor cores.
The pseudocode C++ is as follows:
// #includes
// function announcements
// declaring a global variable:
QVector<QVector<QVector<float>>> variant; // (or "std::vector")
int main() {
// reads data from file
// data are converted and analyzed
// the variant variable containing the current best result is filled in (here - by pre-analysis)
#pragma omp parallel shared(variant)
#pragma omp master
// occurs call a recursive algorithm of search all variants:
PEREBOR(Tabl_1, a, i_a, ..., reс_depth);
return 0;
void PEREBOR(QVector<QVector<uint8_t>> Tabl_1, QVector<A_struct> a, uint8_t i_a, ..., uint8_t reс_depth)
// looking for the boundaries of the first cycle for some reasons
for (int i = quantity; i < another_quantity; i++) {
// the Tabl_1 is processed and modified to determine the number of steps in the subsequent for cycle
for (int k = 0; k < the_quantity_just_found; k++) {
if the recursion depth is not 1, we go down further: {
// add descent to the next recursion level to the call stack:
#pragma omp task
PEREBOR(Tabl_1_COPY, a, i_a, ..., reс_depth-1);
else (if we went down to the lowest level): {
if (condition fulfilled) // condition check - READ variant variable
variant = it_is_equal_to_that_,_to_that...;
Unfortunately, I don't have a CPU with a thousand cores at my disposal, and without this, the algorithm works for a very long time. At the place where I work, I was advised to think about using a GPU to speed up calculations. I learned that OpenMP can work with video cards (and especially with NVidia), but OpenACC also does it well.
In this regard, my main question is whether it is possible to simply and, at the same time, effectively set the execution of a recursive algorithm on a GPU? Can this give a noticeable acceleration relative to the CPU? If so, maybe OpenACC will do better? And is it possible to give instructions to the video card through the "#pragma omp task", or are other instructions REQUIRED? And how would it be possible to combine calculations on the CPU and GPU?
Thank you so much for any help!
Finding pointer with 'find out what writes to this address' strange offset

I'm trying to find a base pointer for UrbanTerror42.
My setup is as followed, I have a server with 2 players.
cheat-engine runs on client a.
I climb a ladder with client b and then scan for incease/decrease.
When I have found the values, I use find out what writes to this address.
But the offset are very high and point to empty memory.
I don't really know how to proceed
For the sake of clarity, I have looked up several other values and they have the same problem
I've already looked at a number of tutorials and forums, but that's always about values where the offsets are between 0 and 100 and not 80614.
I would really appreciate it if someone could tell me why this happened and what I have to do/learn to proceed.
thanks in advance
Urban Terror uses the Quake Engine. Early versions of this engine use the Quake Virtual Machine and the game logic is implemented as bytecode which is compiled into assembly by the Quake Virtual Machine. Custom allocation routines are used to load these modules into memory, relative and hardcoded offsets/addresses are created at runtime to accommodate these relocations and do not use the normal relocation table method of the portable executable file format. This is why you see these seemingly strange numbers that change every time you run the game.
The Quake Virtual Machines are file format .qvm and these qvms in memory are tracked in the QVM table. You must find the QVM table to uncover this mystery. Once you find the 2-3 QVMs and record their addresses, finding the table is easy, as you're simply doing a scan for pointers that point to these addresses and narrowing down your results by finding those which are close in memory to each other.
The QVM is defined like:
struct vmTable_t
vm_t vm[3];
struct vm_s {
int programStack; // the vm may be recursively entered
intptr_t(*systemCall)(intptr_t *parms);
char name[MAX_QPATH];
// for dynamic linked modules
void *dllHandle;
intptr_t entryPoint; //(QDECL *entryPoint)(int callNum, ...);
void(*destroy)(vm_s* self);
// for interpreted modules
qboolean currentlyInterpreting;
qboolean compiled;
byte *codeBase;
int codeLength;
int *instructionPointers;
int instructionCount;
byte *dataBase;
int dataMask;
int stackBottom; // if programStack < stackBottom, error
int numSymbols;
struct vmSymbol_s *symbols;
int callLevel; // counts recursive VM_Call
int breakFunction; // increment breakCount on function entry to this
int breakCount;
BYTE *jumpTableTargets;
int numJumpTableTargets;
typedef struct vm_s vm_t;
The value in EAX in your original screenshot should be the same as either the codeBase or dataBase member variable of the QVM structure. The offsets are just relative to these addresses. Similarly to how you deal with ASLR, you must calculate the addresses at runtime.
Here is a truncated version of my code that does exactly this and additionally grabs important structures from memory, as an example:
void OA_t::GetVM()
cg = nullptr;
cgs = nullptr;
cgents = nullptr;
bLocalGame = false;
cgame = nullptr;
for (auto &vm : vmTable->vm)
if (strstr(vm.name, "qagame")) { bLocalGame = true; continue; }
if (strstr(vm.name, "cgame"))
cgame = &vm;
gamestatus = GSTAT_GAME;
//char* gamestring = Cvar_VariableString("fs_game");
switch (cgame->instructionCount)
case 136054: //version 88
cgents = (cg_entities*)(cgame->dataBase + 0x1649c);
cg = (cg_t*)(cgame->dataBase + 0xCC49C);
cgs = (cgs_t*)(cgame->dataBase + 0xf2720);
Full source code for reference available at OpenArena Aimbot Source Code, it even includes a video overview of the code.
OpenCL usage of register value crashes program

I finished writing an OpenCL kernel for thermodynamics calculations and observed a really weird bug.
My kernel looks like this:
__kernel void energy(... float3 dest, int nlocal, ...){
int i = get_global_id(0);
float3 ev = {0.0f, 0.0f, 0.0f};
//some thermo calculations, adding values to evx and evy
ev.x +=...;
ev.y +=...;
//Then I want to save the result in dest[i].
//Program exits at next two line
dest[i].x = ev.x;
dest[i].y = ev.y;
I get an "unmapped Memory" and segfault error. I get the same error when trying to print out the value using printf. Seems like the program can't read the value. Writing to it works though!(Maybe because of some compiler optimizations)
Now if I use another float register value, I get the same error. But if I change the last lines to something like this (no use of ev.x or ev.y)
dest[i].x = i/nlocal*3.1f
dest[i].y = ...;
everything is going as expected and I get no error.
This works too:
int i = ...
float3 = {0.0f, ...}
dest[i].x = ev.x;
But somehow after the actual calculation it is not possible anymore.
The program is running on a Nvidia K40m, Kepler architecture.
This looks suspicious in your code:
kernel(... __global int* neigh
__global int* neighs = neigh+i;
int j = neighs[k*n];
Seems like you are passing a array of pointers in neigh, then getting the pointer and using it.
Pointers are not allowed in CL, if you pass pointers then you are addressing out of the GPU memory, and therefore crashing.
It is also possible that your vectors are simply not properly calculated, the sizes should be:
res, nneigh = GLOBAL_SIZE
neighs = max(nneigh[])*n
x = max(neighs[])
And also possible you did create the buffers smaller than they should be (remember they are floats, and float3, which use 32bits and 128bits per element). CL API calls are defined in bytes (you should use sizeof()), not in elements.
Okay I found the answer and the code above is working. I changed the kernel parameters for better understanding and corrected the mistake unconsiously when I posted the code here.
int numneigh = nneigh[i] (stands for number of neighbors) is correct
in the original code I did this:
int numneigh = neigh[i] (the neighbors)
segmentation fault when using shared memory created by open_shm on Xeon Phi

I have written my code for single Xeon Phi node( with 61 cores on it). I have two files. I have called MPI_Init(2) before calling any other mpi calls. I have found ntasks, rank also using mpi calls. I have also included all the required libraries. Still i get an error. Can you please help me out with this?
In file 1:
int buffsize;
int *sendbuff,**recvbuff,buffsum;
int *shareRegion;
shareRegion = (int*)gInit(MPI_COMM_WORLD, buffsize, ntasks); /* gInit is in file 2 */
sendbuff=(int *)malloc(sizeof(int)*buffsize);
if( taskid == 0 ){
recvbuff=(int **)malloc(sizeof(int *)*ntasks);
recvbuff[0]=(int *)malloc(sizeof(int)*ntasks*buffsize);
recvbuff=(int **)malloc(sizeof(int *)*1);
recvbuff[0]=(int *)malloc(sizeof(int)*1);
call(sendbuff, buffsize, shareRegion, recvbuff[0],buffsize,taskid,ntasks);
In file 2:
void* gInit( MPI_Comm comm, int size, int num_proc)
int share_mem = shm_open("share_region", O_CREAT|O_RDWR,0666 );
if( share_mem == -1)
return NULL;
int rank;
if( ftruncate( share_mem, sizeof(int)*size*num_proc) == -1 )
return NULL;
int* shared = mmap(NULL, sizeof(int)*size*num_proc, PROT_WRITE | PROT_READ, MAP_SHARED, share_mem, 0);
if(shared == (void*)-1)
printf("error in mem allocation (mmap)\n");
*(shared+(rank)) = 0
return shared;
void call(int *sendbuff, int sendcount, volatile int *sharedRegion, int **recvbuff, int recvcount, int rank, int size)
int i=0;
int k,j;
sharedRegion[j] = sendbuff[i];
if( rank == 0)
recvbuff[k][i] = sharedRegion[j];
Then i am doing some computation in file 1 on this recvbuff.
I get this segmentation fault while using sharedRegion variable.
MPI represents the Message Passing paradigm. That means, processes (ranks) are isolated and are generally running on a distributed machine. They communicate via explicit communication messages, recent versions allow also one-sideded, but still explicit, data transfer. You can not assume that shared memory is available for the processes. Have a look at any MPI tutorial to see how MPI is used.
Since you did not specify on what kind of machine you are running, any further suggestion is purely speculative. If you actually are on a shared memory machine, you may want to use a real shared memory paradigm instead, e.g. OpenMP.
While it's possible to restrict MPI to only use one machine and have shared memory (see the RMA chapter, especially in MPI-3), if you're only ever going to use one machine, it's easier to use some other paradigm.
However, if you're going to use multiple nodes and have multiple ranks on one node (multi-core processes for example), then it might be worth taking a look at MPI-3 RMA to see how it can help you with both locally shared memory and remote memory access. There are multiple papers out on the subject, but because they're so new, there's not a lot of good tutorials yet. You'll have to dig around a bit to find something useful to you.
The ordering of these two lines:
shareRegion = (int*)gInit(MPI_COMM_WORLD, buffsize, ntasks); /* gInit is in file 2 */
suggest that buffsize could possibly have different values before and after the call to gInit. If buffsize as passed in the first argument to the program is larger than its initial value while gInit is called, then out-of-bounds memory access would occur later and lead to a segmentation fault.
OpenCL trying to use semaphore crashes drivers

While writing simple OpenCL kernel I tried to use semaphores and it crushed my GPU Drivers (AMD 12.10). After checking out examples I found out, that crash happens only when local work size is not equal to 1.
This code taken from example:
#pragma OPENCL EXTENSION cl_khr_global_int32_base_atomics : enable
#pragma OPENCL EXTENSION cl_khr_local_int32_base_atomics : enable
#pragma OPENCL EXTENSION cl_khr_global_int32_extended_atomics : enable
#pragma OPENCL EXTENSION cl_khr_local_int32_extended_atomics : enable
void GetSemaphor(__global int * semaphor)
int occupied = atom_xchg(semaphor, 1);
while(occupied > 0)
occupied = atom_xchg(semaphor, 1);
void ReleaseSemaphor(__global int * semaphor)
int prevVal = atom_xchg(semaphor, 0);
__kernel void kernelNoAtomInc(__global int * num,
__global int * semaphor)
int i = get_global_id(0);
In example author uses
CQ.Execute(kernelNoAtomInc, null, new long[1] { N }, new long[1] { 1 }, null);
Where N = global_work_size and local_work_size = 1
Now if I change 1 to null or 2 or 4 or any other number i tried - AMD drivers will crush.
CQ.Execute(kernelNoAtomInc, null, new long[1] { N }, new long[1] { 2 }, null);
I do not have other PC to test on it at the moment. However it seems strange that author deliberately left local_group_size = 1, that's why I think I missing something here. Can someone please explain this to me? Also, as far as I understand, leaving local_group_size at 1 will affect performance greatly or it won't?
Host: Win8 x64, HD6870
Your problem is not reproducible and I can furthermore not find your source from the link, but here are a few ideas on why it could crash, which should be helpful (9 years in the past).
It propably crashes, because...
... the driver thinks you want the local version of that atom_xchg() function to be executed, when instead you want the global one.
... your loop slows down execution of that kernel so drastically on an old machine, that an internal limit of execution time got passed, causing the driver to terminate the kernel.
What I can suggest for a possible fix:
do not activate the local version of the atom function in your kernel
Try running it on CPU
There is no way to fix this, unless we could access your computer and debug on it.
You were also asking, why the author chose the local_group_size of one. This is because the global work size needs to be divisible by the local work size, such that the division results in a natural number. Dividing a natural number by one always results in a natural number, therefor this is perfect for experimenting. You are completely correct by saying that it will affect performance greatly. (Just maybe the maths didn't add up and it didn't crash, but not even start)
Different notes:
To make the incrementing be functionally correct, you should use an atom_inc() on your num buffer. I don't see how this could lead to a crash, but it definitely makes your program not work as intended
