segmentation fault when using shared memory created by open_shm on Xeon Phi - mpi

I have written my code for single Xeon Phi node( with 61 cores on it). I have two files. I have called MPI_Init(2) before calling any other mpi calls. I have found ntasks, rank also using mpi calls. I have also included all the required libraries. Still i get an error. Can you please help me out with this?
In file 1:
int buffsize;
int *sendbuff,**recvbuff,buffsum;
int *shareRegion;
shareRegion = (int*)gInit(MPI_COMM_WORLD, buffsize, ntasks); /* gInit is in file 2 */
buffsize=atoi(argv[1]);
sendbuff=(int *)malloc(sizeof(int)*buffsize);
if( taskid == 0 ){
recvbuff=(int **)malloc(sizeof(int *)*ntasks);
recvbuff[0]=(int *)malloc(sizeof(int)*ntasks*buffsize);
for(i=1;i<ntasks;i++)recvbuff[i]=recvbuff[i-1]+buffsize;
}
else{
recvbuff=(int **)malloc(sizeof(int *)*1);
recvbuff[0]=(int *)malloc(sizeof(int)*1);
}
for(i=0;i<buffsize;i++){
sendbuff[i]=1;
MPI_Barrier(MPI_COMM_WORLD);
call(sendbuff, buffsize, shareRegion, recvbuff[0],buffsize,taskid,ntasks);
In file 2:
void* gInit( MPI_Comm comm, int size, int num_proc)
{
int share_mem = shm_open("share_region", O_CREAT|O_RDWR,0666 );
if( share_mem == -1)
return NULL;
int rank;
MPI_Comm_rank(comm,&rank);
if( ftruncate( share_mem, sizeof(int)*size*num_proc) == -1 )
return NULL;
int* shared = mmap(NULL, sizeof(int)*size*num_proc, PROT_WRITE | PROT_READ, MAP_SHARED, share_mem, 0);
if(shared == (void*)-1)
printf("error in mem allocation (mmap)\n");
*(shared+(rank)) = 0
MPI_Barrier(MPI_COMM_WORLD);
return shared;
}
void call(int *sendbuff, int sendcount, volatile int *sharedRegion, int **recvbuff, int recvcount, int rank, int size)
{
int i=0;
int k,j;
j=rank*sendcount;
for(i=0;i<sendcount;i++)
{
sharedRegion[j] = sendbuff[i];
j++;
}
if( rank == 0)
for(k=0;k<size;k++)
for(i=0;i<sendcount;i++)
{
j=0;
recvbuff[k][i] = sharedRegion[j];
j++;
}
}
Then i am doing some computation in file 1 on this recvbuff.
I get this segmentation fault while using sharedRegion variable.

MPI represents the Message Passing paradigm. That means, processes (ranks) are isolated and are generally running on a distributed machine. They communicate via explicit communication messages, recent versions allow also one-sideded, but still explicit, data transfer. You can not assume that shared memory is available for the processes. Have a look at any MPI tutorial to see how MPI is used.
Since you did not specify on what kind of machine you are running, any further suggestion is purely speculative. If you actually are on a shared memory machine, you may want to use a real shared memory paradigm instead, e.g. OpenMP.

While it's possible to restrict MPI to only use one machine and have shared memory (see the RMA chapter, especially in MPI-3), if you're only ever going to use one machine, it's easier to use some other paradigm.
However, if you're going to use multiple nodes and have multiple ranks on one node (multi-core processes for example), then it might be worth taking a look at MPI-3 RMA to see how it can help you with both locally shared memory and remote memory access. There are multiple papers out on the subject, but because they're so new, there's not a lot of good tutorials yet. You'll have to dig around a bit to find something useful to you.

The ordering of these two lines:
shareRegion = (int*)gInit(MPI_COMM_WORLD, buffsize, ntasks); /* gInit is in file 2 */
buffsize=atoi(argv[1]);
suggest that buffsize could possibly have different values before and after the call to gInit. If buffsize as passed in the first argument to the program is larger than its initial value while gInit is called, then out-of-bounds memory access would occur later and lead to a segmentation fault.
Hint: run your code as an MPI singleton (e.g. without mpirun) from inside a debugger (e.g. gdb) or change the limits so that cores would get dumped on error (e.g. with ulimit -c unlimited) and then examine the core file(s) with the debugger. Compiling with debug information (e.g. adding -g to the compiler options) helps a lot in such cases.

Related

Finding pointer with 'find out what writes to this address' strange offset

I'm trying to find a base pointer for UrbanTerror42.
My setup is as followed, I have a server with 2 players.
cheat-engine runs on client a.
I climb a ladder with client b and then scan for incease/decrease.
When I have found the values, I use find out what writes to this address.
But the offset are very high and point to empty memory.
I don't really know how to proceed
For the sake of clarity, I have looked up several other values and they have the same problem
I've already looked at a number of tutorials and forums, but that's always about values where the offsets are between 0 and 100 and not 80614.
I would really appreciate it if someone could tell me why this happened and what I have to do/learn to proceed.
thanks in advance
Urban Terror uses the Quake Engine. Early versions of this engine use the Quake Virtual Machine and the game logic is implemented as bytecode which is compiled into assembly by the Quake Virtual Machine. Custom allocation routines are used to load these modules into memory, relative and hardcoded offsets/addresses are created at runtime to accommodate these relocations and do not use the normal relocation table method of the portable executable file format. This is why you see these seemingly strange numbers that change every time you run the game.
The Quake Virtual Machines are file format .qvm and these qvms in memory are tracked in the QVM table. You must find the QVM table to uncover this mystery. Once you find the 2-3 QVMs and record their addresses, finding the table is easy, as you're simply doing a scan for pointers that point to these addresses and narrowing down your results by finding those which are close in memory to each other.
The QVM is defined like:
struct vmTable_t
{
vm_t vm[3];
};
struct vm_s {
// DO NOT MOVE OR CHANGE THESE WITHOUT CHANGING THE VM_OFFSET_* DEFINES
// USED BY THE ASM CODE
int programStack; // the vm may be recursively entered
intptr_t(*systemCall)(intptr_t *parms);
//------------------------------------
char name[MAX_QPATH];
// for dynamic linked modules
void *dllHandle;
intptr_t entryPoint; //(QDECL *entryPoint)(int callNum, ...);
void(*destroy)(vm_s* self);
// for interpreted modules
qboolean currentlyInterpreting;
qboolean compiled;
byte *codeBase;
int codeLength;
int *instructionPointers;
int instructionCount;
byte *dataBase;
int dataMask;
int stackBottom; // if programStack < stackBottom, error
int numSymbols;
struct vmSymbol_s *symbols;
int callLevel; // counts recursive VM_Call
int breakFunction; // increment breakCount on function entry to this
int breakCount;
BYTE *jumpTableTargets;
int numJumpTableTargets;
};
typedef struct vm_s vm_t;
The value in EAX in your original screenshot should be the same as either the codeBase or dataBase member variable of the QVM structure. The offsets are just relative to these addresses. Similarly to how you deal with ASLR, you must calculate the addresses at runtime.
Here is a truncated version of my code that does exactly this and additionally grabs important structures from memory, as an example:
void OA_t::GetVM()
{
cg = nullptr;
cgs = nullptr;
cgents = nullptr;
bLocalGame = false;
cgame = nullptr;
for (auto &vm : vmTable->vm)
{
if (strstr(vm.name, "qagame")) { bLocalGame = true; continue; }
if (strstr(vm.name, "cgame"))
{
cgame = &vm;
gamestatus = GSTAT_GAME;
//char* gamestring = Cvar_VariableString("fs_game");
switch (cgame->instructionCount)
{
case 136054: //version 88
cgents = (cg_entities*)(cgame->dataBase + 0x1649c);
cg = (cg_t*)(cgame->dataBase + 0xCC49C);
cgs = (cgs_t*)(cgame->dataBase + 0xf2720);
return;
Full source code for reference available at OpenArena Aimbot Source Code, it even includes a video overview of the code.
Full disclosure: that is a link to my website and the only viable resource I know of that covers this topic.

Do I need to worry about memory corruption if I DON'T use atomic functions in opencl?

I have a kernel where for each execution I might write a result into an output Array.
All the LowerValues[] will be initialised to a magic number to see if it has been untouched.
I don't care which value is in LowerValues[N], as long as it's valid.
kernel void HasLower( global int* Values, global int* LowerValues )
{
int ThisIndex = get_global_id(0);
int Max = get_global_size(0);
int ThisValue = Values[Index];
for ( int ThatIndex=ThisIndex+1; ThatIndex<Max; ThatIndex++ )
{
int ThatValue = Values[ThatIndex];
if ( ThatValue < ThisValue )
{
LowerValues[ThisValue] = ThatValue;
}
else if ( ThatValue > ThisValue )
{
LowerValues[ThatValue] = ThisValue;
}
}
}
My multithread head says, use an atomic function to set the value in LowerValues[] so the memory does not get corrupted when two threads write to the same memory location.
But.... what I'm wondering is, do I NEED to use atomic functions?
I don't care if one execution overwrites another, but I don't want corrupted values. Is this something I need to be concerned about with opencl? is it driver/device dependent?
I couldn't find any information regards to this and so far haven't come across any memory corruption when I omit atomics.
I could play it safe, but perhaps there are optimisations to be had.
You do not need to use atomic functions as basic operations on basic types have so side effects in memory. However, as you've already guessed, multiple threads will find that their ThisValue < Values[idx] and LowerValues[ThatValue] has to be considered a random, non-repeatable choice of true answers.

OpenCL trying to use semaphore crashes drivers

While writing simple OpenCL kernel I tried to use semaphores and it crushed my GPU Drivers (AMD 12.10). After checking out examples I found out, that crash happens only when local work size is not equal to 1.
This code taken from example:
#pragma OPENCL EXTENSION cl_khr_global_int32_base_atomics : enable
#pragma OPENCL EXTENSION cl_khr_local_int32_base_atomics : enable
#pragma OPENCL EXTENSION cl_khr_global_int32_extended_atomics : enable
#pragma OPENCL EXTENSION cl_khr_local_int32_extended_atomics : enable
void GetSemaphor(__global int * semaphor)
{
int occupied = atom_xchg(semaphor, 1);
while(occupied > 0)
{
occupied = atom_xchg(semaphor, 1);
}
}
void ReleaseSemaphor(__global int * semaphor)
{
int prevVal = atom_xchg(semaphor, 0);
}
__kernel void kernelNoAtomInc(__global int * num,
__global int * semaphor)
{
int i = get_global_id(0);
GetSemaphor(&semaphor[0]);
{
num[0]++;
}
ReleaseSemaphor(&semaphor[0]);
}
In example author uses
CQ.Execute(kernelNoAtomInc, null, new long[1] { N }, new long[1] { 1 }, null);
Where N = global_work_size and local_work_size = 1
Now if I change 1 to null or 2 or 4 or any other number i tried - AMD drivers will crush.
CQ.Execute(kernelNoAtomInc, null, new long[1] { N }, new long[1] { 2 }, null);
I do not have other PC to test on it at the moment. However it seems strange that author deliberately left local_group_size = 1, that's why I think I missing something here. Can someone please explain this to me? Also, as far as I understand, leaving local_group_size at 1 will affect performance greatly or it won't?
Thanks.
Host: Win8 x64, HD6870
Your problem is not reproducible and I can furthermore not find your source from the link, but here are a few ideas on why it could crash, which should be helpful (9 years in the past).
It propably crashes, because...
... the driver thinks you want the local version of that atom_xchg() function to be executed, when instead you want the global one.
... your loop slows down execution of that kernel so drastically on an old machine, that an internal limit of execution time got passed, causing the driver to terminate the kernel.
What I can suggest for a possible fix:
do not activate the local version of the atom function in your kernel
Try running it on CPU
There is no way to fix this, unless we could access your computer and debug on it.
You were also asking, why the author chose the local_group_size of one. This is because the global work size needs to be divisible by the local work size, such that the division results in a natural number. Dividing a natural number by one always results in a natural number, therefor this is perfect for experimenting. You are completely correct by saying that it will affect performance greatly. (Just maybe the maths didn't add up and it didn't crash, but not even start)
Different notes:
To make the incrementing be functionally correct, you should use an atom_inc() on your num buffer. I don't see how this could lead to a crash, but it definitely makes your program not work as intended
I would go and use the atomic functions from the 2.0 standard, since they already feature a semaphore-like functions: bool atomic_flag_test_and_set(volatile atomic_flag *object) and void atomic_flag_clear(volatile atomic_flag *object)

Effect of using page-able memory for asynchronous memory copy?

In CUDA C Best Practices Guide Version 5.0, Section 6.1.2, it is written that:
In contrast with cudaMemcpy(), the asynchronous transfer version
requires pinned host memory (see Pinned Memory), and it contains an
additional argument, a stream ID.
It means the cudaMemcpyAsync function should fail if I use simple memory.
But this is not what happened.
Just for testing purpose, I tried the following program:
Kernel:
__global__ void kernel_increment(float* src, float* dst, int n)
{
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if(tid<n)
dst[tid] = src[tid] + 1.0f;
}
Main:
int main()
{
float *hPtr1, *hPtr2, *dPtr1, *dPtr2;
const int n = 1000;
size_t bytes = n * sizeof(float);
cudaStream_t str1, str2;
hPtr1 = new float[n];
hPtr2 = new float[n];
for(int i=0; i<n; i++)
hPtr1[i] = static_cast<float>(i);
cudaMalloc<float>(&dPtr1,bytes);
cudaMalloc<float>(&dPtr2,bytes);
dim3 block(16);
dim3 grid((n + block.x - 1)/block.x);
cudaStreamCreate(&str1);
cudaStreamCreate(&str2);
cudaMemcpyAsync(dPtr1,hPtr1,bytes,cudaMemcpyHostToDevice,str1);
kernel_increment<<<grid,block,0,str2>>>(dPtr1,dPtr2,n);
cudaMemcpyAsync(hPtr2,dPtr2,bytes,cudaMemcpyDeviceToHost,str1);
printf("Status: %s\n",cudaGetErrorString(cudaGetLastError()));
cudaDeviceSynchronize();
printf("Status: %s\n",cudaGetErrorString(cudaGetLastError()));
cudaStreamDestroy(str1);
cudaStreamDestroy(str2);
cudaFree(dPtr1);
cudaFree(dPtr2);
for(int i=0; i<n; i++)
std::cout<<hPtr2[i]<<std::endl;
delete[] hPtr1;
delete[] hPtr2;
return 0;
}
The program gave correct output. The array incremented successfully.
How did cudaMemcpyAsync execute without page locked memory?
Am I missing something here?
cudaMemcpyAsync is fundamentally an asynchronous version of cudaMemcpy. This means that it doesn't block the calling host thread when the copy call is issued. That is the basic behaviour of the call.
Optionally, if the call is launched into the non default stream, and if the host memory is a pinned allocation, and the device has a free DMA copy engine, the copy operation can happen while the GPU simultaneously performs another operation: either kernel execution or another copy (in the case of a GPU with two DMA copy engines). If any of these conditions are not satisfied, the operation on the GPU is functionally identical to a standard cudaMemcpy call, ie. it serialises operations on the GPU, and no simultaneous copy-kernel execution or simultaneous multiple copies can occur. The only difference is that the operation doesn't block the calling host thread.
In your example code, the host source and destination memory are not pinned. So the memory transfer cannot overlap with kernel execution (ie. they serialise operations on the GPU). The calls are still asynchronous on the host. So what you have is functionally equivalent to:
cudaMemcpy(dPtr1,hPtr1,bytes,cudaMemcpyHostToDevice);
kernel_increment<<<grid,block>>>(dPtr1,dPtr2,n);
cudaMemcpy(hPtr2,dPtr2,bytes,cudaMemcpyDeviceToHost);
with the exception that all the calls are asynchronous on the host, so the host thread blocks at the cudaDeviceSynchronize() call rather than at each of the memory transfer calls.
This is absolutely expected behaviour.

JIT compilation and DEP

I was thinking of trying my hand at some jit compilataion (just for the sake of learning) and it would be nice to have it work cross platform since I run all the major three at home (windows, os x, linux).
With that in mind, I want to know if there is any way to get out of using the virtual memory windows functions to allocate memory with execution permissions. Would be nice to just use malloc or new and point the processor at such a block.
Any tips?
DEP is just turning off Execution permission from every non-code page of memory. The code of application is loaded to memory which has execution permission; and there are lot of JITs which works in Windows/Linux/MacOSX, even when DEP is active. This is because there is a way to dynamically allocate memory with needed permissions set.
Usually, plain malloc should not be used, because permissions are per-page. Aligning of malloced memory to pages is still possible at price of some overhead. If you will not use malloc, some custom memory management (only for executable code). Custom management is a common way of doing JIT.
There is a solution from Chromium project, which uses JIT for javascript V8 VM and which is cross-platform. To be cross-platform, the needed function is implemented in several files and they are selected at compile time.
Linux: (chromium src/v8/src/platform-linux.cc) flag is PROT_EXEC of mmap().
void* OS::Allocate(const size_t requested,
size_t* allocated,
bool is_executable) {
const size_t msize = RoundUp(requested, AllocateAlignment());
int prot = PROT_READ | PROT_WRITE | (is_executable ? PROT_EXEC : 0);
void* addr = OS::GetRandomMmapAddr();
void* mbase = mmap(addr, msize, prot, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (mbase == MAP_FAILED) {
/** handle error */
return NULL;
}
*allocated = msize;
UpdateAllocatedSpaceLimits(mbase, msize);
return mbase;
}
Win32 (src/v8/src/platform-win32.cc): flag is PAGE_EXECUTE_READWRITE of VirtualAlloc
void* OS::Allocate(const size_t requested,
size_t* allocated,
bool is_executable) {
// The address range used to randomize RWX allocations in OS::Allocate
// Try not to map pages into the default range that windows loads DLLs
// Use a multiple of 64k to prevent committing unused memory.
// Note: This does not guarantee RWX regions will be within the
// range kAllocationRandomAddressMin to kAllocationRandomAddressMax
#ifdef V8_HOST_ARCH_64_BIT
static const intptr_t kAllocationRandomAddressMin = 0x0000000080000000;
static const intptr_t kAllocationRandomAddressMax = 0x000003FFFFFF0000;
#else
static const intptr_t kAllocationRandomAddressMin = 0x04000000;
static const intptr_t kAllocationRandomAddressMax = 0x3FFF0000;
#endif
// VirtualAlloc rounds allocated size to page size automatically.
size_t msize = RoundUp(requested, static_cast<int>(GetPageSize()));
intptr_t address = 0;
// Windows XP SP2 allows Data Excution Prevention (DEP).
int prot = is_executable ? PAGE_EXECUTE_READWRITE : PAGE_READWRITE;
// For exectutable pages try and randomize the allocation address
if (prot == PAGE_EXECUTE_READWRITE &&
msize >= static_cast<size_t>(Page::kPageSize)) {
address = (V8::RandomPrivate(Isolate::Current()) << kPageSizeBits)
| kAllocationRandomAddressMin;
address &= kAllocationRandomAddressMax;
}
LPVOID mbase = VirtualAlloc(reinterpret_cast<void *>(address),
msize,
MEM_COMMIT | MEM_RESERVE,
prot);
if (mbase == NULL && address != 0)
mbase = VirtualAlloc(NULL, msize, MEM_COMMIT | MEM_RESERVE, prot);
if (mbase == NULL) {
LOG(ISOLATE, StringEvent("OS::Allocate", "VirtualAlloc failed"));
return NULL;
}
ASSERT(IsAligned(reinterpret_cast<size_t>(mbase), OS::AllocateAlignment()));
*allocated = msize;
UpdateAllocatedSpaceLimits(mbase, static_cast<int>(msize));
return mbase;
}
MacOS (src/v8/src/platform-macos.cc): flag is PROT_EXEC of mmap, just like Linux or other posix.
void* OS::Allocate(const size_t requested,
size_t* allocated,
bool is_executable) {
const size_t msize = RoundUp(requested, getpagesize());
int prot = PROT_READ | PROT_WRITE | (is_executable ? PROT_EXEC : 0);
void* mbase = mmap(OS::GetRandomMmapAddr(),
msize,
prot,
MAP_PRIVATE | MAP_ANON,
kMmapFd,
kMmapFdOffset);
if (mbase == MAP_FAILED) {
LOG(Isolate::Current(), StringEvent("OS::Allocate", "mmap failed"));
return NULL;
}
*allocated = msize;
UpdateAllocatedSpaceLimits(mbase, msize);
return mbase;
}
And I also want note, that bcdedit.exe-like way should be used only for very old programs, which creates new executable code in memory, but not sets an Exec property on this page. For newer programs, like firefox or Chrome/Chromium, or any modern JIT, DEP should be active, and JIT will manage memory permissions in fine-grained manner.
One possibility is to make it a requirement that Windows installations running your program be either configured for DEP AlwaysOff (bad idea) or DEP OptOut (better idea).
This can be configured (under WinXp SP2+ and Win2k3 SP1+ at least) by changing the boot.ini file to have the setting:
/noexecute=OptOut
and then configuring your individual program to opt out by choosing (under XP):
Start button
Control Panel
System
Advanced tab
Performance Settings button
Data Execution Prevention tab
This should allow you to execute code from within your program that's created on the fly in malloc() blocks.
Keep in mind that this makes your program more susceptible to attacks that DEP was meant to prevent.
It looks like this is also possible in Windows 2008 with the command:
bcdedit.exe /set {current} nx OptOut
But, to be honest, if you just want to minimise platform-dependent code, that's easy to do just by isolating the code into a single function, something like:
void *MallocWithoutDep(size_t sz) {
#if defined _IS_WINDOWS
return VirtualMalloc(sz, OPT_DEP_OFF); // or whatever
#elif defined IS_LINUX
// Do linuxy thing
#elif defined IS_MACOS
// Do something almost certainly inexplicable
#endif
}
If you put all your platform dependent functions in their own files, the rest of your code is automatically platform-agnostic.

Resources