JIT compilation and DEP - jit

I was thinking of trying my hand at some jit compilataion (just for the sake of learning) and it would be nice to have it work cross platform since I run all the major three at home (windows, os x, linux).
With that in mind, I want to know if there is any way to get out of using the virtual memory windows functions to allocate memory with execution permissions. Would be nice to just use malloc or new and point the processor at such a block.
Any tips?

DEP is just turning off Execution permission from every non-code page of memory. The code of application is loaded to memory which has execution permission; and there are lot of JITs which works in Windows/Linux/MacOSX, even when DEP is active. This is because there is a way to dynamically allocate memory with needed permissions set.
Usually, plain malloc should not be used, because permissions are per-page. Aligning of malloced memory to pages is still possible at price of some overhead. If you will not use malloc, some custom memory management (only for executable code). Custom management is a common way of doing JIT.
There is a solution from Chromium project, which uses JIT for javascript V8 VM and which is cross-platform. To be cross-platform, the needed function is implemented in several files and they are selected at compile time.
Linux: (chromium src/v8/src/platform-linux.cc) flag is PROT_EXEC of mmap().
void* OS::Allocate(const size_t requested,
size_t* allocated,
bool is_executable) {
const size_t msize = RoundUp(requested, AllocateAlignment());
int prot = PROT_READ | PROT_WRITE | (is_executable ? PROT_EXEC : 0);
void* addr = OS::GetRandomMmapAddr();
void* mbase = mmap(addr, msize, prot, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (mbase == MAP_FAILED) {
/** handle error */
return NULL;
}
*allocated = msize;
UpdateAllocatedSpaceLimits(mbase, msize);
return mbase;
}
Win32 (src/v8/src/platform-win32.cc): flag is PAGE_EXECUTE_READWRITE of VirtualAlloc
void* OS::Allocate(const size_t requested,
size_t* allocated,
bool is_executable) {
// The address range used to randomize RWX allocations in OS::Allocate
// Try not to map pages into the default range that windows loads DLLs
// Use a multiple of 64k to prevent committing unused memory.
// Note: This does not guarantee RWX regions will be within the
// range kAllocationRandomAddressMin to kAllocationRandomAddressMax
#ifdef V8_HOST_ARCH_64_BIT
static const intptr_t kAllocationRandomAddressMin = 0x0000000080000000;
static const intptr_t kAllocationRandomAddressMax = 0x000003FFFFFF0000;
#else
static const intptr_t kAllocationRandomAddressMin = 0x04000000;
static const intptr_t kAllocationRandomAddressMax = 0x3FFF0000;
#endif
// VirtualAlloc rounds allocated size to page size automatically.
size_t msize = RoundUp(requested, static_cast<int>(GetPageSize()));
intptr_t address = 0;
// Windows XP SP2 allows Data Excution Prevention (DEP).
int prot = is_executable ? PAGE_EXECUTE_READWRITE : PAGE_READWRITE;
// For exectutable pages try and randomize the allocation address
if (prot == PAGE_EXECUTE_READWRITE &&
msize >= static_cast<size_t>(Page::kPageSize)) {
address = (V8::RandomPrivate(Isolate::Current()) << kPageSizeBits)
| kAllocationRandomAddressMin;
address &= kAllocationRandomAddressMax;
}
LPVOID mbase = VirtualAlloc(reinterpret_cast<void *>(address),
msize,
MEM_COMMIT | MEM_RESERVE,
prot);
if (mbase == NULL && address != 0)
mbase = VirtualAlloc(NULL, msize, MEM_COMMIT | MEM_RESERVE, prot);
if (mbase == NULL) {
LOG(ISOLATE, StringEvent("OS::Allocate", "VirtualAlloc failed"));
return NULL;
}
ASSERT(IsAligned(reinterpret_cast<size_t>(mbase), OS::AllocateAlignment()));
*allocated = msize;
UpdateAllocatedSpaceLimits(mbase, static_cast<int>(msize));
return mbase;
}
MacOS (src/v8/src/platform-macos.cc): flag is PROT_EXEC of mmap, just like Linux or other posix.
void* OS::Allocate(const size_t requested,
size_t* allocated,
bool is_executable) {
const size_t msize = RoundUp(requested, getpagesize());
int prot = PROT_READ | PROT_WRITE | (is_executable ? PROT_EXEC : 0);
void* mbase = mmap(OS::GetRandomMmapAddr(),
msize,
prot,
MAP_PRIVATE | MAP_ANON,
kMmapFd,
kMmapFdOffset);
if (mbase == MAP_FAILED) {
LOG(Isolate::Current(), StringEvent("OS::Allocate", "mmap failed"));
return NULL;
}
*allocated = msize;
UpdateAllocatedSpaceLimits(mbase, msize);
return mbase;
}
And I also want note, that bcdedit.exe-like way should be used only for very old programs, which creates new executable code in memory, but not sets an Exec property on this page. For newer programs, like firefox or Chrome/Chromium, or any modern JIT, DEP should be active, and JIT will manage memory permissions in fine-grained manner.

One possibility is to make it a requirement that Windows installations running your program be either configured for DEP AlwaysOff (bad idea) or DEP OptOut (better idea).
This can be configured (under WinXp SP2+ and Win2k3 SP1+ at least) by changing the boot.ini file to have the setting:
/noexecute=OptOut
and then configuring your individual program to opt out by choosing (under XP):
Start button
Control Panel
System
Advanced tab
Performance Settings button
Data Execution Prevention tab
This should allow you to execute code from within your program that's created on the fly in malloc() blocks.
Keep in mind that this makes your program more susceptible to attacks that DEP was meant to prevent.
It looks like this is also possible in Windows 2008 with the command:
bcdedit.exe /set {current} nx OptOut
But, to be honest, if you just want to minimise platform-dependent code, that's easy to do just by isolating the code into a single function, something like:
void *MallocWithoutDep(size_t sz) {
#if defined _IS_WINDOWS
return VirtualMalloc(sz, OPT_DEP_OFF); // or whatever
#elif defined IS_LINUX
// Do linuxy thing
#elif defined IS_MACOS
// Do something almost certainly inexplicable
#endif
}
If you put all your platform dependent functions in their own files, the rest of your code is automatically platform-agnostic.

Related

Finding pointer with 'find out what writes to this address' strange offset

I'm trying to find a base pointer for UrbanTerror42.
My setup is as followed, I have a server with 2 players.
cheat-engine runs on client a.
I climb a ladder with client b and then scan for incease/decrease.
When I have found the values, I use find out what writes to this address.
But the offset are very high and point to empty memory.
I don't really know how to proceed
For the sake of clarity, I have looked up several other values and they have the same problem
I've already looked at a number of tutorials and forums, but that's always about values where the offsets are between 0 and 100 and not 80614.
I would really appreciate it if someone could tell me why this happened and what I have to do/learn to proceed.
thanks in advance
Urban Terror uses the Quake Engine. Early versions of this engine use the Quake Virtual Machine and the game logic is implemented as bytecode which is compiled into assembly by the Quake Virtual Machine. Custom allocation routines are used to load these modules into memory, relative and hardcoded offsets/addresses are created at runtime to accommodate these relocations and do not use the normal relocation table method of the portable executable file format. This is why you see these seemingly strange numbers that change every time you run the game.
The Quake Virtual Machines are file format .qvm and these qvms in memory are tracked in the QVM table. You must find the QVM table to uncover this mystery. Once you find the 2-3 QVMs and record their addresses, finding the table is easy, as you're simply doing a scan for pointers that point to these addresses and narrowing down your results by finding those which are close in memory to each other.
The QVM is defined like:
struct vmTable_t
{
vm_t vm[3];
};
struct vm_s {
// DO NOT MOVE OR CHANGE THESE WITHOUT CHANGING THE VM_OFFSET_* DEFINES
// USED BY THE ASM CODE
int programStack; // the vm may be recursively entered
intptr_t(*systemCall)(intptr_t *parms);
//------------------------------------
char name[MAX_QPATH];
// for dynamic linked modules
void *dllHandle;
intptr_t entryPoint; //(QDECL *entryPoint)(int callNum, ...);
void(*destroy)(vm_s* self);
// for interpreted modules
qboolean currentlyInterpreting;
qboolean compiled;
byte *codeBase;
int codeLength;
int *instructionPointers;
int instructionCount;
byte *dataBase;
int dataMask;
int stackBottom; // if programStack < stackBottom, error
int numSymbols;
struct vmSymbol_s *symbols;
int callLevel; // counts recursive VM_Call
int breakFunction; // increment breakCount on function entry to this
int breakCount;
BYTE *jumpTableTargets;
int numJumpTableTargets;
};
typedef struct vm_s vm_t;
The value in EAX in your original screenshot should be the same as either the codeBase or dataBase member variable of the QVM structure. The offsets are just relative to these addresses. Similarly to how you deal with ASLR, you must calculate the addresses at runtime.
Here is a truncated version of my code that does exactly this and additionally grabs important structures from memory, as an example:
void OA_t::GetVM()
{
cg = nullptr;
cgs = nullptr;
cgents = nullptr;
bLocalGame = false;
cgame = nullptr;
for (auto &vm : vmTable->vm)
{
if (strstr(vm.name, "qagame")) { bLocalGame = true; continue; }
if (strstr(vm.name, "cgame"))
{
cgame = &vm;
gamestatus = GSTAT_GAME;
//char* gamestring = Cvar_VariableString("fs_game");
switch (cgame->instructionCount)
{
case 136054: //version 88
cgents = (cg_entities*)(cgame->dataBase + 0x1649c);
cg = (cg_t*)(cgame->dataBase + 0xCC49C);
cgs = (cgs_t*)(cgame->dataBase + 0xf2720);
return;
Full source code for reference available at OpenArena Aimbot Source Code, it even includes a video overview of the code.
Full disclosure: that is a link to my website and the only viable resource I know of that covers this topic.

CL_OUT_OF_RESOURCES error is returned by clEnqueueNDRangeKernel() with dynamic parallelism

Kernel codes that produce the error:
__kernel void testDynamic(__global int *data)
{
int id=get_global_id(0);
atomic_add(&data[1],2);
}
__kernel void test(__global int * data)
{
int id=get_global_id(0);
atomic_add(&data[0],2);
if (id == 0) {
queue_t q = get_default_queue();
ndrange_t ndrange = ndrange_1D(1,1);
void (^my_block_A)(void) = ^{testDynamic(data);};
enqueue_kernel(q, CLK_ENQUEUE_FLAGS_WAIT_KERNEL,
ndrange,
my_block_A);
}
}
I tested below code to be sure OpenCL 2.0 compiler is working.
__kernel void test2(__global int *data)
{
int id=get_global_id(0);
data[id]=work_group_scan_inclusive_add(id);
}
scan function gives 0,1,3,6 as outputs so OpenCL 2.0 reduction functions are working.
Is dynamic parallelism an extension to OpenCL 2.0? If I remove enqueue_kernel command, results are equal the the expected values(omitting child kernel).
Device: Amd RX550, driver: 17.6.2
Is there a special command that needs to be run on host side, to run child kernel on get_default_queue queue? For now, command queue is created with an OpenCL 1.2 way as below:
commandQueue = cl::CommandQueue(context, device, CL_QUEUE_PROFILING_ENABLE, &err);
Does get_default_queue() have to be the same command queue which calls the parent kernel? Asking this because I'm using same command queue to upload data to GPU and then download results, in a single synchronization.
Moved solution from question to answer:
Edit: below API command was the solution:
commandQueue = cl::CommandQueue(context, device,
CL_QUEUE_ON_DEVICE|
CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE |
CL_QUEUE_ON_DEVICE_DEFAULT, &err);
after creating this queue(only 1 per device), didn't use it for anything else and also the parent kernel is enqueued on any other host queue so it looks like get_default_queue() doesn't have to be the parent-calling queue.
Documentation says CL_INVALID_QUEUE_PROPERTIES will be thrown if CL_QUEUE_ON_DEVICE is specified but for my machine, dynamic parallelism works with it and doesn't throw that error(as the upper commandQueue constructor parameters).

segmentation fault when using shared memory created by open_shm on Xeon Phi

I have written my code for single Xeon Phi node( with 61 cores on it). I have two files. I have called MPI_Init(2) before calling any other mpi calls. I have found ntasks, rank also using mpi calls. I have also included all the required libraries. Still i get an error. Can you please help me out with this?
In file 1:
int buffsize;
int *sendbuff,**recvbuff,buffsum;
int *shareRegion;
shareRegion = (int*)gInit(MPI_COMM_WORLD, buffsize, ntasks); /* gInit is in file 2 */
buffsize=atoi(argv[1]);
sendbuff=(int *)malloc(sizeof(int)*buffsize);
if( taskid == 0 ){
recvbuff=(int **)malloc(sizeof(int *)*ntasks);
recvbuff[0]=(int *)malloc(sizeof(int)*ntasks*buffsize);
for(i=1;i<ntasks;i++)recvbuff[i]=recvbuff[i-1]+buffsize;
}
else{
recvbuff=(int **)malloc(sizeof(int *)*1);
recvbuff[0]=(int *)malloc(sizeof(int)*1);
}
for(i=0;i<buffsize;i++){
sendbuff[i]=1;
MPI_Barrier(MPI_COMM_WORLD);
call(sendbuff, buffsize, shareRegion, recvbuff[0],buffsize,taskid,ntasks);
In file 2:
void* gInit( MPI_Comm comm, int size, int num_proc)
{
int share_mem = shm_open("share_region", O_CREAT|O_RDWR,0666 );
if( share_mem == -1)
return NULL;
int rank;
MPI_Comm_rank(comm,&rank);
if( ftruncate( share_mem, sizeof(int)*size*num_proc) == -1 )
return NULL;
int* shared = mmap(NULL, sizeof(int)*size*num_proc, PROT_WRITE | PROT_READ, MAP_SHARED, share_mem, 0);
if(shared == (void*)-1)
printf("error in mem allocation (mmap)\n");
*(shared+(rank)) = 0
MPI_Barrier(MPI_COMM_WORLD);
return shared;
}
void call(int *sendbuff, int sendcount, volatile int *sharedRegion, int **recvbuff, int recvcount, int rank, int size)
{
int i=0;
int k,j;
j=rank*sendcount;
for(i=0;i<sendcount;i++)
{
sharedRegion[j] = sendbuff[i];
j++;
}
if( rank == 0)
for(k=0;k<size;k++)
for(i=0;i<sendcount;i++)
{
j=0;
recvbuff[k][i] = sharedRegion[j];
j++;
}
}
Then i am doing some computation in file 1 on this recvbuff.
I get this segmentation fault while using sharedRegion variable.
MPI represents the Message Passing paradigm. That means, processes (ranks) are isolated and are generally running on a distributed machine. They communicate via explicit communication messages, recent versions allow also one-sideded, but still explicit, data transfer. You can not assume that shared memory is available for the processes. Have a look at any MPI tutorial to see how MPI is used.
Since you did not specify on what kind of machine you are running, any further suggestion is purely speculative. If you actually are on a shared memory machine, you may want to use a real shared memory paradigm instead, e.g. OpenMP.
While it's possible to restrict MPI to only use one machine and have shared memory (see the RMA chapter, especially in MPI-3), if you're only ever going to use one machine, it's easier to use some other paradigm.
However, if you're going to use multiple nodes and have multiple ranks on one node (multi-core processes for example), then it might be worth taking a look at MPI-3 RMA to see how it can help you with both locally shared memory and remote memory access. There are multiple papers out on the subject, but because they're so new, there's not a lot of good tutorials yet. You'll have to dig around a bit to find something useful to you.
The ordering of these two lines:
shareRegion = (int*)gInit(MPI_COMM_WORLD, buffsize, ntasks); /* gInit is in file 2 */
buffsize=atoi(argv[1]);
suggest that buffsize could possibly have different values before and after the call to gInit. If buffsize as passed in the first argument to the program is larger than its initial value while gInit is called, then out-of-bounds memory access would occur later and lead to a segmentation fault.
Hint: run your code as an MPI singleton (e.g. without mpirun) from inside a debugger (e.g. gdb) or change the limits so that cores would get dumped on error (e.g. with ulimit -c unlimited) and then examine the core file(s) with the debugger. Compiling with debug information (e.g. adding -g to the compiler options) helps a lot in such cases.

OpenCL trying to use semaphore crashes drivers

While writing simple OpenCL kernel I tried to use semaphores and it crushed my GPU Drivers (AMD 12.10). After checking out examples I found out, that crash happens only when local work size is not equal to 1.
This code taken from example:
#pragma OPENCL EXTENSION cl_khr_global_int32_base_atomics : enable
#pragma OPENCL EXTENSION cl_khr_local_int32_base_atomics : enable
#pragma OPENCL EXTENSION cl_khr_global_int32_extended_atomics : enable
#pragma OPENCL EXTENSION cl_khr_local_int32_extended_atomics : enable
void GetSemaphor(__global int * semaphor)
{
int occupied = atom_xchg(semaphor, 1);
while(occupied > 0)
{
occupied = atom_xchg(semaphor, 1);
}
}
void ReleaseSemaphor(__global int * semaphor)
{
int prevVal = atom_xchg(semaphor, 0);
}
__kernel void kernelNoAtomInc(__global int * num,
__global int * semaphor)
{
int i = get_global_id(0);
GetSemaphor(&semaphor[0]);
{
num[0]++;
}
ReleaseSemaphor(&semaphor[0]);
}
In example author uses
CQ.Execute(kernelNoAtomInc, null, new long[1] { N }, new long[1] { 1 }, null);
Where N = global_work_size and local_work_size = 1
Now if I change 1 to null or 2 or 4 or any other number i tried - AMD drivers will crush.
CQ.Execute(kernelNoAtomInc, null, new long[1] { N }, new long[1] { 2 }, null);
I do not have other PC to test on it at the moment. However it seems strange that author deliberately left local_group_size = 1, that's why I think I missing something here. Can someone please explain this to me? Also, as far as I understand, leaving local_group_size at 1 will affect performance greatly or it won't?
Thanks.
Host: Win8 x64, HD6870
Your problem is not reproducible and I can furthermore not find your source from the link, but here are a few ideas on why it could crash, which should be helpful (9 years in the past).
It propably crashes, because...
... the driver thinks you want the local version of that atom_xchg() function to be executed, when instead you want the global one.
... your loop slows down execution of that kernel so drastically on an old machine, that an internal limit of execution time got passed, causing the driver to terminate the kernel.
What I can suggest for a possible fix:
do not activate the local version of the atom function in your kernel
Try running it on CPU
There is no way to fix this, unless we could access your computer and debug on it.
You were also asking, why the author chose the local_group_size of one. This is because the global work size needs to be divisible by the local work size, such that the division results in a natural number. Dividing a natural number by one always results in a natural number, therefor this is perfect for experimenting. You are completely correct by saying that it will affect performance greatly. (Just maybe the maths didn't add up and it didn't crash, but not even start)
Different notes:
To make the incrementing be functionally correct, you should use an atom_inc() on your num buffer. I don't see how this could lead to a crash, but it definitely makes your program not work as intended
I would go and use the atomic functions from the 2.0 standard, since they already feature a semaphore-like functions: bool atomic_flag_test_and_set(volatile atomic_flag *object) and void atomic_flag_clear(volatile atomic_flag *object)

Condition Variable in Shared Memory - is this code POSIX-conformant?

Does the POSIX standard allow a named shared memory block to contain a mutex and condition variable?
We've been trying to use a mutex and condition variable to synchronise access to named shared memory by two processes on a LynuxWorks LynxOS-SE system (POSIX-conformant).
One shared memory block is called "/sync" and contains the mutex and condition variable, the other is "/data" and contains the actual data we are syncing access to.
We're seeing failures from pthread_cond_signal() if both processes don't perform the mmap() calls in exactly the same order, or if one process mmaps in some other piece of shared memory before it mmaps the "/sync" memory.
This example code is about as short as I can make it:
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h>
#include <sys/file.h>
#include <stdlib.h>
#include <pthread.h>
#include <errno.h>
#include <iostream>
#include <string>
using namespace std;
static const string shm_name_sync("/sync");
static const string shm_name_data("/data");
struct shared_memory_sync
{
pthread_mutex_t mutex;
pthread_cond_t condition;
};
struct shared_memory_data
{
int a;
int b;
};
//Create 2 shared memory objects
// - sync contains 2 shared synchronisation objects (mutex and condition)
// - data not important
void create()
{
// Create and map 'sync' shared memory
int fd_sync = shm_open(shm_name_sync.c_str(), O_CREAT|O_RDWR, S_IRUSR|S_IWUSR);
ftruncate(fd_sync, sizeof(shared_memory_sync));
void* addr_sync = mmap(0, sizeof(shared_memory_sync), PROT_READ|PROT_WRITE, MAP_SHARED, fd_sync, 0);
shared_memory_sync* p_sync = static_cast<shared_memory_sync*> (addr_sync);
// init the cond and mutex
pthread_condattr_t cond_attr;
pthread_condattr_init(&cond_attr);
pthread_condattr_setpshared(&cond_attr, PTHREAD_PROCESS_SHARED);
pthread_cond_init(&(p_sync->condition), &cond_attr);
pthread_condattr_destroy(&cond_attr);
pthread_mutexattr_t m_attr;
pthread_mutexattr_init(&m_attr);
pthread_mutexattr_setpshared(&m_attr, PTHREAD_PROCESS_SHARED);
pthread_mutex_init(&(p_sync->mutex), &m_attr);
pthread_mutexattr_destroy(&m_attr);
// Create the 'data' shared memory
int fd_data = shm_open(shm_name_data.c_str(), O_CREAT|O_RDWR, S_IRUSR|S_IWUSR);
ftruncate(fd_data, sizeof(shared_memory_data));
void* addr_data = mmap(0, sizeof(shared_memory_data), PROT_READ|PROT_WRITE, MAP_SHARED, fd_data, 0);
shared_memory_data* p_data = static_cast<shared_memory_data*> (addr_data);
// Run the second process while it sleeps here.
sleep(10);
int res = pthread_cond_signal(&(p_sync->condition));
assert(res==0); // <--- !!!THIS ASSERT WILL FAIL ON LYNXOS!!!
munmap(addr_sync, sizeof(shared_memory_sync));
shm_unlink(shm_name_sync.c_str());
munmap(addr_data, sizeof(shared_memory_data));
shm_unlink(shm_name_data.c_str());
}
//Open the same 2 shared memory objects but in reverse order
// - data
// - sync
void open()
{
sleep(2);
int fd_data = shm_open(shm_name_data.c_str(), O_RDWR, S_IRUSR|S_IWUSR);
void* addr_data = mmap(0, sizeof(shared_memory_data), PROT_READ|PROT_WRITE, MAP_SHARED, fd_data, 0);
shared_memory_data* p_data = static_cast<shared_memory_data*> (addr_data);
int fd_sync = shm_open(shm_name_sync.c_str(), O_RDWR, S_IRUSR|S_IWUSR);
void* addr_sync = mmap(0, sizeof(shared_memory_sync), PROT_READ|PROT_WRITE, MAP_SHARED, fd_sync, 0);
shared_memory_sync* p_sync = static_cast<shared_memory_sync*> (addr_sync);
// Wait on the condvar
pthread_mutex_lock(&(p_sync->mutex));
pthread_cond_wait(&(p_sync->condition), &(p_sync->mutex));
pthread_mutex_unlock(&(p_sync->mutex));
munmap(addr_sync, sizeof(shared_memory_sync));
munmap(addr_data, sizeof(shared_memory_data));
}
int main(int argc, char** argv)
{
if(argc>1)
{
open();
}
else
{
create();
}
return (0);
}
Run this program with no args, then another copy with args, and the first one will fail at the assert checking the pthread_cond_signal().
But change the order of the open() function to mmap() the "/sync" memory before the "/data" and it will all work fine.
This seems like a major bug in LynxOS to me, but LynuxWorks claim that using mutex and condition variable within named shared memory in this way is not covered by the POSIX standard, so they are not interested.
Can anyone determine if this code does actually violate POSIX?
Or does anyone have any convincing documentation that it is POSIX compliant?
Edit: we know that PTHREAD_PROCESS_SHARED is POSIX and is supported by LynxOS. The area of contention is whether mutexes and semaphores can be used within named shared memory (as we have done) or if POSIX only allows them to be used when one process creates and mmaps the shared memory and then forks the second process.
The pthread_mutexattr_setpshared function may be used to allow a pthread mutex in shared memory to be accessed by any thread which has access to that memory, even threads in different processes. According to this link, pthread_mutex_setpshared conforms to POSIX P1003.1c. (Same thing goes for the condition variable, see pthread_condattr_setpshared.)
Related question: pthread condition variables on Linux, odd behaviour
I can easily see how PTHREAD_PROCESS_SHARED can be tricky to implement on the OS-level (e.g. MacOS doesn't, except for rwlocks it seems). But just from reading the standard, you seem to have a case.
For completeness, you might want to assert on sysconf(_SC_THREAD_PROCESS_SHARED) and the return value of the *_setpshared() function calls— maybe there's another "surprise" waiting for you (but I can see from the comments that you already checked that SHARED is actually supported).
#JesperE: you might want to refer to the API docs at the OpenGroup instead of the HP docs.
May be there is some pointers in pthread_cond_t (without pshared), so you must place it into the same addresses in both threads/processes. With same-ordered mmaps you may get a equal addresses for both processes.
In glibc the pointer in cond_t was to thread descriptor of thread, owned mutex/cond.
You can control addresses with non-NULL first parameter to mmap.

Resources