Related
I do not want to use mpiexec -n 4 ./a.out to run my program on my core i7 processor (with 4 cores). Instead, I want to run ./a.out, have it detect the number of cores and fire up MPI to run a process per core.
This SO question and answer MPI Number of processors? led me to use mpiexec.
The reason I want to avoid mpiexec is because my code is destined to be a library inside a larger project I'm working on. The larger project has a GUI and the user will be starting long computations that will call my library, which will in turn use MPI. The integration between the UI and the computation code is not trivial... so launching an external process and communicating via a socket or some other means is not an option. It must be a library call.
Is this possible? How do I do it?
This is quite a nontrivial thing to achieve in general. Also, there is hardly any portable solution that does not depend on some MPI implementation specifics. What follows is a sample solution that works with Open MPI and possibly with other general MPI implementations (MPICH, Intel MPI, etc.). It involves a second executable or a means for the original executable to directly call you library provided some special command-line argument. It goes like this.
Assume the original executable was started simply as ./a.out. When your library function is called, it calls MPI_Init(NULL, NULL), which initialises MPI. Since the executable was not started via mpiexec, it falls back to the so-called singleton MPI initialisation, i.e. it creates an MPI job that consists of a single process. To perform distributed computations, you have to start more MPI processes and that's where things get complicated in the general case.
MPI supports dynamic process management, in which one MPI job can start a second one and communicate with it using intercommunicators. This happens when the first job calls MPI_Comm_spawn or MPI_Comm_spawn_multiple. The first one is used to start simple MPI jobs that use the same executable for all MPI ranks while the second one can start jobs that mix different executables. Both need information as to where and how to launch the processes. This comes from the so-called MPI universe, which provides information not only about the started processes, but also about the available slots for dynamically started ones. The universe is constructed by mpiexec or by some other launcher mechanism that takes, e.g., a host file with list of nodes and number of slots on each node. In the absence of such information, some MPI implementations (Open MPI included) will simply start the executables on the same node as the original file. MPI_Comm_spawn[_multiple] has an MPI_Info argument that can be used to supply a list of key-value paris with implementation-specific information. Open MPI supports the add-hostfile key that can be used to specify a hostfile to be used when spawning the child job. This is useful for, e.g., allowing the user to specify via the GUI a list of hosts to use for the MPI computation. But let's concentrate on the case where no such information is provided and Open MPI simply runs the child job on the same host.
Assume the worker executable is called worker. Or that the original executable can serve as worker if called with some special command-line option, -worker for example. If you want to perform computation with N processes in total, you need to launch N-1 workers. This is simple:
(separate executable)
MPI_Comm child_comm;
MPI_Comm_spawn("./worker", MPI_ARGV_NULL, N-1, MPI_INFO_NULL, 0,
MPI_COMM_SELF, &child_comm, MPI_ERRCODES_IGNORE);
(same executable, with an option)
MPI_Comm child_comm;
char *argv[] = { "-worker", NULL };
MPI_Comm_spawn("./a.out", argv, N-1, MPI_INFO_NULL, 0,
MPI_COMM_SELF, &child_comm, MPI_ERRCODES_IGNORE);
If everything goes well, child_comm will be set to the handle of an intercommunicator that can be used to communicate with the new job. As intercommunicators are kind of tricky to use and the parent-child job division requires complex program logic, one could simply merge the two sides of the intercommunicator into a "big world" communicator that replaced MPI_COMM_WORLD. On the parent's side:
MPI_Comm bigworld;
MPI_Intercomm_merge(child_comm, 0, &bigworld);
On the child's side:
MPI_Comm parent_comm, bigworld;
MPI_Get_parent(&parent_comm);
MPI_Intercomm_merge(parent_comm, 1, &bigworld);
After the merge is complete, all processes can communicate using bigworld instead of MPI_COMM_WORLD. Note that child jobs do not share their MPI_COMM_WORLD with the parent job.
To put it all together, here is a complete functioning example with two separate program codes.
main.c
#include <stdio.h>
#include <mpi.h>
int main (void)
{
MPI_Init(NULL, NULL);
printf("[main] Spawning workers...\n");
MPI_Comm child_comm;
MPI_Comm_spawn("./worker", MPI_ARGV_NULL, 2, MPI_INFO_NULL, 0,
MPI_COMM_SELF, &child_comm, MPI_ERRCODES_IGNORE);
MPI_Comm bigworld;
MPI_Intercomm_merge(child_comm, 0, &bigworld);
int size, rank;
MPI_Comm_rank(bigworld, &rank);
MPI_Comm_size(bigworld, &size);
printf("[main] Big world created with %d ranks\n", size);
// Perform some computation
int data = 1, result;
MPI_Bcast(&data, 1, MPI_INT, 0, bigworld);
data *= (1 + rank);
MPI_Reduce(&data, &result, 1, MPI_INT, MPI_SUM, 0, bigworld);
printf("[main] Result = %d\n", result);
MPI_Barrier(bigworld);
MPI_Comm_free(&bigworld);
MPI_Comm_free(&child_comm);
MPI_Finalize();
printf("[main] Shutting down\n");
return 0;
}
worker.c
#include <stdio.h>
#include <mpi.h>
int main (void)
{
MPI_Init(NULL, NULL);
MPI_Comm parent_comm;
MPI_Comm_get_parent(&parent_comm);
int rank, size;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
printf("[worker] %d of %d here\n", rank, size);
MPI_Comm bigworld;
MPI_Intercomm_merge(parent_comm, 1, &bigworld);
MPI_Comm_rank(bigworld, &rank);
MPI_Comm_size(bigworld, &size);
printf("[worker] %d of %d in big world\n", rank, size);
// Perform some computation
int data;
MPI_Bcast(&data, 1, MPI_INT, 0, bigworld);
data *= (1 + rank);
MPI_Reduce(&data, NULL, 1, MPI_INT, MPI_SUM, 0, bigworld);
printf("[worker] Done\n");
MPI_Barrier(bigworld);
MPI_Comm_free(&bigworld);
MPI_Comm_free(&parent_comm);
MPI_Finalize();
return 0;
}
Here is how it works:
$ mpicc -o main main.c
$ mpicc -o worker worker.c
$ ./main
[main] Spawning workers...
[worker] 0 of 2 here
[worker] 1 of 2 here
[worker] 1 of 3 in big world
[worker] 2 of 3 in big world
[main] Big world created with 3 ranks
[worker] Done
[worker] Done
[main] Result = 6
[main] Shutting down
The child job has to use MPI_Comm_get_parent to obtain the intercommunicator to the parent job. When a process is not part of such a child job, the returned value will be MPI_COMM_NULL. This allows for an easy way to implement both the main program and the worker in the same executable. Here is a hybrid example:
#include <stdio.h>
#include <mpi.h>
MPI_Comm bigworld_comm = MPI_COMM_NULL;
MPI_Comm other_comm = MPI_COMM_NULL;
int parlib_init (const char *argv0, int n)
{
MPI_Init(NULL, NULL);
MPI_Comm_get_parent(&other_comm);
if (other_comm == MPI_COMM_NULL)
{
printf("[main] Spawning workers...\n");
MPI_Comm_spawn(argv0, MPI_ARGV_NULL, n-1, MPI_INFO_NULL, 0,
MPI_COMM_SELF, &other_comm, MPI_ERRCODES_IGNORE);
MPI_Intercomm_merge(other_comm, 0, &bigworld_comm);
return 0;
}
int rank, size;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
printf("[worker] %d of %d here\n", rank, size);
MPI_Intercomm_merge(other_comm, 1, &bigworld_comm);
return 1;
}
int parlib_dowork (void)
{
int data = 1, result = -1, size, rank;
MPI_Comm_rank(bigworld_comm, &rank);
MPI_Comm_size(bigworld_comm, &size);
if (rank == 0)
{
printf("[main] Doing work with %d processes in total\n", size);
data = 1;
}
MPI_Bcast(&data, 1, MPI_INT, 0, bigworld_comm);
data *= (1 + rank);
MPI_Reduce(&data, &result, 1, MPI_INT, MPI_SUM, 0, bigworld_comm);
return result;
}
void parlib_finalize (void)
{
MPI_Comm_free(&bigworld_comm);
MPI_Comm_free(&other_comm);
MPI_Finalize();
}
int main (int argc, char **argv)
{
if (parlib_init(argv[0], 4))
{
// Worker process
(void)parlib_dowork();
printf("[worker] Done\n");
parlib_finalize();
return 0;
}
// Main process
// Show GUI, save the world, etc.
int result = parlib_dowork();
printf("[main] Result = %d\n", result);
parlib_finalize();
printf("[main] Shutting down\n");
return 0;
}
And here is an example output:
$ mpicc -o hybrid hybrid.c
$ ./hybrid
[main] Spawning workers...
[worker] 0 of 3 here
[worker] 2 of 3 here
[worker] 1 of 3 here
[main] Doing work with 4 processes in total
[worker] Done
[worker] Done
[main] Result = 10
[worker] Done
[main] Shutting down
Some things to keep in mind when designing such parallel libraries:
MPI can only be initialised once. If necessary, call MPI_Initialized to check if the library has already been initialised.
MPI can only be finalized once. Again, MPI_Finalized is your friend. It can be used in something like an atexit() handler to implement a universal MPI finalisation on program exit.
When used in threaded contexts (usual when GUIs are involved), MPI must be initialised with support for threads. See MPI_Init_thread.
You can get number of CPUs by using for example this solution, and then start the MPI process by calling MPI_comm_spawn. But you will need to have a separate executable file.
I have an OpenCL kernel and I want to run it on all detected OpenCL capable devices (like all available GPUs) on different systems, I'd be thankful to know if there is any straightforward method. I mean like creating a single command queue for all devices.
Thanks in advance :]
You can't create a single command queue for all devices; a given command queue is tied to a single device. However, you can create separate command queues for each OpenCL device and feed them work, which should execute concurrently.
As Dithermaster points out you first create a separate command queue for each device, for instance you might have multiple GPUs. You can then place these in an array, e.g., here is a pointer to an array that you can setup:
cl_command_queue* commandQueues;
However in my experience it has not always been a "slam-dunk" in getting the various command queues executing concurrently, as can be verified using event timing information (checking for overlap) which you can get through your own profiling or using 3rd party profiling tools. You should do this step anyway to verify what does or does not work on your setup.
An alternative approach which can work quite nicely is to use OpenMP to execute the command queues concurrently, e.g., you do something like:
#pragma omp parallel for default(shared)
for (int i = 0; i < numDevices; ++i) {
someOpenCLFunction(commandQueues[i], ....);
}
Suppose you have N devices, and a 100 elements of work (jobs). What you should do is something like this:
#define SIZE 3
std::vector<cl::Commandqueue> queues(SIZE); //One queue for each device (same context)
std::vector<cl::Kernel> kernels(SIZE); //One kernel for each device (same context)
std::vector<cl::Buffer> buf_in(SIZE), buf_out(SIZE); //One buffer set for each device (same context)
// Initialize the queues, kernels, buffers etc....
//Create the kernel, buffers and queues, then set the kernel[0] args to point to buf_in[0] and buf_out[0], and so on...
// Create the events in a finished state
std::vector<cl::Event> events;
cl::UserEvent ev; ev.setStatus(CL_COMPLETE);
for(int i=0; i<queues.size(); i++)
events.push_back(ev);
//Run all the elements (a "first empty, first run" scheduler)
for(int i=0; i<jobs.size(); i++){
bool found = false;
int x = -1;
//Try all the queues
while(!found){
for(int j=0; j<queue.size(); j++)
if(events[j].getInfo<CL_EVENT_COMMAND_ EXECUTION_STATUS>() == CL_COMPLETED){
found = true;
x = j;
break;
}
if(!found) Sleep(50); //Sleep a while if not all the queues have completed, other options are possible (like asigning the job to a random one)
}
//Run it
events[x] = cl::Event(); //Clean it
queues[x].enqueueWriteBuffer(...); //Copy buf_in
queues[x].enqueueNDRangeKernel(kernel[x], .... ); //Launch the kernel
queues[x].enqueueReadBuffer(... , events[x]); //Read buf_out
}
//Wait for completion
for(int i=0; i<queues.size(); i++)
queue[i].Finish();
I have written my code for single Xeon Phi node( with 61 cores on it). I have two files. I have called MPI_Init(2) before calling any other mpi calls. I have found ntasks, rank also using mpi calls. I have also included all the required libraries. Still i get an error. Can you please help me out with this?
In file 1:
int buffsize;
int *sendbuff,**recvbuff,buffsum;
int *shareRegion;
shareRegion = (int*)gInit(MPI_COMM_WORLD, buffsize, ntasks); /* gInit is in file 2 */
buffsize=atoi(argv[1]);
sendbuff=(int *)malloc(sizeof(int)*buffsize);
if( taskid == 0 ){
recvbuff=(int **)malloc(sizeof(int *)*ntasks);
recvbuff[0]=(int *)malloc(sizeof(int)*ntasks*buffsize);
for(i=1;i<ntasks;i++)recvbuff[i]=recvbuff[i-1]+buffsize;
}
else{
recvbuff=(int **)malloc(sizeof(int *)*1);
recvbuff[0]=(int *)malloc(sizeof(int)*1);
}
for(i=0;i<buffsize;i++){
sendbuff[i]=1;
MPI_Barrier(MPI_COMM_WORLD);
call(sendbuff, buffsize, shareRegion, recvbuff[0],buffsize,taskid,ntasks);
In file 2:
void* gInit( MPI_Comm comm, int size, int num_proc)
{
int share_mem = shm_open("share_region", O_CREAT|O_RDWR,0666 );
if( share_mem == -1)
return NULL;
int rank;
MPI_Comm_rank(comm,&rank);
if( ftruncate( share_mem, sizeof(int)*size*num_proc) == -1 )
return NULL;
int* shared = mmap(NULL, sizeof(int)*size*num_proc, PROT_WRITE | PROT_READ, MAP_SHARED, share_mem, 0);
if(shared == (void*)-1)
printf("error in mem allocation (mmap)\n");
*(shared+(rank)) = 0
MPI_Barrier(MPI_COMM_WORLD);
return shared;
}
void call(int *sendbuff, int sendcount, volatile int *sharedRegion, int **recvbuff, int recvcount, int rank, int size)
{
int i=0;
int k,j;
j=rank*sendcount;
for(i=0;i<sendcount;i++)
{
sharedRegion[j] = sendbuff[i];
j++;
}
if( rank == 0)
for(k=0;k<size;k++)
for(i=0;i<sendcount;i++)
{
j=0;
recvbuff[k][i] = sharedRegion[j];
j++;
}
}
Then i am doing some computation in file 1 on this recvbuff.
I get this segmentation fault while using sharedRegion variable.
MPI represents the Message Passing paradigm. That means, processes (ranks) are isolated and are generally running on a distributed machine. They communicate via explicit communication messages, recent versions allow also one-sideded, but still explicit, data transfer. You can not assume that shared memory is available for the processes. Have a look at any MPI tutorial to see how MPI is used.
Since you did not specify on what kind of machine you are running, any further suggestion is purely speculative. If you actually are on a shared memory machine, you may want to use a real shared memory paradigm instead, e.g. OpenMP.
While it's possible to restrict MPI to only use one machine and have shared memory (see the RMA chapter, especially in MPI-3), if you're only ever going to use one machine, it's easier to use some other paradigm.
However, if you're going to use multiple nodes and have multiple ranks on one node (multi-core processes for example), then it might be worth taking a look at MPI-3 RMA to see how it can help you with both locally shared memory and remote memory access. There are multiple papers out on the subject, but because they're so new, there's not a lot of good tutorials yet. You'll have to dig around a bit to find something useful to you.
The ordering of these two lines:
shareRegion = (int*)gInit(MPI_COMM_WORLD, buffsize, ntasks); /* gInit is in file 2 */
buffsize=atoi(argv[1]);
suggest that buffsize could possibly have different values before and after the call to gInit. If buffsize as passed in the first argument to the program is larger than its initial value while gInit is called, then out-of-bounds memory access would occur later and lead to a segmentation fault.
Hint: run your code as an MPI singleton (e.g. without mpirun) from inside a debugger (e.g. gdb) or change the limits so that cores would get dumped on error (e.g. with ulimit -c unlimited) and then examine the core file(s) with the debugger. Compiling with debug information (e.g. adding -g to the compiler options) helps a lot in such cases.
Does the POSIX standard allow a named shared memory block to contain a mutex and condition variable?
We've been trying to use a mutex and condition variable to synchronise access to named shared memory by two processes on a LynuxWorks LynxOS-SE system (POSIX-conformant).
One shared memory block is called "/sync" and contains the mutex and condition variable, the other is "/data" and contains the actual data we are syncing access to.
We're seeing failures from pthread_cond_signal() if both processes don't perform the mmap() calls in exactly the same order, or if one process mmaps in some other piece of shared memory before it mmaps the "/sync" memory.
This example code is about as short as I can make it:
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h>
#include <sys/file.h>
#include <stdlib.h>
#include <pthread.h>
#include <errno.h>
#include <iostream>
#include <string>
using namespace std;
static const string shm_name_sync("/sync");
static const string shm_name_data("/data");
struct shared_memory_sync
{
pthread_mutex_t mutex;
pthread_cond_t condition;
};
struct shared_memory_data
{
int a;
int b;
};
//Create 2 shared memory objects
// - sync contains 2 shared synchronisation objects (mutex and condition)
// - data not important
void create()
{
// Create and map 'sync' shared memory
int fd_sync = shm_open(shm_name_sync.c_str(), O_CREAT|O_RDWR, S_IRUSR|S_IWUSR);
ftruncate(fd_sync, sizeof(shared_memory_sync));
void* addr_sync = mmap(0, sizeof(shared_memory_sync), PROT_READ|PROT_WRITE, MAP_SHARED, fd_sync, 0);
shared_memory_sync* p_sync = static_cast<shared_memory_sync*> (addr_sync);
// init the cond and mutex
pthread_condattr_t cond_attr;
pthread_condattr_init(&cond_attr);
pthread_condattr_setpshared(&cond_attr, PTHREAD_PROCESS_SHARED);
pthread_cond_init(&(p_sync->condition), &cond_attr);
pthread_condattr_destroy(&cond_attr);
pthread_mutexattr_t m_attr;
pthread_mutexattr_init(&m_attr);
pthread_mutexattr_setpshared(&m_attr, PTHREAD_PROCESS_SHARED);
pthread_mutex_init(&(p_sync->mutex), &m_attr);
pthread_mutexattr_destroy(&m_attr);
// Create the 'data' shared memory
int fd_data = shm_open(shm_name_data.c_str(), O_CREAT|O_RDWR, S_IRUSR|S_IWUSR);
ftruncate(fd_data, sizeof(shared_memory_data));
void* addr_data = mmap(0, sizeof(shared_memory_data), PROT_READ|PROT_WRITE, MAP_SHARED, fd_data, 0);
shared_memory_data* p_data = static_cast<shared_memory_data*> (addr_data);
// Run the second process while it sleeps here.
sleep(10);
int res = pthread_cond_signal(&(p_sync->condition));
assert(res==0); // <--- !!!THIS ASSERT WILL FAIL ON LYNXOS!!!
munmap(addr_sync, sizeof(shared_memory_sync));
shm_unlink(shm_name_sync.c_str());
munmap(addr_data, sizeof(shared_memory_data));
shm_unlink(shm_name_data.c_str());
}
//Open the same 2 shared memory objects but in reverse order
// - data
// - sync
void open()
{
sleep(2);
int fd_data = shm_open(shm_name_data.c_str(), O_RDWR, S_IRUSR|S_IWUSR);
void* addr_data = mmap(0, sizeof(shared_memory_data), PROT_READ|PROT_WRITE, MAP_SHARED, fd_data, 0);
shared_memory_data* p_data = static_cast<shared_memory_data*> (addr_data);
int fd_sync = shm_open(shm_name_sync.c_str(), O_RDWR, S_IRUSR|S_IWUSR);
void* addr_sync = mmap(0, sizeof(shared_memory_sync), PROT_READ|PROT_WRITE, MAP_SHARED, fd_sync, 0);
shared_memory_sync* p_sync = static_cast<shared_memory_sync*> (addr_sync);
// Wait on the condvar
pthread_mutex_lock(&(p_sync->mutex));
pthread_cond_wait(&(p_sync->condition), &(p_sync->mutex));
pthread_mutex_unlock(&(p_sync->mutex));
munmap(addr_sync, sizeof(shared_memory_sync));
munmap(addr_data, sizeof(shared_memory_data));
}
int main(int argc, char** argv)
{
if(argc>1)
{
open();
}
else
{
create();
}
return (0);
}
Run this program with no args, then another copy with args, and the first one will fail at the assert checking the pthread_cond_signal().
But change the order of the open() function to mmap() the "/sync" memory before the "/data" and it will all work fine.
This seems like a major bug in LynxOS to me, but LynuxWorks claim that using mutex and condition variable within named shared memory in this way is not covered by the POSIX standard, so they are not interested.
Can anyone determine if this code does actually violate POSIX?
Or does anyone have any convincing documentation that it is POSIX compliant?
Edit: we know that PTHREAD_PROCESS_SHARED is POSIX and is supported by LynxOS. The area of contention is whether mutexes and semaphores can be used within named shared memory (as we have done) or if POSIX only allows them to be used when one process creates and mmaps the shared memory and then forks the second process.
The pthread_mutexattr_setpshared function may be used to allow a pthread mutex in shared memory to be accessed by any thread which has access to that memory, even threads in different processes. According to this link, pthread_mutex_setpshared conforms to POSIX P1003.1c. (Same thing goes for the condition variable, see pthread_condattr_setpshared.)
Related question: pthread condition variables on Linux, odd behaviour
I can easily see how PTHREAD_PROCESS_SHARED can be tricky to implement on the OS-level (e.g. MacOS doesn't, except for rwlocks it seems). But just from reading the standard, you seem to have a case.
For completeness, you might want to assert on sysconf(_SC_THREAD_PROCESS_SHARED) and the return value of the *_setpshared() function calls— maybe there's another "surprise" waiting for you (but I can see from the comments that you already checked that SHARED is actually supported).
#JesperE: you might want to refer to the API docs at the OpenGroup instead of the HP docs.
May be there is some pointers in pthread_cond_t (without pshared), so you must place it into the same addresses in both threads/processes. With same-ordered mmaps you may get a equal addresses for both processes.
In glibc the pointer in cond_t was to thread descriptor of thread, owned mutex/cond.
You can control addresses with non-NULL first parameter to mmap.
have C sources that must compile in 32bit and 64bit for multiple platforms.
structure that takes the address of a buffer - need to fit address in a 32bit value.
obviously where possible these structures will use natural sized void * or char * pointers.
however for some parts an api specifies the size of these pointers as 32bit.
on x86_64 linux with -m64 -mcmodel=small tboth static data and malloc()'d data fit within the 2Gb range. data on the stack, however, still starts in high memory.
so given a small utility _to_32() such as:
int _to_32( long l ) {
int i = l & 0xffffffff;
assert( i == l );
return i;
}
then:
char *cp = malloc( 100 );
int a = _to_32( cp );
will work reliably, as would:
static char buff[ 100 ];
int a = _to_32( buff );
but:
char buff[ 100 ];
int a = _to_32( buff );
will fail the assert().
anyone have a solution for this without writing custom linker scripts?
or any ideas how to arrange the linker section for stack data, would appear it is being put in this section in the linker script:
.lbss :
{
*(.dynlbss)
*(.lbss .lbss.* .gnu.linkonce.lb.*)
*(LARGE_COMMON)
}
thanks!
The stack location is most likely specified by the operating system and has nothing to do with the linker.
I can't imagine why you are trying to force a pointer on a 64 bit machine into 32 bits. The memory layout of structures is mainly important when you are sharing the data with something which may run on another architecture and saving to a file or sending across a network, but there are almost no valid reasons that you would send a pointer from one computer to another. Debugging is the only valid reason that comes to mind.
Even storing a pointer to be used later by another run of your program on the same machine would almost certainly be wrong since where your program is loaded can differ. Making any use of such a pointer would be undefined abd unpredictable.
the short answer appears to be there is no easy answer. at least no easy way to reassign range/location of the stack pointer.
the loader 'ld-linux.so' at a very early stage in process activation gets the address in the hurd loader - in the glibc sources, elf/ and sysdeps/x86_64/ search out elf_machine_load_address() and elf_machine_runtime_setup().
this happens in the preamble of calling your _start() entry and related setup to call your main(), is not for the faint hearted, even i couldn't convince myself this was a safe route.
as it happens - the resolution presents itself in some other old school tricks... pointer deflations/inflation...
with -mcmodel=small then automatic variables, alloca() addresses, and things like argv[], and envp are assigned from high memory from where the stack will grow down. those addresses are verified in this example code:
#include <stdlib.h>
#include <stdio.h>
#include <alloca.h>
extern char etext, edata, end;
char global_buffer[128];
int main( int argc, const char *argv[], const char *envp )
{
char stack_buffer[128];
static char static_buffer[128];
char *cp = malloc( 128 );
char *ap = alloca( 128 );
char *xp = "STRING CONSTANT";
printf("argv[0] %p\n",argv[0]);
printf("envp %p\n",envp);
printf("stack %p\n",stack_buffer);
printf("global %p\n",global_buffer);
printf("static %p\n",static_buffer);
printf("malloc %p\n",cp);
printf("alloca %p\n",ap);
printf("const %p\n",xp);
printf("printf %p\n",printf);
printf("First address past:\n");
printf(" program text (etext) %p\n", &etext);
printf(" initialized data (edata) %p\n", &edata);
printf(" uninitialized data (end) %p\n", &end);
}
produces this output:
argv[0] 0x7fff1e5e7d99
envp 0x7fff1e5e6c18
stack 0x7fff1e5e6a80
global 0x6010e0
static 0x601060
malloc 0x602010
alloca 0x7fff1e5e69d0
const 0x400850
printf 0x4004b0
First address past:
program text (etext) 0x400846
initialized data (edata) 0x601030
uninitialized data (end) 0x601160
all access to/from the 32bit parts of structures must be wrapped with inflate() and deflate() routines, e.g.:
void *inflate( unsigned long );
unsigned int deflate( void *);
deflate() tests for bits set in the range 0x7fff00000000 and marks the pointer so that inflate() will recognize how to reconstitute the actual pointer.
hope that helps if anyone similarly must support structures with 32bit storage for 64bit pointers.