Why am I getting a seg fault with ILOG CP Optimizer - constraints

I am brand new to constraint programming and am trying to model my first problem as a constraint program using the ILOG CP Optimizer. I've installed ILOG Optimization Suite 20.1 and have successfully compiled and run many of the included example files.
What I'd like to do is schedule 5 tasks on 2 units. Each task has a release time and a due time. Moreover, on each unit, each task has a processing time and a processing cost.
The following is my code.
#include <ilcp/cp.h>
#include <iostream>
using namespace std;
int main(int argc, const char * argv[])
{
IloEnv env;
try
{
IloModel model(env);
IloInt numUnits = 2;
IloInt numTasks = 5;
IloIntervalVarArray dummyTasks(env, numTasks);
IloInt taskReleaseTimes[] = {0, 21, 15, 37, 3};
IloInt taskDueTimes[] = {200, 190, 172, 194, 161};
IloArray<IloBoolArray> unitTaskAssignment(env, numUnits);
unitTaskAssignment[0] = IloBoolArray(env, numTasks, IloTrue, IloTrue, IloFalse, IloTrue, IloTrue);
unitTaskAssignment[1] = IloBoolArray(env, numTasks, IloTrue, IloFalse, IloTrue, IloFalse, IloTrue);
IloArray<IloIntArray> unitTaskTimes(env, numUnits);
IloArray<IloNumArray> unitTaskCosts(env, numUnits);
IloIntArray minTaskTimes(env, numTasks);
IloIntArray maxTaskTimes(env, numTasks);
IloNumExpr totalCost(env);
unitTaskTimes[0] = IloIntArray(env, numTasks, 51, 67, 0, 24, 76);
unitTaskTimes[1] = IloIntArray(env, numTasks, 32, 0, 49, 0, 102);
unitTaskCosts[0] = IloNumArray(env, numTasks, 3.1, 3.7, 0.0, 3.4, 3.6);
unitTaskCosts[1] = IloNumArray(env, numTasks, 3.2, 0.0, 3.9, 0.0, 3.2);
IloArray<IloIntervalVarArray> taskUnits(env, numTasks);
IloArray<IloIntervalVarArray> unitTasks(env, numUnits);
for(IloInt i = 0; i < numTasks; i++)
{
IloInt minTaskTime = unitTaskTimes[0][i];
IloInt maxTaskTime = unitTaskTimes[0][i];
for(IloInt j = 1; j < numUnits; j++)
{
if(unitTaskTimes[j][i] < minTaskTime) minTaskTime = unitTaskTimes[j][i];
if(unitTaskTimes[j][i] > maxTaskTime) maxTaskTime = unitTaskTimes[j][i];
}
minTaskTimes[i] = minTaskTime;
maxTaskTimes[i] = maxTaskTime;
/* cout << "Minimum task time for task " << i << ": " << minTaskTimes[i] << endl;*/
/* cout << "Maximum task time for task " << i << ": " << maxTaskTimes[i] << endl;*/
}
char name[128];
for(IloInt i = 0; i < numTasks; i++)
{
taskUnits[i] = IloIntervalVarArray(env, numUnits);
sprintf(name, "dummyTask_%ld", i);
dummyTasks[i] = IloIntervalVar(env, minTaskTimes[i], maxTaskTimes[i]);
dummyTasks[i].setStartMin(taskReleaseTimes[i]);
dummyTasks[i].setEndMax(taskDueTimes[i]);
for(IloInt j = 0; j < numUnits; j++)
{
sprintf(name, "task%ld_in_unit%ld", j, i);
taskUnits[i][j] = IloIntervalVar(env, unitTaskTimes[j][i], name);
taskUnits[i][j].setOptional();
taskUnits[i][j].setStartMin(taskReleaseTimes[i]);
taskUnits[i][j].setEndMax(taskDueTimes[i]);
if(!unitTaskAssignment[j][i]) taskUnits[i][j].setAbsent();
totalCost += unitTaskCosts[j][i]*IloPresenceOf(env, taskUnits[i][j]);
}
model.add(IloAlternative(env, dummyTasks[i], taskUnits[i]));
}
for(IloInt j = 0; j < numUnits; j++)
{
unitTasks[j] = IloIntervalVarArray(env, numTasks);
for(IloInt i = 1; i < numTasks; i++)
{
unitTasks[j][i] = taskUnits[i][j];
}
model.add(IloNoOverlap(env, unitTasks[j]));
}
model.add(IloMinimize(env, totalCost));
IloCP cp(model);
cp.setParameter(IloCP::TimeLimit, 20);
if (cp.solve())
{
cout << "There's a solution." << endl;
}
else
{
cp.out() << "No solution found. " << std::endl;
}
}
catch (IloException & ex)
{
env.out() << "Caught " << ex << std::endl;
}
env.end();
return 0;
}
Perhaps there is a better logic that could be applied, and I'm happy to take suggestions on how to better model the problem, but that really isn't the question.
The question is, if I comment out the line
model.add(IloNoOverlap(env, unitTasks[j]));
everything compiles and runs fine, but if I leave it in, the program compiles, but seg faults on execution. Why?
Here's the valgrind output if it helps:
==361075== Memcheck, a memory error detector
==361075== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==361075== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==361075== Command: ./CP_test
==361075==
! --------------------------------------------------- CP Optimizer 20.1.0.0 --
! Minimization problem - 15 variables, 5 constraints
! TimeLimit = 20
! Initial process time : 1.00s (0.99s extraction + 0.02s propagation)
! . Log search space : 18.3 (before), 18.3 (after)
! . Memory usage : 335.4 kB (before), 335.4 kB (after)
! Using parallel search with 8 workers.
! ----------------------------------------------------------------------------
! Best Branches Non-fixed W Branch decision
0 15 -
+ New bound is 17.30000
^C==361075==
==361075== Process terminating with default action of signal 2 (SIGINT)
==361075== at 0x4882A9A: futex_wait (futex-internal.h:141)
==361075== by 0x4882A9A: futex_wait_simple (futex-internal.h:172)
==361075== by 0x4882A9A: pthread_barrier_wait (pthread_barrier_wait.c:184)
==361075== by 0xCEFA06: IlcParallel::SynchronizedMaster::workerSynchronize(IlcParallel::ThreadWorker*) (in /home/nate/Dropbox/Princeton/Maravelias/DCA_Branch_and_Cut/cplex_implementation/CP_optimizer_test_instance/CP_test)
==361075== by 0x66F224: IlcParallelEngineI::SynchronizedMaster::workerSynchronize(IlcParallel::ThreadWorker*) (in /home/nate/Dropbox/Princeton/Maravelias/DCA_Branch_and_Cut/cplex_implementation/CP_optimizer_test_instance/CP_test)
==361075== by 0xCF3B60: IlcParallel::SynchronizedMaster::ThreadIO::workerWaitInput() (in /home/nate/Dropbox/Princeton/Maravelias/DCA_Branch_and_Cut/cplex_implementation/CP_optimizer_test_instance/CP_test)
==361075== by 0xCF4182: IlcParallel::ThreadWorker::startup() (in /home/nate/Dropbox/Princeton/Maravelias/DCA_Branch_and_Cut/cplex_implementation/CP_optimizer_test_instance/CP_test)
==361075== by 0xCF5929: IlcThread::CallStartup(void*) (in /home/nate/Dropbox/Princeton/Maravelias/DCA_Branch_and_Cut/cplex_implementation/CP_optimizer_test_instance/CP_test)
==361075== by 0x487A608: start_thread (pthread_create.c:477)
==361075== by 0x4D0A292: clone (clone.S:95)
==361075==
==361075== HEAP SUMMARY:
==361075== in use at exit: 1,161,688 bytes in 353 blocks
==361075== total heap usage: 942 allocs, 589 frees, 2,260,411 bytes allocated
==361075==
==361075== LEAK SUMMARY:
==361075== definitely lost: 19,728 bytes in 1 blocks
==361075== indirectly lost: 9,752 bytes in 13 blocks
==361075== possibly lost: 2,304 bytes in 8 blocks
==361075== still reachable: 1,129,904 bytes in 331 blocks
==361075== of which reachable via heuristic:
==361075== stdstring : 35 bytes in 1 blocks
==361075== suppressed: 0 bytes in 0 blocks
==361075== Rerun with --leak-check=full to see details of leaked memory
==361075==
==361075== For lists of detected and suppressed errors, rerun with: -s
==361075== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Any advice is appreciated. Please note that I attempted to post to the IBM forum, but I apparently don't have permission for some reason.

You have a segmentation fault because unitTasks[j][0] is null pointer since your array initialization loop starts at 1.
Also, you should try to compile in debug mode (without -DNDEBUG) and thus you would get an assertion failure when the null handle is used.

Related

What does MPI_File_Open

I can't understand this instruction what MPI_File_Open and related MPI_seek and MPI_read does. Is there someone who can help me?
With the instruction MPI_File_Open the file is read simultaneously between all processors when I run the program? For example, if I specify mpirun -n 4 etc, is the file read by all four processors?
This is the code:
MPI_File fh;
MPI_File_open(MPI_COMM_WORLD, image, MPI_MODE_RDONLY, MPI_INFO_NULL, &fh);
for (i = 1 ; i <= rows ; i++) {
MPI_File_seek(fh, 3*(start_row + i-1) * width + 3*start_col, MPI_SEEK_SET);
tmpbuf = offset(src, i, 3, cols*3+6);
MPI_File_read(fh, tmpbuf, cols*3, MPI_BYTE, &status);
}
MPI_File_close(&fh);
Is there a way I can turn this into openMP or optimize it? I tried to modify the code like this:
#pragma omp parallel for num_threads(2)
for (i = 1 ; i <= rows ; i++) {
MPI_File_seek(fh, 3*(start_row + i-1) * width + 3*start_col, MPI_SEEK_SET);
tmpbuf = offset(src, i, 3, cols*3+6);
MPI_File_read(fh, tmpbuf, cols*3, MPI_BYTE, &status);
}
Running the program with this modification I don't get any speedup of the execution. Runs at the same time as code with MPI only. What am I doing wrong?

Writing to Global Memory Causing Crash in OpenCL in For Loop

One of my OpenCL helper functions writing to global memory in one place runs just fine, and the kernel executes typically. Still, when run from directly after that line, it freezes/crashes the kernel, and my program can't function.
The values in this function change (different values for an NDRange of 2^16), and therefore the loops change as well, and not all threads can execute the same code because of the conditionals.
Why exactly is this an issue? Am I missing some kind of memory blocking or something?
void add_world_seeds(yada yada yada...., const uint global_id, __global long* world_seeds)
for (; indexer < (1 << 16); indexer += increment) {
long k = (indexer << 16) + c;
long target2 = (k ^ e) >> 16;
long second_addend = get_partial_addend(k, x, z) & MASK_16;
if (ctz(target2 - second_addend) < mult_trailing_zeroes) { continue; }
long a = (((first_mult_inv * (target2 - second_addend)) >> mult_trailing_zeroes) ^ (J1_MUL >> 32)) & mask;
for (; a < (1 << 16); a += increment) {
world_seeds[global_id] = (a << 32) + k; //WORKS HERE
if (get_population_seed((a << 32) + k, x, z) != population_seed_state) { continue; }
world_seeds[global_id] = (a << 32) + k; //DOES NOT WORK HERE
}
}
for (; a < (1 << 16); a += increment) {
world_seeds[global_id] = (a << 32) + k; //WORKS HERE
if (get_population_seed((a << 32) + k, x, z) != population_seed_state) { continue; }
world_seeds[global_id] = (a << 32) + k; //DOES NOT WORK HERE
}
There was in fact a bug causing the undefined behavior in the code, in particular the main reversal kernel included a variable in the arguments called "increment", and in that same kernel I defined another variable called increment. It compiled fine but led to completely all over the wall wrong results and memory crashes.

MPI Send and receive a pointer in MPI_Type_struct

I want to send a set of data with the MPI_Type_struct and one of them is a pointer to an array (because the matrices that I'm going to use are going to be very large and I need to do malloc). The problem I see is that all the data is passed correctly except the matrix. I know that it is possible to pass a matrix through the pointer since if I only send the pointer of the matrix, correct results are observed.
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
void main(int argc, char *argv[])
{
MPI_Init(&argc, &argv);
int size, rank;
int m,n;
m=n=2;
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
typedef struct estruct
{
float *array;
int sizeM, sizeK, sizeN, rank_or;
} ;
struct estruct kernel, server;
MPI_Datatype types[5] = {MPI_FLOAT, MPI_INT,MPI_INT,MPI_INT,MPI_INT};
MPI_Datatype newtype;
int lengths[5] = {n*m,1,1,1,1};
MPI_Aint displacements[5];
displacements[0] = (size_t) & (kernel.array[0]) - (size_t)&kernel;
displacements[1] = (size_t) & (kernel.sizeM) - (size_t)&kernel;
displacements[2] = (size_t) & (kernel.sizeK) - (size_t)&kernel;
displacements[3] = (size_t) & (kernel.sizeN) - (size_t)&kernel;
displacements[4] = (size_t) & (kernel.rank_or) - (size_t)&kernel;
MPI_Type_struct(5, lengths, displacements, types, &newtype);
MPI_Type_commit(&newtype);
if (rank == 0)
{
kernel.array = (float *)malloc(m * n * sizeof(float));
for(int i = 0; i < m*n; i++) kernel.array[i] = i;
kernel.sizeM = 5;
kernel.sizeK = 5;
kernel.sizeN = 5;
kernel.rank_or = 5;
MPI_Send(&kernel, 1, newtype, 1, 0, MPI_COMM_WORLD);
}
else
{
server.array = (float *)malloc(m * n * sizeof(float));
MPI_Recv(&server, 1, newtype, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("%i \n", server.sizeM);
printf("%i \n", server.sizeK);
printf("%i \n", server.sizeN);
printf("%i \n", server.rank_or);
for(int i = 0; i < m*n; i++) printf("%f\n",server.array[i]);
}
MPI_Finalize();
}
Assuming that only two processes are executed,I expect that process with rank = 1 receive and display the correct elements of the matrix on the screen (the other elements are well received), but the actual output is:
5
5
5
5
0.065004
0.000000
0.000000
0.000000
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 26206 RUNNING AT pmul
= EXIT CODE: 11
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
I hope someone can help me.

CUFFT wrong result for batch 1D ifft

I am new to CUDA and CUFFT, when I was trying to recover the fft result of cufftExecC2R(...) by applying the corresponding cufftExecC2R(...), it went wrong, the recovered data and the original data is not identical.
Here is the code, the cuda library I used was cuda-9.0.
#include "device_launch_parameters.h"
#include "cuda_runtime.h"
#include "cuda.h"
#include "cufft.h"
#include <iostream>
#include <sys/time.h>
#include <cstdio>
#include <cmath>
using namespace std;
// cuda error check
#define gpuErrchk(ans) {gpuAssrt((ans), __FILE__, __LINE__);}
inline void gpuAssrt(cudaError_t code, const char* file, int line, bool abort=true) {
if (code != cudaSuccess) {
fprintf(stderr, "GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) {
exit(code);
}
}
}
// ifft scale for cufft
__global__ void IFFTScale(int scale_, cufftReal* real) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
real[idx] *= 1.0 / scale_;
}
void batch_1d_irfft2_test() {
const int BATCH = 3;
const int DATASIZE = 4;
/// RFFT
// --- Host side input data allocation and initialization
cufftReal *hostInputData = (cufftReal*)malloc(DATASIZE*BATCH*sizeof(cufftReal));
for (int i = 0; i < BATCH; ++ i) {
for (int j = 0; j < DATASIZE; ++ j) {
hostInputData[i * DATASIZE + j] = (cufftReal)(i * DATASIZE + j + 1);
}
}
// DEBUG:print host input data
cout << "print host input data" << endl;
for (int i = 0; i < BATCH; ++ i) {
for (int j = 0; j < DATASIZE; ++ j) {
cout << hostInputData[i * DATASIZE + j] << ", ";
}
cout << endl;
}
cout << "=====================================================" << endl;
// --- Device side input data allocation and initialization
cufftReal *deviceInputData;
gpuErrchk(cudaMalloc((void**)&deviceInputData, DATASIZE * BATCH * sizeof(cufftReal)));
// --- Device side output data allocation
cufftComplex *deviceOutputData;
gpuErrchk(cudaMalloc(
(void**)&deviceOutputData,
(DATASIZE / 2 + 1) * BATCH * sizeof(cufftComplex)));
// Host sice input data copied to Device side
cudaMemcpy(deviceInputData,
hostInputData,
DATASIZE * BATCH * sizeof(cufftReal),
cudaMemcpyHostToDevice);
// --- Batched 1D FFTs
cufftHandle handle;
int rank = 1; // --- 1D FFTs
int n[] = {DATASIZE}; // --- Size of the Fourier transform
int istride = 1, ostride = 1; // --- Distance between two successive input/output elements
int idist = DATASIZE, odist = DATASIZE / 2 + 1; // --- Distance between batches
int inembed[] = { 0 }; // --- Input size with pitch (ignored for 1D transforms)
int onembed[] = { 0 }; // --- Output size with pitch (ignored for 1D transforms)
int batch = BATCH; // --- Number of batched executions
cufftPlanMany(
&handle,
rank,
n,
inembed, istride, idist,
onembed, ostride, odist,
CUFFT_R2C,
batch);
cufftExecR2C(handle, deviceInputData, deviceOutputData);
// **************************************************************************
/// IRFFT
cufftReal *deviceOutputDataIFFT;
gpuErrchk(cudaMalloc((void**)&deviceOutputDataIFFT, DATASIZE * BATCH * sizeof(cufftReal)));
// --- Batched 1D IFFTs
cufftHandle handleIFFT;
int n_ifft[] = {DATASIZE / 2 + 1}; // --- Size of the Fourier transform
idist = DATASIZE / 2 + 1; odist = DATASIZE; // --- Distance between batches
cufftPlanMany(
&handleIFFT,
rank,
n_ifft,
inembed, istride, idist,
onembed, ostride, odist,
CUFFT_C2R,
batch);
cufftExecC2R(handleIFFT, deviceOutputData, deviceOutputDataIFFT);
/* scale
// dim3 dimGrid(512);
// dim3 dimBlock(max((BATCH * DATASIZE + 512 - 1) / 512, 1));
// IFFTScale<<<dimGrid, dimBlock>>>((DATASIZE - 1) * 2, deviceOutputData);
*/
// host output data for ifft
cufftReal *hostOutputDataIFFT = (cufftReal*)malloc(DATASIZE*BATCH*sizeof(cufftReal));
cudaMemcpy(hostOutputDataIFFT,
deviceOutputDataIFFT,
DATASIZE * BATCH * sizeof(cufftReal),
cudaMemcpyDeviceToHost);
// print IFFT recovered host output data
cout << "print host output IFFT data" << endl;
for (int i=0; i<BATCH; i++) {
for (int j=0; j<DATASIZE; j++) {
cout << hostOutputDataIFFT[i * DATASIZE + j] << ", ";
}
printf("\n");
}
cufftDestroy(handle);
gpuErrchk(cudaFree(deviceOutputData));
gpuErrchk(cudaFree(deviceInputData));
gpuErrchk(cudaFree(deviceOutputDataIFFT));
free(hostOutputDataIFFT);
free(hostInputData);
}
int main() {
batch_1d_irfft2_test();
return 0;
}
I compile the 'rfft_test.cu' file by nvcc -o rfft_test rfft_test.cu -lcufft. the result is as below:
print host input data
1, 2, 3, 4,
5, 6, 7, 8,
9, 10, 11, 12,
=====================================================
print IFFT recovered host output data
6, 8.5359, 15.4641, 0,
22, 24.5359, 31.4641, 0,
38, 40.5359, 47.4641, 0,
Specifically, I check the scale issue for the cufftExecC2R(...), and I comment out the IFFTScale() kernel function. Thus I assume that the recovered output data was like DATASIZE*input_batched_1d_data, but even so, the result is not as expected.
I have checked the cufft manual and my code several times, I also search for some Nvidia forums and StackOverflow answers, but I didn't find any solution. Anyone's help is greatly appreciated.
Thanks in advance.
Size of your inverse transform is incorrect and should be DATASIZE not DATASIZE/2+1.
Following sections of cuFFT docs should help:
https://docs.nvidia.com/cuda/cufft/index.html#data-layout
https://docs.nvidia.com/cuda/cufft/index.html#multi-dimensional
"In C2R mode an input array ( x 1 , x 2 , … , x ⌊ N 2 ⌋ + 1 ) of only non-redundant complex elements is required." - N is transform size you pass to plan function

MPI_Wtime timer runs about 2 times faster in OpenMPI 2.0.2

After updating OpenMPI from 1.8.4 to 2.0.2 I ran into erroneous time measurement using MPI_Wtime(). With version 1.8.4 the result was the same as returned by omp_get_wtime() timer, and now MPI_Wtime runs about 2 times faster.
What can cause such a behaviour?
My sample code:
#include <omp.h>
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
int some_work(int rank, int tid){
int count = 10000;
int arr[count];
for( int i=0; i<count; i++)
arr[i] = i + tid + rank;
for( int val=0; val<4000000; val++)
for(int i=0; i<count-1; i++)
arr[i] = arr[i+1];
return arr[0];
}
int main (int argc, char *argv[]) {
MPI_Init(NULL, NULL);
int rank, size;
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0)
printf("there are %d mpi processes\n", size);
MPI_Barrier(MPI_COMM_WORLD);
double omp_time1 = omp_get_wtime();
double mpi_time1 = MPI_Wtime();
#pragma omp parallel
{
int tid = omp_get_thread_num();
if ( tid == 0 ) {
int nthreads = omp_get_num_threads();
printf("There are %d threads for process %d\n", nthreads, rank);
int result = some_work(rank, tid);
printf("result for process %d thread %d is %d\n", rank, tid, result);
}
}
MPI_Barrier(MPI_COMM_WORLD);
double mpi_time2 = MPI_Wtime();
double omp_time2 = omp_get_wtime();
printf("process %d omp time: %f\n", rank, omp_time2 - omp_time1);
printf("process %d mpi time: %f\n", rank, mpi_time2 - mpi_time1);
printf("process %d ratio: %f\n", rank, (mpi_time2 - mpi_time1)/(omp_time2 - omp_time1) );
MPI_Finalize();
return EXIT_SUCCESS;
}
Compiling
g++ -O3 src/example_main.cpp -o bin/example -fopenmp -I/usr/mpi/gcc/openmpi-2.0.2/include -L /usr/mpi/gcc/openmpi-2.0.2/lib -lmpi
And running
salloc -N2 -n2 mpirun --map-by ppr:1:node:pe=16 bin/example
Gives something like
there are 2 mpi processes
There are 16 threads for process 0
There are 16 threads for process 1
result for process 1 thread 0 is 10000
result for process 0 thread 0 is 9999
process 1 omp time: 5.066794
process 1 mpi time: 10.098752
process 1 ratio: 1.993125
process 0 omp time: 5.066816
process 0 mpi time: 8.772390
process 0 ratio: 1.731342
The ratio is not consistent as I wrote first but still large enough.
Results for OpenMPI 1.8.4 are OK:
g++ -O3 src/example_main.cpp -o bin/example -fopenmp -I/usr/mpi/gcc/openmpi-1.8.4/include -L /usr/mpi/gcc/openmpi-1.8.4/lib -lmpi -lmpi_cxx
Gives
result for process 0 thread 0 is 9999
result for process 1 thread 0 is 10000
process 0 omp time: 4.655244
process 0 mpi time: 4.655232
process 0 ratio: 0.999997
process 1 omp time: 4.655335
process 1 mpi time: 4.655321
process 1 ratio: 0.999997
I've got similar behavior on my cluster (same OpenMPI version as yours, 2.0.2) and the problem was the default governor for the CPU frequencies, the 'conservative' one.
Once set the governor to 'performance', output of MPI_Wtime() aligned with the correct timings (output of 'time', in my case).
It appears that, for some older Xeon processors (like the Xeon E5620), some clocking function becomes skewed when too aggressive policies for dynamic frequency adjustment are used - the same OpenMPI version does not suffer from this problem on newer Xeons within the same cluster.
Maybe MPI_Wtime() could be a costly operation in itself?
Do the results get more consistent if you avoid to measure the time consumed by MPI_Wtime() as part of the OpenMP-Time?
E.g.:
double mpi_time1 = MPI_Wtime();
double omp_time1 = omp_get_wtime();
/* do something */
double omp_time2 = omp_get_wtime();
double mpi_time2 = MPI_Wtime();

Resources