OpenMP and MPI hybrid program

OpenMP and MPI hybrid program - mpi

I have a machine with 8 processors. I want to alternate using OpenMP and MPI on my code like this:
OpenMP phase:
ranks 1-7 wait on a MPI_Barrier
rank 0 uses all 8 processors with OpenMP
MPI phase:
rank 0 reaches barrier and all ranks use one processor each
So far, I've done:
set I_MPI_WAIT_MODE 1 so that ranks 1-7 don't use the CPU while on the barrier.
set omp_set_num_threads(8) on rank 0 so that it launches 8 OpenMP threads.
It all worked. Rank 0 did launch 8 threads, but all are confined to one processor. On the OpenMP phase I get 8 threads from rank 0 running on one processor and all other processors are idle.
How do I tell MPI to allow rank 0 to use the other processors? I am using Intel MPI, but could switch to OpenMPI or MPICH if needed.

The following code shows an example on how to save the CPU affinity mask before the OpenMP part, alter it to allow all CPUs for the duration of the parallel region and then restore the previous CPU affinity mask. The code is Linux specific and it makes no sense if you do not enable process pinning by the MPI library - activated by passing --bind-to-core or --bind-to-socket to mpiexec in Open MPI; deactivated by setting I_MPI_PIN to disable in Intel MPI (the default on 4.x is to pin processes).
#define _GNU_SOURCE
#include <sched.h>
...
cpu_set_t *oldmask, *mask;
size_t size;
int nrcpus = 256; // 256 cores should be more than enough
int i;
// Save the old affinity mask
oldmask = CPU_ALLOC(nrcpus);
size = CPU_ALLOC_SIZE(nrcpus);
CPU_ZERO_S(size, oldmask);
if (sched_getaffinity(0, size, oldmask) == -1) { error }
// Temporary allow running on all processors
mask = CPU_ALLOC(nrcpus);
for (i = 0; i < nrcpus; i++)
CPU_SET_S(i, size, mask);
if (sched_setaffinity(0, size, mask) == -1) { error }
#pragma omp parallel
{
}
CPU_FREE(mask);
// Restore the saved affinity mask
if (sched_setaffinity(0, size, oldmask) == -1) { error }
CPU_FREE(oldmask);
...
You can also tweak the pinning arguments of the OpenMP run-time. For GCC/libgomp the affinity is controlled by the GOMP_CPU_AFFINITY environment variable, while for Intel compilers it is KMP_AFFINITY. You can still use the code above if the OpenMP run-time intersects the supplied affinity mask with that of the process.
Just for the sake of completeness - saving, setting and restoring the affinity mask on Windows:
#include <windows.h>
...
HANDLE hCurrentProc, hDupCurrentProc;
DWORD_PTR dwpSysAffinityMask, dwpProcAffinityMask;
// Obtain a usable handle of the current process
hCurrentProc = GetCurrentProcess();
DuplicateHandle(hCurrentProc, hCurrentProc, hCurrentProc,
&hDupCurrentProc, 0, FALSE, DUPLICATE_SAME_ACCESS);
// Get the old affinity mask
GetProcessAffinityMask(hDupCurrentProc,
&dwpProcAffinityMask, &dwpSysAffinityMask);
// Temporary allow running on all CPUs in the system affinity mask
SetProcessAffinityMask(hDupCurrentProc, &dwpSysAffinityMask);
#pragma omp parallel
{
}
// Restore the old affinity mask
SetProcessAffinityMask(hDupCurrentProc, &dwpProcAffinityMask);
CloseHandle(hDupCurrentProc);
...
Should work with a single processor group (up to 64 logical processors).

Thanks all for the comments and answers. You are all right. It's all about the "PIN" option.
To solve my problem, I just had to:
I_MPI_WAIT_MODE=1
I_MPI_PIN_DOMAIN=omp
Simple as that. Now all processors are available to all ranks.
The option
I_MPI_DEBUG=4
shows which processors each rank gets.

Related

Equivalent of cudaSetDevice in OpenCL?

I have a function that I wrote for 1 gpu, and it runs for 10 seconds with one set of args, and I have a very long list of args to go through. I would like to use both my AMD gpus, so I have some wrapper code that launches 2 threads, and runs my function on thread 0 with an argument gpu_idx 0 and on thread 1 with an argument gpu_idx 1.
I have a cuda version for another machine, and I just run checkCudaErrors(cudaSetDevice((unsigned int)device_id)); to get my desired behavior.
With openCL I have tried to do the following:
void createDevice(int device_idx)
{
cl_device_id *devices;
ret = clGetPlatformIDs(1, &platform_id, &ret_num_platforms);
HANDLE_CLERROR_G(ret);
ret = clGetDeviceIDs( platform_id, CL_DEVICE_TYPE_ALL, 0, NULL, &ret_num_devices);
HANDLE_CLERROR_G(ret);
devices = (cl_device_id*)malloc(ret_num_devices*sizeof(cl_device_id));
ret = clGetDeviceIDs( platform_id, CL_DEVICE_TYPE_ALL, ret_num_devices, devices, &ret_num_devices);
HANDLE_CLERROR_G(ret);
if (device_idx >= ret_num_devices)
{
fprintf(stderr, "Found %i devices but asked for device at index %i\n", ret_num_devices, device_idx);
exit(1);
}
device_id = devices[device_idx];
// usleep(((unsigned int)(500000*(1-device_idx)))); // without this line multithreaded 2 gpu execution does not work.
context = clCreateContext( NULL, 1, &device_id, NULL, NULL, &ret);
HANDLE_CLERROR_G(ret);
}
context is a static variable in my *c file that I then use later again when I create the kernel.
This code works when I run only with device_idx 0, or only with device_idx 1, and even if I manually in two terminal windows run the executable "simultaneously" with device_idx 0 and device_idx 1.
BUT, there is something about the threads being "too" concurrent that prevents this code from working. In fact, depending on the amount of sleep (commented above), I get different behavior (sometimes both threads do work on gpu 0, sometimes both threads do work on gpu 1, sometimes threads are balanced on both gpus). If I sleep for too little time, I either get: CL_INVALID_CONTEXT and if I don't sleep at all I get CL_INVALID_KERNEL_NAME.
Like I said, I don't get any errors when running on gpu 0 or gpu 1 alone, only when spawning multiple threads that call this code (as an *so with an extern C function from go) simultaneously with device_idx 0 in thread 0 and device_idx 1 in thread 1.
How can I solve my problem? I am attached to the idea that I have an executable that works on 1 gpu, for which I specify which gpu, and that specification should be respected.
What is the proper way to pick the device when both devices need to be used, one completely separate from the other?

Whoops! Instead of saving device_id into a static variable I started returning from the above code and using it as a local variable, and everything works as expected, and is now thread safe.

Why doesn't OpenCL Nvidia compiler (nvcc) use the registers twice?

I'm doing a small OpenCL benchmark using Nvidia drivers,
my kernel performs 1024 fuse multiply-adds and store the result in an array:
#define FLOPS_MACRO_1(x) { (x) = (x) * 0.99f + 10.f; } // Multiply-add
#define FLOPS_MACRO_2(x) { FLOPS_MACRO_1(x) FLOPS_MACRO_1(x) }
#define FLOPS_MACRO_4(x) { FLOPS_MACRO_2(x) FLOPS_MACRO_2(x) }
#define FLOPS_MACRO_8(x) { FLOPS_MACRO_4(x) FLOPS_MACRO_4(x) }
// more recursive macros ...
#define FLOPS_MACRO_1024(x) { FLOPS_MACRO_512(x) FLOPS_MACRO_512(x) }
__kernel void ocl_Kernel_FLOPS(int iNbElts, __global float *pf)
{
for (unsigned i = get_global_id(0); i < iNbElts; i += get_global_size(0))
{
float f = (float) i;
FLOPS_MACRO_1024(f)
pf[i] = f;
}
}
But when I look in the PTX generated, I see this:
.entry ocl_Kernel_FLOPS(
.param .u32 ocl_Kernel_FLOPS_param_0,
.param .u32 .ptr .global .align 4 ocl_Kernel_FLOPS_param_1
)
{
.reg .f32 %f<1026>; // 1026 float registers !
.reg .pred %p<3>;
.reg .s32 %r<19>;
ld.param.u32 %r1, [ocl_Kernel_FLOPS_param_0];
// some more code unrelated to the problem
// ...
BB1_1:
and.b32 %r13, %r18, 65535;
cvt.rn.f32.u32 %f1, %r13;
fma.rn.f32 %f2, %f1, 0f3F7D70A4, 0f41200000;
fma.rn.f32 %f3, %f2, 0f3F7D70A4, 0f41200000;
fma.rn.f32 %f4, %f3, 0f3F7D70A4, 0f41200000;
fma.rn.f32 %f5, %f4, 0f3F7D70A4, 0f41200000;
// etc
// ...
If I am correct, the PTX uses 1026 float registers to perform the 1024 operations and never reuse a register twice even if it could perform all the multiply-add operations using only 2 registers. 1026 is far above the maximum number of registers a thread is allow to have (according to the specs), so I guess this ends up in memory spilling.
Is it a compiler bug or am I totally missing something ?
I am using nvcc version 6.5 on a Quadro K2000 GPU.
EDIT
Actually I did miss something in the specs:
"Since PTX supports virtual registers, it is quite common for a compiler frontend to generate
a large number of register names. Rather than require explicit declaration of every name,
PTX supports a syntax for creating a set of variables having a common prefix string
appended with integer suffixes. For example, suppose a program uses a large number, say
one hundred, of .b32 variables, named %r0, %r1, ..., %r99"

The PTX file format is intended to describe a virtual machine and instruction set architecture:
PTX defines a virtual machine and ISA for general purpose parallel thread execution. PTX programs are translated at install time to the target hardware instruction set. The PTX-to-GPU translator and driver enable NVIDIA GPUs to be used as programmable parallel computers.
So the PTX output that you are obtaining there is not a form of "GPU assembler". It is only an intermediate representation, intended to be capable of describing virtually any form of parallel computation.
The PTX representation is then compiled into actual binaries for the respective target GPU. This is important in order to be possible to abstract from the actual architecture - specifically, regarding your example: It should be possible to use the same PTX representation of a program, regardless of the number of registers that are available on a specific target machine. The 1026 "registers" that you see there are "virtual" registers, and in the end, may be mapped to the (few) real hardware registers that are actually available. You may add the --ptxas-options=-v argument to the NVCC during the compilation to obtain addition information about the register usage.
(This is roughly the same idea as that behind the LLVM - namely, to have a representation that can be optimized on and argued about, both abstracting from the original source code and from the actual target architecture).

Using a barrier causes a CL_INVALID_WORK_GROUP_SIZE error

If I use a barrier (no matter if CLK_LOCAL_MEM_FENCE or CLK_GLOBAL_MEM_FENCE) in my kernel, it causes a CL_INVALID_WORK_GROUP_SIZE error. The global work size is 512, the local work size is 128, 65536 items have to be computed, the max work group size of my device is 1024, I am using only one dimension. For Java bindings I use JOCL.
The kernel is very simple:
kernel void sum(global float *input, global float *output, const int numElements, local float *localCopy
{
localCopy[get_local_id(0)] = grid[get_global_id(0)];
barrier(CLK_LOCAL_MEM_FENCE); // or barrier(CLK_GLOBAL_MEM_FENCE)
}
I run the kernel on the Intel(R) Xeon(R) CPU X5570 # 2.93GHz and can use OpenCL 1.2. The calling method looks like
kernel.putArg(aCLBuffer).putArg(bCLBuffer).putArg(elementCount).putNullArg(localWorkSize);
queue.put1DRangeKernel(kernel, 0, globalWorkSize, localWorkSize);
But the error is always the same:
[...]can not enqueue 1DRange CLKernel [...] with gwo: null gws: {512} lws: {128}
cond.: null events: null [error: CL_INVALID_WORK_GROUP_SIZE]
What I am doing wrong?

This is expected behaviour on some OpenCL platforms. For example, on my Apple system, the CPU device has a maximum work-group size of 1024. However, if a kernel has a barrier inside, then the maximum work-group size for that specific kernel is reduced to 1.
You can query the maximum work-group size for a specific kernel by using the clGetKernelWorkGroupInfo function with the CL_KERNEL_WORK_GROUP_SIZE parameter. The value returned will be no more than the value returned by clGetDeviceInfo and CL_DEVICE_MAX_WORK_GROUP_SIZE, but is allowed to be less (as it is in this case).

How to measure the amount of memory or RAM consumed by a code on Arduino Mega or Due

Can anybody tell me how to measure the consumed RAM for a particular code running on Arduino Mega or Due.

There is two kinds of numbers to this question:
Global static usage and current run time.
The static estimated usage can be determined by adding the following line to (if it does not already exist)
.\arduino-1.5.5\hardware\arduino\avr\boards.txt
uno.upload.maximum_ram_size=2048
This then allows the compiler to output the additional 2nd line in the following example in the IDE's result window
Binary sketch size: 25,880 bytes (of a 32,256 byte maximum)
Estimated used SRAM memory: 990 bytes (of a 2048 byte maximum)
To see the amount of memory used at any given point. Including memory space currently in use, that exists while only in functions and members. This includes the HEAP and such. I use the following MemoryFree library at specific points in the code to reveal the high-water. The readme explains how to save unnecessarily/unintentionally used RAM by prints.
Note: That while the original Arduino IDE 1.0.5's boards.txt files does contain these ram_sizes, it does not actually use display usage. Where the original Arduino IDE 1.5.5 does, along with Arduino ERW 1.0.5 does (an non-supported fork).

In my Arduino IDE 2.1.0
I edit the file: /usr/share/arduino/hardware/arduino/boards.txt
but the second line don't appear
After read:
check-ram-memory-usage-arduino-optimization
measuring-free-memory
I tried:
Show vervose output during compilation
and use avr-size /tmp/build4042914391435450796.tmp/XXXXXXX.cpp.elf
then i get my memory used
Best Regards!

int freeRam () {
extern int __heap_start, *__brkval;
int v;
int fr = (int) &v - (__brkval == 0 ? (int) &__heap_start : (int) __brkval);
Serial.print("Free ram: ");
Serial.println(fr);
}

Queued kernels slower than expected on AMD gpus only

I am performing a benchmark like show below
CHECK( context = clCreateContext(props, 1, &device, NULL, NULL, &_err); );
CHECK( queue = clCreateCommandQueue(context, device, 0, &_err); );
#define SYNC() clFinish(queue)
#define LAUNCH(glob, loc, kernel) OCL(clEnqueueNDRangeKernel(queue, kernel, 2,\
NULL, glob, loc,\
0, NULL, NULL))
/* Build program, set arguments over here */
START;
for (int i = 0; i < iter; i++) {
LAUNCH(global, local, plus_kernel);
}
SYNC();
STOP;
printf("Time taken (plus) : %lf\n", uSec / iter);
START;
for (int i = 0; i < iter; i++) {
LAUNCH(global, local, minus_kernel);
}
SYNC();
STOP;
printf("Time taken (minus): %lf\n", uSec / iter);
START;
for (int i = 0; i < iter; i++) {
LAUNCH(global, local, plus_kernel);
LAUNCH(global, local, minus_kernel);
}
SYNC();
STOP;
printf("Time taken (both) : %lf\n", uSec / iter);
The results look weird:
Time taken (plus) : 31.450000
Time taken (minus): 28.120000
Time taken (both) : 2256.380000
START, and STOP are just macros that start and stop a timer.
Here are the relevant macros.
I am not sure why queuing up is the kernels is slowing them down (and only on AMD GPUs)!
EDIT I am using Radeon 7970
EDIT Both kernels are operating on independent memory. Also here is the system information.
OS: Ubuntu 11.10
fglrxinfo:
display: :0 screen: 0
OpenGL vendor string: Advanced Micro Devices, Inc.
OpenGL renderer string: AMD Radeon HD 7900 Series
OpenGL version string: 4.2.11762 Compatibility Profile Context

I think the answer has to do with caching of data on newer GPUs (Specifically the Radeon 7970, which uses the Graphics Compute Next (GCN) architecture.
One of the advantages of this architecture is it's caching capabilities (somewhat close to CPU caching at this point). If you perform calls like this:
PLUS
PLUS
PLUS
....
Then the memory that is resident in the inner caches of the GPU. On the other hand if you make calls like this:
PLUS
MINUS
PLUS
MINUS
...
Where the two kernels have different memory objects associated with them, then the data is kicked out of the hardware devices on each CU, causing a need for them to be brought in from the very sluggish global memory.
Two easy ways to test if this is the case:
Run only Pluses with varying numbers of iterations. As the number of iterations increases, the average time will go down because the cost of the first run (which brings the data in) is amortized. Also, you should notice that all calls after the first should be relatively equal.
Make the Plus and Minus kernels run on the same memory objects. If the reason for the slowdown is because of the caching of memory objects, then the overall run time should be the average of the individual running times of PLUS and MINUS (depending perhaps on experiment 1).
Let me know if you find out if this is actually the case!

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

OpenMP and MPI hybrid program - mpi

Thanks all for the comments and answers. You are all right. It's all about the "PIN" option. To solve my problem, I just had to: I_MPI_WAIT_MODE=1 I_MPI_PIN_DOMAIN=omp Simple as that. Now all processors are available to all ranks. The option I_MPI_DEBUG=4 shows which processors each rank gets.

Related

Equivalent of cudaSetDevice in OpenCL?

Why doesn't OpenCL Nvidia compiler (nvcc) use the registers twice?

Using a barrier causes a CL_INVALID_WORK_GROUP_SIZE error

How to measure the amount of memory or RAM consumed by a code on Arduino Mega or Due

Queued kernels slower than expected on AMD gpus only

Categories

Resources