Writing a chunk of MPI distributed data via hdf5 in fortran - multidimensional-array

I have a 3d array distributed into different MPI processes:
real :: DATA(i1:i2, j1:j2, k1:k2)
where i1, i2, ... are different for each MPI process, but the MPI grid is cartesian.
For simplicity let's assume I have a 120 x 120 x 120 array, and 27 MPI processes distributed as 3 x 3 x 3 (so that each processor has an array of size 40 x 40 x 40).
Using hdf5 library I need to write only a slice of that data, say, a slice that goes through the middle perpendicular to the second axis. The resulting (global) array would be of size 120 x 1 x 120.
I'm a bit confused on how to properly use the hdf5 here, and how to generalize full DATA writing (which I can do). The problem is, not each MPI thread is going to be writing. For instance, in the case above, only 9 processes will have to write something, others (which are on the +/-x and +/-z edges of the cube) will not have to, since they don't contain any chunk of the slab I need.
I tried the chunking technique described here, but it looks like that's just for a single thread.
Would be very grateful if the hdf5 community can help me in this :)

When writing an HDF5 dataset in parallel, all MPI processes must participate in the operation (even if a certain MPI process does not have values to write).
If you are not bound to a specific library, take a look at HDFql. Based on what I could understand from the use-case you have posted, here goes an example on how to write data in parallel in Fortran using HDFql.
PROGRAM Example
! use HDFql module (make sure it can be found by the Fortran compiler)
USE HDFql
! declare variables
REAL(KIND=8), DIMENSION(40, 40, 40) :: values
CHARACTER(2) :: start
INTEGER :: state
INTEGER :: x
INTEGER :: y
INTEGER :: z
! create an HDF5 file named "example.h5" and use (i.e. open) it in parallel
state = hdfql_execute("CREATE AND USE FILE example.h5 IN PARALLEL")
! create a dataset named "dset" of data type double of three dimensions (size 120x120x120)
state = hdfql_execute("CREATE DATASET dset AS DOUBLE(120, 120, 120)");
! populate variable "values" with certain values
DO x = 1, 40
DO y = 1, 40
DO z = 1, 40
values(z, y, x) = hdfql_mpi_get_rank() * 100000 + (x * 1600 + y * 40 + z)
END DO
END DO
END DO
! register variable "values" for subsequent use (by HDFql)
state = hdfql_variable_register(values)
IF (hdfql_mpi_get_rank() < 3) THEN
! insert (i.e. write) values from variable "values" into dataset "dset" using an hyperslab in function of the MPI rank (each rank writes 40x40x40 values)
WRITE(start, "(I0)") hdfql_mpi_get_rank() * 40
state = hdfql_execute("INSERT INTO dset(" // start // ":1:1:40) IN PARALLEL VALUES FROM MEMORY 0")
ELSE
! if MPI rank is equal or greater than 3 nothing is written
state = hdfql_execute("INSERT INTO dset IN PARALLEL NO VALUES")
END IF
END PROGRAM
Please check HDFql reference manual to get additional information on how to work with HDF5 files in parallel (i.e. with MPI) using this library.

Related

Unintuitive behavior (to me) in Reactive.jl

My question relates to the Reactive package https://github.com/JuliaLang/Reactive.jl
I have read the tutorial and am experimenting, learning about the reactive programming approach. I try the following code and it works as expected:
using Reactive
x = Signal(100)
z = map(v -> v + 500, x; typ=Int64, init=500)
dar = rand(90:110,10)
for i in dar
push!(x, i)
println(value(z))
end
This leads, as expected, to 10 random numbers betwwen 590 and 610 being printed:
500
591
609
609
605
593
602
596
590
594
So far, so good. Now, suppose I want to collect the outputs of signal z after each update, say in a vector c:
using Reactive
x = Signal(100)
z = map(v -> v + 500, x; typ=Int64, init=500)
dar = rand(90:110,10)
c = Vector()
for i in dar
push!(x, i)
push!(c, value(z))
end
However, instead of Vector c having ten random numbers between 590 and 610, c is a Vector containing the value 500 ten times:
10-element Array{Any,1}:
500
500
500
500
500
500
500
500
500
500
I am trying to understand if this behavior is caused by something I don't understand about Reactive programming; maybe combining for loops and Signals is a no-no? I would appreciate any insight into what is causing this behavior.
For reference, I am using Julia 0.4.5 inside an IJulia notebook.
Julia's Reactive.jl library is designed to allow Reactive programming in Julia. The crux is that only signal are reactive, and they are either independent or dependent on other signals. In the example, x is an independent signal, and the default update to the signal will call push!. z is dependent signal on x, which is why it gets automatically updated when x changes.
Now, these are the only two signals, notice that c is defined to be Vector(), which is not a signal but a normal array in Julia. So any code that runs on it only performs once like all non-reactive languages. Therefore
for i in dar
push!(x, i)
push!(c, value(z))
end
will only allocate to c once, when the code is first run, whereby z still holds its default value of 500 due to init=500 in the code. This make intuitive sense. In fact, if c is changed due to z, then we have made the underlying behaviour of Julia to be reactive, and this is volatile and so is undesirable...
So how do we make c update whenever z does? The correct way is to use Reactive programming all the way, and so c should be a dependent signal of z. Since c maintains the state of z, the correct construct in Reactive programming is foldp, which stands for "fold over past values".
A code that works is the following:-
using Reactive
x = Signal(100)
z = map(v -> v + 500, x)
c = foldp((acc, value) -> push!(acc, value), Int[], z)
for i in rand(90:110, 10)
push!(x, i)
yield() #Can also use Reactive.run_till_now()
println(value(z))
end
#show value(c)
You will get c to be the array of all previous values of z (except the initial value by choice, so if you want the initial value too that can be easily done). Notice that Reactive programming retains a similar code complexity, but adds in the reactive capability. This make the code more elegant and easier to maintain.
Using yield() or Reactive.run_till_now() is as recommended in the comment, so I shall skim on the explanation. But my opinion is that if you need to do that, then you are likely not using reactive programming correctly, or your problem will fit other paradigm better.
This is how I would write:-
using Reactive
x = Signal(100)
z = map(x) do v
result = v + 500
println(result)
return result
end
c = foldp(Int[], z) do acc, value
push!(acc, value)
end
for i in rand(90:110, 10)
push!(x, i)
end
#show value(c)
Notice that the reactive part is self describing. Now z prints itself whenever it is updated, which is descriptive instead of imperative. So even though update is async, we will still capture an update to z. The imperative code of pushing to x is by itself, keeping things modular. The code is high level enough and easy to read, without low level routine passing functions like yield() inside such a high level script.

OpenCL: multiple work items saving results to the same global memory address

I'm trying to do a reduce-like cumulative calculation where 4 different values need to be stored depending on certain conditions. My kernel receives long arrays as input and needs to store only 4 values, which are "global sums" obtained from each data point on the input. For example, I need to store the sum of all the data values satisfying certain condition, and the number of data points that satisfy said condition. The kernel is below to make it clearer:
__kernel void photometry(__global float* stamp,
__constant float* dark,
__global float* output)
{
int x = get_global_id(0);
int s = n * n;
if(x < s){
float2 curr_px = (float2)((x / n), (x % n));
float2 center = (float2)(centerX, centerY);
int dist = (int)fast_distance(center, curr_px);
if(dist < aperture){
output[0] += stamp[x]-dark[x];
output[1]++;
}else if (dist > sky_inner && dist < sky_outer){
output[2] += stamp[x]-dark[x];
output[3]++;
}
}
}
All the values not declared in the kernel are previously defined by macros. s is the length of the input arrays stamp and dark, which are nxn matrices flattened down to 1D.
I get results but they are different from my CPU version of this. Of course I am wondering: is this the right way to do what I'm trying to do? Can I be sure that each pixel data is only being added once? I can't think of any other way to save the cumulative result values.
Atomic operation is needed in your case, otherwise data races will cause the results unpredictable.
The problem is here:
output[0] += stamp[x]-dark[x];
output[1]++;
You can imagine that threads in the same wave might still follow the same step, therefore, it might be OK for threads inside the same wave. Since they read the same output[0] value using a global load instruction (broadcasting). Then, when they finish the computation and try to store data into the same memory address (output[0]), the writing operations will be serialized. To this point, you may still get the correct results (for the work items inside the same wave).
However, since it is highly likely that your program launches more than one wave (in most applications, this is the case). Different waves may execute in an unknown order; then, when they access the same memory address, the behavior becomes more complicated. For example, wave0 and wave1 may access output[0] in the beginning before any other computation happens, that means they fetch the same value from output[0]; then they start the computation. After computation, they save their accumulative results into output[0]; apparently, result from one of the waves will be overwritten by another one, as if only the one who writes memory later got executed. Just imagine that you have much more waves in a real application, so it is not strange to have a wrong result.
You can do this in O(log2(n)) concurrently. a concept idea:
You have 16 (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16) inputs and you want to have the sum of these inputs concurrently.
you can concurrently sum 1 in 2, 3 in 4, 5 in 6, 7 in 8, 9 in 10, 11 in 12, 13 in 14, 15 in 16
then you sum concurrently 2 in 4, 6 in 8, 10 in 12, 14 in 16
then always concurrently 4 in 8, 10 in 16
and finally 8 in 16
everything done in O(log2(n)) in our case in 4 passages.

How do you do parallel matrix multiplication in Julia?

Is there a good way to do parallel matrix multiplication in julia? I tried using DArrays, but it was significantly slower than just a single-thread multiplication.
Parallel in what sense? If you mean single-machine, multi-threaded, then Julia does this by default as OpenBLAS (the underlying linear algebra library used) is multithreaded.
If you mean multiple-machine, distributed-computing-style, then you will be encountering a lot of communications overhead that will only be worth it for very large problems, and a customized approach might be needed.
The problem is most likely that direct (maybe single-threaded) matrix-multiplication is normally performed with an optimized library function. In the case of OpenBLAS, this is already multithreaded. For arrays with size 2000x2000, the simple matrixmultiplication
#time c = sa * sb;
results in 0.3 seconds multithreaded and 0.7 seconds singlethreaded.
Splitting of a single dimension in multiplication the times get even worse and reach around 17 seconds in singlethreaded mode.
#time for j = 1:n
sc[:,j] = sa[:,:] * sb[:,j]
end
shared arrays
The solution to your problem might be the use of shared arrays, which share the same data across your processes on a single computer. Please note that shared arrays are still marked as experimental.
# create shared arrays and initialize them with random numbers
sa = SharedArray(Float64,(n,n),init = s -> s[localindexes(s)] = rand(length(localindexes(s))))
sb = SharedArray(Float64,(n,n),init = s -> s[localindexes(s)] = rand(length(localindexes(s))))
sc = SharedArray(Float64,(n,n));
Then you have to create a function, which performs a cheap matrix multiplication on a subset of the matrix.
#everywhere function mymatmul!(n,w,sa,sb,sc)
# works only for 4 workers and n divisible by 4
range = 1+(w-2) * div(n,4) : (w-1) * div(n,4)
sc[:,range] = sa[:,:] * sb[:,range]
end
Finally, the main process tells the workers to work on their part.
#time #sync begin
for w in workers()
#async remotecall_wait(w, mymatmul!, n, w, sa, sb, sc)
end
end
which takes around 0.3 seconds which is the same time as the multithreaded single-process time.
It sounds like you're interested in dense matrices, in which case see the other answers. Should you be (or become) interested in sparse matrices, see https://github.com/madeleineudell/ParallelSparseMatMul.jl.

How to compute the size of the allocated memory for a general type

I need to work with some databases read with read.table from csv (comma separated values ), and I wish to know how to compute the size of the allocated memory for each type of variable.
How to do it ?
edit -- in other words : how much memory R allocs for a general data frame read from a .csv file ?
You can get the amount of memory allocated to an object with object.size. For example:
x = 1:1000
object.size(x)
# 4040 bytes
This script might also be helpful- it lets you view or graph the amount of memory used by all of your current objects.
In answer to your question of why object.size(4) is 48 bytes, the reason is that there is some overhead in each numeric vector. (In R, the number 4 is not just an integer as in other languages- it is a numeric vector of length 1). But that doesn't hurt performance, because the overhead does not increase with the size of the vector. If you try:
> object.size(1:100000) / 100000
4.0004 bytes
This shows you that each integer itself requires only 4 bytes (as you expect).
Thus, summary:
For a numeric vector of length n, the size in bytes is typically 40 + 8 * floor(n / 2). However, on my version of R and OS there is a single slight discontinuity, where it jumps to 168 bytes faster than you would expect (see plot below). Beyond that, the linear relationship holds, even up to a vector of length 10000000.
plot(sapply(1:50, function(n) object.size(1:n)))
For a categorical variable, you can see a very similar linear trend, though with a bit more overhead (see below). Outside of a few slight discontinuities, the relationship is quite close to 400 + 60 * n.
plot(sapply(1:100, function(n) object.size(factor(1:n))))

How to efficiently convert a few bytes into an integer between a range?

I'm writing something that reads bytes (just a List<int>) from a remote random number generation source that is extremely slow. For that and my personal requirements, I want to retrieve as few bytes from the source as possible.
Now I am trying to implement a method which signature looks like:
int getRandomInteger(int min, int max)
I have two theories how I can fetch bytes from my random source, and convert them to an integer.
Approach #1 is naivé . Fetch (max - min) / 256 number of bytes and add them up. It works, but it's going to fetch a lot of bytes from the slow random number generator source I have. For example, if I want to get a random integer between a million and a zero, it's going to fetch almost 4000 bytes... that's unacceptable.
Approach #2 sounds ideal to me, but I'm unable come up with the algorithm. it goes like this:
Lets take min: 0, max: 1000 as an example.
Calculate ceil(rangeSize / 256) which in this case is ceil(1000 / 256) = 4. Now fetch one (1) byte from the source.
Scale this one byte from the 0-255 range to 0-3 range (or 1-4) and let it determine which group we use. E.g. if the byte was 250, we would choose the 4th group (which represents the last 250 numbers, 750-1000 in our range).
Now fetch another byte and scale from 0-255 to 0-250 and let that determine the position within the group we have. So if this second byte is e.g. 120, then our final integer is 750 + 120 = 870.
In that scenario we only needed to fetch 2 bytes in total. However, it's much more complex as if our range is 0-1000000 we need several "groups".
How do I implement something like this? I'm okay with Java/C#/JavaScript code or pseudo code.
I'd also like to keep the result from not losing entropy/randomness. So, I'm slightly worried of scaling integers.
Unfortunately your Approach #1 is broken. For example if min is 0 and max 510, you'd add 2 bytes. There is only one way to get a 0 result: both bytes zero. The chance of this is (1/256)^2. However there are many ways to get other values, say 100 = 100+0, 99+1, 98+2... So the chance of a 100 is much larger: 101(1/256)^2.
The more-or-less standard way to do what you want is to:
Let R = max - min + 1 -- the number of possible random output values
Let N = 2^k >= mR, m>=1 -- a power of 2 at least as big as some multiple of R that you choose.
loop
b = a random integer in 0..N-1 formed from k random bits
while b >= mR -- reject b values that would bias the output
return min + floor(b/m)
This is called the method of rejection. It throws away randomly selected binary numbers that would bias the output. If min-max+1 happens to be a power of 2, then you'll have zero rejections.
If you have m=1 and min-max+1 is just one more than a biggish power of 2, then rejections will be near half. In this case you'd definitely want bigger m.
In general, bigger m values lead to fewer rejections, but of course they require slighly more bits per number. There is a probabilitistically optimal algorithm to pick m.
Some of the other solutions presented here have problems, but I'm sorry right now I don't have time to comment. Maybe in a couple of days if there is interest.
3 bytes (together) give you random integer in range 0..16777215. You can use 20 bits from this value to get range 0..1048575 and throw away values > 1000000
range 1 to r
256^a >= r
first find 'a'
get 'a' number of bytes into array A[]
num=0
for i=0 to len(A)-1
num+=(A[i]^(8*i))
next
random number = num mod range
Your random source gives you 8 random bits per call. For an integer in the range [min,max] you would need ceil(log2(max-min+1)) bits.
Assume that you can get random bytes from the source using some function:
bool RandomBuf(BYTE* pBuf , size_t nLen); // fill buffer with nLen random bytes
Now you can use the following function to generate a random value in a given range:
// --------------------------------------------------------------------------
// produce a uniformly-distributed integral value in range [nMin, nMax]
// T is char/BYTE/short/WORD/int/UINT/LONGLONG/ULONGLONG
template <class T> T RandU(T nMin, T nMax)
{
static_assert(std::numeric_limits<T>::is_integer, "RandU: integral type expected");
if (nMin>nMax)
std::swap(nMin, nMax);
if (0 == (T)(nMax-nMin+1)) // all range of type T
{
T nR;
return RandomBuf((BYTE*)&nR, sizeof(T)) ? *(T*)&nR : nMin;
}
ULONGLONG nRange = (ULONGLONG)nMax-(ULONGLONG)nMin+1 ; // number of discrete values
UINT nRangeBits= (UINT)ceil(log((double)nRange) / log(2.)); // bits for storing nRange discrete values
ULONGLONG nR ;
do
{
if (!RandomBuf((BYTE*)&nR, sizeof(nR)))
return nMin;
nR= nR>>((sizeof(nR)<<3) - nRangeBits); // keep nRangeBits random bits
}
while (nR >= nRange); // ensure value in range [0..nRange-1]
return nMin + (T)nR; // [nMin..nMax]
}
Since you are always getting a multiple of 8 bits, you can save extra bits between calls (for example you may need only 9 bits out of 16 bits). It requires some bit-manipulations, and it is up to you do decide if it is worth the effort.
You can save even more, if you'll use 'half bits': Let's assume that you want to generate numbers in the range [1..5]. You'll need log2(5)=2.32 bits for each random value. Using 32 random bits you can actually generate floor(32/2.32)= 13 random values in this range, though it requires some additional effort.

Resources