MPI scatter and gather multiple times

MPI scatter and gather multiple times - mpi

I have a problem with understanding scatter and gatter. Let's say I have one table[tableSize]. In that table I want to do some calculation for every 25 elements. I want to divide it among all the processes I have in my MPI. I try something like this
MPI_Scatter(table, 25, MPI_INT, tmpTable, 25, MPI_INT, 0, MPI_COMM_WORLD);
tmpTable[12] = doTheCalculation(tmpTable);
MPI_Gather(tmpTable, 25, MPI_INT, table, 25, MPI_INT, 0, MPI_COMM_WORLD);
But it works only if this 25 * number of processes = tableSize is correct. How should I proceed if I would like to have tableSize of 125 but only 3 processes ran? My goal would be for process 0 and 1 to count twice (process 0 elements 1-25 and then 75-100 and process 1 to count 25-50 and 100-125). Is scatter and gather capable of doing this? Should I look into something else? Thanks in advance

Related

Find the best combination of items based on multiple ordered criteria in Lua

I'm trying to make an algorithm in Lua to find the optimal combination of items to meet multiple ordered criteria.
Constraints are :
Finding the closest to criteria (sum of same variable for each slot) combination of items with exactly 1 item per slot
15 slots
Maximum 5 criteria, minimum 1 criteria
All values are positive integers
Criteria are prioritized such as A > B > C ...
Number of possible items per slot is theoretically between 0 and 15
For example I have a list of :
{slot: 1, valueA: 10, valueB: 20, valueC: 0}
{slot: 1, valueA: 10, valueB: 20, valueC: 16}
{slot: 2, valueA: 10, valueB: 40, valueD: 29}
{slot: 2, valueA: 30, valueB: 460, valueK: 47}
{slot: 2, valueA: 40, valueB: 50, valueC: 32}
{slot: 3, valueA: 55, valueB: 0, valueJ: 50}
With criteria such as : TotalA = 50, TotalB = 20, TotalC = 90
I want to get the best combination of items to meet TotalA and then TotalB and finally TotalC.
I tried to brute force this using loops but it takes too much time to solve this.
I've found a few discussion about the Knapsack problem and how to solve using dynamic programming or ILP solver (didn't found one in Lua however) but I'm not good enough in mathematics to figure out a working solution.
There's also a missing dimension in the Knapsack problem, the ordering of the criteria.
If someone can guide me with some simple words, pseudo code or Lua it would be awesome.

didgits between 1 and 1000 that sum up to 3

I am trying to find out the numbers between 1 and 1,000 that the sum of their digits is equal to 3. I am just looking for any formula that can help me calculate this. For e.g. 111, or 12 equals to 3, the ones, tens, and hundreds added up together will equal 3.
any help will be appreciated.

think of all numbers having 3 digits: 001 and 002
We can start to iterate on this quickly. You cannot have a number in that sequence greater than 3.
So immediately we rule out all numbers >= 400
We can also rule out,within each group of hundred, any number that is greater than ?40 (eg 140 340)
Then we can start to just press into the numbers a bit.
We only have to dig into ?01 - ?39 for 0, 1, 2 and 3.
Start with 00?. We know that there is only one number works here : 0 + 0 + x =3 solve : 003
So we have 0, moving up to the next set of 10 : 01? we know there is only one number that will work. 012.
We have logic, each leading two digit combination leads to only one solution. We know we only have 0?? 1?? 2?? and 3?? for the leading digit. We have ?0? ?1? ?2? and ?3? for the second digit.
We can be comfortable listing : 3, 12, 21, 30, 102, 111, 120, x13?, 201, 210, x22?, 300
If you don't want to use math, use python:
a=[]
for x in range(10):
for y in range(10):
for z in range(10):
if x+y+z==3:
a.append('%r%r%r'%(x,y,z))
a = ['003','012','021','030','102','111','120','201','210','300]

sending multiple messages with different length to the same rank

Let's say I have 3 ranks.
Rank 0 receives 1 MPI_INT from rank 1 and receives 10 MPI_INT from rank 2:
MPI_Recv(buf1, 1, MPI_INT,
1, 0, MPI_COMM_WORLD, &status);
MPI_Recv(buf2, 10, MPI_INT,
2, 0, MPI_COMM_WORLD, &status);
Rank 1 and rank 2 sends 1 and 10 MPI_INT to rank 0, respectively. The MPI_Recv is a blocking call. Let's say the 10 MPI_INIT from rank 2 arrives before the 1 MPI_INT from rank 1. At this point, rank 0 blocks there waiting for data from rank 1.
In this case, could the first MPI_Recv return? Data from rank 2 arrives first, but the data couldn't fit into buf1 which could contain one integer.
And then the message from rank 1 arrives. Is MPI able to pick this message and let the first MPI_Recv return?

Since you specify a source when calling MPI_Recv(), you do not have time worry about the order of the messages. The first MPI_Recv() will return when 1 MPI_INT is received from rank 1, and the second MPI_Recv() will return when 10 MPI_INT are received from rank 2.
If you had MPI_Recv(..., source=MPI_ANY_SOURCE, ...) it would have been a different story.
Feel free to write a simple program with some sleep() here and there if you still need to convince yourself.

OpenCL: multiple work items saving results to the same global memory address

I'm trying to do a reduce-like cumulative calculation where 4 different values need to be stored depending on certain conditions. My kernel receives long arrays as input and needs to store only 4 values, which are "global sums" obtained from each data point on the input. For example, I need to store the sum of all the data values satisfying certain condition, and the number of data points that satisfy said condition. The kernel is below to make it clearer:
__kernel void photometry(__global float* stamp,
__constant float* dark,
__global float* output)
{
int x = get_global_id(0);
int s = n * n;
if(x < s){
float2 curr_px = (float2)((x / n), (x % n));
float2 center = (float2)(centerX, centerY);
int dist = (int)fast_distance(center, curr_px);
if(dist < aperture){
output[0] += stamp[x]-dark[x];
output[1]++;
}else if (dist > sky_inner && dist < sky_outer){
output[2] += stamp[x]-dark[x];
output[3]++;
}
}
}
All the values not declared in the kernel are previously defined by macros. s is the length of the input arrays stamp and dark, which are nxn matrices flattened down to 1D.
I get results but they are different from my CPU version of this. Of course I am wondering: is this the right way to do what I'm trying to do? Can I be sure that each pixel data is only being added once? I can't think of any other way to save the cumulative result values.

Atomic operation is needed in your case, otherwise data races will cause the results unpredictable.
The problem is here:
output[0] += stamp[x]-dark[x];
output[1]++;
You can imagine that threads in the same wave might still follow the same step, therefore, it might be OK for threads inside the same wave. Since they read the same output[0] value using a global load instruction (broadcasting). Then, when they finish the computation and try to store data into the same memory address (output[0]), the writing operations will be serialized. To this point, you may still get the correct results (for the work items inside the same wave).
However, since it is highly likely that your program launches more than one wave (in most applications, this is the case). Different waves may execute in an unknown order; then, when they access the same memory address, the behavior becomes more complicated. For example, wave0 and wave1 may access output[0] in the beginning before any other computation happens, that means they fetch the same value from output[0]; then they start the computation. After computation, they save their accumulative results into output[0]; apparently, result from one of the waves will be overwritten by another one, as if only the one who writes memory later got executed. Just imagine that you have much more waves in a real application, so it is not strange to have a wrong result.

You can do this in O(log2(n)) concurrently. a concept idea:
You have 16 (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16) inputs and you want to have the sum of these inputs concurrently.
you can concurrently sum 1 in 2, 3 in 4, 5 in 6, 7 in 8, 9 in 10, 11 in 12, 13 in 14, 15 in 16
then you sum concurrently 2 in 4, 6 in 8, 10 in 12, 14 in 16
then always concurrently 4 in 8, 10 in 16
and finally 8 in 16
everything done in O(log2(n)) in our case in 4 passages.

Increment number stored as array of digit-counters

I'm trying to store a counter that can become very large (well over 32 and probably 64-bit limits), but rather than use a single integer, I'd like to store it as an array of counters for each digit. This should be pretty language-agnostic.
In this form, 0 would be [1, 0, 0, 0, 0, 0, 0, 0, 0, 0] (one zero, none of the other digits up to 9). 1 would be [0, 1, 0, ...] and so on. 10 would therefore be [1, 1, 0, ...].
I can't come with a way to keep track of which digits should be decremented (moving from 29 to 30, for example) and how those should be moved. I suspect that it can't be done without another counter, either a single value representing the last cell touched, or an array of 10 more counters to flag when each digit should be touched.
Is it possible to represent a number in this fashion and count up without using a simple integer value?

No, this representation by itself would be useless because it fails to encode digit position, leading to many numbers having the same representation (e.g. 121 and 211).
Either use a bignum library, or 80-bits worth of raw binary (that being sufficient to store your declared range of 10e23)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

MPI scatter and gather multiple times - mpi

Related

Find the best combination of items based on multiple ordered criteria in Lua

didgits between 1 and 1000 that sum up to 3

sending multiple messages with different length to the same rank

OpenCL: multiple work items saving results to the same global memory address

Increment number stored as array of digit-counters

Categories

Resources