MPI reduce optimisation - mpi

I have a rather simple MPI program where each node does a calculation and in the end I need the sum of of all the calculations. Each node has no need to communicate anything else than the final sum each node has calculated.
Currently this is what I am doing and it is working.
MPI_Init(&argc, &argv); // start up "virtual machine"
MPI_Comm_size(MPI_COMM_WORLD, &p); // get size of VM
MPI_Comm_rank(MPI_COMM_WORLD, &id); // get own rank in VM
int localsum[1] = {0};
int globalsum[1] = {0};
for (i = lower+id; i <= upper; i=i+p) {
localsum[0] = localsum[0] + getResult(i);
}
MPI_Reduce(localsum,globalsum,1,MPI_INT,MPI_SUM,0,MPI_COMM_WORLD);
if(id==0)
{
printf("globalsum1 = %d \n",globalsum[0]);
}
So each node skips every size-of-vm element in each loop iteration. However here's the problem. At any one time getResult(i) takes less time to compute than getResult(i+1). This means that some nodes will have much bigger work load than others.
Is there anyway to balance this more out, or do something so nodes can steal work from other nodes when they are done?

As Wesley Bland points out in the comments, this is a hard question without knowing more about what getResults() does and how much time extra work we are talking about.
However, One suggestion I have is to pair expensive calls to getResult() with cheaper ones.
example: pair getResult(lower) with getResult(upper) & getResult(lower+1) with getResult(upper-1)
Sample loop (will need some modifications to fix some edge cases):
for (i = id; i <= (upper-lower)/2; i=i+p) {
localsum[0] = localsum[0] + getResult(lower+i) + getResult(upper-i) ;
}

Related

OpenCL Atomic add for vector types?

I'm updating a single element in a buffer from two lanes and need an atomic for float4 types. (More specifically, I launch twice as many threads as there are buffer elements, and each successive pair of threads updates the same element.)
For instance (this pseudocode does nothing useful, but hopefully illustrates my issue):
int idx = get_global_id(0);
int mapIdx = floor (idx / 2.0);
float4 toAdd;
// ...
if (idx % 2)
{
toAdd = (float4)(0,1,0,1);
}
else
{
toAdd = float3(1,0,1,0);
}
// avoid race condition here?
// I'd like to atomic_add(map[mapIdx],toAdd);
map[mapIdx] += toAdd;
In this example, map[0] should be incremented by (1,1,1,1). (0,1,0,1) from thread 0, and (1,0,1,0) from thread 1.
Suggestions? I haven't found any reference to vector atomics in the CL documents. I suppose I could do this on each individual vector component separately:
atomic_add(map[mapIdx].x, toAdd.x);
atomic_add(map[mapIdx].y, toAdd.y);
atomic_add(map[mapIdx].z, toAdd.z);
atomic_add(map[mapIdx].w, toAdd.w);
... but that just feels like a bad idea. (And requires a cmpxchg hack since there are no float atomics.
Suggestions?
Alternatively you could try using local memory like that:
__local float4 local_map[LOCAL_SIZE/2];
if(idx < LOCAL_SIZE/2) // More optimal would be to use work items together than every second (idx%2) as they work together in a warp/wavefront anyway, otherwise that may affect the performance
local_map[mapIdx] = toAdd;
barrier(CLK_LOCAL_MEM_FENCE);
if(idx >= LOCAL_SIZE/2)
local_map[mapIdx - LOCAL_SIZE/2] += toAdd;
barrier(CLK_LOCAL_MEM_FENCE);
What will be faster - atomics or local memory - or possible (size of local memory may be too big) depends on actual kernel, so you will need to benchmark and choose the right solution.
Update:
Answering your question from comments - to write later back to global buffer do:
if(idx < LOCAL_SIZE/2)
map[mapIdx] = local_map[mapIdx];
Or you can try without introducing local buffer and write directly into global buffer:
if(idx < LOCAL_SIZE/2)
map[mapIdx] = toAdd;
barrier(CLK_GLOBAL_MEM_FENCE); // <- notice that now we use barrier related to global memory
if(idx >= LOCAL_SIZE/2)
map[mapIdx - LOCAL_SIZE/2] += toAdd;
barrier(CLK_GLOBAL_MEM_FENCE);
Aside from that I can see now problem with indexes. To use the code from my answer the previous code should look like:
if(idx < LOCAL_SIZE/2)
{
toAdd = (float4)(0,1,0,1);
}
else
{
toAdd = (float4)(1,0,1,0);
}
If you need to use id%2 though then all the code must follow this or you will have to do some index arithmetic so that the values go into right places in map.
If I understand issue correctly I would do next.
Get rid of ifs by making array with offsets
float4[2] = {(1,0,1,0), (0,1,0,1)}
and use idx %2 as offset
move map into local memory and use mem_fence(CLK_LOCAL_MEM_FENCE) to make sure all threads in group synced.

Calculating the average of Sensor Data (Capacitive Sensor)

So I am starting to mess around with Capacitive sensors and all because its some pretty cool stuff.
I have followed some tutorials online about how to set it up and use the CapSense library for Arduino and I just had a quick question about this code i wrote here to get the average for that data.
void loop() {
long AvrNum;
int counter = 0;
AvrNum += cs_4_2.capacitiveSensor(30);
counter++;
if (counter = 10) {
long AvrCap = AvrNum/10;
Serial.println(AvrCap);
counter = 0;
}
}
This is my loop statement and in the Serial it seems like its working but the numbers just look suspiciously low to me. I'm using a 10M resistor (brown, black, black, green, brown) and am touching a piece of foil that both the send and receive pins are attached to (electrical tape) and am getting numbers around about 650, give or take 30.
Basically I'm asking if this code looks right and if these numbers make sense...?
The language used in the Arduino environment is really just an unenforced subset of C++ with the main() function hidden inside the framework code supplied by the IDE. Your code is a module that will be compiled and linked to the framework. When the framework starts running it first initializes itself then your module by calling the function setup(). Once initialized, the framework enters an infinite loop, calling your modules function loop() on each iteration.
Your code is using local variables in loop() and expecting that they will hold their values from call to call. While this might happen in practice (and likely does since that part of framework's main() is probably just while(1) loop();), this is invoking the demons of Undefined Behavior. C++ does not make any promises about the value of an uninitialized variable, and even reading it can cause anything to happen. Even apparently working.
To fix this, the accumulator AvrNum and the counter must be stored somewhere other than on loop()'s stack. They could be declared static, or moved to the module outside. Outside is better IMHO, especially in the constrained Arduino environment.
You also need to clear the accumulator after you finish an average. This is the simplest form of an averaging filter, where you sum up fixed length blocks of N samples, and then use that average each Nth sample.
I believe this fragment (untested) will work for you:
long AvrNum;
int counter;
void setup() {
AvrNum = 0;
counter = 0;
}
void loop() {
AvrNum += cs_4_2.capacitiveSensor(30);
counter++;
if (counter == 10) {
long AvrCap = AvrNum/10;
Serial.println(AvrCap);
counter = 0;
AvrNum = 0;
}
}
I provided a setup(), although it is redundant with the C++ language's guarantee that the global variables begin life initialized to 0.
your line if (counter = 10) is invalid. It should be if (counter == 10)
The first sets counter to 10 and will (of course) evaluate to true.
The second tests for counter equal to 10 and will not evaluate to true until counter is, indeed, equal to 10.
Also, kaylum mentions the other problem, no initialization of AvrNum
This is What I ended up coming up with after spending some more time on it. After some manual calc it gets all the data.
long AvrArray [9];
for(int x = 0; x <= 10; x++){
if(x == 10){
long AvrMes = (AvrArray[0] + AvrArray[1] + AvrArray[2] + AvrArray[3] + AvrArray[4] + AvrArray[5] + AvrArray[6] + AvrArray[7] + AvrArray[8] + AvrArray[9]);
long AvrCap = AvrMes/x;
Serial.print("\t");
Serial.println(AvrCap);
x = 0;
}
AvrArray[x] = cs_4_2.capacitiveSensor(30);
Serial.println(AvrArray[x]);
delay(500);

Minimum number of jumps required to climb stairs

I recently had an interview with Microsoft for an internship and I was asked this question in the interview.
Its basically like, you have 2 parallel staircases and both the staircases have n steps. You start from the bottom and you may move upwards on either of the staircases. Each step on the staircase has a penalty attached to it.
You can also move across both the staircases with some other penalty.
I had to find the minimum penalty that will be imposed for reaching the top.
I tried writing a recurrence relation but I couldn't write anything because of so many variables.
I recently read about dynamic programming and I think this question is related to that.
With some googling, I found that this question is the same as
https://www.hackerrank.com/contests/frost-byte-final/challenges/stairway
Can you please give a solution or an approach for this problem ?
Create two arrays to keep track of the minimal cost to reach every position. Fill both arrays with huge numbers (e.g. 1000000000) and the start of the arrays with the cost of the first step.
Then iterate over all possible steps, and use an inner loop to iterate over all possible jumps.
foreach step in (0, N) {
// we're now sure we know minimal cost to reach this step
foreach jump in (1,K) {
// updating minimal costs here
}
}
Now every time we reach updating there are 4 possible moves to consider:
from A[step] to A[step+jump]
from A[step] to B[step+jump]
from B[step] to A[step+jump]
from B[step] to B[step+jump]
For each of these moves you need to compute the cost. Because you already know that you have the optimal cost to reach A[step] and B[step] this is easy. It's not guaranteed this new move is an improvement, so only update the target cost in your array if the new cost is lower then the cost already there.
Isn't this just a directed graph search? Any kind of simple pathfinding algorithm could handle this. See
https://en.wikipedia.org/wiki/Dijkstra%27s_algorithm
https://en.wikipedia.org/wiki/A*_search_algorithm
Just make sure you enforce the directional of the stairs (up only) and account for penalties (edge weights).
Worked solution
Of course, you could do it with dynamic programming, but I wouldn't be the one to ask for that...
import java.io.;
import java.util.;
public class Main {
public static int csmj(int []a,int src,int[] dp){
if(src>=a.length){
return Integer.MAX_VALUE-1;
}
if(src==a.length-1){
return 0;
}
if(dp[src]!=0){
return dp[src];
}
int count=Integer.MAX_VALUE-1;
for(int i=1;i<=a[src];i++){
count = Math.min( count , csmj(a,src+i,dp)+1 );
}
dp[src] = count;
return count;
}
public static void main(String args[] ) throws Exception {
Scanner s = new Scanner(System.in);
int n = s.nextInt();
int a[] = new int[n];
for(int i=0;i<n;i++){
a[i] = s.nextInt();
}
int minJumps = csmj(a,0,new int[n]);
System.out.println(minJumps);
}
}
bro you can have look at that solution my intuition is that

OpenCL reduction from private to local then global?

The following kernel computes an acoustic pressure field, with each thread computing it's own private instance of the pressure vector, which then needs to be summed down into global memory.
I'm pretty sure the code which computes the pressurevector is correct, but I'm still having trouble making this produce the expected result.
int gid = get_global_id(0);
int lid = get_local_id(0);
int nGroups = get_num_groups(0);
int groupSize = get_local_size(0);
int groupID = get_group_id(0);
/* Each workitem gets private storage for the pressure field.
* The private instances are then summed into local storage at the end.*/
private float2 pressure[HYD_DIM_TOTAL];
local float2 pressure_local[HYD_DIM_TOTAL];
/* Code which computes value of 'pressure' */
//wait for all workgroups to finish accessing any memory
barrier(CLK_GLOBAL_MEM_FENCE | CLK_LOCAL_MEM_FENCE);
/// sum all results in a workgroup into local buffer:
for(i=0; i<groupSize; i++){
//each thread sums its own private instance into the local buffer
if (i == lid){
for(iHyd=0; iHyd<HYD_DIM_TOTAL; iHyd++){
pressure_local[iHyd] += pressure[iHyd];
}
}
//make sure all threads in workgroup get updated values of the local buffer
barrier(CLK_LOCAL_MEM_FENCE);
}
/// copy all the results into global storage
//1st thread in each workgroup writes the group's local buffer to global memory
if(lid == 0){
for(iHyd=0; iHyd<HYD_DIM_TOTAL; iHyd++){
pressure_global[groupID +nGroups*iHyd] = pressure_local[iHyd];
}
}
barrier(CLK_GLOBAL_MEM_FENCE);
/// sum the various instances in global memory into a single one
// 1st thread sums global instances
if(gid == 0){
for(iGroup=1; iGroup<nGroups; iGroup++){
//we only need to sum the results from the 1st group onward
for(iHyd=0; iHyd<HYD_DIM_TOTAL; iHyd++){
pressure_global[iHyd] += pressure_global[iGroup*HYD_DIM_TOTAL +iHyd];
barrier(CLK_GLOBAL_MEM_FENCE);
}
}
}
Some notes on data dimensions:
The total number of threads will vary between 100 and 2000, but may on occasion lie outside this interval.
groupSizewill depend on hardware but I'm currently using values between 1(cpu) and 32(gpu).
HYD_DIM_TOTAL is known at compile time and varies between 4 and 32 (will generally, but not necessarily, be a power of 2).
Is there anything blatantly wrong with this reduction code?
PS: I run this on an i7 3930k with AMD APP SDK 2.8 and on an NVIDIA GTX580.
I notice two issues here, one big, one smaller:
This code suggests that you have a misunderstanding of what a barrier does. A barrier never synchronizes across multiple workgroups. It only synchronizes within a workgroup. The CLK_GLOBAL_MEM_FENCE makes it look like it is global synchronization, but it really isn't. That flag just fences all of the current work item's accesses to global memory. So outstanding writes will be globally observable after a barrier with this flag. But it does not change the barrier's synchronization behavior, which is only at the scope of a workgroup. There is no global synchronization in OpenCL, beyond launching another NDRange or Task.
The first for loop causes multiple work items to overwrite each others' computation. The indexing of pressure_local with iHyd will be done by each work item with the same iHyd. This will produce undefined results.
Hope this helps.

Modifying motion vectors in ffmpeg H.264 decoder

For research purposes, I am trying to modify H.264 motion vectors (MVs) for each P- and B-frame prior to motion compensation during the decoding process. I am using FFmpeg for this purpose. An example of a modification is replacing each MV with its original spatial neighbors and then using the resultant MVs for motion compensation, rather than the original ones. Please direct me appropriately.
So far, I have been able to do a simple modification of MVs in the file /libavcodec/h264_cavlc.c. In the function, ff_h264_decode_mb_cavlc(), modifying the mx and my variables, for instance, by increasing their values modifies the MVs used during decoding.
For example, as shown below, the mx and my values are increased by 50, thus lengthening the MVs used in the decoder.
mx += get_se_golomb(&s->gb)+50;
my += get_se_golomb(&s->gb)+50;
However, in this regard, I don't know how to access the neighbors of mx and my for my spatial mean analysis that I mentioned in the first paragraph. I believe that the key to doing so lies in manipulating the array, mv_cache.
Another experiment that I performed was in the file, libavcodec/error_resilience.c. Based on the guess_mv() function, I created a new function, mean_mv() that is executed in ff_er_frame_end() within the first if-statement. That first if-statement exits the function ff_er_frame_end() if one of the conditions is a zero error-count (s->error_count == 0). However, I decided to insert my mean_mv() function at this point so that is always executed when there is a zero error-count. This experiment somewhat yielded the results I wanted as I could start seeing artifacts in the top portions of the video but they were restricted just to the upper-right corner. I'm guessing that my inserted function is not being completed so as to meet playback deadlines or something.
Below is the modified if-statement. The only addition is my function, mean_mv(s).
if(!s->error_recognition || s->error_count==0 || s->avctx->lowres ||
s->avctx->hwaccel ||
s->avctx->codec->capabilities&CODEC_CAP_HWACCEL_VDPAU ||
s->picture_structure != PICT_FRAME || // we dont support ER of field pictures yet, though it should not crash if enabled
s->error_count==3*s->mb_width*(s->avctx->skip_top + s->avctx->skip_bottom)) {
//av_log(s->avctx, AV_LOG_DEBUG, "ff_er_frame_end in er.c\n"); //KG
if(s->pict_type==AV_PICTURE_TYPE_P)
mean_mv(s);
return;
And here's the mean_mv() function I created based on guess_mv().
static void mean_mv(MpegEncContext *s){
//uint8_t fixed[s->mb_stride * s->mb_height];
//const int mb_stride = s->mb_stride;
const int mb_width = s->mb_width;
const int mb_height= s->mb_height;
int mb_x, mb_y, mot_step, mot_stride;
//av_log(s->avctx, AV_LOG_DEBUG, "mean_mv\n"); //KG
set_mv_strides(s, &mot_step, &mot_stride);
for(mb_y=0; mb_y<s->mb_height; mb_y++){
for(mb_x=0; mb_x<s->mb_width; mb_x++){
const int mb_xy= mb_x + mb_y*s->mb_stride;
const int mot_index= (mb_x + mb_y*mot_stride) * mot_step;
int mv_predictor[4][2]={{0}};
int ref[4]={0};
int pred_count=0;
int m, n;
if(IS_INTRA(s->current_picture.f.mb_type[mb_xy])) continue;
//if(!(s->error_status_table[mb_xy]&MV_ERROR)){
//if (1){
if(mb_x>0){
mv_predictor[pred_count][0]= s->current_picture.f.motion_val[0][mot_index - mot_step][0];
mv_predictor[pred_count][1]= s->current_picture.f.motion_val[0][mot_index - mot_step][1];
ref [pred_count] = s->current_picture.f.ref_index[0][4*(mb_xy-1)];
pred_count++;
}
if(mb_x+1<mb_width){
mv_predictor[pred_count][0]= s->current_picture.f.motion_val[0][mot_index + mot_step][0];
mv_predictor[pred_count][1]= s->current_picture.f.motion_val[0][mot_index + mot_step][1];
ref [pred_count] = s->current_picture.f.ref_index[0][4*(mb_xy+1)];
pred_count++;
}
if(mb_y>0){
mv_predictor[pred_count][0]= s->current_picture.f.motion_val[0][mot_index - mot_stride*mot_step][0];
mv_predictor[pred_count][1]= s->current_picture.f.motion_val[0][mot_index - mot_stride*mot_step][1];
ref [pred_count] = s->current_picture.f.ref_index[0][4*(mb_xy-s->mb_stride)];
pred_count++;
}
if(mb_y+1<mb_height){
mv_predictor[pred_count][0]= s->current_picture.f.motion_val[0][mot_index + mot_stride*mot_step][0];
mv_predictor[pred_count][1]= s->current_picture.f.motion_val[0][mot_index + mot_stride*mot_step][1];
ref [pred_count] = s->current_picture.f.ref_index[0][4*(mb_xy+s->mb_stride)];
pred_count++;
}
if(pred_count==0) continue;
if(pred_count>=1){
int sum_x=0, sum_y=0, sum_r=0;
int k;
for(k=0; k<pred_count; k++){
sum_x+= mv_predictor[k][0]; // Sum all the MVx from MVs avail. for EC
sum_y+= mv_predictor[k][1]; // Sum all the MVy from MVs avail. for EC
sum_r+= ref[k];
// if(k && ref[k] != ref[k-1])
// goto skip_mean_and_median;
}
mv_predictor[pred_count][0] = sum_x/k;
mv_predictor[pred_count][1] = sum_y/k;
ref [pred_count] = sum_r/k;
}
s->mv[0][0][0] = mv_predictor[pred_count][0];
s->mv[0][0][1] = mv_predictor[pred_count][1];
for(m=0; m<mot_step; m++){
for(n=0; n<mot_step; n++){
s->current_picture.f.motion_val[0][mot_index + m + n * mot_stride][0] = s->mv[0][0][0];
s->current_picture.f.motion_val[0][mot_index + m + n * mot_stride][1] = s->mv[0][0][1];
}
}
decode_mb(s, ref[pred_count]);
//}
}
}
}
I would really appreciate some assistance on how to go about this properly.
It's been a long time i have been out of touch with FFMPEG's code internally.
However, given my experience with inside FFMPEG horrors (you would know what i mean), i would rather give you a simple pragmatic advice.
Suggestion #1
Best possibility is that when motion vector of each of the blocks are identified - you can create your own additional array inside FFMPEG encoder context (a.k.a s) which will store all of them. When your algorithm runs it will pick up the values from there.
Suggestion #2
Another thing i read (i am not sure if i read it right)
the mx and my values are increased by 50
I think 50 is a very large motion vector. And usually, the F-value range of motion vector encoding would be prior restrictive. If you alter things by +/- 8 (or even +/- 16) might just be ok- but +50 could be so high that end result may not encode things properly.
I didn't quite understood your objective about mean_mv() and what failure you expect from there. Please re-phrase a bit.

Resources