MPI_Gather of columns - mpi

I have an array which is split up by columns between the processes for my calculation. Afterwards I want to gather this array in one process (0).
Each process has its columns saved in array A, process 0 has an array F for collecting the data. The F-array is of size n*n, each process has part_size columns, so the local arrays A are n*part_size. Columns are sent to alternating processes - c0 goes to p0, c1 to p1, c2 to p0 again and so on.
I created new datatypes for sending and receiving the columns.
On all processes:
MPI_Type_vector(n, 1, part_size, MPI::FLOAT, &col_send);
MPI_Type_commit(&col_send);
On process 0:
MPI_Type_vector(n, 1, n, MPI::FLOAT, &col_recv);
MPI_Type_commit(&col_recv);
Now I would like to gather the array as follows:
MPI_Gather(&A, part_size, col_send, &F, part_size, col_recv, 0, MPI::COMM_WORLD);
However the result is not as expected. My example has n = 4 and two processes. As a result the values from p0 should be in columns 0 and 2 of F and p1 should be stored in 1 and 3. Instead both columns of p0 are stored in 0 and 1, while the values of p1 are not there at all.
0: F[0][0]: 8.31786
0: F[0][1]: 3.90439
0: F[0][2]: -60386.2
0: F[0][3]: 4.573e-41
0: F[1][0]: 0
0: F[1][1]: 6.04768
0: F[1][2]: -60386.2
0: F[1][3]: 4.573e-41
0: F[2][0]: 0
0: F[2][1]: 8.88266
0: F[2][2]: -60386.2
0: F[2][3]: 4.573e-41
0: F[3][0]: 0
0: F[3][1]: 0
0: F[3][2]: -60386.2
0: F[3][3]: 4.573e-41
I'll admit that I'm out of ideas on this one. I obviously misunderstood how Gather or Type_vector works and saves their values. Could someone point me in the right direction? Any help would be much appreciated.

The problem that I see is that the datatype created with MPI_Type_vector() has extent going from the first to the last item. For example:
The extent for your col_recv datatype is between > and < (I hope this representation of the mask is clear enough):
>x . . .
x . . .
x . . .
x<. . .
That is 13 MPI_FLOAT items (must be read by row, that's C ordering).
receiving two of them will lead to:
>x . . .
x . . .
x . . .
x y . .
. y . .
. y . .
. y . .
That clearly is not what you want.
To let the MPI_Gather() properly skip data on the receiver you need to set the extent of col_recv as large as exactly ONE ELEMENT. You can do this by using MPI_Type_create_resized():
>x<. . .
x . . .
x . . .
x . . .
so that receiving successive blocks gets correctly interleaved:
x y . .
x y . .
x y . .
x y . .
However receiving two columns instead of one will lead to:
x x y y
x x y y
x x y y
x x y y
That again is not what you want, even if closer.
Since you want interleaved columns, you need to create a more complex datatype, capable of describing all the columns, with 1-item-extent as before:
Each column is separated (stride) as one ELEMENT (that is the extent - not the size, that is 4 elements - of the previously defined column):
>x<. x .
x . x .
x . x .
x . x .
receiving one of them per processor you'll get what you want:
x y x y
x y x y
x y x y
x y x y
You can do it with MPI_Type_create_darray() as well, since it allow to create datatypes suitable to be used with the block-cyclic distribution of scalapack, being your one a 1D subcase of it.
I have also tried it. Here is a working code, on two processors:
#include <mpi.h>
#define N 4
#define NPROCS 2
#define NPART (N/NPROCS)
int main(int argc, char **argv) {
float a_send[N][NPART];
float a_recv[N][N] = {0};
MPI_Datatype column_send_type;
MPI_Datatype column_recv_type;
MPI_Datatype column_send_type1;
MPI_Datatype column_recv_type1;
MPI_Datatype matrix_columns_type;
MPI_Datatype matrix_columns_type1;
MPI_Init(&argc, &argv);
int my_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
for(int i=0; i<N; ++i) {
for(int j=0; j<NPART; ++j) {
a_send[i][j] = my_rank*100+10*(i+1)+(j+1);
}
}
MPI_Type_vector(N, 1, NPART, MPI_FLOAT, &column_send_type);
MPI_Type_commit(&column_send_type);
MPI_Type_create_resized(column_send_type, 0, sizeof(float), &column_send_type1);
MPI_Type_commit(&column_send_type1);
MPI_Type_vector(N, 1, N, MPI_FLOAT, &column_recv_type);
MPI_Type_commit(&column_recv_type);
MPI_Type_create_resized(column_recv_type, 0, sizeof(float), &column_recv_type1);
MPI_Type_commit(&column_recv_type1);
MPI_Type_vector(NPART, 1, NPROCS, column_recv_type1, &matrix_columns_type);
MPI_Type_commit(&matrix_columns_type);
MPI_Type_create_resized(matrix_columns_type, 0, sizeof(float), &matrix_columns_type1);
MPI_Type_commit(&matrix_columns_type1);
MPI_Gather(a_send, NPART, column_send_type1, a_recv, 1, matrix_columns_type1, 0, MPI_COMM_WORLD);
if (my_rank==0) {
for(int i=0; i<N; ++i) {
for(int j=0; j<N; ++j) {
printf("%4.0f ",a_recv[i][j]);
}
printf("\n");
}
}
MPI_Finalize();
}

Related

Maximum cost without cycles

Given an undirected graph with positive edge costs, choose a subset of edges such that there are no cycles and the sum of the cost is maximal.
The input consists of several graphs, each defined with number of vertices n, number of edges m, and m triples x,y,c to indicate an edge between x and y of cost c. The vertices are numbered from 0 to n - 1. It is assumed that 1 ≤ n ≤ 104, 0 ≤ m ≤ 5 n, and 1 ≤ c ≤ 105. there may be more than one edge between two vertices, and even edges with x = y.
#include <iostream>
#include <vector>
using namespace std;
using P = pair<int,int>;
using VE = vector<int>;
using VP = vector<P>;
using VVE = vector<VP>;
int n,m;
VVE G;
VE cost;
VE vist;
VE pare;
int maxim(int x){
if(cost[x] != -1) return cost[x];
cost[x] = 0;
for(P y: G[x]){
if(cost[x] <= y.second + maxim(y.first)){
cost[x] = y.second + maxim(y.first);
}
}
return cost[x];
}
int main() {
while(cin >> n >> m){
G = VVE(n);
cost = VE(n,-1);
pare = VE(n,-1);
for(int i = 0; i < m; ++i){
int x,y,c; cin >> x >> y >> c;
G[x].push_back(P(y,c));
G[y].push_back(P(x,c));
}
int mx = -1;
for(int i = 0; i < n; ++i){
if(mx <= maxim(i)){
mx = maxim(i);
}
}
cout << mx << endl;
}
}
This is my code and I don't know how to solve the problem. I would appreciate help. As you can see the graph is read as a vector of vectors. In which each pair indicates that node x goes to node y with cost c.
As a commenter pointed out, this is the maximum spanning tree problem (which is the same as the minimum spanning tree problem, just negate the costs). That problem can be solved with the greedy algorithm. Initially place every node into a heap on its own. Then in a loop consider the edges in decreasing cost order. If the two endpoints of the considered edge are in the same heap, discard the edge. Otherwise select it and merge the heaps. When you have only one heap left, you can stop, and the selected edges form your solution.

Rcpp submat from a big sparse matrix

I am trying to multiply a vec by a subset of a very big sparse matrix (as the script followed), but it fails to complier when using sourceCpp, it reports error: no matching function for call to ‘arma::SpMat<double>::submat(arma::uvec&, arma::uvec&), it would be much appreciate if someone could do me a favour.
#include <RcppArmadillo.h>
// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]
double myf(sp_mat X, vec g, uvec xi){
double u = g(xi).t() * X.submat(xi, xi) * g(xi);
return u;
}
So, as #RalfStubner mentioned, the matrix access for sparse matrices is continuous only. That said, the access approach taken is symmetric for the actual sparse matrix since the same index is being used. So, in this case, it makes sense to revert back to a standard element accessor of (x,y). As a result, the summation reduction can be done with a single loop.
#include <RcppArmadillo.h>
// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]
double submat_multiply(const arma::sp_mat& X,
const arma::vec& g, const arma::uvec& xi){
// Add an assertion
if(X.n_rows != g.n_elem) {
Rcpp::stop("Mismatched row and column dimensions of X (%s) and g (%s).",
X.n_rows, g.n_elem);
}
// Reduction
double summed = 0;
for (unsigned int i = 0; i < xi.n_elem; ++i) {
// Retrieve indexing element
arma::uword index_at_i = xi(i);
// Add components together
summed += g(index_at_i) * X(index_at_i, index_at_i) * g(index_at_i);
}
// Return result
return summed;
}
Another approach, but potentially more costly, would be to extract out the diagonal of the sparse matrix and convert it to a dense vector. From there apply an element-wise multiplication and summation.
#include <RcppArmadillo.h>
// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]
double submat_multiply_v2(const arma::sp_mat& X,
const arma::vec& g, const arma::uvec& xi){
// Add an assertion
if(X.n_rows != g.n_elem) {
Rcpp::stop("Mismatched row and column dimensions of X (%s) and g (%s).",
X.n_rows, g.n_elem);
}
// Copy sparse diagonal to dense vector
arma::vec x_diag(X.diag());
// Obtain the subset
arma::vec g_sub = g.elem(xi);
// Perform element-wise multiplication and then sum.
double summed = arma::sum(g_sub % x_diag.elem(xi) % g_sub);
// Return result
return summed;
}
Test code:
# Sparse matrix
library(Matrix)
i <- c(1,4:8,10); j <- c(2, 9, 6:10); x <- 7 * (1:7)
X <- sparseMatrix(i, j, x = x)
X
# 10 x 10 sparse Matrix of class "dgCMatrix"
#
# [1,] . 7 . . . . . . . .
# [2,] . . . . . . . . . .
# [3,] . . . . . . . . . .
# [4,] . . . . . . . . 14 .
# [5,] . . . . . 21 . . . .
# [6,] . . . . . . 28 . . .
# [7,] . . . . . . . 35 . .
# [8,] . . . . . . . . 42 .
# [9,] . . . . . . . . . .
# [10,] . . . . . . . . . 49
# Vector
g <- 1:10
# Indices
xi <- c(0, 3, 4, 9)
# Above function
submat_multiply(X, g, xi)
# [1] 4900
submat_multiply_v2(X, g, xi)
# [1] 4900

Why isn't column-wise operation much faster than row-wise operation (as it should be) for a matrix in R

Consider the following functions which store values row-wise-ly and column-wise-ly.
#include <Rcpp.h>
using namespace Rcpp;
const int m = 10000;
const int n = 3;
// [[Rcpp::export]]
SEXP rowWise() {
SEXP A = Rf_allocMatrix(INTSXP, m, n);
int* p = INTEGER(A);
int i, j;
for (i = 0; i < m; i++){
for(j = 0; j < n; j++) {
p[m * j + i] = j;
}
}
return A;
}
// [[Rcpp::export]]
SEXP columnWise() {
SEXP A = Rf_allocMatrix(INTSXP, n, m);
int* p = INTEGER(A);
int i, j;
for(j = 0; j < m; j++) {
for (i = 0; i < n; i++){
p[n * j + i] = i;
}
}
return A;
}
/*** R
library(microbenchmark)
gc()
microbenchmark(
rowWise(),
columnWise(),
times = 1000
)
*/
The above code yields
Unit: microseconds
expr min lq mean median uq max neval
rowWise() 12.524 18.631 64.24991 20.4540 24.8385 10894.353 1000
columnWise() 11.803 19.434 40.08047 20.9005 24.1585 8590.663 1000
Assigning values row-wise-ly is faster (if not slower) than assigning them column-wise-ly, which is counter-intuitive to what I believe.
However, it does depend magically on the value of m and n. So I guess my question is: why columnWise is not much faster than rowWise?
The dimension (shape) of the matrix has an impact.
When we do a row-wise scan of a 10000 x 3 integer matrix A, we can still effectively do caching. For simplicity of illustration, I assume that each column of A are aligned to a cache line.
--------------------------------------
A[1, 1] A[1, 2] A[1, 3] M M M
A[2, 1] A[2, 2] A[2, 3] H H H
. . . . . .
. . . . . .
A[16,1] A[16,2] A[16,3] H H H
--------------------------------------
A[17,1] A[17,2] A[17,3] M M M
A[18,1] A[18,2] A[18,3] H H H
. . . . . .
. . . . . .
A[32,1] A[32,2] A[32,3] H H H
--------------------------------------
A[33,1] A[33,2] A[33,3] M M M
A[34,1] A[34,2] A[34,3] H H H
. . . . . .
. . . . . .
A 64-bit cache line can hold 16 integers. When we access A[1, 1], a full cache line is filled, that is, A[1, 1] to A[16, 1] are all loaded into cache. When we scan a row A[1, 1], A[1, 2], A[1, 3], a 16 x 3 matrix is loaded into cache and it is much smaller than cache capacity (32 KB). While we have a cache miss (M) for each element in the 1st row, when we start to scan the 2nd row, we have a cache hit (H) for every element. So we have a periodic pattern as such:
[3 Misses] -> [45 Hits] -> [3 Misses] -> [45 Hits] -> ...
That is, we have on average a cache miss ratio of 3 / 48 = 1 / 16 = 6.25%. In fact, this equals to the cache miss ratio if we scan A column-wise, where we have the following periodic pattern:
[1 Miss] -> [15 Hits] -> [1 Miss] -> [15 Hits] -> ...
Try a 5000 x 5000 matrix. In that case, after reading the first row, 16 x 5000 elements are fetched into cache but that is much larger than cache capacity so cache eviction has happened to kick out the A[1, 1] to A[16, 1] (most cache applies "least recently unused" cache line replacement policy). When we come back to scan the 2nd row, we have to fetch A[2, 1] from RAM again. So a row-wise scan gives a cache miss ratio of 100%. In contrasts, a column-wise scan only has a cache miss ratio of 1 / 16 = 6.25%. In this example, we will observe that column-wise scan is much faster.
In summary, with a 10000 x 3 matrix, we have the same cache performance whether we scan it by row or column. I don't see that rowWise is faster than columnWise from the median time reported by microbenchmark. Their execution time may not be exactly equal, but the difference is too minor to cause our concern.
For a 5000 x 5000 matrix, rowWise is much slower than columnWise.
Thanks for verification.
Remark
The "golden rule" that we should ensure sequential memory access in the innermost loop is a general guideline for efficiency. But don't understand it in the narrow sense.
In fact, if you treat the three columns of A as three vectors x, y, z, and consider the element-wise addition (i.e., the row-wise sum of A): z[i] = x[i] + y[i], are we not having a sequential access for all three vectors? Doesn't this fall into the "golden rule"? Scanning a 10000 x 3 matrix by row is no difference from alternately reading three vectors sequentially. And this is very efficient.

How to get the first x leading binary digits of 5**x without big integer multiplication

I want to efficiently and elegantly compute with perfect precision the first x leading binary digits of 5**x?
For example 5**20 is 10101101011110001110101111000101101011000110001. The first 8 leading binary digits is 10101101.
In my use case, x is only up to 1-60. I don't want to create a table. A solution using 64-bit integers would be fine. I just don't want to use big integers.
first x leading binary digits of 5**x without big integer multiplication
efficiently and elegantly compute with perfect precision the first x leading binary digits of 5x?
"compute with perfect precision" leaves out pow(). Too many implementations will return an imperfect result and FP math might not use 64 bit precision, even with long double.
Form an integer with a 64-bit whole number part .ms and a 64-bit fraction part .ls. Then loop 60 times, multiply by 5 and diving by 2 as needed, to keep the leading bits from growing too big.
Note there is some precision lost in the fraction, with N > 42, yet that is not significant enough to affect the whole number part OP is seeking.
#include <inttypes.h>
#include <stdint.h>
#include <stdio.h>
typedef struct {
uint64_t ms, ls;
} uint128;
// Simplifications possible here, leave for OP
uint128 times5(uint128 x) {
uint128 y = x;
for (int i=1; i<5; i++) {
// y += x
y.ms += x.ms;
y.ls += x.ls;
if (y.ls < x.ls) y.ms++;
}
return y;
}
uint128 div2(uint128 x) {
x.ls = (x.ls >> 1) | (x.ms << 63);
x.ms >>= 1;
return x;
}
int main(void) {
uint128 y = {.ms = 1};
uint64_t pow2 = 2;
for (unsigned x = 1; x <= 60; x++) {
y = times5(y);
while (y.ms >= pow2) {
y = div2(y);
}
printf("%2u %16" PRIX64 ".%016" PRIX64 "\n", x, y.ms, y.ls);
pow2 <<= 1;
}
}
Output
whole part.fraction
1 1.4000000000000000
2 3.2000000000000000
3 7.D000000000000000
4 9.C400000000000000
...
57 14643E5AE44D12B.8F5FEE5AA432560D
58 32FA9BE33AC0AEC.E66FD3E29A7DD720
59 7F7285B812E1B50.401791B6823A99D0
60 9F4F2726179A224.501D762422C94044
^-------------^ This is the part OP is seeking.
The key to solving this task is: divide and conquer. Form an algorithm, (which is simply *5 and /2 as needed), and code a type and functions to do each small step.
Is a loop of 60 efficient? Perhaps not. Another approach would use Exponentiation by squaring. Certainly would be worth it for large N, yet for N == 60, a loop was simple enough for a quick turn.
5n = 2(-n) • 10n
Using this identity, we can easily compute the leading N base-2 digits of (the nearest integer to) any given power of 5.
This code example is in C, but it's the same idea in any other language.
Example output: https://wandbox.org/permlink/Fs205DDzQR0gaLSo
#include <assert.h>
#include <float.h>
#include <math.h>
#include <stdint.h>
#define STATIC_ASSERT(CONDITION) ((void)sizeof(int[(CONDITION) ? 1 : -1]))
uint64_t pow5_leading_digits(double power, uint8_t ndigits)
{
STATIC_ASSERT(DBL_MANT_DIG <= 64);
double pow5 = exp2(-power) * pow(10, power);
const double binary_digits = ceil(log2(pow5));
assert(ndigits <= DBL_MANT_DIG);
if (!ndigits || binary_digits < 0)
return 0;
// If pow5 can fit in the number of digits requested, return it
if (binary_digits <= ndigits)
return pow5;
// If pow5 is too big to return, divide by 2 until it fits
if (binary_digits > DBL_MANT_DIG)
pow5 /= exp2(binary_digits - DBL_MANT_DIG + 1);
return (uint64_t)pow5 >> (DBL_MANT_DIG - ndigits);
}
Edit: Now limits the returned value to those exactly representable with double's.

Given XOR & SUM of two numbers. How to find the numbers?

Given XOR & SUM of two numbers. How to find the numbers?
For example, x = a+b, y = a^b; if x,y are given, how to get a, b?
And if can't, give the reason.
This cannot be done reliably. A single counter-example is enough to destroy any theory and, in your case, that example is 0, 100 and 4, 96. Both of these sum to 100 and xor to 100 as well:
0 = 0000 0000 4 = 0000 0100
100 = 0110 0100 96 = 0110 0000
---- ---- ---- ----
xor 0110 0100 = 100 xor 0110 0100 = 100
Hence given a sum of 100 and an xor of 100, you cannot know which of the possibilities generated that situation.
For what it's worth, this program checks the possibilities with just the numbers 0..255:
#include <stdio.h>
static void output (unsigned int a, unsigned int b) {
printf ("%u:%u = %u %u\n", a+b, a^b, a, b);
}
int main (void) {
unsigned int limit = 256;
unsigned int a, b;
output (0, 0);
for (b = 1; b != limit; b++)
output (0, b);
for (a = 1; a != limit; a++)
for (b = 1; b != limit; b++)
output (a, b);
return 0;
}
You can then take that output and massage it to give you all the repeated possibilities:
testprog | sed 's/ =.*$//' | sort | uniq -c | grep -v ' 1 ' | sort -k1 -n -r
which gives:
255 255:255
128 383:127
128 319:191
128 287:223
128 271:239
128 263:247
:
and so on.
Even in that reduced set, there are quite a few combinations which generate the same sum and xor, the worst being the large number of possibilities that generate a sum/xor of 255/255, which are:
255:255 = 0 255
255:255 = 1 254
255:255 = 2 253
255:255 = <n> <255-n>, for n = 3 thru 255 inclusive
It has already been shown that it can't be done, but here are two further reasons why.
For the (rather large) subset of a's and b's (a & b) == 0, you have a + b == (a ^ b) (because there can be no carries) (the reverse implication does not hold). In such a case, you can, for each bit that is 1 in the sum, choose which one of a or b contributed that bit. Obviously this subset does not cover the entire input, but it at least proves that it can't be done in general.
Furthermore, there exist many pairs of (x, y) such that there is no solution to a + b == x && (a ^ b) == y, for example (there are more than just these) all pairs (x, y) where ((x ^ y) & 1) == 1 (ie one is odd and the other is even), because the lowest bit of the xor and the sum are equal (the lowest bit has no carry-in). By a simple counting-argument, that must mean that at least some pairs (x, y) must have multiple solutions: clearly all pairs of (a, b) have some pair of (x, y) associated with them, so if not all pairs of (x, y) can be used, some other pairs (x, y) must be shared.
Here is the solution to get all such pairs
Logic:
let the numbers be a and b, we know
s = a + b
x = a ^ b
therefore
x = (s-b) ^ b
Since we know x and we know s, so for all ints going from 0 to s - just check if this last equation is satisfied
here is the code for this
public List<Pair<Integer>> pairs(int s, int x) {
List<Pair<Integer>> pairs = new ArrayList<Pair<Integer>>();
for (int i = 0; i <= s; i++) {
int calc = (s - i) ^ i;
if (calc == x) {
pairs.add(new Pair<Integer>(i, s - i));
}
}
return pairs;
}
Class pair is defined as
class Pair<T> {
T a;
T b;
public String toString() {
return a.toString() + "," + b.toString();
}
public Pair(T a, T b) {
this.a = a;
this.b = b;
}
}
Code to test this:
public static void main(String[] args) {
List<Pair<Integer>> pairs = new Test().pairs(100,100);
for (Pair<Integer> p : pairs) {
System.out.println(p);
}
}
Output:
0,100
4,96
32,68
36,64
64,36
68,32
96,4
100,0
if you have a , b the sum = a+b = (a^b) + (a&b)*2 this equation may be useful for you

Resources