Related
I would like to compute the element-wise means across multiple blocks of the same dataframe. My input table looks like this, and it consists of 3 (3x3) blocks, with each block having a diagonal of ones:
input = data.frame(
var1 = c(1,7,4,1,2,9,1,8,3),
var2 = c(3,1,9,4,1,8,3,1,8),
var3 = c(3,9,1,6,8,1,3,5,1) )
The output table should be a 3x3 including the means of the elements which are located on similar positions in their blocks. E.g. the first row of the output table should be c(1, 3.3, 4). Any idea how to smartly code this? Thank you.
do.call(rbind, lapply(split(input, 1:3), colMeans))
var1 var2 var3
1 1.000000 3.333333 4.000000
2 5.666667 1.000000 7.333333
3 5.333333 8.333333 1.000000
You could use tapply or even aggregate
tapply(unlist(input), list((row(input)-1)%%3,col(input)), mean)
1 2 3
0 1.000000 3.333333 4.000000
1 5.666667 1.000000 7.333333
2 5.333333 8.333333 1.000000
aggregate(.~id, cbind(id=rep(1:3,3),input),mean)
id var1 var2 var3
1 1 1.000000 3.333333 4.000000
2 2 5.666667 1.000000 7.333333
3 3 5.333333 8.333333 1.000000
If each of the blocks are of the same dimension, then we can also use array route
t(apply(array(as.matrix(input), c(3, 3, 3)), 1, colMeans))
-output
[,1] [,2] [,3]
[1,] 1.000000 3.333333 4.000000
[2,] 5.666667 1.000000 7.333333
[3,] 5.333333 8.333333 1.000000
I have multiple different phenotypes and xy coordinates for each cell. What would be the easiest way to calculate distances between each of my cells within the same slide? My dataset has 100,000+ cells so I'm trying to figure out the most efficient way to do this.
An example dataframe would be:
Xposition <- c(1,6,4,7,9,4,8,6,4)
Yposition <- c(6,3,2,6,3,6,1,3,7)
Phenotype <- c("A", "A", "B", "C", "C", "A", "A", "B", "B")
SlideID <- c(111,111,111,111,111,112,112,112,112)
df <- data.frame(Xposition, Yposition, Phenotype, SlideID)
I'm looking for something that could give me a dataframe where the outputs are something like:
CellType1 <- c("A", "A", "A", "A", "A", "A", "A", "B", "B", "C", "A", "A", "A", "A", "A", "B")
Celltype2 <- c("A", "B", "C", "C", "B", "C", "C", "C", "C", "C", "A", "B", "B", "B", "B", "B")
Distance <- c("5.83", "5", "6", "8.54", "2.23", "3.16", "3", "5", "5.09", "3.6", "6.4", "3.6", "1", "2.82", "7.21", "4.47")
SlideID <- c("111", "111", "111", "111", "111", "111", "111", "111", "111", "111", "112", "112", "112", "112", "112", "112")
distancedf <- data.frame(CellType1, Celltype2, Distance, SlideID)
Thanks for your help!
I think there is room for ambiguity here, but ...
res <- as.data.frame.table(as.matrix(dist(df[,1:2])))
res$Var2 <- df$Phenotype[res$Var2]
res$SlideID <- df$SlideID[res$Var1]
res$Var1 <- df$Phenotype[res$Var1]
head(res)
# Var1 Var2 Freq SlideID
# 1 A A 0.000000 111
# 2 A A 5.830952 111
# 3 B A 5.000000 111
# 4 C A 6.000000 111
# 5 C A 8.544004 111
# 6 A A 3.000000 112
From this, you should be able to filter out the 0s fairly easily, but I wanted to keep it here to show what is actually happening. Effectively, that as.data.frame.table(...) is going from this
dist(df[,1:2])
# 1 2 3 4 5 6 7 8
# 2 5.830952
# 3 5.000000 2.236068
# 4 6.000000 3.162278 5.000000
# 5 8.544004 3.000000 5.099020 3.605551
# 6 3.000000 3.605551 4.000000 3.000000 5.830952
# 7 8.602325 2.828427 4.123106 5.099020 2.236068 6.403124
# 8 5.830952 0.000000 2.236068 3.162278 3.000000 3.605551 2.828427
# 9 3.162278 4.472136 5.000000 3.162278 6.403124 1.000000 7.211103 4.472136
through this:
as.matrix(dist(df[,1:2]))
# 1 2 3 4 5 6 7 8 9
# 1 0.000000 5.830952 5.000000 6.000000 8.544004 3.000000 8.602325 5.830952 3.162278
# 2 5.830952 0.000000 2.236068 3.162278 3.000000 3.605551 2.828427 0.000000 4.472136
# 3 5.000000 2.236068 0.000000 5.000000 5.099020 4.000000 4.123106 2.236068 5.000000
# 4 6.000000 3.162278 5.000000 0.000000 3.605551 3.000000 5.099020 3.162278 3.162278
# 5 8.544004 3.000000 5.099020 3.605551 0.000000 5.830952 2.236068 3.000000 6.403124
# 6 3.000000 3.605551 4.000000 3.000000 5.830952 0.000000 6.403124 3.605551 1.000000
# 7 8.602325 2.828427 4.123106 5.099020 2.236068 6.403124 0.000000 2.828427 7.211103
# 8 5.830952 0.000000 2.236068 3.162278 3.000000 3.605551 2.828427 0.000000 4.472136
# 9 3.162278 4.472136 5.000000 3.162278 6.403124 1.000000 7.211103 4.472136 0.000000
ultimately to this
head(as.data.frame.table(as.matrix(dist(df[,1:2]))))
# Var1 Var2 Freq
# 1 1 1 0.000000
# 2 2 1 5.830952
# 3 3 1 5.000000
# 4 4 1 6.000000
# 5 5 1 8.544004
# 6 6 1 3.000000
and the 0.000s are the diagonals of the distance matrix (that are masked in the default representation of dist(...)).
Per SlideID:
lapply(split(df, df$SlideID), function(x) {
res <- as.data.frame.table(as.matrix(dist(x[,1:2])))
res$Var2 <- x$Phenotype[res$Var2]
res$SlideID <- x$SlideID[res$Var1]
res$Var1 <- x$Phenotype[res$Var1]
res
})
# $`111`
# Var1 Var2 Freq SlideID
# 1 A A 0.000000 111
# 2 A A 5.830952 111
# 3 B A 5.000000 111
# 4 C A 6.000000 111
# 5 C A 8.544004 111
# 6 A A 5.830952 111
# 7 A A 0.000000 111
# 8 B A 2.236068 111
# 9 C A 3.162278 111
# 10 C A 3.000000 111
# 11 A B 5.000000 111
# 12 A B 2.236068 111
# 13 B B 0.000000 111
# 14 C B 5.000000 111
# 15 C B 5.099020 111
# 16 A C 6.000000 111
# 17 A C 3.162278 111
# 18 B C 5.000000 111
# 19 C C 0.000000 111
# 20 C C 3.605551 111
# 21 A C 8.544004 111
# 22 A C 3.000000 111
# 23 B C 5.099020 111
# 24 C C 3.605551 111
# 25 C C 0.000000 111
# $`112`
# Var1 Var2 Freq SlideID
# 1 A A 0.000000 112
# 2 A A 6.403124 112
# 3 B A 3.605551 112
# 4 B A 1.000000 112
# 5 A A 6.403124 112
# 6 A A 0.000000 112
# 7 B A 2.828427 112
# 8 B A 7.211103 112
# 9 A B 3.605551 112
# 10 A B 2.828427 112
# 11 B B 0.000000 112
# 12 B B 4.472136 112
# 13 A B 1.000000 112
# 14 A B 7.211103 112
# 15 B B 4.472136 112
# 16 B B 0.000000 112
I am currently learning to use MPI and trying to scatter blocks of my array to the rest of my processors.
My root processor is the last one (nproc-1) and I am generating the array in that processor. In my next iteration of my code it will be a random array.
For all my processors I am allocating contiguous memory using calloc both for 'array' and 'grain'.
Grain stores the data to process and since I need the above and below rows from the original array, I made it of size grain_length+2.
My issue is that I get the correct data from the original array except for the last two values (see output example below).
int main(int argc, char** argv)
{
int i, j, m;
int array_size, grain_length;
int rc, rank, nproc;
MPI_Status status;
rc = MPI_Init(&argc, &argv);
if (rc != MPI_SUCCESS)
{
printf("Error starting MPI Program.\n");
MPI_Abort(MPI_COMM_WORLD, rc);
}
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &nproc);
array_size = 8;
grain_length = array_size / nproc;
double **array= (double **) calloc(array_size, sizeof (double *));
for (i = 0; i < array_size; i++)
array[i] = (double *) calloc(array_size, sizeof (double));
double **grain = (double **) calloc(grain_length+2, sizeof (double *));
for (i = 0; i < grain_length + 2; i++)
grain[i] = (double *) calloc(array_size, sizeof (double));
if (array == NULL || grain == NULL)
{
printf("Memory could not be allocated for the arrays.");
exit(EXIT_FAILURE);
}
if (rank == nproc-1)
{
for (i = 0; i < array_size; i++)
{
for (j = 0; j < array_size; j++)
{
//array[i][j] = rand() % 10;
array[i][j] = i+j;
}
}
}
MPI_Scatter(
&array[0][0], grain_length*array_size, MPI_DOUBLE,
&grain[1][0], grain_length*array_size, MPI_DOUBLE,
nproc-1, MPI_COMM_WORLD);
for (m = 0; m < nproc; m++)
{
if (rank == m)
{
printf("Grain from processor %d:\n", rank);
for (i = 0; i < grain_length+2; i++)
{
for (j = 0; j < array_size; j++)
{
printf("%f\t", grain[i][j]);
}
printf("\n");
}
printf("\n");
}
MPI_Barrier(MPI_COMM_WORLD);
}
if (rank == nproc-1)
{
printf("Array from processor %d:\n", rank);
for (i = 0; i < array_size; i++)
{
for (j = 0; j < array_size; j++)
{
printf("%f\t", array[i][j]);
}
printf("\n");
}
printf("\n");
}
MPI_Finalize();
return 0;
}
Here is the output. In Grain 0, the first and last row are 0s as expected since the above and below rows will be sent and placed there. Then the second row is correct but the third row is missing the 7 and 8 values which are the first values in Grain 1.
Are the two 0s in Grain 0 the array's two pointers addresses? I don't understand why I am getting incomplete data when the array in memory is stored contiguously.
I tried to use scatterv with the displacement but I am not sure I understand how it works.
I also tried to create an MPI Type but didn't get far away with that either.
What I managed to do is to broadcast the each row of the array to all the others processors. But it is quite inefficient I think. This is how I did it.
for (i=0; i < array_size; i++)
MPI_Bcast(&array[i][0], array_size, MPI_DOUBLE, nproc-1, MPI_COMM_WORLD);
Many thanks in advance for your help!!
Grain from processor 0:
0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
0.000000 1.000000 2.000000 3.000000 4.000000 5.000000 6.000000 7.000000
1.000000 2.000000 3.000000 4.000000 5.000000 6.000000 0.000000 0.000000
0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
Grain from processor 1:
0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
7.000000 8.000000 0.000000 0.000000 2.000000 3.000000 4.000000 5.000000
8.000000 9.000000 0.000000 0.000000 3.000000 4.000000 0.000000 0.000000
0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
Grain from processor 2:
0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
5.000000 6.000000 7.000000 8.000000 9.000000 10.000000 0.000000 0.000000
6.000000 7.000000 8.000000 9.000000 10.000000 11.000000 0.000000 0.000000
0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
Grain from processor 3:
0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
0.000000 0.000000 5.000000 6.000000 7.000000 8.000000 9.000000 10.000000
0.000000 0.000000 6.000000 7.000000 8.000000 9.000000 0.000000 0.000000
0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
Array from processor 3:
0.000000 1.000000 2.000000 3.000000 4.000000 5.000000 6.000000 7.000000
1.000000 2.000000 3.000000 4.000000 5.000000 6.000000 7.000000 8.000000
2.000000 3.000000 4.000000 5.000000 6.000000 7.000000 8.000000 9.000000
3.000000 4.000000 5.000000 6.000000 7.000000 8.000000 9.000000 10.000000
4.000000 5.000000 6.000000 7.000000 8.000000 9.000000 10.000000 11.000000
5.000000 6.000000 7.000000 8.000000 9.000000 10.000000 11.000000 12.000000
6.000000 7.000000 8.000000 9.000000 10.000000 11.000000 12.000000 13.000000
7.000000 8.000000 9.000000 10.000000 11.000000 12.000000 13.000000 14.000000
I was able to achieve the result you want by changing the length of sent information to each individual process from:
MPI_Scatter(
&array[0][0], grain_length*array_size, MPI_DOUBLE,
&grain[1][0], grain_length*array_size, MPI_DOUBLE,
nproc-1, MPI_COMM_WORLD);
To:
MPI_Scatter(
&array[0][0], 4+grain_length*array_size, MPI_DOUBLE,
&grain[1][0], 4+grain_length*array_size, MPI_DOUBLE,
nproc-1, MPI_COMM_WORLD);
The result:
Grain from processor 0:
0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
0.000000 1.000000 2.000000 3.000000 4.000000 5.000000 6.000000 7.000000
1.000000 2.000000 3.000000 4.000000 5.000000 6.000000 7.000000 8.000000
0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
Grain from processor 1:
0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
2.000000 3.000000 4.000000 5.000000 6.000000 7.000000 8.000000 9.000000
3.000000 4.000000 5.000000 6.000000 7.000000 8.000000 9.000000 10.000000
0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
Grain from processor 2:
0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
4.000000 5.000000 6.000000 7.000000 8.000000 9.000000 10.000000 11.000000
5.000000 6.000000 7.000000 8.000000 9.000000 10.000000 11.000000 12.000000
0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
Grain from processor 3:
0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
6.000000 7.000000 8.000000 9.000000 10.000000 11.000000 12.000000 13.000000
7.000000 8.000000 9.000000 10.000000 11.000000 12.000000 13.000000 14.000000
0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
Array from processor 3:
0.000000 1.000000 2.000000 3.000000 4.000000 5.000000 6.000000 7.000000
1.000000 2.000000 3.000000 4.000000 5.000000 6.000000 7.000000 8.000000
2.000000 3.000000 4.000000 5.000000 6.000000 7.000000 8.000000 9.000000
3.000000 4.000000 5.000000 6.000000 7.000000 8.000000 9.000000 10.000000
4.000000 5.000000 6.000000 7.000000 8.000000 9.000000 10.000000 11.000000
5.000000 6.000000 7.000000 8.000000 9.000000 10.000000 11.000000 12.000000
6.000000 7.000000 8.000000 9.000000 10.000000 11.000000 12.000000 13.000000
7.000000 8.000000 9.000000 10.000000 11.000000 12.000000 13.000000 14.000000
I hope it will help you.
Consider we have two different datasets:
X1 = c(1,2,4,5,1,3,1)
Y1 = c(3,5,6,3,1,5,1)
df1= data.frame(X1,Y1)
X2 = c(2,3,4,3,2,3,2)
Y2 = c(3,4,2,6,4,3,4)
df2= data.frame(X2,Y2)
These data are represented in this scatterplot:
I would like to calculate the distances between the 7 XY coordinates in df1 (black open dots) and the 7 XY coordinates in df2 (red open triangles).
I know how to calculate the distances between the XY coordinates within a dataset using dist() and cbind(). But I don't know how to do the same but with XY coordinates in two different datasets.
Using two datasets, we would obtain a table composed by 7 columns and 7 rows, filled by the distances among all these coordinates. Column names would be the coordinates in df1 and row names would be coordinates in df2.
How can I get this data frame with all t
Maybe this strategy may help
X1 = c(1,2,4,5,1,3,1)
Y1 = c(3,5,6,3,1,5,1)
df1= data.frame(X1,Y1)
X2 = c(2,3,4,3,2,3,2)
Y2 = c(3,4,2,6,4,3,4)
df2= data.frame(X2,Y2)
library(tidyverse)
df1 = df1 %>% mutate(df_type = "data1") %>% select(X = X1, Y = Y1)
df2 = df2 %>% mutate(df_type = "data2") %>% select(X = X2, Y = Y2)
# link data frames by row
df = bind_rows(df1, df2)
dist(cbind(df$X,df$Y))
1 2 3 4 5 6 7 8 9 10 11 12 13
2 2.236068
3 4.242641 2.236068
4 4.000000 3.605551 3.162278
5 2.000000 4.123106 5.830952 4.472136
6 2.828427 1.000000 1.414214 2.828427 4.472136
7 2.000000 4.123106 5.830952 4.472136 0.000000 4.472136
8 1.000000 2.000000 3.605551 3.000000 2.236068 2.236068 2.236068
9 2.236068 1.414214 2.236068 2.236068 3.605551 1.000000 3.605551 1.414214
10 3.162278 3.605551 4.000000 1.414214 3.162278 3.162278 3.162278 2.236068 2.236068
11 3.605551 1.414214 1.000000 3.605551 5.385165 1.000000 5.385165 3.162278 2.000000 4.123106
12 1.414214 1.000000 2.828427 3.162278 3.162278 1.414214 3.162278 1.000000 1.000000 2.828427 2.236068
13 2.000000 2.236068 3.162278 2.000000 2.828427 2.000000 2.828427 1.000000 1.000000 1.414214 3.000000 1.414214
14 1.414214 1.000000 2.828427 3.162278 3.162278 1.414214 3.162278 1.000000 1.000000 2.828427 2.236068 0.000000 1.414214
Then you can create a data.frame with the distances between X and Y. First we need to transform the dist object into a data frame
df_dist = data.frame(as.matrix(dist(cbind(df$X,df$Y))))
Doing a bit of manipulation it is possible to have the distance between X and Y
df_dist_x = df_dist %>% select(X1:X7) %>%
mutate(row.1 = 1:nrow(df_dist)) %>%
filter(row.1 >= 8) %>%
mutate(Y = paste0("Y",row_number())) %>%
gather(X, distance, X1:X7) %>%
select(X, Y, distance)
head(df_dist_x)
X Y distance
1 X1 Y1 1.000000
2 X1 Y2 2.236068
3 X1 Y3 3.162278
4 X1 Y4 3.605551
5 X1 Y5 1.414214
6 X1 Y6 2.000000
My vector a <- c(2,0,0,1,0, 3, 0, 0,0,0,5,6)
The zeros represent the missing values
I want to interpolate missing values only if the succession of 0 is shorter than 4
I am looking for a function which allows me to perform this
The wished output is [1] 2.00 1.67 1.33 1.00 2.00 3.00 0.00 0.00 0.00 0.00 5.00 6.00
library(zoo)
temp1 = na.approx(replace(a, a == 0, NA))
temp2 = inverse.rle(with(rle(a), list(values = replace(values, values == 0 & lengths < 4, NA),
lengths = lengths)))
replace(temp2, is.na(temp2), temp1[is.na(temp2)])
# [1] 2.000000 1.666667 1.333333 1.000000 2.000000 3.000000 0.000000 0.000000
# [9] 0.000000 0.000000 5.000000 6.000000
The following (as commented by G.Grothendieck) is better
temp = na.approx(object = replace(a, a == 0, NA), maxgap = 3)
replace(temp, is.na(temp), 0)
# [1] 2.000000 1.666667 1.333333 1.000000 2.000000 3.000000 0.000000 0.000000
# [9] 0.000000 0.000000 5.000000 6.000000