R apply function on symmetrical matrix - r

I'd like to utilise one of the apply set of functions to do some calculations.
First off, I have two matrices, mat1 and mat2:
mat1:
a b c
1 NA NA NA
2 1 1 1
3 1 1 NA
4 NA 1 NA
mat2:
a b c
a 1.0 0.2 0.3
b -0.7 1.0 0.8
c -0.1 -0.3 1.0
mat2 is calculated using mat1 using a function which is irrelevant here, and essentially I'd like to apply a weighting function to mat1 which penalizes the results from mat2 when there is less data (and therefore less accurate).
So to achieve this, I want to, for some coordinate x,y in mat2, calculate the pairwise completeness of two columns of mat1.
For example: mat2["a","b"] or mat2["b","a"] (should be same) would become the original values * (the complete rows of mat1 of a and b/total rows of mat1 of a and b).
So really the question is how can I apply a function to a matrix that loops every column for every column (double loop) and store this in a weight matrix to multiply against another matrix?
I can already compare two rows using rollapply from zoo package like so:
rowSums(rollapply(is.na(t(mat1)), 2, function(x) !any(x)))
I get:
[1] 2 1
As in, comparing a and b, 2 rows are complete and comparing b and c, 1 row is complete. So how can I compare a to b, a to c and b to c?
Thanks.

I was looking at your question again, and it appears that you want a matrix X with the same dimensions of mat2, where X[i,j] is given by the number of complete cases in mat1[,c(i,j)]. Then mat2 will be multiplied by X.
The number of complete cases is given by sum(complete.cases(mat1[,c(i,j)])). I want to use this in outer which requires a vectorized function, so this is passed through Vectorize:
outer(seq(nrow(mat2)), seq(ncol(mat2)),
Vectorize(function(x,y) sum(complete.cases(mat1[,c(x,y)])))
)
## [,1] [,2] [,3]
## [1,] 2 2 1
## [2,] 2 3 1
## [3,] 1 1 1
This is your desired symmetric matrix.

Related

Using distGeo with two sets of coordinates

I have two sets of coordinates (loc and stat) both in the following format
x y
1 49.68375 8.978462
2 49.99174 8.238287
3 51.30842 12.411870
4 50.70487 6.627252
5 50.70487 6.627252
6 50.37381 8.040766
For each location in the first data set (location of observation) I want to know the location in the second data set (weather stations), that is closest to it. Basically matching the locations of observations to the closest weather station for later analysis of weather effects.
I tried using the distGeo function simply by putting in
distGeo(loc, stat, a=6378137, f=1/298.257223563)
But that didn't work, because loc and stat are not in the right format.
Thanks for your help!
Try this:
outer(seq_len(nrow(loc)), seq_len(nrow(stat)),
function(a,b) geosphere::distGeo(loc[a,], stat[b,]))
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] 0.00 88604.79 419299.1 283370.9 283370.9 128560.08
# [2,] 88604.79 0.00 483632.9 194784.6 194784.6 47435.65
# [3,] 419299.12 483632.85 0.0 643230.3 643230.3 494205.86
# [4,] 283370.91 194784.62 643230.3 0.0 0.0 160540.63
# [5,] 283370.91 194784.62 643230.3 0.0 0.0 160540.63
# [6,] 128560.08 47435.65 494205.9 160540.6 160540.6 0.00
Brief explanation:
outer(1:3, 1:4, ...) produces two vectors that are a cartesian product, very similar to
expand.grid(1:3, 1:4)
# Var1 Var2
# 1 1 1
# 2 2 1
# 3 3 1
# 4 1 2
# 5 2 2
# 6 3 2
# 7 1 3
# 8 2 3
# 9 3 3
# 10 1 4
# 11 2 4
# 12 3 4
(using expand.grid only for demonstration of the expansion)
the anonymous function I defined (function(a,b)...) is called once, where a is assigned the integer vector c(1,2,3,1,2,3,1,2,3,1,2,3) (using my 1:3 and 1:4 example), and b is assigned the int vector c(1,1,1,2,2,2,3,3,3,4,4,4).
within the anon func, loc[a,] results in a much longer frame: if loc has m rows and stat has n rows, then loc[a,] should have m*n rows; similarly stat[b,] should have m*n rows as well. This works well, because distGeo (and other dist* functions in geosphere::) operates in one of two ways:
If either of the arguments have 1 row, then its distance is calculated against all rows of the other argument. Unfortunately, unless you know that loc or stat will always have just one row, this method doesn't work.
otherwise, both arguments must have the same number of rows, where the distance is calculated piecewise (1st row of 1st arg with 1st row of 2nd arg; 2nd row 1st arg with 2nd row 2nd arg; etc). This is the method we're prepared for.
in general, the anon func given to outer must deal with vectorized arguments on its own. For instance, if you needed geoDist to be called once for each pair (so it would be called m*n times), then you have to handle that yourself, outer will not do it for you. There are constructs in R that support this (e.g., mapply, Map) or replace outer (Map, expand.grid, and do.call), but that's for another question.

increasing correlation values - R

I would like to obtain the pairs of values that decrease the correlation between two vectors by using a threshold and to find the values that maximize the correlation, with the restriction to have at least three pairs of values. I am using R.
For example, suppose to have this dataframe:
df <- data.frame(val1 = c(1,2,4,8,10,20), val2 = c(2,4,8,16, 35, 14))
rownames(df) <- c('a','b','c','d','e','f')
I would like remove the pairs of values that don't allow me to obtain a correlation greater than 0.6, so in this case I would like to find that f element (row) decreases my correlation. Lastly, if it is easy, I would like to find that by using a,b,c,d elements (rows) I can obtain the highest correlation.
Do you have any idea how I can do it?
Thank you in advance for your kind help.
Best
The restriction of at least 3 rows helps. There are two ways to approach the problem. Which one is best depends a bit on what you are trying to accomplish. We can start with all of the points and remove one at a time or we can start with 3 points and add one at a time. Your example has 6 points so it does not make that much difference. Here is code to find the best 3 point combination:
combos <- combn(6, 3)
corrs <- combn(6, 3, function(x) cor(df[x, ])[1, 2])
results <- cbind(t(combos), corrs)
head(results[order(corrs, decreasing=TRUE), ])
# corrs
# [1,] 1 2 3 1.0000000
# [2,] 1 2 4 1.0000000
# [3,] 2 3 4 1.0000000
# [4,] 1 3 4 1.0000000
# [5,] 1 2 5 0.9988739
# [6,] 1 2 6 0.9940219
We use the combn() function twice, once to get a matrix of the possible combinations of 3 items out of 6 and a second time to apply the correlation function to each combination
Then we combine the results and list the best 6. There are three best 3-point solutions having correlations of +1. For the 5-point solutions we get the following:
combos <- combn(6, 5)
corrs <- combn(6, 5, function(x) cor(df[x, ])[1, 2])
results <- cbind(t(combos), corrs)
head(results[order(corrs, decreasing=TRUE), ])
# corrs
# [1,] 1 2 3 4 5 0.9381942
# [2,] 1 2 3 4 6 0.7514174
# [3,] 1 2 3 5 6 0.4908234
# [4,] 1 2 4 5 6 0.4639890
# [5,] 1 3 4 5 6 0.4062324
# [6,] 2 3 4 5 6 0.3591037
Now there is one clear solution which excludes point 6 ("f") with a correlation of +.938. In general the size of the correlation will increase with decreasing points until it reaches +1 or -1. As the number of points increases, it will take more processing time to compute all of the alternatives. A short cut would be to look at deviations from the first principal component:
df.pca <- prcomp(df)
abval <- abs(df.pca$x[, "PC2"])
df.pca$x[order(abval, decreasing=TRUE), "PC2"]
# f e a b c d
# -11.4055987 5.3497271 2.1507072 1.9191656 1.4560825 0.5299163
Point f (the 6th point) has the largest deviation from the first principal component so removing it should improve the correlation. Likewise removing e and f gives the best 4-point correlation. This is simpler, but generally you would want to remove a point, compute the principal components with that point removed and then identify the next point for removal.

Sum two rows in a matrix

Consider the following matrix:
mat <- cbind(c(5,2,5),c(6,3,2))
I want to sum the two first rows, so I get the following matrix:
7 9
5 2
How do I do that in R?
You should use rowsum:
> rowsum(mat, c(1,1,2))
[,1] [,2]
1 7 9
2 5 2
The first argument is your matrix mat, the second one specifies how the rows should be grouped together. Here c(1,1,2) specify that first two rows are in one group (and summed together) and the third row is in another group.
Note: Do not confuse this with rowSums - a different function.
We can use colSums to sum first n rows and rbind remaining ones
n <- 2
rbind(colSums(mat[seq_len(n), ]), mat[(n + 1):nrow(mat), ])
# [,1] [,2]
#[1,] 7 9
#[2,] 5 2

Function with random experiment related to value pairs

I have two vectors x1 and p:
x1 <- c(1,2,3,1,2,3)
p <- c(0.1,0.9,0.9,0.1,0.5,0.7)
Both vectors form pairs of values, see df1:
df1 <- data.frame(x1,p)
> df1
x1 p
1 1 0.1
2 2 0.9
3 3 0.9
4 1 0.1
5 2 0.5
6 3 0.7
Following function is used to update vector df1$x1 to a vector df1$x2, depending on a random experiment and a probability p:
rexp <- function(x,p){
if(runif(1) <= p) return(x + 1)
return(x)
}
Using lapply, the function "rexp" is applied to every df1$x1 value. Depending on the random experiment, the value for x2 remains equal x1 or increases by + 1.
In the follwing example, p equals 0.5:
set.seed(123)
df1$x2 <- unlist(lapply(df1$x1,rexp,0.5))
> df1
x1 p x2
1 1 0.1 2
2 2 0.9 2
3 3 0.9 4
4 1 0.1 1
5 2 0.5 2
6 3 0.7 4
Now to my problem: I want the argument "p" in "rexp" to refer to the vector df1$p.
For example, p for df1$x1[1] should be 0.1 (as can be seen in df1$p[1]): unlist(lapply(df1$x1[1],rexp,df1$p[1])).
p for df1$x1[5] should be df1$p[5], which is 0.5: unlist(lapply(df1$x1[5],rexp,df1$p[5]))
Desired output should be something like:
> unlist(lapply(df1$x1,rexp,df1$p))
[1] 1 3 4 1 2 4
#where 1 refers to rexp(df1$x1[1],df1$p[1]),
#3 refers to rexp(df1$x1[2],df1$p[2]),
#4 refers to rexp(df1$x1[3],df1$p[3]) and so on...
Doing that "manually" leads to:
set.seed(123)
> unlist(lapply(df1$x1[1],rexp,df1$p[1]))
[1] 1
> unlist(lapply(df1$x1[2],rexp,df1$p[2]))
[1] 3
> unlist(lapply(df1$x1[3],rexp,df1$p[3]))
[1] 4
> unlist(lapply(df1$x1[4],rexp,df1$p[4]))
[1] 1
> unlist(lapply(df1$x1[5],rexp,df1$p[5]))
[1] 2
> unlist(lapply(df1$x1[6],rexp,df1$p[6]))
[1] 4
How can "rexp" be adjusted so that the function uses the specific df1$p-value for each df1$x1-value?
Note: At this point, using "lapply" is important, because for every df1$x1-value in the function "rexp" a new random number should be drawn.
I am happy about any help!
Using your defined function, you may do
df1$x2 <- mapply(rexp, df1$x1, df1$p)
However, you may also exploit vectorization and use simply
df1$x2 <- df1$x1 + (runif(nrow(df1)) <= df1$p)
In this manner we element-wise sum the vector df1$x1 with a logical vector runif(nrow(df1)) <= df1$p that is being coerced to a binary vector (TRUE becomes 1 and FALSE becomes 0). The comparison <= is done element-wise as well, and we draw just as many different values from the uniform distribution as there are rows.
Regarding your approach, notice that when p is fixed, then there is no need for lapply, as it returns a list, and you may instead use
df1$x2 <- sapply(df1$x1, rexp, 0.5)

How to use the outer() to calculate the median between each vector?

I hope the title isn't too confusing...
Basically, I have two vectors that is each of n length. I want to transmute these two vectors to a n*n matrix (i.e. 2 vectors that contains 2 numbers each becomes a 2*2 matrix), where each position in the matrix is the median of each position of the two vectors.
For example:
a<-as.vector(1,5)
b<-as.vector(1,5)
Using outer() gives me a 2*2 matrix
1 5
1
5
But, how do I fill the empty matrix with median values between each unique combination? The answer should look something like this:
1 3
3 5
Try
outer(a, b, FUN= Vectorize(function(x,y) median(c(x,y))))
# [,1] [,2]
#[1,] 1 3
#[2,] 3 5
data
a <- c(1,5)
b <- a

Resources