Using distGeo with two sets of coordinates - r

I have two sets of coordinates (loc and stat) both in the following format
x y
1 49.68375 8.978462
2 49.99174 8.238287
3 51.30842 12.411870
4 50.70487 6.627252
5 50.70487 6.627252
6 50.37381 8.040766
For each location in the first data set (location of observation) I want to know the location in the second data set (weather stations), that is closest to it. Basically matching the locations of observations to the closest weather station for later analysis of weather effects.
I tried using the distGeo function simply by putting in
distGeo(loc, stat, a=6378137, f=1/298.257223563)
But that didn't work, because loc and stat are not in the right format.
Thanks for your help!

Try this:
outer(seq_len(nrow(loc)), seq_len(nrow(stat)),
function(a,b) geosphere::distGeo(loc[a,], stat[b,]))
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] 0.00 88604.79 419299.1 283370.9 283370.9 128560.08
# [2,] 88604.79 0.00 483632.9 194784.6 194784.6 47435.65
# [3,] 419299.12 483632.85 0.0 643230.3 643230.3 494205.86
# [4,] 283370.91 194784.62 643230.3 0.0 0.0 160540.63
# [5,] 283370.91 194784.62 643230.3 0.0 0.0 160540.63
# [6,] 128560.08 47435.65 494205.9 160540.6 160540.6 0.00
Brief explanation:
outer(1:3, 1:4, ...) produces two vectors that are a cartesian product, very similar to
expand.grid(1:3, 1:4)
# Var1 Var2
# 1 1 1
# 2 2 1
# 3 3 1
# 4 1 2
# 5 2 2
# 6 3 2
# 7 1 3
# 8 2 3
# 9 3 3
# 10 1 4
# 11 2 4
# 12 3 4
(using expand.grid only for demonstration of the expansion)
the anonymous function I defined (function(a,b)...) is called once, where a is assigned the integer vector c(1,2,3,1,2,3,1,2,3,1,2,3) (using my 1:3 and 1:4 example), and b is assigned the int vector c(1,1,1,2,2,2,3,3,3,4,4,4).
within the anon func, loc[a,] results in a much longer frame: if loc has m rows and stat has n rows, then loc[a,] should have m*n rows; similarly stat[b,] should have m*n rows as well. This works well, because distGeo (and other dist* functions in geosphere::) operates in one of two ways:
If either of the arguments have 1 row, then its distance is calculated against all rows of the other argument. Unfortunately, unless you know that loc or stat will always have just one row, this method doesn't work.
otherwise, both arguments must have the same number of rows, where the distance is calculated piecewise (1st row of 1st arg with 1st row of 2nd arg; 2nd row 1st arg with 2nd row 2nd arg; etc). This is the method we're prepared for.
in general, the anon func given to outer must deal with vectorized arguments on its own. For instance, if you needed geoDist to be called once for each pair (so it would be called m*n times), then you have to handle that yourself, outer will not do it for you. There are constructs in R that support this (e.g., mapply, Map) or replace outer (Map, expand.grid, and do.call), but that's for another question.

Related

increasing correlation values - R

I would like to obtain the pairs of values that decrease the correlation between two vectors by using a threshold and to find the values that maximize the correlation, with the restriction to have at least three pairs of values. I am using R.
For example, suppose to have this dataframe:
df <- data.frame(val1 = c(1,2,4,8,10,20), val2 = c(2,4,8,16, 35, 14))
rownames(df) <- c('a','b','c','d','e','f')
I would like remove the pairs of values that don't allow me to obtain a correlation greater than 0.6, so in this case I would like to find that f element (row) decreases my correlation. Lastly, if it is easy, I would like to find that by using a,b,c,d elements (rows) I can obtain the highest correlation.
Do you have any idea how I can do it?
Thank you in advance for your kind help.
Best
The restriction of at least 3 rows helps. There are two ways to approach the problem. Which one is best depends a bit on what you are trying to accomplish. We can start with all of the points and remove one at a time or we can start with 3 points and add one at a time. Your example has 6 points so it does not make that much difference. Here is code to find the best 3 point combination:
combos <- combn(6, 3)
corrs <- combn(6, 3, function(x) cor(df[x, ])[1, 2])
results <- cbind(t(combos), corrs)
head(results[order(corrs, decreasing=TRUE), ])
# corrs
# [1,] 1 2 3 1.0000000
# [2,] 1 2 4 1.0000000
# [3,] 2 3 4 1.0000000
# [4,] 1 3 4 1.0000000
# [5,] 1 2 5 0.9988739
# [6,] 1 2 6 0.9940219
We use the combn() function twice, once to get a matrix of the possible combinations of 3 items out of 6 and a second time to apply the correlation function to each combination
Then we combine the results and list the best 6. There are three best 3-point solutions having correlations of +1. For the 5-point solutions we get the following:
combos <- combn(6, 5)
corrs <- combn(6, 5, function(x) cor(df[x, ])[1, 2])
results <- cbind(t(combos), corrs)
head(results[order(corrs, decreasing=TRUE), ])
# corrs
# [1,] 1 2 3 4 5 0.9381942
# [2,] 1 2 3 4 6 0.7514174
# [3,] 1 2 3 5 6 0.4908234
# [4,] 1 2 4 5 6 0.4639890
# [5,] 1 3 4 5 6 0.4062324
# [6,] 2 3 4 5 6 0.3591037
Now there is one clear solution which excludes point 6 ("f") with a correlation of +.938. In general the size of the correlation will increase with decreasing points until it reaches +1 or -1. As the number of points increases, it will take more processing time to compute all of the alternatives. A short cut would be to look at deviations from the first principal component:
df.pca <- prcomp(df)
abval <- abs(df.pca$x[, "PC2"])
df.pca$x[order(abval, decreasing=TRUE), "PC2"]
# f e a b c d
# -11.4055987 5.3497271 2.1507072 1.9191656 1.4560825 0.5299163
Point f (the 6th point) has the largest deviation from the first principal component so removing it should improve the correlation. Likewise removing e and f gives the best 4-point correlation. This is simpler, but generally you would want to remove a point, compute the principal components with that point removed and then identify the next point for removal.

How to find all comparison from a large set of data set

I have a large data set from an experiment. There are about 25000 spectrum in the data set. I want to see if there is any common feature in all spectrum. There is a builtin function for comparespectra between two specific spectra. But I want to develop a loop that gives me results from all possible comparison. Finally, want to make a data.frame or list along with the identity of the compared spectrum number.
I wrote a simple loop that gives me a comparison of spectrum 1 and 2, 2 and 3, 3 and 4 and 4 and 5.
for (i in 1:4){
comparison <- compareSpectra(raw_25kda[[i]], raw_25kda[[i+1]], fun = "common")
print(as.list(comparison))
}
From the loop, I have the list of 4 number 2,5,6,2 respectively for four comparisons of 1 and 2, 2 and 3, 3 and 4 and 4 and 5 comparisons.
The first comparison is between 1 and 2 and there is 2 common feature. Is there any way I can explicitly print that 1 and 2 are compared and there is 2 common feature between them?
I also want a comparison of 1 and 3, 1 and 4, 2 and 4, 3 and 4 as well.
When I recall comparison later in different R chunk that gives me only one value such as the last value 2. How can I save the list inside the loop for future analysis? Any help will be appreciated.
I don't have the data or packages you are using, so this might be a little off, but should hopefully point in the right direction.
Here are all the combinations of 5 data sets:
my_data_sets <- 1:5
combos <- combn(my_data_sets, m = 2, simplify = T)
combos
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#[1,] 1 1 1 1 2 2 2 3 3 4
#[2,] 2 3 4 5 3 4 5 4 5 5
There are 10. Now we can initialize a list with ten elements to store our results in.
combo_len = dim(combos)[2]
output <- vector("list", combo_len)
for (i in combo_len) {
set1 = combos[1, i]
set2 = combos[2, i]
output[[i]] <- compareSpectra(raw_25kda[[set1]], raw_25kda[[set2]], fun = "common")
}
The output object should now have ten elements which each represent their respective combination.

R apply function on symmetrical matrix

I'd like to utilise one of the apply set of functions to do some calculations.
First off, I have two matrices, mat1 and mat2:
mat1:
a b c
1 NA NA NA
2 1 1 1
3 1 1 NA
4 NA 1 NA
mat2:
a b c
a 1.0 0.2 0.3
b -0.7 1.0 0.8
c -0.1 -0.3 1.0
mat2 is calculated using mat1 using a function which is irrelevant here, and essentially I'd like to apply a weighting function to mat1 which penalizes the results from mat2 when there is less data (and therefore less accurate).
So to achieve this, I want to, for some coordinate x,y in mat2, calculate the pairwise completeness of two columns of mat1.
For example: mat2["a","b"] or mat2["b","a"] (should be same) would become the original values * (the complete rows of mat1 of a and b/total rows of mat1 of a and b).
So really the question is how can I apply a function to a matrix that loops every column for every column (double loop) and store this in a weight matrix to multiply against another matrix?
I can already compare two rows using rollapply from zoo package like so:
rowSums(rollapply(is.na(t(mat1)), 2, function(x) !any(x)))
I get:
[1] 2 1
As in, comparing a and b, 2 rows are complete and comparing b and c, 1 row is complete. So how can I compare a to b, a to c and b to c?
Thanks.
I was looking at your question again, and it appears that you want a matrix X with the same dimensions of mat2, where X[i,j] is given by the number of complete cases in mat1[,c(i,j)]. Then mat2 will be multiplied by X.
The number of complete cases is given by sum(complete.cases(mat1[,c(i,j)])). I want to use this in outer which requires a vectorized function, so this is passed through Vectorize:
outer(seq(nrow(mat2)), seq(ncol(mat2)),
Vectorize(function(x,y) sum(complete.cases(mat1[,c(x,y)])))
)
## [,1] [,2] [,3]
## [1,] 2 2 1
## [2,] 2 3 1
## [3,] 1 1 1
This is your desired symmetric matrix.

Extract a portion of 1 column from data.frame/matrix

I get flummoxed by some of the simplest of things. In the following code I wanted to extract just a portion of one column in a data.frame called 'a'. I get the right values, but the final entity is padded with NAs which I don't want. 'b' is the extracted column, 'c' is the correct portion of data but has extra NA padding at the end.
How do I best do this where 'c' is ends up naturally only 9 elements long? (i.e. - the 15 original minus the 6 I skipped)
NumBars = 6
a = as.data.frame(c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15))
a[,2] = c(11,12,13,14,15,16,17,18,19,20,21,22,23,24,25)
names(a)[1] = "Data1"
names(a)[2] = "Data2"
{Use 1st column of data only}
b = as.matrix(a[,1])
c = as.matrix(b[NumBars+1:length(b)])
The immediate reason why you're getting NA's is that the sequence operator : takes precedence over the addition operator +, as is detailed in the R Language Definition. Therefore NumBars+1:length(b) is not the same as (NumBars+1):length(b). The first adds NumBars to the vector 1:length(b), while the second adds first and then takes the sequence.
ind.1 <- 1+1:3 # == 2:4
ind.2 <- (1+1):3 # == 2:3
When you index with this longer vector, you get all the elements you want, and you also are asking for entries like b[length(b)+1], which the R Language Definition tells us returns NA. That's why you have trailing NA's.
If i is positive and exceeds length(x) then the corresponding
selection is NA. A negative out of bounds value for i causes an error.
b <- c(1,2,3)
b[ind.1]
#[1] 2 3 NA
b[ind.2]
#[1] 2 3
From a design perspective, the other solutions listed here are good choices to help avoid this mistake.
It is often easier to think of what you want to remove from your vector / matrix. Use negative subscripts to remove items.
c = as.matrix(b[-1:-NumBars])
c
## [,1]
## [1,] 7
## [2,] 8
## [3,] 9
## [4,] 10
## [5,] 11
## [6,] 12
## [7,] 13
## [8,] 14
## [9,] 15
If your goal is to remove NAs from a column, you can also do something like
c <- na.omit(a[,1])
E.g.
> x
[1] 1 2 3 NA NA
> na.omit(x)
[1] 1 2 3
attr(,"na.action")
[1] 4 5
attr(,"class")
[1] "omit"
You can ignore the attributes - they are there to let you know what elements were removed.

Interpolating data in R

Let's suppose I have a 3 by 5 matrix in R:
4 5 5 6 8
3 4 4 5 6
2 3 3 3 4
I would like to interpolate in between these values to create a matrix of size 15 by 25. I would also like to specify if the interpolation is linear, gaussian, etc. How can I do this?
For example, if I have a small matrix like this
2 3
1 3
and I want it to become 3 by 3, then it might look like
2 2.5 3
1.5 2.2 3
1 2 3
app <- function(x, n) approx(x, n=n)$y # Or whatever interpolation that you want
apply(t(apply(x, 1, function(x) app(x, nc))), 2, function(x) app(x, nr))
[,1] [,2] [,3]
[1,] 2.0 2.50 3
[2,] 1.5 2.25 3
[3,] 1.0 2.00 3
Long time ago I wrote a similar toy, except I never got around to defining the interpolation function. There's also raster::disaggregate .
zexpand<-function(inarray, fact=2, interp=FALSE, ...) {
# do same analysis of fact to allow one or two values, fact >=1 required, etc.
fact<-as.integer(round(fact))
switch(as.character(length(fact)),
'1' = xfact<-yfact<-fact,
'2'= {xfact<-fact[1]; yfact<-fact[2]},
{xfact<-fact[1]; yfact<-fact[2];warning(' fact is too long. First two values used.')})
if (xfact < 1) { stop('fact[1] must be > 0') }
if (yfact < 1) { stop('fact[2] must be > 0') }
bigtmp <- matrix(rep(t(inarray), each=xfact), nrow(inarray), ncol(inarray)*xfact, byr=T) #does column expansion
bigx <- t(matrix(rep((bigtmp),each=yfact),ncol(bigtmp),nrow(bigtmp)*yfact,byr=T))
# the interpolation would go here. Or use interp.loess on output (won't
# handle complex data). Also, look at fields::Tps which probably does
# a much better job anyway. Just do separately on Re and Im data
return(invisible(bigx))
}

Resources