Interpolating data in R

Interpolating data in R - r

Let's suppose I have a 3 by 5 matrix in R:
4 5 5 6 8
3 4 4 5 6
2 3 3 3 4
I would like to interpolate in between these values to create a matrix of size 15 by 25. I would also like to specify if the interpolation is linear, gaussian, etc. How can I do this?
For example, if I have a small matrix like this
2 3
1 3
and I want it to become 3 by 3, then it might look like
2 2.5 3
1.5 2.2 3
1 2 3

app <- function(x, n) approx(x, n=n)$y # Or whatever interpolation that you want
apply(t(apply(x, 1, function(x) app(x, nc))), 2, function(x) app(x, nr))
[,1] [,2] [,3]
[1,] 2.0 2.50 3
[2,] 1.5 2.25 3
[3,] 1.0 2.00 3

Long time ago I wrote a similar toy, except I never got around to defining the interpolation function. There's also raster::disaggregate .
zexpand<-function(inarray, fact=2, interp=FALSE, ...) {
# do same analysis of fact to allow one or two values, fact >=1 required, etc.
fact<-as.integer(round(fact))
switch(as.character(length(fact)),
'1' = xfact<-yfact<-fact,
'2'= {xfact<-fact[1]; yfact<-fact[2]},
{xfact<-fact[1]; yfact<-fact[2];warning(' fact is too long. First two values used.')})
if (xfact < 1) { stop('fact[1] must be > 0') }
if (yfact < 1) { stop('fact[2] must be > 0') }
bigtmp <- matrix(rep(t(inarray), each=xfact), nrow(inarray), ncol(inarray)*xfact, byr=T) #does column expansion
bigx <- t(matrix(rep((bigtmp),each=yfact),ncol(bigtmp),nrow(bigtmp)*yfact,byr=T))
# the interpolation would go here. Or use interp.loess on output (won't
# handle complex data). Also, look at fields::Tps which probably does
# a much better job anyway. Just do separately on Re and Im data
return(invisible(bigx))
}

Related

Using distGeo with two sets of coordinates

I have two sets of coordinates (loc and stat) both in the following format
x y
1 49.68375 8.978462
2 49.99174 8.238287
3 51.30842 12.411870
4 50.70487 6.627252
5 50.70487 6.627252
6 50.37381 8.040766
For each location in the first data set (location of observation) I want to know the location in the second data set (weather stations), that is closest to it. Basically matching the locations of observations to the closest weather station for later analysis of weather effects.
I tried using the distGeo function simply by putting in
distGeo(loc, stat, a=6378137, f=1/298.257223563)
But that didn't work, because loc and stat are not in the right format.
Thanks for your help!

Try this:
outer(seq_len(nrow(loc)), seq_len(nrow(stat)),
function(a,b) geosphere::distGeo(loc[a,], stat[b,]))
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] 0.00 88604.79 419299.1 283370.9 283370.9 128560.08
# [2,] 88604.79 0.00 483632.9 194784.6 194784.6 47435.65
# [3,] 419299.12 483632.85 0.0 643230.3 643230.3 494205.86
# [4,] 283370.91 194784.62 643230.3 0.0 0.0 160540.63
# [5,] 283370.91 194784.62 643230.3 0.0 0.0 160540.63
# [6,] 128560.08 47435.65 494205.9 160540.6 160540.6 0.00
Brief explanation:
outer(1:3, 1:4, ...) produces two vectors that are a cartesian product, very similar to
expand.grid(1:3, 1:4)
# Var1 Var2
# 1 1 1
# 2 2 1
# 3 3 1
# 4 1 2
# 5 2 2
# 6 3 2
# 7 1 3
# 8 2 3
# 9 3 3
# 10 1 4
# 11 2 4
# 12 3 4
(using expand.grid only for demonstration of the expansion)
the anonymous function I defined (function(a,b)...) is called once, where a is assigned the integer vector c(1,2,3,1,2,3,1,2,3,1,2,3) (using my 1:3 and 1:4 example), and b is assigned the int vector c(1,1,1,2,2,2,3,3,3,4,4,4).
within the anon func, loc[a,] results in a much longer frame: if loc has m rows and stat has n rows, then loc[a,] should have m*n rows; similarly stat[b,] should have m*n rows as well. This works well, because distGeo (and other dist* functions in geosphere::) operates in one of two ways:
If either of the arguments have 1 row, then its distance is calculated against all rows of the other argument. Unfortunately, unless you know that loc or stat will always have just one row, this method doesn't work.
otherwise, both arguments must have the same number of rows, where the distance is calculated piecewise (1st row of 1st arg with 1st row of 2nd arg; 2nd row 1st arg with 2nd row 2nd arg; etc). This is the method we're prepared for.
in general, the anon func given to outer must deal with vectorized arguments on its own. For instance, if you needed geoDist to be called once for each pair (so it would be called m*n times), then you have to handle that yourself, outer will not do it for you. There are constructs in R that support this (e.g., mapply, Map) or replace outer (Map, expand.grid, and do.call), but that's for another question.

increasing correlation values - R

I would like to obtain the pairs of values that decrease the correlation between two vectors by using a threshold and to find the values that maximize the correlation, with the restriction to have at least three pairs of values. I am using R.
For example, suppose to have this dataframe:
df <- data.frame(val1 = c(1,2,4,8,10,20), val2 = c(2,4,8,16, 35, 14))
rownames(df) <- c('a','b','c','d','e','f')
I would like remove the pairs of values that don't allow me to obtain a correlation greater than 0.6, so in this case I would like to find that f element (row) decreases my correlation. Lastly, if it is easy, I would like to find that by using a,b,c,d elements (rows) I can obtain the highest correlation.
Do you have any idea how I can do it?
Thank you in advance for your kind help.
Best

The restriction of at least 3 rows helps. There are two ways to approach the problem. Which one is best depends a bit on what you are trying to accomplish. We can start with all of the points and remove one at a time or we can start with 3 points and add one at a time. Your example has 6 points so it does not make that much difference. Here is code to find the best 3 point combination:
combos <- combn(6, 3)
corrs <- combn(6, 3, function(x) cor(df[x, ])[1, 2])
results <- cbind(t(combos), corrs)
head(results[order(corrs, decreasing=TRUE), ])
# corrs
# [1,] 1 2 3 1.0000000
# [2,] 1 2 4 1.0000000
# [3,] 2 3 4 1.0000000
# [4,] 1 3 4 1.0000000
# [5,] 1 2 5 0.9988739
# [6,] 1 2 6 0.9940219
We use the combn() function twice, once to get a matrix of the possible combinations of 3 items out of 6 and a second time to apply the correlation function to each combination
Then we combine the results and list the best 6. There are three best 3-point solutions having correlations of +1. For the 5-point solutions we get the following:
combos <- combn(6, 5)
corrs <- combn(6, 5, function(x) cor(df[x, ])[1, 2])
results <- cbind(t(combos), corrs)
head(results[order(corrs, decreasing=TRUE), ])
# corrs
# [1,] 1 2 3 4 5 0.9381942
# [2,] 1 2 3 4 6 0.7514174
# [3,] 1 2 3 5 6 0.4908234
# [4,] 1 2 4 5 6 0.4639890
# [5,] 1 3 4 5 6 0.4062324
# [6,] 2 3 4 5 6 0.3591037
Now there is one clear solution which excludes point 6 ("f") with a correlation of +.938. In general the size of the correlation will increase with decreasing points until it reaches +1 or -1. As the number of points increases, it will take more processing time to compute all of the alternatives. A short cut would be to look at deviations from the first principal component:
df.pca <- prcomp(df)
abval <- abs(df.pca$x[, "PC2"])
df.pca$x[order(abval, decreasing=TRUE), "PC2"]
# f e a b c d
# -11.4055987 5.3497271 2.1507072 1.9191656 1.4560825 0.5299163
Point f (the 6th point) has the largest deviation from the first principal component so removing it should improve the correlation. Likewise removing e and f gives the best 4-point correlation. This is simpler, but generally you would want to remove a point, compute the principal components with that point removed and then identify the next point for removal.

Vectorizing R-loop for better performance

I have a problem to find a vectorization representation for a specific loop in R. My objective is to enhance the performance of the loop, because it has to be run thousands of times in the algorithm.
I want to find the position of the lowest value in a particular array section defined by a vector 'Level' for each row.
Example:
Level = c(2,3)
Let first row of array X be: c(2, -1, 3, 0.5, 4).
Searching for the position of the lowest value in the range 1:Level[1] of the row (that is (2, -1)), I get a 2, because -1 < 2 and -1 stands on second position of the row. Then, searching the position of the lowest value in the second range (Level[1]+1):(Level[1]+Level[2]) (that is (3, 0.5, 4)), I get a 4, because 0.5 < 3 < 4 and 0.5 stands on fourth position of the row.
I have to perform this over each row in the array.
My solution to the problem works as follows:
Level = c(2,3,3) #elements per section, here: 3 sections with 2,3 and 3 levels
rows = 10 #number of rows in array X
X = matrix(runif(rows*sum(Level),-5,5),rows,sum(Level)) #array with 10 rows and sum(Level) columns, here: 8
Position_min = matrix(0,rows,length(Level)) #array in which the position of minimum values for each section and row are stored
for(i in 1:rows){
for(j in 1:length(Level)){ #length(Level) is number of intervals, here: 3
if(j == 1){coeff=0}else{coeff=1}
Position_min[i,j] = coeff*sum(Level[1:(j-1)]) + which(X[i,(coeff*sum(Level[1:(j-1)])+1):sum(Level[1:j])] == min(X[i,(coeff*sum(Level[1:(j-1)])+1):sum(Level[1:j])]))
}
}
It works fine but I would prefer a solution with better performance. Any ideas?

This will remove the outer level of the loop:
Level1=c(0,cumsum(Level))
for(j in 1:(length(Level1)-1)){
Position_min[,j]=max.col(-X[,(Level1[j]+1):Level1[j+1]])+(Level1[j])
}

Here is a "fully vectorized" solution with no explicit loops:
findmins <- function(x, level) {
series <- rep(1:length(Level), Level)
x <- split(x, factor(series))
minsSplit <- as.numeric(sapply(x, which.min))
minsSplit + c(0, cumsum(level[-length(level)]))
}
Position_min_vectorized <- t(apply(X, 1, findmins, Level))
identical(Position_min, Position_min_vectorized)
## [1] TRUE
You can get better performance by making your matrix into a list, and then using parallel's mclapply():
X_list <- split(X, factor(1:nrow(X)))
do.call(rbind, parallel::mclapply(X_list, findmins, Level))
## [,1] [,2] [,3]
## 1 1 5 6
## 2 2 3 6
## 3 1 4 7
## 4 1 5 6
## 5 2 5 7
## 6 2 4 6
## 7 1 5 8
## 8 1 5 8
## 9 1 3 8
## 10 1 3 8

compatea value in two vectors and assign the compared results into a new vector in R

I have a vector to be append, and here is the code,which is pretty slow due to the nrow is big.
All I want to is to speed up. I have tried c() and append() and both seems not fast enough.
And I checkd Efficiently adding or removing elements to a vector or list in R?
Here is the code:
compare<-vector()
for (i in 1:nrow(domin)){
for (j in 1:nrow(domin)){
a=0
if ((domin[i,]$GPA>domin[j,]$GPA) & (domin[i,]$SAT>domin[j,]$SAT)){
a=1
}
compare<-c(compare,a)
}
print(i)
}
I found it is hard to figure out the index for the compare if I use
#compare<-rep(0,times=nrow(opt_predict)*nrow(opt_predict))

The information you want would be better placed in a matrix:
v1 <- 1:3
v2 <- c(1,2,2)
mat1 <- outer(v1,v1,`>`)
mat2 <- outer(v2,v2,`>`)
both <- mat1 & mat2
To see which positions the inequality holds for, use which:
which(both,arr.ind=TRUE)
# row col
# [1,] 2 1
# [2,] 3 1
Comments:
This answer should be a lot faster than your loop. However, you are really just sorting two vectors, so there is probably a faster way to do this than taking the exhaustive set of inequalities...
In your case, there is only a partial ordering (since, for a given i and j, it is possible that neither one is strictly greater than the other in both dimensions). If you were satisfied with sorting first on v1 and then on v2, you could use the data.table package to easily get a full ordering:
set.seed(1)
v1 <- sample.int(10,replace=TRUE)
v2 <- sample.int(10,replace=TRUE)
require(data.table)
DT <- data.table(v1,v2)
setkey(DT)
DT[,rank:=.GRP,by='v1,v2']
which gives
v1 v2 rank
1: 1 8 1
2: 3 3 2
3: 3 8 3
4: 4 2 4
5: 6 7 5
6: 7 4 6
7: 7 10 7
8: 9 5 8
9: 10 4 9
10: 10 8 10
It depends on what you were planning to do next.

Smoothing a sequence without using a loop in R

I am implementing a statistical method from an academic paper (see the end for a citation) in R. I think there's a way to do one of the steps without using a loop, but I'm having trouble deciding how to attack it.
This method operates on a data frame with three variables: x, n, and p. It can only operate if p[i] <= p[i+1] for all i. If a pair of points violates that, they are smoothed out by setting both p[i] and p[i+1] equal to their weighted average
(n[i]*p[i]+n[i+1]*p[i+1])/(n[i]+n[i+1])
This smoothing is iterated until the p_i are a nondecreasing sequence.
The problem with this smooth is that a) loops are bad form in R, and b) if there are multiple points in a row such that p_i > p_(i+1) >= p_(i+2), the method can fail to terminate or take a very long time to converge. For instance, if a sequence like so happens:
x n p
2 10 0.6
5 10 0.5
10 10 0.5
the smooth will set the first two values of p to 0.55, then the second two to 0.525, then set the first two to 0.5325, and so on and loop forever (or if I'm lucky reach the limit of significance in a bajillion iterations). There should be a mathematically equivalent but more efficient way to do this by identifying adjacent decreasing data points and averaging them as a group, but I'm not sure how to approach that in R.
If you need more background, the paper in question is Martin A. Hamilton, Rosemarie C. Russo, Robert V. Thurston.
"Trimmed Spearman-Karber method for estimating median lethal concentrations in toxicity bioassays." Environ. Sci. Technol., 1977, 11 (7), pp 714–719. I'm referring to the "first step" section on page 716.

As I understand the algorithm, you need to locate positions where p is decreasing and, starting from each of these, find out for how long the (cumulative) weighted average is decreasing so that p can be updated block by block. I do not see how this can be done without a loop of some sort. Some solution might hide the loop under lapply or an equivalent but IMHO, this is one of those algorithms that are complex enough that I prefer a good old loop. You may lose a bit in efficiency but the code reads nicely. My attempt, using a while loop:
smooth.p <- function(df) {
while (any(diff(df$p) < 0)) {
# where does it start decreasing
idx <- which(diff(df$p) < 0)[1]
# from there, compute the cumulative weighted average
sub <- df[idx:nrow(df), ]
cuml.wavg <- cumsum(sub$n * sub$p) / cumsum(sub$n)
# and see for how long it is decreasing
bad.streak.len <- rle(diff(cuml.wavg) <= 0)$lengths[1]
# these are the indices for the block to average
block.idx <- seq(from = idx, length = bad.streak.len + 1)
# compute and apply the average p
df$p[block.idx] <- sum(df$p[block.idx] * df$n[block.idx]) /
sum(df$n[block.idx])
}
return(df)
}
Here is some data, including a rough patch like you suggested:
df <- data.frame(x = 1:9,
n = rep(1, 9),
p = c(0.1, 0.3, 0.2, 0.6, 0.5, 0.5, 0.8, 1.0, 0.9))
df
# x n p
# 1 1 1 0.1
# 2 2 1 0.3
# 3 3 1 0.2
# 4 4 1 0.6
# 5 5 1 0.5
# 6 6 1 0.5
# 7 7 1 0.8
# 8 8 1 1.0
# 9 9 1 0.9
And the output:
smooth.p(df)
# x n p
# 1 1 1 0.1000000
# 2 2 1 0.2500000
# 3 3 1 0.2500000
# 4 4 1 0.5333333
# 5 5 1 0.5333333
# 6 6 1 0.5333333
# 7 7 1 0.8000000
# 8 8 1 0.9500000
# 9 9 1 0.9500000

Following Glen_b above, what's described in Hamilton's paper is equivalent to gpava from the CRAN package isotone.

Categories

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Interpolating data in R - r

app <- function(x, n) approx(x, n=n)$y # Or whatever interpolation that you want apply(t(apply(x, 1, function(x) app(x, nc))), 2, function(x) app(x, nr)) [,1] [,2] [,3] [1,] 2.0 2.50 3 [2,] 1.5 2.25 3 [3,] 1.0 2.00 3

Related

Using distGeo with two sets of coordinates

increasing correlation values - R

Vectorizing R-loop for better performance

compatea value in two vectors and assign the compared results into a new vector in R

Smoothing a sequence without using a loop in R

Categories

Resources