increasing correlation values - R - r

I would like to obtain the pairs of values that decrease the correlation between two vectors by using a threshold and to find the values that maximize the correlation, with the restriction to have at least three pairs of values. I am using R.
For example, suppose to have this dataframe:
df <- data.frame(val1 = c(1,2,4,8,10,20), val2 = c(2,4,8,16, 35, 14))
rownames(df) <- c('a','b','c','d','e','f')
I would like remove the pairs of values that don't allow me to obtain a correlation greater than 0.6, so in this case I would like to find that f element (row) decreases my correlation. Lastly, if it is easy, I would like to find that by using a,b,c,d elements (rows) I can obtain the highest correlation.
Do you have any idea how I can do it?
Thank you in advance for your kind help.
Best

The restriction of at least 3 rows helps. There are two ways to approach the problem. Which one is best depends a bit on what you are trying to accomplish. We can start with all of the points and remove one at a time or we can start with 3 points and add one at a time. Your example has 6 points so it does not make that much difference. Here is code to find the best 3 point combination:
combos <- combn(6, 3)
corrs <- combn(6, 3, function(x) cor(df[x, ])[1, 2])
results <- cbind(t(combos), corrs)
head(results[order(corrs, decreasing=TRUE), ])
# corrs
# [1,] 1 2 3 1.0000000
# [2,] 1 2 4 1.0000000
# [3,] 2 3 4 1.0000000
# [4,] 1 3 4 1.0000000
# [5,] 1 2 5 0.9988739
# [6,] 1 2 6 0.9940219
We use the combn() function twice, once to get a matrix of the possible combinations of 3 items out of 6 and a second time to apply the correlation function to each combination
Then we combine the results and list the best 6. There are three best 3-point solutions having correlations of +1. For the 5-point solutions we get the following:
combos <- combn(6, 5)
corrs <- combn(6, 5, function(x) cor(df[x, ])[1, 2])
results <- cbind(t(combos), corrs)
head(results[order(corrs, decreasing=TRUE), ])
# corrs
# [1,] 1 2 3 4 5 0.9381942
# [2,] 1 2 3 4 6 0.7514174
# [3,] 1 2 3 5 6 0.4908234
# [4,] 1 2 4 5 6 0.4639890
# [5,] 1 3 4 5 6 0.4062324
# [6,] 2 3 4 5 6 0.3591037
Now there is one clear solution which excludes point 6 ("f") with a correlation of +.938. In general the size of the correlation will increase with decreasing points until it reaches +1 or -1. As the number of points increases, it will take more processing time to compute all of the alternatives. A short cut would be to look at deviations from the first principal component:
df.pca <- prcomp(df)
abval <- abs(df.pca$x[, "PC2"])
df.pca$x[order(abval, decreasing=TRUE), "PC2"]
# f e a b c d
# -11.4055987 5.3497271 2.1507072 1.9191656 1.4560825 0.5299163
Point f (the 6th point) has the largest deviation from the first principal component so removing it should improve the correlation. Likewise removing e and f gives the best 4-point correlation. This is simpler, but generally you would want to remove a point, compute the principal components with that point removed and then identify the next point for removal.

Related

Brute force computation of agent-item assignments for auction algorithms

I am working with various auction algorithms to assess assignment
of n items to n agents through a bidding mechanism, such that each agent is assigned to exactly one item, and each item is assigned to exactly one agent. I would like to assess performance of the algorithms I am testing by comparing them with a brute force approach. Comparison is
via the sum of the value assignments, which I am trying to maximize.
set.seed(1)
#Assume:
n <- 3
agents <- 1:3 # agent IDs
items <-1:3 # item IDs
# each agent has a preference score for each item
# create agent x item matrix w/ cells containing pref score
(m <- matrix(data = round(runif(9, 0, 1),3), nrow = n, ncol = n))
## [,1] [,2] [,3]
## [1,] 0.266 0.908 0.945
## [2,] 0.372 0.202 0.661
## [3,] 0.573 0.898 0.629
# Given these sample data, here is one possible assignment
s <- data.frame(agent = agents, item = NA, score = NA)
# assign item & corresponding score to agent
s[1,"item"] <- 1; s[1,"score"] <- m[1,1]
s[2,"item"] <- 2; s[2,"score"] <- m[2,2]
s[3,"item"] <- 1; s[3,"score"] <- m[3,3]
s
## agent item score
## 1 1 1 0.266
## 2 2 2 0.202
## 3 3 1 0.629
# The value/score of this particular assignment s
(total_score <- sum(s$score))
## [1] 1.097
What I would like to do is, given my agents and items vectors is create a data structure that holds every possible combination of member-item assignments. By my calculations there should be factorial(n) possible combinations. Thus in the example where n <- 3, the final structure should have 6 rows.
Here is a symbolic representation of what I want. Each row corresponds to a specific full assignment, where agents are columns and their corresponding items are cell values:
# a1 a2 a3 <- cols are agents
# ______________
# s1 | 1 2 3 <- corresponds to assignment s above
# s2 | 1 3 2
# s3 | 2 1 3
# s4 | 2 3 1
# s5 | 3 2 1
# s6 | 3 1 2
I am unclear the best way to achieve this generically for any positive value of n. I have tried expand.grid() but that doesn't seem to fit
with what I want to achieve. Is there a function I can use for this or does anybody have any suggestions as to an algorithm I can implement to this end?
Expand grid won't work here, because it creates all possible combinations of agents and items, so it will spawn a combination where all agents get first item, for example.
I suggest using permutations instead. It is enough to permute items, leaving agents on the same spots. I am using combinat package to generate permutations:
library(combinat)
permn(1:3)
[[1]]
[1] 1 2 3
[[2]]
[1] 1 3 2
[[3]]
[1] 3 1 2
[[4]]
[1] 3 2 1
[[5]]
[1] 2 3 1
[[6]]
[1] 2 1 3
Each element of the list correspond to one possible permutation of items. So, 2 1 3 means that first agent gets second item, second agent gets first item, and third agent gets third item.
To find out corresponding scores, we can subset our score matrix with permutations of boolean identity matrix:
#creating scores matrix
n=3
m <- matrix(data = round(runif(9, 0, 1),3), nrow = n, ncol = n)
#creating boolean identity matrix
E=matrix(as.logical(diag(1,n,n)),nrow=n,ncol=n)
m[E[,c(1,3,2)]] #this shows the scores of 1 3 2 item combination
#[1] 0.472 0.039 0.223
Finally, we compute individual scores and total score for all permutations and store the result in a neat data.table:
library(data.table)
dt=data.table(items=permn(1:n))
dt[,scores:=lapply(items,function(x) m[E[,x]])]
dt[,totalScore:=lapply(scores,sum)]
dt
# items scores totalScore
#1: 1,2,3 0.472,0.239,0.517 1.228
#2: 1,3,2 0.472,0.039,0.223 0.734
#3: 3,1,2 0.658,0.064,0.223 0.945
#4: 3,2,1 0.658,0.239,0.994 1.891
#5: 2,3,1 0.326,0.039,0.994 1.359
#6: 2,1,3 0.326,0.064,0.517 0.907

Nearest Neighbors from KKNN package in R giving garbage indices values when the entire dataset is used

I am using "kknn" package in R to find all of the nearest neighbors for every row in the data set. For some odd reasons, the last row in the test dataset is always ignored. Below, is the R code and the output obtained.
X1 <- c(0.6439659, 0.1923593, 0.3905551, 0.7728847, 0.7602632)
X2 <- c(0.9147394, 0.6181713, 0.8515923, 0.8459367, 0.9296278)
Class <- c(1, 1, 0, 0, 0)
Data <- data.frame(X1,X2,Class)
Data$Class <- as.factor(Data$Class)
library("kknn")
### Here, both training and testing data sets is the object Data
Neighbors.KNN <- kknn(Data$Class~., Data,Data,k = 5, distance =2, kernel = "gaussian")
## Output
## The Column 5 in the below output is filled with garbage values and the value of the first value in the last row is 4, when it has to be 5.
Neighbors.KNN$C
[,1] [,2] [,3] [,4] [,5]
[1,] 1 4 3 2 3245945
[2,] 2 3 4 1 3245945
[3,] 3 1 4 2 3245945
[4,] 4 1 3 2 3245945
[5,] 1 4 3 2 3245945
Could someone let me know if I am doing something wrong or if that is a bug in the package?
the current implementation (silently) assumes that k is smaller than n, the number of rows. In general will be k << n and this case is no problem. The (k+1)th is used to scale distances. I should have mentioned this in the documentation.
Regards,
Klaus

R apply function on symmetrical matrix

I'd like to utilise one of the apply set of functions to do some calculations.
First off, I have two matrices, mat1 and mat2:
mat1:
a b c
1 NA NA NA
2 1 1 1
3 1 1 NA
4 NA 1 NA
mat2:
a b c
a 1.0 0.2 0.3
b -0.7 1.0 0.8
c -0.1 -0.3 1.0
mat2 is calculated using mat1 using a function which is irrelevant here, and essentially I'd like to apply a weighting function to mat1 which penalizes the results from mat2 when there is less data (and therefore less accurate).
So to achieve this, I want to, for some coordinate x,y in mat2, calculate the pairwise completeness of two columns of mat1.
For example: mat2["a","b"] or mat2["b","a"] (should be same) would become the original values * (the complete rows of mat1 of a and b/total rows of mat1 of a and b).
So really the question is how can I apply a function to a matrix that loops every column for every column (double loop) and store this in a weight matrix to multiply against another matrix?
I can already compare two rows using rollapply from zoo package like so:
rowSums(rollapply(is.na(t(mat1)), 2, function(x) !any(x)))
I get:
[1] 2 1
As in, comparing a and b, 2 rows are complete and comparing b and c, 1 row is complete. So how can I compare a to b, a to c and b to c?
Thanks.
I was looking at your question again, and it appears that you want a matrix X with the same dimensions of mat2, where X[i,j] is given by the number of complete cases in mat1[,c(i,j)]. Then mat2 will be multiplied by X.
The number of complete cases is given by sum(complete.cases(mat1[,c(i,j)])). I want to use this in outer which requires a vectorized function, so this is passed through Vectorize:
outer(seq(nrow(mat2)), seq(ncol(mat2)),
Vectorize(function(x,y) sum(complete.cases(mat1[,c(x,y)])))
)
## [,1] [,2] [,3]
## [1,] 2 2 1
## [2,] 2 3 1
## [3,] 1 1 1
This is your desired symmetric matrix.

Interpolating data in R

Let's suppose I have a 3 by 5 matrix in R:
4 5 5 6 8
3 4 4 5 6
2 3 3 3 4
I would like to interpolate in between these values to create a matrix of size 15 by 25. I would also like to specify if the interpolation is linear, gaussian, etc. How can I do this?
For example, if I have a small matrix like this
2 3
1 3
and I want it to become 3 by 3, then it might look like
2 2.5 3
1.5 2.2 3
1 2 3
app <- function(x, n) approx(x, n=n)$y # Or whatever interpolation that you want
apply(t(apply(x, 1, function(x) app(x, nc))), 2, function(x) app(x, nr))
[,1] [,2] [,3]
[1,] 2.0 2.50 3
[2,] 1.5 2.25 3
[3,] 1.0 2.00 3
Long time ago I wrote a similar toy, except I never got around to defining the interpolation function. There's also raster::disaggregate .
zexpand<-function(inarray, fact=2, interp=FALSE, ...) {
# do same analysis of fact to allow one or two values, fact >=1 required, etc.
fact<-as.integer(round(fact))
switch(as.character(length(fact)),
'1' = xfact<-yfact<-fact,
'2'= {xfact<-fact[1]; yfact<-fact[2]},
{xfact<-fact[1]; yfact<-fact[2];warning(' fact is too long. First two values used.')})
if (xfact < 1) { stop('fact[1] must be > 0') }
if (yfact < 1) { stop('fact[2] must be > 0') }
bigtmp <- matrix(rep(t(inarray), each=xfact), nrow(inarray), ncol(inarray)*xfact, byr=T) #does column expansion
bigx <- t(matrix(rep((bigtmp),each=yfact),ncol(bigtmp),nrow(bigtmp)*yfact,byr=T))
# the interpolation would go here. Or use interp.loess on output (won't
# handle complex data). Also, look at fields::Tps which probably does
# a much better job anyway. Just do separately on Re and Im data
return(invisible(bigx))
}

Smoothing a sequence without using a loop in R

I am implementing a statistical method from an academic paper (see the end for a citation) in R. I think there's a way to do one of the steps without using a loop, but I'm having trouble deciding how to attack it.
This method operates on a data frame with three variables: x, n, and p. It can only operate if p[i] <= p[i+1] for all i. If a pair of points violates that, they are smoothed out by setting both p[i] and p[i+1] equal to their weighted average
(n[i]*p[i]+n[i+1]*p[i+1])/(n[i]+n[i+1])
This smoothing is iterated until the p_i are a nondecreasing sequence.
The problem with this smooth is that a) loops are bad form in R, and b) if there are multiple points in a row such that p_i > p_(i+1) >= p_(i+2), the method can fail to terminate or take a very long time to converge. For instance, if a sequence like so happens:
x n p
2 10 0.6
5 10 0.5
10 10 0.5
the smooth will set the first two values of p to 0.55, then the second two to 0.525, then set the first two to 0.5325, and so on and loop forever (or if I'm lucky reach the limit of significance in a bajillion iterations). There should be a mathematically equivalent but more efficient way to do this by identifying adjacent decreasing data points and averaging them as a group, but I'm not sure how to approach that in R.
If you need more background, the paper in question is Martin A. Hamilton, Rosemarie C. Russo, Robert V. Thurston.
"Trimmed Spearman-Karber method for estimating median lethal concentrations in toxicity bioassays." Environ. Sci. Technol., 1977, 11 (7), pp 714–719. I'm referring to the "first step" section on page 716.
As I understand the algorithm, you need to locate positions where p is decreasing and, starting from each of these, find out for how long the (cumulative) weighted average is decreasing so that p can be updated block by block. I do not see how this can be done without a loop of some sort. Some solution might hide the loop under lapply or an equivalent but IMHO, this is one of those algorithms that are complex enough that I prefer a good old loop. You may lose a bit in efficiency but the code reads nicely. My attempt, using a while loop:
smooth.p <- function(df) {
while (any(diff(df$p) < 0)) {
# where does it start decreasing
idx <- which(diff(df$p) < 0)[1]
# from there, compute the cumulative weighted average
sub <- df[idx:nrow(df), ]
cuml.wavg <- cumsum(sub$n * sub$p) / cumsum(sub$n)
# and see for how long it is decreasing
bad.streak.len <- rle(diff(cuml.wavg) <= 0)$lengths[1]
# these are the indices for the block to average
block.idx <- seq(from = idx, length = bad.streak.len + 1)
# compute and apply the average p
df$p[block.idx] <- sum(df$p[block.idx] * df$n[block.idx]) /
sum(df$n[block.idx])
}
return(df)
}
Here is some data, including a rough patch like you suggested:
df <- data.frame(x = 1:9,
n = rep(1, 9),
p = c(0.1, 0.3, 0.2, 0.6, 0.5, 0.5, 0.8, 1.0, 0.9))
df
# x n p
# 1 1 1 0.1
# 2 2 1 0.3
# 3 3 1 0.2
# 4 4 1 0.6
# 5 5 1 0.5
# 6 6 1 0.5
# 7 7 1 0.8
# 8 8 1 1.0
# 9 9 1 0.9
And the output:
smooth.p(df)
# x n p
# 1 1 1 0.1000000
# 2 2 1 0.2500000
# 3 3 1 0.2500000
# 4 4 1 0.5333333
# 5 5 1 0.5333333
# 6 6 1 0.5333333
# 7 7 1 0.8000000
# 8 8 1 0.9500000
# 9 9 1 0.9500000
Following Glen_b above, what's described in Hamilton's paper is equivalent to gpava from the CRAN package isotone.

Resources