I have the below dataset, where I am trying to do a rolling 3 days correlation across x,y,z,a. So the code should do rolling correlations of xy,xz,xa, yx, yz,ya and so on. Also, as you can see below, the data for y and a is incomplete, but I would wish to do rolling correlations of them starting from the date where they first had values (i.e. id 3 and id 4).
How should I accomplish this? Don't know where to start...
set.seed(42)
n <- 10
dat <- data.frame(id=1:n,
date=seq.Date(as.Date("2020-12-22"), as.Date("2020-12-31"), "day"),
x=rnorm(n),
y=rnorm(n),
z=rnorm(n),
a=rnorm(n))
dat$y[1:2] <- NA
dat$a[1:3] <- NA
I am able to find this set of code from stack, but it only helps in finding the answer for 1st column and not all the columns
rollapplyr(x, 5, function(x) cor(x[, 1], x[, -1]), by.column = FALSE)
Create a data frame with only the columns wanted and then use rollapplyr with cor. cor takes a use= argument that specifies how missing values are to be handled. See ?cor for the values it can take since you may or may not wish to use the value we used below.
The result r is a matrix whose i-th row describes the correlation matrix of the 5 dat2 rows ending in and including row i. That is, matrix(r[i, ], 4, 4) is the correlation matrix of dat2[i-(4:0), ].
We can also create ar which is a 3d array which is such that ar[i,,] is the correlation matrix of the 5 rows of dat2 ending in and including row i.
That is these are equal for each i in 5, ..., nrow(dat2). (The first 4 rows of r are all NA since there do not exist 5 rows leading to those rows.)
1. cor(dat2[i-(4:0), ], use = "pairwise")
2. matrix(r[i, ], 4, 4)
3. ar[i,,]
We run checks for these equivalences for i=5 below.
library(zoo)
w <- 5
dat2 <- dat[c("x", "y", "z", "a")]
nr <- nrow(dat2)
nc <- ncol(dat2)
r <- rollapplyr(dat2, w, cor, use = "pairwise", by.column = FALSE, fill = NA)
colnames(r) <- paste(names(dat2)[c(row(diag(nc)))],
names(dat2)[c(col(diag(nc)))], sep = ".")
ar <- array(r, c(nr, nc, nc),
dimnames = list(NULL, names(dat2), names(dat2)))
# run some checks
cor5 <- cor(dat2[1:w, ], use = "pairwise") # cor of 1st w rows
# same except for names
all.equal(unname(cor5), matrix(r[w, ], nc))
## [1] TRUE
all.equal(cor5, ar[w,,])
## [1] TRUE
The above shows a matrix whose rows are strung out correlation matrices and a 3d array whose slices are correlation matrices. Another possibility for output is to create a list of correlation matrices.
lapply(1:nr, function(i) {
if (i >= w) cor(dat2[i-((w-1):0), ], use = "pairwise")
})
combn produces all the combinations.
cols <- c("x", "y", "z", "a")
combn(cols, 2)
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] "x" "x" "x" "y" "y" "z"
# [2,] "y" "z" "a" "z" "a" "a"
combn has a function argument where you first na.omit all rows with NA's. Then subset with mapply over incrementing sequences 1:3 and calculate correlations, until nrow is reached.
w <- 3 ## size of the rolling window
combn(dat[cols], 2, function(x) {
X <- na.omit(x)
n <- nrow(X)
mapply(function(y, z) cor(X[y + z, 1], X[y + z, 2]), list(1:w), 0:(n - w))
}, simplify=FALSE)
# [[1]]
# [1] 0.5307784 -0.9874843 -0.8364802 0.2407730 0.3655328 -0.4458231
#
# [[2]]
# [1] 0.8121466 0.9652715 0.3304100 0.8278965 -0.1425097 0.5832558 0.9959705
# [8] 0.8696023
#
# [[3]]
# [1] 0.6733985 0.2194488 0.5593983 -0.6589249 -0.9291184
#
# [[4]]
# [1] 0.97528684 -0.90599558 -0.42319742 0.92882443 0.28058418 0.05427966
#
# [[5]]
# [1] -0.7815678 -0.7182037 -0.6698260 0.4592962 0.7452225
#
# [[6]]
# [1] 0.9721521 0.9343926 -0.3470329 -0.7237291 -0.6253825
Related
Using the r package "effsize" I am trying to calculate cohens d between all pairs of groups in my data outputting all the pairwise d estimates as a matrix. I have provided some test data to illustrate this. I would want a matrix of d estimates for all pairs of groups 1, 2, and 3.
I am struggling to find where to start with this. I know that it could be done using loops but since my real data contains 1000 groups each with 6000 data points I think this would be slow.
library("effsize")
test <- data.frame(
score=c(2,3,42,1,2,3,4,5,5,6,8,2),
group=c(1,1,1,1,2,2,2,2,3,3,3,3)
)
This would be similar functionality to what is provided for wilcox rank sum using pairwise.wilcox.test().
All you have to do is to note that function combn outputs the combinations of n elements taken k at a time and can also apply a function to each resulting combination. In this case the question asks for combinations of 2 groups at a time and the function fun is applied to each one.
fun <- function(x) {
cohen.d(x[[1]]$score, x[[2]]$score)
}
sp <- split(test, test$group)
cmb <- combn(sp, 2, fun)
cmb[, 1]
#[[1]]
#[1] "Cohen's d"
#
#[[2]]
#[1] "d"
#
#[[3]]
#[1] 0.5992954
#
#[[4]]
# lower upper
#-1.169345 2.367936
#
#[[5]]
#[1] 0.95
#
#[[6]]
#[1] medium
#Levels: negligible < small < medium < large
The code above can be written as a function that does all the work and returns a matrix.
cohen.d.pairwise.test <- function(DF, scoreCol, groupCol){
fun <- function(x) {
eff <- cohen.d(x[[1]][[scoreCol]], x[[2]][[scoreCol]])
c(eff[["estimate"]],
eff[["conf.int"]][1],
eff[["conf.int"]][2],
eff[["conf.level"]])
}
sp <- split(DF, DF[[groupCol]])
cmb <- combn(sp, 2, fun)
rownames(cmb) <- c("estimate", "lower", "upper", "conf.level")
t(cmb)
}
cohen.d.pairwise.test(test, scoreCol = "score", groupCol = "group")
# estimate lower upper conf.level
#[1,] 0.5992954 -1.169345 2.3679357 0.95
#[2,] 0.4732232 -1.281054 2.2275008 0.95
#[3,] -0.8795932 -2.691556 0.9323698 0.95
Let's assume four data frames, each with 3 vectors, e.g.
setA <- data.frame(
a1 = c(6,5,2,4,5,3,4,4,5,3),
a2 = c(4,3,1,4,5,1,1,6,3,2),
a3 = c(5,4,5,6,4,6,5,5,3,3)
)
setB <- data.frame(
b1 = c(5,3,4,3,3,6,4,4,3,5),
b2 = c(4,3,1,3,5,2,5,2,5,6),
b3 = c(6,5,4,3,2,6,4,3,4,6)
)
setC <- data.frame(
c1 = c(4,4,5,5,6,4,2,2,4,6),
c2 = c(3,3,4,4,2,1,2,3,5,4),
c3 = c(4,5,4,3,5,5,3,5,5,6)
)
setD <- data.frame(
d1 = c(5,5,4,4,3,5,3,5,5,4),
d2 = c(4,4,3,3,4,3,4,3,4,5),
d3 = c(6,5,5,3,3,4,2,5,5,4)
)
I'm trying to find n number of vectors in each data frame, that have the highest correlation among each other. For this simple example, let's say want to find the n = 1 vectors in each of the k = 4 data frames, that show the overall strongest, positive correlation cor().
I'm not interested in the correlation of vectors within a data frame, but the correlation between data frames, since i wish to pick 1 variable from each set.
Intuitively, I would sum all the correlation coefficients for each combination, i.e.:
sum(cor(cbind(setA$a1, setB$b1, setC$c1, setC$d1)))
sum(cor(cbind(setA$a1, setB$b2, setC$c1, setC$d1)))
sum(cor(cbind(setA$a1, setB$b1, setC$c2, setC$d1)))
... # and so on...
...but this seems like brute-forcing a solution that might be solvable more elegantly, with some kind of clustering-technique?
Anyhow, I was hoping to find a dynamic solution like function(n = 1, ...) where (... for data frames) which would return a list of the highest correlating vector names.
Base on your example I would not go with a really complicated algorithm unless your actual data is huge. This is a simple approach I think gets what you want.
So base on your 4 data frames a creates the list_df and then in the function I just generate all the possible combinations of variables an calculate their correlation. At the end I select the n combinations with highest correlation.
list_df = list(setA,setB,setC,setD)
CombMaxCor = function(n = 1,list_df){
column_names = lapply(list_df,colnames)
mat_comb = expand.grid(column_names)
mat_total = do.call(cbind,list_df)
vec_cor = rep(NA,nrow(mat_comb))
for(i in 1:nrow(mat_comb)){
vec_cor[i] = sum(cor(mat_total[,as.character(unlist(mat_comb[i,]))]))
}
pos_max_temp = rev(sort(vec_cor))[1:n]
pos_max = vec_cor%in%pos_max_temp
comb_max_cor = mat_comb[pos_max,]
return(comb_max_cor)
}
You could use comb function:
fun = function(x){
nm = paste0(names(x),collapse="")
if(!grepl("(.)\\d.*\\1",nm,perl = T))
setNames(sum(cor(x)),nm)
}
unlist(combn(a,4,fun,simplify = FALSE))[1:3]#Only printed the first 3
a1b1c1d1 a1b1c1d2 a1b1c1d3
3.246442 4.097532 3.566949
sum(cor(cbind(setA$a1, setB$b1, setC$c1, setD$d1)))
[1] 3.246442
sum(cor(cbind(setA$a1, setB$b1, setC$c1, setD$d2)))
[1] 4.097532
sum(cor(cbind(setA$a1, setB$b1, setC$c1, setD$d3)))
[1] 3.566949
Here is a function we can use to get n non-repeating columns from each data frame to get the max total correlation:
func <- function(n, ...){
list.df <- list(...)
n.df <- length(list.df)
# 1) First get the correlations
get.two.df.cors <- function(df1, df2) apply(df1, 2,
function(x) apply(df2, 2, function(y) cor(x,y))
)
cor.combns <- lapply(list.df, function(x)
lapply(list.df, function(y) get.two.df.cors(x,y))
)
# 2) Define function to help with aggregating the correlations.
# We will call them for different combinations of selected columns from each df later
# cmbns: given a df corresponding columns to be selected each data frame
# (i-th row corresponds to i-th df),
# return the "total correlation"
get.cmbn.sum <- function(cmbns, cor.combns){
# a helper matrix to help aggregation
# each row represents which two data frames we want to get the correlation sums
df.df <- t(combn(seq(n.df), 2, c))
# convert to list of selections for each df
cmbns <- split(cmbns, seq(nrow(cmbns)))
sums <- apply(df.df, 1,
function(dfs) sum(
cor.combns[[dfs[1]]][[dfs[2]]][cmbns[[dfs[2]]], cmbns[[dfs[1]]]]
)
)
# sum of the sums give the "total correlation"
sum(sums)
}
# 3) Now perform the aggragation
# get the methods of choosing n columns from each of the k data frames
if (n==1) {
cmbns.each.df <- lapply(list.df, function(df) matrix(seq(ncol(df)), ncol=1))
} else {
cmbns.each.df <- lapply(list.df, function(df) t(combn(seq(ncol(df)), n, c)))
}
# get all unique selection methods
unique.selections <- Reduce(function(all.dfs, new.df){
all.dfs.lst <- rep(list(all.dfs), nrow(new.df))
all.new.rows <- lapply(seq(nrow(new.df)), function(x) new.df[x,,drop=F])
for(i in seq(nrow(new.df))){
for(j in seq(length(all.dfs.lst[[i]]))){
all.dfs.lst[[i]][[j]] <- rbind(all.dfs.lst[[i]][[j]], all.new.rows[[i]])
}
}
do.call(c, all.dfs.lst)
}, c(list(list(matrix(numeric(0), nrow=0, ncol=n))), cmbns.each.df))
# for each unique selection method, calculate the total correlation
result <- sapply(unique.selections, get.cmbn.sum, cor.combns=cor.combns)
return( unique.selections[[which.max(result)]] )
}
And now we have:
# n = 1
func(1, setA, setB, setC, setD)
# [,1]
# [1,] 1
# [2,] 2
# [3,] 3
# [4,] 2
# n = 2
func(2, setA, setB, setC, setD)
# [,1] [,2]
# [1,] 1 2
# [2,] 2 3
# [3,] 2 3
# [4,] 2 3
Suppose I have a matrix,
mat <- matrix((1:9)^2, 3, 3)
I can slice the matrix like so
> mat[2:3, 2]
[1] 25 36
How does one store the subscript as a variable? That is, what should my_sub be, such that
> mat[my_sub]
[1] 25 36
A list gets "invalid subscript type" error. A vector will lose the multidimensionality. Seems like such a basic operation to not have a primitive type that fits this usage.
I know I can access the matrix via vector addressing, which means converting from [2:3, 2] to c(5, 6), but that mapping presumes knowledge of matrix shape. What if I simply want [2:3, 2] for any matrix shape (assuming it is at least those dimensions)?
Here are some alternatives. They both generalize to higher dimenional arrays.
1) matrix subscripting If the indexes are all scalar except possibly one, as in the question, then:
mi <- cbind(2:3, 2)
mat[mi]
# test
identical(mat[mi], mat[2:3, 2])
## [1] TRUE
In higher dimensions:
a <- array(1:24, 2:4)
mi <- cbind(2, 2:3, 3)
a[mi]
# test
identical(a[mi], a[2, 2:3, 3])
## [1] TRUE
It would be possible to extend this to eliminate the scalar restriction using:
L <- list(2:3, 2:3)
array(mat[as.matrix(do.call(expand.grid, L))], lengths(L))
however, in light of (2) which also uses do.call but avoids the need for expand.grid it seems unnecessarily complex.
2) do.call This approach does not have the scalar limitation. mat and a are from above:
L2 <- list(2:3, 1:2)
do.call("[", c(list(mat), L2))
# test
identical(do.call("[", c(list(mat), L2)), mat[2:3, 1:2])
## [1] TRUE
L3 <- list(2, 2:3, 3:4)
do.call("[", c(list(a), L3))
# test
identical(do.call("[", c(list(a), L3)), a[2, 2:3, 3:4])
## [1] TRUE
This could be made prettier by defining:
`%[%` <- function(x, indexList) do.call("[", c(list(x), indexList))
mat %[% list(2:3, 1:2)
a %[% list(2, 2:3, 3:4)
Use which argument arr.ind = TRUE.
x <- c(25, 36)
inx <- which(mat == x, arr.ind = TRUE)
Warning message:
In mat == x :
longer object length is not a multiple of shorter object length
mat[inx]
#[1] 25 36
This is an interesting question. The subset function can actually help. You cannot subset directly your matrix using a vector or a list, but you can store the indexes in a list and use subset to do the trick.
mat <- matrix(1:12, nrow=4)
mat[2:3, 1:2]
# example using subset
subset(mat, subset = 1:nrow(mat) %in% 2:3, select = 1:2)
# double check
identical(mat[2:3, 1:2],
subset(mat, subset = 1:nrow(mat) %in% 2:3, select = 1:2))
# TRUE
Actually, we can write a custom function if we want to store the row- and column- indexes in the same list.
cust.subset <- function(mat, dim.list){
subset(mat, subset = 1:nrow(mat) %in% dim.list[[1]], select = dim.list[[2]])
}
# initialize a list that includes your sub-setting indexes
sbdim <- list(2:3, 1:2)
sbdim
# [[1]]
# [1] 2 3
# [[2]]
# [1] 1 2
# subset using your custom f(x) and your list
cust.subset(mat, sbdim)
# [,1] [,2]
# [1,] 2 6
# [2,] 3 7
I often need deal with large correlation matrices between sets of variables, and I want to know which correlations meet a given condition (e.g., are above .2 or .3 or absolute .2 or .3) and so on. So given a correlation matrix, it would be useful if I could get the pairs of variables that form correlations that satisfy a condition.
To make it a little more concrete, here is matrix
x <- matrix(1:9, nrow = 3)
rownames(x) <- colnames(x) <- c("a", "b", "c")
x
# x
# a b c
# a 1 4 7
# b 2 5 8
# c 3 6 9
I want a function that allows me to specify a cell condition, and then will return names for matching cells.
e.g., > 8 returns "c:c"
odd number returns "a:a", "c:a", ... "c:c"
The following function takes a matrix and a function. The function should return TRUE/FALSE for each cell value.
Using the sample matrix:
x <- matrix(1:9, nrow = 3)
rownames(x) <- colnames(x) <- c("a", "b", "c")
The function is:
cell_matches <- function(x, FUN = function(X) X > .2) {
cellnames <- outer(row.names(x), colnames(x), function(X, Y) paste0(X, ":", Y))
cellnames[FUN(x) ]
}
Thus, using the matrix above following works:
cell_matches(x, function(X) X > 8)
# [1] "c:c"
cell_matches(x, function(X) X %% 2 == 1)
# [1] "a:a" "c:a" "b:b" "a:c" "c:c"
Application to correlation matrix:
# correlations above .80
cell_matches(cor(mtcars), function(X) X > .80 & X != 1)
# [1] "disp:cyl" "hp:cyl" "cyl:disp" "wt:disp" "cyl:hp" "disp:wt"
check_cor <- function(mat,FUN)
{
apply(which(FUN(x), arr.ind = TRUE),1,
function(i)
{
paste0(row.names(mat)[i[1]],':',colnames(mat)[i[2]])
}
)
}
check_cor(cor(mtcars), function(X) X > .80 & X != 1)
I did not find an answer in other posts, nor did i understand them if they handled similar topics, since i am relatively new to R and to programming in general. I have the following survey output X that i am working with (extract):
A1B1 A1B2 A1B3 A1B4 A2B1 A2B2 A2B3 ...
-0.37014356 1.08841141 -0.126574243 -0.59169360 1.682673457 -0.427706432 -0.76091938 ...
3.03017573 1.39812421 0.243516558 -4.67181650 -0.378640756 2.039940436 -0.40785893 ...
3.50183121 1.51249433 -0.775449944 -4.23887560 -0.456911873 0.431838943 0.91108052 ...
...
I want to compute the difference of the maximum range diff(range(X[i,n:m])) of the first 4 (with n:m equals 1:4), the second 4 (5:8) and the third 4 (9:12) columns of every row i of X and put the results into a second matrix with i rows and 3 columns.
E.g. for the first row and the first 4 cols, it would be 1.08841141+0.59169360=1.68010501.
For this purpose i created a new matrix and tried to fill it up with the values:
newmatrix <- matrix(0,nrow(X),3)
newmatrix[1:nrow(X),1] <- for (i in (1:nrow(X))) {diff(range(X[i,1:4]))}
newmatrix[1:nrow(X),2] <- for (i in (1:nrow(X))) {diff(range(X[i,5:8]))}
newmatrix[1:nrow(X),3] <- for (i in (1:nrow(X))) {diff(range(X[i,9:12]))}
I get the output error:
Error in newmatrix[1:nrow(RBetas), 1] <- for (i in (1:nrow(RBetas))) { :
number of items to replace is not a multiple of replacement length
Thank you for your help!
Assuming that the block of columns are based on the first two characters, i.e. A1, A2, we can split this into different blocks by using substr to extract the first two characters from the column names and use this as index to split. Then, we can either use apply with range and diff to get the result or use pmax and pmin.
indx <- substr(colnames(df), 1,2)
If the grouping is not based on the column names but on the position, this should also work
indx <- (1:ncol(df)-1)%/%4 +1
res1 <- sapply(split(seq_len(ncol(df)), indx),
function(i) do.call(pmax,df[,i, drop=FALSE])-
do.call(pmin, df[,i, drop=FALSE]))
Or
res2 <- sapply(split(seq_len(ncol(df)), indx),
function(i) apply(df[,i, drop=FALSE], 1,
function(x) diff(range(x))) )
identical(res1, res2)
#[1] TRUE
res1
# A1 A2
#[1,] 1.680105 2.443593
#[2,] 7.701992 2.447799
#[3,] 7.740707 1.367992
Or using your code
newmatrix <- matrix(0, nrow(df), 2) #here the example dataset is only 7 columns
for(i in (1:nrow(df))) newmatrix[i,1] <- diff(range(df[i,1:4]))
for(i in (1:nrow(df))) newmatrix[i,2] <- diff(range(df[i,5:7]))
newmatrix
# [,1] [,2]
#[1,] 1.680105 2.443593
#[2,] 7.701992 2.447799
#[3,] 7.740707 1.367992
If you have many blocks of columns, you can try a double for loop
lst <- split(seq_len(ncol(df)), indx) #keep the columns to group in a `list`
newmatrix <- matrix(0, nrow(df), 2) #he
for(i in 1:nrow(df)){
for(j in seq_along(lst)){
newmatrix[i,j] <- diff(range(df[i, lst[[j]]]))
}
}
newmatrix
# [,1] [,2]
#[1,] 1.680105 2.443593
#[2,] 7.701992 2.447799
#[3,] 7.740707 1.367992
data
df <- structure(list(A1B1 = c(-0.37014356, 3.03017573, 3.50183121),
A1B2 = c(1.08841141, 1.39812421, 1.51249433), A1B3 = c(-0.126574243,
0.243516558, -0.775449944), A1B4 = c(-0.5916936, -4.6718165,
-4.2388756), A2B1 = c(1.682673457, -0.378640756, -0.456911873
), A2B2 = c(-0.427706432, 2.039940436, 0.431838943), A2B3 = c(-0.76091938,
-0.40785893, 0.91108052)), .Names = c("A1B1", "A1B2", "A1B3",
"A1B4", "A2B1", "A2B2", "A2B3"), class = "data.frame", row.names = c(NA,
-3L))