Iteratively/sequentially drop and replace df variable and feed into function - r

I am trying to assess the stability of a correlation analysis by iteratively dropping a variable, and re-running the analysis.
As I understand it, this requires me to (1) create matrices of length p-1, by iteratively/sequentially dropping a variable from a dataframe, (2) run a correlation function over a series of matrices, and (3) feed the output into a common dataframe or list, for subsequent analysis.
I am able to achieve each of these steps manually, as follows:
#required library for cc function
library(CCA)
#set seed
set.seed(123)
#X and Y dataframes
X_df <- data.frame(replicate(4,sample(1:10,10,rep=TRUE)))
Y_df <- data.frame(replicate(3,sample(1:10,10,rep=TRUE)))
#X and Y as scaled matrices
X <- scale(X_df)
Y <- scale(Y_df)
#manually omit a variable/column from the X df
X1 <- scale(X_df[2:4])
X2 <- scale(X_df[c(1, 3:4)])
X3 <- scale(X_df[c(1:2, 4)])
X4 <- scale(X_df[1:3])
#manually omit a variable/column from the Y df
Y1 <- scale(Y_df[2:3])
Y2 <- scale(Y_df[c(1, 3)])
Y3 <- scale(Y_df[1:2])
#perform canonical correlation - X sets and Y
cX1 <- cc(X1,Y)$cor
cX2 <- cc(X2,Y)$cor
cX3 <- cc(X3,Y)$cor
cX4 <- cc(X4,Y)$cor
#perform canonical correlation - Y sets and X
cY1 <- cc(X,Y1)$cor
cY2 <- cc(X,Y2)$cor
cY3 <- cc(X,Y3)$cor
#get canonical correlation values into a df
XVALS <- as.data.frame(rbind(cX1, cX2, cX3, cX4))
YVALS <- as.data.frame(rbind(cY1, cY2, cY3))
Of course, I know it's very bad to do this manually, and my real data is much larger.
Unfortunately, I am pretty new to R (and coding), and have been struggling to achieve any of these steps in a better way. I am familiar with the (existence of) the apply functions and also some functions in dplyr that I think are likely relevant (e.g., select) but I just can't get it to work despite reading documentation and seemingly similar posts for hours -- any guidance would be greatly appreciated.

Don't scale.
First of all, there is no need for scaled vectors as the code below shows.
The reason why vectors are scaled is a variant of R FAQ 7.31, see also this SO post.
With older processors the precision loss was a real problem, leading to clearly wrong results. This is no longer true, at least not in the general case.
#perform canonical correlation - original X sets and Y
cX1b <- cc(X_df[2:4], Y)$cor
cX2b <- cc(X_df[c(1, 3:4)], Y)$cor
cX3b <- cc(X_df[c(1:2, 4)], Y)$cor
cX4b <- cc(X_df[1:3], Y)$cor
XVALSb <- as.data.frame(rbind(cX1b, cX2b, cX3b, cX4b))
XVALS and XVALSb row names are different, make them equal in order to please all.equal().
row.names(XVALS) <- 1:4
row.names(XVALSb) <- 1:4
The results are not exactly equal but are within floating-point accuracy. In this case I'm testing equality with all.equal's default of .Machine$double.eps^0.5.
identical(XVALS, XVALSb)
#[1] FALSE
all.equal(XVALS, XVALSb)
#[1] TRUE
XVALS - XVALSb
# V1 V2 V3
#1 0.000000e+00 1.110223e-16 0.000000e+00
#2 -1.110223e-16 1.110223e-16 5.551115e-17
#3 1.110223e-16 -2.220446e-16 2.220446e-16
#4 1.110223e-16 4.440892e-16 1.110223e-16
The question.
To get all combinations of columns leaving one out there is function combn.
Function cc_df_one_out first calls combn on each of its arguments then apply to those indices an anonymous function computing CCA::cc.
Note that the rows order is not the same as in your posted example, since combn does not follow your order of column indices.
cc_df_one_out <- function(X, Y){
f <- function(x) combn(ncol(x), ncol(x) - 1)
X_inx <- f(X)
Y_inx <- f(Y)
ccX <- t(apply(X_inx, 2, function(i) cc(X[, i], Y)$cor))
ccY <- t(apply(Y_inx, 2, function(i) cc(X, Y[, i])$cor))
list(XVALS = as.data.frame(ccX), YVALS = as.data.frame(ccY))
}
cc_df_one_out(X_df, Y_df)
#$XVALS
# V1 V2 V3
#1 0.8787169 0.6999526 0.5073979
#2 0.8922514 0.7244302 0.2979096
#3 0.8441566 0.7807032 0.3331449
#4 0.9059585 0.7371382 0.1344559
#
#$YVALS
# V1 V2
#1 0.8975949 0.7309265
#2 0.8484323 0.7488632
#3 0.8721945 0.7452478

Related

R - Cleanest way to run statistical test on every permutation of multiple populations

I have three populations stored as individual vectors. I need to run a statistical test (wilcoxon, if it matters) on each pair of these three populations.
I want to input three vectors into some block of code and get as output a vector of 6 p-values (one p-value is the result of one test and is a double).
I have a method that works but I am new to R and from what I've been reading I feel like there should be a better way, possibly involving storing the vectors as a data frame and using vectorization, to write this code.
Here is the code I have:
library(arrangements)
runAllTests <- function(pop1,pop2,pop3) {
populations <- list(pop1=pop1,pop2=pop2,pop3=pop3)
colLabels <- c("pop1", "pop2", "pop3")
#This line makes a data frame where each column is a pair of labels
perms <- data.frame(t(permutations(colLabels,2)))
pvals <- vector()
#This for loop gets each column of that data frame
for (pair in perms[,]) {
pair <- as.vector(pair)
p1 <- as.numeric(unlist(populations[pair[1]]))
p2 <- as.numeric(unlist(populations[pair[2]]))
pvals <- append(pvals, wilcox.test(p1, p2,alternative=c("less"))$p.value)
}
return(pvals)
}
What is a more R appropriate way to write this code?
Note: Generating populations and comparing them all to each other is a common enough thing (and tricky enough to code) that I think this question will apply to more people than myself.
EDIT: I forgot that my actual populations are of different sizes. This means I cannot make a data frame out of the vectors (as far as I know). I can make a list of vectors though. I have updated my code with a version that works.
Yes, this is indeed common; indeed so common that R has a built-in function for exactly this scenario: pairwise.table.
p <- list(pop1, pop2, pop3)
pairwise.table(function(i, j) {
wilcox.test(p[[i]], p[[j]])$p.value
}, 1:3)
There are also specific versions for t tests, proportion tests, and Wilcoxon tests; here's an example using pairwise.wilcox.test.
p <- list(pop1, pop2, pop3)
d <- data.frame(x=unlist(p), g=rep(seq_along(p), sapply(p, length)))
with(d, pairwise.wilcox.test(x, g))
Also, make sure you look into the p.adjust.method parameter to correctly adjust for multiple comparisons.
Per your comments, you're interested in tests where the order matters; that's really hard to imagine (and isn't true for the Wilcoxon test you mentioned) but still...
This is the pairwise.table function, edited to do tests in both directions.
pairwise.table.all <- function (compare.levels, level.names, p.adjust.method) {
ix <- setNames(seq_along(level.names), level.names)
pp <- outer(ix, ix, function(ivec, jvec)
sapply(seq_along(ivec), function(k) {
i <- ivec[k]; j <- jvec[k]
if (i != j) compare.levels(i, j) else NA }))
pp[] <- p.adjust(pp[], p.adjust.method)
pp
}
This is a version of pairwise.wilcox.test which uses the above function, and also runs on a list of vectors, instead of a data frame in long format.
pairwise.lazerbeam.test <- function(dat, p.adjust.method=p.adjust.methods) {
p.adjust.method <- match.arg(p.adjust.method)
level.names <- if(!is.null(names(dat))) names(dat) else seq_along(dat)
PVAL <- pairwise.table.all(function(i, j) {
wilcox.test(dat[[i]], dat[[j]])$p.value
}, level.names, p.adjust.method = p.adjust.method)
ans <- list(method = "Lazerbeam's special method",
data.name = paste(level.names, collapse=", "),
p.value = PVAL, p.adjust.method = p.adjust.method)
class(ans) <- "pairwise.htest"
ans
}
Output, both before and after tidying, looks like this:
> p <- list(a=1:5, b=2:8, c=10:16)
> out <- pairwise.lazerbeam.test(p)
> out
Pairwise comparisons using Lazerbeams special method
data: a, b, c
a b c
a - 0.2821 0.0101
b 0.2821 - 0.0035
c 0.0101 0.0035 -
P value adjustment method: holm
> pairwise.lazerbeam.test(p) %>% broom::tidy()
# A tibble: 6 x 3
group1 group2 p.value
<chr> <chr> <dbl>
1 b a 0.282
2 c a 0.0101
3 a b 0.282
4 c b 0.00350
5 a c 0.0101
6 b c 0.00350
Here is an example of one approach that uses combn() which has a function argument that can be used to easily apply wilcox.test() to all variable combinations.
set.seed(234)
# Create dummy data
df <- data.frame(replicate(3, sample(1:5, 100, replace = TRUE)))
# Apply wilcox.test to all combinations of variables in data frame.
res <- combn(names(df), 2, function(x) list(data = c(paste(x[1], x[2])), p = wilcox.test(x = df[[x[1]]], y = df[[x[2]]])$p.value), simplify = FALSE)
# Bind results
do.call(rbind, res)
data p
[1,] "X1 X2" 0.45282
[2,] "X1 X3" 0.06095539
[3,] "X2 X3" 0.3162251

How to find out the best combination of a given vector whose sum is closest to a given number

My question is quite similar to this one: Find a subset from a set of integer whose sum is closest to a value
It discussed the algorithm only, but I want to solve it with R. I'm quite new to R and tried to work out a solution, but I wonder whether there is a more efficient way.
Here is my example:
# Define a vector, to findout a subset whose sum is closest to the reference number 20.
A <- c(2,5,6,3,7)
# display all the possible combinations
y1 <- combn(A,1)
y2 <- combn(A,2)
y3 <- combn(A,3)
y4 <- combn(A,4)
y5 <- combn(A,5)
Y <- list(y1,y2,y3,y4,y5)
# calculate the distance to the reference number of each combination
s1 <- abs(apply(y1,2,sum)-20)
s2 <- abs(apply(y2,2,sum)-20)
s3 <- abs(apply(y3,2,sum)-20)
s4 <- abs(apply(y4,2,sum)-20)
s5 <- abs(apply(y5,2,sum)-20)
S <- list(s1,s2,s3,s4,s5)
# find the minimum difference
M <- sapply(S,FUN=function(x) list(which.min(x),min(x)))
Mm <- which.min(as.numeric(M[2,]))
# return the right combination
data.frame(Y[Mm])[as.numeric(M[,Mm[1]])]
so the answer is 2,5,6,7.
How can I refine this program? Especially the five combn()s and five apply()s, is there a way that can work them at once? I hope when A has more items in it, I can use length(A) to cover it.
Here is another way to do it,
l1 <- sapply(seq_along(A), function(i) combn(A, i))
l2 <- sapply(l1, function(i) abs(colSums(i) - 20))
Filter(length, Map(function(x, y)x[,y], l1, sapply(l2, function(i) i == Reduce(min, l2))))
#[[1]]
#[1] 2 5 6 7
The last line uses Map to index l1 based on a logical list created by finding the minimum value from list l2.
combiter library has isubsetv iterator, which goes through all subset of a vector. Combined with foreach simplifies the code.
library(combiter)
library(foreach)
A <- c(2,5,6,3,7)
res <- foreach(x = isubsetv(A), .combine = c) %do% sum(x)
absdif <- abs(res-20)
ind <- which(absdif==min(absdif))
as.list(isubsetv(A))[ind]

R correlation matrix by group using spearman

I am trying to create a set of correlation matrices by different levels of a factor variable.
This question has previously been answered (spearman correlation by group in R) but not for a matrix and the vector result doesn't seem to generalize as far as I can see.
The code below works, but can't be written to a csv as by() outputs a list - the error is "cannot coerce class ""by"" to a data.frame"
cor1<- by(data, INDICES=data$factor0, FUN = function(x) cor(x[,c("x","y","z","a",
"b","c")],method="spearman",use="pairwise"))
So I am looking for a method to either coerce the above into a data.frame so I can write it to a csv, or to produce the above result by an alternative method which outputs a data frame
Any help greatly appreciated
The reason you get a list is because if x is a matrix than cor(x) will be a matrix as well, not a scalar. In this case it will be a 6x6 matrix. So the result is a list of 6x6 matrices, one for each factor level.
This is the natural way to represent the result, it seems to me. You can make it into a single data frame if you want, though I'm not sure what you want the rows and columns to represent exactly. Here is one option.
data<-matrix(rnorm(500),100,5)
colnames(data)<-letters[1:5]
factors<-sample(LETTERS[1:3],100,T)
cors<-by(data,factors,cor)
cors[[1]]
# a b c d e
# a 1.00000000 0.05389618 -0.16944040 0.25747174 0.21660217
# b 0.05389618 1.00000000 0.22735796 -0.06002965 -0.30115444
# c -0.16944040 0.22735796 1.00000000 -0.06625523 -0.01120225
# d 0.25747174 -0.06002965 -0.06625523 1.00000000 0.10402791
# e 0.21660217 -0.30115444 -0.01120225 0.10402791 1.00000000
corsMatrix<-do.call(rbind,lapply(cors,function(x)x[upper.tri(x)]))
names<-outer(colnames(data),colnames(data),paste,sep="X")
colnames(corsMatrix)<-names[upper.tri(names)]
corsMatrix
# aXb aXc bXc aXd bXd cXd
# A 0.05389618 -0.16944040 0.22735796 0.25747174 -0.06002965 -0.06625523
# B -0.34231682 -0.14225269 0.20881053 -0.14237661 0.25970138 0.27254840
# C 0.27199944 -0.01333377 0.06402734 0.02583126 -0.03336077 -0.02207024
# aXe bXe cXe dXe
# A 0.216602173 -0.3011544 -0.01120225 0.10402791
# B 0.347006942 -0.2207421 0.33123175 -0.05290809
# C 0.007748369 -0.1257357 0.23048709 0.16037247
I'm not sure if this is what you are looking for. Another option is to export each correlation matrix to its own csv file.
You can use ddply from package library(plyr):
library(plyr)
n <- 1e2
mdat <- data.frame(factor0 = factor(LETTERS[sample(26, n, TRUE)]), x = rnorm(n),
y = rnorm(n), z = rnorm(n), a = rnorm(n), b = rnorm(n),
c = rnorm(n))
ddply(mdat, .(factor0), function(d) {
ret <- as.data.frame(cor(d[, letters[c(1:3, 24:26)]], method="spearman",use="pairwise"))
ret$col <- letters[c(1:3, 24:26)]
ret[, c(7, 1:6)]})
Your query is not that clear, at least to me. If I took it correctly, you may need to have a pairwise matrix first before computing correlation.
You may want try the following function in SciencesPo.
require(SciencesPo)
m<-rprob(mtcars, df = nrow(mtcars) - 2)
The following will stack you matrix, so it becomes easier to check r and related p-values.
rstack(m)

How do you find the sample sizes used in calculations on r?

I am running correlations between variables, some of which have missing data, so the sample size for each correlation are likely different. I tried print and summary, but neither of these shows me how big my n is for each correlation. This is a fairly simple problem that I cannot find the answer to anywhere.
like this..?
x <- c(1:100,NA)
length(x)
length(x[!is.na(x)])
you can also get the degrees of freedom like this...
y <- c(1:100,NA)
x <- c(1:100,NA)
cor.test(x,y)$parameter
But I think it would be best if you show the code for how your are estimating the correlation for exact help.
Here's an example of how to find the pairwise sample sizes among the columns of a matrix. If you want to apply it to (certain) numeric columns of a data frame, combine them accordingly, coerce the resulting object to matrix and apply the function.
# Example matrix:
xx <- rnorm(3000)
# Generate some NAs
vv <- sample(3000, 200)
xx[vv] <- NA
# reshape to a matrix
dd <- matrix(xx, ncol = 3)
# find the number of NAs per column
apply(dd, 2, function(x) sum(is.na(x)))
# tack on some column names
colnames(dd) <- paste0("x", seq(3))
# Function to find the number of pairwise complete observations
# among all pairs of columns in a matrix. It returns a data frame
# whose first two columns comprise all column pairs
pairwiseN <- function(mat)
{
u <- if(is.null(colnames(mat))) paste0("x", seq_len(ncol(mat))) else colnames(mat)
h <- expand.grid(x = u, y = u)
f <- function(x, y)
sum(apply(mat[, c(x, y)], 1, function(z) !any(is.na(z))))
h$n <- mapply(f, h[, 1], h[, 2])
h
}
# Call it
pairwiseN(dd)
The function can easily be improved; for example, you could set h <- expand.grid(x = u[-1], y = u[-length(u)]) to cut down on the number of calculations, you could return an n x n matrix instead of a three-column data frame, etc.
Here is a for-loop implementation of Dennis' function above to output an n x n matrix rather than have to pivot_wide() that result. On my databricks cluster it cut the compute time for 1865 row x 69 column matrix down from 2.5 - 3 minutes to 30-40 seconds.
Thanks for your answer Dennis, this helped me with my work.
pairwise_nxn <- function(mat)
{
cols <- if(is.null(colnames(mat))) paste0("x", seq_len(ncol(mat))) else colnames(mat)
nn <- data.frame(matrix(nrow = length(cols), ncol = length(cols)))
rownames(nn) <- colnames(nn) <- cols
f <- function(x, y)
sum(apply(mat[, c(x, y)], 1, function(z) !any(is.na(z))))
for (i in 1:nrow(nn))
for (j in 1:ncol(nn))
nn[i,j] <- f(rownames(nn)[i], colnames(nn)[j])
nn
}
If your variables are vectors named a and b, would something like sum(is.na(a) | is.na(b)) help you?

tapply on matrices of data and indices

I am calculating sums of matrix columns to each group, where the corresponding group values are contained in matrix columns as well. At the moment I am using a loop as follows:
index <- matrix(c("A","A","B","B","B","B","A","A"),4,2)
x <- matrix(1:8,4,2)
for (i in 1:2) {
tapply(x[,i], index[,i], sum)
}
At the end of the day I need the following result:
1 2
A 3 15
B 7 11
Is there a way to do this using matrix operations without a loop? On top, the real data is large (e.g. 500 x 10000), therefore it has to be fast.
Thanks in advance.
Here are a couple of solutions:
# 1
ag <- aggregate(c(x), data.frame(index = c(index), col = c(col(x))), sum)
xt <- xtabs(x ~., ag)
# 2
m <- mapply(rowsum, as.data.frame(x), as.data.frame(index))
dimnames(m) <- list(levels(factor(index)), 1:ncol(index))
The second only works if every column of index has at least one of each level and also requires that there be at least 2 levels; however, its faster.
This is ugly and works but there's a much better way to do it that is more generalizable. Just getting the ball rolling.
data.frame("col1"=as.numeric(table(rep(index[,1], x[,1]))),
"col2"=as.numeric(table(rep(index[,2], x[,2]))),
row.names=names(table(index)))
I still suspect there's a better option, but this seems reasonably fast actually:
index <- matrix(sample(LETTERS[1:4],size = 500*1000,replace = TRUE),500,10000)
x <- matrix(sample(1:10,500*10000,replace = TRUE),500,10000)
rs <- matrix(NA,4,10000)
rownames(rs) <- LETTERS[1:4]
for (i in LETTERS[1:4]){
tmp <- x
tmp[index != i] <- 0
rs[i,] <- colSums(tmp)
}
It runs in ~0.8 seconds on my machine. I upped the number of categories to four and scaled it up to the size data you have. But I don't having to copy x each time.
You can get clever with matrix multiplication, but I think you still have to do one row or column at a time.
You used tapply. If you add mapply, you can complete your objective.
It does the same thing as that for loop.
index <- matrix(c("A","A","B","B","B","B","A","A"),4,2)
x <- matrix(1:8,4,2)
mapply( function(i) tapply(x[,i], index[,i], sum), 1:2 )
result:
[,1] [,2]
A 3 15
B 7 11

Resources