How to optimize my correlation problem in R? - r

I have three dataframes in R, let's call them A, B, and C.
dataframe C contains two columns, the first one contains various row names from dataframe A and the second one contains row names in dataframe B:
C <- data.frame(col1 = c("a12", "a9"), col2 = c("b6","b54"))
I want to calculate the correlation coefficient and p-values for each row of the table C using the corresponding values from the rows of table A and B (i.e. correlating values from the a12 row in the table A with values from b6 row from table B, a9 row from table A with b54 row from table B, etc.) and put the resulting values in additional columns in the table C. This is my current naive and highly inefficient code:
for (i in 1:nrow(C)) {
correlation <- cor.test(unlist(A[C[i,1],]), unlist(B[C[i,2],]), method = "spearman")
C[i,3] <-correlation$estimate
C[i,4] <- correlation$p.value
}
The main problem is that with my current large datasets this analysis can literally take months. so I'm looking for a more efficient way to accomplish this task. I also tried the following code using the "Hmisc" package but the server I'm working on can't handle the large vectors:
A <- t(A)
B <- t(B)
ind.A <- match(C[,1], colnames(A))
A<- A[,ind.A]
ind.B <- match(C[,2], colnames(B))
B<- B[,ind.B]
C[,3]<- diag(rcorr(as.matrix(A),as.matrix(B),type = "spearman")$r[c(1:ncol(A)),c(1:ncol(A))])
C[,4]<- diag(rcorr(as.matrix(A),as.matrix(B),type = "spearman")$P[c(1:ncol(A)),c(1:ncol(A))])

Based on the comment by #HYENA, I tried parallelize processing. This approach accelerated the process approximately 4 times (with 8 cores). The code:
library(foreach)
library(doParallel)
cl<- makeCluster(detectCores())
registerDoParallel(cl)
cor.res<- foreach (i=1:nrow(C)) %dopar% {
a<- C[i,1]
b<- C[i,2]
correlation<- cor.test(unlist(A[a,]),unlist(B[b,]), method = "spearman")
c(correlation$estimate,correlation$p.value)
}
cor.res<- data.frame(Reduce("rbind",cor.res))
C[,c(3,4)]<- cor.res

Extract just the part you need from cor.test giving cor_test1 and use that instead or, in addition, create a lookup table for the p values giving cor_test2 which is slightly faster than cor_test1.
Based on the median column with 10-vectors these run about 3x faster than cor.test. Although cor_test2 is only slightly faster than cor_test1 here we have included it since the speed could depend on size of input which we don't have but you can try it out yourself with whatever sizes you have.
# given correlation and degrees of freedom output p value
r2pval <- function(r, dof) {
tval <- sqrt(dof) * r/sqrt(1 - r^2)
min(pt(tval, dof), pt(tval, dof, lower.tail = FALSE))
}
# faster version of cor.test
cor_test1 <- function(x, y) {
r <- cor(x, y)
dof <- length(x) - 2
tval <- sqrt(dof) * r/sqrt(1 - r^2)
pval <- min(pt(tval, dof), pt(tval, dof, lower.tail = FALSE))
c(r, pval)
}
# even faster version of cor.test.
# Given x, y and the pvals table calculate a 2-vector of r and p value
cor_test2 <- function(x, y, pvals) {
r <- cor(x, y)
c(r, pvals[100 * round(r, 2) + 101])
}
# test
set.seed(123)
n <- 10
x <- rnorm(n); y <- rnorm(n)
dof <- n - 2
# pvals is the 201 p values for r = -1, -0.99, -0.98, ..., 1
pvals <- sapply(seq(-1, 1, 0.01), r2pval, dof = dof)
library(microbenchmark)
microbenchmark(cor.test(x, y), cor_test1(x, y), cor_test2(x, y, pvals))
giving:
Unit: microseconds
expr min lq mean median uq max neval cld
cor.test(x, y) 253.7 256.7 346.278 266.05 501.45 650.6 100 a
cor_test1(x, y) 84.8 87.2 346.777 89.10 107.40 22974.4 100 a
cor_test2(x, y, pvals) 72.4 75.0 272.030 79.45 91.25 17935.8 100 a

Related

Fast QR Factorization in R

I have a large number of matrices for which I need to perform a QR factorization and store the resulting Q matrices (normalized such that the R matrix has positive number in its diagonal). Is there another method than using the qr() function?
Here is the working example:
system.time({
# Parameters for the matrix to be generated
row_number <- 1e6/4
col_number <- 4
# Generate large matrix of random numbers normally distributed.
# Basically it's a matrix containing 4x4 matrices for which I will perform the QR factorization:
RND <- matrix(data = rnorm(n=1e6 , mean = 0,sd = 1),nrow = row_number, ncol = col_number)
# Allocate a 0 matrix where the result will be entered
QSTACK <- matrix(0, nrow = row_number, ncol = col_number)
number_of_blocks <- row_number/4 # The number of 4x4 matrices in RND => 62,500
for (k in c(1:number_of_blocks)) {
l1 <- 1 + col_number * (k-1)
l2 <- col_number * k
QR <- qr(RND[l1:l2,]) # Perform QR factorization
R <- qr.R(QR)
QSTACK[l1:l2,] <- qr.Q(QR) %*% diag(sign(diag(R))) # Normalize such that R diagonal elements are positive
}
})
# user system elapsed
# 3.04 0.03 3.07
So that took 3.07 seconds to compute 62,500 QR factorization. I'm wondering if there is something faster?
If you want:
the R factor to have positive diagonal elements
to explicitly form the Q factor (rather than its sequential Householder vectors format)
you can cheat by using Cholesky factorization:
cheatQR <- function (X) {
XtX <- crossprod(X)
R <- chol(XtX)
Q <- t(forwardsolve(R, t(X), upper.tri = TRUE, transpose = TRUE))
list(Q = Q, R = R)
}
The raw QR:
rawQR <- function (X) {
QR <- qr(X)
Q <- qr.Q(QR)
R <- qr.R(QR)
sgn <- sign(diag(R))
R <- sgn * R
Q <- Q * rep(sgn, each = nrow(Q))
list(Q = Q, R = R)
}
Benchmark:
X <- matrix(rnorm(10000 * 4), nrow = 10000, ncol = 4)
ans1 <- rawQR(X)
ans2 <- cheatQR(X)
all.equal(ans1, ans2)
#[1] TRUE
library(microbenchmark)
microbenchmark(rawQR(X), cheatQR(X))
#Unit: microseconds
# expr min lq mean median uq max neval
# rawQR(X) 3083.537 3109.222 3796.191 3123.2230 4782.583 13895.81 100
# cheatQR(X) 828.025 837.491 1421.211 865.9085 1434.657 32577.01 100
For further speedup, it is often advised that we link our R software to an optimized BLAS library, like OpenBLAS. But more relevant to your context, where you are computing large number of QR factorizations for small matrices, it is more worthwhile to parallelize your for loop.

How to write a for loop to compute max of each column for a dataset in R?

A similar question in How to write a double for loop in r with choosing maximal element in one loop?.
The same setup:
If I want to sample theta[j] as first for j=1,2,...,71, then draw replicated( like 1000 times) yrep[k] form Bin(n[j], theta[j]), n[j] is known.
For theta[1], we have yrep[1,1], yrep[1,2], ..., yrep[1,1000]. Then for all theta[j], we will have a matrix of data set of yrep[i,j], i=1,...,71, j=1,..,1000.Then compute mean, max or min of each column yrep[1,1], yrep[1,2], yrep[1,3], ... yrep[1,71], we will get 1000 mean, max or min.
How to write this for loop?
I first try to write a loop to sample theta[j] and yrep. I do not know how to add a code to compute the maximal, mean, and minimal in this loop. I am not sure if this code is right:
theta<-NULL
yrep<-NULL
test<-NULL
k=1
for(i in 1:1000){
for(j in 1:71){
theta[j] <- rbeta(1,samp_A+y[j], samp_B+n[j]-y[j])
yrep[k]<-rbinom(1, n[j], theta[j])
k=k+1
}
t<-c(test, max(yrep))
}
Data is given in How to write a double for loop in r with choosing maximal element in one loop?:
#Data
y <- c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,
2,1,5,2,5,3,2,7,7,3,3,2,9,10,4,4,4,4,4,4,4,10,4,4,4,5,11,12,
5,5,6,5,6,6,6,6,16,15,15,9,4)
n <-
c(20,20,20,20,20,20,20,19,19,19,19,18,18,17,20,20,20,20,19,19,18,18,25,24,
23,20,20,20,20,20,20,10,49,19,46,27,17,49,47,20,20,13,48,50,20,20,20,20,
20,20,20,48,19,19,19,22,46,49,20,20,23,19,22,20,20,20,52,46,47,24,14)
#Evaluate densities in grid
x <- seq(0.0001, 0.9999, length.out = 1000)
#Compute the marginal posterior of alpha and beta in hierarchical model Use grid
A <- seq(0.5, 15, length.out = 100)
B <- seq(0.3, 45, length.out = 100)
#Make vectors that contain all pairwise combinations of A and B
cA <- rep(A, each = length(B))
cB <- rep(B, length(A))
#Use logarithms for numerical accuracy!
lpfun <- function(a, b, y, n) log(a+b)*(-5/2) +
sum(lgamma(a+b)-lgamma(a)-lgamma(b)+lgamma(a+y)+lgamma(b+n-y)-
lgamma(a+b+n))
lp <- mapply(lpfun, cA, cB, MoreArgs = list(y, n))
#Subtract maximum value to avoid over/underflow in exponentiation
df_marg <- data.frame(x = cA, y = cB, p = exp(lp - max(lp)))
#Sample from the grid (with replacement)
nsamp <- 100
samp_indices <- sample(length(df_marg$p), size = nsamp,
replace = T, prob = df_marg$p/sum(df_marg$p))
samp_A <- cA[samp_indices[1:nsamp]]
samp_B <- cB[samp_indices[1:nsamp]]
df_psamp <- mapply(function(a, b, x) dbeta(x, a, b),
samp_A, samp_B, MoreArgs = list(x = x)) %>%
as.data.frame() %>% cbind(x) %>% gather(ind, p, -x)
This is not very well tested.
There is no need for loops to sample from distributions included in base R, those functions are vectorized on their arguments. Code following the lines below should be able to do what the question asks for.
Ni <- 1000
Nj <- 17
theta <- rbeta(Ni*Nj, rep(samp_A + y, each = Ni), rep(samp_B + n - y, each = Ni))
yrep <- rbinom(Ni*Nj, n, theta)
test1 <- matrix(yrep, nrow = Ni)
mins1 <- matrixStats::colMins(test1)

Memory efficient way to print summary of dataset

I have a dataset with the following structure:
require(data.table)
train <- data.table(sample(1:10, 10), runif(10, 0, 10))
However, the dataset is ~ 7,5 GB in memory and has ~630 million rows. Attempting summary(train) yields in an error: Error: cannot allocate vector of size 2.3 Gb. I can extract some information by manually calling train[, mean(V2)], train[, min(V2)] and train[, max(V2)], but median and quantiles will result in OOM. Is there a possibility to make these operations on a 16GB RAM machine?
An idea would be to split the dataset but that would be a bit cumbersome w.r.t to median and quantiles
So I came up with function summaryI, to which we supply our interested column name:
summaryI <- function(i2) {
setorderv(train, i2)
N <- train[, .N]
# count NAs:
# nas <- is.na(train[[i2]])
# nNA <- sum(nas)
# OR
i <- 1L
nNA <- 0L
while (is.na(train[[i2]][i])) {
nNA <- nNA + 1L
i <- i + 1L
}
nNA
# will be slow if many NAs, but more memory efficient
# (will not create additional vector)
n <- N - nNA
probs <- seq(0, 1, 0.25)
# quantiles, only type = 7
index <- 1 + (n - 1) * probs
lo <- floor(index)
hi <- ceiling(index)
qs <- train[[i2]][lo + nNA]
i <- which(index > lo)
h <- (index - lo)[i]
qs[i] <- (1 - h) * qs[i] + h * train[[i2]][hi[i] + nNA]
qs # quantile results
rmean <- sum(train[[i2]], na.rm = T) / n
qq <- c(qs[1L:3L], rmean, qs[4L:5L])
digits <- max(3L, getOption("digits") - 3L)
qq <- signif(qq, digits)
names(qq) <- c("Min.", "1st Qu.", "Median", "Mean", "3rd Qu.", "Max.")
if (nNA > 0L) { # to match summary output
c(qq, `NA's` = nNA)
} else {
qq
}
}
The basic idea is that we order the interested column in place (with setorder from data.table) and then try to do all the calculations without copying data.
As mentioned in comments, if your data have a lot of NAs then this will be slow.
But hopefully you will be able to run on all of the data. Also, I hard coded inside NA management.
Example:
summaryI('V2')
# Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
# 2.398e-08 2.501e-01 5.000e-01 5.000e-01 7.500e-01 1.000e+00 1.000e+02
or run over multiple columns, like:
sapply(colnames(train), summaryI)
The source code of summary and quantile, which I used as basis, can be found:
github quantile.R
github summary.R

For loops for nested variables within function in R

I would like to iterate through vectors of values and calculate something for every value while being within a function environment in R. For example:
# I have costs for 3 companies
c <- c(10, 20, 30)
# I have the same revenue across all 3
r <- 100
# I want to obtain the profits for all 3 within one variable
result <- list()
# I could do this in a for loop
for(i in 1:3){
result[i] <- r - c[i]
}
Now lets assume I have a model that is very long and I define everything as a function which is to be solved with various random draws for the costs.
# Random draws
n <- 1000
r <- rnorm(n, mean = 100, sd = 10)
c1 <- rnorm(n, mean = 10, sd = 1)
c2 <- rnorm(n, mean = 20, sd = 2)
c3 <- rnorm(n, mean = 30, sd = 3)
X <- data.frame(r, c1, c2, c3)
fun <- function(x){
r <- x[1]
c <- c(x[2], x[3], x[4])
for(i in 1:3){
result[i] <- r - c[i]
}
return(result)
}
I could then evaluate the result for all draws by iterating through the rows of randomly sampled input data.
for(j in 1:n){
x <- X[j,]
y <- fun(x)
}
In this example, the output variable y would entail the nested result variable which comprises of the results for all 3 companies. However, my line of thinking results in an error and I think it has to do with the fact that I try to return a nested variable? Hence my question how you guys would approach something like this.
I would suggest rethinking your coding approach. This is a very un-R-like way of doing things.
For example, the first for loop can be written much more succinctly as
x <- c(10, 20, 30)
r <- 100
result <- lapply(-x, `+`, r)
Then fun becomes something like
fun <- function(x) lapply(-x[-1], `+`, x[1])
To then operate over the rows of a data.frame (which is what you seem to do in the last step), you can use something like
apply(X, 1, fun)
where the MARGIN = 1 argument in apply ensures that you are applying a function per row (as opposed to per column).
Here's an approach using your function and a for loop:
# Random draws
n <- 1000
r <- rnorm(n, mean = 100, sd = 10)
c1 <- rnorm(n, mean = 10, sd = 1)
c2 <- rnorm(n, mean = 20, sd = 2)
c3 <- rnorm(n, mean = 30, sd = 3)
X <- data.frame(r, c1, c2, c3)
result <- list()
fun <- function(x){
r <- x[[1]]
c <- c(x[[2]], x[[3]], x[[4]])
for(i in 1:3){
result[i] <- r - c[i]
}
return(result)
}
# Create a list to store results
profits <- rep(rep(list(1:3)),nrow(X))
# Loop throuhg each row of dataframe and store in profits.
for(i in 1:nrow(X)){
profits_temp <-
fun(list(X[i,"r"],X[i,"c1"],X[i,"c2"],X[i,"c3"]))
for(j in 1:3)
profits[[i]][[j]] <- profits_temp[[j]]
}
# Eye results
profits[[1]]
#> [1] 93.23594 81.25731 70.27699
profits[[2]]
#> [1] 80.50516 69.27517 63.36439

Sampling repeatedly with different probability

In the following code "Weight" is a large matrix of weight sets. This matrix is consisted of let's say 1000 rows and 4 columns. Each row is a set of weights (sum of the elements in each row is equal to one).
In addition, there are four object and I want to select one of them based on the each weight sets. In other words, this random selection should be repeated for all of the weight sets.
Right now I have solved the problem with for. But is there any more efficient way to code it in R?
y <- c("a", "b", "c", "d")
for(i in 1:nrow(Weight)){
selection[i] <- sample(y, 1, prob=Weight[i,]) #selection is a vector with the same number of rows as Weight
}
A more efficient way would be to first compute the row-wise cumulative sums of your weights then draw a number between 0 and 1 and see where it lands within that cumulative sum. This way, you only need to do one call to runif to get your random data, versus 1000 calls using other methods.
Weight <- matrix(sample(1:100, 1000 * 4, TRUE), 1000, 4)
x <- runif(nrow(Weight))
cumul.w <- Weight %*% upper.tri(diag(ncol(Weight)), diag = TRUE) / rowSums(Weight)
i <- rowSums(x > cumul.w) + 1L
selection <- y[i]
Also note how I computed the cumulative sums by multiplying by a triangular matrix instead of using the slower apply(Weight, 1, cumsum). Everything is vectorized so it should be way faster than using an apply or for loop.
Benchmark comparison with apply and for:
f_runif <- function(Weight, y) {
x <- runif(nrow(Weight))
cumul.w <- Weight %*% upper.tri(diag(ncol(Weight)), diag = TRUE) /
rowSums(Weight)
i <- rowSums(x > cumul.w) + 1L
y[i]
}
f_for <- function(Weight, y) {
selection <- rep(NA, nrow(Weight))
for(i in 1:nrow(Weight)){
selection[i] <- sample(y, 1, prob=Weight[i,])
}
}
f_apply <- function(Weight, y) {
apply(Weight, 1, function(w)sample(y, 1, prob=w))
}
y <- c("a", "b", "c", "d")
Weight <- matrix(sample(1:100, 1000 * 4, TRUE), 1000, 4)
library(microbenchmark)
microbenchmark(f_runif(Weight, y),
f_for (Weight, y),
f_apply(Weight, y))
# Unit: microseconds
# expr min lq median uq max neval
# f_runif(Weight, y) 223.635 231.111 274.531 281.2165 1443.208 100
# f_for(Weight, y) 10220.674 11238.660 11574.039 11917.1610 14583.028 100
# f_apply(Weight, y) 9006.974 10016.747 10509.150 10879.9245 27060.189 100
Wrap your sample into a function that lets you pass only one argument, a row from Weight:
myfun <- function(w) {
sample(y, 1, prob=w)
}
Then you can use one of the apply family:
apply(Weight, 1, myfun)
However, so long as you have pre-allocated selection your method is not terribly inefficient.

Resources