Optimize variance calculation, for loop too slow - r

Here is the next step of the question answered at this link [Apply function too slow in r
I have to calculate for a lot of species a specific formula per row. The formula correspond to a variance calculation and so need the result obtained in the above link.
My current script consists in using a for-loop which is naturally very slow. I simplified the problem in the following script, using a simple df called az.
az=data.frame(c(1,2,10),c(2,4,20),c(3,6,30))
colnames(az)=c("a","b","c")
# Necessary number calculated in step 1 (see link above)
m <- as.matrix(az)
m[is.na(m)] <- 0 #remove NA from sums
step1 = as.vector(m %*% m[nrow(m),])
# Initial for loop
prov=0 # prov for provisional number
for (i in 1:nrow(az)){
for (j in 1:ncol(az)){
prov=prov+az[i,j]*az[nrow(az),j]
prov=prov+az[i,j]*(az[nrow(az),j]-step1[i])^2
}
print(prov)
prov=0
}
As I have to repeat the operation for a huge number of species, I was wondering if anyone has a more efficient solution, maybe using vectorized expressions.
Kind regards.

This code will return the same values that your code prints out, but more efficiently.
> n<-nrow(m)
> mm<-t(m)
> prov<-mm*mm[,n]
> prov<-prov+mm*(mm[,n]-step1[col(mm)])^2
> colSums(prov)
[1] 82140 791480 113717400

Related

R: Efficiently Calculate Deviations from the Mean Using Row Operations on a DF (Without Using a For Loop)

I am generating a very large data frame consisting of a large number of combinations of values. As such, my coding has to be as efficient as possible or else 1) I get errors like - R cannot allocate vector of size XX or 2) the calculations take forever.
I am to the point where I need to calculate r (in the example below r = 3) deviations from the mean for each sample (1 sample per row of the df)(Labeled dev1 - dev3 in pic below):
These are my data in R:
I tried this (r is the number of values in each sample, here set to 3):
X2<-apply(X1[,1:r],1,function(x) x-X1$x.bar)
When I try this, I get:
I am guessing that this code is attempting to calculate the difference between each row of X1 (x) and the entire vector of X1$x.bar instead of 81 for the 1st row, 81.25 for the 2nd row, etc.
Once again, I can easily do this using for loops, but I'm assuming that is not the most efficient way.
Can someone please stir me in the right direction? Any assistance is appreciated.
Here is the whole code for the small sample version with r<-3. WARNING: This computes all possible combinations, so the df's get very large very quick.
options(scipen = 999)
dp <- function(x) {
dp1<-nchar(sapply(strsplit(sub('0+$', '', as.character(format(x, scientific = FALSE))), ".",
fixed=TRUE),function(x) x[2]))
ifelse(is.na(dp1),0,dp1)
}
retain1<-function(x,minuni) length(unique(floor(x)))>=minuni
# =======================================================
r<-3
x0<-seq(80,120,.25)
X0<-data.frame(t(combn(x0,r)))
names(X0)<-paste("x",1:r,sep="")
X<-X0[apply(X0,1,retain1,minuni=r),]
rm(X0)
gc()
X$x.bar<-rowMeans(X)
dp1<-dp(X$x.bar)
X1<-X[dp1<=2,]
rm(X)
gc()
X2<-apply(X1[,1:r],1,function(x) x-X1$x.bar)
Because R is vectorized you only need to subtract x.bar from from x1, x2, x3 collectively:
devs <- X1[ , 1:3] - X1[ , 4]
X1devs <- cbind(X1, devs)
That's it...
I think you just got the margin wrong, in apply you're using 1 as in row wise, but you want to do column wise so use 2:
X2<-apply(X1[,1:r], 2, function(x) x-X1$x.bar)
But from what i quickly searched, apply family isn't better in performance than loops, only in clarity. Check this post: Is R's apply family more than syntactic sugar?

Understanding Vectorized Code In R

I'm trying to understand the answer to this question using R and I'm struggling a lot.
The dataset for the R code can be found with this code
library(devtools)
install_github("genomicsclass/GSE5859Subset")
library(GSE5859Subset)
data(GSE5859Subset) ##this loads the three tables you need
Here is the question
Write a function that takes a vector of values e and a binary vector group coding two groups, and returns the p-value from a t-test: t.test( e[group==1], e[group==0])$p.value.
Now define g to code cases (1) and controls (0) like this g <- factor(sampleInfo$group)
Next use the function apply to run a t-test for each row of geneExpression and obtain the p-value. What is smallest p-value among all these t-tests?
The answer provided is
myttest <- function(e,group){
x <- e[group==1]
y <- e[group==0]
return( t.test(x,y)$p.value )
}
g <- factor(sampleInfo$group)
pvals <- apply(geneExpression,1,myttest, group=g)
min( pvals )
Which gives you the answer of 1.406803e-21.
What exactly is the input of the "e" argument of the myttest function when you run this? Is it possible to write this function as a formula like
t.test(DV ~ sampleInfo$group)
The t test is comparing the gene expression values of the 24 people (the values of which I believe are in the "geneExpression" matrix) by what group they were
in which you can find in sampleInfo's "group" column. I've run t tests so many times in R, but for some reason I can't wrap my mind around what's going on in this code.
You question seems to be about understanding the function apply().
For the technical description, see ?apply.
My quick explanation: the apply() line of code in your question applies the following function to each of the rows of geneExpression
myttest(e=x, group=g)
where x is a placeholder for each row.
To help make sense of it, a for loop version of that apply() line would look something like:
N <- nrows(geneExpression) #so we don't have to type this twice
pvals <- numeric(N) #empty vector to store results
# what 'apply' does (but it does it very quickly and with less typing from us)
for(i in 1:N) {
pvals[i] <- myttest(geneExpression[i,], group=g[i])
}

How to counter the 'non-numeric matrix extent' error in R?

I'm trying to generate a data frame of simulated values from the student's t distribution using the standard stochastic equation. The function I use is as follows:
matgen<-function(means,chi,covariancematrix)
{
cols<-ncol(means);
normals<-mvrnorm(n=500,mu=means,Sigma = covariancematrix);
invgammas<-rigamma(n=500,alpha=chi/2,beta=chi/2);
gen<-as.data.frame(matrix(data=NA,ncol=cols,nrow=500));
i<-1;
while(i<=500)
{
gen[i,]<-t(means)+normals[i,]*sqrt(invgammas[i]);
i<=i+1;
}
return(gen);
}
If it's not clear, I'm trying to create an empty data frame, that takes in values in cols number of columns and 500 rows. The values are numeric, of course, and R tells me that in the 9th row:
gen<-as.data.frame(matrix(data=NA,ncol=cols,nrow=500));
There's an error: 'non-numeric matrix extent'.
I remember using as.data.frame() to convert matrices into data frames in the past, and it worked quite smoothly. Even with numbers. I have been out of touch for a while, though, and can't seem to recollect or find online a solution to this problem. I tried is.numeric(), as.numeric(), 0s instead of NA there, but nothing works.
As Roland pointed out, one problem is, that col doesn't seem to be numeric. Please check if means is a dataframe or matrix, e.g. str(means). If it is, your code should not result in the error: 'non-numeric matrix extent'.
You also have some other issues in your code. I created a simplified example and pointed out the bugs I found as comments in the code:
library(MASS)
library(LearnBayes)
means <- cbind(c(1,2,3),c(4,5,6))
chi <- 10
matgen<-function(means,chi,covariancematrix)
{
cols <- ncol(means) # if means is a dataframe or matrix, this should work
normals <- rnorm(n=20,mean=100,sd=10) # changed example for simplification
# normals<-mvrnorm(n=20,mu=means,Sigma = covariancematrix)
# input to mu of mvrnorm should be a vector, see ?mvrnorm; but this means that ncol(means) is always 1 !?
invgammas<-rigamma(n=20,a=chi/2,b=chi/2) # changed alpha= to a and beta= to b
gen<-as.data.frame(matrix(data=NA,ncol=cols,nrow=20))
i<-1
while(i<=20)
{
gen[i,]<-t(means)+normals[i]*sqrt(invgammas[i]) # changed normals[i,] to normals [i], because it is a vector
i<-i+1 # changed <= to <-
}
return(gen)
}
matgen(means,chi,covariancematrix)
I hope this helps.
P.S. You don't need ";" at the end of every line in R

R: create new matrix with outcomes from mathematical operations within another matrix through loops

I am trying to generate a function that conducts various mathematical operations within a matrix and stores the outcomes of these operations in a new matrix with similar dimensions.
Here's an example matrix (a lot of silly computations in it to get sufficient variability in the data)
test<-matrix(1:290,nrow=10,ncol=29) ; colnames(test)<-1979+seq(1,29)
rownames(test)<-c("a","b","c","d","e","f","g","h","i","j")
test[,4]<-rep(8)
test[7,]<-seq(1,29)
test[c(3,5,9),]<-test[c(3,5,9),] * 1/2
test[,c(4,6,8,9,10,15,16,18)]<-test[,c(4,6,8,9,10,15,16,18)]*1/3
I want for instance to be able to calculate the difference between the value in (a,1999) and the average of the 3 values before (a, 1999). This needs to be flexible and for every rowname (firm) and every column (year).
The code I am trying to build looks something like this (I guess):
for(year in 1:29)
for (k in 1:10)
qw<-matrix((test[k, year] + 1/3*(- test[k, year-1] - test[k,year -2] - test[k, year-3])), nrow=10, ncol=29)
When I run it, this code generates a matrix but the value in that matrix is always the one for the last calculation (i.e. 20 in my example) while every matrix value should be stored in qw.
Any suggestions on how I can achieve this (maybe via an apply function)?
Thanks in advance
You are creating a matrix qw in every iteration. Each new matrix overwrites the previous one. Here's how to do what I think you would like to do, altough I didn't know how you want to handle the first 3 years.
qw <- matrix(nrow=10, ncol=29)
colnames(qw)<-1979+seq(1,29)
rownames(qw)<-c("a","b","c","d","e","f","g","h","i","j")
for(year in 4:29){
for (k in 1:10){
qw[k, year] <- (test[k, year] + 1/3*(- test[k, year-1] - test[k,year -2] - test[k, year-3]))
}
}
qw
In R, it is usually a bad idea to use loops, since there are much more efficient functions. Here is the R way of doing this, using the package zoo.
require(zoo)
qw <- matrix(nrow=10, ncol=29)
colnames(qw)<-1979+seq(1,29)
rownames(qw)<-c("a","b","c","d","e","f","g","h","i","j")
qw[,4:29] <- test[,4:29]-t(head(rollmean(t(test), 3),-1))
qw

Make nested loops more efficient?

I'm analyzing large sets of data using the following script:
M <- c_alignment
c_check <- function(x){
if (x == c_1) {
1
}else{
0
}
}
both_c_check <- function(x){
if (x[res_1] == c_1 && x[res_2] == c_1) {
1
}else{
0
}
}
variance_function <- function(x,y){
sqrt(x*(1-x))*sqrt(y*(1-y))
}
frames_total <- nrow(M)
cols <- ncol(M)
c_vector <- apply(M, 2, max)
freq_vector <- matrix(nrow = sum(c_vector))
co_freq_matrix <- matrix(nrow = sum(c_vector), ncol = sum(c_vector))
insertion <- 0
res_1_insertion <- 0
for (res_1 in 1:cols){
for (c_1 in 1:conf_vector[res_1]){
res_1_insertion <- res_1_insertion + 1
insertion <- insertion + 1
res_1_subset <- sapply(M[,res_1], c_check)
freq_vector[insertion] <- sum(res_1_subset)/frames_total
res_2_insertion <- 0
for (res_2 in 1:cols){
if (is.na(co_freq_matrix[res_1_insertion, res_2_insertion + 1])){
for (c_2 in 1:max(c_vector[res_2])){
res_2_insertion <- res_2_insertion + 1
both_res_subset <- apply(M, 1, both_c_check)
co_freq_matrix[res_1_insertion, res_2_insertion] <- sum(both_res_subset)/frames_total
co_freq_matrix[res_2_insertion, res_1_insertion] <- sum(both_res_subset)/frames_total
}
}
}
}
}
covariance_matrix <- (co_freq_matrix - crossprod(t(freq_vector)))
variance_matrix <- matrix(outer(freq_vector, freq_vector, variance_function), ncol = length(freq_vector))
correlation_coefficient_matrix <- covariance_matrix/variance_matrix
A model input would be something like this:
1 2 1 4 3
1 3 4 2 1
2 3 3 3 1
1 1 2 1 2
2 3 4 4 2
What I'm calculating is the binomial covariance for each state found in M[,i] with each state found in M[,j]. Each row is the state found for that trial, and I want to see how the state of the columns co-vary.
Clarification: I'm finding the covariance of two multinomial distributions, but I'm doing it through binomial comparisons.
The input is a 4200 x 510 matrix, and the c value for each column is about 15 on average. I know for loops are terribly slow in R, but I'm not sure how I can use the apply function here. If anyone has a suggestion as to properly using apply here, I'd really appreciate it. Right now the script takes several hours. Thanks!
I thought of writing a comment, but I have too much to say.
First of all, if you think apply goes faster, look at Is R's apply family more than syntactic sugar? . It might be, but it's far from guaranteed.
Next, please don't grow matrices as you move through your code, that slows down your code incredibly. preallocate the matrix and fill it up, that can increase your code speed more than a tenfold. You're growing different vectors and matrices through your code, that's insane (forgive me the strong speech)
Then, look at the help page of ?subset and the warning given there:
This is a convenience function intended for use interactively. For
programming it is better to use the standard subsetting functions like
[, and in particular the non-standard evaluation of argument subset
can have unanticipated consequences.
Always. Use. Indices.
Further, You recalculate the same values over and over again. fre_res_2 for example is calculated for every res_2 and state_2 as many times as you have combinations of res_1 and state_1. That's just a waste of resources. Get out of your loops what you don't need to recalculate, and save it in matrices you can just access again.
Heck, now I'm at it: Please use vectorized functions. Think again and see what you can drag out of the loops : This is what I see as the core of your calculation:
cov <- (freq_both - (freq_res_1)*(freq_res_2)) /
(sqrt(freq_res_1*(1-freq_res_1))*sqrt(freq_res_2*(1-freq_res_2)))
As I see it, you can construct a matrix freq_both, freq_res_1 and freq_res_2 and use them as input for that one line. And that will be the whole covariance matrix (don't call it cov, cov is a function). Exit loops. Enter fast code.
Given the fact I have no clue what's in c_alignment, I'm not going to rewrite your code for you, but you definitely should get rid of the C way of thinking and start thinking R.
Let this be a start: The R Inferno
It's not really the 4 way nested loops but the way your code is growing memory on each iteration. That's happening 4 times where I've placed # ** on the cbind and rbind lines. Standard advice in R (and Matlab and Python) in situations like this is to allocate in advance and then fill it in. That's what the apply functions do. They allocate a list as long as the known number of results, assign each result to each slot, and then merge all the results together at the end. In your case you could just allocate the correct size matrix in advance and assign into it at those 4 points (roughly speaking). That should be as fast as the apply family, and you might find it easier to code.

Resources