Faster for loop in R - r

I want to extract only the rows matched by a particular index from the matrix.
Is there a way to speed up the for loop?
for(x in 1:dim(gene)[1]){
for(y in 1:dim(geno)[1]){
if(grepl(gene[x,2], geno[y], fixed = TRUE)){
geno_gene <- rbind(geno_gene, geno[y,2:dim(geno)[2]])
next
}
}
}

Share a bit of the gene and geno datasets. What is the actual problem? Is it slow? Compared to what?
At the very least your loop is subject to R's copy-on-modify behaviour. Using the address function from pryr we can see that after we call rbind that your object m now is a copy of it's former self:
library(pryr)
m <- matrix(1:6, nrow = 3, ncol = 2)
address(m)
new_row <- c(8, 9)
m <- rbind(m, new_row)
address(m)
Matrix m changes address after rbind:
> m <- matrix(1:6, nrow = 3, ncol = 2)
> address(m)
[1] "0x18a0d4737f0"
> new_row <- c(8, 9)
> m <- rbind(m, new_row)
> address(m)
[1] "0x18a0fcb7450"
Which means that R has created a copy of m, bound the new row to it, assigned the name m to the copy and leave it to the garbage collector to discard of the original version of m. And it does this for each iteration of your loop. This is a well-known mechanism why R loops can be slow. Read from Hadley Wickham here for more information.
One potential way out is to assess if the function your loop implements can be replaced by vectorized function in R (and there are many vectorized functions).

Related

Two-way frequency table followed by matrix multiplication - high running time

I'm new to R, and trying to calculate the product between a fixed matrix to a 2-way frequency table for any combinations of columns in a dataframe or matrix and divide it by the sequence length (aka number of rows which is 15), the problem is that the running time increases dramatically when performing it on 1K sequences (1K columns). the goal is to use it with as much as possible sequences (more than 10 minutes, for 10K could be more than 1hr)
mat1 <- matrix(sample(LETTERS),ncol = 100,nrow = 15)
mat2 <- matrix(sample(abs(rnorm(26,0,3))),ncol=26,nrow=26)
rownames(mat2) <- LETTERS
colnames(mat2) <- LETTERS
diag(mat2) <- 0
test_vec <- c()
for (i in seq(ncol(mat1)-1)){
for(j in seq(i+1,ncol(mat1))){
s2 <- table(mat1[,i],mat1[,j]) # create 2-way frequency table
mat2_1 <- mat2
mat2_1 <- mat2_1[rownames(mat2_1) %in% rownames(s2),
colnames(mat2_1) %in% colnames(s2)]
calc <- ((1/nrow(mat1))*sum(mat2_1*s2))
test_vec <- append(test_vec,calc)
}}
Thanks for the help.
Here is an approach that converts mat1 to a data.table, and converts all the columns to factors, and uses table(..., exclude=NULL)
library(data.table)
m=as.data.table(mat1)[,lapply(.SD, factor, levels=LETTERS)]
g = combn(colnames(m),2, simplify = F)
result = sapply(g, function(x) sum(table(m[[x[1]]], m[[x[2]]], exclude=NULL)*mat2)/nrow(m))
Check equality:
sum(result-test_vec>1e-10)
[1] 0
Here there are 4950 combinations (100*99/2), but the number of combinations will increase quickly as nrow(mat1) increases (as you point out). You might find in that case that a parallelized version works well.
library(doParallel)
library(data.table)
registerDoParallel()
m=as.data.table(mat1)[,lapply(.SD, factor, levels=LETTERS)]
g = combn(colnames(m),2, simplify = F)
result = foreach(i=1:length(g), .combine=c) %dopar%
sum(table(m[[g[[i]][1]]], m[[g[[i]][2]]], exclude=NULL)*mat2)
result = result/nrow(m)

R - change matrix values based on another matrix indices

I have two matrices:
m1 <- matrix(runif(750), nrow = 50, byrow=T)
m2 <- matrix(rep(TRUE,750), nrow = 50, byrow=T)
For each m1 row, I need to find the indices of the two lowest values. Then, I need to use the remaining indices (i.e. not the two lowest values) to assign FALSE in m2.
It is fairly easy to do for one row:
ind <- order(m1[1,], decreasing=FALSE)[1:2]
m2[1,][-ind] <- FALSE
Therefore, I can use a loop to do the same for all rows:
for (i in 1:dim(m1)[1]){
ind <- order(m1[i,], decreasing=FALSE)[1:2]
m2[i,][-ind] <- FALSE
}
However, in my data set this loop runs slower than I would like (since my matrices are quite large - 500000x150000).
Is there any faster, R way to achieve the same result without the use of loops?
You can try the code below
m2 <- t(apply(m1,1,function(x) x %in% head(sort(x),2)))
You can try apply since you have matrix :
val <- rep(TRUE, ncol(m1))
m3 <- t(apply(m1, 1, function(x) {val[-order(x)[1:2]] <- FALSE;val}))
You can do:
m2 <- t(apply(m1, 1, function(x) rank(x)<3))
Using pmap
library(purrr)
pmap_dfr(as.data.frame(m1), ~ min_rank(c(...)) < 3)

Avoiding a loop when populating data frames in R

I have an empty data frame T_modelled with 2784 columns and 150 rows.
T_modelled <- data.frame(matrix(ncol = 2784, nrow = 150))
names(T_modelled) <- paste0("t=", t_sec_ERT)
rownames(T_modelled) <- paste0("z=", seq(from = 0.1, to = 15, by = 0.1))
where
t_sec_ERT <- seq(from = -23349600, to = 6706800, by = 10800)
z <- seq(from = 0.1, to = 15, by = 0.1)
I filled T_modelled by column with a nested for loop, based on a formula:
for (i in 1:ncol(T_modelled)) {
col_tmp <- colnames(T_modelled)[i]
for (j in 1:nrow(T_modelled)) {
z_tmp <- z[j]-0.1
T_tmp <- MANSRT+As*e^(-z_tmp*(omega/(2*K))^0.5)*sin(omega*t_sec_ERT[i]-((omega/(2*K))^0.5)*z_tmp)
T_modelled[j ,col_tmp] <- T_tmp
}
}
where
MANSRT <- -2.051185
As <- 11.59375
omega <- (2*pi)/(347.875*24*60*60)
c <- 790
k <- 0.00219
pb <- 2600
K <- (k*1000)/(c*pb)
e <- exp(1)
I do get the desired results but I keep thinking there must be a more efficient way of filling that data frame. The loop is quite slow and looks cumbersome to me. I guess there is an opportunity to take advantage of R's vectorized way of calculating. I just cannot see myself how to incorporate the formula in an easier way to fill T_modelled.
Anyone got any ideas how to get the same result in a faster, more "R-like" manner?
I believe this does it.
Run this first instruction right after creating T_modelled, it will be needed to test that the results are equal.
Tm <- T_modelled
Now run your code then run the code below.
z_tmp <- z - 0.1
for (i in 1:ncol(Tm)) {
T_tmp <- MANSRT + As*exp(-z_tmp*(omega/(2*K))^0.5)*sin(omega*t_sec_ERT[i]-((omega/(2*K))^0.5)*z_tmp)
Tm[ , i] <- T_tmp
}
all.equal(T_modelled, Tm)
#[1] TRUE
You don't need the inner loop, that's the only difference.
(I also used exp directly but that is of secondary importance.)
Much like your previous question's solution which you accepted, consider simply using sapply, iterating through the vector, t_sec_ERT, which is the same length as your desired dataframe's number of columns. But first adjust every element of z by 0.1. Plus, there's no need to create empty dataframe beforehand.
z_adj <- z - 0.1
T_modelled2 <- data.frame(sapply(t_sec_ERT, function(ert)
MANSRT+As*e^(-z_adj*(omega/(2*K))^0.5)*sin(omega*ert-((omega/(2*K))^0.5)*z_adj)))
colnames(T_modelled2) <- paste0("t=", t_sec_ERT)
rownames(T_modelled2) <- paste0("z=", z)
all.equal(T_modelled, T_modelled2)
# [1] TRUE
Rui is of course correct, I just want to suggest a way of reasoning when writing a loop like this.
You have two numeric vectors. Functions for numerics in R are usually vectorized. By which I mean you can do stuff like this
x <- c(1, 6, 3)
sum(x)
not needing something like this
x_ <- 0
for (i in x) {
x_ <- i + x_
}
x_
That is, no need for looping in R. Of course looping takes place none the less, it just happens in the underlying C, Fortran etc. code, where it can be done more efficiently. This is usually what we mean when we call a function vectorized: looping takes place "under the hood" as it were. The output of Vectorize() thus isn't strictly vectorized by this definition.
When you have two numeric vectors you want to loop over you have to first see if the constituent functions are vectorized, usually by reading the docs.
If it is, you continue by constructing that central vectorized compound function and and start testing it with one vector and one scalar. In your case it would be something like this (testing with just the first element of t_sec_ERT).
z_tmp <- z - 0.1
i <- 1
T_tmp <- MANSRT + As *
exp(-z_tmp*(omega/(2*K))^0.5) *
sin(omega*t_sec_ERT[i] - ((omega/(2*K))^0.5)*z_tmp)
Looks OK. Then you start looping over the elements of t_sec_ERT.
T_tmp <- matrix(nrow=length(z), ncol=length(t_sec_ERT))
for (i in 1:length(t_sec_ERT)) {
T_tmp[, i] <- MANSRT + As *
exp(-z_tmp*(omega/(2*K))^0.5) *
sin(omega*t_sec_ERT[i] - ((omega/(2*K))^0.5)*z_tmp)
}
Or you can do it with sapply() which is often neater.
f <- function(x) {
MANSRT + As *
exp(-z_tmp*(omega/(2*K))^0.5) *
sin(omega*x - ((omega/(2*K))^0.5)*z_tmp)
}
T_tmp <- sapply(t_sec_ERT, f)
I would prefer to put the data in a long format, with all combinations of z and t_sec_ERT as two columns, in order to take advantage of vectorization. Although I usually prefer tidyr for switching between long and wide formats, I've tried to keep this as a base solution:
t_sec_ERT <- seq(from = -23349600, to = 6706800, by = 10800)
z <- seq(from = 0.1, to = 15, by = 0.1)
v <- expand.grid(t_sec_ERT, z)
names(v) <- c("t_sec_ERT", "z")
v$z_tmp <- v$z-0.1
v$T_tmp <- MANSRT+As*e^(-v$z_tmp*(omega/(2*K))^0.5)*sin(omega*v$t_sec_ERT-((omega/(2*K))^0.5)*v$z_tmp)
T_modelled <- data.frame(matrix(v$T_tmp, nrow = length(z), ncol = length(t_sec_ERT), byrow = TRUE))
names(T_modelled) <- paste0("t=", t_sec_ERT)
rownames(T_modelled) <- paste0("z=", seq(from = 0.1, to = 15, by = 0.1))

Efficient algorithm to turn matrix subdiagonal to columns r

I have a non-square matrix and need to do some calculations on it's subdiagonals. I figure out that the best way is too turn subdiagonals to columns/rows and use functions like cumprod. Right now I use a for loop and exdiag defined as below:
exdiag <- function(mat, off=0) {mat[row(mat) == col(mat)+off]}
However it to be not really efficient. Do you know any other algorithm to achieve that kind of results.
A little example to show what I am doing:
exdiag <- function(mat, off=0) {mat[row(mat) == col(mat)+off]}
mat <- matrix(1:72, nrow = 12, ncol = 6)
newmat <- matrix(nrow=11, ncol=6)
for (i in 1:11){
newmat[i,] <- c(cumprod(exdiag(mat,i)),rep(0,max(6-12+i,0)))
}
Best regards,
Artur
The fastest but by far the most cryptic solution to get all possible diagonals from a non-square matrix, would be to treat your matrix as a vector and simply construct an id vector for selection. In the end you can transform it back to a matrix if you want.
The following function does that:
exdiag <- function(mat){
NR <- nrow(mat)
NC <- ncol(mat)
smalldim <- min(NC,NR)
if(NC > NR){
id <- seq_len(NR) +
seq.int(0,NR-1)*NR +
rep(seq.int(1,NC - 1), each = NR)*NR
} else if(NC < NR){
id <- seq_len(NC) +
seq.int(0,NC-1)*NR +
rep(seq.int(1,NR - 1), each = NC)
} else {
return(diag(mat))
}
out <- matrix(mat[id],nrow = smalldim)
id <- (ncol(out) + 1 - row(out)) - col(out) < 0
out[id] <- NA
return(out)
}
Keep in mind you have to take into account how your matrix is formed.
In both cases I follow the same logic:
first construct a sequence indicating positions along the smallest dimension
To this sequence, add 0, 1, 2, ... times the row length.
This creates the first diagonal. After doing this, you simply add a sequence that shifts the entire previous sequence by 1 (either down or to the right) until you reach the end of the matrix. To shift right, I need to multiply this sequence by the number of rows.
In the end you can use these indices to select the correct positions from mat, and return all that as a matrix. Due to the vectorized nature of this code, you have to check that the last subdiagonals are correct. These contain less elements than the first, so you have to replace the values not part of that subdiagonal by NA. Also here you can simply use an indexing trick.
You can use it as follows:
> diag1 <- exdiag(amatrix)
> diag2 <- exdiag(t(amatrix))
> identical(diag1, diag2)
[1] TRUE
In order to come to your result
amatrix <- matrix(1:72, ncol = 6)
diag1 <- exdiag(amatrix)
res <- apply(diag1,2,cumprod)
res[is.na(res)] <- 0
t(res)
You can modify the diag() function.
exdiag <- function(mat, off=0) {mat[row(mat) == col(mat)+off]}
exdiag2 <- function(matrix, off){diag(matrix[-1:-off,])}
Speed Test:
mat = diag(10, 10000,10000)
off = 4
> system.time(exdiag(mat,4))
user system elapsed
7.083 2.973 10.054
> system.time(exdiag2(mat,4))
user system elapsed
5.370 0.155 5.524
> system.time(diag(mat))
user system elapsed
0.002 0.000 0.002
It looks like that the subsetting from matrix take a lot of time, but it still performs better than your implementation. May be there are a lot of other subsetting approaches, which outperforms my solution. :)

How to store data from for loop inside of for loop? (rolling correlation in r)

require(quantmod)
require(TTR)
iris2 <- iris[1:4]
b=NULL
for (i in 1:ncol(iris2)){
for (j in 1:ncol(iris2)){
a<- runCor(iris2[,i],iris2[,j],n=21)
b<-cbind(b,a)}}
I want to calculate a rolling correlation of different columns within a dataframe and store the data separately by a column. Although the code above stores the data into variable b, it is not as useful as it is just dumping all the results. What I would like is to be able to create different dataframe for each i.
In this case, as I have 4 columns, what I would ultimately want are 4 dataframes, each containing 4 columns showing rolling correlations, i.e. df1 = corr of col 1 vs col 1,2,3,4, df2 = corr of col 2 vs col 1,2,3,4...etc)
I thought of using lapply or rollapply, but ran into the same problem.
d=NULL
for (i in 1:ncol(iris2))
for (j in 1:ncol(iris2))
{c<-rollapply(iris2, 21 ,function(x) cor(x[,i],x[,j]), by.column=FALSE)
d<-cbind(d,c)}
Would really appreciate any inputs.
If you want to keep the expanded loop, how about a list of dataframes?
e <- list(length = length(ncol(iris2)))
for (i in 1:ncol(iris2)) {
d <- matrix(0, nrow = length(iris2[,1]), ncol = length(iris2[1,]))
for (j in 1:ncol(iris2)) {
d[,j]<- runCor(iris2[,i],iris2[,j],n=21)
}
e[[i]] <- d
}
It's also a good idea to allocate the amount of space you want with placeholders and put items into that space rather than use rbind or cbind.
Although it is not a good practice to create dataframes on the fly in R (you should prefer putting them in a list as in other answer), the way to do so is to use the assign and get functions.
for (i in 1:ncol(iris2)) {
for (j in 1:ncol(iris2)){
c <- runCor(iris2[,i],iris2[,j],n=21)
# Assign 'c' to the name df1, df2...
assign(paste0("df", i), c)
}
}
# to have access to the dataframe:
get("df1")
# or inside a loop
get(paste0("df", i))
Since you stated your computation was slow, I wanted to provide you with a parallel solution. If you have a modern computer, it probably has 2 cores, if not 4 (or more!). You can easily check this via:
require(parallel) # for parallelization
detectCores()
Now the code:
require(quantmod)
require(TTR)
iris2 <- iris[,1:4]
Parallelization requires the functions and variables be placed into a special environment that is created and destroyed with each process. That means a wrapper function must be created to define the variables and functions.
wrapper <- function(data, n) {
# variables placed into environment
force(data)
force(n)
# functions placed into environment
# same inner loop written in earlier answer
runcor <- function(data, n, i) {
d <- matrix(0, nrow = length(data[,1]), ncol = length(data[1,]))
for (j in 1:ncol(data)) {
d[,i] <- TTR::runCor(data[,i], data[,j], n = n)
}
return(d)
}
# call function to loop over iterator i
worker <- function(i) {
runcor(data, n, i)
}
return(worker)
}
Now create a cluster on your local computer. This allows the multiple cores to run separately.
parallelcluster <- makeCluster(parallel::detectCores())
models <- parallel::parLapply(parallelcluster, 1:ncol(iris2),
wrapper(data = iris2, n = 21))
stopCluster(parallelcluster)
Stop and close the cluster when finished.

Resources