How can I speed up this sapply for cross checking samples? - r

I'm trying to speed up a QC function for checking similarity between samples. I wanted to know if there is a faster way to compare the way I am doing below? I know there have been answers to this kind of question that are pretty definitive (on SO or otherwise) but I can't find them. I know I should investigate plyr but I'm still getting a hold of sapply.
The following sample data is a representative output of what I would be working but randomized and I don't think would impact the application to my original question.
## sample data
nSamples <- 1000
nSamplesQC <- 100
nAssays <- 96
microarrayScores <- matrix(sample(c("G:G", "T:G", "T:T", NA),nSamples * nAssays,replace = TRUE), nrow = nSamples, ncol = nAssays)
microarrayScoresQC <- matrix(sample(c("G:G", "T:G", "T:T", NA),nSamples * nAssays,replace = TRUE), nrow = nSamples, ncol = nAssays)
mycombs <- data.frame(Experiment = rep(1:nSamples,nSamplesQC),QC = sort(rep(1:nSamplesQC,nSamples)))
## testing function
system.time(
sapply(seq(length(mycombs[,1])), function(x) {compare <- microarrayScores[mycombs[x,1],]==microarrayScoresQC[mycombs[x,2],];
sum(compare[!is.na(compare)])/sum(!is.na(compare))})
)

Here is a vectorized version of your code, about 20 times faster on my machine:
rowMeans(microarrayScores[mycombs[,1], ] ==
microarrayScoresQC[mycombs[,2], ], na.rm = TRUE)

Something like this:
foo <- function(x){
compare <- microarrayScores[x[1],]==microarrayScoresQC[x[2],]
sum(compare[!is.na(compare)])/sum(!is.na(compare))
}
system.time(apply(mycombs,1,foo))
appears to be modestly faster. (Maybe 2-3x)

Related

Avoiding a loop when populating data frames in R

I have an empty data frame T_modelled with 2784 columns and 150 rows.
T_modelled <- data.frame(matrix(ncol = 2784, nrow = 150))
names(T_modelled) <- paste0("t=", t_sec_ERT)
rownames(T_modelled) <- paste0("z=", seq(from = 0.1, to = 15, by = 0.1))
where
t_sec_ERT <- seq(from = -23349600, to = 6706800, by = 10800)
z <- seq(from = 0.1, to = 15, by = 0.1)
I filled T_modelled by column with a nested for loop, based on a formula:
for (i in 1:ncol(T_modelled)) {
col_tmp <- colnames(T_modelled)[i]
for (j in 1:nrow(T_modelled)) {
z_tmp <- z[j]-0.1
T_tmp <- MANSRT+As*e^(-z_tmp*(omega/(2*K))^0.5)*sin(omega*t_sec_ERT[i]-((omega/(2*K))^0.5)*z_tmp)
T_modelled[j ,col_tmp] <- T_tmp
}
}
where
MANSRT <- -2.051185
As <- 11.59375
omega <- (2*pi)/(347.875*24*60*60)
c <- 790
k <- 0.00219
pb <- 2600
K <- (k*1000)/(c*pb)
e <- exp(1)
I do get the desired results but I keep thinking there must be a more efficient way of filling that data frame. The loop is quite slow and looks cumbersome to me. I guess there is an opportunity to take advantage of R's vectorized way of calculating. I just cannot see myself how to incorporate the formula in an easier way to fill T_modelled.
Anyone got any ideas how to get the same result in a faster, more "R-like" manner?
I believe this does it.
Run this first instruction right after creating T_modelled, it will be needed to test that the results are equal.
Tm <- T_modelled
Now run your code then run the code below.
z_tmp <- z - 0.1
for (i in 1:ncol(Tm)) {
T_tmp <- MANSRT + As*exp(-z_tmp*(omega/(2*K))^0.5)*sin(omega*t_sec_ERT[i]-((omega/(2*K))^0.5)*z_tmp)
Tm[ , i] <- T_tmp
}
all.equal(T_modelled, Tm)
#[1] TRUE
You don't need the inner loop, that's the only difference.
(I also used exp directly but that is of secondary importance.)
Much like your previous question's solution which you accepted, consider simply using sapply, iterating through the vector, t_sec_ERT, which is the same length as your desired dataframe's number of columns. But first adjust every element of z by 0.1. Plus, there's no need to create empty dataframe beforehand.
z_adj <- z - 0.1
T_modelled2 <- data.frame(sapply(t_sec_ERT, function(ert)
MANSRT+As*e^(-z_adj*(omega/(2*K))^0.5)*sin(omega*ert-((omega/(2*K))^0.5)*z_adj)))
colnames(T_modelled2) <- paste0("t=", t_sec_ERT)
rownames(T_modelled2) <- paste0("z=", z)
all.equal(T_modelled, T_modelled2)
# [1] TRUE
Rui is of course correct, I just want to suggest a way of reasoning when writing a loop like this.
You have two numeric vectors. Functions for numerics in R are usually vectorized. By which I mean you can do stuff like this
x <- c(1, 6, 3)
sum(x)
not needing something like this
x_ <- 0
for (i in x) {
x_ <- i + x_
}
x_
That is, no need for looping in R. Of course looping takes place none the less, it just happens in the underlying C, Fortran etc. code, where it can be done more efficiently. This is usually what we mean when we call a function vectorized: looping takes place "under the hood" as it were. The output of Vectorize() thus isn't strictly vectorized by this definition.
When you have two numeric vectors you want to loop over you have to first see if the constituent functions are vectorized, usually by reading the docs.
If it is, you continue by constructing that central vectorized compound function and and start testing it with one vector and one scalar. In your case it would be something like this (testing with just the first element of t_sec_ERT).
z_tmp <- z - 0.1
i <- 1
T_tmp <- MANSRT + As *
exp(-z_tmp*(omega/(2*K))^0.5) *
sin(omega*t_sec_ERT[i] - ((omega/(2*K))^0.5)*z_tmp)
Looks OK. Then you start looping over the elements of t_sec_ERT.
T_tmp <- matrix(nrow=length(z), ncol=length(t_sec_ERT))
for (i in 1:length(t_sec_ERT)) {
T_tmp[, i] <- MANSRT + As *
exp(-z_tmp*(omega/(2*K))^0.5) *
sin(omega*t_sec_ERT[i] - ((omega/(2*K))^0.5)*z_tmp)
}
Or you can do it with sapply() which is often neater.
f <- function(x) {
MANSRT + As *
exp(-z_tmp*(omega/(2*K))^0.5) *
sin(omega*x - ((omega/(2*K))^0.5)*z_tmp)
}
T_tmp <- sapply(t_sec_ERT, f)
I would prefer to put the data in a long format, with all combinations of z and t_sec_ERT as two columns, in order to take advantage of vectorization. Although I usually prefer tidyr for switching between long and wide formats, I've tried to keep this as a base solution:
t_sec_ERT <- seq(from = -23349600, to = 6706800, by = 10800)
z <- seq(from = 0.1, to = 15, by = 0.1)
v <- expand.grid(t_sec_ERT, z)
names(v) <- c("t_sec_ERT", "z")
v$z_tmp <- v$z-0.1
v$T_tmp <- MANSRT+As*e^(-v$z_tmp*(omega/(2*K))^0.5)*sin(omega*v$t_sec_ERT-((omega/(2*K))^0.5)*v$z_tmp)
T_modelled <- data.frame(matrix(v$T_tmp, nrow = length(z), ncol = length(t_sec_ERT), byrow = TRUE))
names(T_modelled) <- paste0("t=", t_sec_ERT)
rownames(T_modelled) <- paste0("z=", seq(from = 0.1, to = 15, by = 0.1))

Optimise code for a simple monte carlo like simulation

I run the following code that works but just take ages and I'm sure there is a way to get the same results much faster.
runs <- 1000
prediction <- runif(77,0,1)
n< - length(prediction)
df.all <- data.frame(Preds = rep(prediction, runs),
simno=rep(1:runs,each=n))
for (x in 1:runs) {
for (i in 1:length(df.all$Preds)){
df.all$rand[i] <- sample(1:100,1)
df.all$Win[i] <- ifelse(df.all$rand[i]<df.all$Preds[i]*100,1,0)
}
}
df.all% >% group_by(simno) %>% summarise(Wins=sum(Win)) -> output
This can easily be vectorise by:
Performing a single sample operation (not the additional replace = TRUE argument.
Performing a single comparison >
You can remove the inner for loop to get
for (x in 1:runs) {
df.all$rand = sample(1:100, size = length(prediction), replace=TRUE)
df.all$Win = df.all$rand < df.all$Preds*100
}
You can then take it one step further and remove that loop
df.all$rand = sample(1:100, n = nrow(df.all), replace=TRUE)
df.all$Win = df.all$rand < df.all$Preds*100

Discarding samples according to a condition

I wish to write a loop in R within which Poisson samples are simulated, but I wish to discard samples that do not contain any zeros and "have another go". How may I do this?.
For example:
X<-rep(999,100)
for(j in 1:100){
x<-rpois(100,4)
X[j]<-mean(x)
}
Is there any way I could keep samples for which length(X[X==0])==0, and then reselect a sample, and continue until 100 means from samples which do contain zeros are obtained?
As #Frank suggested, a while loop is your best approach, though I don't think if is the best way to go.
NN <- 100
kk <- 100
lam <- 4
draws <- matrix(rpois(kk * NN, lam), ncol = NN)
while (!all(idx <- apply(draws, 2, all))){
draws[ , nidx] <- matrix(rpois(sum(nidx <- !idx) * NN, lam), ncol = NN)
}
Then to finish:
colMeans(draws)
An alternative is to use replicate:
colMeans(replicate(NN, {draws <- rpois(kk, lam)
while (!all(draws)) draws <- rpois(kk, lam)
draws}))
My quick benchmarks suggest this latter is actually faster.
Even more savvy would be to simply eliminate all bad draws from the start (and essentially draw from the truncated distribution).
We know that the probability of getting 0 on a given draw is exp(-lambda), so if we invert uniform draws on (exp(-lambda), 1], we'll be set:
colMeans(matrix(qpois(runif(kk * NN, min = exp(-lam)), lam), ncol = NN))
Also competitive with this is to use data.table:
library(data.table)
grps <- rep(1:NN, each = kk)
data.table(qpois(runif(kk * NN, min = exp(-lam)), lam))[ , mean(V1), grps]
Just to say that I have realised that if I edit Micheal's code to:
replicate(NN, {draws <- rpois(kk, lam)
while (all(draws)) draws <- rpois(kk, lam)
draws})
It will do what I wish. Thanks to all who answered.

Optimizing a vectorized function using apply, compiler, or other techniques

I'm seeking to optimize this algorithm smartWindow and (and the process where I original post which explains some context around the function and how I got here:
Vectorizing a loop through lines of data frame R while accessing multiple variables the dataframe).
This currently takes me 240 seconds to run on my actual data. I've tried some Rprof It seems that chg2 <- line of smartWindow is eating the most time. I've also tried the compiler in R using cmpfun I'm wondering there's a way to significantly improve the speed of what I'm trying to do.
What I'm really looking for, is if there's a technique to accomplish what I've done below in something closer to 20 seconds than 240 seconds. I've shaved off 1-5% of of the computation time using various things. but what I'm really wondering is if I can decrease the time by a factor of a number greater than 2.
## the function
smartWindow <- function(tdate, aid, chgdf, datev='Submit.Date', assetv='Asset.ID', fdays=30, bdays=30) {
fdays <- tdate+fdays
bdays <- tdate-bdays
chg2 <- chgdf[chgdf[,assetv]==aid & chgdf[,datev]<fdays & chgdf[,datev]>bdays, ]
ret <- nrow(chg2)
return(ret)
}
## set up some data #################################################
dates <- seq(as.Date('2011-01-01'), as.Date('2013-12-31'), by='days')
aids <- paste(rep(letters[1:26], 3), 1:3, sep='')
n <- 3000
inc <- data.frame(
Submit.Date = sample(dates, n, replace=T),
Asset.ID = sample(aids, n, replace=T))
chg <- data.frame(
Submit.Date = sample(dates, n, replace=T),
Asset.ID = sample(aids, n, replace=T))
## applying function to just one incident ###########################
smartWindow(inc$Submit.Date[1], inc$Asset.ID[1], chgdf=chg, bdays=100)
## applying to every incident... this is process i seek to optimize #########
system.time({
inc$chg_b30 <- apply(inc[,c('Submit.Date', 'Asset.ID')], 1, function(row) smartWindow(as.Date(row[1]), row[2], chgdf=chg,
datev='Submit.Date', assetv='Asset.ID', bdays=30, fdays=0))
})
table(inc$chg_b30)

working with large lists that become too big for RAM when operated on

Short of working on a machine with more RAM, how can I work with large lists in R, for example put them on disk and then work on sections of it?
Here's some code to generate the type of lists I'm using
n = 50; i = 100
WORD <- vector(mode = "integer", length = n)
for (i in 1:n){
WORD[i] <- paste(sample(c(rep(0:9,each=5),LETTERS,letters),5,replace=TRUE),collapse='')
}
dat <- data.frame(WORD = WORD,
COUNTS = sample(1:50, n, replace = TRUE))
dat_list <- lapply(1:i, function(i) dat)
In my actual use case each data frame in the list is unique, unlike the quick example here. I'm aiming for n = 4000 and i = 100,000
This is one example of what I want to do with this list of dataframes:
FUNC <- function(x) {rep(x$WORD, times = x$COUNTS)}
la <- lapply(dat_list, FUNC)
With my actual use case this runs for a few hours, fills up the RAM and most of the swap and then RStudio freezes and shows a message with a bomb on it (RStudio was forced to terminate due to an error in the R session).
I see that bigmemory is limited to matrices and ff doesn't seem to handle lists. What are the other options? If sqldf or a related out-of-memory method possible here, how might I get started? I can't get enough out of the documentation to make any progress and would be grateful for any pointers. Note that instructions to "buy more RAM" will be ignored! This is for a package that I'm hoping will be suitable for average desktop computers (ie. undergrad computer labs).
UPDATE Followining up on the helpful comments from SimonO101 and Ari, here's some benchmarking comparing dataframes and data.tables, loops and lapply, and with and without gc
# self-contained speed test of untable
n = 50; i = 100
WORD <- vector(mode = "integer", length = n)
for (i in 1:n){
WORD[i] <- paste(sample(c(rep(0:9,each=5),LETTERS,letters),5,replace=TRUE),collapse='')
}
# as data table
library(data.table)
dat_dt <- data.table(WORD = WORD, COUNTS = sample(1:50, n, replace = TRUE))
dat_list_dt <- lapply(1:i, function(i) dat_dt)
# as data frame
dat_df <- data.frame(WORD = WORD, COUNTS = sample(1:50, n, replace = TRUE))
dat_list_df <- lapply(1:i, function(i) dat_df)
# increase object size
y <- 10
dt <- c(rep(dat_list_dt, y))
df <- c(rep(dat_list_df, y))
# untable
untable <- function(x) rep(x$WORD, times = x$COUNTS)
# preallocate objects for loop to fill
df1 <- vector("list", length = length(df))
dt1 <- vector("list", length = length(dt))
df3 <- vector("list", length = length(df))
dt3 <- vector("list", length = length(dt))
# functions for lapply
df_untable_gc <- function(x) { untable(df[[x]]); if (x%%10) invisible(gc()) }
dt_untable_gc <- function(x) { untable(dt[[x]]); if (x%%10) invisible(gc()) }
# speedtests
library(microbenchmark)
microbenchmark(
for(i in 1:length(df)) { df1[[i]] <- untable(df[[i]]); if (i%%10) invisible(gc()) },
for(i in 1:length(dt)) { dt1[[i]] <- untable(dt[[i]]); if (i%%10) invisible(gc()) },
df2 <- lapply(1:length(df), function(i) df_untable_gc(i)),
dt2 <- lapply(1:length(dt), function(i) dt_untable_gc(i)),
for(i in 1:length(df)) { df3[[i]] <- untable(df[[i]])},
for(i in 1:length(dt)) { dt3[[i]] <- untable(dt[[i]])},
df4 <- lapply(1:length(df), function(i) untable(df[[i]]) ),
dt4 <- lapply(1:length(dt), function(i) untable(dt[[i]]) ),
times = 10)
And here are the results, without explicit garbage collection, data.table is much faster and lapply slightly faster than a loop. With explicit garbage collection (as I think SimonO101 might be suggesting) they are all much the same speed - a lot slower! I know that using gc is a bit controversial and probably not helpful in this case, but I'll give it a shot with my actual use-case and see if it makes any difference. Of course I don't have any data on memory use for any of these functions, which is really my main concern. Seems that there is no function for memory benchmarking equivalent to the timing functions (for windows, anyway).
Unit: milliseconds
expr
for (i in 1:length(df)) { df1[[i]] <- untable(df[[i]]) if (i%%10) invisible(gc()) }
for (i in 1:length(dt)) { dt1[[i]] <- untable(dt[[i]]) if (i%%10) invisible(gc()) }
df2 <- lapply(1:length(df), function(i) df_untable_gc(i))
dt2 <- lapply(1:length(dt), function(i) dt_untable_gc(i))
for (i in 1:length(df)) { df3[[i]] <- untable(df[[i]]) }
for (i in 1:length(dt)) { dt3[[i]] <- untable(dt[[i]]) }
df4 <- lapply(1:length(df), function(i) untable(df[[i]]))
dt4 <- lapply(1:length(dt), function(i) untable(dt[[i]]))
min lq median uq max neval
37436.433962 37955.714144 38663.120340 39142.350799 39651.88118 10
37354.456809 38493.268121 38636.424561 38914.726388 39111.20439 10
36959.630896 37924.878498 38314.428435 38636.894810 39537.31465 10
36917.765453 37735.186358 38106.134494 38563.217919 38751.71627 10
28.200943 29.221901 30.205502 31.616041 34.32218 10
10.230519 10.418947 10.665668 12.194847 14.58611 10
26.058039 27.103217 27.560739 28.189448 30.62751 10
8.835168 8.904956 9.214692 9.485018 12.93788 10
If you really are going to be using very large data you can use the h5r package to write hdf5 files. You would be writing to and reading from your hard drive on the fly instead of using RAM. I have not used this so I can be of little help on it's general usage, I mention this because I think there's is no tutorial for it. I got this idea by thinking about pytables. Not sure if this solution is appropriate for you.

Resources