optimizing code in R for vector comparisons in data.table

optimizing code in R for vector comparisons in data.table - r

As part of my program in R, I have to compare a huge number of pair of sentences with some functions (the one im showing here is comparing sentences with the same number of words, and whether there is just exactly one different word between those two sentences)
To make things faster, I have already converted all words into integers so I am dealing with integer vectors so the example function is a very simple one
is_sub_num <- function(a,b){sum(!(a==b))==1}
where a,b are character vectors such as
a = c(1,2,3); b=c(1,4,3)
is_sub_num(a,b)
# [1] TRUE
my data will be stored in a data.table
Classes ‘data.table’ and 'data.frame': 100 obs. of 2 variables:
$ ID: int 1 2 3 4 5 6 7 8 9 10 ...
$ V2:List of 100
..$ : int 4 4 3 4
..$ : int 1 2 3 1
the length of each entry may be different (in the example below, the entries are all of size 4)
I have a table with candidate pair IDs to test the corresponding entries in DT with the function above as follow
is_pair_ok <- function(pair){
is_sub_num(DT[ID==pair[1],V2][[1]],DT[ID==pair[2],V2][[1]])}
here is a simplification of what I'm trying to do:
set.seed=234
z = lapply(1:100, function(x) sample(1:4,size=4,replace=TRUE))
is_sub_num <- function(a,b){sum(!(a==b))==1}
is_pair_ok <- function(pair){
is_sub_num(DT[ID==pair[1],V2][[1]],DT[ID==pair[2],V2][[1]])}
pair_list <- as.data.table(cbind(sample(1:100,10000,replace=TRUE),sample(1:100,10000,replace=TRUE)))
DT <- as.data.table(1:100)
DT$V2 <- z
colnames(DT) <- c("ID","V2")
print(system.time(tmp <-apply(pair_list,1,is_pair_ok)))
this takes around 22 seconds on my laptop although its only 10,000 entries and the functions are very very basic.
Do you have any advice on how to speed up the code ???

i have delved further myself into this issue, and here is my answer.
I think its an important one, and everyone should know it so please vote for this post, it doesn't deserve its bad score !!
The code to the answer is below. I have put some new parameters to make the problem a bit more general.
The key point is to use the unlist function.
Whenever we use apply to a list object, we get very very bad performance in R.
its a bit of a pain in the ass to explode objects and to do manual indexing in a vector, but the speedup is phenomenal.
set.seed=234
N=100
nobs=10000
z = lapply(1:N, function(x) sample(1:4,size=sample(3:5),replace=TRUE))
is_sub_num <- function(a,b){sum(!(a==b))==1}
is_pair_ok <- function(pair){
is_sub_num(DT[ID==pair[1],V2][[1]],DT[ID==pair[2],V2][[1]])}
is_pair_ok1 <- function(pair){
is_sub_num(zzz[pos_table[pair[1]]:(pos_table[pair[1]]+length_table[pair[1]] -1) ],
zzz[pos_table[pair[2]]:(pos_table[pair[2]]+length_table[pair[2]] -1) ]) }
pair_list <- as.data.table(cbind(sample(1:N,nobs,replace=TRUE),sample(1:N,nobs,replace=TRUE)))
DT <- as.data.table(1:N)
DT$V2 <- z
setnames(DT, c("ID","V2"))
setkey(DT, ID)
length_table <- sapply(z,length)
myfun <- function(i){sum(length_table[1:i])}
pos_table <- c(0,sapply(1:N,myfun))+1
zzz=unlist(z)
print(system.time(tmp_ref <- apply(pair_list,1,is_pair_ok)))
print(system.time(tmp <- apply(pair_list,1,is_pair_ok1)))
identical(tmp,tmp_ref)
here is the output
utilisateur système écoulé
20.96 0.00 20.96
utilisateur système écoulé
0.70 0.00 0.71
There were 50 or more warnings (use warnings() to see the first 50)
[1] TRUE
EDIT
it would a bit too long to post here. I tried to draw conclusions from the above and modify the source code of my program by trying to speed it up and using unlist, and manual indexing.
the new implementation actually is slower which came as a surprise to me, and i fail to understand why...

now I have 60% spare of time:
library(data.table)
set.seed(234)
is_sub_num <- function(a,b) sum(!(a==b))==1
is_pair_ok2 <- function(p1, p2) is_sub_num(DT[p1,V2][[1]],DT[p2,V2][[1]])
DT <- as.data.table(1:100)
DT$V2 <- lapply(1:100, function(x) sample(1:4,size=4,replace=TRUE))
setnames(DT, c("ID","V2"))
setkey(DT, ID)
pair_list <- as.data.table(cbind(sample(1:100,10000,replace=TRUE),sample(1:100,10000,replace=TRUE)))
print(system.time(tmp <- mapply(FUN=is_pair_ok2, pair_list$V1, pair_list$V2)))
most effect had setting the key for DT and using fast indexing in is_pair_ok2()
a little bit more (without the function is_sub_num()):
is_pair_ok3 <- function(p1, p2) sum(DT[p1,V2][[1]]!=DT[p2,V2][[1]])==1
print(system.time(tmp <- mapply(FUN=is_pair_ok3, pair_list$V1, pair_list$V2)))

Related

set missing values to constant in R, computational speed

In R, I have a reasonably large data frame (d) which is 10500 by 6000. All values are numeric.
It has many na value elements in both its rows and columns, and I am looking to replace these values with a zero. I have used:
d[is.na(d)] <- 0
but this is rather slow. Is there a better way to do this in R?
I am open to using other R packages.
I would prefer it if the discussion focused on computational speed rather than, "why would you replace na's with zeros", for example. And, while I realize a similar Q has been asked (How do I replace NA values with zeros in an R dataframe?) the focus has not been towards computational speed on a large data frame with many missing values.
Thanks!
Edited Solution:
As helpfully suggested, changing d to a matrix before applying is.na sped up the computation by an order of magnitude

You can get a considerable performance increase using the data.table package.
It is much faster, in general, with a lot of manipulations and transformations.
The downside is the learning curve of the syntax.
However, if you are looking for a speed performance boost, the investment could be worth it.
Generate fake data
r <- 10500
c <- 6000
x <- sample(c(NA, 1:5), r * c, replace = TRUE)
df <- data.frame(matrix(x, nrow = r, ncol = c))
Base R
df1 <- df
system.time(df1[is.na(df1)] <- 0)
user system elapsed
4.74 0.00 4.78
tidyr - replace_na()
dfReplaceNA <- function (df) {
require(tidyr)
l <- setNames(lapply(vector("list", ncol(df)), function(x) x <- 0), names(df))
replace_na(df, l)
}
system.time(df2 <- dfReplaceNA(df))
user system elapsed
4.27 0.00 4.28
data.table - set()
dtReplaceNA <- function (df) {
require(data.table)
dt <- data.table(df)
for (j in 1:ncol(dt)) {set(dt, which(is.na(dt[[j]])), j, 0)}
setDF(dt) # Return back a data.frame object
}
system.time(df3 <- dtReplaceNA(df))
user system elapsed
0.80 0.31 1.11
Compare data frames
all.equal(df1, df2)
[1] TRUE
all.equal(df1, df3)
[1] TRUE

I guess that all columns must be numeric or assigning 0s to NAs wouldn't be sensible.
I get the following timings, with approximately 10,000 NAs:
> M <- matrix(0, 10500, 6000)
> set.seed(54321)
> r <- sample(1:10500, 10000, replace=TRUE)
> c <- sample(1:6000, 10000, replace=TRUE)
> M[cbind(r, c)] <- NA
> D <- data.frame(M)
> sum(is.na(M)) # check
[1] 9999
> sum(is.na(D)) # check
[1] 9999
> system.time(M[is.na(M)] <- 0)
user system elapsed
0.19 0.12 0.31
> system.time(D[is.na(D)] <- 0)
user system elapsed
3.87 0.06 3.95
So, with this number of NAs, I get about an order of magnitude speedup by using a matrix. (With fewer NAs, the difference is smaller.) But the time using a data frame is just 4 seconds on my modest laptop -- much less time than it took to answer the question. If the problem really is of this magnitude, why is that slow?
I hope this helps.

Efficient subsetting in R using 2 dataframes

I have a big time series full in one dataframe and a list of timestamps in a different dataframe test. I need to subset full with data points surrounding the timestamps in test. My first instinct (as an R noob) was to write the below, which was wrong
subs <- subset(full,(full$dt>test$dt-i) & (full$dt<test$dt+i))
Looking at the result I realized that R is looping through both the vectors simultaneously giving the wrong result. My option is to write a loop like the below:
subs<-data.frame()
for (j in test$dt)
subs <- rbind(subs,subset(full,full$dt>(j-i) & full$dt<(j+i)))
I feel that there might be a better way to do loops and this article implores us to avoid R loops as much as possible. The other reason is I might be hitting up against performance issues as this would be at the heart of an optimization algorithm. Any suggestions from gurus would be greatly appreciated.
EDIT:
Here is some reproducible code that shows the wrong approach as well as the approach that works but could be better.
#create a times series
full <- data.frame(seq(1:200),rnorm(200,0,1))
colnames(full)<-c("dt","val")
#my smaller array of points of interest
test <- data.frame(seq(5,200,by=23))
colnames(test)<-c("dt")
# my range around the points of interset
i<-3
#the wrong approach
subs <- subset(full,(full$dt>test$dt-i) & (full$dt<test$dt+i))
#this works, but not sure this is the best way to go about it
subs<-data.frame()
for (j in test$dt)
subs <- rbind(subs,subset(full,full$dt>(j-i) & full$dt<(j+i)))
EDIT:
I updated the values to better reflect my usecase, and I see #mrdwab 's solution pulling ahead unexpectedly and by a wide margin.
I am using benchmark code from #mrdwab and the initialization is as follows:
set.seed(1)
full <- data.frame(
dt = 1:15000000,
val = floor(rnorm(15000000,0,1))
)
test <- data.frame(dt = floor(runif(24,1,15000000)))
i <- 500
The benchmarks are:
test replications elapsed relative
2 mrdwab 2 1.31 1.00000
3 spacedman 2 69.06 52.71756
1 andrie 2 93.68 71.51145
4 original 2 114.24 87.20611
Totally unexpected. Mind = blown. Can someone please shed some light in this dark corner and enlighten as to what is happening.
Important: As #mrdwab notes below, his solution works only if the vectors are integers. If not, #spacedman has the right solution

Here's a real R way to do it. Functionally. No loops...
Starting with Andrie's example data.
First, an interval comparison function:
> cf = function(l,u){force(l);force(u);function(x){x>l & x<u}}
An OR composition function:
> OR = function(f1,f2){force(f1);force(f2);function(x){f1(x)|f2(x)}}
Now there's sort of a loop here, to construct a list of those comparison functions:
> funs = mapply(cf,test$dt-i,test$dt+i)
Now combine all those into one function:
> anyF = Reduce(OR,funs)
And now we apply the OR composition to our interval testing functions:
> head(full[anyF(full$dt),])
dt val
3 3 -0.83562861
4 4 1.59528080
5 5 0.32950777
6 6 -0.82046838
7 7 0.48742905
26 26 -0.05612874
What you've got now is a function of a single variable that tests if the value is in the ranges you defined.
> anyF(1:10)
[1] FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
I don't know if this is faster, or better, or what. Someone do some benchmarks!

I don't know if it's any more efficient, but I would think you could also do something like this to get what you want:
subs <- apply(test, 1, function(x) c((x-2):(x+2)))
full[which(full$dt %in% subs), ]
I had to adjust your "3" to "2" since x would be included both ways.
Benchmarking (just for fun)
#Spacedman leads the way!
First, the required data and functions.
## Data
set.seed(1)
full <- data.frame(
dt = 1:200,
val = rnorm(200,0,1)
)
test <- data.frame(dt = seq(5,200,by=23))
i <- 3
## Spacedman's functions
cf = function(l,u){force(l);force(u);function(x){x>l & x<u}}
OR = function(f1,f2){force(f1);force(f2);function(x){f1(x)|f2(x)}}
funs = mapply(cf,test$dt-i,test$dt+i)
anyF = Reduce(OR,funs)
Second, the benchmarking.
## Benchmarking
require(rbenchmark)
benchmark(andrie = do.call(rbind,
lapply(test$dt,
function(j) full[full$dt > (j-i) &
full$dt < (j+i), ])),
mrdwab = {subs <- apply(test, 1,
function(x) c((x-(i-1)):(x+(i-1))))
full[which(full$dt %in% subs), ]},
spacedman = full[anyF(full$dt),],
original = {subs <- data.frame()
for (j in test$dt)
subs <- rbind(subs,
subset(full, full$dt > (j-i) &
full$dt < (j+i)))},
columns = c("test", "replications", "elapsed", "relative"),
order = "relative")
# test replications elapsed relative
# 3 spacedman 100 0.064 1.000000
# 2 mrdwab 100 0.105 1.640625
# 1 andrie 100 0.520 8.125000
# 4 original 100 1.080 16.875000

There is nothing inherently wrong with your code. To achieve your aim, you need a loop of some sort around a vectorised subset operation.
But here is more R-ish way to do it, which might well be faster:
do.call(rbind,
lapply(test$dt, function(j)full[full$dt > (j-i) & full$dt < (j+i), ])
)
PS: You can significantly simplify your reproducible example:
set.seed(1)
full <- data.frame(
dt = 1:200,
val = rnorm(200,0,1)
)
test <- data.frame(dt = seq(5,200,by=23))
i <- 3
xx <- do.call(rbind,
lapply(test$dt, function(j)full[full$dt > (j-i) & full$dt < (j+i), ])
)
head(xx)
dt val
3 3 -0.83562861
4 4 1.59528080
5 5 0.32950777
6 6 -0.82046838
7 7 0.48742905
26 26 -0.05612874

one more way using data.tables:
{
temp <- data.table(x=unique(c(full$dt,(test$dt-i),(test$dt+i))),key="x")
temp[,index:=1:nrow(temp)]
startpoints <- temp[J(test$dt-i),index]$index
endpoints <- temp[J(test$dt+i),index]$index
allpoints <- as.vector(mapply(FUN=function(x,y) x:y,x=startpoints,y=endpoints))
setkey(x=temp,index)
ans <- temp[J(allpoints)]$x
}
benchmarks:
number of rows in test:9
number of rows in full:10000
test replications elapsed relative
1 spacedman 100 0.406 1.000
2 new 100 1.179 2.904
number of rows in full:100000
test replications elapsed relative
2 new 100 2.374 1.000
1 spacedman 100 3.753 1.581

Count the number of valid observations (no NA) pairwise in a data frame

Say I have a data frame like this:
Df <- data.frame(
V1 = c(1,2,3,NA,5),
V2 = c(1,2,NA,4,5),
V3 = c(NA,2,NA,4,NA)
)
Now I want to count the number of valid observations for every combination of two variables. For that, I wrote a function sharedcount:
sharedcount <- function(x,...){
nx <- names(x)
alln <- combn(nx,2)
out <- apply(alln,2,
function(y)sum(complete.cases(x[y]))
)
data.frame(t(alln),out)
}
This gives the output:
> sharedcount(Df)
X1 X2 out
1 V1 V2 3
2 V1 V3 1
3 V2 V3 2
All fine, but the function itself takes pretty long on big dataframes (600 variables and about 10000 observations). I have the feeling I'm overseeing an easier approach, especially since cor(...,use='pairwise') is running still a whole lot faster while it has to do something similar :
> require(rbenchmark)
> benchmark(sharedcount(TestDf),cor(TestDf,use='pairwise'),
+ columns=c('test','elapsed','relative'),
+ replications=1
+ )
test elapsed relative
2 cor(TestDf, use = "pairwise") 0.25 1.0
1 sharedcount(TestDf) 1.90 7.6
Any tips are appreciated.
Note : Using Vincent's trick, I wrote a function that returns the same data frame. Code in my answer below.

The following is slightly faster:
x <- !is.na(Df)
t(x) %*% x
# test elapsed relative
# cor(Df) 12.345 1.000000
# t(x) %*% x 20.736 1.679708

I thought Vincent's looked really elegant, not to mention being faster than my sophomoric for-loop, except it seems to be needing an extraction step which I added below. This is just an example of the heavy overhead in the apply method when used with dataframes.
shrcnt <- function(Df) {Comb <- t(combn(1:ncol(Df),2) )
shrd <- 1:nrow(Comb)
for (i in seq_len(shrd)){
shrd[i] <- sum(complete.cases(Df[,Comb[i,1]], Df[,Comb[i,2]]))}
return(shrd)}
benchmark(
shrcnt(Df), sharedcount(Df), {prs <- t(x) %*% x; prs[lower.tri(prs)]},
cor(Df,use='pairwise'),
columns=c('test','elapsed','relative'),
replications=100
)
#--------------
test elapsed relative
3 { 0.008 1.0
4 cor(Df, use = "pairwise") 0.020 2.5
2 sharedcount(Df) 0.092 11.5
1 shrcnt(Df) 0.036 4.5

Based on the lovely trick of Vincent and the additional lower.tri() suggestion of DWin, I came up with following function that gives me the same output (i.e. a data frame) as my original one, and runs a whole lot faster :
sharedcount2 <- function(x,stringsAsFactors=FALSE,...){
counts <- crossprod(!is.na(x))
id <- lower.tri(counts)
count <- counts[id]
X1 <- colnames(counts)[col(counts)[id]]
X2 <- rownames(counts)[row(counts)[id]]
data.frame(X1,X2,count)
}
Note the use of crossprod(), as that one gives a small improvement compared to %*%, but it does exactly the same.
The timings :
> benchmark(sharedcount(TestDf),sharedcount2(TestDf),
+ replications=5,
+ columns=c('test','replications','elapsed','relative'))
test replications elapsed relative
1 sharedcount(TestDf) 5 10.00 90.90909
2 sharedcount2(TestDf) 5 0.11 1.00000
Note: I supplied TestDf in the question, as I noticed that the timings differ depending on the size of the data frames. As shown here, the time increase is a lot more dramatic than when compared using a small data frame.

using lists for simulation

I set myself a little challange on my way to learning R. The question was, given a sample of 500 numbers in normal distribution with mean 20, how many numbers under 20 would I get for standard deviations from 6 to 10. Just to have to learn more I decided to get 4 samples for each sd. So by the end I should have:
sd6samp1:...
sd6samp2:...
....
sd10samp4:...
My first approach, which worked was:
ddss<-c(6:10) # sd's
sam<-c(1:4) # 4 samples for each
k=0 # counter in 0
for (i in ddss) { # for each sd
for (j in sam) { # for each sample
nam <- paste("sam",i,".",j, sep="") # building a name
n <- assign(nam,rnorm(500, 20, i)) # the great assign function
k <- k+sum(n<=0)
}
print(assign(paste("ds",i,sep=""), k)) # ohh assign you're great
k=0 # reset counter
}
While looking for how to create variable names with the looping 'i', founded that 'assign' does the work but it also said:
Note though that if you are planning some simulations,
many guRus would say that you should use a list.
So I thoght it would be good to learn lists...
In the meanwhile I also discover a great other option...
ddss <- c(6:10)
for (i in ddss) {
print(paste('prob. x<=0), with sd=',i))
print(pnorm(0,mean=20,sd=i)*500)
}
This worked to answer the question, but the lists were still to be done... and a lot of R has yet to be learned. The main idea wasn't to know the very prob or number of negatives... but to learn R and specifically some looping.
So, I've been trying to go with the mentioned lists
My closest approach has been:
ddss<-c(6:10) # sd's to be calculated.
sam<-c(1:4) # 4 samples for each sd
liss<-list() # initializing the list
for (i in ddss) { # for each sd
liss[[i]] <- list()
for (j in sam) { # for each sample
liss[[i]][[j]] <- rnorm(500, 20, i)
print(paste('ds',i,'samp',j,'=',sum(liss[[i]][[j]]<0)))
}
}
With this one I get the information but I'm wondering about two issues (1 & 2) and some other questions (3 & 4):
I get a list of 10 elements, 6 empty ones and then 4 with sublists. I can't seem to find out how to work with elements 1:4 of the list (sd's) with the 6:9 names (the very sd's).
Even though I tried, I couldn't get to name the lists elements through the 'for' loops. Any insight on these issues would be great.
Since in this context of simulations. What do you think is better: nested lists (lists with sublists) or simple (longer) lists?
I wondered whether the 'apply' functions would be of any help here, I tried to do something, like:
vbv<-matrix(c(6,6,6,6,7,7,7,7,8,8,8,8,9,9,9,9))
lsl<-apply(vbv, 2, function(x) rnorm(500,20,x))
But it looks I'm not getting even close....
Thanks for your time if you've read this far!
You may as well take some more to reply ;-).

The problem is in your indexes: you are running over indexer i from ddss, which runs from 6 to 10. So in the first tour of duty in your outer loop, your first statement really says: liss[[6]]<-list(), implying that the first 5 ones are NULL.
So if you insist on working with loops, this is what you should do (check ?seq_along):
ddss<-c(6:10) # sd's to be calculated.
sam<-c(1:4) # 4 samples for each sd
liss<-list() # initializing the list
for (i in seq_along(ddss)) { # now, i runs from 1 to 5
liss[[i]] <- list()
for (j in sam) { # for each sample
liss[[i]][[j]] <- rnorm(500, 20, i)
print(paste('ds',ddss[i],'samp',j,'=',sum(liss[[i]][[j]]<0)))
}
names(liss[[i]])<-as.character(sam)#this should solve your naming issue (1/2)
}
names(liss)<-as.character(ddss)#this should solve your naming issue (2/2)
Note that, as always, it is a good idea to name your variables something more useful than i or j: if you'd named it curds, maybe you wouldn't have used it immediately as an indexer in a list?
Now, if you are really aiming for improvement (but want to stick to lists), you indeed want to go with the apply style functions:
liss<-lapply(ddss, function(curds){ #apply the inline function to each ds and store results in a list
return(lapply(sam, function(cursam){ #apply inline function to each sam and store results in a list
rv<-rnorm(500, 20, curds)
cat('ds',curds,'samp',cursam,'=',sum(rv<0), "\n") #maybe better for your purposes.
return(rv)
}))
})
Finally, for your case, there is not a lot of reason to actually use lists (nor do you even need to keep the sampled data for each ds/sam): you can store everything as a threedimensional array, but since you specify it as a learning exercise (hey, maybe the array thing can be your next exercise :-)), I'll leave it at that.

lapply() is helpful here, where we can just apply over the set of values for the SD. It helps to write a custom wrapper around the rnorm() function so we can pass in different values for the various arguments of rnorm(), and handle the k replicates (k = 4 in your example) in a nice fashion also. That wrapper is foo() below:
foo <- function(sd, n, mean, reps = 1) {
rands <- rnorm(n * reps, mean = mean, sd = sd)
if(reps > 1)
rands <- matrix(rands, ncol = reps)
rands
}
We use it in an lapply() call like so:
sims <- lapply(6:10, FUN = foo, mean = 20, n = 500, reps = 4)
Which gives:
R> str(sims)
List of 5
$ : num [1:500, 1:4] 30.3 22 15.6 20 19.4 ...
$ : num [1:500, 1:4] 20.9 21.7 17.7 35 30 ...
$ : num [1:500, 1:4] 17.88 26.48 5.19 19.25 15.59 ...
$ : num [1:500, 1:4] 27.41 12.72 9.38 35.09 11.08 ...
$ : num [1:500, 1:4] 16.2 11.6 20.5 35.4 27.3 ...
We can then compute the number of observations < 20 per SD
names(sims) <- paste("SD", 6:10, sep = "")
out <- lapply(sims, function(x) colSums(x < 20))
Which gives:
R> out
$SD6
[1] 218 251 253 227
$SD7
[1] 250 242 233 232
$SD8
[1] 258 241 246 274
$SD9
[1] 252 245 249 258
$SD10
[1] 253 259 241 242
#Joris suggests I show how to access elements of the list. For example, if you want the results of the simulations for a SD = 20, we could do out[[4]] because 20 was the 4th value in the vector of SDs we applied over, or, because I named the elements of the output list out, we can as for the results of the simulation using out[["SD10"]].
To Answer some of the specific points about your loops etc.,
to add names to a list use names(), e.g. names(mylist) <- c("foo","bar"). You'd be better off in your loop callingnames()` once per iteration of the loop to set up the names in a single shot - you probably wouldn't want to fill the names in as you go along as that would be inefficient.
I don't think it makes too much difference whether you use a nested list or a list containing a matrix as per my example. To alter foo() to return a list so the output of lapply() is a list of lists, we could do:
Code:
bar <- function(sd, n, mean, reps = 1) {
rands <- rnorm(n * reps, mean = mean, sd = sd)
if(reps > 1)
rands <- split(rands, rep(seq_len(reps), each = n))
rands
}
sims2 <- lapply(6:10, FUN = bar, mean = 20, n = 500, reps = 4)
names(sims2) <- paste("SD", 6:10, sep = "")
out2 <- lapply(sims2, function(x) sapply(x, function(y) sum(y < 20)))
which gives the same output as before.

I am going to throw in another solution using the plyr package, which I think is tailor made for such exercises.
library(plyr)
# generate a data frame of parameters, repeating some as required
parameters = data.frame(mean = 20, sd = rep(6:10, each = 4))
# generate sample data for each combination of parameters
sample_data = mdply(df, rnorm, n = 500)
# generate answer by counting number of observations less than 20
answer = data.frame(
parameters,
obs_less_20 = rowSums(sample_data[,-c(1, 2),] < 20)
)
head(answer)
mean sd obs_less_20
1 20 6 247
2 20 6 250
3 20 6 242
4 20 6 259
5 20 7 240
6 20 7 237

from for loop to apply

I am new in using R.
So I am not sure about how to use apply.
I would like to speed up my function with using apply:
for(i in 1: ncol(exp)){
for (j in 1: length(fe)){
tmp =TRUE
id = strsplit(colnames(exp)[i],"\\.")
if(id == fe[j]){
tmp = FALSE
}
if(tmp ==TRUE){
only = cbind(only,c(names(exp)[i],exp[,i]) )
}
}
}
How can I use the apply function to do this above?
EDIT :
Thank you so much for the very good explanation and sorry for my bad description. You guess everything right, but When wanted to delete matches in fe.
Exp <- data.frame(A.x=1:10,B.y=10:1,C.z=11:20,A.z=20:11)
fe<-LETTERS[1:2]
then the result should be only colnames with 'C'. Everything else should be deleted.
1 C.z
2 11
3 12
4 13
5 14
6 15
7 16
8 17
9 18
10 19
11 20

EDIT : If you only want to delete the columns whose name appear in fe, you can simply do :
Exp <- data.frame(A.x=1:10,B.y=10:1,C.z=11:20,A.z=20:11)
fe<-LETTERS[1:2]
id <- sapply(strsplit(names(Exp),"\\."),
function(i)!i[1] %in% fe)
Exp[id]
This code does exactly what your (updated) for-loop does as well, only a lot more efficient. You don't have to loop through fe, the %in% function is vectorized.
In case the name can appear anywhere between the dots, then
id <- sapply(strsplit(names(Exp),"\\."),
function(i)sum(i %in% fe)==0)
Your code does some very funny things, and I have no clue what exactly you're trying to do. For one, strsplit gives a list, so id == fe[j] will always return false, unless fe[j] is a list itself. And I doubt it is... So I'd correct your code as
id = strsplit(colnames(Exp)[i],"\\.")[[1]][1]
in case you want to compare with everything that is before the dot, or to
id = unlist(strsplit(colnames(Exp)[i],"\\."))
if you want to compare with everything in the string. In that case, you should use %in%instead of == as well.
Second, what you get is a character matrix, which essentially multiplies rows. if all elements in fe[j] are unique, you could as well do :
only <- rbind(names(exp),exp)
only <- do.call(cbind,lapply(mat,function(x)
matrix(rep(x,ncol(exp)-1),nrow=nrow(exp)+1)
))
Assuming that the logic in your code does make sense (as you didn't apply some sample data this is impossible to know), the optimalization runs :
mat <- rbind(names(Exp),Exp)
do.call(cbind,
lapply(mat, function(x){
n <- sum(!fe %in% strsplit(x[1],"\\.")[[1]][1])
matrix(rep(x,n),nrow=nrow(mat))
}))
Note that - in case you are interested if fe[j] appears anywhere in the name - you can change the code to :
do.call(cbind,
lapply(mat, function(x){
n <- sum(!fe %in% unlist(strsplit(x[1],"\\.")))
matrix(rep(x,n),nrow=nrow(mat))
}))
If this doesn't return what you want, then your code doesn't do that either. I checked with following sample data, and all gives the same result :
Exp <- data.frame(A.x=1:10,B.y=10:1,C.z=11:20,A.z=20:11)
fe <- LETTERS[1:4]

The apply() family of functions are convenience functions. They will not necessarily be faster than a well-written for loop or vectorized functions. For example:
set.seed(21)
x <- matrix(rnorm(1e6),5e5,2)
system.time({
yLoop <- x[,1]*0 # preallocate result
for(i in 1:NROW(yLoop)) yLoop[i] <- mean(x[i,])
})
# user system elapsed
# 13.39 0.00 13.39
system.time(yApply <- apply(x, 1, mean))
# user system elapsed
# 16.19 0.28 16.51
system.time(yRowMean <- rowMeans(x))
# user system elapsed
# 0.02 0.00 0.02
identical(yLoop,yApply,yRowMean)
# TRUE
The reason your code is so slow is that--as Gavin pointed out--you're growing your array for every loop iteration. Preallocate the entire array before the loop and you will see a significant speedup.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

optimizing code in R for vector comparisons in data.table - r

Related

set missing values to constant in R, computational speed

Efficient subsetting in R using 2 dataframes

Count the number of valid observations (no NA) pairwise in a data frame

using lists for simulation

from for loop to apply

Categories

Resources