I am trying to get the most optimized performance for a piece of code, but I am getting almost identical performance between mapply and for loops. Why is that? Would plyr or data.table be faster? Or is there a more efficient way to write my function?
The first chunk of code creates a list of 1000 lists, each nested list containing 1-10 random letters.
testlist<-list()
for(i in 1:1000){
testlist[i]<-list(paste(sample(c(letters),sample(1:10, 1))))}
The script that I am trying to optimize is one that counts the number of intersections across all possible combinations (1,000,000) of my nested lists. Below illustrates the mapply syntax I used for this.
#Function for mapply
intersectfunction<-function(x,y){
length(intersect(x,y))
}
#mapply syntax
T1<-Sys.time()
Intersects<-mapply(x=rep(testlist,length(testlist)),intersectfunction,y=rep(testlist,each=length(testlist)))
mapplytime<-Sys.time()-T1
Below illustrates a nested for loop syntax that produces essentially identical output.
T1<-Sys.time()
Intersects<-vector(length=length(testlist)^2)
for(i in 1:length(testlist)){
for(j in 1:length(testlist)){
Intersects[j+((i-1)*length(testlist))]<-length(intersect(testlist[[i]],testlist[[j]]))
}
}
forlooptime<-Sys.time()-T1
The weird thing is that each syntax takes almost the same amount of time, even though it seems like mapply should be more efficient. This suggests to me that I am either doing something wrong with mapply, or that mapply is not the right tool for accomplishing my goal.
> mapplytime
Time difference of 20.97202 secs
> forlooptime
Time difference of 23.29733 secs
Related
This has probably been answered already and in that case, I am sorry to repeat the question, but unfortunately, I couldn't find an answer to my problem. I am currently trying to work on the readability of my code and trying to use functions more frequently, yet I am not that familiar with it.
I have a data.frame and some columns contain NA's that I want to interpolate with, in this case, a simple kalman filter.
require(imputeTS)
#some test data
col <- c("Temp","Prec")
df_a <- data.frame(c(10,13,NA,14,17),
c(20,NA,30,NA,NA))
names(df_a) <- col
#this is my function I'd like to use
gapfilling <- function(df,col){
print(sum(is.na(df[,col])))
df[,col] <- na_kalman(df[,col])
}
#this is my for-loop to loop through the columns
for (i in col) {
gapfilling(df_a, i)
}
I have two problems:
My for loop works, yet it doesn't overwrite the data.frame. Why?
How can I achieve this without a for-loop? As far as I am aware you should avoid for-loops if possible and I am sure it's possible in my case, I just don't know how.
How can I achieve this without a for-loop? As far as I am aware you should avoid for-loops if possible and I am sure it's possible in my case, I just don't know how.
You most definitely do not have to avoid for loops. What you should avoid is using a loop to perform actions that could be vectorized. Loops are in general just fine, however they are (much) slower compared to compiled languages such as c++, but are equivalent to loops in languages such as python.
My for loop works, yet it doesn't overwrite the data.frame. Why?
This is a problem with overwriting values within a function, or what is referred to as scope. Basically any assignment is restricted to its current environment (or scope). Take the example below:
f <- function(x){
a <- x
cat("a is equal to ", a, "\n")
return(3)
}
x <- 4
f(x)
a is equal to 4
[1] 3
print(a)
Error in print(a) : object 'a' not found
As you can see, "a" definitely exists, but it stops existing after the function call has been fulfilled. It is restricted to the environment (or scope) of the function. Here the scope is basically the time at which the function is run.
To alleviate this, you have to overwrite the value in the global environment
for (i in col) {
df_a[, i] <- gapfilling(df_a, i)
}
Now for readability (not speed) one could change this to a lapply
df_a[, col] <- lapply(df_a[, col], na_kalman)
I set a heavy point on it not being faster than using a loop. lapply iterates over each column, as you would in a loop. Speed could be obtained if say na_kalman was programmed to take multiple columns, and possibly save time using optimized c or c++ code.
So I have a customer survey, and I need to determine if there are significant differences between the four areas. I obviously want to do a t-test on these, and here is my current R solution.
`for (i in colnames(c_survey))
assign(i, subset(c_survey, select=i))
elements <- list(quality, ease.of.use, price, service)
elements_alt<-list(service,price,ease.of.use,quality)
for(i in elements){print(names(elements)[i])
for (j in elements_alt) {print(t.test(i,j)$p.value)}}`
(Edit) I figured out the nested loop, but I still think there's a faster way to do what I want than this whole two list, nested loop nonsense. Also my output from this has no names on it so I have no idea what's being compared to what, and it includes all the duplicate inverse comparisons. I also can't save the results. I think my solution certainly lies elsewhere.
Also, would producing so many t-test p values even be the best statistical way to accomplish what I want? It seems like there should be something easier than this...
I have a data frame df that has 15 columns and 1000000 rows of all ints. My code is:
for(i in 1:nrow(df))
{
if(is.null(df$col1[i]) || .... || is.null(df$col9[i]))
df[-i,] #to delete the row if one of those columns is null
}
This has been running for an hour and is still going. Why? It seems like it should be relatively fast code to run. How can I speed it up?
The reason it is slow is that R is relatively slow at looping through vectors. Most functions in R are vectorized which means you can perform them on a vector at once much faster than it can loop through each element one by one. On a side note, I don't think you have NULLs in your data frame. I think you have NAs so I'm going to assume that is what you have. Even if you have NULLs then the following should still work.
This syntax should give you a nice speed boost.
This will take advantage of rowSums producing NA for every row that has missing values in it.
df<-subset(df, !is.na(rowSums(df[,1:10])))
This syntax should also work.
df<-df[rowSums(is.na(df[,1:10]))==0,]
I am totally convinced that an efficient R programm should avoid using loops whenever possible and instead should use the big family of the apply functions.
But this cannot happen without pain.
For example I face with a problem whose solution involves a sum in the applied function, as a result the list of results is reduced to a single value, which is not what I want.
To be concrete I will try to simplify my problem
assume N =100
sapply(list(1:N), function(n) (
choose(n,(floor(n/2)+1):n) *
eps^((floor(n/2)+1):n) *
(1- eps)^(n-((floor(n/2)+1):n))))
As you can see the function inside cause length of the built vector to explode
whereas using the sum inside would collapse everything to single value
sapply(list(1:N), function(n) (
choose(n,(floor(n/2)+1):n) *
eps^((floor(n/2)+1):n) *
(1- eps)^(n-((floor(n/2)+1):n))))
What I would like to have is a the list of degree of N.
so what do you think? how can I repair it?
Your question doesn't contain reproducible code (what's "eps"?), but on the general point about for loops and optimising code:
For loops are not incredibly slow. For loops are incredibly slow when used improperly because of how memory is assigned to objects. For primitive objects (like vectors), modifying a value in a field has a tiny cost - but expanding the /length/ of the vector is fairly costly because what you're actually doing is creating an entirely new object, finding space for that object, copying the name over, removing the old object, etc. For non-primitive objects (say, data frames), it's even more costly because every modification, even if it doesn't alter the length of the data.frame, triggers this process.
But: there are ways to optimise a for loop and make them run quickly. The easiest guidelines are:
Do not run a for loop that writes to a data.frame. Use plyr or dplyr, or data.table, depending on your preference.
If you are using a vector and can know the length of the output in advance, it will work a lot faster. Specify the size of the output object before writing to it.
Do not twist yourself into knots avoiding for loops.
So in this case - if you're only producing a single value for each thing in N, you could make that work perfectly nicely with a vector:
#Create output object. We're specifying the length in advance so that writing to
#it is cheap
output <- numeric(length = length(N))
#Start the for loop
for(i in seq_along(output)){
output[i] <- your_computations_go_here(N[i])
}
This isn't actually particularly slow - because you're writing to a vector and you've specified the length in advance. And since data.frames are actually lists of equally-sized vectors, you can even work around some issues with running for loops over data.frames using this; if you're only writing to a single column in the data.frame, just create it as a vector and then write it to the data.frame via df$new_col <- output. You'll get the same output as if you had looped through the data.frame, but it'll work faster because you'll only have had to modify it once.
Suppose that you have a data frame with many rows and many columns.
The columns have names. You want to access rows by number, and columns by name.
For example, one (possibly slow) way to loop over the rows is
for (i in 1:nrow(df)) {
print(df[i, "column1"])
# do more things with the data frame...
}
Another way is to create "lists" for separate columns (like column1_list = df[["column1"]), and access the lists in one loop. This approach might be fast, but also inconvenient if you want to access many columns.
Is there a fast way of looping over the rows of a data frame? Is some other data structure better for looping fast?
I think I need to make this a full answer because I find comments harder to track and I already lost one comment on this... There is an example by nullglob that demonstrates the differences among for, and apply family functions much better than other examples. When one makes the function such that it is very slow then that's where all the speed is consumed and you won't find differences among the variations on looping. But when you make the function trivial then you can see how much the looping influences things.
I'd also like to add that some members of the apply family unexplored in other examples have interesting performance properties. First I'll show replications of nullglob's relative results on my machine.
n <- 1e6
system.time(for(i in 1:n) sinI[i] <- sin(i))
user system elapsed
5.721 0.028 5.712
lapply runs much faster for the same result
system.time(sinI <- lapply(1:n,sin))
user system elapsed
1.353 0.012 1.361
He also found sapply much slower. Here are some others that weren't tested.
Plain old apply to a matrix version of the data...
mat <- matrix(1:n,ncol =1),1,sin)
system.time(sinI <- apply(mat,1,sin))
user system elapsed
8.478 0.116 8.531
So, the apply() command itself is substantially slower than the for loop. (for loop is not slowed down appreciably if I use sin(mat[i,1]).
Another one that doesn't seem to be tested in other posts is tapply.
system.time(sinI <- tapply(1:n, 1:n, sin))
user system elapsed
12.908 0.266 13.589
Of course, one would never use tapply this way and it's utility is far beyond any such speed problem in most cases.
The fastest way is to not loop (i.e. vectorized operations). One of the only instances in which you need to loop is when there are dependencies (i.e. one iteration depends on another). Otherwise, try to do as much vectorized computation outside the loop as possible.
If you do need to loop, then using a for loop is essentially as fast as anything else (lapply can be a little faster, but other apply functions tend to be around the same speed as for).
Exploiting the fact that data.frames are essentially lists of column vectors, one can use do.call to apply a function with the arity of the number of columns over each column of the data.frame (similar to a "zipping" over a list in other languages).
do.call(paste, data.frame(x=c(1,2), z=c("a","b"), z=c(5,6)))