I am struggling to improve the performance of the below code, which is running for about 2M entries. First, the condition was inside the loop, and now it is outside, and this brought some improvements, but not enough.
Do you have any other ideas?
if (Floor=="Yes") {
for (i in 1:length(X)){
base_short_term[i] <- pmax(numeric_vector1[i],(1+numeric_vector2[i])^((numeric_vector3[i])/(1+numeric_vector4[i]))
}
} else {
for (i in 1:length(X)){
base_short_term[i] <- pmin(numeric_vector5[i],(1+numeric_vector3[i])^((numeric_vector5[i])/(1+numeric_vector7[i]))
}
}
Loops are bad in R and should be avoided whenever possible. Here this is the case: a vectorized operation would be far more efficient (loops lead to memory overhead) and more readable code:
df <- data.frame(x1 = numeric_vector1,
x2 = numeric_vector2,
x3 = numeric_vector3,
x4 = numeric_vector4,
x5 = numeric_vector5,
x7 = numeric_vector7)
if (Floor == "yes"){
df$base_short_term <- pmax(df$x1, (1+df$x2)^(df$x3/df$x4))
} else{
df$base_short_term <- pmin(df$x5, (1+df$x3)^(df$x5/df$x7))
}
If loops cannot be avoided, it's better to use lapply or favor Rcpp
Update
If vectors have different length, you will loose performance because you will need to slice first from 1 to length(X) or use lapply
Slicing vector
df <- data.frame(x1 = numeric_vector1[seq_along(X)],
x2 = numeric_vector2[seq_along(X)],
x3 = numeric_vector3[seq_along(X)],
x4 = numeric_vector4[seq_along(X)],
x5 = numeric_vector5[seq_along(X)],
x7 = numeric_vector7[seq_along(X)])
(this solution is possible because even if vectors do not have the same length, you are only using indices up to length(X), for all your vectors)
lapply
Really looks like your for loop but more efficient since it avoids creating and dumping object at each iteration
For instance, if Floor is TRUE:
base_short_term <- lapply(seq_along(X), function(i), {
pmax(numeric_vector1[i],(1+numeric_vector2[i])^((numeric_vector3[i])/(1+numeric_vector4[i]))
})
Related
Although there are some questions about this topic (e.g. this question), none of them answer my particular questions (as far as I could tell anyway).
Suppose I have a function which depends on a lot of parameters. For demonstration purposes I chose 3 parameters:
myfun <- function(x1, x2, x3){
some code containing x1, x2, x3
}
Often the input parameters are already contained in a list:
xlist <- list(x1 = 1, x2= 2, x3 = 3)
I want to run myfun with the inputs contained in xlist like this:
myfun(xlist$x1, xlist$x2, xlist$x3)
However this seems like too big of an effort (because of the high number of parameters).
So I decided to modify myfun: instead of all the input parameters. It now gets the whole list as one single input: at the beginning of the code I use attach in order to use the same code as above.
myfun2 <- function(xlist){
attach(xlist)
same code as in myfun containing x1, x2, x3
detach(xlist)
}
I thought that this would be quite a neat solution, but a lot of users advise to not use attach.
What do you think? Are there any arguments to prefer myfun over myfun2?
Thanks in advance.
I think you'd be better off using do.call. do.call will accept a list and convert them to arguments.
myfun <- function(x1, x2, x3){
x1 + x2 + x3
}
xlist <- list(x1 = 1, x2= 2, x3 = 3)
do.call(myfun, xlist)
This has the benefit of being explicit about what the arguments are, which makes it much easier to reason with the code, maintain it, and debug it.
The place where this gets tricky is if xlist has more values in it than just those required by the function. For example, the following throws an error:
xlist <- list(x1 = 1, x2 = 2, x3 = 3, x4 = 4)
do.call(myfun, xlist)
You can circumvent this by matching arguments with the formals
do.call(myfun, xlist[names(xlist) %in% names(formals(myfun))])
It's still a bit of typing, but if you're talking about 10+ arguments, it's still a lot easier than xlist$x1, xlist$x2, xlist$x3, etc.
LAP gives a useful solution as well, but would be better used to have with outside the call.
with(xlist, myfun(x1, x2, x3))
You could just use with():
xlist <- list(x1 = 1, x2= 2, x3 = 3)
FOO <- function(mylist){
with(mylist,
x1+x2+x3
)
}
> FOO(xlist)
[1] 6
I'm not convinced of this approach, though. The function would depend on the correctly named elements within the list.
My approach would be something like this:
testfun <- function (a_list)
{
args = a_list
print(args$x1)
print(args$x2)
print(args$x3)
}
my_list <- list(x1=2, x2=3, x3=4)
testfun(my_list)
However, you would need to know the names of the parameters within the function.
Perhaps the do.call() function can come into play here.
do.call('fun', list)
You could assign the list to the environment of the function:
myfun <- function(xlist) {
for (i in seq_along(xlist)) {
assign(names(xlist)[i], xlist[[i]], envir = environment())
}
# or if you dislike for-loops
# lapply(seq_along(xlist), function(i) assign(names(xlist)[i], xlist[[i]], envir = parent.env(environment())))
print(paste0(x2, x3)) # do something with x2 and x3
print(x1 * x3) # do something with x1 and x3
}
myfun(list(x1 = 4, x2 = "dc", x3 = c(3,45,21)))
I have an empty data frame T_modelled with 2784 columns and 150 rows.
T_modelled <- data.frame(matrix(ncol = 2784, nrow = 150))
names(T_modelled) <- paste0("t=", t_sec_ERT)
rownames(T_modelled) <- paste0("z=", seq(from = 0.1, to = 15, by = 0.1))
where
t_sec_ERT <- seq(from = -23349600, to = 6706800, by = 10800)
z <- seq(from = 0.1, to = 15, by = 0.1)
I filled T_modelled by column with a nested for loop, based on a formula:
for (i in 1:ncol(T_modelled)) {
col_tmp <- colnames(T_modelled)[i]
for (j in 1:nrow(T_modelled)) {
z_tmp <- z[j]-0.1
T_tmp <- MANSRT+As*e^(-z_tmp*(omega/(2*K))^0.5)*sin(omega*t_sec_ERT[i]-((omega/(2*K))^0.5)*z_tmp)
T_modelled[j ,col_tmp] <- T_tmp
}
}
where
MANSRT <- -2.051185
As <- 11.59375
omega <- (2*pi)/(347.875*24*60*60)
c <- 790
k <- 0.00219
pb <- 2600
K <- (k*1000)/(c*pb)
e <- exp(1)
I do get the desired results but I keep thinking there must be a more efficient way of filling that data frame. The loop is quite slow and looks cumbersome to me. I guess there is an opportunity to take advantage of R's vectorized way of calculating. I just cannot see myself how to incorporate the formula in an easier way to fill T_modelled.
Anyone got any ideas how to get the same result in a faster, more "R-like" manner?
I believe this does it.
Run this first instruction right after creating T_modelled, it will be needed to test that the results are equal.
Tm <- T_modelled
Now run your code then run the code below.
z_tmp <- z - 0.1
for (i in 1:ncol(Tm)) {
T_tmp <- MANSRT + As*exp(-z_tmp*(omega/(2*K))^0.5)*sin(omega*t_sec_ERT[i]-((omega/(2*K))^0.5)*z_tmp)
Tm[ , i] <- T_tmp
}
all.equal(T_modelled, Tm)
#[1] TRUE
You don't need the inner loop, that's the only difference.
(I also used exp directly but that is of secondary importance.)
Much like your previous question's solution which you accepted, consider simply using sapply, iterating through the vector, t_sec_ERT, which is the same length as your desired dataframe's number of columns. But first adjust every element of z by 0.1. Plus, there's no need to create empty dataframe beforehand.
z_adj <- z - 0.1
T_modelled2 <- data.frame(sapply(t_sec_ERT, function(ert)
MANSRT+As*e^(-z_adj*(omega/(2*K))^0.5)*sin(omega*ert-((omega/(2*K))^0.5)*z_adj)))
colnames(T_modelled2) <- paste0("t=", t_sec_ERT)
rownames(T_modelled2) <- paste0("z=", z)
all.equal(T_modelled, T_modelled2)
# [1] TRUE
Rui is of course correct, I just want to suggest a way of reasoning when writing a loop like this.
You have two numeric vectors. Functions for numerics in R are usually vectorized. By which I mean you can do stuff like this
x <- c(1, 6, 3)
sum(x)
not needing something like this
x_ <- 0
for (i in x) {
x_ <- i + x_
}
x_
That is, no need for looping in R. Of course looping takes place none the less, it just happens in the underlying C, Fortran etc. code, where it can be done more efficiently. This is usually what we mean when we call a function vectorized: looping takes place "under the hood" as it were. The output of Vectorize() thus isn't strictly vectorized by this definition.
When you have two numeric vectors you want to loop over you have to first see if the constituent functions are vectorized, usually by reading the docs.
If it is, you continue by constructing that central vectorized compound function and and start testing it with one vector and one scalar. In your case it would be something like this (testing with just the first element of t_sec_ERT).
z_tmp <- z - 0.1
i <- 1
T_tmp <- MANSRT + As *
exp(-z_tmp*(omega/(2*K))^0.5) *
sin(omega*t_sec_ERT[i] - ((omega/(2*K))^0.5)*z_tmp)
Looks OK. Then you start looping over the elements of t_sec_ERT.
T_tmp <- matrix(nrow=length(z), ncol=length(t_sec_ERT))
for (i in 1:length(t_sec_ERT)) {
T_tmp[, i] <- MANSRT + As *
exp(-z_tmp*(omega/(2*K))^0.5) *
sin(omega*t_sec_ERT[i] - ((omega/(2*K))^0.5)*z_tmp)
}
Or you can do it with sapply() which is often neater.
f <- function(x) {
MANSRT + As *
exp(-z_tmp*(omega/(2*K))^0.5) *
sin(omega*x - ((omega/(2*K))^0.5)*z_tmp)
}
T_tmp <- sapply(t_sec_ERT, f)
I would prefer to put the data in a long format, with all combinations of z and t_sec_ERT as two columns, in order to take advantage of vectorization. Although I usually prefer tidyr for switching between long and wide formats, I've tried to keep this as a base solution:
t_sec_ERT <- seq(from = -23349600, to = 6706800, by = 10800)
z <- seq(from = 0.1, to = 15, by = 0.1)
v <- expand.grid(t_sec_ERT, z)
names(v) <- c("t_sec_ERT", "z")
v$z_tmp <- v$z-0.1
v$T_tmp <- MANSRT+As*e^(-v$z_tmp*(omega/(2*K))^0.5)*sin(omega*v$t_sec_ERT-((omega/(2*K))^0.5)*v$z_tmp)
T_modelled <- data.frame(matrix(v$T_tmp, nrow = length(z), ncol = length(t_sec_ERT), byrow = TRUE))
names(T_modelled) <- paste0("t=", t_sec_ERT)
rownames(T_modelled) <- paste0("z=", seq(from = 0.1, to = 15, by = 0.1))
As part of a project I made a smoother to smooth out missing data. I make use of the previous slope of the last data points to calculate new values. After calculated each new point I use this data to calculate a new value (and so on). Hence I used a while-loop to calculate each value (both from left to right as from right to left to eventually take a average of these 2 values). This scripts works fine!
Although I expect that I can significantly accelerate this with a function from the apply-family, I still want to use this while loop. The script is however really slow (3 days for ~ 2,500,000 data points). Do you have tips (for the current script) for me to change to speed things up?
#Loop from: bottom -> top
number_rows <- nrow(weight_id)
i <- nrow(weight_id)
while (i >= 1){
j = as.integer(weight_id[i,1])
prev1 <- temp[j+1,]$new_MAP_bottom
if(j<max(weight_id)){
previous_slope <- ifelse((temp[j+2,]$duration-temp[j+1,]$duration)>0,prev1-temp[j+2,]$new_MAP_bottom,0)
}else{
previous_slope <- 0
}
new_MAP <- round(prev1+((previous_slope-(factor*temp[j,]$steps))/(1+factor)), digit=2)
temp[j,]$new_MAP_bottom <- new_MAP
i <- i-1
}
#Loop from: top -> bottom
weight_factor <- 0
i <- 1
while (i <= nrow(weight_id)) {
j = as.integer(weight_id[i,1])
prev1 <- temp[j-1,]$new_MAP_top
if(j>2){
previous_slope <- ifelse((temp[j-1,]$duration-temp[j-2,]$duration)>0,prev1-temp[j-2,]$new_MAP_top,0)
}else{
previous_slope <- 0
}
new_MAP <- round(prev1+((previous_slope+(factor*temp[j,]$steps))/(1+factor)), digit=2)
temp[j,]$new_MAP_top <- new_MAP
#Take weighted average of two approaches (top -> bottom/bottom -> top)
if(weight_factor < 1){ weight_factor = temp[j,]$weight-1 }
weight_top <- weight_factor
weight_bottom <- temp[j,]$weight-weight_factor
if(weight_top>weight_bottom){ weight_top<-weight_top-1 }
if(weight_top<weight_bottom){ weight_bottom<-weight_bottom-1}
temp[j,]$MAP <- round(((new_MAP*weight_top)+(temp[j,]$new_MAP_bottom*weight_bottom))/(weight_top+weight_bottom),digit=0)
weight_factor <- weight_factor-1
i <- i+1
}
I did not read all of your code, especially without example data, but from the textual description, its only linear approximation: Please check, if the buildin functions approx and approxfun already do what you try to implement yourself, as these will be optimized more than you can with suitable effort.
par(mfrow=c(2,1))
example <- data.frame(x = 1:14,
y = c(3,4,5,NA, NA, NA, 6,7,8.1, 8.2, NA, 8.4, 8.5, NA))
plot(example)
f <- approxfun(example)
plot(example$x, f(example$x))
The apply family tends to give you shorter, more succinct code, but not necessarily much more speed then loops. If you are into speed, first check, if somebody else has already implemented, what you need, then try vectorization.
Edit:
The following runs in about a second on my computer. If this does something close enough to your own "linear smoother" so that you can replace yours with this, that is a speed increase of about 3 days.
n <- 2500000
example <- data.frame(x = 1:n,
y = sample(1:1000, n, replace = TRUE))
example$y[sample(1:n, n/5)] <- NA
print(Sys.time())
f <- approxfun(example)
mean(f(example$x))
print(Sys.time())
I have a large data.frame with thousands of lines, on which I am looping each time creating a subset from line 1 up to incrementally one additional line per iteration.
On each subset I perform several tasks which, coming from a pythonic understanding of programming, I carry out with 'for' loops. E.g.
df <- data.frame(a=c(1:10), b=c(11:20), c=c(21:30), d=c(31:40))
for (index in 1:nrow(df)) {
thisSubset <- df[1: index,]
#loop 1
new1 <- ncumeric(nrow(thisSubset))
for (i in 1:nrow(thisSubset)) {
var1 = 5 - thisSubset$b[i]
new1[i] <- 1/exp(var1*log(2));
}
#loop 2
new2 <- numeric()
for (i in 1:nrow(thisSubset)) if (thisSubset$c[i] >25) {
new2<- c(new2, (thisSubset$a[i]/exp(5*log(2))))
}
#loop 3
new3 <- numeric(nrow(thisSubset))
for (i in 1:nrow(thisSubset)) if (thisSubset$a[i] < 5) {
new3[i] <- thisSubset$d[i]-thisSubset$d[i+1]/2
} else {
new3[i] <- thisSubset$d[i]-thisSubset$d[i-1]/2
}
#loop x
#...
}
As my datasets grow larger, processing time increases exponentially to a few hours. I appreciate there are preferable ways of working in R to perform similar tasks (e.g. apply), but can I still use anything other than 'for' when several things are happening in each loop and also when multiple elements of each line are used? I would appreciate if someone could give me an example with each or any of the loops presented above.
For loops aren't really any slower than their *apply cousins (in fact, sometimes they can even be faster!). The real speed gains come from converting explicit for loops into vectorized code whenever possible. For example, loop 1 and loop 2 in your code can be converted to vectorized statements like so:
#loop 1
new1 <- 1 / exp((5 - thisSubset$b) * log(2))
#loop 2
new2 <- thisSubset$a[thisSubset$c > 25] / exp(5 * log(2))
Things get a little trickier to vectorize when your computations depend on the index of the vector (such as in loop 3), but there do exist packages with efficient implementations of various "rolling" functions (see the roll and zoo packages for details).
When you do need to use for loops, keep in mind that you should always pre-allocate your "results" vector before beginning iteration rather than growing it during iteration:
#bad
new2 <- numeric()
for (i in 1:nrow(thisSubset)) {
if (thisSubset$c[i] >25) {
new2<- c(new2, (thisSubset$a[i]/exp(5*log(2))))
}
}
#good
new2 <- numeric(length = nrow(thisSubset))
for (i in 1:nrow(thisSubset)) {
if (thisSubset$c[i] >25) {
new2[i] <- (thisSubset$a[i]/exp(5*log(2)))
}
}
This prevents R from having to copy new2 after every iteration and results in much faster code.
R is a vectorized language and thus most functions can be applied to the entire column in a single statement without iterating down each row. For example I vectorized your three inner loops:
df <- data.frame(a=c(1:10), b=c(11:20), c=c(21:30), d=c(31:40))
thisSubset <- df
#loop 1
new1 <- numeric(nrow(thisSubset))
var1 = 5 - thisSubset$b
new1 <- 1/exp(var1*log(2));
#loop 2
new2 <- numeric()
new2<- (thisSubset$a[thisSubset$c >25]/exp(5*log(2)))
#loop 3
new3 <- numeric(nrow(thisSubset))
dplus1 <-c(thisSubset$d[-1], thisSubset$d[length(thisSubset$d)])
dminus1 <-c(thisSubset$d[1], thisSubset$d[-length(thisSubset$d)])
new3<- ifelse((thisSubset$a < 5), thisSubset$d-dplus1/2, thisSubset$d-dminus1/2)
#loop x
#...
Your outer loop was unnecessary.
See Mark's solution for additional comments.
I'm doing some small calculations and i decided to fill the data inside a data.table since it's much faster than data.frame and rbind
so basically my code is something like that:
df is a data.frame used in the calculation but it's important what does it contain.
l=12000
dti = 1
dt = data.table(ni = 0, nj = 0, regerr = 0)
for (i in seq(1,12000,200)) {
for (j in seq(1, 12000, 200)) {
for (ind in 1:nrow(df)) {
if( i+j >= l/2 ){
df[ind,]$X = df[ind,]$pos * 2
} else {
df[ind,]$X = df[ind,]$pos/l
}
}
for (i in 1:100) { # 100 sample
sample(df$X,nrow(df), replace=FALSE)
fit=lm(X ~ gx, df) #linear regression calculation
regerror=sum(residuals(fit)^2)
print(paste(i,j,regerror))
set(dt,dti,1L,as.double(i))
set(dt,dti,2L,as.double(j))
set(dt,dti,3L,regerror)
dti=dti+1
}
}
}
The code prints the first few rounds of print(paste(i,j,regerror)) and then it quits with this error:
*** caught segfault ***
address 0x3ff00008, cause 'memory not mapped'
Segmentation fault (core dumped)
EDIT
structure(list(ax = c(-0.0242214, 0.19770304, 0.01587302, -0.0374415,
0.05079826, 0.12209738), gx = c(-0.3913043, -0.0242214, -0.4259067,
-0.725, -0.0374415, 0.01587302), pos = c(11222, 13564, 16532,
12543, 12534, 14354)), .Names = c("ax", "gx", "pos"), row.names = c(NA,
-6L), class = "data.frame")
Any ideas are appreciated.
Without meaning to sound rude, I think you may benefit from reading a few R tutorials before going forward. This question is also very likely to be closed as too localized. Also, seg faults are almost always a bug somewhere, but you can avoid a bunch of this headache by understanding what each piece of your code is doing. Since its Friday, lets walk through some of it:
if( i+j >= l/2 ){
data[ind,]$X = df[ind,]$pos * 2
}
else{
data[ind,]$X = df[ind,]$pos/l
}
I'll assume data is meant to be df and go from there. We're inside two loops of i and j that both go from 1 through 20000. They will never sum to less than 1/2 so you will always execute the first statement. Also, if you ever expected the FALSE case to occur, you would need else on the same line as your closing brace:
if (i + j >= 1/2) {
df$X <- df$pos * 2
} else {
df$X <- df$pos
}
R is vectorized so doing the above is the same as looping through every value and multiplying by 2. I also removed the / 1 statement since it doesn't do anything. This whole section can be moved outside of the loop. Since its a constant operation of adding a column X that is double the column pos.
Next, your loop where you do a fit:
for (i in 1:100) { # 100 sample
sample(df$X,nrow(df), replace=FALSE)
fit=lm(X ~ gx, df) #linear regression calculation
regerror=sum(residuals(fit)^2)
print(paste(i,j,regerror))
set(dt,dti,1L,as.double(i))
set(dt,dti,2L,as.double(j))
set(dt,dti,3L,regerror)
dti=dti+1
}
Taking, sample(df$X, nrow(df), replace=FALSE) will only show you the new order. It doesn't actual assign them. Instead df$X <- sample(df$X, nrow(df), replace=FALSE).
Now, It looks like you're going to assign into dt (which is a function much like df and should be avoided as a variable name) at row dti the result of this fit error as well as your indicies? As far as I can tell, nothing depends on i or j. Instead, you're going to perform a randomly ordered fit 60 * 60 * 100 times... If that is what you want to do, by all means go for it! But instead do it in an efficient way:
df$X <- df$pos * 2
fit.fun <- function(n, dat) {
jumble <- sample(nrow(dat))
dat$X <- dat$X[jumble]
sum(residuals(lm(X ~ gx, dat))^2)
}
sapply(1:10, fit.fun, dat=df)