How would the following be written using apply?
# Variables
age <- 1:100
Y <- age+5
d <- 0.25
dx <- 5
a_x <- 1:dx
Yd <- matrix( 0, nrow=max(age), ncol=dx )
# Nested loop is computationally inefficient?
for (a in age){
for (ax in a_x){
Yd[a,ax] <- (Y[[a]] * (1 - d) ** (ax-1))
}
}
My model has a lot of these nested for loop structures, because I am incompetent. I am hoping to improve the computational time using apply. I find the apply functions rather confusing to get into. I am looking for a solution that illustrates how one can obtain such nested structures using apply. Hopefully, from there on I can apply (pun intended) the solution to even more complicated nested for loops (4-5 loops within each other).
For example
Ydi <- rep( list(), 6)
for (i in 1:6){
Ydi[[i]] <- matrix( 0, nrow=max(age), ncol=dx )
}
# Nested loop is computationally inefficient?
for (i in 1:6){
for (a in age){
for (ax in a_x){
Ydi[[i]][a,ax] <- (Y[[a]] * (1 - d) ** (ax-1)) + i
}
}
}
I would use expand.grid instead:
df <- data.frame(expand.grid(a = age, ax = a_x))
df[['Yd']] <- (df[['a']] + 5) * (1 - d) ** (df[['ax']] - 1)
This is infinitely extendable (subject to memory constraints) - each additional nested loop will just be an additional variable in your expand.grid call. For example:
new_col <- 1:2
df_2 <- data.frame(expand.grid(a = age, ax = a_x, nc = new_col))
df_2[['Yd']] <- (df_2[['a']] + 5) * (1 - d) ** (df_2[['ax']] - 1) + df_2[['nc']]
This essentially switches to a tidy data format, which is an easier way of storing multi-dimensional data.
For easier syntax, and faster speed, you can use the data.table package:
library(data.table)
dt_3 <- data.table(expand.grid(a = age, ax = a_x, nc = new_col))
dt_3[ , Yd := (a + 5) * (1 - d) ** (ax - 1) + nc]
Related
Why does the computation of the following code in R take so much time? It takes many minutes, so I have interruped the calculations.
My aim is to adapt my simulated random numbers (sumzv, dim(sumzv) = 1000000 x 10) to my market model S_t (geometric brownian motion).
The vectors m and s describe the drift and the deviation of the GBM and are vectors containing 10 numbers. DEL is the variable for the time steps. S_0 is a vector containing 10 stock prices at time 0.
n <- 1000000
k <- 10
S_t <- data.frame(matrix(0, nrow = n, ncol = k))
i <- 1
j <- 1
t <- 10
for (j in 1:k) {
for (i in 1:n) {
S_t[i, j] <- S_0[j] * exp(m[j] * t * DEL + s[j] * sqrt(DEL) * sumzv[i, j])
}
}
Thank you for your help. Please keep in mind that I'm a beginner :)
Unfortunately, I couldn't find any helpful information so far on the internet. Some pages said, vectorization is helpful to speed up an R Code, but this doesn't seem helpful to me.
I tried to break down the data frames into vectors but this got very complex.
The following code with vectorized inner loop is equivalent to the posted code.
It also pre-computes some inner loop vectors, fac1 and fac2.
S_t <- data.frame(matrix(0, nrow = n, ncol = m))
fac1 <- m * t * DEL
fac2 <- s * sqrt(DEL)
for (j in 1:k) {
S_t[, j] <- S_0[j] * exp(fac1[j] + fac2[j] * sumzv[, j])
}
The fully vectorized version of the loop on j above is the one-liner below. The transposes are needed because R is column major and we are multiplying by row vectors indexed on j = 1:k.
S_t2 <- t(S_0 * exp(fac1 + fac2 * t(sumzv)))
I am writing a for loop to calculate a numerator which is part of a larger formula. I used a for loop but it is taking a lot of time to compute. What would be a better way to do this.
city is a dataframe with the following columns: pop, not.white, pct.not.white
n <- nrow(city)
numerator = 0
for(i in 1:n) {
ti <- city$pop[i]
pi<- city$pct.not.white[i]
for(j in 1:n) {
tj <- city$pop[j]
pj <- city$pct.not.white[j]
numerator = numerator + (ti * tj) * abs(pi -pj)
}
}
Use the following toy data for result validation.
set.seed(0)
city <- data.frame(pop = runif(101), pct.not.white = runif(101))
The most obvious "vectorization":
# n <- nrow(city)
titj <- tcrossprod(city$pop)
pipj <- outer(city$pct.not.white, city$pct.not.white, "-")
numerator <- sum(titj * abs(pipj))
Will probably have memory problem if n > 5000.
A clever workaround (exploiting symmetry; more memory efficient "vectorization"):
## see https://stackoverflow.com/a/52086291/4891738 for function: tri_ind
n <- nrow(city)
ij <- tri_ind(n, lower = TRUE, diag = FALSE)
titj <- city$pop[ij$i] * city$pop[ij$j]
pipj <- abs(city$pct.not.white[ij$i] - city$pct.not.white[ij$j])
numerator <- 2 * crossprod(titj, pipj)[1]
The ultimate solution is to write C / C++ loop, which I will not showcase.
I am doing a simulation study and I wrote the following R code. Is there anyway to write this code without using two for loop, or make it more efficient (run faster)?
S = 10000
n = 100
v = c(5,10,50,100)
beta0.mle = matrix(NA,S,length(v)) #creating 4 S by n NA matrix
beta1.mle = matrix(NA,S,length(v))
beta0.lse = matrix(NA,S,length(v))
beta1.lse = matrix(NA,S,length(v))
for (j in 1:length(v)){
for (i in 1:S){
set.seed(i)
beta0 = 50
beta1 = 10
x = rnorm(n)
e.t = rt(n,v[j])
y.t = e.t + beta0 + beta1*x
func1 = function(betas){
beta0 = betas[1]
beta1 = betas[2]
sum = sum(log(1+1/v[j]*(y.t-beta0-beta1*x)^2))
return((v[j]+1)/2*sum)
}
beta0.mle[i,j] = nlm(func1,c(1,1),iterlim = 1000)$estimate[1]
beta1.mle[i,j] = nlm(func1,c(1,1),iterlim = 1000)$estimate[2]
beta0.lse[i,j] = lm(y.t~x)$coef[1]
beta1.lse[i,j] = lm(y.t~x)$coef[2]
}
}
The function func1 inside the second for loop is used for nlm function (to find mle when errors are t distributed).
I wanted to use parallel package in R but I didn't find any useful functions.
The key to getting anything to run faster in R is replacing for loops with vectorized functions (such as the apply family). Additionally, as for any programming language, you should look for places where you are calling expensive functions (such as nlm) more than once with the same parameters and see where you can store the results rather than recomputing each time.
Here I am starting as you did by defining the parameters. Also since beta0 and beta1 always 50 and 10 I am going to define those here as well.
S <- 10000
n <- 100
v <- c(5,10,50,100)
beta0 <- 50
beta1 <- 10
Next we will define func1 outside the loop to avoid redefining it each time. func1 now has two extra parameters, v and y.t so that it can be called with the new values.
func1 <- function(betas, v, y.t, x){
beta0 <- betas[1]
beta1 <- betas[2]
sum <- sum(log(1+1/v*(y.t-beta0-beta1*x)^2))
return((v+1)/2*sum)
}
Now we actually do the real work. Rather than having nested loops, we use nested apply statements. The outer lapply will make a list for each value of v and the inner vapply will make a matrix for the four values you want to get (beta0.mle, beta1.mle, beta0.sle, beta1.lse) for each value of S.
values <- lapply(v, function(j) vapply(1:S, function(s) {
# This should look familiar, it is taken from your code
set.seed(s)
x <- rnorm(n)
e.t <- rt(n,j)
y.t <- e.t + beta0 + beta1*x
# Rather than running `nlm` and `lm` twice, we run it once and store the results
nlmmod <- nlm(func1,c(1,1), j, y.t, x, iterlim = 1000)
lmmod <- lm(y.t~x)
# now we return the four values of interest
c(beta0.mle = nlmmod$estimate[1],
beta1.mle = nlmmod$estimate[2],
beta0.lse = lmmod$coef[1],
beta1.lse = lmmod$coef[2])
}, numeric(4)) # this tells `vapply` what to expect out of the function
)
Finally we can reorganize everything into the four matrices.
beta0.mle <- vapply(values, function(x) x["beta0.mle", ], numeric(S))
beta1.mle <- vapply(values, function(x) x["beta1.mle", ], numeric(S))
beta0.lse <- vapply(values, function(x) x["beta0.lse.(Intercept)", ], numeric(S))
beta1.lse <- vapply(values, function(x) x["beta1.lse.x", ], numeric(S))
As a final note, it may be possible to reorganize this to run even faster depending on why you are using the S index to set the seed. If it is important to know what seed was used to generate your x with rnorm then this may be there best I can do. However if you are only doing it to ensure that all of your values of v are being tested on the same values of x then there may be more reorganizing we can do that may produce more speed up using replicate.
I'm trying to make a model using deSolve with a fairly large number of states. One of the states, 'foo', is actually made of 15 different states comprising of foo[1,1:5], foo[2,1:5] and foo[3,1:5] so I thought it would be easiest to pass the function a matrix of states instead of typing them out individually and then I could refer to them with indexing:
par <- rep(NA,3)
par_names <- c('alpha','prog','death_rate')
names(par) <-par_names
par['alpha'] <- 0.7
par['prog'] <- 0.8
par['death_rate'] <- 0.3
foo <- matrix(0,nrow = 3,ncol = 5)
states <- foo
my_func <- function(t,states,par){
with(as.list(c(states,par)),{
for (j in 1:5){
dfoo[1,j] <- par['alpha']*par['prog']*foo[1,j] - par['death_rate']*foo[1,j]
dfoo[2,j] <- par['prog']*foo[1,j] - par['prog']*foo[2,j] - par['death_rate']*foo[2,j]
dfoo[3,j] <- par['prog']*foo[2,j] - par['prog']*foo[3,j] - par['death_rate']*foo[3,j]
}
list(c(
dfoo[]
))
})
}
times <- seq(1,365,by=1)
library(deSolve)
alldata <- as.data.frame(ode(y=states,times=times,func=my_func,parms=par))
I've tried to fix it but I just keep getting the same error:
Error in dfoo[1, j] <- par["alpha"] * par["prog"] * foo[1, j] - par["death_rate"] * :
object 'dfoo' not found
So does anyone know how this might be made to work or an easier way of doing this?
Yes, you can pass a matrix in as your states. But every time ode calls your function (except for the first time) it will pass a vector rather than a matrix. But you can convert it to a matrix at the beginning of your function.
You use unnecessary contortions to create your data. Also, as pointed out in the comments, your function doesn't seem to initialize dfoo. Finally, your for loop in the function could be more cleanly handled with a few vectorized operations. Here is an example:
my_func <- function(t,states,par){
foo <- matrix(states, nrow = 3, ncol = 5)
dfoo <- with(as.list(par), rbind(
(prog * alpha * foo[1,]) - (death_rate * foo[1,]),
(prog * foo[-nrow(foo),]) - (prog * foo[-1,]) - (death_rate * foo[-1,])
))
list(dfoo)
}
library(deSolve)
par <- c(alpha = 0.7, prog = 0.8, death_rate = 0.3)
states <- matrix(0,nrow = 3,ncol = 5)
ode <- ode(y=states, times=1:365, func=my_func, parms=par)
alldata <- as.data.frame(ode)
I have a recursive computation for a matrix A (this will be a hat-matrix), for example:
A(i) = A(i-1) + crossprod(B,A(i-1))
For each step i I need the trace of A(i). Is there a faster way to implement this in R than the following implementation:
# define random matrices
set.seed(123)
n <- 7^2*10^4
steps <- 10
A <- matrix(rnorm(n), ncol=sqrt(n))
B <- matrix(rnorm(n), ncol=sqrt(n))
# preallocation
Amat <- traceA <- vector("list", steps)
Amat[[1]] <- A
# recursive computation for matrix A(i)
ptm <- proc.time()
for(i in 2:steps){
Amat[[i]] <- Amat[[i-1]] + crossprod(B,Amat[[i-1]])
traceA[[i]] <- sum(diag(Amat[[i]]))
}
proc.time() - ptm
I would like to mention that the matrix A(i) and the matrix B are symmetric and idempotent (because they are hat matrices of a linear model) and can be extremely big. I guess that parallel computation will fail here, because the for-loop needs the matrix A(i-1) of the step before.
The idea behind this is a likelihood-based boosting algorithm, where I need the trace of each boosting iteration of the hat-matrix that could be computed as mentioned above.
Looks like your Amat_i's can be written as Amat_i = (1+t(B))^(i-1) * A and since you mention that B*B = B or t(B)*t(B) = t(B), then
(1+B)^n = 1 + choose(n,1)*B + choose(n,2)*B^2 + ...
= 1 + B * (choose(n,1) + choose(n,2) + ... + choose(n,n))
= 1 + B * (2^n - 1)
Putting it all together then:
tr(Amat_i) = tr(A) + (2^(i-1) - 1) * tr(t(B)*A)
So just calculate the two traces and then you won't need to do any more matrix multiplications to get all of the tr(Amat_i)'s.