I am doing a simulation study and I wrote the following R code. Is there anyway to write this code without using two for loop, or make it more efficient (run faster)?
S = 10000
n = 100
v = c(5,10,50,100)
beta0.mle = matrix(NA,S,length(v)) #creating 4 S by n NA matrix
beta1.mle = matrix(NA,S,length(v))
beta0.lse = matrix(NA,S,length(v))
beta1.lse = matrix(NA,S,length(v))
for (j in 1:length(v)){
for (i in 1:S){
set.seed(i)
beta0 = 50
beta1 = 10
x = rnorm(n)
e.t = rt(n,v[j])
y.t = e.t + beta0 + beta1*x
func1 = function(betas){
beta0 = betas[1]
beta1 = betas[2]
sum = sum(log(1+1/v[j]*(y.t-beta0-beta1*x)^2))
return((v[j]+1)/2*sum)
}
beta0.mle[i,j] = nlm(func1,c(1,1),iterlim = 1000)$estimate[1]
beta1.mle[i,j] = nlm(func1,c(1,1),iterlim = 1000)$estimate[2]
beta0.lse[i,j] = lm(y.t~x)$coef[1]
beta1.lse[i,j] = lm(y.t~x)$coef[2]
}
}
The function func1 inside the second for loop is used for nlm function (to find mle when errors are t distributed).
I wanted to use parallel package in R but I didn't find any useful functions.
The key to getting anything to run faster in R is replacing for loops with vectorized functions (such as the apply family). Additionally, as for any programming language, you should look for places where you are calling expensive functions (such as nlm) more than once with the same parameters and see where you can store the results rather than recomputing each time.
Here I am starting as you did by defining the parameters. Also since beta0 and beta1 always 50 and 10 I am going to define those here as well.
S <- 10000
n <- 100
v <- c(5,10,50,100)
beta0 <- 50
beta1 <- 10
Next we will define func1 outside the loop to avoid redefining it each time. func1 now has two extra parameters, v and y.t so that it can be called with the new values.
func1 <- function(betas, v, y.t, x){
beta0 <- betas[1]
beta1 <- betas[2]
sum <- sum(log(1+1/v*(y.t-beta0-beta1*x)^2))
return((v+1)/2*sum)
}
Now we actually do the real work. Rather than having nested loops, we use nested apply statements. The outer lapply will make a list for each value of v and the inner vapply will make a matrix for the four values you want to get (beta0.mle, beta1.mle, beta0.sle, beta1.lse) for each value of S.
values <- lapply(v, function(j) vapply(1:S, function(s) {
# This should look familiar, it is taken from your code
set.seed(s)
x <- rnorm(n)
e.t <- rt(n,j)
y.t <- e.t + beta0 + beta1*x
# Rather than running `nlm` and `lm` twice, we run it once and store the results
nlmmod <- nlm(func1,c(1,1), j, y.t, x, iterlim = 1000)
lmmod <- lm(y.t~x)
# now we return the four values of interest
c(beta0.mle = nlmmod$estimate[1],
beta1.mle = nlmmod$estimate[2],
beta0.lse = lmmod$coef[1],
beta1.lse = lmmod$coef[2])
}, numeric(4)) # this tells `vapply` what to expect out of the function
)
Finally we can reorganize everything into the four matrices.
beta0.mle <- vapply(values, function(x) x["beta0.mle", ], numeric(S))
beta1.mle <- vapply(values, function(x) x["beta1.mle", ], numeric(S))
beta0.lse <- vapply(values, function(x) x["beta0.lse.(Intercept)", ], numeric(S))
beta1.lse <- vapply(values, function(x) x["beta1.lse.x", ], numeric(S))
As a final note, it may be possible to reorganize this to run even faster depending on why you are using the S index to set the seed. If it is important to know what seed was used to generate your x with rnorm then this may be there best I can do. However if you are only doing it to ensure that all of your values of v are being tested on the same values of x then there may be more reorganizing we can do that may produce more speed up using replicate.
Related
I am fairly new to programming in R, so I apologize if this question is too basic. I am trying to study the properties of OLS with error terms created by three different processes (i.e., normal1, normal2, and chi-square). I include these in a list, 'fun_list'.
I would like to iterate through 1,000 (iter) regressions, each with sample size 500 (n). I would like to save all 1,000 X 500 observations in a dataset (big_data) as well as the regression results (reg_results).
At the end of the program, I would like 1,000 regressions for each of the three processes (for a total of 3,000 regressions). I have set up nested loops for the three functions on one level and the 1,000 iterations on a different (sub-) level. I am having trouble getting the program to loop through the three different functions. I am not sure how to call out each element of the list in this embedded loop. Any help would be greatly appreciated!
library(psych)
library(arm)
library(dplyr)
library(fBasics)
library(sjstats)
#set sample size and number of iterations
set.seed(12345)
n <- 500
iter <- 1000
#setting empty vectors. Probably a better way to do this. :)
bn <- rep(NA,iter)
sen <- rep(NA,iter)
#these are the three functions I want to use to generate en,
#which is the error term below. I want one loop for each of the three.
# I can get f1, f2 and f3 to work independently, but I can't get the list
#to work to cycle through all three.
f1 <- function (n) {rnorm(n, 0, 2)}
f2 <- function (n) {rnorm(n, 0, 10)}
f3 <- function (n) {rchisq(n, 2)}
fun_list <- list(f1, f2, f3)
#following line starting point for saving all iterations in one big
#dataset
datalist = list()
#if I remove the following line (for (j ....)), I can get this to work by
#referencing each function independently (i.e., using 'en <- f1(n)').
for (j in fun_list) {
for (s in 1:iter) {
# en <- f1(n)
en <- fun_list[[1]]
x <- rnorm(n, 0, .5)
yn <- .3*x + en
#this is the part that saves the data#
dat <- data.frame(yn, x, en)
dat$s <- s
datalist[[s]] <- dat
#### run model for normal data and save parameters###
lm1n <- lm(yn ~ x)
int.hatn <- coef (lm1n)[1]
b.hatn <- coef (lm1n)[2]
se.hatn <- se.coef (lm1n) [2]
##save them for each iteration
bn[s] = b.hatn
sen[s] = se.hatn
}
}
reg_results<- tibble(bn, sen)
big_data = do.call(rbind,datalist)
When using the loop, I get the following error:
Error in 0.3 * x + en : non-numeric argument to binary operator
I am assuming this is because I do not fully understand how to call out each of the three functions in the list.
Here is a complete solution which wraps the multiple points discussed in the comments:
library(psych)
library(arm)
library(dplyr)
library(fBasics)
library(sjstats)
#set sample size and number of iterations
set.seed(12345)
n <- 500
iter <- 1000
#setting empty vectors. Probably a better way to do this. :)
bn <- c()
sen <- c()
#these are the three functions I want to use to generate en,
#which is the error term below. I want one loop for each of the three.
# I can get f1, f2 and f3 to work independently, but I can't get the list
#to work to cycle through all three.
f1 <- function (n) {rnorm(n, 0, 2)}
f2 <- function (n) {rnorm(n, 0, 10)}
f3 <- function (n) {rchisq(n, 2)}
fun_list <- list(f1, f2, f3)
#following line starting point for saving all iterations in one big
#dataset
datalist = list()
#if I remove the following line (for (j ....)), I can get this to work by
#referencing each function independently (i.e., using 'en <- f1(n)').
for (j in c(1:length(fun_list))) {
en <- fun_list[[j]]
for (s in 1:iter) {
x <- rnorm(n, 0, .5)
random_part <- en(n)
yn <- .3*x + random_part
#this is the part that saves the data#
dat <- data.frame(yn, x, random_part)
dat$s <- s
datalist[[s]] <- dat
#### run model for normal data and save parameters###
lm1n <- lm(yn ~ x)
int.hatn <- coef(lm1n)[1]
b.hatn <- coef(lm1n)[2]
se.hatn <- se.coef(lm1n)[2]
##save them for each iteration
bn = c(bn,b.hatn)
sen = c(sen,se.hatn)
}
}
reg_results<- tibble(bn, sen)
big_data = do.call(rbind,datalist)
So I have a system of ode's and some data I am using the R packages deSolve and FME to fit the parameters of the ode system to data. I am getting a singular matrix result when I fit the full parameter set to the data. So I went back and looked at the collinearity of the parameters using a collinearity index cut-off of 20 as suggested in all the FME package documentation I then picked a few models with subsets of parameters to fit. Then when I run modFit I get this error:
Error in approx(xMod, yMod, xout = xDat) :
need at least two non-NA values to interpolate
Can anyone enlighten me as to a fix for this. Everything else is working fine. So this is not a coding problem.
Here is a minimal working example (removing r=2 in modFit creates the error which I can fix in the minimal working example but not in my actual problem so I doubt a minimal working example helps here):
`## =======================================================================
## Now suppose we do not know K and r and they are to be fitted...
## The "observations" are the analytical solution
## =======================================================================
# You need these packages
library('deSolve')
library('FME')
## logistic growth model
TT <- seq(1, 100, 2.5)
N0 <- 0.1
r <- 0.5
K <- 100
## analytical solution
Ana <- cbind(time = TT, N = K/(1 + (K/N0 - 1) * exp(-r*TT)))
time <- 0:100
parms <- c(r = r, K = K)
x <- c(N = N0)
logist <- function(t, x, parms) {
with(as.list(parms), {
dx <- r * x[1] * (1 - x[1]/K)
list(dx)
})
}
## Run the model with initial guess: K = 10, r = 2
parms["K"] <- 10
parms["r"] <- 2
init <- ode(x, time, logist, parms)
## FITTING algorithm uses modFit
## First define the objective function (model cost) to be minimised
## more general: using modFit
Cost <- function(P) {
parms["K"] <- P[1]
parms["r"] <- P[2]
out <- ode(x, time, logist, parms)
return(modCost(out, Ana))
}
(Fit<-modFit(p = c(K = 10,r=2), f = Cost))
summary(Fit)`
I think the problem is in your Cost function. If you don't provide both K and r, then the cost function will override the start value of r to NA. You can test this:
Cost <- function(P) {
parms["K"] <- P[1]
parms["r"] <- P[2]
print(parms)
#out <- ode(x, time, logist, parms)
#return(modCost(out, Ana))
}
Cost(c(K=10, r = 2))
Cost(c(K=10))
This function works:
Cost <- function(P) {
parms[names(P)] <- P
out <- ode(x, time, logist, parms)
return(modCost(out, Ana))
}
The vignette FMEDyna is very helpful: https://cran.r-project.org/web/packages/FME/vignettes/FMEdyna.pdf See page 14 on how to specify the Objective (Cost) function.
I am trying to construct a new variable, z, using two pre-existing variables - x and y. Suppose for simplicity that there are only 5 observations (corresponding to 5 time periods) and that x=c(5,7,9,10,14) and y=c(0,2,1,2,3). I’m really only using the first observation in x as the initial value, and then constructing the new variable z using depreciated values of x[1] (depreciation rate of 0.05 per annum) and each of the observations over time in the vector, y. The variable I am constructing takes the form of a new 5 by 1 vector, z, and it can be obtained using the following simple commands in R:
z=NULL
for(i in 1:length(x)){n=seq(1,i,by=1)
z[i]=sum(c(0.95^(i-1)*x[1],0.95^(i-n)*y[n]))}
The problem I am having is that I need to define this operation as a function. That is, I need to create a function f that will spit out the vector z whenever any arbitrary vectors x and y are plugged into the function, f(x,y). I’ve been going around in circles for days now and I was wondering if someone would be kind enough to provide me with a suggestion about how to proceed. Thanks in advance.
I hope following will work for you...
x=c(5,7,9,10,14)
y=c(0,2,1,2,3)
getZ = function(x,y){
z = NULL
for(i in 1:length(x)){
n=seq(1,i,by=1)
z[i]=sum(c(0.95^(i-1)*x[1],0.95^(i-n)*y[n]))
}
return = z
}
z = getZ(x,y)
z
5.000000 6.750000 7.412500 9.041875 11.589781
This will allow .05 (or any other value) passed in as r.
ConstructZ <- function(x, y, r){
n <- length(y)
d <- 1 - r
Z <- vector(length = n)
for(i in seq_along(x)){
n = seq_len(i)
Z[i] = sum(c(d^(i-1)*x[1],d^(i-n)*y[n]))
}
return(Z)
}
Here is a cool (if I say so myself) way to implement this as an infix operator (since you called it an operation).
ff = function (x, y, i) {
n = seq.int(i)
sum(c(0.95 ^ (i - 1) * x[[1]],
0.95 ^ (i - n) * y[n]))
}
`%dep%` = function (x, y) sapply(seq_along(x), ff, x=x, y=y)
x %dep% y
[1] 5.000000 6.750000 7.412500 9.041875 11.589781
Doing the loop multiple times and recalculating the exponents every time may be inefficient. Here's another way to implement your calculation
getval <- function(x,y,lambda=.95) {
n <- length(y)
pp <- lambda^(1:n-1)
yy <- sapply(1:n, function(i) {
sum(y * c(pp[i:1], rep.int(0, n-i)))
})
pp*x[1] + yy
}
Testing with #vrajs5's sample data
x=c(5,7,9,10,14)
y=c(0,2,1,2,3)
getval(x,y)
# [1] 5.000000 6.750000 7.412500 9.041875 11.589781
but appears to be about 10x faster when testing on larger data such as
set.seed(15)
x <- rpois(200,20)
y <- rpois(200,20)
I'm not sure of how often you will run this or on what size of data so perhaps efficiency isn't a concern for you. I guess readability is often more important long-term for maintenance.
I have the following code to create a sample function and to generate simulated data
mean_detects<- function(obs,cens) {
detects <- obs[cens==0]
nondetects <- obs[cens==1]
res <- mean(detects)
return(res)
}
mu <-log(1); sigma<- log(3); n_samples=10, n_iterations = 5; p=0.10
dset2 <- function (mu, sigma, n_samples, n_iterations, p) {
X_after <- matrix(NA_real_, nrow = n_iterations, ncol = n_samples)
delta <- matrix(NA_real_, nrow = n_iterations, ncol = n_samples)
lod <- quantile(rlnorm(100000, mu, sigma), p = p)
pct_cens <- numeric(n_iterations)
count <- 1
while(count <= n_iterations) {
X_before <- rlnorm(n_samples, mu, sigma)
X_after[count, ] <- pmax(X_before, lod)
delta [count, ] <- X_before <= lod
pct_cens[count] <- mean(delta[count,])
if (pct_cens [count] > 0 & pct_cens [count] < 1 ) count <- count + 1 }
ave_detects <- mean_detects(X_after,delta) ## how can I use apply or other functions here?
return(ave_detects)
}
If I specify n_iterations, I will have a 1x10 X_after matrix and also 1x10 delta matrix. Then the mean_detects function works fine using this command.
ave_detects <- mean_detects(X_after,delta)
however when I increase n_iterations to say 5, then I will have 2 5x10 X_after and delta then the mean_detects function does not work any more. It only gives me output for 1 iteration instead of 5. My real simulation has thousands of iterations so speed and memory must also be taken into account.
Edits: I edited my code based your comments. The mean_detects function that I created was meant to show an example the use of X_after and delta matrices simultaneously. The real function is very long. That's why I did not post it here.
Your actual question isn't really clear. So,
"My function only takes in 1 dataframe".
Actually your function takes in two vectors
Write code that can use both X_after and delta. This doesn't really mean anything - sorry.
"speed and memory must be taken into account". This is vague. Will your run out of memory? As a suggestion, you could think about a rolling mean. For example,
x = runif(5)
total = 0
for(i in seq_along(x)) {
total = (i-1)*total/i + x[i]/i
cat(i, ": mean ", total, "\n")
}
1 : mean 0.4409
2 : mean 0.5139
3 : mean 0.5596
4 : mean 0.6212
5 : mean 0.6606
Aside
Your dest2 function requires the variable n (which you haven't defined).
Your dest2 function doesn't return an obvious value.
your mean_detects function can be simplified to:
mean(obs[cens==0])
I want to use arms() to get one sample each time and make a loop like the following one in my function. It runs very slowly. How could I make it run faster? Thanks.
library(HI)
dmat <- matrix(0, nrow=100,ncol=30)
system.time(
for (d in 1:100){
for (j in 1:30){
y <- rep(0, 101)
for (i in 2:100){
y[i] <- arms(0.3, function(x) (3.5+0.000001*d*j*y[i-1])*log(x)-x,
function(x) (x>1e-4)*(x<20), 1)
}
dmat[d, j] <- sum(y)
}
}
)
This is a version based on Tommy's answer but avoiding all loops:
library(multicore) # or library(parallel) in 2.14.x
set.seed(42)
m = 100
n = 30
system.time({
arms.C <- getNativeSymbolInfo("arms")$address
bounds <- 0.3 + convex.bounds(0.3, dir = 1, function(x) (x>1e-4)*(x<20))
if (diff(bounds) < 1e-07) stop("pointless!")
# create the vector of z values
zval <- 0.00001 * rep(seq.int(n), m) * rep(seq.int(m), each = n)
# apply the inner function to each grid point and return the matrix
dmat <- matrix(unlist(mclapply(zval, function(z)
sum(unlist(lapply(seq.int(100), function(i)
.Call(arms.C, bounds, function(x) (3.5 + z * i) * log(x) - x,
0.3, 1L, parent.frame())
)))
)), m, byrow=TRUE)
})
On a multicore machine this will be really fast since it spreads the loads across cores. On a single-core machine (or for poor Windows users) you can replace mclapply above with lapply and get only a slight speedup compared to Tommy's answer. But note that the result will be different for parallel versions since it will use different RNG sequences.
Note that any C code that needs to evaluate R functions will be inherently slow (because interpreted code is slow). I have added the arms.C just to remove all R->C overhead to make moli happy ;), but it doesn't make any difference.
You could squeeze out a few more milliseconds by using column-major processing (the question code was row-major which requires re-copying as R matrices are always column-major).
Edit: I noticed that moli changed the question slightly since Tommy answered - so instead of the sum(...) part you have to use a loop since y[i] are dependent, so the function(z) would look like
function(z) { y <- 0
for (i in seq.int(99))
y <- y + .Call(arms.C, bounds, function(x) (3.5 + z * y) * log(x) - x,
0.3, 1L, parent.frame())
y }
Well, one effective way is to get rid of the overhead inside arms. It does some checks and calls the indFunc every time even though the result is always the same in your case.
Some other evaluations can be also be done outside the loop. These optimizations bring down the time from 54 secs to around 6.3 secs on my machine. ...and the answer is identical.
set.seed(42)
#dmat2 <- ##RUN ORIGINAL CODE HERE##
# Now try this:
set.seed(42)
dmat <- matrix(0, nrow=100,ncol=30)
system.time({
e <- new.env()
bounds <- 0.3 + convex.bounds(0.3, dir = 1, function(x) (x>1e-4)*(x<20))
f <- function(x) (3.5+z*i)*log(x)-x
if (diff(bounds) < 1e-07) stop("pointless!")
for (d in seq_len(nrow(dmat))) {
for (j in seq_len(ncol(dmat))) {
y <- 0
z <- 0.00001*d*j
for (i in 1:100) {
y <- y + .Call("arms", bounds, f, 0.3, 1L, e)
}
dmat[d, j] <- y
}
}
})
all.equal(dmat, dmat2) # TRUE
why not like this?
dat <- expand.grid(d=1:10, j=1:3, i=1:10)
arms.func <- function(vec) {
require(HI)
dji <- vec[1]*vec[2]*vec[3]
arms.out <- arms(0.3,
function(x,params) (3.5 + 0.00001*params)*log(x) - x,
function(x,params) (x>1e-4)*(x<20),
n.sample=1,
params=dji)
return(arms.out)
}
dat$arms <- apply(dat,1,arms.func)
library(plyr)
out <- ddply(dat,.(d,j),summarise, arms=sum(arms))
matrix(out$arms,nrow=length(unique(out$d)),ncol=length(unique(out$j)))
However, its still single core and time consuming. But that isn't R being slow, its the arms function.