use apply function to 2 separate lists in R - r

I have the following code to create a sample function and to generate simulated data
mean_detects<- function(obs,cens) {
detects <- obs[cens==0]
nondetects <- obs[cens==1]
res <- mean(detects)
return(res)
}
mu <-log(1); sigma<- log(3); n_samples=10, n_iterations = 5; p=0.10
dset2 <- function (mu, sigma, n_samples, n_iterations, p) {
X_after <- matrix(NA_real_, nrow = n_iterations, ncol = n_samples)
delta <- matrix(NA_real_, nrow = n_iterations, ncol = n_samples)
lod <- quantile(rlnorm(100000, mu, sigma), p = p)
pct_cens <- numeric(n_iterations)
count <- 1
while(count <= n_iterations) {
X_before <- rlnorm(n_samples, mu, sigma)
X_after[count, ] <- pmax(X_before, lod)
delta [count, ] <- X_before <= lod
pct_cens[count] <- mean(delta[count,])
if (pct_cens [count] > 0 & pct_cens [count] < 1 ) count <- count + 1 }
ave_detects <- mean_detects(X_after,delta) ## how can I use apply or other functions here?
return(ave_detects)
}
If I specify n_iterations, I will have a 1x10 X_after matrix and also 1x10 delta matrix. Then the mean_detects function works fine using this command.
ave_detects <- mean_detects(X_after,delta)
however when I increase n_iterations to say 5, then I will have 2 5x10 X_after and delta then the mean_detects function does not work any more. It only gives me output for 1 iteration instead of 5. My real simulation has thousands of iterations so speed and memory must also be taken into account.
Edits: I edited my code based your comments. The mean_detects function that I created was meant to show an example the use of X_after and delta matrices simultaneously. The real function is very long. That's why I did not post it here.

Your actual question isn't really clear. So,
"My function only takes in 1 dataframe".
Actually your function takes in two vectors
Write code that can use both X_after and delta. This doesn't really mean anything - sorry.
"speed and memory must be taken into account". This is vague. Will your run out of memory? As a suggestion, you could think about a rolling mean. For example,
x = runif(5)
total = 0
for(i in seq_along(x)) {
total = (i-1)*total/i + x[i]/i
cat(i, ": mean ", total, "\n")
}
1 : mean 0.4409
2 : mean 0.5139
3 : mean 0.5596
4 : mean 0.6212
5 : mean 0.6606
Aside
Your dest2 function requires the variable n (which you haven't defined).
Your dest2 function doesn't return an obvious value.
your mean_detects function can be simplified to:
mean(obs[cens==0])

Related

Area Under the Curve using Simpson's rule in R

I would like to compute the Area Under the Curve defined by a set of experimental values. I created a function to calculate an aproximation of the AUC using the Simpson's rule as I saw in this post. However, the function only works when it receives a vector of odd length. How can I modify the code to add the area of the last trapezoid when the input vector has an even length.
AUC <- function(x, h=1){
# AUC function computes the Area Under the Curve of a time serie using
# the Simpson's Rule (numerical method).
# https://link.springer.com/chapter/10.1007/978-1-4612-4974-0_26
# Arguments
# x: (vector) time serie values
# h: (int) temporal resolution of the time serie. default h=1
n = length(x)-1
xValues = seq(from=1, to=n, by=2)
sum <- list()
for(i in 1:length(xValues)){
n_sub <- xValues[[i]]-1
n <- xValues[[i]]
n_add <- xValues[[i]]+1
v1 <- x[[n_sub+1]]
v2 <- x[[n+1]]
v3 <- x[[n_add+1]]
s <- (h/3)*(v1+4*v2+v3)
sum <- append(sum, s)
}
sum <- unlist(sum)
auc <- sum(sum)
return(auc)
}
Here a data example:
smoothed = c(0.3,0.317,0.379,0.452,0.519,0.573,0.61,0.629,0.628,0.613,0.587,0.556,0.521,
0.485,0.448,0.411,0.363,0.317,0.273,0.227,0.185,0.148,0.12,0.103,0.093,0.086,
0.082,0.079,0.076,0.071,0.066,0.059,0.053,0.051,0.052,0.057,0.067,0.081,0.103,
0.129,0.165,0.209,0.252,0.292,0.328,0.363,0.398,0.431,0.459,0.479,0.491,0.494,
0.488,0.475,0.457,0.43,0.397,0.357,0.316,0.285,0.254,0.227,0.206,0.189,0.181,
0.171,0.157,0.151,0.162,0.192,0.239)
One recommended way to handle an even number of points and still achieve precision is to combine Simpson's 1/3 rule with Simpson's 3/8 rule, which can handle an even number of points. Such approaches can be found in (at least one or perhaps more) engineering textbooks on numerical methods.
However, as a practical matter, you can write a code chunk to check the data length and add a single trapezoid at the end, as was suggested in the last comment of the post to which you linked. I wouldn't assume that it is necessarily as precise as combining Simpson's 1/3 and 3/8 rules, but it is probably reasonable for many applications.
I would double-check my code edits below, but this is the basic idea.
AUC <- function(x, h=1){
# AUC function computes the Area Under the Curve of a time serie using
# the Simpson's Rule (numerical method).
# https://link.springer.com/chapter/10.1007/978-1-4612-4974-0_26
# Arguments
# x: (vector) time serie values
# h: (int) temporal resolution of the time serie. default h=1
#jh edit: check for even data length
#and chop off last data point if even
nn = length(x)
if(length(x) %% 2 == 0){
xlast = x[length(x)]
x = x[-length(x)]
}
n = length(x)-1
xValues = seq(from=1, to=n, by=2)
sum <- list()
for(i in 1:length(xValues)){
n_sub <- xValues[[i]]-1
n <- xValues[[i]]
n_add <- xValues[[i]]+1
v1 <- x[[n_sub+1]]
v2 <- x[[n+1]]
v3 <- x[[n_add+1]]
s <- (h/3)*(v1+4*v2+v3)
sum <- append(sum, s)
}
sum <- unlist(sum)
auc <- sum(sum)
##jh edit: add trapezoid for last two data points to result
if(nn %% 2 == 0){
auc <- auc + (x[length(x)] + xlast)/2 * h
}
return(auc)
}
sm = smoothed[-length(smoothed)]
length(sm)
[1] 70
#even data as an example
AUC(sm)
[1] 20.17633
#original odd data
AUC(smoothed)
[1] 20.389
There may be a good reason for you to prefer using Simpson's rule, but if you're just looking for a quick and efficient estimate of AUC, the trapezoid rule is far easier to implement, and does not require an even number of breaks:
AUC <- function(x, h = 1) sum((x[-1] + x[-length(x)]) / 2 * h)
AUC(smoothed)
#> [1] 20.3945
Here, I show example code that uses the Simpson's 1/3 and 3/8 rules in tandem for the numerical integration of data. As always, the usual caveats about the possibility of coding errors or compatibility issues apply.
The output at the end compares the numerical estimates of this algorithm with the trapezoidal rule using R's "integrate" function.
#Algorithm adapted from:
#Numerical Methods for Engineers, Seventh Edition,
#By Chapra and Canale, page 623
#Modified to accept data instead of functional values
#Modified by: Jeffrey Harkness, M.S.
##Begin Simpson's rule function code
simp13 <- function(dat, h = 1){
ans = 2*h*(dat[1] + 4*dat[2] + dat[3])/6
return(ans)}
simp13m <- function(dat, h = 1){
summ <- dat[1]
n <- length(dat)
nseq <- seq(2,(n-2),2)
for(i in nseq){
summ <- summ + 4*dat[i] + 2*dat[i+1]}
summ <- summ + 4*dat[n-1] + dat[n]
result <- (h*summ)/3
return(result)}
simp38 <- function(dat, h = 1){
ans <- 3*h*(dat[1] + 3*sum(dat[2:3]) + dat[4])/8
return(ans)}
simpson = function(dat, h = 1){
hin = h
len = length(dat)
comp <- len %% 2
##number of segments
if(len == 2){
ans = sum(dat)/2*h} ##n = 2 is the trapezoidal rule
if(len == 3){
ans = simp13(dat, h = hin)}
if(len == 4){
ans = simp38(dat,h = hin)}
if(len == 6){
ans <- simp38(dat[1:4],h = hin) + simp13(dat[4:len],h = hin)}
if(len > 6 & comp == 0){
ans = simp38(dat[1:4],h = hin) + simp13m(dat[4:len],h = hin)}
if(len >= 5 & comp == 1){
ans = simp13m(dat,h = hin)}
return(ans)}
##End Simpson's rule function code
This next section of code shows the performance comparison. This code can easily be altered for different test functions and cases.
The precision difference tends to change with the sample size and test function used; this example is not intended to imply that the difference is always this pronounced.
#other algorithm for comparison purposes, from Allan Cameron above
oa <- function(x, h = 1) sum((x[-1] + x[-length(x)]) / 2 * h)
#Testing and algorithm comparison code
simans = NULL; oaans = NULL; simerr = NULL; oaerr = NULL; mp = NULL
for( j in 1:10){
n = j
#f = function(x) cos(x) + 2 ##Test functions
f = function(x) 0.2 + 25*x - 200*x^2 + 675*x^3 - 900*x^4 + 400*x^5
a = 0;b = 10
h = (b-a)/n
datain = seq(a,b,by = h)
preans = integrate(f,a,b)$value #precise numerical estimate of test function
simans[j] = simpson(f(datain), h = h)
oaans[j] = oa(f(datain), h = h)
(simerr[j] = abs(simans[j] - preans)/preans * 100)
(oaerr[j] = abs(oaans[j] - preans)/preans * 100)
mp[j] = simerr[j] < oaerr[j]
}
(outframe = data.frame("simpsons percent diff" = simerr,"trapezoidal percent diff" = oaerr, "more precise?" = mp, check.names = F))
simpsons percent diff trapezoidal percent diff more precise?
1 214.73489738 214.734897 FALSE
2 15.07958148 64.993410 TRUE
3 6.70203621 29.816799 TRUE
4 0.94247384 16.955208 TRUE
5 0.54830021 10.905620 TRUE
6 0.18616767 7.593825 TRUE
7 0.12051767 5.588209 TRUE
8 0.05890462 4.282980 TRUE
9 0.04087107 3.386525 TRUE
10 0.02412733 2.744500 TRUE

Trying to simulate Poisson samples using inverse CDF method but my R function produces wrong results

I wrote some R code for simulating random samples from a Poisson distribution, based on the description of an algorithm (see attached image). But my code does not seem to work correctly, because the generated random samples are of a different pattern compared with those generated by R's built-in rpois() function. Can anybody tell me what I did wrong and how to fix my function?
r.poisson <- function(n, l=0.5)
{
U <- runif(n)
X <- rep(0,n)
p=exp(-l)
F=p
for(i in 1:n)
{
if(U[i] < F)
{
X[i] <- i
} else
{
p=p*l/(i+1)
F=F+p
i=i+1
}
}
return(X)
}
r.poisson(50)
The output is very different from rpois(50, lambda = 0.5). The algorithm I followed is:
(Thank you for your question. Now I know how a Poisson random variable is simulated.)
You had a misunderstanding. The inverse CDF method (with recursive computation) you referenced is used to generate a single Poisson random sample. So you need to fix this function to produce a single number. Here is the correct function, commented to help you follow each step.
rpois1 <- function (lambda) {
## step 1
U <- runif(1)
## step 2
i <- 0
p <- exp(-lambda)
F <- p
## you need an "infinite" loop
## no worry, it will "break" at some time
repeat {
## step 3
if (U < F) {
X <- i
break
}
## step 4
i <- i + 1
p <- lambda * p / i ## I have incremented i, so it is `i` not `i + 1` here
F <- F + p
## back to step 3
}
return(X)
}
Now to get n samples, you need to call this function n times. R has a nice function called replicate to repeat a function many times.
r.poisson <- function (n, lambda) {
## use `replicate()` to call `rpois1` n times
replicate(n, rpois1(lambda))
}
Now we can make a reasonable comparison with R's own rpois.
x1 <- r.poisson(1000, lambda = 0.5)
x2 <- rpois(1000, lambda = 0.5)
## set breaks reasonably when making a histogram
xmax <- max(x1, x2) + 0.5
par(mfrow = c(1, 2))
hist(x1, main = "proof-of-concept-implementation", breaks = seq.int(-0.5, xmax))
hist(x2, main = "R's rpois()", breaks = seq.int(-0.5, xmax))
Remark:
Applaud jblood94 for exemplifying how to seek vectorization opportunity of an R loop, without converting everything to C/C++. R's rpois is coded in C, that is why it is fast.
A vectorized version will run much faster than a non-vectorized function using replicate. The idea is to iteratively drop the uniform random samples as i is incremented.
r.poisson1 <- function(n, l = 0.5) {
U <- runif(n)
i <- 0L
X <- integer(n)
p <- exp(-l)
F <- p
idx <- 1:n
while (length(idx)) {
bln <- U < F
X[idx[bln]] <- i
p <- l*p/(i <- i + 1L)
F <- F + p
idx <- idx[!bln]
U <- U[!bln]
}
X
}
#Zheyuan Li's non-vectorized functions:
rpois1 <- function (lambda) {
## step 1
U <- runif(1)
## step 2
i <- 0
p <- exp(-lambda)
F <- p
## you need an "infinite" loop
## no worry, it will "break" at some time
repeat {
## step 3
if (U < F) {
X <- i
break
}
## step 4
i <- i + 1
p <- lambda * p * i
F <- F + p
## back to step 3
}
return(X)
}
r.poisson2 <- function (n, lambda) {
## use `replicate()` to call `rpois1` n times
replicate(n, rpois1(lambda))
}
Benchmark:
microbenchmark::microbenchmark(r.poisson1(1e5),
r.poisson2(1e5, 0.5),
rpois(1e5, 0.5))
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> r.poisson1(1e+05) 3.063202 3.129151 3.782200 3.225402 3.734600 18.377700 100
#> r.poisson2(1e+05, 0.5) 217.631002 244.816601 269.692648 267.977001 287.599251 375.910601 100
#> rpois(1e+05, 0.5) 1.519901 1.552300 1.649026 1.579551 1.620451 7.531401 100

Manually simulating Poisson Process in R

The following problem tells us to generate a Poisson process step by step from ρ (inter-arrival time), and τ (arrival time).
One of the theoretical results presented in the lectures gives the
following direct method for simulating Poisson process:
• Let τ0 = 0.
• Generate i.i.d. exponential random variables ρ1, ρ2, . . ..
• Let τn = ρ1 + . . . + ρn for n = 1, 2, . . . .
• For each k = 0, 1, . . ., let
Nt = k for τk ≤ t < τk+1.
Using this method, generate a realization of a Poisson process (Nt)t with λ = 0.5 on the interval [0, 20].
Generate 10000 realizations of a Poisson process (Nt)t with λ = 0.5 and use your results to estimate E(Nt) and Var(Nt). Compare the estimates
with the theoretical values.
My attempted solution:
First, I have generated the values of ρ using rexp() function in R.
rhos <-function(lambda, max1)
{
vec <- vector()
for (i in 1:max1)
{
vec[i] <- rexp(0.5)
}
return (vec)
}
then, I created τs by progressive summing of ρs.
taos <- function(lambda, max)
{
rho_vec <- rhos(lambda, max)
#print(rho_vec)
vec <- vector()
vec[1] <- 0
sum <- 0
for(i in 2:max)
{
sum <- sum + rho_vec[i]
vec[i] <- sum
}
return (vec)
}
The following function is for finding the value of Nt=k when the value of k is given. Say, it is 7, etc.
Ntk <- function(lambda, max, k)
{
tao_vec <- taos(lambda, max)
val <- max(tao_vec[tao_vec < k])
}
y <- taos(0.5, 20)
x <- seq(0, 20-1, by=1)
plot(x,y, type="s")
Output:
As you can see, the plot of the Poisson process is blank rather than a staircase.
If I change rexp to exp, I get the following output:
.. which is a staircase function but all steps are equal.
Why is my source code not producing the expected output?
It looks like you're using max1 to indicate how many times to sample the exponential distribution in your rhos function. I would recommend something like this:
rhosGen <- function(lambda, maxTime){
rhos <- NULL
i <- 1
while(sum(rhos) < maxTime){
samp <- rexp(n = 1, rate = lambda)
rhos[i] <- samp
i <- i+1
}
return(head(rhos, -1))
}
This will continue to sample from the exponential until the sum of these holding times is larger than the length of the given interval. head the removes the last sample so that all of the events that we keep track of definitely occur in our time interval of interest.
From here you have to generate the taos by summing the previous holding times (rhos):
taosGen <- function(lambda, maxTime){
rhos <- rhosGen(lambda, maxTime)
taos <- NULL
cumSum <- 0
for(i in 1:length(rhos)){
taos[i] <- sum(rhos[1:i])
}
return(taos)
}
Now that you have the taos we know at what time each event in the time interval (0,maxTime) occurs. This leads us to generating the associated Poisson Process by finding the value of the Nt for each t in the time interval:
ppGen <- function(lambda, maxTime){
taos <- taosGen(lambda, maxTime)
pp <- NULL
for(i in 1:maxTime){
pp[i] <- sum(taos <= i)
}
return(pp)
}
This generates the value of the Poisson Process at each integer time in the interval. I suspect that part of your issue was trying to put the tao values on the y-axis instead of the count of events that had occurred already. The following code worked for me to produce a random looking stair case, similar to your example.
y <- ppGen(0.5, 20)
x <- seq(0, 20-1, by=1)
plot(x,y, type="s")
Here's another possible implementation. The idea is to generate a vector of wait times (tau), and plot that against the list of events we're waiting for (max1)
poi.process <- function(lambda,n){
# initialize vector of total wait time for the arrival of each event:
s<-numeric(n+1)
# set S_0 = 0
s[1] <-0
# generate vector of iid Exp random variables:
x <-replicate(n,rexp(1,lambda))
# assign wait time to vector s in for loop:
for (k in 1:n){
s[k+1] <-sum(x[1:k])
}
# return vector of wait time
return(s)
}
Plotting it using stepfun will get us something like this:
n<-20
lambda <-3
# simulate list of wait time:
s_list <-poi.process(lambda,n)
# plot function:
plot(stepfun(0:(n-1), s_list),
do.points = TRUE,
pch = 16,
col.points = "red",
verticals = FALSE,
main = 'Realization of a Poisson process with lambda = 3',
xlab = 'Time of arrival',
ylab = 'Number of arrivals')
Sample Poisson process:

Sequential Quadratic Programming in R to find optimal weights of an Equally-Weighted Risk Contribution Portfolio

Introduction to the problem
I am trying to write down a code in R so to obtain the weights of an Equally-Weighted Contribution (ERC) Portfolio. As some of you may know, the portfolio construction was presented by Maillard, Roncalli and Teiletche.
Skipping technicalities, in order to find the optimal weights of an ERC portfolio one needs to solve the following Sequential Quadratic Programming problem:
with:
Suppose we are analysing N assets. In the above formulas, we have that x is a (N x 1) vector of portfolio weights and Σ is the (N x N) variance-covariance matrix of asset returns.
What I have done so far
Using the function slsqp of the package nloptr which solves SQP problems, I would like to solve the above minimisation problem. Here is my code. Firstly, the objective function to be minimised:
ObjFuncERC <- function (x, Sigma) {
sum <- 0
R <- Sigma %*% x
for (i in 1:N) {
for (j in 1:N) {
sum <- sum + (x[i]*R[i] - x[j]*R[j])^2
}
}
}
Secondly, the starting point (we start by an equally-weighted portfolio):
x0 <- matrix(1/N, nrow = N, ncol = 1)
Then, the equality constraint (weights must sum to one, that is: sum of the weights minus one equal zero):
heqERC <- function (x) {
h <- numeric(1)
h[1] <- (t(matrix(1, nrow = N, ncol = 1)) %*% x) - 1
return(h)
}
Finally, the lower and upper bounds constraints (weights cannot exceed one and cannot be lower than zero):
lowerERC <- matrix(0, nrow = N, ncol = 1)
upperERC <- matrix(1, nrow = N, ncol = 1)
So that the function which should output optimal weights is:
slsqp(x0 = x0, fn = ObjFuncERC, Sigma = Sigma, lower = lowerERC, upper = upperERC, heq = heqERC)
Unfortunately, I do not know how to share with you my variance-covariance matrix (which takes name Sigma and is a (29 x 29) matrix, so that N = 29) so to reproduce my result, still you can simulate one.
The output error
Running the above code yields the following error:
Error in nl.grad(x, fn) :
Function 'f' must be a univariate function of 2 variables.
I have no idea what to do guys. Probably, I have misunderstood how things must be written down in order for the function slsqp to understand what to do. Can someone help me understand how to fix the problem and get the result I want?
UPDATE ONE: as pointed out by #jogo in the comments, I have updated the code, but it still produces an error. The code and the error above are now updated.
UPDATE 2: as requested by #jaySf, here is the full code that allows you to reproduce my error.
## ERC Portfolio Test
# Preliminary Operations
rm(list=ls())
require(quantmod)
require(nloptr)
# Load Stock Data in R through Yahoo! Finance
stockData <- new.env()
start <- as.Date('2014-12-31')
end <- as.Date('2017-12-31')
tickers <-c('AAPL','AXP','BA','CAT','CSCO','CVX','DIS','GE','GS','HD','IBM','INTC','JNJ','JPM','KO','MCD','MMM','MRK','MSFT','NKE','PFE','PG','TRV','UNH','UTX','V','VZ','WMT','XOM')
getSymbols.yahoo(tickers, env = stockData, from = start, to = end, periodicity = 'monthly')
# Create a matrix containing the price of all assets
prices <- do.call(cbind,eapply(stockData, Op))
prices <- prices[-1, order(colnames(prices))]
colnames(prices) <- tickers
# Compute Returns
returns <- diff(prices)/lag(prices)[-1,]
# Compute variance-covariance matrix
Sigma <- var(returns)
N <- 29
# Set up the minimization problem
ObjFuncERC <- function (x, Sigma) {
sum <- 0
R <- Sigma %*% x
for (i in 1:N) {
for (j in 1:N) {
sum <- sum + (x[i]*R[i] - x[j]*R[j])^2
}
}
}
x0 <- matrix(1/N, nrow = N, ncol = 1)
heqERC <- function (x) {
h <- numeric(1)
h[1] <- t(matrix(1, nrow = N, ncol = 1)) %*% x - 1
}
lowerERC <- matrix(0, nrow = N, ncol = 1)
upperERC <- matrix(1, nrow = N, ncol = 1)
slsqp(x0 = x0, fn = ObjFuncERC, Sigma = Sigma, lower = lowerERC, upper = upperERC, heq = heqERC)
I spotted several mistakes in your code. For instance, ObjFuncERC is not returning any value. You should use the following instead:
# Set up the minimization problem
ObjFuncERC <- function (x, Sigma) {
sum <- 0
R <- Sigma %*% x
for (i in 1:N) {
for (j in 1:N) {
sum <- sum + (x[i]*R[i] - x[j]*R[j])^2
}
}
sum
}
heqERC doesn't return anything too, I also changed your function a bit
heqERC <- function (x) {
sum(x) - 1
}
I made those changes and tried slsqp without lower and upper and it worked. Still, another thing to consider is that you set lowerERC and upperERC as matrices. Use the following instead:
lowerERC <- rep(0,N)
upperERC <- rep(1,N)
Hope this helps.

How to recode while loop to optimize performance for large simulation in R?

I need to generate simulated data where the percent censored cannot be 0 or 1. That's why I use while loop. The problem is if I increase count to 10,000 (instead of 5), the program is very slow. I have to repeat this with 400 different scenarios so it is extremely slow. I'm trying to figure out places where I can vectorize my code piece by piece. How can I avoid while-loop and still able to keep the condition?
Another approach is keep the while loop and generate a list of 10,000 dataset that meet my criteria and then apply the function to the list. Here I use summary function as an example but my real function use both X_after and delta (ie. mle(X_after,delta)). Is this a better option if I have to use while loop?
Another concern I have is memory issue. How can I avoid using up memory while doing such large simulation?
mu=1 ; sigma=3 ; n=10 ; p=0.10
dset <- function (mu,sigma, n, p) {
Mean <- array()
Median <- array()
Pct_cens_array <- array()
count = 0
while(count < 5) {
lod <- quantile(rlnorm(100000, log(mu), log(sigma)), p = p)
X_before <- rlnorm(n, log(mu), log(sigma))
X_after <- ifelse(X_before <= lod, lod, X_before)
delta <- ifelse(X_before <= lod, 1, 0)
pct_cens <- sum(delta)/length(delta)
# print(pct_cens)
if (pct_cens == 0 | pct_cens == 1 ) next
else {
count <- count +1
if (pct_cens > 0 & pct_cens < 1) {
sumStats <- summary(X_after)
Median[count] <- sumStats[3]
Mean [count]<- sumStats[4]
Pct_cens_array [count] <- pct_cens
print(list(pct_cens=pct_cens,X_after=X_after, delta=delta, Median=Median,Mean=Mean,Pct_cens_array=Pct_cens_array))
}
}
}
return(data.frame(Pct_cens_array=Pct_cens_array, Mean=Mean, Median=Median))
}
I've made a few little tweaks to your code without changing the whole style of it. It would be good to heed Yoong Kim's advice and try to break up the code into smaller pieces, to make it more readable and maintainable.
Your function now gets two "n" arguments, for how many samples you have in each row, and how many iterations (columns) you want.
You were growing the arrays Median and Mean in the loop, which requires a lot of messing about reallocating memory and copying things, which slows everything down. I've predefined X_after and moved the mean and median calculations after the loop to avoid this. (As a bonus, mean and median only get called once instead of n_iteration times.)
The calls to ifelse weren't really needed.
It is a little quicker to call rlnorm once, generating enough values for x and the lod, than to call it twice.
Here's the updated function.
dset2 <- function (mu, sigma, n_samples, n_iterations, p) {
X_after <- matrix(NA_real_, nrow = n_iterations, ncol = n_samples)
pct_cens <- numeric(n_iterations)
count <- 1
while(count <= n_iterations) {
random_values <- rlnorm(2L * n_samples, log(mu), log(sigma))
lod <- quantile(random_values[1:n_samples], p = p)
X_before <- random_values[(n_samples + 1L):(2L * n_samples)]
X_after[count, ] <- pmax(X_before, lod)
delta <- X_before <= lod
pct_cens[count] <- mean(delta)
if (pct_cens > 0 && pct_cens < 1 ) count <- count + 1
}
Median <- apply(X_after, 1, median)
Mean <- rowMeans(X_after)
data.frame(Pct_cens_array=pct_cens, Mean=Mean, Median=Median)
}
Compare timings with, for example,
mu=1
sigma=3
n_samples=10L
n_iterations = 1000L
p=0.10
system.time(dset(mu,sigma, n_samples, n_iterations, p))
system.time(dset2(mu,sigma, n_samples, n_iterations, p))
On my machine, there is a factor of 3 speedup.
First rule I learnt with C programming: divide to reign! I mean you should first create multiple functions and call them into your loop because this loop does too many different things.
And I am worried about your algorithm:
if (pct_cens == 0 | pct_cens == 1 ) next
else {count <- count +1
Is there any reason you use while instead of for?
There is a difference between the loops while and for: with while, you always have a first loop, not with for.
Finally, about your problem: use more memory with an array to increase the speed.
Example:
lod <- quantile(rlnorm(100000, log(mu), log(sigma)), p = p)
X_before <- rlnorm(n, log(mu), log(sigma))
log(mu) and log(sigma) are computed twice: use variables to store the result, you will save time but spend more memory of course.

Resources