I would like to repeat the process unless a condition is met at the sometime storing the outcomes.
Here is a simple case where I know number of cycles to perform in the loop:
# just example data
smpls <- rnorm(100,50,50)
ncycles <- 1000
outm <- matrix(nrow=ncycles, ncol = 1)
# repeate the process for n cycles
for(i in 1:ncycles){
outm[i] <- mean(sample(smpls, 50))
}
# get average of outm
outm <- mean(sample(smpls, 50))
But my case is different in the sense that I do not know ncyles. I want to continue sampling unless the samples will get very low variance or converges (I guess it is "while" loop. For example unless vsd is less than 1 in following case.
vsd <- NULL
outm <- mean(sample(smpls, 50))
while (vsd > 1){
outm[i] <- mean(sample(smpls, 50))
vsd <- sd(outm)
}
I do not know the value of i here to be set. Help appreciated
Edits:
smpls <- rnorm(100,50,50)
iter <- 0
# maximum iteration
itermax <- 1000
outm <- rep(NA, itermax)
vsd <- 2
while((vsd > 1 ) && (iter < itermax)) {
outm[iter] <- mean(sample(smpls, 50))
vsd <- sd(outm)
iter <- iter+1
}
Error in while ((vsd > 1) && (iter < itermax)) { :
missing value where TRUE/FALSE needed
Main idea of stopping when it reaches convergence is to save time. Although the above example with just mean function is quick, my original function need significant time to do iterations and I want to stop it when it converges.
Two problems in your code:
1) you need sd(... , na.rm = TRUE)
2) you need to be sure that there are at least two numbers in outm for sd(outm, na.rm = TRUE) != NA
Just by the way, given the sd you specify to rnorm, I don't think you'll ever need more than a couple of dozen iterations
sim <- function() {
smpls <- rnorm(100,50,5)
itermax <- 1000
outm <- rep(NA, itermax)
outm[1] <- mean(sample(smpls, 50))
iter <- 1
vsd <- 2
while((vsd > 1 ) && (iter < itermax)) {
iter <- iter+1
outm[iter] <- mean(sample(smpls, 50))
vsd <- sd(outm, na.rm = TRUE)
}
iter
}
set.seed(666)
iters <- replicate(100000, sim() )
range(iters) # c(2, 11)
Cheers.
Here is a solution:
data
set.seed(123) # so that you can replicate what I did
smpls <- rnorm(100,50,50)
I think you need some initialization cycles (minimum iterations) so that your do get false convergence because you have small number of samples. So run few samples - say miniter. You also need a maximum iteration so that your loop do not became wild - say maxiter.
meanconverge <- function (data, miniter, maxiter, tolerance){
outm <- rep(NA, maxiter)
for(i in 1:miniter){
outm[i] <- mean(sample(smpls, 50))
}
# sd of initial cycles
vsd <- sd(outm, na.rm = TRUE)
if(vsd > tolerance) {
iter <- miniter+1
sdout <- rep(NA, maxiter)
while((vsd > tolerance ) && (iter < maxiter)) {
iter <- iter + 1
outm[iter] <- mean(sample(smpls, 50))
vsd <- sd(outm, na.rm = TRUE)
sdout[iter] <- vsd
}
out <- list(outm, sdout)
return(out)
} else {
return(outm)
}
}
out <- meanconverge (data = smpls, miniter = 50, maxiter = 100000, tolerance = 3)
plot(unlist(out[2]), pch = ".", col = "red")
plot(unlist(out[1]), pch = ".", col = "red")
Checking for convergence is a tricky thing. A great way to get started is to look at how the value changes while you are computing. Convergence is all about getting arbitrarily close to a boundary; programmatically, you have to make a choice what "arbitrary" means. You also need to make a decision about how you will measure convergence.
To illustrate, suppose I want to know if my estimates for meeting my condition are getting really close to each other. I may have something like:
# inside my function or method that performs this convergence feat
while (while_condition && i < itermax)) {
outcome[i] <- some_complicated_foo(bar)
if ( abs(outcomes[i-1] - outcomes[i]) <= tolerance ) {
while_condition <- FALSE # i.e. STOP LOOPING
return outcomes
}
else {continue}
}
Where tolerance is your definition of arbitrary closeness. Now, this seems like a hammer to your nail right? Well, what happens if you converge to the wrong answer? How will you know? Does this thing even converge? The trick to these kinds of problems is making cleaver guesses about your functions or the data generating process you are analyzing. But, having a maximum iteration boundary will definitely ease some computing time as long as it is reasonable. The real way to know if you are right is to use tests (like statistical tests or unit tests) to see if there is any 'garbage-in-garbage-out' or getting something different than you'd expect with a contrived example with a well known answer.
Check out optimization algorithms and see how they do it. See ?optim or some other optimization package to see how the pros do it.
Related
I am looking to sample repeatedly from a distribution with a specific condition.
I am sampling 50 values for four iterations and saving the results. However I need each individual results from the iteration to be smaller than the last result at the same position.
mu.c <- c(7,6,5,3) # Means of control chains
chains.sim <- function(vector, N) {
all.list <- list()
for (i in 1:length(vector)) {
Y <- MASS::rnegbin(n = N, mu = vector[i], theta = 4)
name <- paste('position:',i, sep = '')
all.list[[name]] <- Y
}
all.list
}
chains.sim(mu.c, 50)
The sampling part works fine, but the Y individual results are of course not always smaller than the results from the previous iteration ("position").
Is there a way to repeat the sampling process until the result is smaller?
I would really appreciate your help!
I would add a while loop inside your for loop which samples data sets until the condition is met.
mu.c <- c(7,6,5,3) # Means of control chains
chain.sim <- function(vector, N) {
all.list <- list()
all.list[[1]] <- MASS::rnegbin(n = N, mu = vector[1], theta = 4)
for (i in 2:length(vector)) {
is_smaller <- FALSE
while(!is_smaller){
Y <- MASS::rnegbin(n = N, mu = vector[i], theta = 4)
if (all(all.list[[i-1]] >= Y)) is_smaller <- TRUE
}
all.list[[i]] <- Y
}
all.list
}
chain.sim(mu.c, 3)
Note that I changed the condition to >=, because if 0 is generated in any round, it will never find smaller values. Also, with 50 elements this code will never stop, because it is really unlikely to get two samples where each value is smaller, let alone 4 different samples.
Edit:
it can be much faster by sampling individually as you pointed out
chain.sim <- function(vector, N) {
all.list <- list()
all.list[[1]] <- MASS::rnegbin(n = N, mu = vector[1], theta = 4)
for (i in 2:length(vector)) {
Y <- numeric(N)
for (j in 1:N){
previous_value <- all.list[[i-1]][j]
if (previous_value == 0){
Y[j] = 0
next
}
is_smaller <- FALSE
while(!is_smaller){
val <- MASS::rnegbin(1, mu = vector[i], theta = 4)
if (val <= previous_value) is_smaller <- TRUE
Y[j] <- val
}
}
all.list[[i]] <- Y
}
all.list
}
chain.sim(mu.c, 50)
If 0 is encountered anywhere, no more simulation is necessary as we know the next value can only be 0. This makes the simulation much faster
I would like to apply the Rejection sampling method to simulate a random vector Y=(Y_1, Y_2) of a uniform distribution from a unit disc D = { (X_1 , X_2) \in R^2: \sqrt{x^2_1 + x^2_2} ≤ 1} such that X = (X_1 , X_ 2) is random vector of a uniform distribution in the square S = [−1, 1]^2 and the joint density f(y_1,y_2) = \frac{1}{\pi} 1_{D(y_1,y_2)}.
In the rejection method, we accept a sample generally if f(x) \leq C * g(x). I am using the following code to :
x=runif(100,-1,1)
y=runif(100,-1,1)
d=data.frame(x=x,y=y)
disc_sample=d[(d$x^2+d$y^2)<1,]
plot(disc_sample)
I have two questions:
{Using the above code, logically, the size of d should be greater than the size of disc_sample but when I call both of them I see there are 100 elements in each one of them. How could this be possible. Why the sizes are the same.} THIS PART IS SOLVED, thanks to the comment below.
The question now
Also, how could I reformulate my code to give me the total number of samples needed to get 100 samples follow the condition. i.e to give me the number of samples rejected until I got the 100 needed sample?
Thanks to the answer of r2evans but I am looking to write something simpler, a while loop to store all possible samples inside a matrix or a data frame instead of a list then to call from that data frame just the samples follow the condition. I modified the code from the answer without the use of the lists and without sapply function but it is not giving the needed result, it yields only one row.
i=0
samps <- data.frame()
goods <- data.frame()
nr <- 0L
sampsize <- 100L
needs <- 100L
while (i < needs) {
samps <- data.frame(x = runif(1, -1, 1), y = runif(1, -1, 1))
goods <- samps[(samps$x^2+samps$y^2)<1, ]
i = i+1
}
and I also thought about this:
i=0
j=0
samps <- matrix()
goods <- matrix()
needs <- 100
while (j < needs) {
samps[i,1] <- runif(1, -1, 1)
samps[i,2] <- runif(1, -1, 1)
if (( (samps[i,1])**2+(samps[i,2])**2)<1){
goods[j,1] <- samps[i,1]
goods[j,2] <- samps[i,2]
}
else{
i = i+1
}
}
but it is not working.
I would be very grateful for any help to modify the code.
As to your second question ... you cannot reformulate your code to know precisely how many it will take to get (at least) 100 resulting combinations. You can use a while loop and concatenate results until you have at least 100 such rows, and then truncate those over 100. Because using entropy piecewise (at scale) is "expensive", you might prefer to always over-estimate the rows you need and grab all at once.
(Edited to reduce "complexity" based on homework constraints.)
set.seed(42)
samps <- vector(mode = "list")
goods <- vector(mode = "list")
nr <- 0L
iter <- 0L
sampsize <- 100L
needs <- 100L
while (nr < needs && iter < 50) {
iter <- iter + 1L
samps[[iter]] <- data.frame(x = runif(sampsize, -1, 1), y = runif(sampsize, -1, 1))
rows <- (samps[[iter]]$x^2 + samps[[iter]]$y^2) < 1
goods[[iter]] <- samps[[iter]][rows, ]
nr <- nr + sum(rows)
}
iter # number of times we looped
# [1] 2
out <- head(do.call(rbind, goods), n = 100)
NROW(out)
# [1] 100
head(out) ; tail(out)
# x y
# 1 0.8296121 0.2524907
# 3 -0.4277209 -0.5668654
# 4 0.6608953 -0.2221099
# 5 0.2834910 0.8849114
# 6 0.0381919 0.9252160
# 7 0.4731766 0.4797106
# x y
# 221 -0.65673577 -0.2124462
# 231 0.08606199 -0.7161822
# 251 -0.37263236 0.1296444
# 271 -0.38589120 -0.2831997
# 28 -0.62909284 0.6840144
# 301 -0.50865171 0.5014720
I found out that boot function of the boot package is not working with complex numbers. I am trying to bootstrap a data by taking the eigenvalue of the bivariate matrix. The problem with the eigenvalue is that, it often returns complex numbers, and by that it (boot) gives error. Is there a way to avoid complex numbers?
Here is my codes,
Data <- read.table('http://ubuntuone.com/6n1igcHXq4EnOm4x2zeqFb', header = FALSE)
Mat <- cbind(Data[["V1"]],Data[["V2"]])
Data.ts <- as.ts(Mat)
Below are some functions needed,
library(mvtnorm)
var1.sim <- function(T, n.start=100, phi1=matrix(c(0.7,0.2,0.2,0.7),nr=2),
err.mu=c(0,0), err.sigma2=matrix(c(1,0.5,0.5,1), nr=2),
errors=NULL) {
e <- rmvnorm(n.start + T, err.mu, err.sigma2) # (n.start+T) x 2 matrix
y <- matrix(0, nrow=n.start+T, ncol=2)
if (!is.null(errors) && is.matrix(errors) && ncol(errors) == 2) {
rows <- nrow(errors)
if (rows < n.start + T) {
# replace last nrow(errors) errors
e[seq.int(n.start+T-rows+1,n.start+T),] <- errors
} else {
e <- errors[seq.int(n.start+T+1, rows)]
}
}
for (t in seq.int(2, n.start + T)) {
y[t,] <- phi1 %*% y[t-1,] + e[t,]
}
return(ts(y[seq.int(n.start+1,n.start+T),]))
}
########
coef.var1 <- function(var.fit) {
k <- coef(var.fit)
rbind(k[[1]][,"Estimate"], k[[2]][,"Estimate"])
}
And here is the main method,
library(vars)
library(boot)
y.var <- VAR(Data.ts, p=1, type="none")
y.resid <- resid(y.var)
rm(y.boot)
y.boot <- boot(y.resid, R=100, statistic=function(x,i) {
resid.boot <- x[i,]
y.boot1 <- var1.sim(T=nrow(x), errors=resid.boot)
min(eigen(coef.var1(VAR(y.boot1, p=1, type="none")))$values)
}, stype="i")
y.boot$t
y.ci <- boot.ci(y.boot, type="norm", conf=0.95)$normal[2:3]
list(t=y.boot$t,ci=y.ci)
The problem occurs in y.boot object, particularly this line
min(eigen(coef.var1(VAR(y.boot1, p=1, type="none")))$values)
When the obtain minimum eigenvalue is complex, then boot will return this error
Error in min(eigen(coef.var1(VAR(y.boot1, p = 1, type = "none")))$values) :
invalid 'type' (complex) of argument
Otherwise, there is no problem. Now, it would be safe if this 100 bootstraps is performed once, but I am going to loop this actually about 100 times too. So, there is a big chance that complex values will occur in these loops. Hence, we will obtain the above error again.
Is there a way to avoid these complex values?
I have the following code to create a sample function and to generate simulated data
mean_detects<- function(obs,cens) {
detects <- obs[cens==0]
nondetects <- obs[cens==1]
res <- mean(detects)
return(res)
}
mu <-log(1); sigma<- log(3); n_samples=10, n_iterations = 5; p=0.10
dset2 <- function (mu, sigma, n_samples, n_iterations, p) {
X_after <- matrix(NA_real_, nrow = n_iterations, ncol = n_samples)
delta <- matrix(NA_real_, nrow = n_iterations, ncol = n_samples)
lod <- quantile(rlnorm(100000, mu, sigma), p = p)
pct_cens <- numeric(n_iterations)
count <- 1
while(count <= n_iterations) {
X_before <- rlnorm(n_samples, mu, sigma)
X_after[count, ] <- pmax(X_before, lod)
delta [count, ] <- X_before <= lod
pct_cens[count] <- mean(delta[count,])
if (pct_cens [count] > 0 & pct_cens [count] < 1 ) count <- count + 1 }
ave_detects <- mean_detects(X_after,delta) ## how can I use apply or other functions here?
return(ave_detects)
}
If I specify n_iterations, I will have a 1x10 X_after matrix and also 1x10 delta matrix. Then the mean_detects function works fine using this command.
ave_detects <- mean_detects(X_after,delta)
however when I increase n_iterations to say 5, then I will have 2 5x10 X_after and delta then the mean_detects function does not work any more. It only gives me output for 1 iteration instead of 5. My real simulation has thousands of iterations so speed and memory must also be taken into account.
Edits: I edited my code based your comments. The mean_detects function that I created was meant to show an example the use of X_after and delta matrices simultaneously. The real function is very long. That's why I did not post it here.
Your actual question isn't really clear. So,
"My function only takes in 1 dataframe".
Actually your function takes in two vectors
Write code that can use both X_after and delta. This doesn't really mean anything - sorry.
"speed and memory must be taken into account". This is vague. Will your run out of memory? As a suggestion, you could think about a rolling mean. For example,
x = runif(5)
total = 0
for(i in seq_along(x)) {
total = (i-1)*total/i + x[i]/i
cat(i, ": mean ", total, "\n")
}
1 : mean 0.4409
2 : mean 0.5139
3 : mean 0.5596
4 : mean 0.6212
5 : mean 0.6606
Aside
Your dest2 function requires the variable n (which you haven't defined).
Your dest2 function doesn't return an obvious value.
your mean_detects function can be simplified to:
mean(obs[cens==0])
I need to generate simulated data where the percent censored cannot be 0 or 1. That's why I use while loop. The problem is if I increase count to 10,000 (instead of 5), the program is very slow. I have to repeat this with 400 different scenarios so it is extremely slow. I'm trying to figure out places where I can vectorize my code piece by piece. How can I avoid while-loop and still able to keep the condition?
Another approach is keep the while loop and generate a list of 10,000 dataset that meet my criteria and then apply the function to the list. Here I use summary function as an example but my real function use both X_after and delta (ie. mle(X_after,delta)). Is this a better option if I have to use while loop?
Another concern I have is memory issue. How can I avoid using up memory while doing such large simulation?
mu=1 ; sigma=3 ; n=10 ; p=0.10
dset <- function (mu,sigma, n, p) {
Mean <- array()
Median <- array()
Pct_cens_array <- array()
count = 0
while(count < 5) {
lod <- quantile(rlnorm(100000, log(mu), log(sigma)), p = p)
X_before <- rlnorm(n, log(mu), log(sigma))
X_after <- ifelse(X_before <= lod, lod, X_before)
delta <- ifelse(X_before <= lod, 1, 0)
pct_cens <- sum(delta)/length(delta)
# print(pct_cens)
if (pct_cens == 0 | pct_cens == 1 ) next
else {
count <- count +1
if (pct_cens > 0 & pct_cens < 1) {
sumStats <- summary(X_after)
Median[count] <- sumStats[3]
Mean [count]<- sumStats[4]
Pct_cens_array [count] <- pct_cens
print(list(pct_cens=pct_cens,X_after=X_after, delta=delta, Median=Median,Mean=Mean,Pct_cens_array=Pct_cens_array))
}
}
}
return(data.frame(Pct_cens_array=Pct_cens_array, Mean=Mean, Median=Median))
}
I've made a few little tweaks to your code without changing the whole style of it. It would be good to heed Yoong Kim's advice and try to break up the code into smaller pieces, to make it more readable and maintainable.
Your function now gets two "n" arguments, for how many samples you have in each row, and how many iterations (columns) you want.
You were growing the arrays Median and Mean in the loop, which requires a lot of messing about reallocating memory and copying things, which slows everything down. I've predefined X_after and moved the mean and median calculations after the loop to avoid this. (As a bonus, mean and median only get called once instead of n_iteration times.)
The calls to ifelse weren't really needed.
It is a little quicker to call rlnorm once, generating enough values for x and the lod, than to call it twice.
Here's the updated function.
dset2 <- function (mu, sigma, n_samples, n_iterations, p) {
X_after <- matrix(NA_real_, nrow = n_iterations, ncol = n_samples)
pct_cens <- numeric(n_iterations)
count <- 1
while(count <= n_iterations) {
random_values <- rlnorm(2L * n_samples, log(mu), log(sigma))
lod <- quantile(random_values[1:n_samples], p = p)
X_before <- random_values[(n_samples + 1L):(2L * n_samples)]
X_after[count, ] <- pmax(X_before, lod)
delta <- X_before <= lod
pct_cens[count] <- mean(delta)
if (pct_cens > 0 && pct_cens < 1 ) count <- count + 1
}
Median <- apply(X_after, 1, median)
Mean <- rowMeans(X_after)
data.frame(Pct_cens_array=pct_cens, Mean=Mean, Median=Median)
}
Compare timings with, for example,
mu=1
sigma=3
n_samples=10L
n_iterations = 1000L
p=0.10
system.time(dset(mu,sigma, n_samples, n_iterations, p))
system.time(dset2(mu,sigma, n_samples, n_iterations, p))
On my machine, there is a factor of 3 speedup.
First rule I learnt with C programming: divide to reign! I mean you should first create multiple functions and call them into your loop because this loop does too many different things.
And I am worried about your algorithm:
if (pct_cens == 0 | pct_cens == 1 ) next
else {count <- count +1
Is there any reason you use while instead of for?
There is a difference between the loops while and for: with while, you always have a first loop, not with for.
Finally, about your problem: use more memory with an array to increase the speed.
Example:
lod <- quantile(rlnorm(100000, log(mu), log(sigma)), p = p)
X_before <- rlnorm(n, log(mu), log(sigma))
log(mu) and log(sigma) are computed twice: use variables to store the result, you will save time but spend more memory of course.