Hellou
I've had problems with the following while loop in R. I try to know with what number of samples (n), I can achieve a variance less than 0.01 (dtest) and that let me to know the values of n, m, s and d:
n <- 100
x <- rnorm(n,0,1)
sd(x)
d <- sd(x)/sqrt(n)
dtest <- 0.01
while(dtest <=0.01) {
x <- rnorm(n,0,1)
n <- n+1
m <- mean(x)
s <- sd(x)
d <- s/sqrt(n)
return(output <- data.frame(n,m,s,d))
}
The first time I did the cycle without problems and it marked a n of approx 27K. Now only every time I execute the loop it accumulates
There are a number of issues:
Your condition should compare d to dtest. Currently, it’s comparing two values that aren’t changed within the loop, so will run forever.
Increment n at the start of the loop. Otherwise you’re using a different n to compute x and d.
Just create your results dataframe once, after the loop, rather than creating and discarding with each loop. And don’t use return(), which is meant for use inside functions.
Note that sd(x)/sqrt(n) is standard error, not variance. Variance would be sd(x)^2.
set.seed(13)
n <- 99
x <- rnorm(n,0,1)
d <- sd(x)/sqrt(n)
dtest <- 0.01
while(dtest <= d) {
n <- n+1
x <- rnorm(n,0,1)
s <- sd(x)
d <- s/sqrt(n)
}
output <- data.frame(n,m = mean(x),s,d)
output
n m s d
1 9700 0.01906923 0.9848469 0.009999605
Related
I need to simulate the probabilities that are computed using the function transitionProbability1D from isingLenzMC package. I want to simulate them for 10 values of bF at once and receive a vector of results but still receive only one number and I dont know why. Here is my code
N <- 100
conf0 <- genConfig1D(N)
conf1 <- flipConfig1D(conf0)
# transition probability at J=H=1/kBT=1.0, 1= p-ty metropolis 2=glauber
bF <- 1:10
J <- h <- rep(1,10)
# HERE IT DOESNT WORK EVEN THOUGHT bF IS A VECTOR
transitionProbability1D(bF, conf0, conf1, J, h, 1)
>> 0.298615
You might want to look at how to vectorize a function.
On your example, the following would probably give you what you expect:
library(isingLenzMC)
N <- 100
conf0 <- genConfig1D(N)
conf1 <- flipConfig1D(conf0)
# transition probability at J=H=1/kBT=1.0, 1= p-ty metropolis 2=glauber
bF <- 1:10
# Here I changed these inputs to single values
J <- h <- 1
# HERE IT DOESNT WORK EVEN THOUGHT bF IS A VECTOR
transitionProbability1D(bF, conf0, conf1, J, h, 1)
# Vectorize on the first argument
transitionProbability1D_vectorized <- Vectorize(transitionProbability1D, vectorize.args = "ikBT")
# Now there are as many results as input values
transitionProbability1D_vectorized(ikBT = bF, x = conf0, xflip = conf1, J = J, H = h, probSel = 1)
You could also use a (for) loop!
The pricing of the Asian option is approximated, using Monte Carlo simulation, by:
delta <- 1/12
T <- 2
S0 <- 100
sigma <- 0.20
K <- 100
r <- 0.01
n <- 10^4
m <- T/delta
S <- S0
for(i in 1:n) {
for(j in 1:m) {
W <- rnorm(1)
Si <- S[length(S)]*exp((r-0.5*sigma^2)*delta + sigma*sqrt(delta)*W)
S <- c(S, Si)
}
Si.bar <- mean(S[-1])
Ci <- exp(-r*T)*max(Si.bar - K, 0)
}
mean(Ci)
The for(j in 1:m) for loop runs perfectly, I think... But when I run it n times, using for(i in 1:n) S gets smaller and smaller by n. It decreases to almost zero when n grows. This leads to a mean (Si.bar <- mean(S[-1]) well below the strike price, K= 100.
I can't figure out what is wrong with the two last lines of codes. I'm getting a value on the Asian call option of 0, due to the payoff function. The correct solution to this option is a value of approximately 7 (mean(Ci))
There's a couple of issues with your code. Firstly, it's inefficient in R to build a vector by repeated concatenation. Instead, you should allocate the vector up front and then assign to its members.
Secondly, as I understand it, the aim is to repeat the inner loop n times and store the output into members of a vector C before taking the mean. That's not what you're doing at the moment - each iteration of the outer loop makes S longer and overwrites Ci such that the last statement, mean(Ci) is meaningless.
Here's an amended version of the code. I've used plyr partly to make the code neater, and partly for its progress bar functionality.
library(plyr)
delta <- 1/12
T <- 2
S0 <- 100
sigma <- 0.20
K <- 100
r <- 0.01
n <- 10^4
m <- T/delta
S <- numeric(m + 1)
S[1] <- S0
asian_price <- function() {
for(j in 1:m) {
W <- rnorm(1)
S[j + 1] <- S[j] * exp((r - 0.5 * sigma^2) * delta + sigma * sqrt(delta) * W)
}
Si.bar <- mean(S[-1])
exp(-r * T) * max(Si.bar - K, 0)
}
C <- raply(n, asian_price(), .progress = "text")
mean(C)
# [1] 7.03392
Suppose I have the following data frame
set.seed(36)
n <- 300
dat <- data.frame(x = round(runif(n,0,200)), y = round(runif(n, 0, 500)))
d <- dat[order(dat$y),]
For each value of d$y<=300, I have to create a variable res in which the numerator is the sum of the indicator (d$x <= d$y[i]) and the denominator is the sum of the indicator (d$y >= d$y[i]). I have written the codes in for loop:
res <- NULL
for( i in seq_len(sum(d$y<=300)) ){
numerator <- sum(d$x <= d$y[i])
denominator <- sum(d$y >= d$y[i])
res[i] <- numerator / denominator
}
But my concern is when the number of observations of x and y is large, that is, the number of rows of the data frame increases, the for loop will work slowly. Additionally, if I simulate data 1000 times and each time run the for loop, the program will be inefficient.
What can be the more efficient solution of the code?
This depends on d already being sorted as it is:
# example data
set.seed(36)
n <- 1e5
dat <- data.frame(x = round(runif(n,0,200)), y = round(runif(n, 0, 500)))
d <- dat[order(dat$y),]
My suggestion (thanks to #alexis_laz for the denominator):
system.time(res3 <- {
xs <- sort(d$x) # sorted x
yt <- d$y[d$y <= 300] # truncated y
num = findInterval(yt, xs)
den = length(d$y) - match(yt, d$y) + 1L
num/den
})
# user system elapsed
# 0 0 0
OP's approach:
system.time(res <- {
res <- NULL
for( i in seq_len(sum(d$y<=300)) ){
numerator <- sum(d$x <= d$y[i])
denominator <- sum(d$y >= d$y[i])
res[i] <- numerator / denominator
}
res
})
# user system elapsed
# 50.77 1.13 52.10
# verify it matched
all.equal(res,res3) # TRUE
#d.b's approach:
system.time(res2 <- {
numerator = rowSums(outer(d$y, d$x, ">="))
denominator = rowSums(outer(d$y, d$y, "<="))
res2 = numerator/denominator
res2 = res2[d$y <= 300]
res2
})
# Error: cannot allocate vector of size 74.5 Gb
# ^ This error is common when using outer() on large-ish problems
Vectorization. Generally, tasks are faster in R if they can be vectorized. The key functions related to ordered vectors have confusing names (findInterval, sort, order and cut), but fortunately they all work on vectors.
Continuous vs discrete. The match above should be a fast way to compute the denominator whether the data is continuous or has mass points / repeating values. If the data is continuous (and so has no repeats), the denominator can just be seq(length(xs), length = length(yt), by=-1). If it is fully discrete and has a lot of repetition (like the example here), there might be some way to make that faster as well, maybe like one of these:
den2 <- inverse.rle(with(rle(yt), list(
values = length(xs) - length(yt) + rev(cumsum(rev(lengths))),
lengths = lengths)))
tab <- unname(table(yt))
den3 <- rep(rev(cumsum(rev(tab))) + length(xs) - length(yt), tab)
# verify
all.equal(den,den2) # TRUE
all.equal(den,den3) # TRUE
findInterval will still work for the numerator for continuous data. It's not ideal for the repeated-values case considered here I guess (since we're redundantly finding the interval for many repeated yt values). Similar ideas for speeding that up likely apply.
Other options. As #chinsoon suggested, the data.table package might be a good fit if findInterval is too slow, since it has a lot of features focused on sorted data, but it's not obvious to me how to apply it here.
Instead of running loop, generate all the numerator and denominator at once. This also allows you to keep track of which res is associated with which x and y. Later, you can keep only the ones you want.
You can use outer for element wise comparison between vectors.
numerator = rowSums(outer(d$y, d$x, ">=")) #Compare all y against all x
denominator = rowSums(outer(d$y, d$y, "<=")) #Compare all y against itself
res2 = numerator/denominator #Obtain 'res' for all rows
#I would first 'cbind' res2 to d and only then remove the ones for 'y <=300'
res2 = res2[d$y <= 300] #Keep only those 'res' that you want
Since this is using rowSums, this should be faster.
I'm trying to create a combinatoric function in R and it is dependent on what number I set a variable to. This tells me I would have to use a loop. Here is an example of the equation that I am trying to create and I filled in the numbers given in the example:
The equation changes depending on what number if put in for b. I'm guessing I would need a sum and a loop.
comb = function(n, x) {return(factorial(n) / (factorial(x) * factorial(n-x)))}
a <- 8
b <- 4
c <- 0:b
p <- 0.05
total = function(n) {
return(((comb(a,b + c)*comb((n-a), (c - b)*(-1))/comb(n,a) - 0.05)
}
I will then find out what n is equal too by setting it equal to 0.
You don't need to define comb -- this is provided by the function choose in base R. I think all you need is to combine choose and sum to compute your numerator:
total <- function(n) {
sum(choose(a, b:(2*b)) * choose(n-a, b:0)) / choose(n, a) - p
}
# Plot total(n) for various n values:
a <- 8
b <- 4
p <- 0.05
n <- 20:100
plot(n, sapply(n, total))
For these parameters, total(n) crossed 0 between n=36 and n=37.
Trying to wrap my mind arround vectorizing, trying to make some simulations faster I found this very basic epidemic simulation. The code is from the book http://www.amazon.com/Introduction-Scientific-Programming-Simulation-Using/dp/1420068725/ref=sr_1_1?ie=UTF8&qid=1338069156&sr=8-1
#program spuRs/resources/scripts/SIRsim.r
SIRsim <- function(a, b, N, T) {
# Simulate an SIR epidemic
# a is infection rate, b is removal rate
# N initial susceptibles, 1 initial infected, simulation length T
# returns a matrix size (T+1)*3 with columns S, I, R respectively
S <- rep(0, T+1)
I <- rep(0, T+1)
R <- rep(0, T+1)
S[1] <- N
I[1] <- 1
R[1] <- 0
for (i in 1:T) {
S[i+1] <- rbinom(1, S[i], (1 - a)^I[i])
R[i+1] <- R[i] + rbinom(1, I[i], b)
I[i+1] <- N + 1 - R[i+1] - S[i+1]
}
return(matrix(c(S, I, R), ncol = 3))
}
The core of the simulation is the for loop. My question, is since the code produces the S[i+1] and R[i+1] values from the S[i] and R[i] values, is it possible to vectorize it with an apply function?
Many thanks
It's hard to 'vectorize' iterative calculations, but this is a simulation and simulations are likely to be run many times. So write this to do all the the simulations at the same time by adding an argument M (number of simulations to perform), allocating an M x (T + 1) matrix, and then filling in successive columns (times) of each simulation. The changes seem to be remarkably straight-forward (so I've probably made a mistake; I'm particularly concerned about the use of vectors in the second and third arguments to rbinom, though this is consistent with the documentation).
SIRsim <- function(a, b, N, T, M) {
## Simulate an SIR epidemic
## a is infection rate, b is removal rate
## N initial susceptibles, 1 initial infected, simulation length T
## M is the number of simulations to run
## returns a list of S, I, R matricies, each M simulation
## across T + 1 time points
S <- I <- R <- matrix(0, M, T + 1)
S[,1] <- N
I[,1] <- 1
for (i in seq_along(T)) {
S[,i+1] <- rbinom(M, S[,i], (1 - a)^I[,i])
R[,i+1] <- R[,i] + rbinom(M, I[,i], b)
I[,i+1] <- N + 1 - R[,i+1] - S[,i+1]
}
list(S=S, I=I, R=R)
}