Monte Carlo simulation of correlation between two Brownian motion (continuous random walk) - r

y <- cumsum(rnorm(100,0,1)) # random normal, with small (1.0) drift.
y.ts <- ts(y)
x <- cumsum(rnorm(100,0,1))
x
x.ts <- ts(x)
ts.plot(y.ts,ty= "l", x.ts) # plot the two random walks
Regression.Q1 = lm(y~x) ; summary(lm2)
summary(Regression.Q1)
t.test1 <- (summary(Regression.Q1)$coef[2,3]) # T-test computation
y[t] = y[t-1] + epsilon[t]
epsilon[t] ~ N(0,1)
set.seed(1)
t=1000
epsilon=sample(c(-1,1), t, replace = 1) # Generate k random walks across time {0, 1, ... , T}
N=T=1e3
y=t(apply(matrix(sample(c(-1,1),N*T,rep=TRUE),ncol=T),1,cumsum))
y[1]<-0
for (i in 2:t) {
y[i]<-y[i-1]+epsilon[i]
}
I need to:
Repeat the process 1,000 times (Monte Carlo simulations), namely build a loop around the previous program and each time save the t statistics. You will have a sequence of 1;000 t-tests : S = (t-test1, t-test2, ... , t-test1000). Count the number of time the absolute value of the 1,000 t-tests > 1.96, the critical value at a 5% significance level. If the series were I(0) you would have found roughly 5%. It won't be the case here (spurious regression).
What do I need to add to save the respective coefficients ?

Your posted code related to y[t] = y[t-1] + epsilon[t] is not real working code, but I can see that you are trying to store all 1000 * 2 random walk. Actually there is no need to do this. We only care about t-score rather than what those realizations of random walk are.
For this kind of problem, where we aim to replicate a procedure a lot of times, it is handy to first write a function to execute such a procedure for a single time. You already had good working code for this; we just need to wrap it in a function (removing those unnecessary part like plot):
sim <- function () {
y <- cumsum(rnorm(100,0,1))
x <- cumsum(rnorm(100,0,1))
coef(summary(lm(y ~ x)))[2,3]
}
This function takes no input; it only returns the t-score for one experiment.
Now, we are going to repeat this 1000 times. We can write a for loop, but function replicate is easier (read ?replicate if necessary)
S <- replicate(1000, sim())
Note this will take some time, much slower than it should be for such a simple task, because both lm and summary.lm are slow. A much faster way will be shown later.
Now, S is vector with 1000 values, which is the "a sequence of 1000 t-tests" you want. To get "the number of time the absolute value of the 1,000 t-tests > 1.96", we can just use
sum(abs(S) > 1.96)
# [1] 756
The result 756 is just what I get; you will get something different as the simulation is random. But it will always be quite a large number as expected.
A faster version of sim:
fast_sim <- function () {
y <- cumsum(rnorm(100,0,1))
x <- cumsum(rnorm(100,0,1))
y <- y - mean(y)
x <- x - mean(x)
xty <- crossprod(x,y)[1]
xtx <- crossprod(x)[1]
b <- xty / xtx
sigma <- sqrt(sum((y - x * b) ^ 2) / 98)
b * sqrt(xtx) * sigma
}
This function computes simple linear regression without lm, and t-score without summary.lm.
S <- replicate(1000, fast_sim())
sum(abs(S) > 1.96)
# [1] 778
An alternative way is to use cor.test:
fast_sim2 <- function () {
y <- cumsum(rnorm(100,0,1))
x <- cumsum(rnorm(100,0,1))
unname(cor.test(x, y)[[1]])
}
S <- replicate(1000, fast_sim())
sum(abs(S) > 1.96)
# [1] 775
Let's have a benchmark:
system.time(replicate(1000, sim()))
# user system elapsed
# 1.860 0.004 1.867
system.time(replicate(1000, fast_sim()))
# user system elapsed
# 0.088 0.000 0.090
system.time(replicate(1000, fast_sim2()))
# user system elapsed
# 0.308 0.004 0.312
cor.test is much faster than lm + summary.lm, but manual computation is even faster!

Related

Implement a Monte Carlo Simulation Method to Estimate an Integral in R

I am trying to implement a Monte carlo simulation method to estimate an integral in R. However, I still get wrong answer. My code is as follows:
f <- function(x){
((cos(x))/x)*exp(log(x)-3)^3
}
t <- integrate(f,0,1)
n <- 10000 #Assume we conduct 10000 simulations
int_gral <- Monte_Car(n)
int_gral
You are not doing Monte-Carlo here. Monte-Carlo is a simulation method that helps you approximating integrals using sums/mean based on random variables.
You should do something in this flavor (you might have to verify that it's correct to say that the mean of the f output can approximates your integral:
f <- function(n){
x <- runif(n)
return(
((cos(x))/x)*exp(log(x)-3)^3
)
}
int_gral <- mean(f(10000))
What your code does is taking a number n and return ((cos(n))/n)*exp(log(n)-3)^3 ; there is no randomness in that
Update
Now, to get a more precise estimates, you need to replicate this step K times. Rather than using a loop, you can use replicate function:
K <- 100
dist <- data.frame(
int = replicate(K, mean(f(10000)))
)
You get a distribution of estimators for your integral :
library(ggplot2)
ggplot(dist) + geom_histogram(aes(x = int, y = ..density..))
and you can use mean to have a numerical value:
mean(dist$int)
# [1] 2.95036e-05
You can evaluate the precision of your estimates with
sd(dist$int)
# [1] 2.296033e-07
Here it is small because N is already large, giving you a good precision of first step.
I have managed to change the codes as follows. Kindly confirm to me that I am doing the right thing.
regards.
f <- function(x){
((cos(x))/x)*exp(log(x)-3)^3
}
set.seed(234)
n<-10000
for (i in 1:10000) {
x<-runif(n)
I<-sum(f(x))/n
}
I

How to generate samples from MVN model?

I am trying to run some code on R based on this paper here through example 5.1. I want to simulate the following:
My background on R isn't great so I have the following code below, how can I generate a histogram and samples from this?
xseq<-seq(0, 100, 1)
n<-100
Z<- pnorm(xseq,0,1)
U<- pbern(xseq, 0.4, lower.tail = TRUE, log.p = FALSE)
Beta <- (-1)^U*(4*log(n)/(sqrt(n)) + abs(Z))
Some demonstrations of tools that will be of use:
rnorm(1) # generates one standard normal variable
rnorm(10) # generates 10 standard normal variables
rnorm(1, 5, 6) # generates 1 normal variable with mu = 5, sigma = 6
# not needed for this problem, but perhaps worth saying anyway
rbinom(5, 1, 0.4) # generates 5 Bernoulli variables that are 1 w/ prob. 0.4
So, to generate one instance of a beta:
n <- 100 # using the value you gave; I have no idea what n means here
u <- rbinom(1, 1, 0.4) # make one Bernoulli variable
z <- rnorm(1) # make one standard normal variable
beta <- (-1)^u * (4 * log(n) / sqrt(n) + abs(z))
But now, you'd like to do this many times for a Monte Carlo simulation. One way you might do this is by building a function, having beta be its output, and using the replicate() function, like this:
n <- 100 # putting this here because I assume it doesn't change
genbeta <- function(){ # output of this function will be one copy of beta
u <- rbinom(1, 1, 0.4)
z <- rnorm(1)
return((-1)^u * (4 * log(n) / sqrt(n) + abs(z)))
}
# note that we don't need to store beta anywhere directly;
# rather, it is just the return()ed value of the function we defined
betadraws <- replicate(5000, genbeta())
hist(betadraws)
This will have the effect of making 5000 copies of your beta variable and putting them in a histogram.
There are other ways to do this -- for instance, one might just make a big matrix of the random variables and work directly with it -- but I thought this would be the clearest approach for starting out.
EDIT: I realized that I ignored the second equation entirely, which you probably didn't want.
We've now made a vector of beta values, and you can control the length of the vector in the first parameter of the replicate() function above. I'll leave it as 5000 in my continued example below.
To get random samples of the Y vector, you could use something like:
x <- replicate(5000, rnorm(17))
# makes a 17 x 5000 matrix of independent standard normal variables
epsilon <- rnorm(17)
# vector of 17 standard normals
y <- x %*% betadraws + epsilon
# y is now a 17 x 1 matrix (morally equivalent to a vector of length 17)
and if you wanted to get many of these, you could wrap that inside another function and replicate() it.
Alternatively, if you didn't want the Y vector, but just a single Y_i component:
x <- rnorm(5000)
# x is a vector of 5000 iid standard normal variables
epsilon <- rnorm(1)
# epsilon_i is a single standard normal variable
y <- t(x) %*% betadraws + epsilon
# t() is the transpose function; y is now a 1 x 1 matrix

how to find correlation coefficient in a for loop that is to be repeated 5000 times? and save the statistic

for 2 independent normally distributed variables x and y, they are found using x = rnorm(50) and y = rnorm(50). calculate the correlation 5000 times and save the result each time. What is the likelihood that a correlation with absolute value greater than 0.3 is computed? (default set.seed(42) and to plot a histogram of the coefficient spread)
This is what i have tried so far...
set.seed(42)
n <- 50 #length of random sequence
x_norm <- rnorm(n)
y_norm <- rnorm(n)
nrun <- 5000
corr <- numeric(nrun)
for (i in 1:nrun) {
corrxy <- cor(x_norm,y_norm)
corr[i] <- sum(abs(corrxy > 0.3)) / n #save statistic in the vector
}
hist(corr)
it is expected that i get 5000 different coefficient numbers saved in [i], and when plotted using hist(0), these coefficients should follow approx a normal distribution. but i do not understand how the for loop works and how to incorporate the value of coefficient being greater than 0.3.
I think you were nearly there. You just had to shift some code outside and inside the for loop.
You want new data for each run of the loop (otherwise you get the same correlation 5000 times) and you need to save the correlation each time the loop runs. This results in a vector of 5000 correlations which you can use to look at the proportion of correlations (divide by the number of runs, not the number of observations) that are higher than .3 outside of the for loop.
Edit: One final correction is needed in the bracketing of the absolute function. You want to find the absolute correlations > .3 not the absolute value of corrxy > .3.
set.seed(42)
n <- 50 #length of random sequence
nrun <- 5000
corrxy <- numeric(nrun) # The correlation is the statistic you want to save
for (i in 1:nrun) {
x_norm <- rnorm(n) # Compute a new dataset for each run (otherwise you get the same correlation)
y_norm <- rnorm(n)
corrxy[i] <- cor(x_norm,y_norm) # Calculate the correlation
}
hist(corrxy)
sum(abs(corrxy) > 0.3) / nrun # look at the proportion of runs that have cor > .3
Below is the resulting histogram of the 5000 correlations. The proportion of correlations that is higher than |.3| is 0.034 in this case.
Here's another way of doing this kind of simulations without explicitly calling a loop:
Define first your simulation:
my_sim <- function(n) { # n is the norm distribution size
x <- rnorm(n)
y <- rnorm(n)
corrxy <- cor(x, y)
corrxy # return the correlation (single value)
}
Now we can call this function many times with replicate():
set.seed(123)
nrun <- 10
my_results <- replicate(nrun, my_sim(n=50))
#my_results
# [1] -0.0358698314 -0.0077403045 -0.0512509071 -0.0998484901 0.1230261286 0.1001124010 -0.0002023124
# [8] 0.2017120443 0.0644662387 0.0567232640
Now in my_results you have all the correlations from each simulations (just 10 for example).
And you can compute your statistics:
sum(abs(my_results)> 0.3) / nrun # nrun is 10
or plot:
hist(my_results)

Simulate Compound poisson process in r

I'm trying to simulate a compound Poisson process in r. The process is defined by $ \sum_{j=1}^{N_t} Y_j $ where $Y_n$ is i.i.d sequence independent $N(0,1)$ values and $N_t$ is a Poisson process with parameter $1$. I'm trying to simulate this in r without luck. I have an algorithm to compute this as follows:
Simutale the cPp from 0 to T:
Initiate: $ k = 0 $
Repeat while $\sum_{i=1}^k T_i < T$
Set $k = k+1$
Simulate $T_k \sim exp(\lambda)$ (in my case $\lambda = 1$)
Simulate $Y_k \sim N(0,1)$ (This is just a special case, I would like to be able to change this to any distribution)
The trajectory is given by $X_t = \sum_{j=1}^{N_t} Y_j $ where $N(t) = sup(k : \sum_{i=1}^k T_i \leq t )$
Can someone help me to simulate this in r so that I can plot the process? I have tried, but can't get it done.
Use cumsum for the cumulative sums that determine the times N_t as well as the X_t. This illustrative code specifies the number of times to simulate, n, simulates the times in n.t and the values in x, and (to display what it has done) plots the trajectory.
n <- 1e2
n.t <- cumsum(rexp(n))
x <- c(0,cumsum(rnorm(n)))
plot(stepfun(n.t, x), xlab="t", ylab="X")
This algorithm, since it relies on low-level optimized functions, is fast: the six-year-old system I tested it on will generate over three million (time, value) pairs per second.
That's usually good enough for simulation, but it doesn't quite satisfy the problem, which asks to generate a simulation out to time T. We can leverage the preceding code, but the solution is a little trickier. It computes a reasonable upper limit on how many times will occur in the Poisson process before time T. It generates the inter-arrival times. This is wrapped in a loop that will repeat the procedure in the (rare) event the time T is not actually reached.
The additional complexity doesn't change the asymptotic calculation time.
T <- 1e2 # Specify the end time
T.max <- 0 # Last time encountered
n.t <- numeric(0) # Inter-arrival times
while (T.max < T) {
#
# Estimate how many random values to generate before exceeding T.
#
T.remaining <- T - T.max
n <- ceiling(T.remaining + 3*sqrt(T.remaining))
#
# Continue the Poisson process.
#
n.new <- rexp(n)
n.t <- c(n.t, n.new)
T.max <- T.max + sum(n.new)
}
#
# Sum the inter-arrival times and cut them off after time T.
#
n.t <- cumsum(n.t)
n.t <- n.t[n.t <= T]
#
# Generate the iid random values and accumulate their sums.
#
x <- c(0,cumsum(rnorm(length(n.t))))
#
# Display the result.
#
plot(stepfun(n.t, x), xlab="t", ylab="X", sub=paste("n =", length(n.t)))

Speed up monte carlo simulation with nested loop (2)

I would like to know if there is a more efficiency way to speed up below code. It uses a procedure where subsampling is required in the nested loop (which a previous answer https://stackoverflow.com/a/13629611/1176697 help to make more efficient).
R has a tendency to hang when B=500, although computer OS isn't unduly affected.
The goal is to run the below code with B=1000 and using larger m values(m=75,m=100,m=150)
I have detailed the procedure in the code below and included a link to a reproducible data set.
#Estimation of order m `leave one out' hyperbolic efficiency scores
#The procedure sequentially works through each observation in `IOs' and
#calculates a DEA order M efficiency score by leaving out the observation
# under investigation in the DEA reference set
# Step 1: Load the packages, create Inputs (x) and Outputs (y), choose
# m(the order of the partial frontier or reference set to be used in DEA)
# and B the number of monte carlo simulations of each m order DEA estimate
# Step 2: For each observations a in x, x1 and y1
#are create which 'leaves out' this observation.
# Step 3: From these matrices subsamples (xref, yref) of size [m,] are
# taken and used in DEA estimation.
# Step 4: The DEA estimation uses the m subsample from step 3
# as a reference set and evaluates the efficiency of the observation that
# has been 'left out'
#(thus the first two arguments in DEA are matrices of order [1,3] )
# Step 5: Steps 3 and 4 are repeated B times to obtain B simulations of the
# order m efficiency score and a mean and standard deviation are
# calculated and placed in effm.
# IOs data can be found here: https://dl.dropbox.com/u/1972975/IOs.txt
# From IOs an Input matrix (x[1376,3]) and an Output matrix (y[1376,3])
# are created.
library(Benchmarking)
x <- IOs[,1:3]
y<-IOs[,4:6]
A<-nrow(x)
effm <- matrix(nrow = A, ncol = 2)
m <- 50
B <- 500
pb <- txtProgressBar(min = 0,
max = A, style=3)
for(a in 1:A) {
x1 <- x[-a,]
y1 <- y[-a,]
theta <- numeric(B)
xynrow<-nrow(x1)
mB<-m*B
xrefm <- x1[sample(1:xynrow, mB, replace=TRUE),] # get all of your samples at once(https://stackoverflow.com/a/13629611/1176697)
yrefm <- y1[sample(1:xynrow, mB, replace=TRUE),]
deaX <- as.matrix(x[a,], ncol=3)
deaY <-as.matrix(y[a,], ncol=3)
for(i in 1:B){
theta[i] <- dea(deaX, deaY, RTS = 'vrs', ORIENTATION = 'graph',
xrefm[(1:m) + (i-1) * m,], yrefm[(1:m) + (i-1) * m,], FAST=TRUE)
}
effm[a,1] <- mean(theta)
effm[a,2] <- sd(theta) / sqrt(B)
setTxtProgressBar(pb, a)
}
close(pb)

Resources