Generating covariates from Bernoulli Distribution with a particular correlation structure in R - r

I am trying to reproduce a simulation result from an article using R and I am stuck at the very beginning. I am supposed to generate say some 10 covariates, all follow bernoulli distribution, where they are correlated by the regression structure: p(X_j=1|X_j-1)=0.1+0.1(X_j-1 - 0.15), j=1,...,10, for n=100 individuals, and X1~Binomial(1,0.5). I thought it's supposed to be easy but I think I am missing something and have no idea how to proceed. Any suggestion or idea (even clearing me out what they are trying to say) would be really helpful!

Recursive definitions like these are messy. You can use a loop
draw <- function() {
N<-10
x <- numeric(N)
x[1] <- runif(1) < .5
for(i in 2:N) {
x[i] <- runif(1) < 0.1+0.1*(x[i] - 0.15)
}
}
draw()
or a Reduce function
Reduce(function(xjm1,x) {
as.numeric(runif(1) < .1+.1*(xjm1-0.15)) },
rep(0,9), init=runif(1)>.5, accumulate=T)
If you needed to generate a bunch of these values and calculate the correlation, you could do
xx <- replicate(100, draw())
cor(t(xx))

Related

Monte Carlo simulation for Geometric Brownian motion in R giving negative numbers

I was under the impression that simulations involving geometric brownian motion are not supposed to yield negative numbers. However, I was trying the following Monte Carlo simulation in R for a GBM, where my initial asset price is: $98.78$, $\mu = 0.208$, $\sigma = 0.824$. I initialized my dataframe as such: (I am just doing 1000 simulations over 5 years, simulating the price each year)
V = matrix(0, nrow = 1000, ncol = 6)
V_df = data.frame(V)
Then:
V[, 1] <- 98.78
I then perform the simulations (with dt = 1):
for (i in 1:1000) {
for (j in 1:5) {
V_df[i,j+1] <- V_df[i,j]*(mu*dt + sigma*sqrt(dt)*rnorm(1)) + V_df[i,j]
}
}
When I then check V_df there are many negative entries, which is not supposed to be the case. Would anyone have an idea as to why this is so?
Thanks.
Your solution to the GBM is not correct. One step should read
V_df[i,j+1] <- V_df[i,j]*exp((mu - sigma^2/2)*dt + sigma*sqrt(dt)*rnorm(1))
However, doing this with a double loop is very inefficient. You can create a matrix of random numbers and use cumprod or cumsum to generate the paths. Which function you use depends on when you take the exp.
See also https://en.m.wikipedia.org/wiki/Geometric_Brownian_motion

Exponential distribution in R

I want to simulate some data from an exp(1) distribution but they have to be > 0.5 .so i used a while loop ,but it does not seem to work as i would like to .Thanks in advance for your responses !
x1<-c()
w<-rexp(1)
while (length(x1) < 100) {
if (w > 0.5) {
x1<- w }
else {
w<-rexp(1)
}
}
1) The code in the question has these problems:
we need a new random variable on each iteration but it only generates new random variables if the if condition is FALSE
x1 is repeatedly overwritten rather than extended
although while could be used repeat seems better since having the test at the end is a better fit than the test at the beginning
We can fix this up like this:
x1 <- c()
repeat {
w <- rexp(1)
if (w > 0.5) {
x1 <- c(x1, w)
if (length(x1) == 100) break
}
}
1a) A variation would be the following. Note that an if whose condition is FALSE evaluates to NULL if there is no else leg so if the condition is FALSE on the line marked ## then nothing is concatenated to x1.
x1 <- c()
repeat {
w <- rexp(1)
x1 <- c(x1, if (w > 0.5) w) ##
if (length(x1) == 100) break
}
2) Alternately, this generates 200 exponential random variables keeping only those greater than 0.5. If fewer than 100 are generated then repeat. At the end it takes the first 100 from the last batch generated. We have chosen 200 to be sufficiently large that on most runs only one iteration of the loop will be needed.
repeat {
r <- rexp(200)
r <- r[r > 0.5]
if (length(r) >= 100) break
}
r <- head(r, 100)
Alternative (2) is actually faster than (1) or (1a) because it is more highly vectorized. This is despite it throwing away more exponential random variables than the other solutions.
I would advise against a while (or any other accept/reject) loop; instead use the methods from truncdist:
# Sample 1000 observations from a truncated exponential
library(truncdist);
x <- rtrunc(1000, spec = "exp", a = 0.5);
# Plot
library(ggplot2);
ggplot(data.frame(x = x), aes(x)) + geom_histogram(bins = 50) + xlim(0, 10);
It's also fairly straightforward to implement a sampler using inverse transform sampling to draw samples from a truncated exponential distribution that avoids rejecting samples in a loop. This will be a more efficient method than any accept/reject-based sampling method, and works particularly well in your case, since there exists a closed form of the truncated exponential cdf.
See for example this post for more details.

Simulating thousands of regressions and obtaining p-values

I'm looking to do some basic simulation in R to examine the nature of p-values. My goal is to see whether large sample sizes trend towards small p-values. My thought is to generate random vectors of 1,000,000 data points, regress them on each other, and then plot the distribution of p-values and look for skew.
This is what I was thinking so far:
x1 = runif(1000000, 0, 1000)
x2 = runif(1000000, 0, 1000)
model1 = lm(x2~x1)
Using code taken from another thread:
lmp <- function (modelobject) {
if (class(modelobject) != "lm") stop("Not an object of class 'lm' ")
f <- summary(modelobject)$fstatistic
p <- pf(f[1],f[2],f[3],lower.tail=F)
attributes(p) <- NULL
return(p)
}
lmp(model1)
0.3874139
Any suggestions for how I might do this for 1000 models or even more? Thanks!
see ?replicate ... but the p-value you're calculating assumes gaussian errors not uniform ones (not that this will matter much at n=10^6)
Specifically, something like this:
nrep <- 1000
ndat <- 1000000
results <- replicate(nrep, {
x1=runif(ndat, 0, 1000);
x2=runif(ndat, 0, 1000);
model1=lm(x1 ~ x2);
lmp(model1)
})
should work, but
it will take a long time to run.
I'd suggest making nrep and ndat smaller to try it out.

Fill matrix with loop

I am trying to create a matrix n by k with k mvn covariates using a loop.
Quite simple but not working so far... Here is my code:
n=1000
k=5
p=100
mu=0
sigma=1
x=matrix(data=NA, nrow=n, ncol=k)
for (i in 1:k){
x [[i]]= mvrnorm(n,mu,sigma)
}
What's missing?
I see several things here:
You may want to set the random seed for replicability (set.seed(20430)). This means that every time you run the code, you will get exactly the same set of pseudorandom variates.
Next, your data will just be independent variates; they won't actually have any multivariate structure (although that may be what you want). In general, if you want to generate multivariate data, you should use ?mvrnorm, from the MASS package. (For more info, see here.)
As a minor point, if you want standard normal data, you don't need to specify mu = 0 and sigma = 1, as those are the default values for rnorm().
You don't need a loop to fill a matrix in R, just generate as many values as you like and add them directly using the data= argument in the matrix() function. If you really were committed to using a loop, you should probably use a double loop, so that you are looping over the columns, and within each loop, looping over the rows. (Note that this is a very inefficient way to code in R--although I do things like that all the time ;-).
Lastly, I can't tell what p is supposed to be doing in your code.
Here is a basic way to do what you seem to be going for:
set.seed(20430)
n = 1000
k = 5
dat = rnorm(n*k)
x = matrix(data=dat, nrow=n, ncol=k)
If you really wanted to use loops you could do it like this:
mu = 0
sigma = 1
x = matrix(data=NA, nrow=n, ncol=k)
for(j in 1:k){
for(i in 1:n){
x[i,j] = rnorm(1, mu, sigma)
}
}
define the matrix first
E<-matrix(data=0, nrow=10, ncol=10);
run two loops to iterate i for rows and j for columns, mine is a exchangeable correlation structure
for (i in 1:10)
{
for (j in 1:10)
{
if (i==j) {E[i,j]=1}
else {E[i,j]=0.6}
}
};
A=c(2,3,4,5);# In your case row terms
B=c(3,4,5,6);# In your case column terms
x=matrix(,nrow = length(A), ncol = length(B));
for (i in 1:length(A)){
for (j in 1:length(B)){
x[i,j]<-(A[i]*B[j])# do the similarity function, simi(A[i],B[j])
}
}
x # matrix is filled
I was thinking in my problem perspective.

How do I best simulate an arbitrary univariate random variate using its probability function?

In R, what's the best way to simulate an arbitrary univariate random variate if only its probability density function is available?
Here is a (slow) implementation of the inverse cdf method when you are only given a density.
den<-dnorm #replace with your own density
#calculates the cdf by numerical integration
cdf<-function(x) integrate(den,-Inf,x)[[1]]
#inverts the cdf
inverse.cdf<-function(x,cdf,starting.value=0){
lower.found<-FALSE
lower<-starting.value
while(!lower.found){
if(cdf(lower)>=(x-.000001))
lower<-lower-(lower-starting.value)^2-1
else
lower.found<-TRUE
}
upper.found<-FALSE
upper<-starting.value
while(!upper.found){
if(cdf(upper)<=(x+.000001))
upper<-upper+(upper-starting.value)^2+1
else
upper.found<-TRUE
}
uniroot(function(y) cdf(y)-x,c(lower,upper))$root
}
#generates 1000 random variables of distribution 'den'
vars<-apply(matrix(runif(1000)),1,function(x) inverse.cdf(x,cdf))
hist(vars)
To clarify the "use Metropolis-Hastings" answer above:
suppose ddist() is your probability density function
something like:
n <- 10000
cand.sd <- 0.1
init <- 0
vals <- numeric(n)
vals[1] <- init
oldprob <- 0
for (i in 2:n) {
newval <- rnorm(1,mean=vals[i-1],sd=cand.sd)
newprob <- ddist(newval)
if (runif(1)<newprob/oldprob) {
vals[i] <- newval
} else vals[i] <- vals[i-1]
oldprob <- newprob
}
Notes:
completely untested
efficiency depends on candidate distribution (i.e. value of cand.sd).
For maximum efficiency, tune cand.sd to an acceptance rate of 25-40%
results will be autocorrelated ... (although I guess you could always
sample() the results to scramble them, or thin)
may need to discard a "burn-in", if your starting value is weird
The classical approach to this problem is rejection sampling (see e.g. Press et al Numerical Recipes)
Use cumulative distribution function http://en.wikipedia.org/wiki/Cumulative_distribution_function
Then just use its inverse.
Check here for better picture http://en.wikipedia.org/wiki/Normal_distribution
That mean: pick random number from [0,1] and set as CDF, then check Value
It is also called quantile function.
This is a comment but I don't have enough reputation to drop a comment to Ben Bolker's answer.
I am new to Metropolis, but IMHO this code is wrong because:
a) the newval is drawn from a normal distribution whereas in other codes it is drawn from a uniform distribution; this value must be drawn from the range covered by the random number. For example, for a gaussian distribution this should be something like runif(1, -5, +5).
b) the prob value must be updated only if acceptance.
Hope this help and hope that someone with reputation could correct this answer (especially mine if I am wrong).
# the distribution
ddist <- dnorm
# number of random number
n <- 100000
# the center of the range is taken as init
init <- 0
# the following should go into a function
vals <- numeric(n)
vals[1] <- init
oldprob <- 0
for (i in 2:n) {
newval <- runif(1, -5, +5)
newprob <- ddist(newval)
if (runif(1) < newprob/oldprob) {
vals[i] <- newval
oldprob <- newprob
} else vals[i] <- vals[i-1]
}
# Final view
hist(vals, breaks = 100)
# and comparison
hist(rnorm(length(vals)), breaks = 100)

Resources