Simulations in R for Heteroskedasticity - r

I want to simulate heteroskedasticity and see how White's test performs for various sample sizes (this is similar to Greene, but testing different kinds).
Now, the model I wish to test is:
Replications=1000=n
y=1.5x+ e
x ~ N(25, 15)
e ~ N(0, sigma_i)
sigma_i= 1+ alpha (sqrt(x[i]^2)
And say initially sample size=20 (varying this part is easy)
So, using the bstats command for the white test, I wrote my code as
for (j in 1:n) {
for (i in 1:20){
x[i]=rnorm(1, 25, 15)
sigma [i]=1+0*sqrt(x[i]^2)
epsilon[i]=rnorm(1, 0, sigma[i])
y[i]=1.5*x[i]+epsilon[i]}
lm1[j]<- lm(y~x); white.test(lm1[j])
if (white.test(lm1)$p.value > 0.05){ind=ind
}else{
if (white.test(lm1)$p.value < 0.05) {ind=ind+1}
}
IND1=ind/1000}
}
Now, I know this is a problem because the i section works, but I can't get the outer for loop to work. Does anyone have any suggestions?

require(tseries)
# if you dont have above package, please install it
# it can be done by running following code
# install.packages('tseries')
########### function that simulates 100 times for hetro-
test_gen=function(n,m){
esp=0
saved=0
for(i in 1:m){ # simulate 100 times
X=rnorm(n,25,1)
for(j in 1:length(X)){
eps[j]=rnorm(1,0,sqrt((X[j])^2)[1])
}
Y=1.5*X+eps
temp=white.test(X,Y)
saved[i]=temp$p.value
}
return(saved)
}
#red dots are the values less than 0.05
#n controls the size of sample
#m controls numbers of simulation
n=20
m=100
out=test_gen(n,m)
plot(c(1:length(out)),out,main="p-value(whitetest) for each simulated data")
ind=out<0.05
points(c(1:length(out))[ind],out[ind],col='red',pch=16)
out2=0
#simulate 100 times for each specific data size
for(i in 20:100){
k=i-19
temp=test_gen(i,100)
out2[k]=sum(temp<0.5)/100
}
plot(20:100,out2,main="error rate",xlab="sample size") # error rate
I feel like I'm doing your homework...anyway good luck

Related

R's code to obtain a histogram following a chi-square distribution from uniform random numbers

I had a code in my text-book (Written in Japanese) to generate a chi-square distribution with 3-degrees of freedom from a uniform distribution. I improved on this and created a code to get a histogram that follows a chi-square distribution with 4-degrees of freedom. This is in good agreement with the distribution function of R, so I think it probably works correctly (See Box1, below).
I tried to refine Box1's code further to obtain a histogram following a chi-square distribution with the specified degrees of freedom, but it didn't work with many errors.
(See Box2)
My Question:
The Box2's code to generate a chi-square distribution from a uniform distribution does not work well.
Please help me to fix the errors of the Box2's Code.
Probably the generalization of " y<-ifelse(x<0.2,1,ifelse(x<0.4,2,ifelse(x<0.6,3,ifelse(x<0.8,4,5))))" didn't work in Box 2.
Box1:Code for obtaining a histogram that follows a chi-square distribution with 4 degrees of freedom (probably works correctly)
ite <- 10000
sc <- numeric(ite) #★1
A<- c(20,20,20,20,20) #★2
for(i in 1:ite){
s<- runif(sum(A)*5) #★3
y<-ifelse(s<0.2,1,ifelse(s<0.4,2,ifelse(s<0.6,3,ifelse(s<0.8,4,5)))) #★4
z1 <- table(y)
z2 <- A*5
z3 <- (z1-z2)^2 /z2
sc[i] <- sum(z3)
}
hist(sc,ylim=c(0,0.35),breaks="Scott",freq=F)
curve(dchisq(x,4),add=T)
The code for Box 1 is designed based on the following facts;
If 500=sum(A)*5 uniform random numbers are divided into five rooms of the same size, the expected value of the number entering each room is 100.
Here, 1st room, 2nd room,...,and 5th room are defined by 0≦x<0.2,0.2≦x<0.4,.... and, 0.8≦x≦1. We can see this from the output of table(y) in the following Box’ 1. Of course, sum(table(y)) of Box 1' always results in 500.
Box1' Logic for making uniform random numbers(x) stepwise(y) on Box1's code
A<- c(20,20,20,20,20)
s<- runif(sum(A)*5) #★3
y<-ifelse(s<0.2,1,ifelse(s<0.4,2,ifelse(s<0.6,3,ifelse(s<0.8,4,5))))
table(y)
sum(table(y))
Box2:Code for obtaining a histogram following a chi-square distribution of degrees of freedom n (With many errors)
chiq_dist_n<-function(numb,itr){
A<-numeric(numb) #★2
aa<-numeric(numb) #★4-1
for(i in 1:numb){
A[i]=20
} #★2
ntot=sum(A)
for(i in 1:numb){
if (i ==1){aa[i]= A[i]/ntot
}else{
aa[i]=aa[i-1]+(A[i]/ntot)
}
} #★4-2
sc<-numeric(itr) #★1
y<-numeric(ntot*numb) #★4-3
for(i in 1:itr){
x<-runif(ntot*numb)
for(k in 1:ntot*numb){
for(j in 1:numb){
if (x[k]<aa[numb-j+1]) {
y[k]<-j
} else {}
}
}#★3
z1<-table(y)
z2<-A*ntot
z3<-(z1-z2)^2/z2
sum(z3)
sc[i]<-sum(z3)
}
return(sc)
}
hist(chiq_dist(10,1000),ylim=c(0,0.35),breaks="Scott",freq=F)
The part of the Box2 code that generates y was cut out into Box2'.
If you look at the table(y) of Box2', you can see that too many y[i] are zero.
I want the output of table(y) in Box 2' to be roughly the same as the output of table(y) in Box 1'.
Box2' Logic for making uniform random numbers(x) stepwise(y) on Box2's code
A<- c(20,20,20,20,20)
ntot=sum(A)
numb=length(A)
aa<-numeric(numb)
for(i in 1:numb){
if (i ==1){aa[i]= A[i]/ntot
}else{
aa[i]=aa[i-1]+(A[i]/ntot)
}
} #★4-2
y<-numeric(ntot*numb)
x<-runif(ntot*numb)
for(k in 1:ntot*numb){
for(j in 1:numb){
if (x[k]<aa[numb-j+1]) {
y[k]<-j
} else {}
}
}#★3
table(y)
You don't need a ifelse to break a random uniform distribution, you can just use cut() and specify the number of breaks, for example:
set.seed(111)
v = runif(10)
[1] 0.59298128 0.72648112 0.37042200 0.51492383 0.37766322 0.41833733
[7] 0.01065785 0.53229524 0.43216062 0.09368152
cut(v,breaks=seq(0,1,length.out=numb+2),labels=1:5)
[1] 3 4 2 3 2 3 1 3 3 1
I am not so sure about A or what it does, but for simulating chisquare, I suppose you do a random sample of the labels 1:(df+1) where df is the degree of freedom. If we fix that number of samplings at 500, then we know that the expected for each break would be 500/(df+1).
So without changing too much of your code.
chiq_dist_n<-function(numb,ite){
sc <- numeric(ite)
for(i in 1:ite){
x<- runif(500) #★3
y<- cut(x,breaks=seq(0,1,length.out=numb+2),labels=1:(numb+1))
z1 <- table(y)
z2 <- length(x)/(numb+1)
z3 <- (z1-z2)^2 /z2
sc[i] <- sum(z3)
}
hist(sc,ylim=c(0,0.35),breaks="Scott",freq=F,main=paste0("df=",numb))
curve(dchisq(x,numb),add=T)
}
And we try from 4 to 9:
par(mfrow=c(3,2))
par(mar=c(2.5,2.5,2.5,2.5))
for(i in seq(2,12,2)){
chiq_dist_n(i,10000)
}

Function to Produce Repeating Spikes

I asked a similar question on CrossValidated, but did not get a response. I went ahead anyway, and built out a function but am having a problem with replication...
The original question, posted here is as such:
I am seeking a function (or short algorithm, ideally implemented in R) that produces something similar to the following:
See, I would like to be able to generate a vector of n items that follows this sort of pattern, mapped to a set of inputs (say, seq(1:n)). Ideally, I would be able to tell the algorithm to "spike" to a maximum height h on every kth time period, and decay at rate r. However, I would be sufficiently happy with simply being able to generate a spike pattern that occurs periodically.
I wrote some code in R, which is included here, that works fairly well...
## Neural Networks / Deep Learning ##
# first, must install Python from:
# https://www.anaconda.com/download/#windows
# https://www.python.org/downloads/
if (!require(keras)) devtools::install_github("rstudio/keras") ; library(keras)
# install_tensorflow()
spikes_model <- function(maxiter, total_spikes = 10, max_height = 0.001, min_height = 0.000005, decay_rate = 1) {
value_at_iteration <- rep(0, maxiter)
spike_at <- maxiter / total_spikes
current_rate <- min_height
holder_timeval <- 0
for(i in 1:maxiter) {
spike_indicator <- i / spike_at
if (is.integer(spike_indicator)) {
current_rate <- max_height
value_at_iteration[i] <- current_rate
holder_timeval <- spike_indicator
} else if (i < spike_at) {
current_rate <- min_height
value_at_iteration[i] <- current_rate
} else {
timeval <- i - (holder_timeval*spike_at)
current_rate <- max_height*exp(-decay_rate*timeval) + min_height
value_at_iteration[i] <- current_rate
}
}
return(value_at_iteration)
}
asdf <- spikes_model(maxiter = 100)
plot(asdf, type="l")
... which results in the following plot:
This is exactly what I want, except there is only one spike. I know there is a code or logic error somewhere, but I can not find where I am going wrong. Please help me replicate this spike procedure across time.
The code this scheduler is used in:
eps <- 1000
sch <- spikes_model(eps)
lr_schedule <- function(epoch, lr) {
lrn <- sch[as.integer(epoch)]
lrn <- k_cast_to_floatx(lrn)
return(lrn)
}
## Add callback to automatically adjust learning rate downward when training reaches plateau ##
reduce_lr <- callback_learning_rate_scheduler(lr_schedule)
## Fit model using trainig data, validate with validation data ##
mod1.hst <- mod1 %>% fit(
x=X.train, y=Y.train,
epochs=eps, batch_size=nrow(X.train),
validation_data = list(X.val, Y.val),
shuffle=TRUE, callbacks = list(checkpoint, reduce_lr)
)
Wow, I just figured out my own error. I was using the is.integer() function, which does not work how I wanted. I needed to use the is.whole.number() function from mosaic.
Fixing that single error, I find the following chart, which is exactly what I wanted.

Exponential distribution in R

I want to simulate some data from an exp(1) distribution but they have to be > 0.5 .so i used a while loop ,but it does not seem to work as i would like to .Thanks in advance for your responses !
x1<-c()
w<-rexp(1)
while (length(x1) < 100) {
if (w > 0.5) {
x1<- w }
else {
w<-rexp(1)
}
}
1) The code in the question has these problems:
we need a new random variable on each iteration but it only generates new random variables if the if condition is FALSE
x1 is repeatedly overwritten rather than extended
although while could be used repeat seems better since having the test at the end is a better fit than the test at the beginning
We can fix this up like this:
x1 <- c()
repeat {
w <- rexp(1)
if (w > 0.5) {
x1 <- c(x1, w)
if (length(x1) == 100) break
}
}
1a) A variation would be the following. Note that an if whose condition is FALSE evaluates to NULL if there is no else leg so if the condition is FALSE on the line marked ## then nothing is concatenated to x1.
x1 <- c()
repeat {
w <- rexp(1)
x1 <- c(x1, if (w > 0.5) w) ##
if (length(x1) == 100) break
}
2) Alternately, this generates 200 exponential random variables keeping only those greater than 0.5. If fewer than 100 are generated then repeat. At the end it takes the first 100 from the last batch generated. We have chosen 200 to be sufficiently large that on most runs only one iteration of the loop will be needed.
repeat {
r <- rexp(200)
r <- r[r > 0.5]
if (length(r) >= 100) break
}
r <- head(r, 100)
Alternative (2) is actually faster than (1) or (1a) because it is more highly vectorized. This is despite it throwing away more exponential random variables than the other solutions.
I would advise against a while (or any other accept/reject) loop; instead use the methods from truncdist:
# Sample 1000 observations from a truncated exponential
library(truncdist);
x <- rtrunc(1000, spec = "exp", a = 0.5);
# Plot
library(ggplot2);
ggplot(data.frame(x = x), aes(x)) + geom_histogram(bins = 50) + xlim(0, 10);
It's also fairly straightforward to implement a sampler using inverse transform sampling to draw samples from a truncated exponential distribution that avoids rejecting samples in a loop. This will be a more efficient method than any accept/reject-based sampling method, and works particularly well in your case, since there exists a closed form of the truncated exponential cdf.
See for example this post for more details.

Simulating thousands of regressions and obtaining p-values

I'm looking to do some basic simulation in R to examine the nature of p-values. My goal is to see whether large sample sizes trend towards small p-values. My thought is to generate random vectors of 1,000,000 data points, regress them on each other, and then plot the distribution of p-values and look for skew.
This is what I was thinking so far:
x1 = runif(1000000, 0, 1000)
x2 = runif(1000000, 0, 1000)
model1 = lm(x2~x1)
Using code taken from another thread:
lmp <- function (modelobject) {
if (class(modelobject) != "lm") stop("Not an object of class 'lm' ")
f <- summary(modelobject)$fstatistic
p <- pf(f[1],f[2],f[3],lower.tail=F)
attributes(p) <- NULL
return(p)
}
lmp(model1)
0.3874139
Any suggestions for how I might do this for 1000 models or even more? Thanks!
see ?replicate ... but the p-value you're calculating assumes gaussian errors not uniform ones (not that this will matter much at n=10^6)
Specifically, something like this:
nrep <- 1000
ndat <- 1000000
results <- replicate(nrep, {
x1=runif(ndat, 0, 1000);
x2=runif(ndat, 0, 1000);
model1=lm(x1 ~ x2);
lmp(model1)
})
should work, but
it will take a long time to run.
I'd suggest making nrep and ndat smaller to try it out.

Generating covariates from Bernoulli Distribution with a particular correlation structure in R

I am trying to reproduce a simulation result from an article using R and I am stuck at the very beginning. I am supposed to generate say some 10 covariates, all follow bernoulli distribution, where they are correlated by the regression structure: p(X_j=1|X_j-1)=0.1+0.1(X_j-1 - 0.15), j=1,...,10, for n=100 individuals, and X1~Binomial(1,0.5). I thought it's supposed to be easy but I think I am missing something and have no idea how to proceed. Any suggestion or idea (even clearing me out what they are trying to say) would be really helpful!
Recursive definitions like these are messy. You can use a loop
draw <- function() {
N<-10
x <- numeric(N)
x[1] <- runif(1) < .5
for(i in 2:N) {
x[i] <- runif(1) < 0.1+0.1*(x[i] - 0.15)
}
}
draw()
or a Reduce function
Reduce(function(xjm1,x) {
as.numeric(runif(1) < .1+.1*(xjm1-0.15)) },
rep(0,9), init=runif(1)>.5, accumulate=T)
If you needed to generate a bunch of these values and calculate the correlation, you could do
xx <- replicate(100, draw())
cor(t(xx))

Resources