R's code to obtain a histogram following a chi-square distribution from uniform random numbers

R's code to obtain a histogram following a chi-square distribution from uniform random numbers - r

I had a code in my text-book (Written in Japanese) to generate a chi-square distribution with 3-degrees of freedom from a uniform distribution. I improved on this and created a code to get a histogram that follows a chi-square distribution with 4-degrees of freedom. This is in good agreement with the distribution function of R, so I think it probably works correctly (See Box1, below).
I tried to refine Box1's code further to obtain a histogram following a chi-square distribution with the specified degrees of freedom, but it didn't work with many errors.
(See Box2)
My Question:
The Box2's code to generate a chi-square distribution from a uniform distribution does not work well.
Please help me to fix the errors of the Box2's Code.
Probably the generalization of " y<-ifelse(x<0.2,1,ifelse(x<0.4,2,ifelse(x<0.6,3,ifelse(x<0.8,4,5))))" didn't work in Box 2.
Box1:Code for obtaining a histogram that follows a chi-square distribution with 4 degrees of freedom (probably works correctly)
ite <- 10000
sc <- numeric(ite) #★1
A<- c(20,20,20,20,20) #★2
for(i in 1:ite){
s<- runif(sum(A)*5) #★3
y<-ifelse(s<0.2,1,ifelse(s<0.4,2,ifelse(s<0.6,3,ifelse(s<0.8,4,5)))) #★4
z1 <- table(y)
z2 <- A*5
z3 <- (z1-z2)^2 /z2
sc[i] <- sum(z3)
}
hist(sc,ylim=c(0,0.35),breaks="Scott",freq=F)
curve(dchisq(x,4),add=T)
The code for Box 1 is designed based on the following facts;
If 500=sum(A)*5 uniform random numbers are divided into five rooms of the same size, the expected value of the number entering each room is 100.
Here, 1st room, 2nd room,...,and 5th room are defined by 0≦x<0.2,0.2≦x<0.4,.... and, 0.8≦x≦1.　We can see this from the output of table(y) in the following Box’ 1. Of course, sum(table(y)) of Box 1' always results in 500.
Box1' Logic for making uniform random numbers(x) stepwise(y) on Box1's code
A<- c(20,20,20,20,20)
s<- runif(sum(A)*5) #★3
y<-ifelse(s<0.2,1,ifelse(s<0.4,2,ifelse(s<0.6,3,ifelse(s<0.8,4,5))))
table(y)
sum(table(y))
Box2:Code for obtaining a histogram following a chi-square distribution of degrees of freedom n (With many errors)
chiq_dist_n<-function(numb,itr){
A<-numeric(numb) #★2
aa<-numeric(numb) #★4-1
for(i in 1:numb){
A[i]=20
} #★2
ntot=sum(A)
for(i in 1:numb){
if (i ==1){aa[i]= A[i]/ntot
}else{
aa[i]=aa[i-1]+(A[i]/ntot)
}
} #★4-2
sc<-numeric(itr) #★1
y<-numeric(ntot*numb) #★4-3
for(i in 1:itr){
x<-runif(ntot*numb)
for(k in 1:ntot*numb){
for(j in 1:numb){
if (x[k]<aa[numb-j+1]) {
y[k]<-j
} else {}
}
}#★3
z1<-table(y)
z2<-A*ntot
z3<-(z1-z2)^2/z2
sum(z3)
sc[i]<-sum(z3)
}
return(sc)
}
hist(chiq_dist(10,1000),ylim=c(0,0.35),breaks="Scott",freq=F)
The part of the Box2 code that generates y was cut out into Box2'.
If you look at the table(y) of Box2', you can see that too many y[i] are zero.
I want the output of table(y) in Box 2' to be roughly the same as the output of table(y) in Box 1'.
Box2' Logic for making uniform random numbers(x) stepwise(y) on Box2's code
A<- c(20,20,20,20,20)
ntot=sum(A)
numb=length(A)
aa<-numeric(numb)
for(i in 1:numb){
if (i ==1){aa[i]= A[i]/ntot
}else{
aa[i]=aa[i-1]+(A[i]/ntot)
}
} #★4-2
y<-numeric(ntot*numb)
x<-runif(ntot*numb)
for(k in 1:ntot*numb){
for(j in 1:numb){
if (x[k]<aa[numb-j+1]) {
y[k]<-j
} else {}
}
}#★3
table(y)

You don't need a ifelse to break a random uniform distribution, you can just use cut() and specify the number of breaks, for example:
set.seed(111)
v = runif(10)
[1] 0.59298128 0.72648112 0.37042200 0.51492383 0.37766322 0.41833733
[7] 0.01065785 0.53229524 0.43216062 0.09368152
cut(v,breaks=seq(0,1,length.out=numb+2),labels=1:5)
[1] 3 4 2 3 2 3 1 3 3 1
I am not so sure about A or what it does, but for simulating chisquare, I suppose you do a random sample of the labels 1:(df+1) where df is the degree of freedom. If we fix that number of samplings at 500, then we know that the expected for each break would be 500/(df+1).
So without changing too much of your code.
chiq_dist_n<-function(numb,ite){
sc <- numeric(ite)
for(i in 1:ite){
x<- runif(500) #★3
y<- cut(x,breaks=seq(0,1,length.out=numb+2),labels=1:(numb+1))
z1 <- table(y)
z2 <- length(x)/(numb+1)
z3 <- (z1-z2)^2 /z2
sc[i] <- sum(z3)
}
hist(sc,ylim=c(0,0.35),breaks="Scott",freq=F,main=paste0("df=",numb))
curve(dchisq(x,numb),add=T)
}
And we try from 4 to 9:
par(mfrow=c(3,2))
par(mar=c(2.5,2.5,2.5,2.5))
for(i in seq(2,12,2)){
chiq_dist_n(i,10000)
}

Related

Combining probability density functions in R

Suppose I have some observations of variables and a model. The model result is not directly observable but from physical assumptions I know roughly what to expect. Here is a simplified example, the real model is quite complex:
its <- 1000 # number of iterations
obs1 <- rnorm(n=its, mean=20, sd=1) # an observation of one parameter
obs2 <- runif(n=its, min=3, max=12) # an observation of a second parameter
mod <- function(obs1, obs2){
res <- (obs1 + obs2)^2
return(res)
} # a model. the result cannot be directly observed
result <- mod(obs1=obs1, obs2=obs2)
## but i know from physical principles something about the result: it must follow a specific distribution, here for example a normal one:
res.info <- density(rnorm(1e4, mean=600, sd=100))
### and I also know the ratio of obs1/obs2 can't be greater than 4
res.info2 <- density(runif(n=1e4, min=0, max=4))
layout(mat=matrix(1:4, nrow=2))
par(mar=c(3,5,1,1))
hist(result, xlim=c(200, 1400))
plot(res.info, xlim=c(200, 1400))
hist(obs1/obs2, xlim=c(0,8))
plot(res.info2, xlim=c(0,8))
My question: How do I obtain a probability density function of the result, given what I know about obs1, obs2, and the result? Is this a situation in which Bayes Theorem can be applied? How would I do this programmatically?

Implement a Monte Carlo Simulation Method to Estimate an Integral in R

I am trying to implement a Monte carlo simulation method to estimate an integral in R. However, I still get wrong answer. My code is as follows:
f <- function(x){
((cos(x))/x)*exp(log(x)-3)^3
}
t <- integrate(f,0,1)
n <- 10000 #Assume we conduct 10000 simulations
int_gral <- Monte_Car(n)
int_gral

You are not doing Monte-Carlo here. Monte-Carlo is a simulation method that helps you approximating integrals using sums/mean based on random variables.
You should do something in this flavor (you might have to verify that it's correct to say that the mean of the f output can approximates your integral:
f <- function(n){
x <- runif(n)
return(
((cos(x))/x)*exp(log(x)-3)^3
)
}
int_gral <- mean(f(10000))
What your code does is taking a number n and return ((cos(n))/n)*exp(log(n)-3)^3 ; there is no randomness in that
Update
Now, to get a more precise estimates, you need to replicate this step K times. Rather than using a loop, you can use replicate function:
K <- 100
dist <- data.frame(
int = replicate(K, mean(f(10000)))
)
You get a distribution of estimators for your integral :
library(ggplot2)
ggplot(dist) + geom_histogram(aes(x = int, y = ..density..))
and you can use mean to have a numerical value:
mean(dist$int)
# [1] 2.95036e-05
You can evaluate the precision of your estimates with
sd(dist$int)
# [1] 2.296033e-07
Here it is small because N is already large, giving you a good precision of first step.

I have managed to change the codes as follows. Kindly confirm to me that I am doing the right thing.
regards.
f <- function(x){
((cos(x))/x)*exp(log(x)-3)^3
}
set.seed(234)
n<-10000
for (i in 1:10000) {
x<-runif(n)
I<-sum(f(x))/n
}
I

Generating Random Variables with given correlations between pairs of them:

I want to generate 2 continuous random variables Q1, Q2 (quantitative traits, each are normal) and 2 binary random variables Z1, Z2 (binary traits) with given pairwise correlations between all possible pairs of them.
Say
(Q1,Q2):0.23
(Q1,Z1):0.55
(Q1,Z2):0.45
(Q2,Z1):0.4
(Q2,Z2):0.5
(Z1,Z2):0.47
Please help me generate such data in R.

This is crude but might get you started in the right direction.
library(copula)
options(digits=3)
probs <- c(0.5,0.5)
corrs <- c(0.23,0.55,0.45,0.4,0.5,0.47) ## lower triangle
Simulate correlated values (first two quantitative, last two transformed to binary)
sim <- function(n,probs,corrs) {
tmp <- normalCopula( corrs, dim=4 , "un")
getSigma(tmp) ## test
x <- rCopula(1000, tmp)
x2 <- x
x2[,3:4] <- qbinom(x[,3:4],size=1,prob=rep(probs,each=nrow(x)))
x2
}
Test SSQ distance between observed and target correlations:
objfun <- function(corrs,targetcorrs,probs,n=1000) {
cc <- try(cor(sim(n,probs,corrs)),silent=TRUE)
if (is(cc,"try-error")) return(NA)
sum((cc[lower.tri(cc)]-targetcorrs)^2)
}
See how bad things are when input corrs=target:
cc0 <- cor(sim(1000,probs=probs,corrs=corrs))
cc0[lower.tri(cc0)]
corrs
objfun(corrs,corrs,probs=probs) ## 0.112
Now try to optimize.
opt1 <- optim(fn=objfun,
par=corrs,
targetcorrs=corrs,probs=c(0.5,0.5))
opt1$value ## 0.0208
Stops after 501 iterations with "max iterations exceeded". This will never work really well because we're trying to use a deterministic hill-climbing algorithm on a stochastic objective function ...
cc1 <- cor(sim(1000,probs=c(0.5,0.5),corrs=opt1$par))
cc1[lower.tri(cc1)]
corrs
Maybe try simulated annealing?
opt2 <- optim(fn=objfun,
par=corrs,
targetcorrs=corrs,probs=c(0.5,0.5),
method="SANN")
It doesn't seem to do much better than the previous value. Two possible problems (left as an exercise for the reader are) (1) we have specified a set of correlations that are not feasible with the marginal distributions we have chosen, or (2) the error in the objective function surface is getting in the way -- to do better we would have to average over more replicates (i.e. increase n).

Simulations in R for Heteroskedasticity

I want to simulate heteroskedasticity and see how White's test performs for various sample sizes (this is similar to Greene, but testing different kinds).
Now, the model I wish to test is:
Replications=1000=n
y=1.5x+ e
x ~ N(25, 15)
e ~ N(0, sigma_i)
sigma_i= 1+ alpha (sqrt(x[i]^2)
And say initially sample size=20 (varying this part is easy)
So, using the bstats command for the white test, I wrote my code as
for (j in 1:n) {
for (i in 1:20){
x[i]=rnorm(1, 25, 15)
sigma [i]=1+0*sqrt(x[i]^2)
epsilon[i]=rnorm(1, 0, sigma[i])
y[i]=1.5*x[i]+epsilon[i]}
lm1[j]<- lm(y~x); white.test(lm1[j])
if (white.test(lm1)$p.value > 0.05){ind=ind
}else{
if (white.test(lm1)$p.value < 0.05) {ind=ind+1}
}
IND1=ind/1000}
}
Now, I know this is a problem because the i section works, but I can't get the outer for loop to work. Does anyone have any suggestions?

require(tseries)
# if you dont have above package, please install it
# it can be done by running following code
# install.packages('tseries')
########### function that simulates 100 times for hetro-
test_gen=function(n,m){
esp=0
saved=0
for(i in 1:m){ # simulate 100 times
X=rnorm(n,25,1)
for(j in 1:length(X)){
eps[j]=rnorm(1,0,sqrt((X[j])^2)[1])
}
Y=1.5*X+eps
temp=white.test(X,Y)
saved[i]=temp$p.value
}
return(saved)
}
#red dots are the values less than 0.05
#n controls the size of sample
#m controls numbers of simulation
n=20
m=100
out=test_gen(n,m)
plot(c(1:length(out)),out,main="p-value(whitetest) for each simulated data")
ind=out<0.05
points(c(1:length(out))[ind],out[ind],col='red',pch=16)
out2=0
#simulate 100 times for each specific data size
for(i in 20:100){
k=i-19
temp=test_gen(i,100)
out2[k]=sum(temp<0.5)/100
}
plot(20:100,out2,main="error rate",xlab="sample size") # error rate
I feel like I'm doing your homework...anyway good luck

How do I best simulate an arbitrary univariate random variate using its probability function?

In R, what's the best way to simulate an arbitrary univariate random variate if only its probability density function is available?

Here is a (slow) implementation of the inverse cdf method when you are only given a density.
den<-dnorm #replace with your own density
#calculates the cdf by numerical integration
cdf<-function(x) integrate(den,-Inf,x)[[1]]
#inverts the cdf
inverse.cdf<-function(x,cdf,starting.value=0){
lower.found<-FALSE
lower<-starting.value
while(!lower.found){
if(cdf(lower)>=(x-.000001))
lower<-lower-(lower-starting.value)^2-1
else
lower.found<-TRUE
}
upper.found<-FALSE
upper<-starting.value
while(!upper.found){
if(cdf(upper)<=(x+.000001))
upper<-upper+(upper-starting.value)^2+1
else
upper.found<-TRUE
}
uniroot(function(y) cdf(y)-x,c(lower,upper))$root
}
#generates 1000 random variables of distribution 'den'
vars<-apply(matrix(runif(1000)),1,function(x) inverse.cdf(x,cdf))
hist(vars)

To clarify the "use Metropolis-Hastings" answer above:
suppose ddist() is your probability density function
something like:
n <- 10000
cand.sd <- 0.1
init <- 0
vals <- numeric(n)
vals[1] <- init
oldprob <- 0
for (i in 2:n) {
newval <- rnorm(1,mean=vals[i-1],sd=cand.sd)
newprob <- ddist(newval)
if (runif(1)<newprob/oldprob) {
vals[i] <- newval
} else vals[i] <- vals[i-1]
oldprob <- newprob
}
Notes:
completely untested
efficiency depends on candidate distribution (i.e. value of cand.sd).
For maximum efficiency, tune cand.sd to an acceptance rate of 25-40%
results will be autocorrelated ... (although I guess you could always
sample() the results to scramble them, or thin)
may need to discard a "burn-in", if your starting value is weird
The classical approach to this problem is rejection sampling (see e.g. Press et al Numerical Recipes)

Use cumulative distribution function http://en.wikipedia.org/wiki/Cumulative_distribution_function
Then just use its inverse.
Check here for better picture http://en.wikipedia.org/wiki/Normal_distribution
That mean: pick random number from [0,1] and set as CDF, then check Value
It is also called quantile function.

This is a comment but I don't have enough reputation to drop a comment to Ben Bolker's answer.
I am new to Metropolis, but IMHO this code is wrong because:
a) the newval is drawn from a normal distribution whereas in other codes it is drawn from a uniform distribution; this value must be drawn from the range covered by the random number. For example, for a gaussian distribution this should be something like runif(1, -5, +5).
b) the prob value must be updated only if acceptance.
Hope this help and hope that someone with reputation could correct this answer (especially mine if I am wrong).
# the distribution
ddist <- dnorm
# number of random number
n <- 100000
# the center of the range is taken as init
init <- 0
# the following should go into a function
vals <- numeric(n)
vals[1] <- init
oldprob <- 0
for (i in 2:n) {
newval <- runif(1, -5, +5)
newprob <- ddist(newval)
if (runif(1) < newprob/oldprob) {
vals[i] <- newval
oldprob <- newprob
} else vals[i] <- vals[i-1]
}
# Final view
hist(vals, breaks = 100)
# and comparison
hist(rnorm(length(vals)), breaks = 100)