Generating correlated ordinal data - r

I'm using the package GenOrd for generating correlated ordinal data. The basic idea is to get correlated ordinal data with correlation 0.5, now I want to repeat the whole code for 1000 times and save the results of correlation, to see how close I can get to the correlation of 0.5, then change the sample size and the Marginal probabilities and see what changes.
library(GenOrd)
R<-matrix(c(1,0.5,0.5,1),2,2)
Marginal<-list(c(0.2,0.5,0.7,0.9),c(0.1,0.3,0.4,0.5))
DataOrd<-ordsample(100,Marginal,R)
correlation<-cor(DataOrd)
correlation[1,2] # 0.5269

Here is a simple solution:
sim.cor <- function(R, Marginal, n, K)
{
res <- numeric(length = K)
for(i in 1:K)
res[i] <- cor(ordsample(n, Marginal, R))[1,2]
res
}
where n is the sample size and K is the number of times you want to repeat. So, in your example, you can call this function and save the result (a vector of size K with the correlations) in an object:
set.seed(1234)
correlations <- sim.cor(R = R, Marginal = Marginal, n = 100, K = 1000)
mean(correlations)
[1] 0.5009389
A faster and more elegant solution is to use the replicate function as suggested by jaysunice3401:
set.seed(1234)
n <- 100
correlations <- replicate(n = 1000, expr = cor(ordsample(n, Marginal, R))[1,2])
mean(correlations)
[1] 0.5009389
I hope this can help!

Related

R generating binomial Random variables from exponential random variables

I have 100000 exponential random variables generated withrexp and I am asked to generate 100000 binomial random variables from them using built in R functions.
I really don't know how can I generate one random variable from another. I searched some resources on internet but they were mostly about generating poisson from exponential which are very related because exponential distribution can be interpreted as time intervals of poisson. making poisson can be easily achieved by applying cumsum on exponentials and using cut function to make some bins including number of occurrences in a time interval.
But I don't know how is it possible to generate binomial from exponential.
The function rbin below generates binomial rv's from exponential rv's. The reason why might be a question for CrossValidated, not for StackOverflow, which is about code.
rbin <- function(n, size, p){
onebin <- function(i, size, thres){
I <- 0L
repeat{
S <- sum(rexp(I + 1)/(size + 1 - seq_len(I + 1)))
if(S > thres) break
I <- I + 1L
}
I
}
thres <- -log(1 - p)
sapply(seq_len(n), onebin, size, thres)
}
set.seed(1234)
u <- rbin(100000, 1, 0.5)
v <- rbinom(100000, 1, 0.5)
X <- cbind(u, v)
cbind(Mean = colMeans(X), Var = apply(X, 2, var))
# Mean Var
#u 0.50124 0.2500010
#v 0.49847 0.2500002

How to run monte carlo simulation from a custom distribution in R

I would like to pull 1000 samples from a custom distribution in R
I have the following custom distribution
library(gamlss)
mu <- 1
sigma <- 2
tau <- 3
kappa <- 3
rate <- 1
Rmax <- 20
x <- seq(1, 2e1, 0.01)
points <- Rmax * dexGAUS(x, mu = mu, sigma = sigma, nu = tau) * pgamma(x, shape = kappa, rate = rate)
plot(points ~ x)
How can I randomly sample via Monte Carlo simulation from this distribution?
My first attempt was the following code which produced a histogram shape I did not expect.
hist(sample(points, 1000), breaks = 51)
This is not what I was looking for as it does not follow the same distribution as the pdf.
If you want a Monte Carlo simulation, you'll need to sample from the distribution a large number of times, not take a large sample one time.
Your object, points, has values that increases as the index increases to a threshold around 400, levels off, and then decreases. That's what plot(points ~ x) shows. It may describe a distribution, but the actual distribution of values in points is different. That shows how often values are within a certain range. You'll notice your x axis for the histogram is similar to the y axis for the plot(points ~ x) plot. The actual distribution of values in the points object is easy enough to see, and it is similar to what you're seeing when sampling 1000 values at random, without replacement from an object with 1900 values in it. Here's the distribution of values in points (no simulation required):
hist(points, 100)
I used 100 breaks on purpose so you could see some of the fine details.
Notice the little bump in the tail at the top, that you may not be expecting if you want the histogram to look like the plot of the values vs. the index (or some increasing x). That means that there are more values in points that are around 2 then there are around 1. See if you can look at how the curve of plot(points ~ x) flattens when the value is around 2, and how it's very steep between 0.5 and 1.5. Notice also the large hump at the low end of the histogram, and look at the plot(points ~ x) curve again. Do you see how most of the values (whether they're at the low end or the high end of that curve) are close to 0, or at least less than 0.25. If you look at those details, you may be able to convince yourself that the histogram is, in fact, exactly what you should expect :)
If you want a Monte Carlo simulation of a sample from this object, you might try something like:
samples <- replicate(1000, sample(points, 100, replace = TRUE))
If you want to generate data using points as a probability density function, that question has been asked and answered here
Let's define your (not normalized) probability density function as a function:
library(gamlss)
fun <- function(x, mu = 1, sigma = 2, tau = 3, kappa = 3, rate = 1, Rmax = 20)
Rmax * dexGAUS(x, mu = mu, sigma = sigma, nu = tau) *
pgamma(x, shape = kappa, rate = rate)
Now one approach is to use some MCMC (Markov chain Monte Carlo) method. For instance,
simMCMC <- function(N, init, fun, ...) {
out <- numeric(N)
out[1] <- init
for(i in 2:N) {
pr <- out[i - 1] + rnorm(1, ...)
r <- fun(pr) / fun(out[i - 1])
out[i] <- ifelse(runif(1) < r, pr, out[i - 1])
}
out
}
It starts from point init and gives N draws. The approach can be improved in many ways, but I'm simply only going to start form init = 5, include a burnin period of 20000 and to select every second draw to reduce the number of repetitions:
d <- tail(simMCMC(20000 + 2000, init = 5, fun = fun), 2000)[c(TRUE, FALSE)]
plot(density(d))
You invert the ECDF of the distribution:
ecd.points <- ecdf(points)
invecdfpts <- with( environment(ecd.points), approxfun(y,x) )
samp.inv.ecd <- function(n=100) invecdfpts( runif(n) )
plot(density (samp.inv.ecd(100) ) )
plot(density(points) )
png(); layout(matrix(1:2,1)); plot(density (samp.inv.ecd(100) ),main="The Sample" )
plot(density(points) , main="The Original"); dev.off()
Here's another way to do it that draws from R: Generate data from a probability density distribution and How to create a distribution function in R?:
x <- seq(1, 2e1, 0.01)
points <- 20*dexGAUS(x,mu=1,sigma=2,nu=3)*pgamma(x,shape=3,rate=1)
f <- function (x) (20*dexGAUS(x,mu=1,sigma=2,nu=3)*pgamma(x,shape=3,rate=1))
C <- integrate(f,-Inf,Inf)
> C$value
[1] 11.50361
# normalize by C$value
f <- function (x)
(20*dexGAUS(x,mu=1,sigma=2,nu=3)*pgamma(x,shape=3,rate=1)/11.50361)
random.points <- approx(cumsum(pdf$y)/sum(pdf$y),pdf$x,runif(10000))$y
hist(random.points,1000)
hist((random.points*40),1000) will get the scaling like your original function.

Solve the MLE for Binomial Distribution

This problem is about how to write a log likelihood function that computes the MLE for binomial distribution. The exact log likelihood function is as following:
Find the MLE estimate by writing a function that calculates the negative log-likelihood and then using nlm() to minimize it. Find the MLE estimate in this way on your data from part 1.b. Use an initial guess of $p= 0.5$.
My thought process is that, when using the MLE, only terms that involve p (the parameter) matters, so we can ignore the first (n choose x) term.
neg.loglik <- function(p, data){
n <- 1000
return(-sum(data*log(p) + (n-data)*log(1-p)))
# return a single numeric value
}
# nlm(f, p, ...)
nlm(neg.loglik, p = 0.5, data = data)
nlm(neg.loglik, p = 0.5, data = data)$estimate
I am not sure if my neg.loglik function is correct since the estimate seems to be so low...
The true p is 1/3, but I am getting 0.013 for my MLE.
DATA:
You have an urn with 30 balls -- 10 are red, 10 are blue, and 10 are green. Write a single line of code to simulate randomly picking 400 balls from the urn with replacement. Create a variable num_green that records the number of green balls selected in the 400 draws.
set.seed(1)
urn <- c(rep("red", 10), rep("blue", 10), rep("green", 10))
sample <- sample(urn, size = 400, replace = TRUE)
num_green <- length(sample[sample == "green"])
num_green
Now repeat the above experiment 1000 times. Create a vector data, such that each element in data is the result (counting the number of green balls) from an independent trial like that described in 1.a.
set.seed(2)
data <- rep(NA,n)
for (i in 1:n){
sample <- sample(urn, size = 400, replace = TRUE)
data[i] <- length(sample[sample == "green"])
}

R - Inverse cumulative distribution method with given function

I have a given function (let's call it f(x)) and I used the Monte Carlo method to normalized it. I calculated the probability density function and I got the cumulative distribution function from integrating that.
f = function(x) ...
plot(f,xlim = c(0, 5), ylim = c(0, 1),main="f(x)")
mc.integral = function(f, n.iter = 1000, interval){
x = runif(n.iter, interval[1], interval[2])
y = f(x)
mean(y)*(interval[2] - interval[1])
}
MC = mc.integral(f, interval = c(0, 8))
print(MC)
densityFunction <- function(x){
return ((f(x)/MC))
}
distributionFunction <- function(x){
return (integrate(densityFunction,0,x)$value)
}
vd <- Vectorize(distributionFunction)
plot(vd,xlim = c(0, 8), ylim = c(0, 1),ylab = "y",main="E(f(x))")
Now my next task is to use the inverse transform method / inverse cumulative distribution method to generate samples and test it with the Kolmogorov-Smirnov Test, but I don't know how should I do in R.
Can you please give me some help?
Well, this thread shows us how to generate a sample using the inverse transform method:
sample <- vd(runif(1000))
> head(sample)
[1] 0.28737403 0.59295499 0.30814305 0.27998306 0.07601228 0.52753327
Therefore, generating 10 different random samples could be done with:
sample <- list()
for(i in 1:10){
set.seed(i)
sample[[i]] <- vd(runif(1000))
}
Afterwards, loop ks.test over the list:
lapply(sample, function(x) ks.test(x, pnorm))
will give you the output of a test vs. normality for each sample. Choose the size of your samples wisely, as most tests for normality are prone to be significant for large samples even with small differences (reference here).

Random data generation leading to good prediction on random labels

I've been playing around with implementing CV in R but encountered a weird problem with the returned value among folds in LOOCV.
First I'll randomly generate data as well as labels, then I'll fit a randomForest on what should be just noise. From the returned loop I get not only a good AUC but a significant p-value from a t-test. I don't understand how this could be theoretically happening so I was curious if the ways I attempted to generate data/labels was best?
Here is a code snippet that shows my issue.
library(randomForest)
library(pROC)
n=30
p=900
set.seed(3)
XX=matrix(rnorm(n*p, 0, 1) , nrow=n)
YY=as.factor(sample(c('P', 'C'), n, replace=T))
resp = vector()
for(i in 1:n){
fit = randomForest(XX[-i,], YY[-i])
pred = predict(fit, XX[i,], type = "prob")[2]
resp[i] <- pred
}
t.test(resp~YY)$p.value
roc(YY, resp)$auc
I tried multiple ways of generating data all of which result in the same thing
XX=matrix(runif(n*p), nrow=n)
XX=matrix(rnorm(n*p, 0, 1) , nrow=n)
and
random_data=matrix(0, n, p)
for(i in 1:n){
random_data[i,]=jitter(runif(p), factor = 1, amount = 10)
}
XX=as.matrix(random_data)
Since the randomForest is finding relevant predictors in this scenario that leads me to believe that data may not be truly random. Is there a better possible way I could generate data, or generate the random labels? is it possible that this is an issue with R?
This is a partial answer: I modified your roc function call to make sure the distribution of AUC values are between 0 and 1. Then I ran it 20 times. Mean AUC and p-value are 0.73 and 0.12, respectively. Improved but still better than random...
library(ROCR)
library(randomForest)
library(pROC)
n=30
p=900
pvs=vector()
aucs=vector()
for (j in seq(20)){
XX=matrix(rnorm(n*p, 0, 1) , nrow=n)
YY=as.factor(sample(c('C', 'P'), n, replace=T))
resp = vector()
for(i in 1:n){
fit = randomForest(XX[-i,], YY[-i])
pred = predict(fit, XX[i,], type = "prob")[2]
resp[i] <- pred
}
pvs[j]=t.test(resp~YY)$p.value
aucs[j]=roc(YY, resp, direction='>')$auc
}

Resources