R: Distribution of Random Samples vs. 1 Random Sample - r

I have a question about random sampling.
Are the two following results (A and B) statistically the same?
nobs <- 1000
A <- rt(n=nobs, df=3, ncp=0)
simulations <- 50
B <- unlist(lapply(rep.int(nobs/simulations, times=simulations),function(y) rt(n=y, df=3, ncp=0) ))
I thought it would be but now I've been going back and forth.
Any help would be appreciated.
Thanks

With some small changes, you can even make them numerically equal. You only need to seed the RNG and omit specifying the ncp parameter and use the default value (of 0) instead:
nobs <- 1000
set.seed(42)
A <- rt(n=nobs, df=3)
simulations <- 50
set.seed(42)
B <- unlist(lapply(rep.int(nobs/simulations, times=simulations),function(y) rt(n=y, df=3) ))
all.equal(A, B)
#[1] TRUE
Why don't you get equal results when you specify ncp=0?
Because then rt assumes that you actually want a non-central t-distribution and the values are calculated with rnorm(n, ncp)/sqrt(rchisq(n, df)/df). That means when creating 1000 values at once rnorm is called once and rchisq is called once subsequently. If you create 50 times 20 values, you have alternating calls to these RNGs, which means the RNG states are different for the rnorm and rchisq calls than in the first case.

Related

how to find correlation coefficient in a for loop that is to be repeated 5000 times? and save the statistic

for 2 independent normally distributed variables x and y, they are found using x = rnorm(50) and y = rnorm(50). calculate the correlation 5000 times and save the result each time. What is the likelihood that a correlation with absolute value greater than 0.3 is computed? (default set.seed(42) and to plot a histogram of the coefficient spread)
This is what i have tried so far...
set.seed(42)
n <- 50 #length of random sequence
x_norm <- rnorm(n)
y_norm <- rnorm(n)
nrun <- 5000
corr <- numeric(nrun)
for (i in 1:nrun) {
corrxy <- cor(x_norm,y_norm)
corr[i] <- sum(abs(corrxy > 0.3)) / n #save statistic in the vector
}
hist(corr)
it is expected that i get 5000 different coefficient numbers saved in [i], and when plotted using hist(0), these coefficients should follow approx a normal distribution. but i do not understand how the for loop works and how to incorporate the value of coefficient being greater than 0.3.
I think you were nearly there. You just had to shift some code outside and inside the for loop.
You want new data for each run of the loop (otherwise you get the same correlation 5000 times) and you need to save the correlation each time the loop runs. This results in a vector of 5000 correlations which you can use to look at the proportion of correlations (divide by the number of runs, not the number of observations) that are higher than .3 outside of the for loop.
Edit: One final correction is needed in the bracketing of the absolute function. You want to find the absolute correlations > .3 not the absolute value of corrxy > .3.
set.seed(42)
n <- 50 #length of random sequence
nrun <- 5000
corrxy <- numeric(nrun) # The correlation is the statistic you want to save
for (i in 1:nrun) {
x_norm <- rnorm(n) # Compute a new dataset for each run (otherwise you get the same correlation)
y_norm <- rnorm(n)
corrxy[i] <- cor(x_norm,y_norm) # Calculate the correlation
}
hist(corrxy)
sum(abs(corrxy) > 0.3) / nrun # look at the proportion of runs that have cor > .3
Below is the resulting histogram of the 5000 correlations. The proportion of correlations that is higher than |.3| is 0.034 in this case.
Here's another way of doing this kind of simulations without explicitly calling a loop:
Define first your simulation:
my_sim <- function(n) { # n is the norm distribution size
x <- rnorm(n)
y <- rnorm(n)
corrxy <- cor(x, y)
corrxy # return the correlation (single value)
}
Now we can call this function many times with replicate():
set.seed(123)
nrun <- 10
my_results <- replicate(nrun, my_sim(n=50))
#my_results
# [1] -0.0358698314 -0.0077403045 -0.0512509071 -0.0998484901 0.1230261286 0.1001124010 -0.0002023124
# [8] 0.2017120443 0.0644662387 0.0567232640
Now in my_results you have all the correlations from each simulations (just 10 for example).
And you can compute your statistics:
sum(abs(my_results)> 0.3) / nrun # nrun is 10
or plot:
hist(my_results)

Generate random values in R with a defined correlation in a defined range

For a science project, I am looking for a way to generate random data in a certain range (e.g. min=0, max=100000) with a certain correlation with another variable which already exists in R. The goal is to enrich the dataset a little so I can produce some more meaningful graphs (no worries, I am working with fictional data).
For example, I want to generate random values correlating with r=-.78 with the following data:
var1 <- rnorm(100, 50, 10)
I already came across some pretty good solutions (i.e. https://stats.stackexchange.com/questions/15011/generate-a-random-variable-with-a-defined-correlation-to-an-existing-variable), but only get very small values, which I cannot transform so the make sense in the context of the other, original values.
Following the example:
var1 <- rnorm(100, 50, 10)
n <- length(var1)
rho <- -0.78
theta <- acos(rho)
x1 <- var1
x2 <- rnorm(n, 50, 50)
X <- cbind(x1, x2)
Xctr <- scale(X, center=TRUE, scale=FALSE)
Id <- diag(n)
Q <- qr.Q(qr(Xctr[ , 1, drop=FALSE]))
P <- tcrossprod(Q) # = Q Q'
x2o <- (Id-P) %*% Xctr[ , 2]
Xc2 <- cbind(Xctr[ , 1], x2o)
Y <- Xc2 %*% diag(1/sqrt(colSums(Xc2^2)))
var2 <- Y[ , 2] + (1 / tan(theta)) * Y[ , 1]
cor(var1, var2)
What I get for var2 are values ranging between -0.5 and 0.5. with a mean of 0. I would like to have much more distributed data, so I could simply transform it by adding 50 and have a quite simililar range compared to my first variable.
Does anyone of you know a way to generate this kind of - more or less -meaningful data?
Thanks a lot in advance!
Starting with var1, renamed to A, and using 10,000 points:
set.seed(1)
A <- rnorm(10000,50,10) # Mean of 50
First convert values in A to have the new desired mean 50,000 and have an inverse relationship (ie subtract):
B <- 1e5 - (A*1e3) # Note that { mean(A) * 1000 = 50,000 }
This only results in r = -1. Add some noise to achieve the desired r:
B <- B + rnorm(10000,0,8.15e3) # Note this noise has mean = 0
# the amount of noise, 8.15e3, was found through parameter-search
This has your desired correlation:
cor(A,B)
[1] -0.7805972
View with:
plot(A,B)
Caution
Your B values might fall outside your range 0 100,000. You might need to filter for values outside your range if you use a different seed or generate more numbers.
That said, the current range is fine:
range(B)
[1] 1668.733 95604.457
If you're happy with the correlation and the marginal distribution (ie, shape) of the generated values, multiply the values (that fall between (-.5, +.5) by 100,000 and add 50,000.
> c(-0.5, 0.5) * 100000 + 50000
[1] 0e+00 1e+05
edit: this approach, or any thing else where 100,000 & 50,000 are exchanged for different numbers, will be an example of a 'linear transformation' recommended by #gregor-de-cillia.

How to vectorise sampling from non-identically distributed Bernoulli random variables?

Given a sequence of independent but not identically distributed Bernoulli trials with success probabilities given by a vector, e.g.:
x <- seq(0, 50, 0.1)
prob <- - x*(x - 50)/1000 # trial probabilities for trials 1 to 501
What is the most efficient way to obtain a random variate from each trial? I am assuming that vectorisation is the way to go.
I know of two functions that give Bernoulli random variates:
rbernoulli from the package purr, which does not accept a vector of success probabilities as an input. In this case it may be possible to wrap the function in an apply type operation.
rbinom with arguments size = 1 gives Bernoulli random variates. It also accepts a vector of probabilities, so that:
rbinom(n = length(prob), size = 1, prob = prob)
gives an output with the right length. However, I am not entirely sure that this is actually what I want. The bits in the helpfile ?rbinom that seem relevant are:
The length of the result is determined by n for rbinom, and is the
maximum of the lengths of the numerical arguments for the other
functions.
The numerical arguments other than n are recycled to the length of the
result. Only the first elements of the logical arguments are used.
However, n is a parameter with no default, so I am not sure what the first sentence means. I presume the second sentence means that I get what I want, since only size = 1 should be recycled. However this thread seems to suggest that this method does not work.
This blog post gives some other methods as well. One commentator mentions my suggested idea using rbinom.
Another way to test that rbinom is vectorised for prob, taking advantage of the fact that the sum of N bernoulli random variables is a binomial random variable with denominator N:
x <- seq(0, 50, 0.1)
prob <- -x*(x - 50)/1000
n <- rbinom(prob, size=1000, prob)
par(mfrow=c(1, 2))
plot(prob ~ x)
plot(n ~ x)
If you don't trust random strangers on the internet and do not understand documentation, maybe you can convince yourself by testing. Just set the random seed to get reproducible results:
x <- seq(0, 50, 0.1)
prob <- - x*(x - 50)/1000
#501 seperate draws of 1 random number
set.seed(42)
res1 <- sapply(prob, rbinom, n = 1, size = 1)
#501 "simultaneous" (vectorized) draws
set.seed(42)
res2 <- rbinom(501, 1, prob)
identical(res1, res2)
#[1] TRUE

Generating Random Variables with given correlations between pairs of them:

I want to generate 2 continuous random variables Q1, Q2 (quantitative traits, each are normal) and 2 binary random variables Z1, Z2 (binary traits) with given pairwise correlations between all possible pairs of them.
Say
(Q1,Q2):0.23
(Q1,Z1):0.55
(Q1,Z2):0.45
(Q2,Z1):0.4
(Q2,Z2):0.5
(Z1,Z2):0.47
Please help me generate such data in R.
This is crude but might get you started in the right direction.
library(copula)
options(digits=3)
probs <- c(0.5,0.5)
corrs <- c(0.23,0.55,0.45,0.4,0.5,0.47) ## lower triangle
Simulate correlated values (first two quantitative, last two transformed to binary)
sim <- function(n,probs,corrs) {
tmp <- normalCopula( corrs, dim=4 , "un")
getSigma(tmp) ## test
x <- rCopula(1000, tmp)
x2 <- x
x2[,3:4] <- qbinom(x[,3:4],size=1,prob=rep(probs,each=nrow(x)))
x2
}
Test SSQ distance between observed and target correlations:
objfun <- function(corrs,targetcorrs,probs,n=1000) {
cc <- try(cor(sim(n,probs,corrs)),silent=TRUE)
if (is(cc,"try-error")) return(NA)
sum((cc[lower.tri(cc)]-targetcorrs)^2)
}
See how bad things are when input corrs=target:
cc0 <- cor(sim(1000,probs=probs,corrs=corrs))
cc0[lower.tri(cc0)]
corrs
objfun(corrs,corrs,probs=probs) ## 0.112
Now try to optimize.
opt1 <- optim(fn=objfun,
par=corrs,
targetcorrs=corrs,probs=c(0.5,0.5))
opt1$value ## 0.0208
Stops after 501 iterations with "max iterations exceeded". This will never work really well because we're trying to use a deterministic hill-climbing algorithm on a stochastic objective function ...
cc1 <- cor(sim(1000,probs=c(0.5,0.5),corrs=opt1$par))
cc1[lower.tri(cc1)]
corrs
Maybe try simulated annealing?
opt2 <- optim(fn=objfun,
par=corrs,
targetcorrs=corrs,probs=c(0.5,0.5),
method="SANN")
It doesn't seem to do much better than the previous value. Two possible problems (left as an exercise for the reader are) (1) we have specified a set of correlations that are not feasible with the marginal distributions we have chosen, or (2) the error in the objective function surface is getting in the way -- to do better we would have to average over more replicates (i.e. increase n).

Impossible to create correlated variables from this correlation matrix?

I would like to generate correlated variables specified by a correlation matrix.
First I generate the correlation matrix:
require(psych)
require(Matrix)
cor.table <- matrix( sample( c(0.9,-0.9) , 2500 , prob = c( 0.8 , 0.2 ) , repl = TRUE ) , 50 , 50 )
k=1
while (k<=length(cor.table[1,])){
cor.table[1,k]<-0.55
k=k+1
}
k=1
while (k<=length(cor.table[,1])){
cor.table[k,1]<-0.55
k=k+1
}
ind<-lower.tri(cor.table)
cor.table[ind]<-t(cor.table)[ind]
diag(cor.table) <- 1
This correlation matrix is not consistent, therefore, eigenvalue decomposition is impossible.
TO make it consistent I use nearPD:
c<-nearPD(cor.table)
Once this is done I generate the correlated variables:
fit<-principal(c, nfactors=50,rotate="none")
fit$loadings
loadings<-matrix(fit$loadings[1:50, 1:50],nrow=50,ncol=50,byrow=F)
loadings
cases <- t(replicate(50, rnorm(10)) )
multivar <- loadings %*% cases
T_multivar <- t(multivar)
var<-as.data.frame(T_multivar)
cor(var)
However the resulting correlations are far from anything that I specified initially.
Is it not possible to create such correlations or am I doing something wrong?
UPDATE from Greg Snow's comment it became clear that the problem is that my initial correlation matrix is unreasonable.
The question then is how can I make the matrix reasonable. The goal is:
each of the 49 variables should correlate >.5 with the first variable.
~40 of the variables should have a high >.8 correlation with each other
the remaining ~9 variables should have a low or negative correlation with each other.
Is this whole requirement impossible ?
Try using the mvrnorm function from the MASS package rather than trying to construct the variables yourself.
**Edit
Here is a matrix that is positive definite (so it works as a correlation matrix) and comes close to your criteria, you can tweak the values from there (all the Eigen values need to be positive, so you can see how changing a number affects things):
cor.mat <- matrix(0.2,nrow=50, ncol=50)
cor.mat[1,] <- cor.mat[,1] <- 0.55
cor.mat[2:41,2:41] <- 0.9
cor.mat[42:50, 42:50] <- 0.25
diag(cor.mat) <- 1
eigen(cor.mat)$values
Some numerical experimentation based on your specifications above suggests that the generated matrix will never (what never? well, hardly ever ...) be positive definite, but it also doesn't look far from PD with these values (making lcor below negative will almost certainly make things worse ...)
rmat <- function(n=49,nhcor=40,hcor=0.8,lcor=0) {
m <- matrix(lcor,n,n) ## fill matrix with 'lcor'
## select high-cor variables
hcorpos <- sample(n,size=nhcor,replace=FALSE)
## make all of these highly correlated
m[hcorpos,hcorpos] <- hcor
## compute min real part of eigenvalues
min(Re(eigen(m,only.values=TRUE)$values))
}
set.seed(101)
r <- replicate(1000,rmat())
## NEVER pos definite
max(r)
## [1] -1.069413e-15
par(las=1,bty="l")
png("eighist.png")
hist(log10(abs(r)),breaks=50,col="gray",main="")
dev.off()

Resources