Sample from one of two distributions - r

I want to repeatedly sample values based on a certain condition. For example I want to create a sample of 100 values.
With probability of 0.7 it will be sampled from one distribution, and from another probability, otherwise.
Here is a way to do what I want:
set.seed(20)
A<-vector()
for (i in 1:100){
A[i]<-ifelse(runif(1,0,1)>0.7,rnorm(1, mean = 100, sd = 20),runif(1, min = 0, max = 1))
}
I am sure there are other more elegant ways, without using for loop.
Any suggestions?

You can sample an indiactor, which defines what distribution you draw from.
ind <- sample(0:1, size = 100, prob = c(0.3, 0.7), replace = TRUE)
A <- ind * rnorm(100, mean = 100, sd = 20) + (1 - ind) * runif(100, min = 0, max = 1)
In this case you don't use a for-loop but you need to sample more random variables.

If the percentage of times is not random, you can draw the right amount of each distribution then shuffle the result :
n <- 100
A <- sample(c(rnorm(0.7*n, mean = 100, sd = 20), runif(0.3*n, min = 0, max = 1)))

Related

Creating a matrix with random entries with given probabilities in R

I want to create a 100x100 matrix A with entry a_ij being randomly selected from the set {0,1} with P(a_ij=1)=0.2 and P(a_ij=0)=0.8.
This is what I’ve tried so far:
n<-100
matrix<-matrix(0,100,100)
mynumbers<-c(1,0)
myprobs<-c(0.2,0.8)
for(i in 1:100){
for (j in 1:100){
matrix[i,j]<-sample(mynumbers, 1, replace=TRUE, prob=myprobs)
}
}
matrix
I’m not sure about the sample size being 1, but this way only seems to work if I choose size=1... Is this the correct way to do it? Thank you in advance!
As #akrun noted there are much easier ways. A matrix of 100 x 100 means 10,000 entries. prob = .2 is saying success = 1 = P(a_ij=1)=0.2, size in this case means one trial at a time. The matrix parameters should be pretty self-evident.
set.seed(2020)
trials <- rbinom(n = 10000, size = 1, prob = .2)
my.matrix <- matrix(trials, nrow = 100, ncol = 100)
or to more closely resemble your code
n <- 10000
mynumbers<-c(1,0)
myprobs<-c(0.2,0.8)
trials2 <- sample(x = mynumbers,
size = n,
replace = TRUE,
prob = myprobs)
my.matrix2 <- matrix(trials2, nrow = 100, ncol = 100)

How do I generate 5000 synthetic data sets in R with 1000 observations in each that are gaussian;

for each I need to For each data set, set σ 2 = 10 and µj = j, where j = 1, . . . , 5, 000 is the index of a data set.
We can use lapply to loop through 1 to 5000 and design a simple function to apply the data to the rnorm function.
lapply(1:5000, function(x) rnorm(n = 1000, mean = x, sd = sqrt(10)))
You can use purrr::map().
map(1:5000, ~ rnorm(n = 10000, mean = .x, sd = 10))
If you want to iterate over two different arguments to rnorm:
n_arg <- c(rep(10000, 2500), rep(20000, 2500))
map2(1:5000, n_arg, ~ rnorm(n = .y, mean = .x, sd = 10))

R - Coding Function for Bootstrap CI Coverage Property

I need to write a function that performs a simulation to evaluate the coverage of a bootstrap confidence interval for the variance of n samples from a normal distribution. Belowis what I've attempted but it keeps returning a mean of 0 or 0.002 for the number of samples that lie within the CI...
Var_CI_Coverage <- function(true_mean,true_var, nsim, nboot, alpha, nsamples){
cover = NULL
for(k in 1:nsim){
Var = as.numeric()
y <- rnorm(1, mean = true_mean, sd = sqrt(true_var))
for(i in 1:nboot){
resample_y <- sample(y, size = nsamples, replace = TRUE)
Var[i] <- var(resample_y)
}
LB <- quantile(Var, probs=c(alpha/2))
UB <- quantile(Var, probs=c(1 - (alpha/2)))
cover[k] <- ifelse(LB <= true_var & UB >= true_var, 1, 0)
}
return(mean(cover))
}
Var_CI_Coverage(true_mean= 0, true_var = 4, nsim = 500, nboot = 1000, alpha = 0.05, nsamples = 10)
The main problem is you generate y using
y <- rnorm(1, mean = true_mean, sd = sqrt(true_var))
which means y is a single value, and all your bootstrap samples are just that single y value repeated nsamples times. You need
y <- rnorm(nsamples, mean = true_mean, sd = sqrt(true_var))
Then you get samples with actual variance, and you get a coverage estimate that looks more in the right ballpark (no comment on whether it's correct, I haven't tried to check).

rnorm is generating non-random looking realizations

I was debugging my simulation and I find that when I run rnorm(), my random normal values don't look random to me at all. ccc is the mean sd vector that is given parametrically. How can I get really random normal realizations? Since my original simulation is quite long, I don't want to go into Gibbs sampling... Should you know why I get non-random looking realizations of normal random variables?
> ccc
# [1] 144.66667 52.52671
> rnorm(20, ccc)
# [1] 144.72325 52.31605 144.44628 53.07380 144.64438 53.87741 144.91300 54.06928 144.76440
# [10] 52.09181 144.61817 52.17339 145.01374 53.38597 145.51335 52.37353 143.02516 52.49332
# [19] 144.27616 54.22477
> rnorm(20, ccc)
# [1] 143.88539 52.42435 145.24666 50.94785 146.10255 51.59644 144.04244 51.78682 144.70936
# [10] 53.51048 143.63903 51.25484 143.83508 52.94973 145.53776 51.93892 144.14925 52.35716
# [19] 144.08803 53.34002
It's a basic concept to set parameters in a function. Take rnorm() for example:
Its structure is rnorm(n, mean = 0, sd = 1). Obviously, mean and sd are two different parameters, so you need to put respective values to them. Here is a confusing situation where you may get stuck:
arg <- c(5, 10)
rnorm(1000, arg)
This actually means rnorm(n = 1000, mean = c(5, 10), sd = 1). The standard deviation is set to 1 because the position of arg represents the parameter mean and you don't set sd additionally. Therefore, rnorm() will take the default value 1 to sd. However, what does mean = c(5, 10) mean? Let's check:
x <- rnorm(1000, arg)
hist(x, breaks = 50, prob = TRUE)
# lines(density(x), col = 2, lwd = 2)
mean = c(5, 10) and sd = 1 will recycle to length 1000, i.e.
rnorm(n = 1000, mean = c(5, 10, 5, 10, ...), sd = c(1, 1, 1, 1, ...))
and hence the final sample x is actually a blend of 500 N(5, 1) samples and 500 N(10, 1) samples which are drawn alternately, i.e.
c(rnorm(1, 5, 1), rnorm(1, 10, 1), rnorm(1, 5, 1), rnorm(1, 10, 1), ...)
As for your question, it should be:
arg <- c(5, 10)
rnorm(1000, arg[1], arg[2])
and this means rnorm(n = 1000, mean = 5, sd = 10). Check it again, and you will get a normal distribution with mean = 5 and sd = 10.
x <- rnorm(1000, arg[1], arg[2])
hist(x, breaks = 50, prob = T)
# curve(dnorm(x, arg[1], arg[2]), col = 2, lwd = 2, add = T)

How to extract the variance covariance matrix for particular values?

I would like to extract the variance covariance matrix for variables b and c and have some struggles to find the right command. My original data frame has more then 100 variables. therefore to know a command to exctract that would be great
Given data:
a<-rnorm(1000, mean = 0, sd = 1)
b<-rnorm(1000, mean = 0, sd = 1)
c<-rnorm(1000, mean = 0, sd = 1)
d<-rbinom(1000, size = 1, prob = .5)
e<-rbinom(1000, size = 1, prob = .5)
f<-rbinom(1000, size = 1, prob = .5)
data<-data.frame(a,b,c,d,e,f)
test<-glm(a~b+c+d+e+f,data=data)
pe.glmCube<-test$coefficients[2:3] # point estimates
I tried the same with the variance matrix. But it seems senseless to do it that way:
vc.glmCube <- vcov(test[2:3]) # var-cov matrix
vcov(test)[c("b", "c"), c("b", "c")]
# b c
#b 1.083964e-03 -2.532682e-05
#c -2.532682e-05 9.779278e-04

Resources