Creating a binary variable with probability in R - r

I'm trying to make a variable, Var, that takes the value 0 60% of the time, and 1 otherwise, with 50 000 observation.
For a normally distributed, I remember doing the following for a normal distribution, to define n:
Var <- rnorm(50 000, 0, 1)
Is there a way I could combine an ifelse command with the above to specify the number of n as well as the probability of Var being 0?

I would use rbinom like this:
n_ <- 50000
p_ <- 0.4 # it's probability of 1s
Var <- rbinom(n=n_, size=1, prob=p_)
By using of variables, you can change the size and/or probability just by changing of those variables. Hope that's what you are looking for.

If by 60% you mean a probability equal to 0.6 (rather than an empirical frequency), then
Var <- sample(0:1, 50000, prob = c(6, 4), replace = TRUE)
gives a desired sequence of independent Bernoulli(0.6) realizations.

I'm picking nits here, but it actually isn't completely clear exactly what you want.
Do you want to simulate a sample of 50000 from the distribution you describe?
Or, do you want 50000 replications of simulating an observation from the distribution you describe?
These are different things that, in my opinion, should be approached differently.
To simulate a sample of size 50000 from that distribution you would use:
sample(c(0,1), size = 50000, replace = TRUE)
To replicate 50000 simulations of sampling from the distribution you describe I would recommend:
replicate(50000, sample(c(0,1), size = 1, prob = c(0.6, 0.4)))
This might seem silly since these two lines of code produce exactly the same thing, in this case.
But suppose your goal was to investigate properties of samples of size 50000? Then what you would use a bunch (say, 1000) of replication of that first line of code above wrapped inside replicate:
replicate(1000, sample(c(0,1), size = 50000, prob = c(0.6, 0.4), replace = TRUE))
I hope I haven't been too pedantic about this. Having seen simulations go awry it has become my belief that one should keep separate the thing being simulated from the number of simulations you decide to do. The former is fundamental to your problem, while the latter only affects the accuracy of the simulation study and how long it takes.

Related

Two exponential distributions

I am trying to simulate two exponential distributions. For example two CPUs processing jobs e.g. one having average service time 10 min (lambda = 0.1) and another one 20 min (lambda = 0.05) and they work independently. Both of them are busy when a new job arrives.
I would like to simulate the waiting time of a new job
Here is what I did so far.
cpu1 = rexp(n = 10000, rate = .1)
cpu2 = rexp(n = 10000, rate = .05)
I generate 10K data points based on exponential distribution. For each of them new job has to wait min(cpu1[i], cpu2[i]) I store all of them in a data frame and compute the mean.
for (i in seq(1, 10000)) {
if (i == 1) {
df1 <- data.frame(waiting_time=min(cpu1[i], cpu2[i]))
} else {
df1 <- rbind(df1, data.frame(waiting_time=data.frame(waiting_time=min(cpu1[i], cpu2[i])))
}
}
mean(df1$waiting_time)
Is this the right way to do the simulation? or am I doing something wrong?
The simulation is already done, by your definitions. You are asking how to compute the final results, which can be done with mean(pmin(cpu1,cpu2)).
As has been pointed out, mean(pmin(cpu1,cpu2)) is equivalent to the for loop and mean(df1$waiting_time), but much, much faster.
Or you could skip the simulation altogether since the minimum of two independent random exponential variables is also exponentially distributed with a rate equal to the sum of the two rates. Furthermore, the sum of n iid exponential random variables is gamma-distributed with the same rate parameter and a shape parameter equal to n.
So we can simply do rgamma(1, 1e4, 0.15)/1e4 or, equivalently, rgamma(1, 1e4, 0.15*1e4) instead of mean(pmin(cpu1,cpu2)), and the results will have identical distributions.

How do simulate a class of 100 students?

An exam with 20 multiple choice question with P=0.25 how do I simulate a class of 100 students and what is the average of the class of students. If the class is increased to 1000 what happens to the average?
I'm not sure where to begin. Other than just try to solving this manually.
n_experiments<-100
n_samples<-c(1:20)
means_of_sample_n<-c()
hist(rbinom( n = 100, size = 20, prob = 0.25 ))
I'm not sure what to do after this?
Well you just have to find a way to set the probability of answering correctly to 0.25, you can do that easily with the generation of a uniform distribution
n_experiments<-100
n_samples<-c(1:20)
means_of_sample_n<-c()
hist(rbinom( n = 100000, size = 20, prob = 0.25 ))
Nstu=100000
Nquest=20
Results=matrix(as.numeric(runif(100000*20)<0.25),ncol=20)
hist(apply(Results,1,sum))
mean(apply(Results,1,sum))
from the definition of the binomial distribution:
the mean is defined to be n*p, so mu = 20*0.25, giving a mean of 5. this is independent of the class size
the variance is defined to be n*p*(1-p), and the standard deviation is the usual sqrt of this, so sigma = sqrt(20*0.25*0.75), i.e. ~1.94.
the standard error of the mean is sigma / sqrt(k), where k would be your class size. so we get SEMs of 0.19 and 0.061 for class sizes of 100 and 1000 respectively
it's often useful to check things via simulation, and we can simulate a single class as you were doing.
x <- rbinom(100, 20, 0.25)
plot(table(x))
I'm using plot(table(x)) above instead of using hist, because this is a discrete distribution. hist is more suited to continuous distributions, while table is better for discrete distributions with a small number of distinct values.
next, we can simulate things many times using replicate. in this case you're after the mean of the binomial draw:
y <- replicate(1000, mean(rbinom(100, 20, 0.25)))
c(mu=mean(y), se=sd(y))
which happened to give me mu=5.002 and se=0.201, but will change every run. increasing the class size to 1000, I get mu=5.002 again, and se=0.060. because these are random samples from the distribution they are subject to "monte-carlo error" but given enough replicates they should approach the analytical answers above. that said, they're close enough to the analytical results to give me confidence I've not made any silly typos

How should I specify argument "prob" when using sample() for resampling?

In short
I'm trying to better understand the argument prob as part of the function sample in R. In what follows, I both ask a question, and provide a piece of R code in connection with my question.
Question
Suppose I have generated 10,000 random standard rnorms. I then want to draw a sample of size 5 from this mother 10,000 standard rnorms.
How should I set the prob argument within the sample such that the probability of drawing these 5 numbers from the mother rnorm considers that the middle areas of the mother rnorm are denser but tail areas are thinner (so in drawing these 5 numbers it would draw from the denser areas more frequently than the tail areas)?
x = rnorm(1e4)
sample( x = x, size = 5, replace = TRUE, prob = ? ) ## what should be "prob" here?
# OR I leave `prob` to be the default by not using it:
sample( x = x, size = 5, replace = TRUE )
Overthinking is devil.
You want to resample these samples, following the original distribution or an empirical distribution. Think about how an empirical CDF is obtained:
plot(sort(x), 1:length(x)/length(x))
In other words, the empirical PDF is just
plot(sort(x), rep(1/length(x), length(x)))
So, we want prob = rep(1/length(x), length(x)) or simply, prob = rep(1, length(x)) as sample normalizes prob internally. Or, just leave it unspecified as equal probability is default.

Sampling Distribution from a data-set with one column

I want to create a sampling distribution for a mean. I have a variable x with at least ten thousand values. I want take 500 samples (n=10) and then show the distribution of the sample means in a histogram. I think it worked with the following, but can anyone check if this is what i meant and tell me what the 2 within the apply function stands for?
x <- rnorm(10000, 7.5, 1.5)
draws = sample(x, size = 10 * 500, replace = TRUE)
draws = matrix(draws, 10)
drawmeans = apply(draws, 2, mean)
hist(drawmeans)
would be sincerely appreciated!
You could do this using replicate if you wanted. One of lots of different ways. For data frame df
out = replicate(500, mean(sample(df$Scores,10)))
hist(out)

how to create a random loss sample in r using if function

I am working currently on generating some random data for a school project.
I have created a variable in R using a binomial distribution to determine if an observation had a loss yes=1 or not=0.
Afterwards I am trying to generate the loss amount using a random distribution for all observations which already had a loss (=1).
As my loss amount is a percentage it can be anywhere between 0
What Is The Intuition Behind Beta Distribution # stats.stackexchange
In a third step I am looking for an if statement, which combines my two variables.
Please find below my code (which is only working for the Loss_Y_N variable):
Loss_Y_N = rbinom(1000000,1,0.01)
Loss_Amount = dbeta(x, 10, 990, ncp = 0, log = FALSE)
ideally I can combine the two into something like
if(Loss_Y_N=1 then Loss_Amount=dbeta(...) #... is meant to be a random variable with mean=0.15 and should be 0<x=<1
else Loss_Amount=0)
Any input highly appreciated!
Create a vector for your loss proportion. Fill up the elements corresponding to losses with draws from the beta. Tweak the parameters for the beta until you get the desired result.
N <- 100000
loss_indicator <- rbinom(N, 1, 0.1)
loss_prop <- numeric(N)
loss_prop[loss_indicator > 0] <- rbeta(sum(loss_indicator), 10, 990)

Resources