Why does my function only sometimes work? - r

I have the following function:
samp315<-function(n=30, desmean=86, distance=3.4995) {
x = seq(from = 0, to = 100, by = 0.1)
samp<-0
while (!between(mean(samp),desmean-distance,desmean+distance)) samp<-sample(x,n,replace=TRUE)
samp
}
percent <- samp315()
so pretty much I want to generate 30 numbers within 0-100 that has a mean of 86+/-3.4995, however whenever I run the last line it will load forever or when I am lucky it will genrate a list of desired results. Any idea on how i could change the function to improve its functionality?

As suggested by Parfait in the comments, you're using a randomization strategy that gives a low probability of providing the condition you're interested in. Did no other answer to this question help you out?
Some other possible strategies for you to try out.
n = 30
# Using truncated normal
library(truncnorm)
x = round(rtruncnorm(n, a = -0.0495, b = 100.0495, mean = 85, sd = 3.5*2), 1)
# Using beta
sig = 3
x = round(100*rbeta(n, (0.85)*sig, (1-0.85)*sig), 1)
The round(..., 1) is meant to align with your vector x. These methods would both have very few values away from 85. It's a trade-off you have to consider. If you want to have a mean in 85 +/- 3.5, then you can't too many values below 10, for example. So you have to lower the probability of such values being selected. Using your function, when it is completed, you'll probably find that values closer to 85 are more represented.

Related

Distribution of mean*standard deviation of sample from gaussian

I'm trying to assess the feasibility of an instrumental variable in my project with a variable I havent seen before. The variable essentially is an interaction between the mean and standard deviation of a sample drawn from a gaussian, and im trying to see what this distribution might look like. Below is what im trying to do, any help is much appreciated.
Generate a set of 1000 individuals with a variable x following the gaussian distribution, draw 50 random samples of 5 individuals from this distribution with replacement, calculate the means and standard deviation of x for each sample, create an interaction variable named y which is calculated by multiplying the mean and standard deviation of x for each sample, plot the distribution of y.
Beginners version
There might be more efficient ways to code this, but this is easy to follow, I guess:
stat_pop <- rnorm(1000, mean = 0, sd = 1)
N = 50
# As Ben suggested, we create a data.frame filled with NA values
samples <- data.frame(mean = rep(NA, N), sd = rep(NA, N))
# Now we use a loop to populate the data.frame
for(i in 1:N){
# draw 5 samples from population (without replacement)
# I assume you want to replace for each turn of taking 5
# If you want to replace between drawing each of the 5,
# I think it should be obvious how to adapt the following code
smpl <- sample(stat_pop, size = 5, replace = FALSE)
# the data.frame currently has two columns. In each row i, we put mean and sd
samples[i, ] <- c(mean(smpl), sd(smpl))
}
# $ is used to get a certain column of the data.frame by the column name.
# Here, we create a new column y based on the existing two columns.
samples$y <- samples$mean * samples$sd
# plot a histogram
hist(samples$y)
Most functions here use positional arguments, i.e., you are not required to name every parameter. E.g., rnorm(1000, mean = 0, sd = 1) is the same as rnorm(1000, 0, 1) and even the same as rnorm(1000), since 0 and 1 are the default values.
Somewhat more efficient version
In R, loops are very inefficient and, thus, ought to be avoided. In case of your question, it does not make any noticeable difference. However, for large data sets, performance should be kept in mind. The following might be a bit harder to follow:
stat_pop <- rnorm(1000, mean = 0, sd = 1)
N = 50
n = 5
# again, I set replace = FALSE here; if you meant to replace each individual
# (so the same individual can be drawn more than once in each "draw 5"),
# set replace = TRUE
# replicate repeats the "draw 5" action N times
smpls <- replicate(N, sample(stat_pop, n, replace = FALSE))
# we transform the output and turn it into a data.frame to make it
# more convenient to work with
samples <- data.frame(t(smpls))
samples$mean <- rowMeans(samples)
samples$sd <- apply(samples[, c(1:n)], 1, sd)
samples$y <- samples$mean * samples$sd
hist(samples$y)
General note
Usually, you should do some research on the problem before posting here. Then, you either find out how it works by yourself, or you can provide an example of what you tried. To this end, you can simply google each of the steps you outlined (e.g., google "generate random standard distribution R" in order to find out about the function rnorm().
Run ?rnorm to get help on the function in RStudio.

Creating a binary variable with probability in R

I'm trying to make a variable, Var, that takes the value 0 60% of the time, and 1 otherwise, with 50 000 observation.
For a normally distributed, I remember doing the following for a normal distribution, to define n:
Var <- rnorm(50 000, 0, 1)
Is there a way I could combine an ifelse command with the above to specify the number of n as well as the probability of Var being 0?
I would use rbinom like this:
n_ <- 50000
p_ <- 0.4 # it's probability of 1s
Var <- rbinom(n=n_, size=1, prob=p_)
By using of variables, you can change the size and/or probability just by changing of those variables. Hope that's what you are looking for.
If by 60% you mean a probability equal to 0.6 (rather than an empirical frequency), then
Var <- sample(0:1, 50000, prob = c(6, 4), replace = TRUE)
gives a desired sequence of independent Bernoulli(0.6) realizations.
I'm picking nits here, but it actually isn't completely clear exactly what you want.
Do you want to simulate a sample of 50000 from the distribution you describe?
Or, do you want 50000 replications of simulating an observation from the distribution you describe?
These are different things that, in my opinion, should be approached differently.
To simulate a sample of size 50000 from that distribution you would use:
sample(c(0,1), size = 50000, replace = TRUE)
To replicate 50000 simulations of sampling from the distribution you describe I would recommend:
replicate(50000, sample(c(0,1), size = 1, prob = c(0.6, 0.4)))
This might seem silly since these two lines of code produce exactly the same thing, in this case.
But suppose your goal was to investigate properties of samples of size 50000? Then what you would use a bunch (say, 1000) of replication of that first line of code above wrapped inside replicate:
replicate(1000, sample(c(0,1), size = 50000, prob = c(0.6, 0.4), replace = TRUE))
I hope I haven't been too pedantic about this. Having seen simulations go awry it has become my belief that one should keep separate the thing being simulated from the number of simulations you decide to do. The former is fundamental to your problem, while the latter only affects the accuracy of the simulation study and how long it takes.

Sampling Distribution from a data-set with one column

I want to create a sampling distribution for a mean. I have a variable x with at least ten thousand values. I want take 500 samples (n=10) and then show the distribution of the sample means in a histogram. I think it worked with the following, but can anyone check if this is what i meant and tell me what the 2 within the apply function stands for?
x <- rnorm(10000, 7.5, 1.5)
draws = sample(x, size = 10 * 500, replace = TRUE)
draws = matrix(draws, 10)
drawmeans = apply(draws, 2, mean)
hist(drawmeans)
would be sincerely appreciated!
You could do this using replicate if you wanted. One of lots of different ways. For data frame df
out = replicate(500, mean(sample(df$Scores,10)))
hist(out)

how to create a random loss sample in r using if function

I am working currently on generating some random data for a school project.
I have created a variable in R using a binomial distribution to determine if an observation had a loss yes=1 or not=0.
Afterwards I am trying to generate the loss amount using a random distribution for all observations which already had a loss (=1).
As my loss amount is a percentage it can be anywhere between 0
What Is The Intuition Behind Beta Distribution # stats.stackexchange
In a third step I am looking for an if statement, which combines my two variables.
Please find below my code (which is only working for the Loss_Y_N variable):
Loss_Y_N = rbinom(1000000,1,0.01)
Loss_Amount = dbeta(x, 10, 990, ncp = 0, log = FALSE)
ideally I can combine the two into something like
if(Loss_Y_N=1 then Loss_Amount=dbeta(...) #... is meant to be a random variable with mean=0.15 and should be 0<x=<1
else Loss_Amount=0)
Any input highly appreciated!
Create a vector for your loss proportion. Fill up the elements corresponding to losses with draws from the beta. Tweak the parameters for the beta until you get the desired result.
N <- 100000
loss_indicator <- rbinom(N, 1, 0.1)
loss_prop <- numeric(N)
loss_prop[loss_indicator > 0] <- rbeta(sum(loss_indicator), 10, 990)

R, rounding, ceiling and floors

Suppose that one has a bunch of data returned from pnorm(), such that you've got numbers between .0003ish and .9999ish.
numbers <- round(rnorm(n = 10000, mean = 100, sd = 15))
percentiles <- pnorm(numbers, mean = 100, sd = 15)*100
And then further suppose that one is interested in rounding the percentiles such that .0003 or whatevs will come out to 1 (so ceiling()), but 99.999 will come out to 99 (so floor()).
I guess what I'm looking for is round() that somehow brilliantly knows to reverse it in the extreme cases, but as far as I know, no such thing exists. Am I going to have to ugly it up with an if statement? Is there a better method of handling such a thing?
You could use round and force things into 1 or 99 at the extremities using pmin and pmax:
pmax(1, pmin(99, round(percentiles)))

Resources