Distribution of mean*standard deviation of sample from gaussian - r

I'm trying to assess the feasibility of an instrumental variable in my project with a variable I havent seen before. The variable essentially is an interaction between the mean and standard deviation of a sample drawn from a gaussian, and im trying to see what this distribution might look like. Below is what im trying to do, any help is much appreciated.
Generate a set of 1000 individuals with a variable x following the gaussian distribution, draw 50 random samples of 5 individuals from this distribution with replacement, calculate the means and standard deviation of x for each sample, create an interaction variable named y which is calculated by multiplying the mean and standard deviation of x for each sample, plot the distribution of y.

Beginners version
There might be more efficient ways to code this, but this is easy to follow, I guess:
stat_pop <- rnorm(1000, mean = 0, sd = 1)
N = 50
# As Ben suggested, we create a data.frame filled with NA values
samples <- data.frame(mean = rep(NA, N), sd = rep(NA, N))
# Now we use a loop to populate the data.frame
for(i in 1:N){
# draw 5 samples from population (without replacement)
# I assume you want to replace for each turn of taking 5
# If you want to replace between drawing each of the 5,
# I think it should be obvious how to adapt the following code
smpl <- sample(stat_pop, size = 5, replace = FALSE)
# the data.frame currently has two columns. In each row i, we put mean and sd
samples[i, ] <- c(mean(smpl), sd(smpl))
}
# $ is used to get a certain column of the data.frame by the column name.
# Here, we create a new column y based on the existing two columns.
samples$y <- samples$mean * samples$sd
# plot a histogram
hist(samples$y)
Most functions here use positional arguments, i.e., you are not required to name every parameter. E.g., rnorm(1000, mean = 0, sd = 1) is the same as rnorm(1000, 0, 1) and even the same as rnorm(1000), since 0 and 1 are the default values.
Somewhat more efficient version
In R, loops are very inefficient and, thus, ought to be avoided. In case of your question, it does not make any noticeable difference. However, for large data sets, performance should be kept in mind. The following might be a bit harder to follow:
stat_pop <- rnorm(1000, mean = 0, sd = 1)
N = 50
n = 5
# again, I set replace = FALSE here; if you meant to replace each individual
# (so the same individual can be drawn more than once in each "draw 5"),
# set replace = TRUE
# replicate repeats the "draw 5" action N times
smpls <- replicate(N, sample(stat_pop, n, replace = FALSE))
# we transform the output and turn it into a data.frame to make it
# more convenient to work with
samples <- data.frame(t(smpls))
samples$mean <- rowMeans(samples)
samples$sd <- apply(samples[, c(1:n)], 1, sd)
samples$y <- samples$mean * samples$sd
hist(samples$y)
General note
Usually, you should do some research on the problem before posting here. Then, you either find out how it works by yourself, or you can provide an example of what you tried. To this end, you can simply google each of the steps you outlined (e.g., google "generate random standard distribution R" in order to find out about the function rnorm().
Run ?rnorm to get help on the function in RStudio.

Related

R function to find difference in mean greater than or equal to a specific number

I have just started my basic statistic course using R and we're studying using R for paired t-tests. I have come across questions where we're given two sets of data and we're asked to find whether the difference in mean is equal to 0 or greater than 0 so on so forth. The function we use for two samples x and y with an unknown variance is similar to the one below;
t.test(x, y, var.equal=TRUE, alternative="greater")
My question is, how would we to do this if we wanted to test the difference in mean is more than or equal to a specified number against the alternative that its less than a specific number and not 0.
For example, say we're given two datas for before and after weights of 10 people. How do we test that the mean difference in weight is more than or equal to say 3kg against the alternative where the mean difference in weight is less than 3kg. Is there a way to do this? Would really appreciate any guidance on this matter.
It might be worthwhile posting on https://stats.stackexchange.com/ as well if you're in need of more theoretical proof. Is it ok to add/subtract the 3kg from either x or y and then use the t-test to check for similarity? I think this would tell you at least which outcome is more likely, if that's the end goal. It would be good to get feedback on this
# number of obs, and rnorm dist for simulating
N <- 10
mu <- 70
sd <- 10
set.seed(1)
x <- round(rnorm(N, mu, sd), 1)
# three outcomes
# (1) no change
y_same <- x + round(rnorm(N, 0, 5), 1)
# (2) average increase of 3
y_imp <- x + rnorm(N, 3, 5)
# (3) average decrease of 3
y_dec <- x + rnorm(N, -3, 5)
# say y_imp is true
y_act <- y_imp
# can we test whether we're closer to the output by altering
# the original data? or conversely, altering y_imp
t_inc <- t.test(x+3, y_act, var.equal=TRUE, alternative="two.sided")
t_dec <- t.test(x-3, y_act, var.equal=TRUE, alternative="two.sided")
t_inc$p.value
[1] 0.8279801
t_dec$p.value
[1] 0.0956033
# one with the highest p.value has the closest distribution, so
# +3 kg more likely than -3kg
You can set mu=3 to change the null hypothesis from 0 to 3 assuming your x variables are in the units you describe above.
t.test(x, y, mu=3, alternative="greater", paired=TRUE)
More (general) information on Stack Exchange [here].(https://stats.stackexchange.com/questions/206316/can-a-paired-or-two-group-t-test-test-if-the-difference-between-two-means-is-l/206317#206317)

How to vary multiple parameters with lapply in R

In an attempt to avoid nesting for loops 6-7 times, I am trying to use lapply to find the proportion of randomly drawn values (that are combined in a certain way) that exceed some arbitrary thresholds values. The problem is that I have several parameters that each vary a certain number of ways, and these, in turn, will affect how the values are combined. The goal is to use the results in an ANOVA to see how varying these parameters contributes to reaching those thresholds. However, I don't understand how to do this. I have a feeling that anonymous functions could be useful, but I don't understand how they work with more than 1 parameter.
I tried to simplify the code as much as possible. But again, there are just so many parameters that must be included.
trials = 10
data_means = c(0,1,2,3)
prior_samples = c(2, 8, 32)
data_SD = c(0.5, 1, 2)
thresholds = c(10, 30, 80)
The idea is that there are two distributions, data and prior, which I draw values from. I always draw one from data, but I draw a sample (see prior_samples) of values from the prior distribution. There are four different values that determine the mean of the data distribution (see data_means), but the values are drawn the same number of times (determined by trials) from each of these four "versions" of the data distribution. These are then put into nested lists:
set.seed(123)
data_list = list()
for (nMean in data_means){ #the data values
for (nTrial in 1:trials){
data_list[[paste(nMean, sep="_")]][[paste(nTrial, sep="_")]] = rnorm(1, nMean, 1)
}
}
prior_list = list()
for (nSamples in prior_samples){ #the prior values
for (nTrial in 1:trials){
prior_list[[paste(nSamples, sep="_")]][[paste(nTrial, sep="_")]] = rnorm(nSamples, 0, 1)
}
}
Then I create another list for the prior values, because I want to calculate the means and standard deviations (SD) of the samples of prior values. I include normal SD, as well as SD/2 and SD*2:
prior_SD = list("mean"=0, "standard_devations"=list("SD/2"=0, "SD"=0, "SD*2"=0))
prior_mean_SD = rep(list(prior_SD), trials)
prior_nested_list = list("2"=prior_mean_SD, "8"=prior_mean_SD, "32"=prior_mean_SD)
for (nSamples in 1:length(prior_samples)){
for (nTrial in 1:trials){
prior_nested_list[[nSamples]][[nTrial]][["mean"]]=mean(prior_list[[nSamples]][[nTrial]])
prior_nested_list[[nSamples]][[nTrial]][["standard_devations"]][["SD/2"]]=sum(sd(prior_list[[nSamples]][[nTrial]])/2)
prior_nested_list[[nSamples]][[nTrial]][["standard_devations"]][["SD"]]=sd(prior_list[[nSamples]][[nTrial]])
prior_nested_list[[nSamples]][[nTrial]][["standard_devations"]][["SD*2"]]=sum(sd(prior_list[[nSamples]][[nTrial]])*2)
}
}
Then I combinde the values from the data list and the last list, using list.zip from rlist:
library(rlist)
dataMean0 = list.zip(dMean0=data_list[["0"]], pSample2=prior_nested_list[["2"]],
pSample8=prior_nested_list[["8"]], pSample32=prior_nested_list[["32"]])
dataMean1 = list.zip(dMean1=data_list[["1"]], pSample2=prior_nested_list[["2"]],
pSample8=prior_nested_list[["8"]], pSample32=prior_nested_list[["32"]])
dataMean2 = list.zip(dMean2=data_list[["2"]], pSample2=prior_nested_list[["2"]],
pSample8=prior_nested_list[["8"]], pSample32=prior_nested_list[["32"]])
dataMean3 = list.zip(dMean3=data_list[["3"]], pSample2=prior_nested_list[["2"]],
pSample8=prior_nested_list[["8"]], pSample32=prior_nested_list[["32"]])
all_values = list(mean_difference0=dataMean0, mean_difference1=dataMean1,
mean_difference2=dataMean2, mean_difference3=dataMean3)
Now comes the tricky part. I combine the data values and the prior values in all_values by using this custom function for the Kullback-Leibler divergence. As you can see, there are 6 parameters that varies:
mean_diff refers to the means of the data distribution (data_means). It is named mean_diff beacsue it refers to the difference in mean between the prior distribution (which is always 0), and the data distribution (which can be 0, 1, 2 or 3).
trial refers to trials,
pSample refers to the numbers of samples drawn from the prior distribution (prior_samples)
p_SD refers to the calculations of the SD based on the prior samples (normal SD, SD/2, SD*2)
data_SD refers to the SD of the data distribution, determined by data_SD
threshold refers to thresholds
The Kullback-Leibler divergence function:
kld = function(mean_diff, trial, pSample, p_SD, data_SD, threshold){
prior_mean = all_values[[mean_diff]][[trial]][[pSample]][["mean"]]
data_mean = all_values[[mean_diff]][[trial]][["mean"]]
prior_SD = all_values[[mean_diff]][[trial]][[pSample]][["standard_devations"]][[p_SD]]
posterior_SD = sqrt(1/(1/
((all_values[[mean_diff]][[trial]][[pSample]][["standard_devations"]][[p_SD]]
*all_values[[mean_diff]][[trial]][[pSample]][["standard_devations"]][[p_SD]]))
+1/(data_SD*data_SD)))
length(
which(
(log(prior_SD/posterior_SD) +
(((posterior_SD*posterior_SD) +
(prior_mean -
(((data_SD*data_SD))/
((data_SD*data_SD)+(prior_SD*prior_SD))*prior_mean +
((prior_SD*prior_SD))/
((data_SD*data_SD)+(prior_SD*prior_SD))*data_mean))^2)
/(2*(prior_SD*prior_SD)))-0.5
+
log(posterior_SD/prior_SD) +
((((prior_SD*prior_SD)) +
(prior_mean -
(((data_SD*data_SD))/
((data_SD*data_SD)+(prior_SD*prior_SD))*prior_mean +
((prior_SD*prior_SD))/
((data_SD*data_SD)+(prior_SD*prior_SD))*data_mean))^2)
/(2*(posterior_SD*posterior_SD)))-0.5
)>=threshold))/trials
}
So the question is how can one use lapply on the list with all the values (all_values) while using all the different combinations of the six parameters that are included? The data I want to end up with is the proportions of values (percentage of trials) that exceed the thresholds in all the parameter combinations.
I can't find the info I need, so any tips would be appreciated.

Sampling Distribution from a data-set with one column

I want to create a sampling distribution for a mean. I have a variable x with at least ten thousand values. I want take 500 samples (n=10) and then show the distribution of the sample means in a histogram. I think it worked with the following, but can anyone check if this is what i meant and tell me what the 2 within the apply function stands for?
x <- rnorm(10000, 7.5, 1.5)
draws = sample(x, size = 10 * 500, replace = TRUE)
draws = matrix(draws, 10)
drawmeans = apply(draws, 2, mean)
hist(drawmeans)
would be sincerely appreciated!
You could do this using replicate if you wanted. One of lots of different ways. For data frame df
out = replicate(500, mean(sample(df$Scores,10)))
hist(out)

how to create a random loss sample in r using if function

I am working currently on generating some random data for a school project.
I have created a variable in R using a binomial distribution to determine if an observation had a loss yes=1 or not=0.
Afterwards I am trying to generate the loss amount using a random distribution for all observations which already had a loss (=1).
As my loss amount is a percentage it can be anywhere between 0
What Is The Intuition Behind Beta Distribution # stats.stackexchange
In a third step I am looking for an if statement, which combines my two variables.
Please find below my code (which is only working for the Loss_Y_N variable):
Loss_Y_N = rbinom(1000000,1,0.01)
Loss_Amount = dbeta(x, 10, 990, ncp = 0, log = FALSE)
ideally I can combine the two into something like
if(Loss_Y_N=1 then Loss_Amount=dbeta(...) #... is meant to be a random variable with mean=0.15 and should be 0<x=<1
else Loss_Amount=0)
Any input highly appreciated!
Create a vector for your loss proportion. Fill up the elements corresponding to losses with draws from the beta. Tweak the parameters for the beta until you get the desired result.
N <- 100000
loss_indicator <- rbinom(N, 1, 0.1)
loss_prop <- numeric(N)
loss_prop[loss_indicator > 0] <- rbeta(sum(loss_indicator), 10, 990)

log- and z-transforming my data in R

I'm preparing my data for a PCA, for which I need to standardize it. I've been following someone else's code in vegan but am not getting a mean of zero and SD of 1, as I should be.
I'm using a data set called musci which has 13 variables, three of which are labels to identify my data.
log.musci<-log(musci[,4:13],10)
stand.musci<-decostand(log.musci,method="standardize",MARGIN=2)
When I then check for mean=0 and SD=1...
colMeans(stand.musci)
sapply(stand.musci,sd)
I get mean values ranging from -8.9 to 3.8 and SD values are just listed as NA (for every data point in my data set rather than for each variable). If I leave out the last variable in my standardization, i.e.
log.musci<-log(musci[,4:12],10)
the means don't change, but the SDs now all have a value of 1.
Any ideas of where I've gone wrong?
Cheers!
You data is likely a matrix.
## Sample data
dat <- as.matrix(data.frame(a=rnorm(100, 10, 4), b=rexp(100, 0.4)))
So, either convert to a data.frame and use sapply to operate on columns
dat <- data.frame(dat)
scaled <- sapply(dat, scale)
colMeans(scaled)
# a b
# -2.307095e-16 2.164935e-17
apply(scaled, 2, sd)
# a b
# 1 1
or use apply to do columnwise operations
scaled <- apply(dat, 2, scale)
A z-transformation is quite easy to do manually.
See below using a random string of data.
data <- c(1,2,3,4,5,6,7,8,9,10)
data
mean(data)
sd(data)
z <- ((data - mean(data))/(sd(data)))
z
mean(z) == 0
sd(z) == 1
The logarithm transformation (assuming you mean a natural logarithm) is done using the log() function.
log(data)
Hope this helps!

Resources