How to get the intervals for ntile() - r

I was trying to figure out if there is a way to get the intervals used for when ntile() is used.
I have a sample that I want to use as a basis for getting the percentile values of a larger sample, and I was hoping to find a way to get the value of the intervals for when I use ntile().
Any enlightenment on this would be appreciated.

I really want to put this as a comment, but I stil can't comment.
How about using quantile to generate the interval, like this:
# create fake data; 100 samples randomly picked from 1 to 500
fakeData <- runif(100, 1, 500)
# create percentile values; tweak the probs to specify the quantile that you want
x <- quantile(fakeData, probs = seq(0, 1, length.out = 100))
Then you can apply that interval to the larger data set (i.e., using cut, which might give the same result to the ntile of dplyr).

Related

Distribution of mean*standard deviation of sample from gaussian

I'm trying to assess the feasibility of an instrumental variable in my project with a variable I havent seen before. The variable essentially is an interaction between the mean and standard deviation of a sample drawn from a gaussian, and im trying to see what this distribution might look like. Below is what im trying to do, any help is much appreciated.
Generate a set of 1000 individuals with a variable x following the gaussian distribution, draw 50 random samples of 5 individuals from this distribution with replacement, calculate the means and standard deviation of x for each sample, create an interaction variable named y which is calculated by multiplying the mean and standard deviation of x for each sample, plot the distribution of y.
Beginners version
There might be more efficient ways to code this, but this is easy to follow, I guess:
stat_pop <- rnorm(1000, mean = 0, sd = 1)
N = 50
# As Ben suggested, we create a data.frame filled with NA values
samples <- data.frame(mean = rep(NA, N), sd = rep(NA, N))
# Now we use a loop to populate the data.frame
for(i in 1:N){
# draw 5 samples from population (without replacement)
# I assume you want to replace for each turn of taking 5
# If you want to replace between drawing each of the 5,
# I think it should be obvious how to adapt the following code
smpl <- sample(stat_pop, size = 5, replace = FALSE)
# the data.frame currently has two columns. In each row i, we put mean and sd
samples[i, ] <- c(mean(smpl), sd(smpl))
}
# $ is used to get a certain column of the data.frame by the column name.
# Here, we create a new column y based on the existing two columns.
samples$y <- samples$mean * samples$sd
# plot a histogram
hist(samples$y)
Most functions here use positional arguments, i.e., you are not required to name every parameter. E.g., rnorm(1000, mean = 0, sd = 1) is the same as rnorm(1000, 0, 1) and even the same as rnorm(1000), since 0 and 1 are the default values.
Somewhat more efficient version
In R, loops are very inefficient and, thus, ought to be avoided. In case of your question, it does not make any noticeable difference. However, for large data sets, performance should be kept in mind. The following might be a bit harder to follow:
stat_pop <- rnorm(1000, mean = 0, sd = 1)
N = 50
n = 5
# again, I set replace = FALSE here; if you meant to replace each individual
# (so the same individual can be drawn more than once in each "draw 5"),
# set replace = TRUE
# replicate repeats the "draw 5" action N times
smpls <- replicate(N, sample(stat_pop, n, replace = FALSE))
# we transform the output and turn it into a data.frame to make it
# more convenient to work with
samples <- data.frame(t(smpls))
samples$mean <- rowMeans(samples)
samples$sd <- apply(samples[, c(1:n)], 1, sd)
samples$y <- samples$mean * samples$sd
hist(samples$y)
General note
Usually, you should do some research on the problem before posting here. Then, you either find out how it works by yourself, or you can provide an example of what you tried. To this end, you can simply google each of the steps you outlined (e.g., google "generate random standard distribution R" in order to find out about the function rnorm().
Run ?rnorm to get help on the function in RStudio.

R boxplot with already computed mean, confidence intervals and min max

I am trying to generate a boxplot in R using already computed confidence intervals and min and max. For time 1,2,3,4,5 (x-axis), I have MN which represents array of 5 elements, each describing the mean at time point. I also have CI1, CI2, MINIM, and MAXM, each as an array of 5 elements, one for each time step, representing upper CI, lower CI , minimum and maximum.
I want to generate 5 box plots bars at each time step.
I have tried the usual box plot function, but I could get it to work with already computed CIs and min max.
It would be great if the method work for normal plot function, though ggplot woll be fine too.
Since you have not posted data, I will use the builtin iris dataset, keeping the first 4 columns.
data(iris)
iris2 <- iris[-5]
The function boxplot computes the statistics it uses and then calls bxp to do the printing, passing it those computed values.
If you want a different set of statistics you will have to compute them and pass them to bxp manually.
I am assuming that by CI you mean normal 95% confidence intervals. For that you need to compute the standard errors and the mean values first.
s <- apply(iris2, 2, sd)
mn <- colMeans(iris2)
ci1 <- mn - qnorm(0.95)*s
ci2 <- mn + qnorm(0.95)*s
minm <- apply(iris2, 2, min)
maxm <- apply(iris2, 2, max)
Now have boxplot create the data structure used by bxp, a matrix.
bp <- boxplot(iris2, plot = FALSE)
And fill the matrix with the values computed earlier.
bp$stats <- matrix(c(
minm,
ci1,
mn,
ci2,
maxm
), nrow = 5, byrow = TRUE)
Finally, plot it.
bxp(bp)

Sampling Distribution from a data-set with one column

I want to create a sampling distribution for a mean. I have a variable x with at least ten thousand values. I want take 500 samples (n=10) and then show the distribution of the sample means in a histogram. I think it worked with the following, but can anyone check if this is what i meant and tell me what the 2 within the apply function stands for?
x <- rnorm(10000, 7.5, 1.5)
draws = sample(x, size = 10 * 500, replace = TRUE)
draws = matrix(draws, 10)
drawmeans = apply(draws, 2, mean)
hist(drawmeans)
would be sincerely appreciated!
You could do this using replicate if you wanted. One of lots of different ways. For data frame df
out = replicate(500, mean(sample(df$Scores,10)))
hist(out)

Confidence Interval of Sample Means using R

My dataframe contains sampling means of 500 samples of size 100 each. Below is the snapshot. I need to calculate the confidence interval at 90/95/99 for mean.
head(Means_df)
Means
1 14997
2 11655
3 12471
4 12527
5 13810
6 13099
I am using the below code but only getting the confidence interval for one row only. Can anyone help me with the code?
tint <- matrix(NA, nrow = dim(Means_df)[2], ncol = 2)
for (i in 1:dim(Means_df)[2]) {
temp <- t.test(Means_df[, i], conf.level = 0.9)
tint[i, ] <- temp$conf.int
}
colnames(tint) <- c("lcl", "ucl")
For any single mean, e. g. 14997, you can not compute a 95%-CI without knowing the variance or the standard deviation of the data, the mean was computed from. If you have access to the standard deviation of each sample, you can than compute the standard error of the mean and with that, easily the 95%-CI. Apparently, you lack the Information needed for the task.
Means_df is a data frame with 500 rows and 1 column. Therefore
dim(Means_df)[2]
will give the value 1.
Which is why you only get one value.
Solve the problem by using dim(Means_df)[1] or even better nrow(Means_df) instead of dim(Means_df)[2].

how to create a random loss sample in r using if function

I am working currently on generating some random data for a school project.
I have created a variable in R using a binomial distribution to determine if an observation had a loss yes=1 or not=0.
Afterwards I am trying to generate the loss amount using a random distribution for all observations which already had a loss (=1).
As my loss amount is a percentage it can be anywhere between 0
What Is The Intuition Behind Beta Distribution # stats.stackexchange
In a third step I am looking for an if statement, which combines my two variables.
Please find below my code (which is only working for the Loss_Y_N variable):
Loss_Y_N = rbinom(1000000,1,0.01)
Loss_Amount = dbeta(x, 10, 990, ncp = 0, log = FALSE)
ideally I can combine the two into something like
if(Loss_Y_N=1 then Loss_Amount=dbeta(...) #... is meant to be a random variable with mean=0.15 and should be 0<x=<1
else Loss_Amount=0)
Any input highly appreciated!
Create a vector for your loss proportion. Fill up the elements corresponding to losses with draws from the beta. Tweak the parameters for the beta until you get the desired result.
N <- 100000
loss_indicator <- rbinom(N, 1, 0.1)
loss_prop <- numeric(N)
loss_prop[loss_indicator > 0] <- rbeta(sum(loss_indicator), 10, 990)

Resources