Consider a random set of numbers that are normally distributed:
x <- rnorm(n=1000, mean=10)
We'd like to know the mean and the standard error on the mean so we do the following:
se <- function(x) { sd(x)/length(x) }
mean(x) # something near 10.0 units
se(x) # something near 0.001 units
Great!
However, let's assume we don't necessarily know that our original distribution follows a normal distribution. We log-transform the data and perform the same standard error calculation.
z <- log(x, base=10)
mean(z) # something near 1 log units
se(z) # something near 0.000043 log units
Cool, but now we need to back-transform to get our answer in units NOT log units.
10^mean(z) # something near 10.0 units
10^se(z) # something near 1.00 units
My question: Why, for a normal distribution, does the standard error differ depending on whether it was calculated from the distribution itself or if it was transformed, calculated, and back-transformed? In this example, it is interesting that the difference is almost exactly 3 orders of magnitude. Note: the means came out the same regardless of the transformation.
EDIT #1: Ultimately, I am interested in calculating a mean and confidence intervals for non-normally distributed data, so if you can give some guidance on how to calculate 95% CI's on transformed data including how to back-transform to their native units, I would appreciate it!
END EDIT #1
EDIT #2: I tried using the quantile function to get the 95% confidence intervals:
quantile(x, probs = c(0.05, 0.95)) # around [8.3, 11.6]
10^quantile(z, probs = c(0.05, 0.95)) # around [8.3, 11.6]
So, that converged on the same answer, which is good. However, using this method doesn't provide the exact same interval using non-normal data with "small" sample sizes:
t <- rlnorm(10)
mean(t) # around 1.46 units
10^mean(log(t, base=10)) # around 0.92 units
quantile(t, probs = c(0.05, 0.95)) # around [0.211, 4.79]
10^(quantile(log(t, base=10), probs = c(0.05, 0.95))) # around [0.209, 4.28]
Which method would be considered "more correct". I assume one would pick the most conservative estimate?
As an example, would you report this result for the non-normal data (t) as having a mean of 0.92 units with a 95% confidence interval of [0.211, 4.79]?
END EDIT #2
Thanks for your time!
Related
When calculating the 95% CI for the sample mean in R, I get different results when using the CI function CI(mydata) and when I use qnorm(.975, mean = mean(mydata), sd = sd(mydata)) for the upper bound and qnorm(.025, mean = mean(mydata), sd = sd(mydata)) for the lower bound.
Why would there be a difference? The qnorm function provides results that make sense when looking at the plot of the probability distribution for my data.
Here is the code that I am using to generate 500 normal random variables, calculate the mean(xbar), standard deviation(s) and 95% CI:
mydata <- rnorm(500)
xbar <- mean(mydata)
xbar
[1] -0.0376074
s
[1] 1.004922
CI(mydata)
upper mean lower
0.05069041 -0.03760740 -0.12590521
Then using qnorm I get the following:
qnorm(.975, mean=xbar, sd=s)
[1] 1.932003
qnorm(.025, mean=xbar, sd=s)
[1] -2.007218
From https://rdrr.io/cran/Rmisc/src/R/CI.R, the code underlying CI turns out to be
as follows.
CI <-
function(x,ci=.95) {
a<-mean(x)
s<-sd(x)
n<-length(x)
error<-qt(ci+(1-ci)/2,df=n-1)*s/sqrt(n)
return(c(upper=a+error,mean=a,lower=a-error))
}
The use of qt indicates that the quantile of a t-distribution with n-1 degrees of freedom is used to determine the lower and upper bound of the confidence interval. If you use qnorm(), then you get a different result. The differences will be especially noticeable when n is small.
I hope this answers your question.
I have a dataset with several groups, where I want to calculate a median value for each group using dplyr. The data are weighted, and the weights need to be taken into account in calculating the median. I found the weighted.median function from spatstat which seems to work fine. Consider the following simplified example:
require(spatstat, dplyr)
tst <- data.frame(group = rep(c(1:5), each = 100))
tst$val = runif(500) * tst$group
tst$wt = runif(500) * tst$val
tst %>%
group_by(group) %>%
summarise(weighted.median(val, wt))
# A tibble: 5 × 2
group `weighted.median(val, wt)`
<int> <dbl>
1 1 0.752
2 2 1.36
3 3 1.99
4 4 2.86
5 5 3.45
However, I would also like to add 95% confidence intervals to these values, and this has me stumped. Things I've considered:
Spatstat also has a weighted.var function but there's no documentation, and it's not even clear to me whether this is variance around the median or mean.
This rcompanion post suggests various methods for calculating CIs around medians, but as far as I can tell none of them handle weights.
This blog post suggests a function for calculating CIs and a median for weighted data, and is the closest I can find to what I need. However, it doesn't work with my dplyr groupings. I suppose I could write a loop to do this one group at a time and build the output data frame, but that seems cumbersome. I'm also not totally sure I understand the function in the post and slightly suspicious of its results- for instance, testing this out I get wider estimates for alpha=0.1 than for alpha=0.05, which seems backwards to me. Edit to add: upon further investigation, I think this function works as intended if I use alpha=0.95 for 95% CIs, rather than alpha = 0.05 (at least, this returns values that feel intuitively about right). I can also make it work with dplyr by editing to return just a single moe value rather than a pair of high/low estimates. So this may be a good option- but I'm also considering others.
Is there an existing function in some library somewhere that can do what I want, or an otherwise straightforward way to implement this?
There are several approaches.
You could use the asymptotic formula for standard error of the sample median. The sample median is asymptotically normal with standard error 1/sqrt(4 n f(m)) where n is the number of observations, m is the true median, and f(x) is the probability density of the (weighted) random variable. You could estimate the probability density using the base R function density.default with the weights argument. If x is the vector of observed values and w the corresponding vector of weights, then
med <- weighted.median(x, w)
f <- density(x, weights=w)
fmed <- approx(f$x, f$y, xout=med)$y
samplesize <- length(x)
se <- 1/sqrt(4 * samplesize * fmed)
ci <- med + c(-1,1) * 1.96 * se
This relies on several asymptotic approximations so it may be inaccurate. Also the sample size depends on the interpretation of the weights. In some cases the sample size could be equal to sum(w).
If there is very little data in each group, you could use the even simpler normal reference approximation,
med <- weighted.median(x, w)
v <- weighted.var(x, w)
sdm <- sqrt(pi/2) * sqrt(v)
samplesize <- length(x)
se <- sdm/sqrt(samplesize)
ci <- med + c(-1,1) * 1.96 * se
Alternatively you could use bootstrapping - generate random resamples of the input data (by choosing random resamples of the indices 1, 2, ..., n), extract the corresponding weighted observations (x_i, w_i), compute the weighted median of each resampled dataset, and construct the 95% confidence interval.
(This approach implicitly assumes the sample size is equal to n)
My code:
set.seed(1234)
population1 <- rpois(1000000, 0.6)
sample1 <- sample(population1, 30, replace=F)
mean(sample1)
sample_bs <- replicate(200, mean(sample(sample1, 30, replace=T)))
mean(sample_bs)
gmodels::ci(sample_bs)
My results:
population mean: 0.6
sample mean: 0.4666
results from my boostrapping procedures:
Estimate CI lower CI upper Std. Error
0.467500000 0.450542365 0.484457635 0.008599396
So my question is: why the results from this procedures still far from the original population?
You are not sampling from your original population, you are sampling from your first sample! So the mean will be close to the mean of sample1, not population1.
Try this instead, and you will see how both results are close:
set.seed(1234)
population1 <- rpois(1000000, 0.6)
sample1 <- sample(population1, 30, replace=F)
mean(sample1)
sample_bs <- replicate(200, mean(sample(population1, 30, replace=T)))
mean(sample_bs)
gmodels::ci(sample_bs)
Your main issue is that the sample size is too small. We can make some educated guesses at the precision you can expect from a sample size of 30, and what sample size you'd need for a given expected precision.
We will assume a normal distribution, but for this to be reasonable, the distribution of what you want to measure needs to be reasonably normal. That is, the distribution of means needs to be reasonably normal, the distribution of a given sample needs not.
So…
set.seed(1234)
population1 <- rpois(1000000, 0.6)
sm30 <- scale(replicate(10000, mean(sample(population1, 30, rep=FALSE))))
s <- seq(-3, 3, by=0.01)
plot(s, dnorm(s, 0, 1), type="l", col="grey80", lwd=9)
lines(density(sm30), col=3, lwd=2)
That looks fairly normal (ish) to me. Some skew, but close enough for government work, as they say (CERN might demand better)
Assuming this is normal enough, we can continue estimating (this is Stat101 stuff):
z <- qnorm(1-(0.05/2))
0.6 + z*c(-1, 1)*sqrt(0.6/30)
# 0.32281924 0.87718076
With a confidence level of 5% and a sample size of 30, we get some pretty wide confidence intervals (CI), as you experienced.
By rearranging the expression a bit, we can figure out what sample size we need for a given precision at a given confidence level. If we keep the 5% and say we want the CI to be at 0.6±0.1 we get:
(n <- ceiling((z*sqrt(0.6)/0.1)^2))
# 231
0.6 + z*c(-1, 1)*sqrt(0.6/n)
# 0.50011099 0.69988901
That's a bit more than 30.
While bootstrapping can help you estimating confidence intervals of non-normal distributions, it can't make up for missing information. The amount of information is inherent in the variance in your data, and the size of your sample.
I want to pick up 50 samples from (TRUNCATED) Normal Distribution (Gaussian) in a range 15-85 with mean=35, and sd=30. For reproducibility:
num = 50 # number of samples
rng = c(15, 85) # the range to pick the samples from
mu = 35 # mean
std = 30 # standard deviation
The following code gives 50 samples:
rnorm(n = num, mean = mu, sd = std)
However, I want these numbers to be strictly between the range 15-85. How can I achieve this?
UPDATE: Some people made great points in the comment section that this problem can not be solved as this will no longer be Gaussian Distribution. I added the word TRUNCATED to the original post so it makes more sense (Truncated Normal Distribution).
As Limey said in the comments, by imposing a bounded region the distribution is no longer normal. There are several ways to achieve this.
library("MCMCglmm")
rtnorm(n = 50, mean = mu, sd = std, lower = 15, upper = 85)
is one method. If you want a more manual approach you could simulate using uniform distribution within the range and apply the normal distribution function
bounds <- c(pnorm(15, mu, std), pnorm(50, mu, std))
samples <- qnorm(runif(50, bounds[1], bounds[2]), mu, std)
The idea is very basic: Simulate the quantiles of the outcome, and then estimate the value of the specific quantive given the distribution. The value of this approach rather than the approach linked by GKi is that it ensures a "normal-ish" distribution, where simulating and bounding the resulting vector will cause the bounds to have additional mass compared to the normal distribution.
Note the outcome is not normal, as it is bounded.
I have a data set and one of columns contains random numbers raging form 300 to 400. I'm trying to find what proportion of this column in between 320 and 350 using R. To my understanding, I need to standardize this data and creates a bell curve first. I have the mean and standard deviation but when I do (X - mean)/SD and get histogram from this column it's still not a bell curve.
This the code I tried.
myData$C1 <- (myData$C1 - C1_mean) / C1_SD
If you are simply counting the number of observations in that range, there's no need to do any standardization and you may directly use
mean(myData$C1 >= 320 & myData$C1 <= 350)
As for the standardization, it definitely doesn't create any "bell curves": it only shifts the distribution (centering) and rescales the data (dividing by the standard deviation). Other than that, the shape itself of the density function remains the same.
For instance,
x <- c(rnorm(100, mean = 300, sd = 20), rnorm(100, mean = 400, sd = 20))
mean(x >= 320 & x <= 350)
# [1] 0.065
hist(x)
hist((x - mean(x)) / sd(x))
I suspect that what you are looking for is an estimate of the true, unobserved proportion. The standardization procedure then would be applicable if you had to use tabulated values of the standard normal distribution function. However, in R we may do that without anything like that. In particular,
pnorm(350, mean = mean(x), sd = sd(x)) - pnorm(320, mean = mean(x), sd = sd(x))
# [1] 0.2091931
That's the probability P(320 <= X <= 350), where X is normally distributed with mean mean(x) and standard deviation sd(x). The figure is quite different from that above since we misspecified the underlying distribution by assuming it to be normal; it actually is a mixture of two normal distributions.