Why bootstrapping tecnique is not being accurate for my non normal distribution? - r

My code:
set.seed(1234)
population1 <- rpois(1000000, 0.6)
sample1 <- sample(population1, 30, replace=F)
mean(sample1)
sample_bs <- replicate(200, mean(sample(sample1, 30, replace=T)))
mean(sample_bs)
gmodels::ci(sample_bs)
My results:
population mean: 0.6
sample mean: 0.4666
results from my boostrapping procedures:
Estimate CI lower CI upper Std. Error
0.467500000 0.450542365 0.484457635 0.008599396
So my question is: why the results from this procedures still far from the original population?

You are not sampling from your original population, you are sampling from your first sample! So the mean will be close to the mean of sample1, not population1.
Try this instead, and you will see how both results are close:
set.seed(1234)
population1 <- rpois(1000000, 0.6)
sample1 <- sample(population1, 30, replace=F)
mean(sample1)
sample_bs <- replicate(200, mean(sample(population1, 30, replace=T)))
mean(sample_bs)
gmodels::ci(sample_bs)

Your main issue is that the sample size is too small. We can make some educated guesses at the precision you can expect from a sample size of 30, and what sample size you'd need for a given expected precision.
We will assume a normal distribution, but for this to be reasonable, the distribution of what you want to measure needs to be reasonably normal. That is, the distribution of means needs to be reasonably normal, the distribution of a given sample needs not.
So…
set.seed(1234)
population1 <- rpois(1000000, 0.6)
sm30 <- scale(replicate(10000, mean(sample(population1, 30, rep=FALSE))))
s <- seq(-3, 3, by=0.01)
plot(s, dnorm(s, 0, 1), type="l", col="grey80", lwd=9)
lines(density(sm30), col=3, lwd=2)
That looks fairly normal (ish) to me. Some skew, but close enough for government work, as they say (CERN might demand better)
Assuming this is normal enough, we can continue estimating (this is Stat101 stuff):
z <- qnorm(1-(0.05/2))
0.6 + z*c(-1, 1)*sqrt(0.6/30)
# 0.32281924 0.87718076
With a confidence level of 5% and a sample size of 30, we get some pretty wide confidence intervals (CI), as you experienced.
By rearranging the expression a bit, we can figure out what sample size we need for a given precision at a given confidence level. If we keep the 5% and say we want the CI to be at 0.6±0.1 we get:
(n <- ceiling((z*sqrt(0.6)/0.1)^2))
# 231
0.6 + z*c(-1, 1)*sqrt(0.6/n)
# 0.50011099 0.69988901
That's a bit more than 30.
While bootstrapping can help you estimating confidence intervals of non-normal distributions, it can't make up for missing information. The amount of information is inherent in the variance in your data, and the size of your sample.

Related

Multiple random values between specific ranges in R?

I want to pick up 50 samples from (TRUNCATED) Normal Distribution (Gaussian) in a range 15-85 with mean=35, and sd=30. For reproducibility:
num = 50 # number of samples
rng = c(15, 85) # the range to pick the samples from
mu = 35 # mean
std = 30 # standard deviation
The following code gives 50 samples:
rnorm(n = num, mean = mu, sd = std)
However, I want these numbers to be strictly between the range 15-85. How can I achieve this?
UPDATE: Some people made great points in the comment section that this problem can not be solved as this will no longer be Gaussian Distribution. I added the word TRUNCATED to the original post so it makes more sense (Truncated Normal Distribution).
As Limey said in the comments, by imposing a bounded region the distribution is no longer normal. There are several ways to achieve this.
library("MCMCglmm")
rtnorm(n = 50, mean = mu, sd = std, lower = 15, upper = 85)
is one method. If you want a more manual approach you could simulate using uniform distribution within the range and apply the normal distribution function
bounds <- c(pnorm(15, mu, std), pnorm(50, mu, std))
samples <- qnorm(runif(50, bounds[1], bounds[2]), mu, std)
The idea is very basic: Simulate the quantiles of the outcome, and then estimate the value of the specific quantive given the distribution. The value of this approach rather than the approach linked by GKi is that it ensures a "normal-ish" distribution, where simulating and bounding the resulting vector will cause the bounds to have additional mass compared to the normal distribution.
Note the outcome is not normal, as it is bounded.

Calculating the probability of a sample mean using R

I'm in an intro to stats class right now, and have absolutely no idea what's going on. How would I solve the following problem using R?
Let x be a continuous random variable that has a normal distribution with a mean of 71 and a standard deviation of 15. Assuming n/N is less than or equal to 0.05, find the probability that the sample mean, x-bar, for a random sample of 24 taken from this population will be between 68.1 and 78.3,
I'm really struggling on this one and I still have to get through other problems in the same format. Any help would be greatly appreciated!
For R coding this might set you up:
[# Children's IQ scores are normally distributed with a
# mean of 100 and a standard deviation of 15. What
# proportion of children are expected to have an IQ between
# 80 and 120?
mean=100; sd=15
lb=80; ub=120
x <- seq(-4,4,length=100)*sd + mean
hx <- dnorm(x,mean,sd)
plot(x, hx, type="n", xlab="IQ Values", ylab="",
main="Normal Distribution", axes=FALSE)
i <- x >= lb & x <= ub
lines(x, hx)
polygon(c(lb,x\[i\],ub), c(0,hx\[i\],0), col="red")
area <- pnorm(ub, mean, sd) - pnorm(lb, mean, sd)
result <- paste("P(",lb,"< IQ <",ub,") =",
signif(area, digits=3))
mtext(result,3)
axis(1, at=seq(40, 160, 20), pos=0)]
There is also some nice introductory course to R and data analysis by datacamp, this might also come in handy:
https://www.datacamp.com/courses/exploratory-data-analysis
And another tutorial on R and statistics:
http://www.cyclismo.org/tutorial/R/confidence.html
In terms of the code:
pop_sample <- rnorm(24, 71, 15)
se_pop <- sd(pop_sample)/sqrt(24)
pnorm(78.3, 71, se_pop) - pnorm(68.1, 71, se_pop) # 80%
In term of stats... you should probably refer to stats.stackexchange.com or your professor.

Issue with ROC curve where 'test positive' is below a certain threshold

I am working on evaluating a screening test for osteoporosis, and I have a large set of data where we measured values of bone density. We classified individuals as being 'disease positive' for osteoporosis if they had a vertebral fracture present on the images when we took the bone density measure.
The 'disease positive' has a lower distribution of the continuous value than the disease negative group.
We want to determine which threshold for the continuous variable is best for determining if an individual is at a higher risk for future fractures. We've found that the lower the value is, the higher the risk. I used Stata to create some tables to calculate sensitivity and specificity at a few different thresholds. Again, a person is 'test positive' if their value is below the threshold. I made this table here:
We wanted to show this in graphical form, so I decided to make an ROC curve, and I used the ROCR package to do so. Here is the code I used in R:
library(ROCR)
prevalentfx <- read.csv("prevalentfxnew.csv", header = TRUE)
pred <- prediction(prevalentfx$l1_hu, prevalentfx$fx)
perf <- performance(pred, "tpr", "fpr")
plot(perf, print.cutoffs.at = c(50,90,110,120), points.pch = 20, points.col = "darkblue",
text.adj=c(1.2,-0.5))
And here is what comes out:
Not what I expected!
This didn't make sense to me because according to the few thresholds where I calculated sensitivity and specificity manually (in the table), 50 HU is the least sensitive threshold and 120 is the most sensitive. Additionally, I feel like the curve is flipped along the diagonal axis. I know that this test is not that poor.
I figured this issue was due to the fact that a person is 'test positive' if the value is below the threshold, not above them. So, I just created a new vector of values where I flipped the binary classification and re-created the ROC plot, and got a figure which aligns much better with the data. However, the threshold values are still opposite of what they should be.
Is there something fundamentally wrong with how I'm looking at this? I have double checked our data several times to make sure I wasn't miscalculating the sensitivity and specificity values, and it all looks right. Thanks.
EDIT:
Here is a working example:
library(ROCR)
low <- rnorm(200, mean = 73, sd = 42)
high<- rnorm(3000, mean = 133, sd = 51.5)
measure <- c(low, high)
df = data.frame(measure)
df$fx <- rep.int(1, 200)
df$fx[201:3200] <- rep.int(0,3000)
pred <- prediction(df$measure, df$fx)
perf <- performance(pred, "tpr", "fpr")
plot(perf,print.cutoffs.at=c(50,90,110,120), points.pch = 20, points.col = "darkblue",
text.adj=c(1.2,-0.5))
The easiest solution (although inelegant) might be to use the negative values (rather than reversing your classification):
pred <- prediction(-df$measure, df$fx)
perf <- performance(pred, "tpr", "fpr")
plot(perf,
print.cutoffs.at=-c(50,90,110,120),
cutoff.label.function=`-`,
points.pch = 20, points.col = "darkblue",
text.adj=c(1.2,-0.5))

Calculating standard error after a log-transform

Consider a random set of numbers that are normally distributed:
x <- rnorm(n=1000, mean=10)
We'd like to know the mean and the standard error on the mean so we do the following:
se <- function(x) { sd(x)/length(x) }
mean(x) # something near 10.0 units
se(x) # something near 0.001 units
Great!
However, let's assume we don't necessarily know that our original distribution follows a normal distribution. We log-transform the data and perform the same standard error calculation.
z <- log(x, base=10)
mean(z) # something near 1 log units
se(z) # something near 0.000043 log units
Cool, but now we need to back-transform to get our answer in units NOT log units.
10^mean(z) # something near 10.0 units
10^se(z) # something near 1.00 units
My question: Why, for a normal distribution, does the standard error differ depending on whether it was calculated from the distribution itself or if it was transformed, calculated, and back-transformed? In this example, it is interesting that the difference is almost exactly 3 orders of magnitude. Note: the means came out the same regardless of the transformation.
EDIT #1: Ultimately, I am interested in calculating a mean and confidence intervals for non-normally distributed data, so if you can give some guidance on how to calculate 95% CI's on transformed data including how to back-transform to their native units, I would appreciate it!
END EDIT #1
EDIT #2: I tried using the quantile function to get the 95% confidence intervals:
quantile(x, probs = c(0.05, 0.95)) # around [8.3, 11.6]
10^quantile(z, probs = c(0.05, 0.95)) # around [8.3, 11.6]
So, that converged on the same answer, which is good. However, using this method doesn't provide the exact same interval using non-normal data with "small" sample sizes:
t <- rlnorm(10)
mean(t) # around 1.46 units
10^mean(log(t, base=10)) # around 0.92 units
quantile(t, probs = c(0.05, 0.95)) # around [0.211, 4.79]
10^(quantile(log(t, base=10), probs = c(0.05, 0.95))) # around [0.209, 4.28]
Which method would be considered "more correct". I assume one would pick the most conservative estimate?
As an example, would you report this result for the non-normal data (t) as having a mean of 0.92 units with a 95% confidence interval of [0.211, 4.79]?
END EDIT #2
Thanks for your time!

trying to compare two distributions

I found this code on internet that compares a normal distribution to different student distributions:
x <- seq(-4, 4, length=100)
hx <- dnorm(x)
degf <- c(1, 3, 8, 30)
colors <- c("red", "blue", "darkgreen", "gold", "black")
labels <- c("df=1", "df=3", "df=8", "df=30", "normal")
plot(x, hx, type="l", lty=2, xlab="x value",
ylab="Density", main="Comparison of t Distributions")
for (i in 1:4){
lines(x, dt(x,degf[i]), lwd=2, col=colors[i])
}
I would like to adapt this to my situation where I would like to compare my data to a normal distribution. This is my data:
library(quantmod)
getSymbols("^NDX",src="yahoo", from='1997-6-01', to='2012-6-01')
daily<- allReturns(NDX) [,c('daily')]
dailySerieTemporel<-ts(data=daily)
ss<-na.omit(dailySerieTemporel)
The objectif being to see if my data is normal or not... Can someone help me out a bit with this ? Thank you very much I really appreciate it !
If you are only concern about knowing if your data is normal distributed or not, you can apply the Jarque-Bera test. This test states that under the null your data is normal distributed, see details here. You can perform this test using jarque.bera.test function.
library(tseries)
jarque.bera.test(ss)
Jarque Bera Test
data: ss
X-squared = 4100.781, df = 2, p-value < 2.2e-16
Clearly, from the result, you can see that your data is not normaly distributed since the null has been rejected even at 1%.
To see why your data is not normaly distributed you can take a look at the descriptive statistics:
library(fBasics)
basicStats(ss)
ss
nobs 3776.000000
NAs 0.000000
Minimum -0.105195
Maximum 0.187713
1. Quartile -0.009417
3. Quartile 0.010220
Mean 0.000462
Median 0.001224
Sum 1.745798
SE Mean 0.000336
LCL Mean -0.000197
UCL Mean 0.001122
Variance 0.000427
Stdev 0.020671
Skewness 0.322820
Kurtosis 5.060026
From the last two rows, one can realize that ss has an excess of kurtosis, and the skewness is not zero. This is the basis of the Jarque-Bera test.
But if you are interested in compare actual distribution of your data agaist a normal distibuted random variable with the same mean and variance as your data, you can first estimate the empirical density function from your data using a kernel and then plot it, finally you only have to generate a normal random variable with same mean and variance as you data, do something like this:
plot(density(ss, kernel='epanechnikov'))
set.seed(125)
lines(density(rnorm(length(ss), mean(ss), sd(ss)), kernel='epanechnikov'), col=2)
In this fashion you can generate other curve from another probability distribution.
The tests suggested by #Alex Reynolds will help you if your interest is to know what possible distribution your data were drawn from. If this is your goal you can take a look at any goodness-of-it test in any statistics texbook. Nevertheless, if just want to know if your variable is normally distributed then Jarque-Bera test is good enough.
Take a look at Q-Q, Shapiro-Wilk or K-S tests to see if your data are normally distributed.

Resources