How to calculate 95% confidence interval of a nominal variable. For instance, I have a sample size of 987 (300 smokers, 687 non-smokers). I need to calculate 95% confidence interval for the percentage of smokers in the population.
Do read the Wikipedia link provided by Otto. An easy to use implementation in R can be found in the Hmisc package.
library(Hmisc)
binconf(687, 987,alpha = 0.05)
Related
Below I have my code, which determines the number of times the mean of a population falls within confidence intervals of samples taken from the mean. Basically, hoping to prove 95% confidence intervals work.
rp<-function(x,s,n){ #x-population data, s-number of samples taken from
population, n-size of samples
m<-mean(x)
ci.mat=NULL
tot=0
for(i in 1:s){
cix<-t.test(sample(x,n))$conf.int #obtain confidence intervals of 1000
samples of x
if(cix[1]<m & m<cix[2]){tot<-tot+1} #total number of confidence intervals containing pop mean
ci.mat<-rbind(ci.mat,cbind(cix[1],cix[2]))
}
par(mfrow=c(2,1))
hist(ci.mat[,1],main=paste("Lower Limits for Sample Confidence
Intervals"),xlab="Lower Limit")
hist(ci.mat[,2], main=paste("Upper Limits for Sample Confidence
Intervals"),xlab="Upper Limit")
return(data.frame(mean(x),tot/s))
}
I am hoping to add the population mean to my histograms, so I can show where the confidence intervals did not include the mean. So wherever the mean lies on the histogram with the lower limits, any values to the right of it would be part of a confidence interval that did not include the mean. I have no experience with modifying plots in R, so I don't even know if this is possible. Thanks for any help!
In R I am having a problem obtaining exact confidence intervals for the rate ratio as calculated by the poisson.test function. This function uses the method in binom.test to calculate the confidence limits between two rates e.g.
poisson.test(x = c(10000,20000), T = c(15000,15000), r = 1, conf.level = 0.95)
This works fine when x (events) is lower than T (observations). However, at very high x (events) and relatively low T (observations), the variance of the Poisson distribution will approach infinity. This is not accounted for in the poisson.test function, as the binom.test method has no such limits. Consequently, with a progressively higher event rates, the confidence interval of both the individual rates and the ratios of these rates become progressively narrower, while they should asymptotically widen.
Would anybody know an alternative way to test the ratio of two very high rates using the Uniformly Most Powerful method and obtain their correct confidence limits?
I have been struggling with how R calculates quantiles and the normal fitting of data.
I have data (NDVI values) that follows a truncated normal distribution (see figure)
I am interested in getting the lowest 10th percentile value (p=0.1) from the data and from the fitting normal distribution curve.
In my understanding, because the data is truncated, the two should be quite different: I expect the quantile from the data to be higher than the one calculated from the normal distribution, but this is not so. For what I understand of the quantile function help the quantile from the data should be the default quantile function:
q=quantile(y, p=0.1)
while the quantile from the normal distribution is :
qx=quantile(y, p=0.1, type=9)
However the two result very close in all cases, which makes me wonder to what type of distribution does R fit the data to calculate the quantile (truncated normal dist.?)
I have also tried to calculate the quantile based on the fitting normal curve as:
fitted=fitdist(as.numeric(y), "norm", discrete = T)
fit.q=as.numeric(quantile(fitted, p=0.1)[[1]][1])
but obtaining no difference.
So my questions are:
To what curve does R fit the data for calculating quantiles, in particular for type=9 ? How can I calculate the quantile based on the complete normal distribution (including the lower tail)?
I don't know how to generate a reproducible example for this, but the data is available at https://dl.dropboxusercontent.com/u/26249349/data.csv
Thanks!
R is using the empirical ordering of the data when determining quantiles, rather than assuming any particular distribution.
The 10th percentile for your truncated data and a normal distribution fit to your data happen to be pretty close, although the 1st percentile is quite a bit different. For example:
# Load data
df = read.csv("data.csv", header=TRUE, stringsAsFactors=FALSE)
# Fit a normal distribution to the data
df.dist = fitdist(df$x, "norm", discrete = T)
Now let's get quantiles of the fitted distribution and the original data. I've included the 1st percentile in addition to the 10th percentile. You can see that the fitted normal distribution's 10th percentile is just a bit lower than that of the data. However, the 1st percentile of the fitted normal distribution is much lower.
quantile(df.dist, p=c(0.01, 0.1))
Estimated quantiles for each specified probability (non-censored data)
p=0.01 p=0.1
estimate 1632.829 2459.039
quantile(df$x, p=c(0.01, 0.1))
1% 10%
2064.79 2469.90
quantile(df$x, p=c(0.01, 0.1), type=9)
1% 10%
2064.177 2469.400
You can also see this by direct ranking of the data and by getting the 1st and 10th percentiles of a normal distribution with mean and sd equal to the fitted values from fitdist:
# 1st and 10th percentiles of data by direct ranking
df$x[order(df$x)][round(c(0.01,0.1)*5780)]
[1] 2064 2469
# 1st and 10th percentiles of fitted distribution
qnorm(c(0.01,0.1), df.dist$estimate[1], df.dist$estimate[2])
[1] 1632.829 2459.039
Let's plot histograms of the original data (blue) and of fake data generated from the fitted normal distribution (red). The area of overlap is purple.
# Histogram of data (blue)
hist(df$x, xlim=c(0,8000), ylim=c(0,1600), col="#0000FF80")
# Overlay histogram of random draws from fitted normal distribution (red)
set.seed(685)
set.seed(685)
x.fit = rnorm(length(df$x), df.dist$estimate[1], df.dist$estimate[2])
hist(x.fit, add=TRUE, col="#FF000080")
Or we can plot the empirical cumulative distribution function (ecdf) for the data (blue) and the random draws from the fitted normal distribution (red). The horizontal grey line marks the 10th percentile:
plot(ecdf(df$x), xlim=c(0,8000), col="blue")
lines(ecdf(x.fit), col="red")
abline(0.1,0, col="grey40", lwd=2, lty="11")
Now that I've gone through this, I'm wondering if you were expecting fitdist to return the parameters of the normal distribution we would have gotten had your data really come from a normal distribution and not been truncated. Rather, fitdist returns a normal distribution with the mean and sd of the (truncated) data at hand, so the distribution returned by fitdist is shifted to the right compared to where we might have "expected" it to be.
c(mean=mean(df$x), sd=sd(df$x))
mean sd
3472.4708 790.8538
df.dist$estimate
mean sd
3472.4708 790.7853
Or, another quick example: x is normally distributed with mean ~ 0 and sd ~ 1. xtrunc removes all values less than -1, and xtrunc.dist is the output of fitdist on xtrunc:
set.seed(55)
x = rnorm(6000)
xtrunc = x[x > -1]
xtrunc.dist = fitdist(xtrunc, "norm")
round(cbind(sapply(list(x=x,xtrunc=xtrunc), function(x) c(mean=mean(x),sd=sd(x))),
xtrunc.dist=xtrunc.dist$estimate),3)
x xtrunc xtrunc.dist
mean -0.007 0.275 0.275
sd 1.009 0.806 0.806
And you can see in the ecdf plot below that the truncated data and the normal distribution fitted to the truncated data have about the same 10th percentile, while the 10th percentile of the untruncated data is (as we would expect) shifted to the left.
I would like to calculate a summary odds ratio value for two or more papers where the only information I have is the individual odds ratios with their 95% confidence intervals. Is this possible? I have been poking around in the meta package, and only figured out how to do it with crude counts.
Thanks so much!
It is quite simple.
You just need to use the natural logarithm of the odds ratio (logOR), and its standard errror (and corresponding variance). These can be easily back-calculated from the 95% confidence intervals according to the normal distribution. Finally, pool logORs with their variance.
For instance, after you have built a data frame (eg called mydata) with logOR and variance for each study, you can easily proceed with a random effect meta-analysis with the metafor package in R as follows:
res <- rma(logOR, variance, data=mydata, method="DL")
forest(res)
In the future, you may consider posting similar questions in CrossValidated.
I want to obtain the the limits that determine the significance of autocorrelation coefficients and partial autocorrelation coefficients, but I don't know how to do it.
I obtained the Partial autocorrelogram using this function pacf(data). I want that R print me the values indicated in the figure.
The limits that determine the significance of autocorrelation coefficients are: +/- of (exp(2*1.96/√(N-3)-1)/(exp(2*1.96/√(N-3)+1).
Here N is the length of the time series, and I used the 95% confidence level.
The correlation values that correspond to the m % confidence intervals chosen for the test are given by 0 ± i/√N where:
N is the length of the time series
i is the number of standard deviations we expect m % of the correlations to lie within under the null hypothesis that there is zero autocorrelation.
Since the observed correlations are assumed to be normally distributed:
i=2 for a 95% confidence level (acf's default),
i=3 for a 99% confidence level,
and so on as dictated by the properties of a Gaussian distribution
Figure A1, Page 1011 here provides a nice example of how the above principle applies in practice.
After investigating acf and pacf functions and library psychometric with its CIz and CIr functions I found this simple code to do the task:
Compute confidence interval for z Fisher:
ciz = c(-1,1)*(-qnorm((1-alpha)/2)/sqrt(N-3))
here alpha is the confidence level (typically 0.95). N - number of observations.
Compute confidence interval for R:
cir = (exp(2*ciz)-1)/(exp(2*ciz)+1