R - calculate confidence interval of grouped data - r

Suppose you have a dataset named data as follows:
Gender Pneumonia_Incidence lower_CI upper_CI
Male 38000 30000 44000
Female 34000 32000 38000
I would now like to calculate the total pneumonia incidence, which can be done easily:
sum(data$Pneumonia_Incidence)
But how can I calculate lower and upper CI for this estimate? Is it valid to just sum lower_CI and upper_CI values or would I need to do something else?

How can I calculate lower and upper CI for this estimate?
You cannot with the information you have provided. You would need to know the variance of each estimated prevalence. This must be known since someone has calculated the confidence intervals. With these you could then obtain a pooled estimate for the total variance and then calculate the overall confidence interval.
If this is important to you, I strongly suggest you consult a qualified statistician, or at least a reputable text book. This is not simple high school math. There may be other issues such as sampling weights involved. I certainly wouldn't seek statistical advice here, and probably not even at that other place often mentioned. What if your boss asked you how you calculated the confidence interval? Would you say that you consulted the internet?
Is it valid to just sum lower_CI and upper_CI values ...
No. Variances involve sample sizes. Consider this. Imagine two groups, one with a very large sample size and one with a very small one. The group with the large sample size will have a narrower confidence interval than the group with the small sample size will. If you just added the two intervals, you would end up with an overall interval that was equally weighted by both groups, which intuitively doesn't seem correct. It's a biased estimate.
... or would I need to do something else?
Consult a statistician. :)

Related

Number of likert scale points impact on statistical inferences

I have a survey in which we are debating whether using a 5- or 7-point likert scale for questions around agreement (strongly agree-strongly disagree). The question is whether the 7-point scale would help or hinder the statistical inferences we could make from the data with a sample size of say, 1,800.
One may assume a 7-point likert scale would give you more variability, but at the cost of a wider confidence interval, especially when looking at stratifying by demographic variables.
A back-of-the-envelop calculation of what the confidence interval would be given a random distribution along a 7-point scale and a 5-point scale for a sample size of 1,800 is ~9% and ~6.5% respectively. They both seem high but a 9% CI seems like a high cost for added variability, but I am interested in other's takes.
My thoughts:
The standard 5-category likert scale is typical. If you need a sample size of 1,800 to get a width of ~6.5% for your confidence interval, I'd go with the 5-point scale. That's a lot of people to only get a ~9% width for your CI, which allows you to only estimate which decile your outcome variable is in.

Odds Ratio and 95% Confidence Intervals for Binary Matched Outcome after Propensity Score Matching

I am conducting a propensity score match analysis on the outcome of two different new cancer treatments where the outcome is binary (cancer-free or not cancer free). Following successful matching I get my paired 2x2 contingency table for my outcome between my matched pairs which looks like below;
**Treatment 1**
Not-Cancer Free Cancer Free
**Treatment 2** Not-Cancer Free 50 39
Cancer Free 53 60
I'd like to compare the outcomes to figure out if one treatment is better than the other by comparing odds ratios of being cancer free. I've been advised to conduct a McNemar's test due to the matched nature of the data which I do and get a p-value of 0.17 (non-significant). However, I've also been advised that instead of simply using the odds ratio normally used for such 2x2 tables (B/C --> 39/53 = 0.78 OR) that I should calculate the odds ratio and 95% confidence intervals using the methods shown in Agresti Alan, Min Yongyi. Effects and non‐effects of paired identical observations in comparing proportions with binary matched‐pairs data. Statistics in medicine. 2004 Jan 15;23(1):65-75. as it accounts for the matched nature of the data.
Unfortunately after reading this paper numerous times (especially it's odds ratio section) I can't figure out what the equations given for the odds ratio and 95% CI calculations are that they are referring to but know that they must be in there somewhere as other papers have cited this paper when referring to their odds ratios but don't share their methodology making it difficult to traceback.
If anyone has read this paper or has experience with odds ratios for matched binary data, can you please let me know how I can go about to get matched pair odds ratios. Thank you incredibly much in advance!
You can use McNemar exact test for the paired data. A point they are making in the paper and what the exact test uses are the off-diagonal elements (b,c) in the calculations. You can use exact2x2 package (https://cran.r-project.org/web/packages/exact2x2/exact2x2.pdf) to get the test results with 95%CI:
library(exact2x2)
# Set up your data as matrix
x<-matrix(c(50,53,39,60),2,2)
mcnemar.exact(x)
Gives:
Exact McNemar test (with central confidence intervals)
data: x
b = 39, c = 53, p-value = 0.175
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.4738071 1.1339142
sample estimates:
odds ratio
0.7358491

Change significance level MannKendall trend test -- R

I want to perform Mann-Kendall test at 99% and 90% confidence interval (CI). When running the lines below the analysis will be based on a 95% CI. How to change the code to perform it on 99 and 90% CI?
vec = c(1,2,3,4,5,6,7,8,9,10)
MannKendall(vec)
I cannot comment yet, but I have a question, what do you mean when you say that you need to perform the analysis on a 99 and 95% CI. Do you want to know if your value is significant at the 99 and 90% significance level?
If you just need to know if your score is significant at 99 and 90% significance then r2evans was right, the alpha or significance level is just an arbitrary threshold that you use to define how small your probability should be for you to assume that there "is no effect" or in this case that there is independence between the observations. More importantly, the calculation of the p-value is independent of the confidence level you select, so if you want to know if your result is significant at different confidence levels just compare your p-value at those levels.
I checked how the function works and did not see any indication that the alpha level selected is going to affect the results. if you check the source code of MannKendall(x) (by typing MannKendall without parenthesis or anything) you can see that is just Kendall(1:length(x), x). The function Kendall calculates a statistic tau, that "measures the strength of monotonic association between the vectors x and y", then it returns a p-value by calculating how likely your observed tau is under the assumption that there is no relation between length(x) and x. In other words, how likely it is that you obtain that tau just by chance, as you can see this is not dependent on the confidence level at all, the confidence level only matters at the end when you are deciding how small the probability of your tau should be for you to assume that it cannot have been obtained just by chance.

Is it possible to specify the correlation between two distributions?

For context, say there were two academic exams --morning and afternoon-- conducted. I'm only given the summary statistics -- mean, median, skew and kurtosis for the scores on both exams, so I'm unable to say exactly how many students passed, but I can estimate it by fitting the moments and creating a custom pearson distribution. I can estimate, for example, how many students passed the first and the second exam, as well as giving it a standard deviation to quantify my error.
What I would like to do is to estimate the number of students who pass the course, defined as having the average score of both morning and afternoon exams being over 60%. If the performance of students on both tests are completely independent, I suppose this would be easy - I just generate scores for both tests in the form of two lists, average them, count the number of items over 60%, and repeat, say 10000 times.
If both tests are completely dependent, I suppose I would have to order both lists, because the student scoring the highest on the morning exam should also score the highest on the second. What I'm missing is how I should measure the degree of randomness/interdependence (maybe it has something to do with entropy?) in between, where students who score highly on exam 1 also score highly on exam 2, and if there is a package in R that I can use to specify an arbitrary degree of entropy between two variables.
A renowned concept for measuring entropy between two distributions is KL divergence:
In mathematical statistics, the Kullback–Leibler divergence (also called relative entropy) is a measure of how one probability distribution is different from a second, reference probability distribution.
To make the measure symmetric, you can use Jensen-Shannon divergence as well.
For the implementation of KL divergence, you can use this package in R.
A special case of KL-divergence is mutual information, which is a better measure of interdependence which is what you're looking for. Instead of calculating the divergence from a reference distribution, mutual information is basically equal to the KL-divergence between the joint probability and the product of the marginal probability distributions. Mutual information is also equal to the sum of each marginal distribution's entropy minus their joint entropy, meaning that you can estimate the individual and joint entropies first, then sum them together to estimate mutual information.
Here is one implementation of mutual information for R, although there have been many other estimators introduced:
https://github.com/majianthu/copent

calculating aggregate confidence intervals for forecasts

I'm using ARIMA models to estimate sales forecast for a company. The company's sales channel is broken down into 4 sales channels and I'm running 4 different models to estimate the sales for each channel. Eventually, I'm going to aggregate the sales of these channels to find the total forecasted sales for the whole company. My questions is, how should i go about finding the confidence interval for the overall forecast? Adding up the confidence intervals of each channel is not correct since that will give me a very large interval.
I'd really appreciate if anyone can give me some idea on how to approach this sort of issue. Thanks in advance!
The confidence interval is based on the variance. For a 95% interval, it is the forecast plus/minus 1.96*sqrt(variance). So you need to estimate the variance of the sum. It is not equal to the sum of the variances unless the four channels are uncorrelated. You will need to compute the correlations between the four channels, and then use the equation for the variance of a sum (check any intro stats book).

Resources