Calculate vaccine efficacy confidence Interval using the exact method - r

I'm trying to calculate confidence intervals for vaccine efficacy studies.
All the studies I am looking at claim that they use the Exact method and cite this free PDF: Statistical Methods in Cancer Research Volume II: The Design and Analysis of Cohort Studies It is my understanding that the exact method is also sometimes called the Clopper Pearson method.
The data I have is: Person-years of vaccinated, Person-years of unvaccinated, Number of cases among vaccinated, Number of cases among unvaccinated,
Efficacy is easy to calculate: 1 - ( (Number of cases among vaccinated/Person-years of vaccinated) / (Number of cases among unvaccinated/Person-years of unvaccinated) ) * 100
But calculating the confidence interval is harder.
At first I thought that this website gave the code I needed:
testall <- binom.test(8, 8+162)
(theta <- testall$conf.int)
(VE <- (1-2*theta)/(1-theta))
In this example, 8 is the number of cases in the vaccinated group and 162 is the number of cases in the unvaccinated group. But I have had a few problems with this.
(1) there are some studies where the size of the two cohorts (vaccinated vs. not vaccinated) are different. I don't think that this code works for those cohorts.
(2) I want to be able to adjust the type of confidence interval. For example, one study used "one-sided α risk of 2·5%" where as another study used "a two-sided α level of 5%". I'm not clear if this effects the numbers.
Either way, when I tried to run the numbers, it didn't work.
Here is an example of a data sets I am trying to validate:
Number of cases among vaccinated: 176
Number of cases among unvaccinated: 221
Person-years of vaccinated: 11,793
Person-years of unvaccinated: 5,809
Efficacy: 60.8 95%
Two sided 95% CI: 52.0–68.0

Related

Hypothesis testing for three groups

Based on the data, is the average sale amount statistically the same for the A, B, and C groups?
I performed t.test on AB, BC, CA. for CA, p-value>0.05, so I concluded for CA, we can't reject null hypothesis, and average may be same.
H1- alternative hypothesis was - true difference in means between group 36-45 and group 46-50 is not equal to 0
My Question is - Did I do this correctly or is there another way to check the hypothesis for three groups
If the population means of the groups are denoted mu_A, mu_B, and mu_C, then you are actually interested in the single joint null hypothesis: H_0: mu_A=mu_B=mu_C. The problem with conducting three pairwise test is the fact that it is difficult to control the probability of the type I error. That is, how do you know that three test at a significance level of 5% will still reject the H_0 above with 5% probability if this H_0 is true?
The test you are looking for is called an Analysis of Variance (ANOVA) test. It will provide a single test statistic and a single p-value to test the hypothesis above. If you search for "ANOVA statistical test", then Google will suggest many explanations (and probably also appropriate commands to do the analysis in R). I hope this helps.

Sample size estimation for three armed clinical trial in R for comparing the Restricted Mean Survival Time

I have a three armed clinical trial with two treatments and one placebo. I am to compare the Restricted Mean Survival Time among the three.
I need to find the sample sizes for each arms (equal allocation)
I know that R software has survRM2 package for calculating sample sizes for a RMST of two armed trial. The code as used is:
library(SSRMST) ssrmst(ac_rate=ac_rate, ac_period=ac_period, tot_time=tot_time, tau=tau, shape0=shape0, scale0=scale0, shape1=shape1, scale1=scale1, margin=margin, seed=seed)
So, my question is, how do I use this package to calculate sample size for a three armed trial (with equal allocation).
How will the above code modify?
Any guidance will be very helpful.

Should I use Friedman test or Mixed Model for my data in R? Nested or not?

I have my Response variable which is Proportion of Range Exposed to extreme events for terrestrial mammal species in the future. More clearly, it is the Difference of Proportion of Range Exposed (DPRE) from historical period to future green gases emission scenarios (it is a measure of the level of increase/decrease of percentage of range exposed): it means that my response variable goes from -1 to 1 (where +1 implies that the range will experience a +100% increase in the proportion of exposure: from 0% in historical period, to 100% in the future scenario).
As said, I am analyzing these differences for all terrestrial mammals (5311 species, across different scenarios and for two time periods, near future (means of 2021-2040) and far future (means of 2081-2100).
So, my Explicative variables are:
3 Scenarios of green gas emissions (Representative Concentration Pathways: RCP2.6, RCP4.5 and RCP8.5);
Time Periods (Near Future and Far Future): NF and FF;
Species: 5311 individuals.
I am not so expert in statistics , so I'm not sure which of the two suggestions I recieved:
Friedman test with Species as blocks (but in which I should somehow do a nested model, with RCPs as groups, nested within TimePeriods; or a sort of two way Friedman, with RCP and TimePeriod as the two different factors).
Linear Mixed Models with RCP*TimePeriod as fixed effects, and (TimePeriod | Species ) as random effects.
I run t-test, and all distribution result to be not normal, this is why I was suggested to use Friendman instead of ANOVA; I run pairwise Wilcoxon Rank Sum test and in this case I found significative differences from NF and FF for all RCPs.
I have to say I run 3 Wilcoxon, one for every RCP, so maybe a third option would be to create 3 different models, one for every RCP, but this would also go away from the standard analysis of "repated measures" for Friedman test.
Last consideration: I have to run Another model, where the Response variable is the Difference of Proportion of Subrange Exposed. In this case, other Explicative variables are mantained, but in this case analysis is not global but takes in consideration the difference that could be present across 14 IUCN Biomes. So every analysis is made across RCPs, for NF and FF and for all Biomes. Should I create and run 14 (biomes) x 3 (RCPs) x 2 (Time Periods) = 84 models, in this case? OR a sort of double nested (Time Periods and Biomes) model?
If necessary I can provide the large dataframe.

MatchIT function returns equal continuous variable but unequal categorical variable

I used the MatchIt function to derive a 1:4 ratio treated:untreated dataset, attempting to achieve similar average age and gender frequency.
I have a small treated group (n = 44) and a much larger control group (n= 980). To reduce the number of the control group and exclude age and gender as confounders, I attempted to use the MatchIt function to create a control group of 176 with an average age and gender balance similar to the treated group.
m.out <- matchit(Treated ~ AGE + SEX, data = d,
method = "optimal",
ratio = 4)
The summary of the output is:
Summary of balance for matched data:
Means Treated Means Control SD Control Mean Diff eQQ Med
distance 0.0602 0.0603 0.0250 -0.0001 0
AGE 57.5227 58.4034 7.9385 -0.8807 1
SEXF 0.4318 0.1477 0.3558 0.2841 0
SEXM 0.5682 0.8523 0.3558 -0.2841 0
The Age variable worked great - it is not significantly different but the gender seemed off (85% male in control vs 57% in treated) so I performed a chi-square test on the treated ~ gender data. It showed a highly significant difference in gender:
chisq <- with(m.data, chisq.test(SEX, Treated))
data: SEX and Treated
X-squared = 15.758, df = 1, p-value = 7.199e-05
How do I account for the difference here? Is my problem with the MatchIT function (incorrect method?) or it has worked but I've applied the chi-square to the incorrect problem?
There are many reasons why propensity score matching didn't "work" in this case. In general, it isn't guaranteed to balance covariates in small samples; the theoretical properties of the propensity score apply in large samples and with the correct propensity score (and yours is almost certainly not correct).
Some more specific reasons could be that when doing 4:1 matching, so many controls units that are far from treated units are matched to your treated units. You could see if matching fewer control units fixes this by changing the ratio. It could be that optimal matching is not a good matching method to use. Optimal matching finds optimal pairs based on the propensity score, but you want balance on the covariates, not the propensity score. You could try genetic matching (i.e., using method = "genetic"), though this will probably fail as well (it's like using a hammer on a thumb-tack).
One recommendation is to use the designmatch package to perform cardinality matching, which allows you to impose balance constraints and perform the matching without having to estimate a propensity score. With only two covariates, though, exact matching on gender and nearest-neighbor matching on age should do a fairly good job. Set exact = d$gender and distance = d$age in matchit() and see if that works better. You don't need a propensity score for this problem.
Finally, don't use hypothesis tests to assess balance. The balance output is enough. DOn't stop trying to find good matches until your balance can't improve any more. See Ho, Imai, King, & Stuart (2007) for more information on this. They are the authors of MatchIt too.
Ho, D. E., Imai, K., King, G., & Stuart, E. A. (2007). Matching as Nonparametric Preprocessing for Reducing Model Dependence in Parametric Causal Inference. Political Analysis, 15(3), 199–236. https://doi.org/10.1093/pan/mpl013

How to understand RandomForestExplainer output (R package)

I have the following code, which basically try to predict the Species from iris data using randomForest. What I'm really intersed in is to find what are the best features (variable) that explain the species classification. I found the package randomForestExplainer is the best
to serve the purpose.
library(randomForest)
library(randomForestExplainer)
forest <- randomForest::randomForest(Species ~ ., data = iris, localImp = TRUE)
importance_frame <- randomForestExplainer::measure_importance(forest)
randomForestExplainer::plot_multi_way_importance(importance_frame, size_measure = "no_of_nodes")
The result of the code produce this plot:
Based on the plot, the key factor to explain why Petal.Length and Petal.Width is the best factor are these (the explanation is based on the vignette):
mean_min_depth – mean minimal depth calculated in one of three ways specified by the parameter mean_sample,
times_a_root – total number of trees in which Xj is used for splitting the root node (i.e., the whole sample is divided into two based on the value of Xj),
no_of_nodes – total number of nodes that use Xj for splitting (it is usually equal to no_of_trees if trees are shallow),
It's not entirely clear to me why the high times_a_root and no_of_nodes is better? And low mean_min_depth is better?
What are the intuitive explanation for that?
The vignette information doesn't help.
You would like a statistical model or measure to be a balance between "power" and "parsimony". The randomForest is designed internally to do penalization as its statistical strategy for achieving parsimony. Furthermore the number of variables selected in any given sample will be less than the the total number of predictors. This allows model building when hte number of predictors exceeds the number of cases (rows) in the dataset. Early splitting or classification rules can be applied relatively easily, but subsequent splits become increasingly difficult to meet criteria of validity. "Power" is the ability to correctly classify items that were not in the subsample, for which a proxy, the so-called OOB or "out-of-bag" items is used. The randomForest strategy is to do this many times to build up a representative set of rules that classify items under the assumptions that the out-of-bag samples will be a fair representation of the "universe" from which the whole dataset arose.
The times_a_root would fall into the category of measuring the "relative power" of a variable compared to its "competitors". The times_a_root statistic measures the number of times a variable is "at the top" of a decision tree, i.e., how likely it is to be chosen first in the process of selecting split criteria. The no_of_node measures the number of times the variable is chosen at all as a splitting criterion among all of the subsampled.
From:
?randomForest # to find the names of the object leaves
forest$ntree
[1] 500
... we can see get a denominator for assessing the meaning of the roughly 200 values in the y-axis of the plot. About 2/5ths of the sample regressions had Petal.Length in the top split criterion, while another 2/5ths had Petal.Width as the top variable selected as the most important variable. About 75 of 500 had Sepal.Length while only about 8 or 9 had Sepal.Width (... it's a log scale.) In the case of the iris dataset, the subsamples would have ignored at least one of the variables in each subsample, so the maximum possible value of times_a_root would have been less than 500. Scores of 200 are pretty good in this situation and we can see that both of these variables have a comparable explanatory ability.
The no_of_nodes statistic totals up the total number of trees that had that variable in any of its nodes, remembering that the number of nodes would be constrained by the penalization rules.

Resources