MatchIT function returns equal continuous variable but unequal categorical variable - r

I used the MatchIt function to derive a 1:4 ratio treated:untreated dataset, attempting to achieve similar average age and gender frequency.
I have a small treated group (n = 44) and a much larger control group (n= 980). To reduce the number of the control group and exclude age and gender as confounders, I attempted to use the MatchIt function to create a control group of 176 with an average age and gender balance similar to the treated group.
m.out <- matchit(Treated ~ AGE + SEX, data = d,
method = "optimal",
ratio = 4)
The summary of the output is:
Summary of balance for matched data:
Means Treated Means Control SD Control Mean Diff eQQ Med
distance 0.0602 0.0603 0.0250 -0.0001 0
AGE 57.5227 58.4034 7.9385 -0.8807 1
SEXF 0.4318 0.1477 0.3558 0.2841 0
SEXM 0.5682 0.8523 0.3558 -0.2841 0
The Age variable worked great - it is not significantly different but the gender seemed off (85% male in control vs 57% in treated) so I performed a chi-square test on the treated ~ gender data. It showed a highly significant difference in gender:
chisq <- with(m.data, chisq.test(SEX, Treated))
data: SEX and Treated
X-squared = 15.758, df = 1, p-value = 7.199e-05
How do I account for the difference here? Is my problem with the MatchIT function (incorrect method?) or it has worked but I've applied the chi-square to the incorrect problem?

There are many reasons why propensity score matching didn't "work" in this case. In general, it isn't guaranteed to balance covariates in small samples; the theoretical properties of the propensity score apply in large samples and with the correct propensity score (and yours is almost certainly not correct).
Some more specific reasons could be that when doing 4:1 matching, so many controls units that are far from treated units are matched to your treated units. You could see if matching fewer control units fixes this by changing the ratio. It could be that optimal matching is not a good matching method to use. Optimal matching finds optimal pairs based on the propensity score, but you want balance on the covariates, not the propensity score. You could try genetic matching (i.e., using method = "genetic"), though this will probably fail as well (it's like using a hammer on a thumb-tack).
One recommendation is to use the designmatch package to perform cardinality matching, which allows you to impose balance constraints and perform the matching without having to estimate a propensity score. With only two covariates, though, exact matching on gender and nearest-neighbor matching on age should do a fairly good job. Set exact = d$gender and distance = d$age in matchit() and see if that works better. You don't need a propensity score for this problem.
Finally, don't use hypothesis tests to assess balance. The balance output is enough. DOn't stop trying to find good matches until your balance can't improve any more. See Ho, Imai, King, & Stuart (2007) for more information on this. They are the authors of MatchIt too.
Ho, D. E., Imai, K., King, G., & Stuart, E. A. (2007). Matching as Nonparametric Preprocessing for Reducing Model Dependence in Parametric Causal Inference. Political Analysis, 15(3), 199–236. https://doi.org/10.1093/pan/mpl013

Related

Use of svyglm and svydesign with R for multistage stratified cluster design

I have a complicated data set which was made by a multistage stratified cluster design. I had originally analysed this using glm, however now realise that I have to use svyglm. I'm not quite sure about how is best to model the data utilising svyglm. I was wondering if anyone could help shed some light.
I am attempting to see the effect that a variety of covariates taken at time 1 have on a binary outcome taken at time 2.
The sampling strategy was as follows: state -> urban/rural -> district -> subdistrict -> village. Within each village, individuals were randomly selected, with each of these having an id (uniqid).
I have a variable in the df for each of these stages of the sampling strategy. I also have the following variables: outcome, age, sex, income, marital_status, urban_or_rural_area, uniqid, weights. The formula that I want for my regression equation is outcome ~ age + sex + income + marital_status + urban_or_rural_area . Weights are coded by the weights variable. I had set the family to binomial(link = logit).
If anyone has any idea how such an approach could be coded in R with svyglm I would be most appreciative. I'm quite confused as to what should be inputted as ID, fpc and nest. Do I have to specify all levels of the stratified design or just some?
Any direction, or resources which explain this well would be massively appreciated.
You don't really give enough information about the design: which of the geographical units are strata and which are clusters. For example, my guess is that you sample both urban and rural in all states, and you don't sample all villages, but I don't know whether you sample all districts or subdistricts. I also don't know whether your overall sampling fraction is large or small (so whether the with-replacement approximation is ok)
Let's pretend you sample just some districts, so districts are your Primary Sampling Units, and that the overall sampling fraction of people is small. The design command is
your_design <- svydesign(id=~district, weights=~weights,
strata=~interaction(state, urban_rural,drop=TRUE),
data=your_data_frame)
That is, the strata are combinations of state and urban/rural and any combinations that aren't in your data set don't exist in the population (maybe some states are all-rural or all-urban). Within each stratum you have districts, and only some of these appear in the sample. In your geographical hierarchy, districts are then the first level that is sampled rather than exhaustively enumerated.
You don't need fpc unless you want to specify the full multistage design without replacement.
The nest option is not about how the survey was done but is about how variables are coded. The US National Center for Health Statistics (bless their hearts) set up a lot of designs that have many strata and two primary sampling units per stratum. They call these primary sampling units 1 and 2; that is, they reuse the names 1 and 2 in every stratum. The svydesign function is set up to expect different sampling unit names in different strata, and to verify that each sampling unit name appears in just one stratum, as a check against data errors. This check has to be disabled for NCHS surveys and perhaps some others that also reuse sampling unit names. You can always leave out the nest option at first; svydesign will tell you if it might be needed.
Finally, the models:
svyglm(outcome ~ age + sex + income + marital_status + urban_or_rural_area,
design=your_design, family=quasibinomial)
Using binomial or quasibinomial will give identical answers, but using binomial will give you a harmless warning about non-integer weights. If you use quasibinomial, the harmless warning is suppressed.

Calculate vaccine efficacy confidence Interval using the exact method

I'm trying to calculate confidence intervals for vaccine efficacy studies.
All the studies I am looking at claim that they use the Exact method and cite this free PDF: Statistical Methods in Cancer Research Volume II: The Design and Analysis of Cohort Studies It is my understanding that the exact method is also sometimes called the Clopper Pearson method.
The data I have is: Person-years of vaccinated, Person-years of unvaccinated, Number of cases among vaccinated, Number of cases among unvaccinated,
Efficacy is easy to calculate: 1 - ( (Number of cases among vaccinated/Person-years of vaccinated) / (Number of cases among unvaccinated/Person-years of unvaccinated) ) * 100
But calculating the confidence interval is harder.
At first I thought that this website gave the code I needed:
testall <- binom.test(8, 8+162)
(theta <- testall$conf.int)
(VE <- (1-2*theta)/(1-theta))
In this example, 8 is the number of cases in the vaccinated group and 162 is the number of cases in the unvaccinated group. But I have had a few problems with this.
(1) there are some studies where the size of the two cohorts (vaccinated vs. not vaccinated) are different. I don't think that this code works for those cohorts.
(2) I want to be able to adjust the type of confidence interval. For example, one study used "one-sided α risk of 2·5%" where as another study used "a two-sided α level of 5%". I'm not clear if this effects the numbers.
Either way, when I tried to run the numbers, it didn't work.
Here is an example of a data sets I am trying to validate:
Number of cases among vaccinated: 176
Number of cases among unvaccinated: 221
Person-years of vaccinated: 11,793
Person-years of unvaccinated: 5,809
Efficacy: 60.8 95%
Two sided 95% CI: 52.0–68.0

predict.coxph() and survC1::Est.Cval -- type for predict() output

Given a coxph() model, I want to use predict() to predict hazards and then use survC1::Est.Cval( . . . nofit=TRUE) to get a c-value for the model.
The Est.Cval() documentation is rather terse, but says that "nofit=TRUE: If TRUE, the 3rd column of mydata is used as the risk score directly in calculation of C."
Say, for simplicity, that I want to predict on the same data I built the model on. For
coxModel a Cox regression model from coxph();
time a vector of times (positive reals), the same times that coxModel was built on; and
event a 0/1 vector, the same length, of event/censor indicators, the same events that coxModel was built on --
does this indicate that I want
predictions <- predict(coxModel, type="risk")
dd <- cbind(time, event, pred)
Est.Cval(mydata=dd, tau=tau, nofit=TRUE)
or should that first line be
predictions <- predict(coxModel, type="lp")
?
Thanks for any help,
The answer is that it doesn't matter.
Basically, the concordance value is testing, for all comparable pairs of times (events and censors), how probable it is that the later time has the lower risk (for a really good model, almost always). But since e^u is a monotonic function of real u, and the c-value is only testing comparisons, it doesn't matter whether you provide the hazard ratio, e^(sum{\beta_i x_i}), or the linear predictor, sum{\beta_i x_i}.
Since #42 motivated me to come up with a minimal working example, we can test this. We'll compare the values that Est.Cval() provides using one input versus using the other; and we can compare both to the value we get from coxph().
(That last value won't match exactly, because Est.Cval() uses the method of Uno et al. 2011 (Uno, H., Cai, T., Pencina, M. J., D’Agostino, R. B. & Wei, L. J. On the C-statistics for evaluating overall adequacy of risk prediction procedures with censored survival data. Statist. Med. 30, 1105–1117 (2011), https://onlinelibrary.wiley.com/doi/full/10.1002/sim.4154) but it can serve as a sanity check, since the values should be close.)
The following is based on the example worked through in Survival Analysis with R, 2017-09-25, by Joseph Rickert, https://rviews.rstudio.com/2017/09/25/survival-analysis-with-r/.
library("survival")
library("survC1")
# Load dataset included with survival package
data("veteran")
# The variable `time` records survival time; `status` indicates whether the
# patient’s death was observed (status=1) or that survival time was censored
# (status = 0).
# The model they build in the example:
coxModel <- coxph(Surv(time, status) ~ trt + celltype + karno + diagtime +
age + prior, data=veteran)
# The results
summary(coxModel)
Note the c-score it gives us:
Concordance= 0.736 (se = 0.021 )
Now, we calculate the c-score given by Est.Cval() on the two types of values:
# The value from Est.Cval(), using a risk input
cvalByRisk <- Est.Cval(mydata=cbind(time=veteran$time, event=veteran$status,
predictions=predict(object=coxModel, newdata=veteran, type="risk")),
tau=2000, nofit=TRUE)
# The value from Est.Cval(), using a linear predictor input
cvalByLp <- Est.Cval(mydata=cbind(time=veteran$time, event=veteran$status,
predictions=predict(object=coxModel, newdata=veteran, type="lp")),
tau=2000, nofit=TRUE)
And we compare the results:
cvalByRisk$Dhat
[1] 0.7282348
cvalByLp$Dhat
[1] 0.7282348

R: Propensity Score Matching using MatchIt. How to specify desired matching accuracy for different covariates?

I'm rather new to R and especially to the method of matching by propensity scores. My dataset includes two groups of people that differ in whether they were treated or not- unfortunately they also differ significantly in age and disease duration, therefore my wish to match them.
So far this is my code:
set.seed(2208)
mod_match <- matchit(TR ~ age + disease_duration + sex + partner + work + academic,
data = Data_nomiss,
method = "nearest",
caliper = .025)
summary(mod_match)
This code works fine, but I wondered whether there is a possibility to weight the importance of the covariates regarding the accuracy of matching? For me it is crucial that the groups are as close as possible concerning age and disease duration (numeric), whereas the rest of the variables (factors) should also be matched, but for my purposes might differ in means a little more than the first two.
While searching for a solution to my problem I came across the request of this one guy, who had basically the same problem http://r.789695.n4.nabble.com/matchit-can-I-weight-the-parameters-td4633907.html
In this case it was proposed to combine nearest neighbor and exact matching, but transferred to my dataset this leads to an unproportional reduction of my sample. In the end what I'd like to have is some sort of customized matching process focussing on age and disease duration while also involving the last three variables but in a weaker way.
Does anyone happen to have an idea how this could be realized? I'd be really glad to receive any kinds of tips on this matter and thank you for your time!
Unfortunately, MatchIt does not provide this functionality. There were two ways to do this instead of using MatchIt, but they are slightly advanced. Note that neither use propensity scores. The point of propensity score matching is to match on a single number, the propensity score, which makes the matching procedure blind to the original covariates for which balance is desired.
The first is to use the package Matching and include your own weight matrix to Weight.matrix in Match(). You could upweight age and disease duration in the weight matrix.
The second is to use the package designmatch to do cardinality matching, which allows you to specify balance constraints, and it will use optimization to find the largest sample that meets those constraints. In bmatch(), enter your covariates of interest into the mom argument, which also allows you to include specific balance constraints for each variable. You can require stricter balance constraints for age and disease duration.

R matchit on 7 variables with different seeds

I am using the following code to match 2 cohorts (2800 controls, 460 treated) of different patients:
set.seed(99)
m.out <- matchit(treatment ~ gender + age + VarC + diseaseDuration +
pastActivity + activity + country, data = Pat.match,
method = "nearest", ratio = 5, discard = "control",
caliper = 0.1, m.order = "smallest")
After matching, the cohorts are reduced to about 1230 controls vs. 400 treated.
These numbers are similar when I change the seed. However, if I check more accurately (patient ID), the total cohorts for different seeds differ in about 20% of the patients. To be more precise:
set.seed(99) results in a cohort, that has an overlap of only 80% with the resulting cohort of set.seed(27).
And this might have a huge impact on further general models and statistical analyses. Have I overseen something ?
Regards !
Sometimes this occurs when units have the same or very similar propensity scores; I believe MatchIt resolves this with a randomly selected match. I actually disagree with #dash2 that you shouldn't change the seed until you get a result you like. You should perform the matching procedure as many times as you want until you arrive at covariate balance. If your data is balanced and all your treated units are retained (or at least the same ones across matching specifications), then your effect estimation will not vary systematically with your matched set. Just remember that once you have estimated your treatment effect, you can't go back and redo your matching results (which is probably what #dash2 is getting at). But at the matching phase, this is not a concern.
So the computing side of this is that matchit is probably doing something random, even though you haven't e.g. specified m.order="random". What that could be, it's probably easiest to find out by looking through the source code...
The statistical side is really for crossvalidated, not here, but I would suggest:
if any of your results depend non-trivially on the seed, don't trust the results, they are not robust.
in particular - I'm sure you know this but it is worth reiterating - do NOT try different seeds until you get a result you like!

Resources