I'm running a dif-in-dif estimation and using the MatchIt package to match my treatment and control groups by their distance to a certain location (nearest neighbour matching, logit model, caliper = 0.25).
Everything is ok with the actual matching, however I ran across this kind of plot in a paper I read:
I'm a bit confused, how is it possible to plot propensity scores before matching since the matching itself gives the propensity scores? So if anyone is familiar with this kind of plotting I'd appreciate help. Here's my code so far, which only gives the density functions after matching for treatment (Near) and control.
m.df <- matchit(Near ~ Distance_to_center, data = df, method = "nearest", distance = "logit", caliper =0.25)
mdf <- match.data(m.df,distance = "pscore")
df <- mdf
plot(density(df$pscore[df$Near==1]))
plot(density(df$pscore[df$Near==0]))
Matching does not give the propensity scores. Propensity scores are first estimated, then matchit() matches units on the propensity scores.
You can extract the propensity scores for the whole sample from the matchit object. What you did when you used match.data() is extract the propensity scores for only the matched data. The propensity scores for the whole sample are stored in m.df$distance. So, to manually generate those plots, you can use:
plot(density(m.df$distance[df$Near==1]))
plot(density(m.df$distance[df$Near==0]))
before using match.data().
You can also use the cobalt package to automatically generate these plots:
bal.plot(m.df, var.name = "distance", which = "both")
will generate the same density plots in one simple line of code.
Related
I'm trying to do a PSM, Full matching, in complex survey data. I first estimated the propensity score matching using the MatchIt package of R and diagnosed the covariate balance.
But, a problem occurred when trying to use svyglm in survey package to estimate the treatment effect. In order to use svyglm, the complex survey must be designed through svydesign in survey package. I wonder what weight is used at this time.
After full matching, weight exists in two ways, Survey Weight and Matching Weight.
enter image description here
In the figure above, the wt_itvex is the survey weight, and the weights is the matching weight generated after matching.
design_org <- svydesign(id=~psu,strata=~kstrata,weights=~NULL,data=full_att1_data)
fit_boot <- svyglm(formula, data = md_boot,
family=quasibinomial(link="logit"),
design=design_org)
How do I specify the weight of svydesign to apply both weights? (part of "weight=~NULL")
I first thought of the product of the two weights as the code below.
design_org <- svydesign(id=~psu,strata=~kstrata,weights=~wt_itvex * weights,data=full_att1_data)
I am doing a counterfactual impact evaluation on survival data. More precisely, I try to evaluate the impact of vocational training on time spent in unemployment. I use the Kaplan Meier estimator of the survival curve (package survival).
Before doing Kaplan Meier, I use coarsened exact matching (aim is ATT) to get the control and treatment groups close in terms of pretreatment covariates (package MatchIt).
For the Kaplan Meier estimator, I have to use the weights form the matching, which works well using the weights option and robust standard errors of survfit :
library(survival)
library(survminer)
kp_cem <- survfit(Surv(time=time_cem,event=status_cem)~treatment_cem, data=data_impact_cem,robust =TRUE,weights =weights)
Although, when I try to use a log-rank test to test for the difference in survival curves between treatment and control groups, I cannot take into account the frequency weights from the matching so the test statistics are not correct.
log_rank <- survdiff(Surv(time=time_cem,event=status_cem)~treatment_cem, data=data_impact_cem,rho=0)
I tried the option "pval = TRUE" of ggsurvplot (package survminer) but the problem is the same, the frequency weights are not taken into account.
How can I include frequency weights in survdiff? Are there other packages to compute log-rank test taking into account frequency weights (obtained after matching)?
There are at least two ways to do this:
First, you can use the survey::svylogrank function, as #IRTFM suggests. This will treat the weights as sampling weights, but I think that's ok with the robust standard errors that svylogrank uses.
Second, you can use survival::coxph. The logrank test is the score test in a Cox model, and coxph takes frequency weights. Use robust=TRUE if you want a robust score test: it will be at the bottom of the output of summary(your_cox_model) and you can extract it as summary(your_cox_model)$robscore
Thank you very much #Thomas Lumley and #IRTFM for your answers.
Here is how I apply your 2 suggestions (I added some comments + references).
1. Using survey::svylogrank
I don’t feel very confortable using sampling weights while it is really frequency weights that I have.
How should I specify the survey design ? The weights come from Coarsened Exact Matching (matchit with method = "cem") which is a class of stratum matching.
Should I specify the strata and the weights in the survey design ? In this vignette form Matchit Estimating Effects After Matching, it is suggested to use only weights and robust standard errors in the survival analysis (not the strata) (p. 27).
Here is how I specify the design and how I obtain the log-rank test using the package survey taking into account the weights from matching :
library(survey)
design_weights <- svydesign(id=~ibis, strata=~subclass, weights=~weights, data=data_impact_cem)
log_rank <- svylogrank(Surv(time=time_cem,event=status_cem)~treatment_cem, design=design_weights, rho=0)
2. Using survival::coxph
Thank you for this piece of information, being quite new to survival analysis, I overlooked this nice property of the equivalency of score test from cox model and log-rank test. For people wishing more info on this subject, I found this book very instructive : Moore, D. (2016). Applied survival analysis using R. New York: NY : Springer (p 58).
I find this 2d option more attractive than the 1st involving survey. Here is how I apply it :
library(survival)
cox_cem <-coxph(Surv(time=time_cem,event=status_cem)~treatment_cem, data=data_impact_cem,robust =TRUE,weights =weights)
sum_cox_cem <-summary(cox_cem)
score_test <-sum_cox_month[[13]][[1]]
score_test <- round(score_test,3)
pvalue <- sum_cox_month[[13]][[3]]
pvalue <-if(pvalue<0.001){"<0.001"} else{round(pvalue,3)}
Here is the difference between the 2 test statistics (quite close in the end).
enter image description here
Though, I still wonder why the weights option does not exist in survdiff.
I'm trying to perform propensity score matching on survey data. I'm aware of the package MatchIt which is able to make the matching procedure but can I include in some ways the individual weights? because if I don't consider them, a less relevant observation can be match with a more relevant one. Thank you!
Update 2020-11-25 below this answer.
Survey weights cannot be used with matching in this way. You might consider using weighting, which can accommodate survey weights. With weighting, you estimate the propensity score weights using a model that accounts for the survey weights, and then multiply the estimated weights by the survey weights to arrive at your final set of weights.
This can be done using the weighting companion to the MatchIt package, WeightIt (of which I am the author). With your treatment A, outcome Y (I assume continuous for this demonstration), covariates X1 and X2, and sampling weights S, you could run the following:
#Estimate the propensity score weights
w.out <- weightit(A ~ X1 + X2, data = data, s.weights = "S",
method = "ps", estimand = "ATT")
#Combine the estimated weights with the survey weights
att.weights <- w.out$weights * data$S
#Fit the outcome model with the weights
fit <- lm(Y ~ A, data = data, weights = att.weights)
#Estimate the effect of treatment and its robust standard error
lmtest::coeftest(fit, vcov. = sandwich::vcovHC)
It's critical that you assess balance after estimating the weights; you can do that using the cobalt package, which works with WeightIt objects and automatically incorporates the sampling weights into the balance statistics. Prior to estimating the effect, you would run the following:
cobalt::bal.tab(w.out, un = TRUE)
Only if balance was achieved would you continue on to estimating the treatment effect.
There are other ways to estimate weights besides using logistic regression propensity scores. WeightIt provides support for many methods, and almost all of them support sampling weights. The documentation for each method explains whether sampling weights are supported.
MatchIt 4.0.0 now supports survey weights through the s.weights, just like WeightIt. This supplies survey weights to the model used to estimate the propensity scores but otherwise does not affect the matching. If you want units to be paired with other units that have similar survey weights, you should enter the survey weights as a variable to match on or to place a caliper on.
Here is how I do propensity score matching in R:
m.out <- matchit(treat ~ x1+x2, data = Newdata, method = "subclass", subclass=6)
dta_m <- match.data(m.out)
propensity <- glm.nb(y ~ treat+x1+x2+treat:x1+treat:x2,data=dta_m)
summary(propensity)
Thereinto,"treat" is a dummy variable.
I want to see the accuracy of matching function (matchit), Hence I want to get Area under the ROC curve. My question is how to get AUC in PSM?
Thank you.
You should not do this. See my answer here. Several studies have shown that there is no correspondence between the AUC of a propensity score model (aka the C-statistic) and its performance. That said, the propensity scores are stored in the distance component of the matchit output object, so you can take those and the treatment vector and put them into a function that computes the AUC from these values. I don't know of a function to do this because, as I mentioned, it's not good practice to do this with propensity scores.
I'm using the svydesign package in R to run survey weighted logit regressions as follows:
sdobj <- svydesign(id = ~0, weights = ~chweight, strata = ~strata, data = svdat)
model1 <- svyglm(formula=formula1,design=sdobj,family = quasibinomial)
However, the documentation states a caveat about regressions without specifying finite population corrections (FPC):
If fpc is not specified then sampling is assumed to be
with replacement at the top level and only the first stage of
cluster is used in computing variances.
Unfortunately, I do not have sufficient information to specify my populations at each level (of which I sampling very little). Any information on how to specify survey weights without FPC information would be very helpful.
You're doing it right. "With replacement" is survey statistics jargon for what you want in this case.
If the sampling fraction is low, it is standard to use an approximation that would be exact if the sampling fraction were infinitesimal or sampling were with replacement. No-one actually does surveys with replacement, but the approximation is almost universal. With this approximation you don't need to supply fpc, and conversely, if you don't supply fpc, svydesign() assumes you want this approximation.