Propensity score matching with individual weights - r

I'm trying to perform propensity score matching on survey data. I'm aware of the package MatchIt which is able to make the matching procedure but can I include in some ways the individual weights? because if I don't consider them, a less relevant observation can be match with a more relevant one. Thank you!

Update 2020-11-25 below this answer.
Survey weights cannot be used with matching in this way. You might consider using weighting, which can accommodate survey weights. With weighting, you estimate the propensity score weights using a model that accounts for the survey weights, and then multiply the estimated weights by the survey weights to arrive at your final set of weights.
This can be done using the weighting companion to the MatchIt package, WeightIt (of which I am the author). With your treatment A, outcome Y (I assume continuous for this demonstration), covariates X1 and X2, and sampling weights S, you could run the following:
#Estimate the propensity score weights
w.out <- weightit(A ~ X1 + X2, data = data, s.weights = "S",
method = "ps", estimand = "ATT")
#Combine the estimated weights with the survey weights
att.weights <- w.out$weights * data$S
#Fit the outcome model with the weights
fit <- lm(Y ~ A, data = data, weights = att.weights)
#Estimate the effect of treatment and its robust standard error
lmtest::coeftest(fit, vcov. = sandwich::vcovHC)
It's critical that you assess balance after estimating the weights; you can do that using the cobalt package, which works with WeightIt objects and automatically incorporates the sampling weights into the balance statistics. Prior to estimating the effect, you would run the following:
cobalt::bal.tab(w.out, un = TRUE)
Only if balance was achieved would you continue on to estimating the treatment effect.
There are other ways to estimate weights besides using logistic regression propensity scores. WeightIt provides support for many methods, and almost all of them support sampling weights. The documentation for each method explains whether sampling weights are supported.
MatchIt 4.0.0 now supports survey weights through the s.weights, just like WeightIt. This supplies survey weights to the model used to estimate the propensity scores but otherwise does not affect the matching. If you want units to be paired with other units that have similar survey weights, you should enter the survey weights as a variable to match on or to place a caliper on.

Related

Covariate balance after weighting by the inverse of the propensity score

My analysis focuses on causal inference. I am using the inverse of the propensity scores to form weights (Propensity score is the probability of receiving a treatment (or intervention) given a set of covariates).
My question is, does anyone know how I do do the balance assessment for covariates before and after weighting?
I know there are packages out there that may do this, but I want to write it by hand not by using packages.
Here is an example:
X1<-c(1,1,1,0,0,1) #Covariate
X2<c(0,1,1,0,1,0) #Covariate
X3<-c(1,0,1,1,1,0) #Treatment
X4<- c(1,0,1,1,0,0) #Outcome
data<-data.frame(X1,X2,X3,X4)
model<- glm(X3~X1+X2, family= "binomial", data=subset(data, X3==1))
propensity_score<- predict(p, newdata=data, type="response")
weights<- 1/propensity_score
The is to see if the covariates have a balance after weighting with the inverse of the propensity score (I know the general idea but am not familiar with the theory behind it)

Longitudinal analysis using sampling weigths in R

I have longitudinal data from two surveys and I want to do a pre-post analysis. Normally, I would use survey::svyglm() or svyVGAM::svy_vglm (for multinomial family) to include sampling weights, but these functions don't account for the random effects. On the other hand, lme4::lmer accounts for the repeated measures, but not the sampling weights.
For continuous outcomes, I understand that I can do
w_data_wide <- svydesign(ids = ~1, data = data_wide, weights = data_wide$weight)
svyglm((post-pre) ~ group, w_data_wide)
and get the same estimates that I would get if I could use lmer(outcome ~ group*time + (1|id), data_long) with weights [please correct me if I'm wrong].
However, for categorical variables, I don't know how to do the analyses. WeMix::mix() has a parameter weights, but I'm not sure if it treats them as sampling weights. Still, this function can't support multinomial family.
So, to resume: can you enlighten me on how to do a pre-post test analysis of categorical outcomes with 2 or more levels? Any tips about packages/functions in R and how to use/write them would be appreciated.
I give below some data sets with binomial and multinomial outcomes:
library(data.table)
set.seed(1)
data_long <- data.table(
id=rep(1:5,2),
time=c(rep("Pre",5),rep("Post",5)),
outcome1=sample(c("Yes","No"),10,replace=T),
outcome2=sample(c("Low","Medium","High"),10,replace=T),
outcome3=rnorm(10),
group=rep(sample(c("Man","Woman"),5,replace=T),2),
weight=rep(c(1,0.5,1.5,0.75,1.25),2)
)
data_wide <- dcast(data_long, id~time, value.var = c('outcome1','outcome2','outcome3','group','weight'))[, `:=` (weight_Post = NULL, group_Post = NULL)]
EDIT
As I said below in the comments, I've been using lmer and glmer with variables used to calculate the weights as predictors. It happens that glmer returns a lot of problems (convergence, high eigenvalues...), so I give another look at #ThomasLumley answer in this post and others (https://stat.ethz.ch/pipermail/r-help/2012-June/315529.html | https://stats.stackexchange.com/questions/89204/fitting-multilevel-models-to-complex-survey-data-in-r).
So, my question is now if a can use participants id as clusters in svydesign
library(survey)
w_data_long_cluster <- svydesign(ids = ~id, data = data_long, weights = data_long$weight)
summary(svyglm(factor(outcome1) ~ group*time, w_data_long_cluster, family="quasibinomial"))
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.875e+01 1.000e+00 18.746 0.0339 *
groupWoman -1.903e+01 1.536e+00 -12.394 0.0513 .
timePre 5.443e-09 5.443e-09 1.000 0.5000
groupWoman:timePre 2.877e-01 1.143e+00 0.252 0.8431
and still interpret groupWoman:timePre as differences in the average rate of change/improvement in the outcome over time between sex groups, as if I was using mixed models with participants as random effects.
Thank you once again!
A linear model with svyglm does not give the same parameter estimates as lme4::lmer. It does estimate the same parameters as lme4::lmer if the model is correctly specified, though.
Generalised linear models with svyglm or svy_vglm don't estimate the same parameters as lme4::glmer, as you note. However, they do estimate perfectly good regression parameters and if you aren't specifically interested in the variance components or in estimating the realised random effects (BLUPs) I would recommend just using svy_glm.
Another option if you have non-survey software for random effects versions of the models is to use that. If you scale the weights to sum to the sample size and if all the clustering in the design is modelled by random effects in the model, you will get at least a reasonable approximation to valid inference. That's what I've seen recommended for Bayesian survey modelling, for example.

AUC of Propensity Score Matching in R

Here is how I do propensity score matching in R:
m.out <- matchit(treat ~ x1+x2, data = Newdata, method = "subclass", subclass=6)
dta_m <- match.data(m.out)
propensity <- glm.nb(y ~ treat+x1+x2+treat:x1+treat:x2,data=dta_m)
summary(propensity)
Thereinto,"treat" is a dummy variable.
I want to see the accuracy of matching function (matchit), Hence I want to get Area under the ROC curve. My question is how to get AUC in PSM?
Thank you.
You should not do this. See my answer here. Several studies have shown that there is no correspondence between the AUC of a propensity score model (aka the C-statistic) and its performance. That said, the propensity scores are stored in the distance component of the matchit output object, so you can take those and the treatment vector and put them into a function that computes the AUC from these values. I don't know of a function to do this because, as I mentioned, it's not good practice to do this with propensity scores.

Cox regression with Inverse Propensity Treatment Weighting

A normal Cox Regression is as following:
coxph(formula = Surv(time, status) ~ v1 + v2 + v3, data = x)
I've calculated the Inverse Propensity Treatment Weighting (IPTW) scores with the subsequent Propensity Scores.
Propensity scores can be calculated as following:
ps<-glm(treat~v1+v2+v3, family="binomial", data=x)
Weights used for IPTW are calculated as following:
weight <- ifelse (treat==1, 1/(ps), 1/(1-ps))
Every subject in the dataset can be weighted with aforementioned method (every subject does get a specific weight, calculated as above), but I see no place to put the weights in the 'normal' Cox regression formula.
Is there a Cox regression formula wherein we can assess the calculated weights to each subject and what R package or code is being used for these calculations?
Propensity score weighting method
(inverse probability weighting method)
R was used for the following statistical analysis.
Load the following R packages:
library(ipw)
library(survival)
Estimate propensity score for each ID in your data frame (base_model), based on variables.
The propensity score is the probability of assignment of treatment in the presence of given covariates (v).
As shown in your data,
PS estimation
ps_model <- glm(treatment~v1+v2+v3...., family = binomial, data = base_model)
summary(ps_model)
# view propensity score values
pscore <- ps_model$fitted.values
dataframe$propensityScore <- predict(ps_model, type = "response")
Calculate weights
#estimate weight for each patient
base_model$weight.ATE <- ifelse((base_model$treatment=="1"),(1/base_model$propensityScore), (1/(1-base_model$propensityScore)))
base_weight <- ipwpoint(exposure = treatment, family = "binomial", link="logit", numerator = ~1, denominator =~v1+v2+v3....vn, data = base_model, trunc=0.05) #truncation of 5% for few extreme weights if needed
Survival analysis: Cox regression
#time to event analysis with weights
HR5 <- coxph(Surv(time, event)~as.factor(treat_group), weights = weights.trunc, data = base_model)
summary(HR5)
weights argument was added based on the estimated weights earlier.
cobalt or tableOne packages of R would help you view balance in characteristics before and after propensity score weighting.
Good luck!
You can do like this using the DIVAT dataset from iptwsurvival package:
##Generate ID
DIVAT$ID<- 1:nrow(DIVAT)
We can calculate the IPTW as the average treatment effect instead as the average treatment effect among treated
DIVAT$p.score <- glm(retransplant ~ age + hla, data = DIVAT,
family = "binomial")$fitted.values
DIVAT$ate.weights <- with(DIVAT, retransplant * 1/p.score + (1-retransplant)* 1/(1-p.score))
Than we can perform a cox regression
####COX without weight
coxph(Surv(times, failures)~ retransplant, data=DIVAT)->fit
summary(fit)
Adding weight is quite easy
###COX with weight naive model
coxph(Surv(times, failures)~ retransplant, data=DIVAT, weights = ate.weights)->fit
summary(fit)
###COX with weight and robust estimation
coxph(Surv(times, failures)~ retransplant + cluster(ID), data=DIVAT, weights = ate.weights)->fit
summary(fit)
However, in this way the estimation of standard error is biased (please see Austin, Peter C. "Variance estimation when using inverse probability of treatment weighting (IPTW) with survival analysis." Statistics in medicine 35.30 (2016): 5642-5655.).
Austin suggested to rely on bootstrap estimator. However I'm stacked too, since I'm not able to find a way to perform this kind of analyses. If you found any answer please let me know.

Survey Weighted Regression Without FPC in R

I'm using the svydesign package in R to run survey weighted logit regressions as follows:
sdobj <- svydesign(id = ~0, weights = ~chweight, strata = ~strata, data = svdat)
model1 <- svyglm(formula=formula1,design=sdobj,family = quasibinomial)
However, the documentation states a caveat about regressions without specifying finite population corrections (FPC):
If fpc is not specified then sampling is assumed to be
with replacement at the top level and only the first stage of
cluster is used in computing variances.
Unfortunately, I do not have sufficient information to specify my populations at each level (of which I sampling very little). Any information on how to specify survey weights without FPC information would be very helpful.
You're doing it right. "With replacement" is survey statistics jargon for what you want in this case.
If the sampling fraction is low, it is standard to use an approximation that would be exact if the sampling fraction were infinitesimal or sampling were with replacement. No-one actually does surveys with replacement, but the approximation is almost universal. With this approximation you don't need to supply fpc, and conversely, if you don't supply fpc, svydesign() assumes you want this approximation.

Resources