I would like to use nearest neighbour matching with MatchIt in R.
So far I have used the following code:
Matching<- matchit(Treatment ~ Size+ Age + Expenses, data=data, method = "nearest", distance="glm", replace=TRUE)
I have two questions:
Question 1.)
When I run this code and run Matching again then I get a summary.
One line then says
A matchit object
- method: 1:1 nearest neighbor matching with replacement
I want to have the same control observation to be matched multiple times if neeeded.
Is the code above doing that?
I am confused since it says 1:1 nearest neighbor matching with replacement, I don't know if it now only uses an observation in the control group not more than once due to the 1:1 part in the sentence. However, since I use replace=true in the code I thought that this does exactly that so that one observation in the control group can be matched several times.
Could someone explain to me if my understanding is correct?
Question 2.)
After having run
Matching<- matchit(Treatment ~ Size+ Age + Expenses, data=data, method = "nearest", distance="glm", replace=TRUE)
I would like to estimate the average treatment effect.
I use the following document as a reference on how to estimate it:
https://cran.r-project.org/web/packages/MatchIt/vignettes/estimating-effects.html#after-pair-matching-with-replacement
However, I would like to use clustered standard errors by subclass and id.
Therefore, I need to first write the code:
Matching_gm <- get_matches(Matching)
When I look at the weights of Matching_gm they are always 1. However, when I run summary(Matching$weigths) there are many weights that are different from 1.
Why do the weights change when I use get_matches ? As far as I know, this should not be the case.
It is called 1:1 matching because each treated unit gets one match, but it is possible that control units are reused as matches for multiple treated units. If you set ratio to something other than 1, you would be doing k:1 matching, where each treated units gets k matches, composed of controls that may be resued for other treated units.
get_matches() produces a row for each unit for each time it is matched. If a control unit is matched twice (i.e., to two different treated units), it will have two rows each with a weight of 1 in the get_matches() output, but it will have a weight of 2 in the matchit() output (though this weight may be scaled to be different from 2). If you use match.data() instead of get_matches(), you will see that each unit receives only one row and the weights for each control unit are the same as in the matchit() output.
Related
I'm using the MatchIt package in R to perform propensity score matching. Matching method is nearest neighbour and distance measure is glm. When I leave ratio and caliper set to the defaults or if I enter a non default value for just one of them, I get the distribution of weights that I expect without replacement, i.e. all 0 or 1. However, when I set both to non default values I get some weights higher than 1 and some between 0 and 1, which is a distribution I associate with replacement.
Am I misunderstanding the difference between sampling with and without replacement, or is this a case that overrides the replace=F argument? I have read the package documentation but it is very possible that I missed or misunderstood the section that explains this. If so, feel free to simply direct me to the relevant section!
A (hopefully) reproducible example:
set.seed(42)
DF<-data.frame(Group=factor(c(rep("Treatment",40),rep("Control",360))),
mVar1=factor(c(sample(LETTERS[c(1,1,1,2)],40,replace=T),sample(LETTERS[c(1,2)],360,replace=T))),
mVar2=factor(c(sample(LETTERS[c(3,3,4,4,4,5)],40,replace=T),sample(LETTERS[c(3,4,5)],360,replace=T))),
mVar3=c(rpois(40,3),rpois(360,1)))
str(DF)
(m1<-matchit(Group~mVar1+mVar2+mVar3,data=DF,method="nearest",distance="glm",ratio=3,replace=F))
plot(m1,type="jitter",interactive=FALSE)
hist(m1$weights)
(m2<-matchit(Group~mVar1+mVar2+mVar3,data=DF,method="nearest",distance="glm",caliper=0.1,replace=F))
plot(m2,type="jitter",interactive=FALSE)
hist(m2$weights)
(m3<-matchit(Group~mVar1+mVar2+mVar3,data=DF,method="nearest",distance="glm",ratio=3,caliper=0.1,replace=F))
plot(m3,"jitter",interactive=FALSE)
hist(m3$weights)
Thanks!
Matching weights are computed using the formula described at ?matchit. This formula is used when matching without replacement, as you are doing. The formula is as follows:
Each unit is assigned to a subclass, which represents the pair they
are a part of (in the case of k:1 matching) or the stratum they belong
to (in the case of exact matching, coarsened exact matching, full
matching, or subclassification). The formula for computing the weights
depends on the argument supplied to estimand. A new stratum
"propensity score" (p) is computed as the proportion of units in each
stratum that are in the treated group, and all units in that stratum
are assigned that propensity score. Weights are then computed using
the standard formulas for inverse probability weights: for the ATT,
weights are 1 for the treated units and p/(1-p) for the control units;
for the ATC, weights are (1-p)/p for the treated units and 1 for the
control units; for the ATE, weights are 1/p for the treated units and
1/(1-p) for the control units.
...
In each treatment group, weights are divided by the mean of the
nonzero weights in that treatment group to make the weights sum to the
number of units in that treatment group.
When using a constant matching ratio (e.g., every treated unit gets 1 match or every treated unit gets 3 matches), the weights will be constant for all control units. Otherwise, the weights will vary for control units. What you are seeing is the weight varying for control units. It happens to be that this also can occur with matching with replacement, but it can also occur when variable-ratio matching or full matching, both of which without replacement.
To see if you are actually matching with replacement, run table(table(m3$match.matrix)). table(m3$match.matrix) tells you how many times each control unit is used as a match, and running table() on that output tells you how many times a control unit is used each number of times. You will see that each control unit is used only once, so there will be only one entry in the output of table(table()), indicating matching is being done without replacement.
Matchit R - how to run ratio of 1 CONTROL: 2 TREATED units? (20 CONTROLS FOR 40 TREATED SUBJECTS)
Ratio = for methods that allow it, how many control units should be matched to each treated unit in k:1 matching.
0.5 IS IMPOSSIBLE
how can I apply it?
I think this isn't possible because it doesn't make sense. Your question implies that treated units can be paired, but that presupposes a prior step in which treated units are matched with each other, which is not part of what MatchIt is designed to do (and doesn't really make sense from an experimental design standpoint anyway).
Either that, or you have pairs of treated cases that match exactly, in which case you're not really gaining any analytical leverage by trying to use them both. If that's the case, I would just reduce the pairs to single observation and go with 1:1 matching.
This is actually a much a harder question than it may seem. There is definitely no way to do this with nearest neighbor matching. In theory, it should be possible with optimal matching, e.g., using method = "full" or the optmatch package directly. Unfortunately, in my experiments, I am unable to get optmatch to do what you want. Ideally, you should be able to run
fullmatch(., min.controls = 1/2, max.controls = 1/2, mean.controls = 1/2)
but that doesn't actually produce the desired results. Maybe you can contact the opmatch authors with this problem, as it's an interesting one that their package should be able to solve.
There is another method you can use called cardinality matching, which will implemented in the next release of MatchIt but can be accessed in the development version using devtools::install_github("ngreifer/MatchIt"). Cardinality matching uses optimzation to select a subset of units that satisfy a sample size constraint and balance constraints set by the user. In this case, your sample size constraint is that the number of treated units is twice the number of control units. You need to set balance constraints, which are the maximum imbalance (in standardized mean difference units) allowable between the treated and control groups. If A is your treatment variable and X1 and X2 are the variables you are trying to match on, you could run the following code:
m <- matchit(A ~ X1 + X2, dataa = data, method = "cardinality",
estimand = "ATC", ratio = 2, tols = .01)
If a solution is possible, it will be produced. If not, you will have to relax the tolerance (i.e., increase tols) until it is. You can the run
summary(m, un = FALSE)
which will produce a summary of balance and the remaining sample size. If not all of your treated units are matched, then continue to increase tols until they are.
I am trying to match two samples on several covariates using MatchIt, but I am having difficulty creating samples that are similar enough. Both my samples are plenty large (~1000 in the control group, ~5000 in the comparison group).
I want to get a matched sample with participants as closely matched as possible and I am alright with losing sample size in the control group. Right now, MatchIt only returns two groups of 1000, whereas I want two groups that are very closely matched and would be fine with smaller groups (e.g., 500 instead of 1000).
Is there a way to do this through either MatchIt or another package? I would rather avoid using random sampling and then match if possible because I want as close a match between groups as possible.
Apologies for not having a reproducible example, I am still pretty new to using R and couldn't figure out how to make a sample of this issue...
Below is the code I have for matching the two groups.
data<- na.omit(data)
data$Group<- as.numeric(data$Group)
data$Group<- recode(data$Group, '1 = 1; 2 = 0')
m.out <- matchit(Group ~ Age + YearsEdu + Income + Gender, data = data, ratio = 1)
s.out <- summary(m.out, standardize = TRUE)
plot(s.out)
matched.data <- match.data(m.out)
MatchIt, like other similar packages, offers several matching routines that enable you to play around with the settings. Check out the argument method, which is set to method = 'nearest' by default. This means that unless you specify, it will look for the best match for each of the treatment observations. In your case, you will always have 1000 paired matches with this setting.
You can choose to set it to method = 'exact', which is much more restrictive. In the documentation you will find:
This technique matches each treated unit to all
possible control units with exactly the same values on all the covariates, forming subclasses
such that within each subclass all units (treatment and control) have the same covariate values.
On the lalonde dataset, you can run:
m.out <- matchit(treat ~ educ + black + hispan, data = lalonde, method = 'exact')
summary(m.out)
As a consequence, it discards some of the treatment observation that could not get matched. Have a look at the other possibilities for method, maybe you will find something you will like better.
That being said, be mindful not to discard too many treatment observations. If you do, you will make the treatment group look like the control group (instead of the opposite), which might lead to unwanted results.
You should look into the package designmatch, which implements a form of matching called cardinality matching that does what you want (i.e., find the largest matched set that yields desired balance). Unlike MatchIt, designmatch doesn't use a distance variable; instead, it uses optimization to solve the matching problem. You select exactly how balanced you want each covariate to be, and it will do its best to solve the problem while retaining as many matches as possible. The methodology is described in Zubizarreta, Paredes, & Rosenbaum (2014).
Hello I'm using the MatchIt package in R
I have a total of unmatched 116 treated cases and 462 unmatcehd non-treated case
with the command
mod_match_logit = matchit(f.build("treatement_yes_or_no", covariates),
method = "nearest", distance = "logit", data = df, caliper = 0.05, ratio = 3)
I get then a result of 91 matched treated cases with 248 matched non treated cases. What I don't understand is that, with a 3:1 matching, I should have 91*3 = 273 matched non treated cases (and not 248). Per default is the command replace set to F in MatchIt, so it doesn't explain the difference for me. What am I missing?
Without seeing the data I am only guessing but it is most likely due to your caliper setting.
MatchIt defines the caliper as "the number of standard deviations of the distance measure within which to draw control units (default = 0, no caliper matching)"(p.26)
Therefore my guess is you have some units in the treatment group with high propensity scores that cannot be matched to those in the untreated group (at least within 0.05 standard deviations as you specified). The reason why you are not getting 273 subjects in your matched data set is because of the caliper = 0.05 setting in your MatchIt call. Some of the treated subjects with higher propensity scores that are still getting matched to at least one untreated are unable to get matched to a second or third because they are beyond the 0.05 caliper specification.
Maybe increasing the caliper would retain more treated subjects but I would not go any higher than 0.25 based on best practices documented in the literature.
Depending on your research design you could consider using other matching methods. For example you could use distances other than euclidean such as mahalanobis which is an option in MatchIt. Alternatively, you could also use either optimal full matching or optimal pair matching from the 'optmatch' library though you can also call those through the MatchIt function. There are many other approaches but these are easily accessible from the MatchIt library. The literature does suggest trying a few different methods and then checking for balance as long as you do not "cherry-pick" the one that gives you the largest effect. In other words select your matched set based on covariate balance and not on the outcome variable in your study. There is definitely a bit of an art to propensity score matching but is why I think it so interesting!
I'm rather new to R and especially to the method of matching by propensity scores. My dataset includes two groups of people that differ in whether they were treated or not- unfortunately they also differ significantly in age and disease duration, therefore my wish to match them.
So far this is my code:
set.seed(2208)
mod_match <- matchit(TR ~ age + disease_duration + sex + partner + work + academic,
data = Data_nomiss,
method = "nearest",
caliper = .025)
summary(mod_match)
This code works fine, but I wondered whether there is a possibility to weight the importance of the covariates regarding the accuracy of matching? For me it is crucial that the groups are as close as possible concerning age and disease duration (numeric), whereas the rest of the variables (factors) should also be matched, but for my purposes might differ in means a little more than the first two.
While searching for a solution to my problem I came across the request of this one guy, who had basically the same problem http://r.789695.n4.nabble.com/matchit-can-I-weight-the-parameters-td4633907.html
In this case it was proposed to combine nearest neighbor and exact matching, but transferred to my dataset this leads to an unproportional reduction of my sample. In the end what I'd like to have is some sort of customized matching process focussing on age and disease duration while also involving the last three variables but in a weaker way.
Does anyone happen to have an idea how this could be realized? I'd be really glad to receive any kinds of tips on this matter and thank you for your time!
Unfortunately, MatchIt does not provide this functionality. There were two ways to do this instead of using MatchIt, but they are slightly advanced. Note that neither use propensity scores. The point of propensity score matching is to match on a single number, the propensity score, which makes the matching procedure blind to the original covariates for which balance is desired.
The first is to use the package Matching and include your own weight matrix to Weight.matrix in Match(). You could upweight age and disease duration in the weight matrix.
The second is to use the package designmatch to do cardinality matching, which allows you to specify balance constraints, and it will use optimization to find the largest sample that meets those constraints. In bmatch(), enter your covariates of interest into the mom argument, which also allows you to include specific balance constraints for each variable. You can require stricter balance constraints for age and disease duration.