MatchIt: Full Matching - Long Vector Error - r

I am running an analysis to assess the land conservation policy impact on land use change at parcel level. To address the non-random nature of conservation program enrollment, I am running a matching analysis between treated and non-treated parcel level data. I am getting this error when I am trying to run a full matching using MatchIt package.
Error in cbind(treatmentids, controlids) :
long vectors not supported yet: ../include/Rinlinedfuns.h:535
The configuration I am using is:
m1.out <- matchit(formula = Y ~ X1 + X2 + ..... Xn, data = dataframe,
method = "full", distance = 'glm', link = 'logit',
estimand = 'ATT', ratio = 1, pop.size = 16)
Where X1 .. Xn are continuous covariates, Y is a binary treatment variable. The dataset contains 121226 rows out of which 51693 are treatment and the rest are control samples.
I am running R (4.0.2) with MatchIt(4.3.4) on a windows machine. Genetic or nearest neighbor matching methods are running without any issues. I appreciate any help on this.

This is an error from optmatch. The problem is too big for optmatch::fullmatch(), the function matchit() calls with method = "full", to handle. This is because fullmatch() does a search over all pairwise distances, which in this case is over 2.5 billion in number. The problem may simply be infeasible for full matching. See here for the same problem.
Some things you can try are to impose a very strict caliper, which reduces the number of eligible nodes, or add an exact matching constraint using the exact argument, which splits the problem into smaller chunks that may be more manageable. You can also try using subclassification with a large number of subclasses, which approximates full matching.
Also note that the ratio and pop.size arguments do nothing with full matching, so you should exclude them from your call to matchit().

Related

R: Nearest neighbour matching with MatchIT

I would like to use nearest neighbour matching with MatchIt in R.
So far I have used the following code:
Matching<- matchit(Treatment ~ Size+ Age + Expenses, data=data, method = "nearest", distance="glm", replace=TRUE)
I have two questions:
Question 1.)
When I run this code and run Matching again then I get a summary.
One line then says
A matchit object
- method: 1:1 nearest neighbor matching with replacement
I want to have the same control observation to be matched multiple times if neeeded.
Is the code above doing that?
I am confused since it says 1:1 nearest neighbor matching with replacement, I don't know if it now only uses an observation in the control group not more than once due to the 1:1 part in the sentence. However, since I use replace=true in the code I thought that this does exactly that so that one observation in the control group can be matched several times.
Could someone explain to me if my understanding is correct?
Question 2.)
After having run
Matching<- matchit(Treatment ~ Size+ Age + Expenses, data=data, method = "nearest", distance="glm", replace=TRUE)
I would like to estimate the average treatment effect.
I use the following document as a reference on how to estimate it:
https://cran.r-project.org/web/packages/MatchIt/vignettes/estimating-effects.html#after-pair-matching-with-replacement
However, I would like to use clustered standard errors by subclass and id.
Therefore, I need to first write the code:
Matching_gm <- get_matches(Matching)
When I look at the weights of Matching_gm they are always 1. However, when I run summary(Matching$weigths) there are many weights that are different from 1.
Why do the weights change when I use get_matches ? As far as I know, this should not be the case.
It is called 1:1 matching because each treated unit gets one match, but it is possible that control units are reused as matches for multiple treated units. If you set ratio to something other than 1, you would be doing k:1 matching, where each treated units gets k matches, composed of controls that may be resued for other treated units.
get_matches() produces a row for each unit for each time it is matched. If a control unit is matched twice (i.e., to two different treated units), it will have two rows each with a weight of 1 in the get_matches() output, but it will have a weight of 2 in the matchit() output (though this weight may be scaled to be different from 2). If you use match.data() instead of get_matches(), you will see that each unit receives only one row and the weights for each control unit are the same as in the matchit() output.

Matchit R - how to run ratio of 1 CONTROL: 2 TREATED units?

Matchit R - how to run ratio of 1 CONTROL: 2 TREATED units? (20 CONTROLS FOR 40 TREATED SUBJECTS)
Ratio = for methods that allow it, how many control units should be matched to each treated unit in k:1 matching.
0.5 IS IMPOSSIBLE
how can I apply it?
I think this isn't possible because it doesn't make sense. Your question implies that treated units can be paired, but that presupposes a prior step in which treated units are matched with each other, which is not part of what MatchIt is designed to do (and doesn't really make sense from an experimental design standpoint anyway).
Either that, or you have pairs of treated cases that match exactly, in which case you're not really gaining any analytical leverage by trying to use them both. If that's the case, I would just reduce the pairs to single observation and go with 1:1 matching.
This is actually a much a harder question than it may seem. There is definitely no way to do this with nearest neighbor matching. In theory, it should be possible with optimal matching, e.g., using method = "full" or the optmatch package directly. Unfortunately, in my experiments, I am unable to get optmatch to do what you want. Ideally, you should be able to run
fullmatch(., min.controls = 1/2, max.controls = 1/2, mean.controls = 1/2)
but that doesn't actually produce the desired results. Maybe you can contact the opmatch authors with this problem, as it's an interesting one that their package should be able to solve.
There is another method you can use called cardinality matching, which will implemented in the next release of MatchIt but can be accessed in the development version using devtools::install_github("ngreifer/MatchIt"). Cardinality matching uses optimzation to select a subset of units that satisfy a sample size constraint and balance constraints set by the user. In this case, your sample size constraint is that the number of treated units is twice the number of control units. You need to set balance constraints, which are the maximum imbalance (in standardized mean difference units) allowable between the treated and control groups. If A is your treatment variable and X1 and X2 are the variables you are trying to match on, you could run the following code:
m <- matchit(A ~ X1 + X2, dataa = data, method = "cardinality",
estimand = "ATC", ratio = 2, tols = .01)
If a solution is possible, it will be produced. If not, you will have to relax the tolerance (i.e., increase tols) until it is. You can the run
summary(m, un = FALSE)
which will produce a summary of balance and the remaining sample size. If not all of your treated units are matched, then continue to increase tols until they are.

R and SPSS: Different results for Hierarchical Cluster Analysis

I'm performing hierarchical cluster analysis using Ward's method on a dataset containing 1000 observations and 37 variables (all are 5-point likert-scales).
First, I ran the analysis in SPSS via
CLUSTER Var01 to Var37
/METHOD WARD
/MEASURE=SEUCLID
/ID=ID
/PRINT CLUSTER(2,10) SCHEDULE
/PLOT DENDROGRAM
/SAVE CLUSTER(2,10).
FREQUENCIES CLU2_1.
I additionaly performed the analysis in R:
datA <- subset(dat, select = Var01:Var37)
dist <- dist(datA, method = "euclidean")
hc <- hclust(d = dist, method = "ward.D2")
table(cutree(hc, k = 2))
The resulting cluster sizes are:
1 2
SPSS 712 288
R 610 390
These results are obviously confusing to me, as they differ substentially (which becomes highly visible when observing the dendrograms; also applies for the 3-10 clusters solutions). "ward.D2" takes into account the squared distance, if I'm not mistaken, so I included the simple distance matrix here. However, I tried several (combinations) of distance and clustering methods, e.g. EUCLID instead of SEUCLID, squaring the distance matrix in R, applying "ward.D" method,.... I also looked at the distance matrices generated by SPSS and R, which are identical (when applying the same method). Ultimately, I excluded duplicate cases (N=29) from my data, guessing that those might have caused differences when being allocated (randomly) at a certain point. All this did not result in matching outputs in R and SPSS.
I tried running the analysis with the agnes() function from the cluster package, which resulted in - again - different results compared to SPSS and even hclust() (But that's a topic for another post, I guess).
Are the underlying clustering procedures that different between the programs/packages? Or did I overlook a crucial detail? Is there a "correct" procedure that replicates the results yielded in SPSS?
If the distance matrices are identical and the merging methods are identical, the only thing that should create different outcomes is having tied distances handled differently in two algorithms. Tied distances might be present with the original full distance matrix, or might occur during the joining process. If one program searches the matrix and finds two or more distances tied at the minimum value at that step, and it selects the first one, while another program selects the last one, or one or both select one at random from among the ties, different results could occur.
I'd suggest starting with a small example with some data with randomness added to values to make tied distances unlikely and see if the two programs produce matching results on those data. If not, there's a deeper problem. If so, then tie handling might be the issue.

Reducing "treatment" sample size through MatchIt (or another package) to increase sample similarity

I am trying to match two samples on several covariates using MatchIt, but I am having difficulty creating samples that are similar enough. Both my samples are plenty large (~1000 in the control group, ~5000 in the comparison group).
I want to get a matched sample with participants as closely matched as possible and I am alright with losing sample size in the control group. Right now, MatchIt only returns two groups of 1000, whereas I want two groups that are very closely matched and would be fine with smaller groups (e.g., 500 instead of 1000).
Is there a way to do this through either MatchIt or another package? I would rather avoid using random sampling and then match if possible because I want as close a match between groups as possible.
Apologies for not having a reproducible example, I am still pretty new to using R and couldn't figure out how to make a sample of this issue...
Below is the code I have for matching the two groups.
data<- na.omit(data)
data$Group<- as.numeric(data$Group)
data$Group<- recode(data$Group, '1 = 1; 2 = 0')
m.out <- matchit(Group ~ Age + YearsEdu + Income + Gender, data = data, ratio = 1)
s.out <- summary(m.out, standardize = TRUE)
plot(s.out)
matched.data <- match.data(m.out)
MatchIt, like other similar packages, offers several matching routines that enable you to play around with the settings. Check out the argument method, which is set to method = 'nearest' by default. This means that unless you specify, it will look for the best match for each of the treatment observations. In your case, you will always have 1000 paired matches with this setting.
You can choose to set it to method = 'exact', which is much more restrictive. In the documentation you will find:
This technique matches each treated unit to all
possible control units with exactly the same values on all the covariates, forming subclasses
such that within each subclass all units (treatment and control) have the same covariate values.
On the lalonde dataset, you can run:
m.out <- matchit(treat ~ educ + black + hispan, data = lalonde, method = 'exact')
summary(m.out)
As a consequence, it discards some of the treatment observation that could not get matched. Have a look at the other possibilities for method, maybe you will find something you will like better.
That being said, be mindful not to discard too many treatment observations. If you do, you will make the treatment group look like the control group (instead of the opposite), which might lead to unwanted results.
You should look into the package designmatch, which implements a form of matching called cardinality matching that does what you want (i.e., find the largest matched set that yields desired balance). Unlike MatchIt, designmatch doesn't use a distance variable; instead, it uses optimization to solve the matching problem. You select exactly how balanced you want each covariate to be, and it will do its best to solve the problem while retaining as many matches as possible. The methodology is described in Zubizarreta, Paredes, & Rosenbaum (2014).

R package bnlearn: cpquery vs predict - different results?

I want to use my bayesian network as a classifier, first on complete evidence data (predict), but also on incomplete data (bnlearn::cpquery). But it seems that, even working with the same evidence, the functions give different results (not only based on slight deviation due to sampling).
With complete data, one can easily use R's predict function:
predict(object = BN,
node = "TargetVar",
data = FullEvidence ,
method = "bayes-lw",
prob = TRUE)
By analyzing the prob attribute, I understood that the predict-function simply chooses the factor level with the highest probability assigned.
When it comes to incomplete evidence (only outcomes of some nodes are known), predict doesn't work anymore:
Error in check.fit.vs.data(fitted = fitted,
data = data,
subset = setdiff(names(fitted), :
required variables [.....] are not present in the data.`
So, I want to use bnlearn::cpquery with a list of known evidence:
cpquery(fitted = BN,
event = TargetVar == "TRUE",
evidence = evidenceList,
method = "lw",
n = 100000)
Again, I simply want to use the factor with the highest probability as prediction. So if the result of cpquery is higher than 0.5, I set the prediction to TRUE, else to FALSE.
I tried to monitor the process by giving the same (complete) data to both functions, but they don't give me back the same results. There are large differences, e.g. predict's "prob"-attribute gives me a p(false) = 27% whereas cpquery gives me p(false) = 2,2%.
What is the "right" way of doing this? Using only cpquery, also for complete data? Why are there large differences?
Thanks for your help!
As user20650 put it, increasing the number of samples in the predict call was the solution to get very similar results. So just provide the argument n = ... in your function call.
Of course that makes sense, I just didn't know about that argument in the predict() function.
There's no documentation about it in the bn.fit utilities and also none in the quite generic documentation of predict.

Resources