R matchit on 7 variables with different seeds - r

I am using the following code to match 2 cohorts (2800 controls, 460 treated) of different patients:
set.seed(99)
m.out <- matchit(treatment ~ gender + age + VarC + diseaseDuration +
pastActivity + activity + country, data = Pat.match,
method = "nearest", ratio = 5, discard = "control",
caliper = 0.1, m.order = "smallest")
After matching, the cohorts are reduced to about 1230 controls vs. 400 treated.
These numbers are similar when I change the seed. However, if I check more accurately (patient ID), the total cohorts for different seeds differ in about 20% of the patients. To be more precise:
set.seed(99) results in a cohort, that has an overlap of only 80% with the resulting cohort of set.seed(27).
And this might have a huge impact on further general models and statistical analyses. Have I overseen something ?
Regards !

Sometimes this occurs when units have the same or very similar propensity scores; I believe MatchIt resolves this with a randomly selected match. I actually disagree with #dash2 that you shouldn't change the seed until you get a result you like. You should perform the matching procedure as many times as you want until you arrive at covariate balance. If your data is balanced and all your treated units are retained (or at least the same ones across matching specifications), then your effect estimation will not vary systematically with your matched set. Just remember that once you have estimated your treatment effect, you can't go back and redo your matching results (which is probably what #dash2 is getting at). But at the matching phase, this is not a concern.

So the computing side of this is that matchit is probably doing something random, even though you haven't e.g. specified m.order="random". What that could be, it's probably easiest to find out by looking through the source code...
The statistical side is really for crossvalidated, not here, but I would suggest:
if any of your results depend non-trivially on the seed, don't trust the results, they are not robust.
in particular - I'm sure you know this but it is worth reiterating - do NOT try different seeds until you get a result you like!

Related

Use of svyglm and svydesign with R for multistage stratified cluster design

I have a complicated data set which was made by a multistage stratified cluster design. I had originally analysed this using glm, however now realise that I have to use svyglm. I'm not quite sure about how is best to model the data utilising svyglm. I was wondering if anyone could help shed some light.
I am attempting to see the effect that a variety of covariates taken at time 1 have on a binary outcome taken at time 2.
The sampling strategy was as follows: state -> urban/rural -> district -> subdistrict -> village. Within each village, individuals were randomly selected, with each of these having an id (uniqid).
I have a variable in the df for each of these stages of the sampling strategy. I also have the following variables: outcome, age, sex, income, marital_status, urban_or_rural_area, uniqid, weights. The formula that I want for my regression equation is outcome ~ age + sex + income + marital_status + urban_or_rural_area . Weights are coded by the weights variable. I had set the family to binomial(link = logit).
If anyone has any idea how such an approach could be coded in R with svyglm I would be most appreciative. I'm quite confused as to what should be inputted as ID, fpc and nest. Do I have to specify all levels of the stratified design or just some?
Any direction, or resources which explain this well would be massively appreciated.
You don't really give enough information about the design: which of the geographical units are strata and which are clusters. For example, my guess is that you sample both urban and rural in all states, and you don't sample all villages, but I don't know whether you sample all districts or subdistricts. I also don't know whether your overall sampling fraction is large or small (so whether the with-replacement approximation is ok)
Let's pretend you sample just some districts, so districts are your Primary Sampling Units, and that the overall sampling fraction of people is small. The design command is
your_design <- svydesign(id=~district, weights=~weights,
strata=~interaction(state, urban_rural,drop=TRUE),
data=your_data_frame)
That is, the strata are combinations of state and urban/rural and any combinations that aren't in your data set don't exist in the population (maybe some states are all-rural or all-urban). Within each stratum you have districts, and only some of these appear in the sample. In your geographical hierarchy, districts are then the first level that is sampled rather than exhaustively enumerated.
You don't need fpc unless you want to specify the full multistage design without replacement.
The nest option is not about how the survey was done but is about how variables are coded. The US National Center for Health Statistics (bless their hearts) set up a lot of designs that have many strata and two primary sampling units per stratum. They call these primary sampling units 1 and 2; that is, they reuse the names 1 and 2 in every stratum. The svydesign function is set up to expect different sampling unit names in different strata, and to verify that each sampling unit name appears in just one stratum, as a check against data errors. This check has to be disabled for NCHS surveys and perhaps some others that also reuse sampling unit names. You can always leave out the nest option at first; svydesign will tell you if it might be needed.
Finally, the models:
svyglm(outcome ~ age + sex + income + marital_status + urban_or_rural_area,
design=your_design, family=quasibinomial)
Using binomial or quasibinomial will give identical answers, but using binomial will give you a harmless warning about non-integer weights. If you use quasibinomial, the harmless warning is suppressed.

Latent class growth modelling in R/flexmix with multinomial outcome variable

How to run Latent Class Growth Modelling (LCGM) with a multinomial response variable in R (using the flexmix package)?
And how to stratify each class by a binary/categorical dependent variable?
The idea is to let gender shape the growth curve by cluster (cf. Mikolai and Lyons-Amos (2017, p. 194/3) where the stratification is done by education. They used Mplus)
I think I might have come close with the following syntax:
lcgm_formula <- as.formula(rel_stat~age + I(age^2) + gender + gender:age)
lcgm <- flexmix::stepFlexmix(.~ .| id,
data=d,
k=nr_of_classes, # would be 1:12 in real analysis
nrep=1, # would be 50 in real analysis to avoid local maxima
control = list(iter.max = 500, minprior = 0),
model = flexmix::FLXMRmultinom(lcgm_formula,varFix=T,fixed = ~0))
,which is close to what Wardenaar (2020,p. 10) suggests in his methodological paper for a continuous outcome:
stepFlexmix(.~ .|ID, k = 1:4,nrep = 50, model = FLXMRglmfix(y~ time, varFix=TRUE), data = mydata, control = list(iter.max = 500, minprior = 0))
The only difference is that the FLXMRmultinom probably does not support varFix and fixed parameters, altough adding them do produce different results. The binomial equivalent for FLXMRmultinom in flexmix might be FLXMRglm (with family="binomial") as opposed FLXMRglmfix so I suspect that the restrictions of the LCGM (eg. fixed slope & intercept per class) are not specified they way it should.
The results are otherwise sensible, but model fails to put men and women with similar trajectories in the same classes (below are the fitted probabilities for each relationship status in each class by gender):
We should have the following matches by cluster and gender...
1<->1
2<->2
3<->3
...but instead we have
1<->3
2<->1
3<->2
That is, if for example men in class one and women in class three would be forced in the same group, the created group would be more similar than the current first row of the plot grid.
Here is the full MVE to reproduce the code.
Got similar results with another dataset with diffent number of classes and up to 50 iterations/class. Have tried two alternative ways to predict the probabilities, with identical results. I conclude that the problem is most likely in the model specification (stepflexmix(...,model=FLXMRmultinom(...) or this is some sort of label switch issue.
If the model would be specified correctly and the issue is that similar trajectories for men/women end up in different classes, is there a way to fix that? By for example restricting the parameters?
Any assistance will be highly appreciated.
This seems to be a an identifiability issue apparently common in mixture modelling. In other words the labels are switched so that while there might not be a problem with the modelling as such, men and women end up in different groups and that will have to be dealt with one way or another
In the the new linked code, I have swapped the order manually and calculated the predictions with by hand.
Will be happy to hear, should someone has an alternative approach to deal with the label swithcing issue (like restricting parameters or switching labels algorithmically). Also curious if the model could/should be specified in some other way.
A few remarks:
I believe that this is indeed performing a LCGM as we do not specify random effects for the slopes or intercepts. Therefore I assume that intercepts and slopes are fixed within classes for both sexes. That would mean that the model performs LCGM as intended. By the same token, it seems that running GMM with random intercept, slope or both is not possible.
Since we are calculating the predictions by hand, we need to be able to separate parameters between the sexes. Therefore I also added an interaction term gender x age^2. The calculations seems to slow down somewhat, but the estimates are similar to the original. It also makes conceptually sense to include the interaction for age^2 if we have it for age already.
varFix=T,fixed = ~0 seem to be reduntant: specifying them do not change anything. The subsampling procedure (of my real data) was unaffected by the set.seed() command for some reason.
The new model specification becomes:
lcgm_formula <- as.formula(rel_stat~ age + I(age^2) +gender + age:gender + I(age^2):gender)
lcgm <- flexmix::flexmix(.~ .| id,
data=d,
k=nr_of_classes, # would be 1:12 in real analysis
#nrep=1, # would be 50 in real analysis to avoid local maxima (and we would use the stepFlexmix function instead)
control = list(iter.max = 500, minprior = 0),
model = flexmix::FLXMRmultinom(lcgm_formula))
And the plots:

Matchit R - how to run ratio of 1 CONTROL: 2 TREATED units?

Matchit R - how to run ratio of 1 CONTROL: 2 TREATED units? (20 CONTROLS FOR 40 TREATED SUBJECTS)
Ratio = for methods that allow it, how many control units should be matched to each treated unit in k:1 matching.
0.5 IS IMPOSSIBLE
how can I apply it?
I think this isn't possible because it doesn't make sense. Your question implies that treated units can be paired, but that presupposes a prior step in which treated units are matched with each other, which is not part of what MatchIt is designed to do (and doesn't really make sense from an experimental design standpoint anyway).
Either that, or you have pairs of treated cases that match exactly, in which case you're not really gaining any analytical leverage by trying to use them both. If that's the case, I would just reduce the pairs to single observation and go with 1:1 matching.
This is actually a much a harder question than it may seem. There is definitely no way to do this with nearest neighbor matching. In theory, it should be possible with optimal matching, e.g., using method = "full" or the optmatch package directly. Unfortunately, in my experiments, I am unable to get optmatch to do what you want. Ideally, you should be able to run
fullmatch(., min.controls = 1/2, max.controls = 1/2, mean.controls = 1/2)
but that doesn't actually produce the desired results. Maybe you can contact the opmatch authors with this problem, as it's an interesting one that their package should be able to solve.
There is another method you can use called cardinality matching, which will implemented in the next release of MatchIt but can be accessed in the development version using devtools::install_github("ngreifer/MatchIt"). Cardinality matching uses optimzation to select a subset of units that satisfy a sample size constraint and balance constraints set by the user. In this case, your sample size constraint is that the number of treated units is twice the number of control units. You need to set balance constraints, which are the maximum imbalance (in standardized mean difference units) allowable between the treated and control groups. If A is your treatment variable and X1 and X2 are the variables you are trying to match on, you could run the following code:
m <- matchit(A ~ X1 + X2, dataa = data, method = "cardinality",
estimand = "ATC", ratio = 2, tols = .01)
If a solution is possible, it will be produced. If not, you will have to relax the tolerance (i.e., increase tols) until it is. You can the run
summary(m, un = FALSE)
which will produce a summary of balance and the remaining sample size. If not all of your treated units are matched, then continue to increase tols until they are.

Reducing "treatment" sample size through MatchIt (or another package) to increase sample similarity

I am trying to match two samples on several covariates using MatchIt, but I am having difficulty creating samples that are similar enough. Both my samples are plenty large (~1000 in the control group, ~5000 in the comparison group).
I want to get a matched sample with participants as closely matched as possible and I am alright with losing sample size in the control group. Right now, MatchIt only returns two groups of 1000, whereas I want two groups that are very closely matched and would be fine with smaller groups (e.g., 500 instead of 1000).
Is there a way to do this through either MatchIt or another package? I would rather avoid using random sampling and then match if possible because I want as close a match between groups as possible.
Apologies for not having a reproducible example, I am still pretty new to using R and couldn't figure out how to make a sample of this issue...
Below is the code I have for matching the two groups.
data<- na.omit(data)
data$Group<- as.numeric(data$Group)
data$Group<- recode(data$Group, '1 = 1; 2 = 0')
m.out <- matchit(Group ~ Age + YearsEdu + Income + Gender, data = data, ratio = 1)
s.out <- summary(m.out, standardize = TRUE)
plot(s.out)
matched.data <- match.data(m.out)
MatchIt, like other similar packages, offers several matching routines that enable you to play around with the settings. Check out the argument method, which is set to method = 'nearest' by default. This means that unless you specify, it will look for the best match for each of the treatment observations. In your case, you will always have 1000 paired matches with this setting.
You can choose to set it to method = 'exact', which is much more restrictive. In the documentation you will find:
This technique matches each treated unit to all
possible control units with exactly the same values on all the covariates, forming subclasses
such that within each subclass all units (treatment and control) have the same covariate values.
On the lalonde dataset, you can run:
m.out <- matchit(treat ~ educ + black + hispan, data = lalonde, method = 'exact')
summary(m.out)
As a consequence, it discards some of the treatment observation that could not get matched. Have a look at the other possibilities for method, maybe you will find something you will like better.
That being said, be mindful not to discard too many treatment observations. If you do, you will make the treatment group look like the control group (instead of the opposite), which might lead to unwanted results.
You should look into the package designmatch, which implements a form of matching called cardinality matching that does what you want (i.e., find the largest matched set that yields desired balance). Unlike MatchIt, designmatch doesn't use a distance variable; instead, it uses optimization to solve the matching problem. You select exactly how balanced you want each covariate to be, and it will do its best to solve the problem while retaining as many matches as possible. The methodology is described in Zubizarreta, Paredes, & Rosenbaum (2014).

R: Propensity Score Matching using MatchIt. How to specify desired matching accuracy for different covariates?

I'm rather new to R and especially to the method of matching by propensity scores. My dataset includes two groups of people that differ in whether they were treated or not- unfortunately they also differ significantly in age and disease duration, therefore my wish to match them.
So far this is my code:
set.seed(2208)
mod_match <- matchit(TR ~ age + disease_duration + sex + partner + work + academic,
data = Data_nomiss,
method = "nearest",
caliper = .025)
summary(mod_match)
This code works fine, but I wondered whether there is a possibility to weight the importance of the covariates regarding the accuracy of matching? For me it is crucial that the groups are as close as possible concerning age and disease duration (numeric), whereas the rest of the variables (factors) should also be matched, but for my purposes might differ in means a little more than the first two.
While searching for a solution to my problem I came across the request of this one guy, who had basically the same problem http://r.789695.n4.nabble.com/matchit-can-I-weight-the-parameters-td4633907.html
In this case it was proposed to combine nearest neighbor and exact matching, but transferred to my dataset this leads to an unproportional reduction of my sample. In the end what I'd like to have is some sort of customized matching process focussing on age and disease duration while also involving the last three variables but in a weaker way.
Does anyone happen to have an idea how this could be realized? I'd be really glad to receive any kinds of tips on this matter and thank you for your time!
Unfortunately, MatchIt does not provide this functionality. There were two ways to do this instead of using MatchIt, but they are slightly advanced. Note that neither use propensity scores. The point of propensity score matching is to match on a single number, the propensity score, which makes the matching procedure blind to the original covariates for which balance is desired.
The first is to use the package Matching and include your own weight matrix to Weight.matrix in Match(). You could upweight age and disease duration in the weight matrix.
The second is to use the package designmatch to do cardinality matching, which allows you to specify balance constraints, and it will use optimization to find the largest sample that meets those constraints. In bmatch(), enter your covariates of interest into the mom argument, which also allows you to include specific balance constraints for each variable. You can require stricter balance constraints for age and disease duration.

Resources