R package MatchIt with factor variables - r

I'm using the R package MatchIt to calculate propensity score weights to be used into a straightforward survival analysis, and I'm noticing very different behaviors according to the fact that some covariates entering the propensity score calculations are factors or numeric.
An example: simple code for 3 variables, one of which is numeric (size) and two factors (say tumor stage, smoking habits). Treatment variable is a factor indicating the type of surgery.
Example 1: with stage as factor and smoking habit as integer,
> sapply(surg.data[,confounders], class)
tumor_size TNM.STAGE smoking_hx
"numeric" "factor" "integer"
I calculate the propensity scores with the following code and extract the weights
data.for.ps = surg.data[,c('record_id','surgeries_combined_n', confounders)]
match.it.1 <- matchit(as.formula(paste0('surgeries_combined_n ~',paste0(confounders, collapse='+'))),
data=data.for.ps, method='full', distance='logit')
match.it.1$nn
m.data = match.data(match.it.1)
m.data$weights = match.it.1$weights
No big problems. The result of the corresponding, weighted survival analysis is the following, no matter here what "blue" and "red" means:
Example 2 is exactly the same, but with tumor stage now a numeric
> sapply(surg.data[,confounders], class)
tumor_size TNM.STAGE smoking_hx
"numeric" "numeric" "integer"
Exactly the same code for matching, exactly the same code for the survival analysis, the result is the following:
not very different, but different.
Example 3 is exactly the same code, but with both tumor stage and smoking habit factors:
> sapply(surg.data[,confounders], class)
tumor_size TNM.STAGE smoking_hx
"numeric" "factor" "factor"
The result, using exactly the same code, is the following:
totally different.
Now, there is no reason why one of the two potential factors should be numeric: they can be both factors, but the results are unquestionably different.
Can anybody help me understand
Why this happens? I don't think it's a coding problem, but more of understanding which is the correct class to put into match.it.
Which is the "correct" solution with MatchIt, keeping in mind that in the package vignette all the variables entering the propensity score calculations are numeric or integer, even those potentially coded as factors (such as education level, or marital status).
Should factors stay always factors? What if a factor is coded, say, 0,1,2,3 (numeric values but class=factor): should it stay a factor?
Thank you so much for your help!
EM

This is not a bug in MatchIt but rather a real event that can occur when analyzing any kind of data. Numeric variables contain a lot of hidden assumptions; in particular, that the values have a meaningful order and that the spacing between consecutive values is the same. When using numeric variables in a model, you are assuming there is a linear relationship between the variable and the outcome of the model. If these assumptions are invalid, then there is a risk that your results will be as well.
It's smart of you to assess the sensitivity of your results to these kinds of assumptions. It's hard to know what the right answer is. The most conservative perspective is to consider the variable as factors, which requires no assumption about the functional form of an otherwise numeric variable (though a flexibly modeled numeric predictor could be effective as well). This method requires no assumptions about the nature of the variables, but you lose precision in your estimates if any of the assumptions for numeric variables are indeed valid.
Because propensity score matching really just relies on a good propensity score and the role of the covariates in the model is mostly a nuisance, you should determine which propensity score model yields the best balance on your covariates. Again, assessing balance requires assumptions to be made about how the variables are distributed, but it's totally feasible and advisable to assess balance on the covariates under a variety of transformations and forms. If one propensity score specification yields better balance across transformations of the covariate, then that is the propensity score model that should be trusted. Going beyond standardized mean differences and looking at the full distribution of the covariate in both groups will help you make a more informed decision.

Related

How to specify contrasts in lme to test hypotheses with interactions

I have a generalized mixed model that has 2 factors (fac1 (2 levels), fac2 (3 levels)) and 4 continuous variables (x1,x2,x3,x4) as fixed effects and a continuous response. I am interested in answering:
are the main effects x1-x4 (slopes) significant ignoring fac1 and fac2
are fac1 and fac2 levels significantly different from the model mean and to each other
is there a difference in slopes between fac1 levels and fac2 levels and fac1*fac2 levels
This means I would need to include interations in my model (random effects ignored here)
say: Y~x1+x2+x3+x4+fac1+fac2+x1:fac1+x2:fac1+x3:fac1+x4:fac1+x1:fac2+x2:fac2+x3:fac2+x4:fac2
but now my coefficients for x1-x4 are based on my ref level and interpretation of the overall main effects is not possible.
Also do I have to include xi:fac1:fac2+fac1:fac2 in my model as well to answer 3)?
is there an R package that can do this? I though about refitting the model (e.g. without the interactions) to answer 1) but the data points in each factor level are not the same so ignoring this in Y~x1+x2+x3+x4 the slope of the most common factor combination may dominate the result and inference? I know you can use contrasts e.g. by not dummy coding a factor with 2 levels to 0 and 1 but -0.5,0.5 but not sure how something would look like in this case.
Would it be better to ease the model combining the factors first e.g.
fac3<-interaction(fac1,fac2) #and then
Y~x1+x2+x3+x4+x1:fac3+x2:fac3+x3:fac3+x4:fac3
But how do I answer 1-3 from that.
Thanks a lot for your advice
I think you have to take a step back and ask yourself what hypotheses exactly you want to test here. Taken word for word, your 3-point list results in a lot (!) of hypotheses tests, some of which can be done in the same model, some requiring different parameterizations. Given that the question at hand is about hypotheses and not how to code them in R, this is more about statistics rather than programming and may be better moved to CrossValidated.
Nevertheless, for starters, I would propose the following procedure:
To test x1-x4 alone, just add all of these to your model, then use drop1() to check which of them actually add to the model fit.
In order to reduce the number of hypothesis tests (and different models to fit), I suggest you also test for each factor and the interaction as whole whether it is relevant. Again, add all three terms (both factors and interaction, so just fac1*fac2 if they are formatted as factors) to the model and use drop1.
This point alone includes many potential hypotheses/contrasts to test. Depending on parameterization (dummy or effect coding), for each of the 4 continuous predictors you have 3 or 5 first-order interactions with the factor dummies/effects and 2 or 6 second-order interactions, given that you test each group against all others. This is a total of 20 or 44 tests and means that it becomes very likely that you have false-positives (if you test at 95% confidence level). Additionally, please ask yourself whether these interactions can even be interpreted in a meaningful way. Therefore, I would advise that you to either focus on some specific interactions that you expect to be relevant. If you really want to explore all interactions, just test entire interactions (e.g. fac1:x1, not specific contrasts) first. For this you have to make 8 models, each including one factor-continuous interaction, then compare all of them to the no-interaction model, using anova().
One last thing: I have assumed that you have already figured out the random variable structure of your model (i.e. what cluster variable(s) to consider and whether there should be random slopes). If not, do that first.

Regression with factor variables [duplicate]

I want to do linear regression with the lm function. My dependent variable is a factor called AccountStatus:
1:0 days in arrears, 2:30-60 days in arrears, 3:60-90 days in arrears and 4:90+ days in arrears. (4)
As independent variable I have several numeric variables: Loan to value, debt to income and interest rate.
Is it possible to do a linear regression with these variables? I looked on the internet and found something about dummy's, but those were all for the independent variable.
This did not work:
fit <- lm(factor(AccountStatus) ~ OriginalLoanToValue, data=mydata)
summary(fit)
Linear regression does not take categorical variables for the dependent part, it has to be continuous. Considering that your AccountStatus variable has only four levels, it is unfeasible to treat it is continuous. Before commencing any statistical analysis, one should be aware of the measurement levels of one's variables.
What you can do is use multinomial logistic regression, see here for instance. Alternatively, you can recode the AccountStatus as dichotomous and use simple logistic regression.
Sorry to disappoint you, but this is just an inherent restriction of multiple regression, nothing to do with R really. If you want to learn more about which statistical technique is appropriate for different combinations of measurement levels of dependent and independent variables, I can wholeheartedly advise this book.
Expanding a little bit on #MaximK's answer: multinomial approaches are appropriate when the levels of the factor are unordered: in your case, however, when the measurement level is ordinal (i.e. ordered, but the distance between the levels is unknown/undefined), you can get more out of your data by doing ordinal regression, e.g. with the polr() function in the MASS package or with functions in the ordinal package. However, since ordinal regression has different/more complex underlying theory than simple linear regression, you should probably read more about it (e.g. at the Wikipedia article linked above, or in the vignettes of the ordinal package, or at the UCLA stats consulting page on ordinal regression, or browsing related questions on CrossValidated.
If you can give a numeric value to the variables then you might have a solution. You have to rename the values to numbers, then convert the variable into a numeric one. Here is how:
library(plyr)
my.data2$islamic_leviathan_score <- revalue(my.data2$islamic_leviathan,
c("(1) Very Suitable"="3", "(2) Suitable"="2", "(3) Somewhat Suitable"="1", "(4) Not Suitable At All"="-1"))
my.data2$islamic_leviathan_score_1 <- as.numeric(as.character(my.data2$islamic_leviathan_score))
This revaluates the potential values while transforming the variable as numeric ones. The results I get are consistent with the original values contained in the dataset when the variables are as factor variables. You can use this solution to change the name of the variables to whatever you may like, while transforming them to numeric variables.
Finally, this is worth doing because it allows you to draw histograms or regressions, something that is impossible to do with factor variables.
Hope this helps!

emmeans using weights to account for sample size differences in one factor

I'd like to obtain emmeans of a continuous response for each level of a several-level factor (numerous different populations) while "correcting" for differences in the frequency in another factor (gender) between the first, assuming no interaction between these two.
The model I am working with is x <- lm(response ~ size*population + gender).
As I understand it, weights=equal and weights=proportional do not take into account differences in the frequency of the gender factor across different populations, but use either an equal frequency or the frequency in the entire sample, respectively. The description of weights=outer is rather obscure to me, but it doesn't sound like this is exactly what I'm looking for either; the emmeans package documentation states: "All except "cells" uses the same set of weights for each mean."
But it seems like weights=cells is also not what I'm looking for, as it the emmeans will be closer to the ordinary marginal means, whereas I want them to be further away in cases where gender is unbalanced in certain populations. If I understand correctly, I would like the weighting to be the 'reverse' of this option. The emmean for each population should reflect what the mean of each population should be if gender had been sampled equally in each.
Perhaps I don't understand these weights fully, but is there an option to set weights to accomplish this?

Linear model (lm) when dependent variable is a factor/categorical variable?

I want to do linear regression with the lm function. My dependent variable is a factor called AccountStatus:
1:0 days in arrears, 2:30-60 days in arrears, 3:60-90 days in arrears and 4:90+ days in arrears. (4)
As independent variable I have several numeric variables: Loan to value, debt to income and interest rate.
Is it possible to do a linear regression with these variables? I looked on the internet and found something about dummy's, but those were all for the independent variable.
This did not work:
fit <- lm(factor(AccountStatus) ~ OriginalLoanToValue, data=mydata)
summary(fit)
Linear regression does not take categorical variables for the dependent part, it has to be continuous. Considering that your AccountStatus variable has only four levels, it is unfeasible to treat it is continuous. Before commencing any statistical analysis, one should be aware of the measurement levels of one's variables.
What you can do is use multinomial logistic regression, see here for instance. Alternatively, you can recode the AccountStatus as dichotomous and use simple logistic regression.
Sorry to disappoint you, but this is just an inherent restriction of multiple regression, nothing to do with R really. If you want to learn more about which statistical technique is appropriate for different combinations of measurement levels of dependent and independent variables, I can wholeheartedly advise this book.
Expanding a little bit on #MaximK's answer: multinomial approaches are appropriate when the levels of the factor are unordered: in your case, however, when the measurement level is ordinal (i.e. ordered, but the distance between the levels is unknown/undefined), you can get more out of your data by doing ordinal regression, e.g. with the polr() function in the MASS package or with functions in the ordinal package. However, since ordinal regression has different/more complex underlying theory than simple linear regression, you should probably read more about it (e.g. at the Wikipedia article linked above, or in the vignettes of the ordinal package, or at the UCLA stats consulting page on ordinal regression, or browsing related questions on CrossValidated.
If you can give a numeric value to the variables then you might have a solution. You have to rename the values to numbers, then convert the variable into a numeric one. Here is how:
library(plyr)
my.data2$islamic_leviathan_score <- revalue(my.data2$islamic_leviathan,
c("(1) Very Suitable"="3", "(2) Suitable"="2", "(3) Somewhat Suitable"="1", "(4) Not Suitable At All"="-1"))
my.data2$islamic_leviathan_score_1 <- as.numeric(as.character(my.data2$islamic_leviathan_score))
This revaluates the potential values while transforming the variable as numeric ones. The results I get are consistent with the original values contained in the dataset when the variables are as factor variables. You can use this solution to change the name of the variables to whatever you may like, while transforming them to numeric variables.
Finally, this is worth doing because it allows you to draw histograms or regressions, something that is impossible to do with factor variables.
Hope this helps!

When are factors necessary/appropriate in r

I've been using the aov() function in R for ages. I always input my data via .csv files, and have never bothered converting any of the variables to 'factor'.
Recently I've done just that, converting variables to factors and repeated the aov(), and the results of the aov() are now different.
My data are ordered categories, 0,1,2. Un-ordered or ordered levels makes no difference, both are different than using the variable without converting to a factor.
Are factors always appropriate? Why does this conversion make such a large difference?
Please let me know if more information is necessary to make my question clearer.
This is really a statistical question, but yes, it can make a difference. If R treated the variable as numeric, in a model it would account for only a single degree of freedom. If the levels of the numeric were 0, 1, 2, as a factor it would use two degrees of freedom. This would alter the statistical outputs from the model. The difference in model complexity between the numeric and factor representations increase markedly if you multiple factors coded numerically or the variables have more than a few levels. Whether the increase in explained sums-of-squared from the inclusion of a variable is statistically significant depends on the magnitude of the increase and the change in the complexity of the model. Using a numeric representation of a class variable would increase the model complexity by a single degree of freedom, but the class variable would use k-1 degrees of freedom. Hence for the same improvement in model fit, you could be in a situation where whether coding a variable a numeric or a factor changes whether it has a significant effect on the response.
Conceptually, the models based on numerics or factors differ; with factors you have a small set of groups or classes that have been sampled and the aim is to see whether the response differs between these groupings. The model is fixed on the set of samples groups; you can only predict for those groups observed. With numerics, you are saying that the response varies linearly with the numeric variable(s). From the fitted model you can predict for some new values of the numeric variable not observed.
(Note that the inference for fixed factors assumes you are fitting a fixed effects model. Treating a factor variables as a random effect moves the focus from the exact set of groups sampled on to the set of all groups in the population from which the sample was taken.)

Resources