When are factors necessary/appropriate in r - r

I've been using the aov() function in R for ages. I always input my data via .csv files, and have never bothered converting any of the variables to 'factor'.
Recently I've done just that, converting variables to factors and repeated the aov(), and the results of the aov() are now different.
My data are ordered categories, 0,1,2. Un-ordered or ordered levels makes no difference, both are different than using the variable without converting to a factor.
Are factors always appropriate? Why does this conversion make such a large difference?
Please let me know if more information is necessary to make my question clearer.

This is really a statistical question, but yes, it can make a difference. If R treated the variable as numeric, in a model it would account for only a single degree of freedom. If the levels of the numeric were 0, 1, 2, as a factor it would use two degrees of freedom. This would alter the statistical outputs from the model. The difference in model complexity between the numeric and factor representations increase markedly if you multiple factors coded numerically or the variables have more than a few levels. Whether the increase in explained sums-of-squared from the inclusion of a variable is statistically significant depends on the magnitude of the increase and the change in the complexity of the model. Using a numeric representation of a class variable would increase the model complexity by a single degree of freedom, but the class variable would use k-1 degrees of freedom. Hence for the same improvement in model fit, you could be in a situation where whether coding a variable a numeric or a factor changes whether it has a significant effect on the response.
Conceptually, the models based on numerics or factors differ; with factors you have a small set of groups or classes that have been sampled and the aim is to see whether the response differs between these groupings. The model is fixed on the set of samples groups; you can only predict for those groups observed. With numerics, you are saying that the response varies linearly with the numeric variable(s). From the fitted model you can predict for some new values of the numeric variable not observed.
(Note that the inference for fixed factors assumes you are fitting a fixed effects model. Treating a factor variables as a random effect moves the focus from the exact set of groups sampled on to the set of all groups in the population from which the sample was taken.)

Related

How to specify contrasts in lme to test hypotheses with interactions

I have a generalized mixed model that has 2 factors (fac1 (2 levels), fac2 (3 levels)) and 4 continuous variables (x1,x2,x3,x4) as fixed effects and a continuous response. I am interested in answering:
are the main effects x1-x4 (slopes) significant ignoring fac1 and fac2
are fac1 and fac2 levels significantly different from the model mean and to each other
is there a difference in slopes between fac1 levels and fac2 levels and fac1*fac2 levels
This means I would need to include interations in my model (random effects ignored here)
say: Y~x1+x2+x3+x4+fac1+fac2+x1:fac1+x2:fac1+x3:fac1+x4:fac1+x1:fac2+x2:fac2+x3:fac2+x4:fac2
but now my coefficients for x1-x4 are based on my ref level and interpretation of the overall main effects is not possible.
Also do I have to include xi:fac1:fac2+fac1:fac2 in my model as well to answer 3)?
is there an R package that can do this? I though about refitting the model (e.g. without the interactions) to answer 1) but the data points in each factor level are not the same so ignoring this in Y~x1+x2+x3+x4 the slope of the most common factor combination may dominate the result and inference? I know you can use contrasts e.g. by not dummy coding a factor with 2 levels to 0 and 1 but -0.5,0.5 but not sure how something would look like in this case.
Would it be better to ease the model combining the factors first e.g.
fac3<-interaction(fac1,fac2) #and then
Y~x1+x2+x3+x4+x1:fac3+x2:fac3+x3:fac3+x4:fac3
But how do I answer 1-3 from that.
Thanks a lot for your advice
I think you have to take a step back and ask yourself what hypotheses exactly you want to test here. Taken word for word, your 3-point list results in a lot (!) of hypotheses tests, some of which can be done in the same model, some requiring different parameterizations. Given that the question at hand is about hypotheses and not how to code them in R, this is more about statistics rather than programming and may be better moved to CrossValidated.
Nevertheless, for starters, I would propose the following procedure:
To test x1-x4 alone, just add all of these to your model, then use drop1() to check which of them actually add to the model fit.
In order to reduce the number of hypothesis tests (and different models to fit), I suggest you also test for each factor and the interaction as whole whether it is relevant. Again, add all three terms (both factors and interaction, so just fac1*fac2 if they are formatted as factors) to the model and use drop1.
This point alone includes many potential hypotheses/contrasts to test. Depending on parameterization (dummy or effect coding), for each of the 4 continuous predictors you have 3 or 5 first-order interactions with the factor dummies/effects and 2 or 6 second-order interactions, given that you test each group against all others. This is a total of 20 or 44 tests and means that it becomes very likely that you have false-positives (if you test at 95% confidence level). Additionally, please ask yourself whether these interactions can even be interpreted in a meaningful way. Therefore, I would advise that you to either focus on some specific interactions that you expect to be relevant. If you really want to explore all interactions, just test entire interactions (e.g. fac1:x1, not specific contrasts) first. For this you have to make 8 models, each including one factor-continuous interaction, then compare all of them to the no-interaction model, using anova().
One last thing: I have assumed that you have already figured out the random variable structure of your model (i.e. what cluster variable(s) to consider and whether there should be random slopes). If not, do that first.

R package MatchIt with factor variables

I'm using the R package MatchIt to calculate propensity score weights to be used into a straightforward survival analysis, and I'm noticing very different behaviors according to the fact that some covariates entering the propensity score calculations are factors or numeric.
An example: simple code for 3 variables, one of which is numeric (size) and two factors (say tumor stage, smoking habits). Treatment variable is a factor indicating the type of surgery.
Example 1: with stage as factor and smoking habit as integer,
> sapply(surg.data[,confounders], class)
tumor_size TNM.STAGE smoking_hx
"numeric" "factor" "integer"
I calculate the propensity scores with the following code and extract the weights
data.for.ps = surg.data[,c('record_id','surgeries_combined_n', confounders)]
match.it.1 <- matchit(as.formula(paste0('surgeries_combined_n ~',paste0(confounders, collapse='+'))),
data=data.for.ps, method='full', distance='logit')
match.it.1$nn
m.data = match.data(match.it.1)
m.data$weights = match.it.1$weights
No big problems. The result of the corresponding, weighted survival analysis is the following, no matter here what "blue" and "red" means:
Example 2 is exactly the same, but with tumor stage now a numeric
> sapply(surg.data[,confounders], class)
tumor_size TNM.STAGE smoking_hx
"numeric" "numeric" "integer"
Exactly the same code for matching, exactly the same code for the survival analysis, the result is the following:
not very different, but different.
Example 3 is exactly the same code, but with both tumor stage and smoking habit factors:
> sapply(surg.data[,confounders], class)
tumor_size TNM.STAGE smoking_hx
"numeric" "factor" "factor"
The result, using exactly the same code, is the following:
totally different.
Now, there is no reason why one of the two potential factors should be numeric: they can be both factors, but the results are unquestionably different.
Can anybody help me understand
Why this happens? I don't think it's a coding problem, but more of understanding which is the correct class to put into match.it.
Which is the "correct" solution with MatchIt, keeping in mind that in the package vignette all the variables entering the propensity score calculations are numeric or integer, even those potentially coded as factors (such as education level, or marital status).
Should factors stay always factors? What if a factor is coded, say, 0,1,2,3 (numeric values but class=factor): should it stay a factor?
Thank you so much for your help!
EM
This is not a bug in MatchIt but rather a real event that can occur when analyzing any kind of data. Numeric variables contain a lot of hidden assumptions; in particular, that the values have a meaningful order and that the spacing between consecutive values is the same. When using numeric variables in a model, you are assuming there is a linear relationship between the variable and the outcome of the model. If these assumptions are invalid, then there is a risk that your results will be as well.
It's smart of you to assess the sensitivity of your results to these kinds of assumptions. It's hard to know what the right answer is. The most conservative perspective is to consider the variable as factors, which requires no assumption about the functional form of an otherwise numeric variable (though a flexibly modeled numeric predictor could be effective as well). This method requires no assumptions about the nature of the variables, but you lose precision in your estimates if any of the assumptions for numeric variables are indeed valid.
Because propensity score matching really just relies on a good propensity score and the role of the covariates in the model is mostly a nuisance, you should determine which propensity score model yields the best balance on your covariates. Again, assessing balance requires assumptions to be made about how the variables are distributed, but it's totally feasible and advisable to assess balance on the covariates under a variety of transformations and forms. If one propensity score specification yields better balance across transformations of the covariate, then that is the propensity score model that should be trusted. Going beyond standardized mean differences and looking at the full distribution of the covariate in both groups will help you make a more informed decision.

Why convert numbers to factors while model bulding

I was following a tutorial on model building using logistic regression.
In the tutorial, columns having numeric data type and with levels 3, were converted into factors using as.factor function. I wanted to know the reason for this conversion.
If vectors of class-"numeric" with a small number of unique values are left in that form, logistic regression, i.e. glm( form, family="binomial", ...), will return a single coefficient. Generally, that is not what the data will support, so the authors of that tutorial are advising that these vectors be converted to factors so that the default handling of categorical values by the glm function will occur. It's possible that those authors already know for a fact that the underlying data-gathering process has encoded categorical data with numeric levels and the data input process was not "told" to process as categorical. That could have been done using the colClasses parameter to whichever read.* function was used.
The default handling of factors by most R regression routines uses the first level as part of the baseline (Intercept) estimate and estimates a coefficient for each of the other levels. If you had left that vector as numeric you would have gotten an estimate that could have been interpreted as the slope of an effect of an ordinal variable. The statistical tests associated with such an encoding of an ordinal relationship is often called a "linear test of trend" and is sometimes a useful result when the data situation in the "real world" can be interpreted as an ordinal relationship.

Regression model performance fails with a factor having more number of levels

I have a mixed data(both quantitative and categorical) predicting a quantitative variable. I have converted the categorical data into factors before feeding into glm model in R. My data has categorical variables with most of them having more than 150 levels. When I try to feed them to glm model, it fails with memory issues because of these factors having more levels. We can put a threshold and accept only the variables upto certain number of levels. But, I need to embed these factors which has more levels into the model. Is there any methodology to follow to address this issue.
Edit: The dataset has 120000 rows and 50 columns. When the data is expanded with model.matrix there are 4772 columns.
If you have a lot of data, the easiest thing to do is sample from your matrix/data frame, then run the regression.
Given sampling theory, we know that the standard error of a proportion p is equal to sqrt((p(1-p))/n). So if you have 150 levels, assuming that the number of observations in levels is evenly distributed, then we would want to be able to find proportions as small as .005 or so from your data set. So if we take a 10,000 row sample, the standard error of one of those factor levels is roughly:
sqrt((.005*.995)/10000) = 0.0007053368
That's really not all that much additional variance that you added to your regression estimate. Especially when you are doing exploratory analysis, sampling from the rows in your data, say a 12,000 row sample, should still give you plenty of data to estimate quantities while making estimation possible. Reducing your rows by a factor of 10 should also help R do the estimation without running out of memory. Win-win.

how to find correlation in a mixed data including continuous, category and date types in r

I have a data including different types:
a <- data.frame(x=c("a","b","b","c","c","c","d","d","e","f"),y=c(1,2,2,2,3,1,4,7,10,2),m=c("a","d","ab","ac","ac","vc","ed","ed","e","df"),n=c(2,1,5,3,3,2,8,10,10,1))
Actually, the data is more complex than this, probably including date as well. Furthermore, this is an unsupervised issue. So there is no "class labels" here. So I cannot use the methods such as ANOVA. So, how can I find correlation between each two columns?
P.S. I find a function called mixed.cor in psych package, but cannot understand how to use it.
Furthermore, correlation is just representing the linear relation. What function should I use if I want to know the important of every column?
The measure of correlation that most people use for numeric variables (i.e. Pearson correlation) is not defined for categorical data. If you want to measure the association between a numeric variable and a categorical variable, you can use ANOVA. If you want to measure the association between two categorical variables, you can use a Chi-Squared test. If your categorical variable is ordered (e.g. low, medium, high), you can use Spearman rank correlation.

Resources