I was following a tutorial on model building using logistic regression.
In the tutorial, columns having numeric data type and with levels 3, were converted into factors using as.factor function. I wanted to know the reason for this conversion.
If vectors of class-"numeric" with a small number of unique values are left in that form, logistic regression, i.e. glm( form, family="binomial", ...), will return a single coefficient. Generally, that is not what the data will support, so the authors of that tutorial are advising that these vectors be converted to factors so that the default handling of categorical values by the glm function will occur. It's possible that those authors already know for a fact that the underlying data-gathering process has encoded categorical data with numeric levels and the data input process was not "told" to process as categorical. That could have been done using the colClasses parameter to whichever read.* function was used.
The default handling of factors by most R regression routines uses the first level as part of the baseline (Intercept) estimate and estimates a coefficient for each of the other levels. If you had left that vector as numeric you would have gotten an estimate that could have been interpreted as the slope of an effect of an ordinal variable. The statistical tests associated with such an encoding of an ordinal relationship is often called a "linear test of trend" and is sometimes a useful result when the data situation in the "real world" can be interpreted as an ordinal relationship.
Related
I want to do linear regression with the lm function. My dependent variable is a factor called AccountStatus:
1:0 days in arrears, 2:30-60 days in arrears, 3:60-90 days in arrears and 4:90+ days in arrears. (4)
As independent variable I have several numeric variables: Loan to value, debt to income and interest rate.
Is it possible to do a linear regression with these variables? I looked on the internet and found something about dummy's, but those were all for the independent variable.
This did not work:
fit <- lm(factor(AccountStatus) ~ OriginalLoanToValue, data=mydata)
summary(fit)
Linear regression does not take categorical variables for the dependent part, it has to be continuous. Considering that your AccountStatus variable has only four levels, it is unfeasible to treat it is continuous. Before commencing any statistical analysis, one should be aware of the measurement levels of one's variables.
What you can do is use multinomial logistic regression, see here for instance. Alternatively, you can recode the AccountStatus as dichotomous and use simple logistic regression.
Sorry to disappoint you, but this is just an inherent restriction of multiple regression, nothing to do with R really. If you want to learn more about which statistical technique is appropriate for different combinations of measurement levels of dependent and independent variables, I can wholeheartedly advise this book.
Expanding a little bit on #MaximK's answer: multinomial approaches are appropriate when the levels of the factor are unordered: in your case, however, when the measurement level is ordinal (i.e. ordered, but the distance between the levels is unknown/undefined), you can get more out of your data by doing ordinal regression, e.g. with the polr() function in the MASS package or with functions in the ordinal package. However, since ordinal regression has different/more complex underlying theory than simple linear regression, you should probably read more about it (e.g. at the Wikipedia article linked above, or in the vignettes of the ordinal package, or at the UCLA stats consulting page on ordinal regression, or browsing related questions on CrossValidated.
If you can give a numeric value to the variables then you might have a solution. You have to rename the values to numbers, then convert the variable into a numeric one. Here is how:
library(plyr)
my.data2$islamic_leviathan_score <- revalue(my.data2$islamic_leviathan,
c("(1) Very Suitable"="3", "(2) Suitable"="2", "(3) Somewhat Suitable"="1", "(4) Not Suitable At All"="-1"))
my.data2$islamic_leviathan_score_1 <- as.numeric(as.character(my.data2$islamic_leviathan_score))
This revaluates the potential values while transforming the variable as numeric ones. The results I get are consistent with the original values contained in the dataset when the variables are as factor variables. You can use this solution to change the name of the variables to whatever you may like, while transforming them to numeric variables.
Finally, this is worth doing because it allows you to draw histograms or regressions, something that is impossible to do with factor variables.
Hope this helps!
modelo <- lm( P3J_IOP~ PräOP_IOP +OPTyp + P3J_Med, data = na.omit(df))
summary(modelo)
Error:
Fehler in step(modelo, direction = "backward") :
Number of lines used has changed: remove missing values?
I have a lot of missing values in my dependent variable P3J_IOP.
Has anyone any idea how to create the model?
tl;dr unfortunately, this is going to be hard.
It is fairly difficult to make linear regression work smoothly with missing values in the predictors/dependent variables (this is true of most statistical modeling approaches, with the exception of random forests). In case it's not clear, the problem with stepwise approaches with missing data in the predictor is:
incomplete cases (i.e., observations with missing data for any of the current set of predictors) must be dropped in order to fit a linear model;
models with different predictor sets will typically have different sets of incomplete cases, leading to the models being fitted on different subsets of the data;
models fitted to different data sets aren't easily comparable.
You basically have the following choices:
drop any predictors with large numbers of missing values, then drop all cases that have missing values in any of the remaining predictors;
use some form of imputation, e.g. with the mice package, to fill in your missing data (in order to do proper statistical inference, you need to do multiple imputation, which may be hard to combine with stepwise regression).
There are some advanced statistical techniques that will allow you to simultaneously do the imputation and the modeling, such as the brms package (here is some documentation on imputation with brms), but it's a pretty big hammer/jump in statistical sophistication if all you want to do is fit a linear model to your data ...
I have run a logit model through glmnet. I am extracting the coefficients from the minimum lambda, and it gives me the results I expect. However I have a factor variable with nine unique values, and glmnet produces a single coefficient for this, which is expected for a binary variable but not factor...
library(glmnet)
coef(model.obj, s = 'lambda.min')
#output:
TraumaticInj 2.912419e-02
Toxin .
OthInj 4.065266e-03
CurrentSTDYN 7.601812e-01
GeoDiv 1.372628e-02 #this is a factor variable w/ 9 options...
so my questions:
1) how should I interpret a single coefficient from a factor variable in glmnet?
2) is there a method to extract the coefficients for the different factors of the variable?
Glmnet doesn't handle factor variables. You have to convert them to dummies using eg model. Matrix. So the results you are seeing is glmnet treating your factor variable as a single real variable.
Can't be done, b/c glmnet doesn't treat factor variables. This is pretty much answered here: How does glmnet's standardize argument handle dummy variables?
This comment by #R_User in the answer is particularly insightful:
#DTRM - In general, one does not standardize categorical variables to
retain the interpretability of the estimated regressors. However, as
pointed out by Tibshirani here:
statweb.stanford.edu/~tibs/lasso/fulltext.pdf, "The lasso method
requires initial standardization of the regressors, so that the
penalization scheme is fair to all regressors. For categorical
regressors, one codes the regressor with dummy variables and then
standardizes the dummy variables" - so while this causes arbitrary
scaling between continuous and categorical variables, it's done for
equal penalization treatment. – R_User Dec 6 '13 at 1:20
I want to do linear regression with the lm function. My dependent variable is a factor called AccountStatus:
1:0 days in arrears, 2:30-60 days in arrears, 3:60-90 days in arrears and 4:90+ days in arrears. (4)
As independent variable I have several numeric variables: Loan to value, debt to income and interest rate.
Is it possible to do a linear regression with these variables? I looked on the internet and found something about dummy's, but those were all for the independent variable.
This did not work:
fit <- lm(factor(AccountStatus) ~ OriginalLoanToValue, data=mydata)
summary(fit)
Linear regression does not take categorical variables for the dependent part, it has to be continuous. Considering that your AccountStatus variable has only four levels, it is unfeasible to treat it is continuous. Before commencing any statistical analysis, one should be aware of the measurement levels of one's variables.
What you can do is use multinomial logistic regression, see here for instance. Alternatively, you can recode the AccountStatus as dichotomous and use simple logistic regression.
Sorry to disappoint you, but this is just an inherent restriction of multiple regression, nothing to do with R really. If you want to learn more about which statistical technique is appropriate for different combinations of measurement levels of dependent and independent variables, I can wholeheartedly advise this book.
Expanding a little bit on #MaximK's answer: multinomial approaches are appropriate when the levels of the factor are unordered: in your case, however, when the measurement level is ordinal (i.e. ordered, but the distance between the levels is unknown/undefined), you can get more out of your data by doing ordinal regression, e.g. with the polr() function in the MASS package or with functions in the ordinal package. However, since ordinal regression has different/more complex underlying theory than simple linear regression, you should probably read more about it (e.g. at the Wikipedia article linked above, or in the vignettes of the ordinal package, or at the UCLA stats consulting page on ordinal regression, or browsing related questions on CrossValidated.
If you can give a numeric value to the variables then you might have a solution. You have to rename the values to numbers, then convert the variable into a numeric one. Here is how:
library(plyr)
my.data2$islamic_leviathan_score <- revalue(my.data2$islamic_leviathan,
c("(1) Very Suitable"="3", "(2) Suitable"="2", "(3) Somewhat Suitable"="1", "(4) Not Suitable At All"="-1"))
my.data2$islamic_leviathan_score_1 <- as.numeric(as.character(my.data2$islamic_leviathan_score))
This revaluates the potential values while transforming the variable as numeric ones. The results I get are consistent with the original values contained in the dataset when the variables are as factor variables. You can use this solution to change the name of the variables to whatever you may like, while transforming them to numeric variables.
Finally, this is worth doing because it allows you to draw histograms or regressions, something that is impossible to do with factor variables.
Hope this helps!
I've been using the aov() function in R for ages. I always input my data via .csv files, and have never bothered converting any of the variables to 'factor'.
Recently I've done just that, converting variables to factors and repeated the aov(), and the results of the aov() are now different.
My data are ordered categories, 0,1,2. Un-ordered or ordered levels makes no difference, both are different than using the variable without converting to a factor.
Are factors always appropriate? Why does this conversion make such a large difference?
Please let me know if more information is necessary to make my question clearer.
This is really a statistical question, but yes, it can make a difference. If R treated the variable as numeric, in a model it would account for only a single degree of freedom. If the levels of the numeric were 0, 1, 2, as a factor it would use two degrees of freedom. This would alter the statistical outputs from the model. The difference in model complexity between the numeric and factor representations increase markedly if you multiple factors coded numerically or the variables have more than a few levels. Whether the increase in explained sums-of-squared from the inclusion of a variable is statistically significant depends on the magnitude of the increase and the change in the complexity of the model. Using a numeric representation of a class variable would increase the model complexity by a single degree of freedom, but the class variable would use k-1 degrees of freedom. Hence for the same improvement in model fit, you could be in a situation where whether coding a variable a numeric or a factor changes whether it has a significant effect on the response.
Conceptually, the models based on numerics or factors differ; with factors you have a small set of groups or classes that have been sampled and the aim is to see whether the response differs between these groupings. The model is fixed on the set of samples groups; you can only predict for those groups observed. With numerics, you are saying that the response varies linearly with the numeric variable(s). From the fitted model you can predict for some new values of the numeric variable not observed.
(Note that the inference for fixed factors assumes you are fitting a fixed effects model. Treating a factor variables as a random effect moves the focus from the exact set of groups sampled on to the set of all groups in the population from which the sample was taken.)