I have run a logit model through glmnet. I am extracting the coefficients from the minimum lambda, and it gives me the results I expect. However I have a factor variable with nine unique values, and glmnet produces a single coefficient for this, which is expected for a binary variable but not factor...
library(glmnet)
coef(model.obj, s = 'lambda.min')
#output:
TraumaticInj 2.912419e-02
Toxin .
OthInj 4.065266e-03
CurrentSTDYN 7.601812e-01
GeoDiv 1.372628e-02 #this is a factor variable w/ 9 options...
so my questions:
1) how should I interpret a single coefficient from a factor variable in glmnet?
2) is there a method to extract the coefficients for the different factors of the variable?
Glmnet doesn't handle factor variables. You have to convert them to dummies using eg model. Matrix. So the results you are seeing is glmnet treating your factor variable as a single real variable.
Can't be done, b/c glmnet doesn't treat factor variables. This is pretty much answered here: How does glmnet's standardize argument handle dummy variables?
This comment by #R_User in the answer is particularly insightful:
#DTRM - In general, one does not standardize categorical variables to
retain the interpretability of the estimated regressors. However, as
pointed out by Tibshirani here:
statweb.stanford.edu/~tibs/lasso/fulltext.pdf, "The lasso method
requires initial standardization of the regressors, so that the
penalization scheme is fair to all regressors. For categorical
regressors, one codes the regressor with dummy variables and then
standardizes the dummy variables" - so while this causes arbitrary
scaling between continuous and categorical variables, it's done for
equal penalization treatment. – R_User Dec 6 '13 at 1:20
Related
I'm running a model without an intercept, and I only have categorical predictors. I have also included an interaction term between my main predictor and another covariate. But the output shows that for the interaction terms it has a reference group, even though I suppress the intercept. Therefore, I was wondering if R has a function equivalent to Statas .ibn function?
The ibn. factor-variable operator in Stata enables that all levels of the factor variable are included in the model (the factor variable "loses" its base).
In R you can achieve this by resetting the contrasts for the factor variables. Here how to do this is described by fabians : All Levels of a Factor in a Model Matrix in R
I want to do linear regression with the lm function. My dependent variable is a factor called AccountStatus:
1:0 days in arrears, 2:30-60 days in arrears, 3:60-90 days in arrears and 4:90+ days in arrears. (4)
As independent variable I have several numeric variables: Loan to value, debt to income and interest rate.
Is it possible to do a linear regression with these variables? I looked on the internet and found something about dummy's, but those were all for the independent variable.
This did not work:
fit <- lm(factor(AccountStatus) ~ OriginalLoanToValue, data=mydata)
summary(fit)
Linear regression does not take categorical variables for the dependent part, it has to be continuous. Considering that your AccountStatus variable has only four levels, it is unfeasible to treat it is continuous. Before commencing any statistical analysis, one should be aware of the measurement levels of one's variables.
What you can do is use multinomial logistic regression, see here for instance. Alternatively, you can recode the AccountStatus as dichotomous and use simple logistic regression.
Sorry to disappoint you, but this is just an inherent restriction of multiple regression, nothing to do with R really. If you want to learn more about which statistical technique is appropriate for different combinations of measurement levels of dependent and independent variables, I can wholeheartedly advise this book.
Expanding a little bit on #MaximK's answer: multinomial approaches are appropriate when the levels of the factor are unordered: in your case, however, when the measurement level is ordinal (i.e. ordered, but the distance between the levels is unknown/undefined), you can get more out of your data by doing ordinal regression, e.g. with the polr() function in the MASS package or with functions in the ordinal package. However, since ordinal regression has different/more complex underlying theory than simple linear regression, you should probably read more about it (e.g. at the Wikipedia article linked above, or in the vignettes of the ordinal package, or at the UCLA stats consulting page on ordinal regression, or browsing related questions on CrossValidated.
If you can give a numeric value to the variables then you might have a solution. You have to rename the values to numbers, then convert the variable into a numeric one. Here is how:
library(plyr)
my.data2$islamic_leviathan_score <- revalue(my.data2$islamic_leviathan,
c("(1) Very Suitable"="3", "(2) Suitable"="2", "(3) Somewhat Suitable"="1", "(4) Not Suitable At All"="-1"))
my.data2$islamic_leviathan_score_1 <- as.numeric(as.character(my.data2$islamic_leviathan_score))
This revaluates the potential values while transforming the variable as numeric ones. The results I get are consistent with the original values contained in the dataset when the variables are as factor variables. You can use this solution to change the name of the variables to whatever you may like, while transforming them to numeric variables.
Finally, this is worth doing because it allows you to draw histograms or regressions, something that is impossible to do with factor variables.
Hope this helps!
I was following a tutorial on model building using logistic regression.
In the tutorial, columns having numeric data type and with levels 3, were converted into factors using as.factor function. I wanted to know the reason for this conversion.
If vectors of class-"numeric" with a small number of unique values are left in that form, logistic regression, i.e. glm( form, family="binomial", ...), will return a single coefficient. Generally, that is not what the data will support, so the authors of that tutorial are advising that these vectors be converted to factors so that the default handling of categorical values by the glm function will occur. It's possible that those authors already know for a fact that the underlying data-gathering process has encoded categorical data with numeric levels and the data input process was not "told" to process as categorical. That could have been done using the colClasses parameter to whichever read.* function was used.
The default handling of factors by most R regression routines uses the first level as part of the baseline (Intercept) estimate and estimates a coefficient for each of the other levels. If you had left that vector as numeric you would have gotten an estimate that could have been interpreted as the slope of an effect of an ordinal variable. The statistical tests associated with such an encoding of an ordinal relationship is often called a "linear test of trend" and is sometimes a useful result when the data situation in the "real world" can be interpreted as an ordinal relationship.
I have a training dataset that has 3233 rows and 62 columns. The independent variable is Happy (train$Happy), which is a binary variable. The other 61 columns are categorical independent variables.
I've created a logistic regression model as follows:
logModel <- glm(Happy ~ ., data = train, family = binary)
However, I want to reduce the number of independent variables that go into the model, perhaps down to 20 or so. I would like to start by getting rid of colinear categorical variables.
Can someone shed some light on how to determine which categorical variables are colinear and what threshold that I should use when removing a variable from a model?
Thank you!
if your variables were categorical then the obvious solution would be penalized logistic regression (Lasso) in R it is implemented in glmnet.
With categorical variables the problem is much more difficult.
I was in a similar situation and I used the importance plot from the package random forest in order to reduce the number of variables.
This would not help you to find collinearity but only to rank the variables by importance.
You have only 60 variable and maybe you have a knowledge of the field so you can try to add to you model some variables that makes sense to you (like z=x1-x3 if you think that the value x1-x3 is important.) and then rank them according to a random forest model
You could use Cramer's V, or the related Phi or contingency coefficient (see a great paper at http://www.harding.edu/sbreezeel/460%20files/statbook/chapter15.pdf), to measure colinearity among categorical variables. If two or more categorical variables have a Cramer's V value close to 1, it means they're highly "correlated" and you may not need to keep all of them in your logistic regression model.
I want to do linear regression with the lm function. My dependent variable is a factor called AccountStatus:
1:0 days in arrears, 2:30-60 days in arrears, 3:60-90 days in arrears and 4:90+ days in arrears. (4)
As independent variable I have several numeric variables: Loan to value, debt to income and interest rate.
Is it possible to do a linear regression with these variables? I looked on the internet and found something about dummy's, but those were all for the independent variable.
This did not work:
fit <- lm(factor(AccountStatus) ~ OriginalLoanToValue, data=mydata)
summary(fit)
Linear regression does not take categorical variables for the dependent part, it has to be continuous. Considering that your AccountStatus variable has only four levels, it is unfeasible to treat it is continuous. Before commencing any statistical analysis, one should be aware of the measurement levels of one's variables.
What you can do is use multinomial logistic regression, see here for instance. Alternatively, you can recode the AccountStatus as dichotomous and use simple logistic regression.
Sorry to disappoint you, but this is just an inherent restriction of multiple regression, nothing to do with R really. If you want to learn more about which statistical technique is appropriate for different combinations of measurement levels of dependent and independent variables, I can wholeheartedly advise this book.
Expanding a little bit on #MaximK's answer: multinomial approaches are appropriate when the levels of the factor are unordered: in your case, however, when the measurement level is ordinal (i.e. ordered, but the distance between the levels is unknown/undefined), you can get more out of your data by doing ordinal regression, e.g. with the polr() function in the MASS package or with functions in the ordinal package. However, since ordinal regression has different/more complex underlying theory than simple linear regression, you should probably read more about it (e.g. at the Wikipedia article linked above, or in the vignettes of the ordinal package, or at the UCLA stats consulting page on ordinal regression, or browsing related questions on CrossValidated.
If you can give a numeric value to the variables then you might have a solution. You have to rename the values to numbers, then convert the variable into a numeric one. Here is how:
library(plyr)
my.data2$islamic_leviathan_score <- revalue(my.data2$islamic_leviathan,
c("(1) Very Suitable"="3", "(2) Suitable"="2", "(3) Somewhat Suitable"="1", "(4) Not Suitable At All"="-1"))
my.data2$islamic_leviathan_score_1 <- as.numeric(as.character(my.data2$islamic_leviathan_score))
This revaluates the potential values while transforming the variable as numeric ones. The results I get are consistent with the original values contained in the dataset when the variables are as factor variables. You can use this solution to change the name of the variables to whatever you may like, while transforming them to numeric variables.
Finally, this is worth doing because it allows you to draw histograms or regressions, something that is impossible to do with factor variables.
Hope this helps!