Subset selection with LASSO involving categorical variables - r

I ran a LASSO algorithm on a dataset that has multiple categorical variables. When I used model.matrix() function on the independent variables, it automatically created dummy values for each factor level.
For example, I have a variable "worker_type" that has three values: FTE, contr, other. Here, reference is modality "FTE".
Some other categorical variables have more or fewer factor levels.
When I output the coefficients results from LASSO, I noticed that worker_typecontr and worker_typeother both have coefficients zero. How should I interpret the results? What's the coefficient for FTE in this case? Should I just take this variable out of the formula?

Perhaps this question is suited more for Cross Validated.
Ridge Regression and the Lasso are both "shrinkage" methods, typically used to deal with high dimensional predictor space.
The fact that your Lasso regression reduces some of the beta coefficients to zero indicates that the Lasso is doing exactly what it is designed for! By its mathematical definition, the Lasso assumes that a number of the coefficients are truly equal to zero. The interpretation of coefficients that go to zero is that these predictors do not explain any of the variance in the response compared to the non-zero predictors.
Why does the Lasso shrink some coefficients to zero? We need to investigate how the coefficients are chosen. The Lasso is essentially a multiple linear regression problem that is solved by minimizing the Residual Sum of Squares, plus a special L1 penalty term that shrinks coefficients to 0. This is the term that is minimized:
where p is the number of predictors, and lambda is a a non-negative tuning parameter. When lambda = 0, the penalty term drops out, and you have a multiple linear regression. As lambda becomes larger, your model fit will have less bias, but higher variance (ie - it will be subject to overfitting).
A cross-validation approach should be taken towards selecting the appropriate tuning parameter lambda. Take a grid of lambda values, and compute the cross-validation error for each value of lambda and select the tuning parameter value for which the cross-validation error is the lowest.
The Lasso is useful in some situations and helps in generating simple models, but special consideration should be paid to the nature of the data itself, and whether or not another method such as Ridge Regression, or OLS Regression is more appropriate given how many predictors should be truly related to the response.
Note: See equation 6.7 on page 221 in "An Introduction to Statistical Learning", which you can download for free here.

Related

How to get each independent variables R^2 in a multivariate linear regression

When you do a multivariate linear regression you get the multiple R-squared, like this:
My question is, if I can get the R-squared for each independent variable, without having to make a regression for each of the predictor variables.
For example, is it possible to get the R-squared for each of the predictor variables, next to the p value:
In regression models, individual variables do not have an R-squared. There is only ever an R-squared for a complete model. The variance explained by any single independent variable in a regression model is depending on the other independent variables.
If you need some added value of an independent variable, that is, the variance this IV explains above all others, you can compute two regression models. One with this IV and one without. The difference in R-squared is the variance this IV explains after all others have explained their share. But if you do this for all variables, the differences won't add up to the total R-squared.
Alternatively, you may use squared Beta weights to roughly estimate the effect size of a variable in a model. But this value is not directly comparable to R-squared.
This said, this question would better be posted in CrossValidated than StackOverflow.

How does fixest handle negative values of the demeaned dependent variable in poisson estimations?

I need to perform glm (poisson) estimations with fixed-effects (say merely unit FE) and several regressors (RHS variables). I have an unbalanced panel dataset where most (~90%) observations have missing values (NA) for some but not all regressors.
fixest::feglm() can handle this and returns my fitted model.
However, to do so, it (and fixest::demean too) removes observations that have at least one regressor missing, before constructing the fixed-effect means.
In my case, I am afraid this implies not using a significant share of available information in the data.
Therefore, I would like to demean my variables by hand, to be able to include as much information as possible in each fixed-effect dimension's mean, and then run feglm on the demeaned data. However, this implies getting negative dependent variable values, which is not compatible with Poisson. If I run feglm with "poisson" family and my manually demeaned data, I (coherently) get: "Negative values of the dependent variable are not allowed for the "poisson" family.". The same error is returned with data demeaned with the fixest::demean function.
Question:
How does feglm handle negative values of the demeaned dependent variable? Is there a way (like some data transformation) to reproduce fepois on a fixed-effect in the formula with fepois on demeaned data and a no fixed-effect formula?
To use the example from fixest::demean documentation (with two-way fixed-effects):
data(trade)
base = trade
base$ln_dist = log(base$dist_km)
base$ln_euros = log(base$Euros)
# We center the two variables ln_dist and ln_euros
# on the factors Origin and Destination
X_demean = demean(X = base[, c("ln_dist", "ln_euros")],
fe = base[, c("Origin", "Destination")])
base[, c("ln_dist_dm", "ln_euros_dm")] = X_demean
and I would like to reproduce
est_fe = fepois(ln_euros ~ ln_dist | Origin + Destination, base)
with
est = fepois(ln_euros_dm ~ ln_dist_dm, base)
I think there are two main problems.
Modelling strategy
In general, it is important to be able to formally describe the estimated model.
In this case it wouldn't be possible to write down the model with a single equation, where the fixed-effects are estimated using all the data and other variables only on the non-missing observations. And if the model is not clear, then... maybe it's not a good model.
On the other hand, if your model is well defined, then removing random observations should not change the expectation of the coefficients, only their variance. So again, if your model is well specified, you shouldn't worry too much.
By suggesting that observations with missing values are relevant to estimate the fixed-effects coefficients (or stated differently, that they are used to demean some variables) you are implying that these observations are not randomly distributed. And now you should worry.
Just using these observations to demean the variables wouldn't remove the bias on the estimated coefficients due to the selection to non-missingness. That's a deeper problem that cannot be removed by technical tricks but rather by a profound understanding of the data.
GLM
There is a misunderstanding with GLM. GLM is a super smart trick to estimate maximum likelihood models with OLS (there's a nice description here). It was developed and used at a time when regular optimization techniques were very expensive in terms of computational time, and it was a way to instead employ well developed and fast OLS techniques to perform equivalent estimations.
GLM is an iterative process where typical OLS estimations are performed at each step, the only changes at each iteration concern the weights associated to each observation. Therefore, since it's a regular OLS process, techniques to perform fast OLS estimations with multiple fixed-effects can be leveraged (as is in the fixest package).
So actually, you could do what you want... but only within the OLS step of the GLM algorithm. By no means you should demean the data before running GLM because, well, it makes no sense (the FWL theorem has absolutely no hold here).

How to assess the model and prediction of random forest when doing regression analysis?

I know when random forest (RF) is used for classification, the AUC normally is used to assess the quality of classification after applying it to test data. However,I have no clue the parameter to assess the quality of regression with RF. Now I want to use RF for the regression analysis, e.g. using a metrics with several hundreds samples and features to predict the concentration (numerical) of chemicals.
The first step is to run randomForest to build the regression model, with y as continuous numerics. How can I know whether the model is good or not, based on the Mean of squared residuals and % Var explained? Sometime my % Var explained is negative.
Afterwards, if the model is fine and/or used straightforward for test data, and I get the predicted values. Now how can I assess the predicted values good or not? I read online some calculated the accuracy (formula: 1-abs(predicted-actual)/actual), which also makes sense to me. However, I have many zero values in my actual dataset, are there any other solutions to assess the accuracy of predicted values?
Looking forward to any suggestions and thanks in advance.
The randomForest R package comes with an importance function which can used to determine the accuracy of a model. From the documentation:
importance(x, type=NULL, class=NULL, scale=TRUE, ...), where x is the output from your initial call to randomForest.
There are two types of importance measurements. One uses a permutation of out of bag data to test the accuracy of the model. The other uses the GINI index. Again, from the documentation:
Here are the definitions of the variable importance measures. The first measure is computed from permuting OOB data: For each tree, the prediction error on the out-of-bag portion of the data is recorded (error rate for classification, MSE for regression). Then the same is done after permuting each predictor variable. The difference between the two are then averaged over all trees, and normalized by the standard deviation of the differences. If the standard deviation of the differences is equal to 0 for a variable, the division is not done (but the average is almost always equal to 0 in that case).
The second measure is the total decrease in node impurities from splitting on the variable, averaged over all trees. For classification, the node impurity is measured by the Gini index. For regression, it is measured by residual sum of squares.
For further information, one more simple importance check you may do, really more of a sanity check than anything else, is to use something called the best constant model. The best constant model has a constant output, which is the mean of all responses in the test data set. The best constant model can be assumed to be the crudest model possible. You may compare the average performance of your random forest model against the best constant model, for a given set of test data. If the latter does not outperform the former by at least a factor of say 3-5, then your RF model is not very good.

Interpreting zero-inflated regression summary

The output of the zeroinfl regression from pscl provides a list of coefficients under "count model coefficients" as well as a list of coefficients under "zero-inflation model coefficients."
Given the interest is to follow the z inflated model, what is the utility of the count model coefficients? Is it simply provided for reference?
Your zero inflated regression consists of two models. The zero part is usually a binomial part, such as a logit or probit model, and accounts for the probability that Y is not zero. The count part is usually a model for count data (usually integers), such as a poisson or negative binomial model, and only considers those observations that are not zero. When you compare the number of observations of both models, e.g. using summary(fit), you will see the difference. In sum, your zero model calculates the probability that an observations is not zero, the count model fits a model on those observations that are not zero.
This zero inflated regression is similar to a hurdle model. You can read more on this at Cross Validated: What is the difference between zero-inflated and hurdle models?. BTW that platform is actually better suited for this kind of merely statistical questions.

Orange Canvas: Regression models gives the r^2 coefficients out of [-1,1] interval

I am using the Orange canvas with its regression methods to make some estimations about my data set. The regression coefficients r^2 must be inside of the interval [-1,1] for being meaningful according to statistics field. But sometimes, I've got the regression coefficients -50,.. or 26,.. etc. So, I am confused about that. How can I interprete such the coefficients ? Thank you all already.
From Wikipedia:
Important cases where the computational definition of R2 can yield negative values, depending on the definition used, arise where the predictions that are being compared to the corresponding outcomes have not been derived from a model-fitting procedure using those data, and where linear regression is conducted without including an intercept. Additionally, negative values of R2 may occur when fitting non-linear functions to data.
There is nothing in the definition of R² that theoretically prevents it from having arbitrarily negative values. I guess you can interpret -50 as even worse than -1. But with regard to R² = 26, I'm clueless.

Resources