My data set has a binary dependent variable (0/1) and a lot of continuous independent variables for many individuals and three time periods. Therefore, I am facing a panel data set with a binary dependent variable, which asks for the use of a non-linear panel data model. However, I also have a lot of independent variables, which asks for the use of a variable selection method. Therefore, I want to apply lasso on a fixed effects logit model.
As far as I know, there is only the possibility in cv.glmnet to estimate a logit lasso model by the function cv.glmnet(x, y, weights, offset, lambda, type.measure='binomial', nfolds, foldid, grouped, keep, parallel, ...) using type.measure='binomial'. This estimation procedure pools all individuals as it is a cross-sectional estimation procedure and does not take the panel component of my data set.
Therefore, I would like to adjust the cv.glmnet function such that I can take as input for example type.measure='fe binomial' and so it runs a fixed effects logit lasso model.
In conclusion, it is possible to run a fixed effects logit model and a lasso model separately but I want to combine both. How can I do this in R?
(Also, in the attachment I wrote my model down in more detail)
Explanation model
Related
I need to perform glm (poisson) estimations with fixed-effects (say merely unit FE) and several regressors (RHS variables). I have an unbalanced panel dataset where most (~90%) observations have missing values (NA) for some but not all regressors.
fixest::feglm() can handle this and returns my fitted model.
However, to do so, it (and fixest::demean too) removes observations that have at least one regressor missing, before constructing the fixed-effect means.
In my case, I am afraid this implies not using a significant share of available information in the data.
Therefore, I would like to demean my variables by hand, to be able to include as much information as possible in each fixed-effect dimension's mean, and then run feglm on the demeaned data. However, this implies getting negative dependent variable values, which is not compatible with Poisson. If I run feglm with "poisson" family and my manually demeaned data, I (coherently) get: "Negative values of the dependent variable are not allowed for the "poisson" family.". The same error is returned with data demeaned with the fixest::demean function.
Question:
How does feglm handle negative values of the demeaned dependent variable? Is there a way (like some data transformation) to reproduce fepois on a fixed-effect in the formula with fepois on demeaned data and a no fixed-effect formula?
To use the example from fixest::demean documentation (with two-way fixed-effects):
data(trade)
base = trade
base$ln_dist = log(base$dist_km)
base$ln_euros = log(base$Euros)
# We center the two variables ln_dist and ln_euros
# on the factors Origin and Destination
X_demean = demean(X = base[, c("ln_dist", "ln_euros")],
fe = base[, c("Origin", "Destination")])
base[, c("ln_dist_dm", "ln_euros_dm")] = X_demean
and I would like to reproduce
est_fe = fepois(ln_euros ~ ln_dist | Origin + Destination, base)
with
est = fepois(ln_euros_dm ~ ln_dist_dm, base)
I think there are two main problems.
Modelling strategy
In general, it is important to be able to formally describe the estimated model.
In this case it wouldn't be possible to write down the model with a single equation, where the fixed-effects are estimated using all the data and other variables only on the non-missing observations. And if the model is not clear, then... maybe it's not a good model.
On the other hand, if your model is well defined, then removing random observations should not change the expectation of the coefficients, only their variance. So again, if your model is well specified, you shouldn't worry too much.
By suggesting that observations with missing values are relevant to estimate the fixed-effects coefficients (or stated differently, that they are used to demean some variables) you are implying that these observations are not randomly distributed. And now you should worry.
Just using these observations to demean the variables wouldn't remove the bias on the estimated coefficients due to the selection to non-missingness. That's a deeper problem that cannot be removed by technical tricks but rather by a profound understanding of the data.
GLM
There is a misunderstanding with GLM. GLM is a super smart trick to estimate maximum likelihood models with OLS (there's a nice description here). It was developed and used at a time when regular optimization techniques were very expensive in terms of computational time, and it was a way to instead employ well developed and fast OLS techniques to perform equivalent estimations.
GLM is an iterative process where typical OLS estimations are performed at each step, the only changes at each iteration concern the weights associated to each observation. Therefore, since it's a regular OLS process, techniques to perform fast OLS estimations with multiple fixed-effects can be leveraged (as is in the fixest package).
So actually, you could do what you want... but only within the OLS step of the GLM algorithm. By no means you should demean the data before running GLM because, well, it makes no sense (the FWL theorem has absolutely no hold here).
I'm trying to estimate a linear regression with arima errors, but my regressors are highly collinear and thus the regression model suffers from multicollinearity. Since my ultimate goal is to be able to interpret the individual regression coefficients as elasticities and to use them for ex-ante forecasting, I need to solve the multicollinearity somehow to be able to trust the coefficients of the regressors. I know that transforming the regressor variables eg. by differencing might help to reduce the multicollinearity. And I have also understood that auto.arima performs the same differencing for both the response variable as well as the regressors defined in xreg (see: Do we need to do differencing of exogenous variables before passing to xreg argument of Arima() in R?).
So my question is, does it do the transformation already before estimating the regression coefficients or is the regression estimated using the untransformed data and transformation done only before fitting the arima model to the errors? And if the transformation is done before estimating the regression, what is the script to get those transformed values to a table or something, to be able to run the multicollinearity test on those rather than the original data?
The auto.arima() function will do the differencing before estimation to ensure consistent estimators. The estimation of the regression coefficients and ARIMA model is done simultaneously using MLE.
You can manually difference the data yourself using the diff() function.
I ran a LASSO algorithm on a dataset that has multiple categorical variables. When I used model.matrix() function on the independent variables, it automatically created dummy values for each factor level.
For example, I have a variable "worker_type" that has three values: FTE, contr, other. Here, reference is modality "FTE".
Some other categorical variables have more or fewer factor levels.
When I output the coefficients results from LASSO, I noticed that worker_typecontr and worker_typeother both have coefficients zero. How should I interpret the results? What's the coefficient for FTE in this case? Should I just take this variable out of the formula?
Perhaps this question is suited more for Cross Validated.
Ridge Regression and the Lasso are both "shrinkage" methods, typically used to deal with high dimensional predictor space.
The fact that your Lasso regression reduces some of the beta coefficients to zero indicates that the Lasso is doing exactly what it is designed for! By its mathematical definition, the Lasso assumes that a number of the coefficients are truly equal to zero. The interpretation of coefficients that go to zero is that these predictors do not explain any of the variance in the response compared to the non-zero predictors.
Why does the Lasso shrink some coefficients to zero? We need to investigate how the coefficients are chosen. The Lasso is essentially a multiple linear regression problem that is solved by minimizing the Residual Sum of Squares, plus a special L1 penalty term that shrinks coefficients to 0. This is the term that is minimized:
where p is the number of predictors, and lambda is a a non-negative tuning parameter. When lambda = 0, the penalty term drops out, and you have a multiple linear regression. As lambda becomes larger, your model fit will have less bias, but higher variance (ie - it will be subject to overfitting).
A cross-validation approach should be taken towards selecting the appropriate tuning parameter lambda. Take a grid of lambda values, and compute the cross-validation error for each value of lambda and select the tuning parameter value for which the cross-validation error is the lowest.
The Lasso is useful in some situations and helps in generating simple models, but special consideration should be paid to the nature of the data itself, and whether or not another method such as Ridge Regression, or OLS Regression is more appropriate given how many predictors should be truly related to the response.
Note: See equation 6.7 on page 221 in "An Introduction to Statistical Learning", which you can download for free here.
I am analyzing a panel data set and I am interested in some time-independent explanatory variables(z). The Hausmann Test shows that I should use a fixed effects model instead of a random effects model.
Downside is, that the model will not estimate any coefficients for the time-independent explanatory variables.
So one idea is to take the estimated coefficients(b) for the time-dependent variables(x) from the FE model and use them on the raw data which means, take out the effects from the already estimated explanatory variables. Then use this corrected values as dependent variable for an OLS model with the time-independent variables as explanatory variables. This leads to:
y - x'b = z'j + u (with j as coefficients of interesst)
Do these two models exclude each other with any necessary assumption or is it just that the standard errors of the OLS model need to be corrected?
Thanks for every hint!
Let's imagine that for a target value 'price', I have predictive variables of x, y, z, m, and n.
I have been able to analyse different models that I could fit through following methods:
Forward, backward, and stepwise selection
Grid and Lasso
KNN (IBk)
For each I got RMSE and MSE for prediction and I can choose the best model.
All these are helpful with linear models.
I'm just wondering if there is any chance to do the same for polynomial regressions (squared, cubic, ...) so I can fit and analyse them as well in the same dataset.
Have you seen caret package? Its very powerfull and groups a lot of machine learning models. It can compares different models and also see the best metaparameters.
http://topepo.github.io/caret/index.html