How does fixest handle negative values of the demeaned dependent variable in poisson estimations?

How does fixest handle negative values of the demeaned dependent variable in poisson estimations? - r

I need to perform glm (poisson) estimations with fixed-effects (say merely unit FE) and several regressors (RHS variables). I have an unbalanced panel dataset where most (~90%) observations have missing values (NA) for some but not all regressors.
fixest::feglm() can handle this and returns my fitted model.
However, to do so, it (and fixest::demean too) removes observations that have at least one regressor missing, before constructing the fixed-effect means.
In my case, I am afraid this implies not using a significant share of available information in the data.
Therefore, I would like to demean my variables by hand, to be able to include as much information as possible in each fixed-effect dimension's mean, and then run feglm on the demeaned data. However, this implies getting negative dependent variable values, which is not compatible with Poisson. If I run feglm with "poisson" family and my manually demeaned data, I (coherently) get: "Negative values of the dependent variable are not allowed for the "poisson" family.". The same error is returned with data demeaned with the fixest::demean function.
Question:
How does feglm handle negative values of the demeaned dependent variable? Is there a way (like some data transformation) to reproduce fepois on a fixed-effect in the formula with fepois on demeaned data and a no fixed-effect formula?
To use the example from fixest::demean documentation (with two-way fixed-effects):
data(trade)
base = trade
base$ln_dist = log(base$dist_km)
base$ln_euros = log(base$Euros)
# We center the two variables ln_dist and ln_euros
# on the factors Origin and Destination
X_demean = demean(X = base[, c("ln_dist", "ln_euros")],
fe = base[, c("Origin", "Destination")])
base[, c("ln_dist_dm", "ln_euros_dm")] = X_demean
and I would like to reproduce
est_fe = fepois(ln_euros ~ ln_dist | Origin + Destination, base)
with
est = fepois(ln_euros_dm ~ ln_dist_dm, base)

I think there are two main problems.
Modelling strategy
In general, it is important to be able to formally describe the estimated model.
In this case it wouldn't be possible to write down the model with a single equation, where the fixed-effects are estimated using all the data and other variables only on the non-missing observations. And if the model is not clear, then... maybe it's not a good model.
On the other hand, if your model is well defined, then removing random observations should not change the expectation of the coefficients, only their variance. So again, if your model is well specified, you shouldn't worry too much.
By suggesting that observations with missing values are relevant to estimate the fixed-effects coefficients (or stated differently, that they are used to demean some variables) you are implying that these observations are not randomly distributed. And now you should worry.
Just using these observations to demean the variables wouldn't remove the bias on the estimated coefficients due to the selection to non-missingness. That's a deeper problem that cannot be removed by technical tricks but rather by a profound understanding of the data.
GLM
There is a misunderstanding with GLM. GLM is a super smart trick to estimate maximum likelihood models with OLS (there's a nice description here). It was developed and used at a time when regular optimization techniques were very expensive in terms of computational time, and it was a way to instead employ well developed and fast OLS techniques to perform equivalent estimations.
GLM is an iterative process where typical OLS estimations are performed at each step, the only changes at each iteration concern the weights associated to each observation. Therefore, since it's a regular OLS process, techniques to perform fast OLS estimations with multiple fixed-effects can be leveraged (as is in the fixest package).
So actually, you could do what you want... but only within the OLS step of the GLM algorithm. By no means you should demean the data before running GLM because, well, it makes no sense (the FWL theorem has absolutely no hold here).

Related

How to assess the model and prediction of random forest when doing regression analysis?

I know when random forest (RF) is used for classification, the AUC normally is used to assess the quality of classification after applying it to test data. However,I have no clue the parameter to assess the quality of regression with RF. Now I want to use RF for the regression analysis, e.g. using a metrics with several hundreds samples and features to predict the concentration (numerical) of chemicals.
The first step is to run randomForest to build the regression model, with y as continuous numerics. How can I know whether the model is good or not, based on the Mean of squared residuals and % Var explained? Sometime my % Var explained is negative.
Afterwards, if the model is fine and/or used straightforward for test data, and I get the predicted values. Now how can I assess the predicted values good or not? I read online some calculated the accuracy (formula: 1-abs(predicted-actual)/actual), which also makes sense to me. However, I have many zero values in my actual dataset, are there any other solutions to assess the accuracy of predicted values?
Looking forward to any suggestions and thanks in advance.

The randomForest R package comes with an importance function which can used to determine the accuracy of a model. From the documentation:
importance(x, type=NULL, class=NULL, scale=TRUE, ...), where x is the output from your initial call to randomForest.
There are two types of importance measurements. One uses a permutation of out of bag data to test the accuracy of the model. The other uses the GINI index. Again, from the documentation:
Here are the definitions of the variable importance measures. The first measure is computed from permuting OOB data: For each tree, the prediction error on the out-of-bag portion of the data is recorded (error rate for classification, MSE for regression). Then the same is done after permuting each predictor variable. The difference between the two are then averaged over all trees, and normalized by the standard deviation of the differences. If the standard deviation of the differences is equal to 0 for a variable, the division is not done (but the average is almost always equal to 0 in that case).
The second measure is the total decrease in node impurities from splitting on the variable, averaged over all trees. For classification, the node impurity is measured by the Gini index. For regression, it is measured by residual sum of squares.
For further information, one more simple importance check you may do, really more of a sanity check than anything else, is to use something called the best constant model. The best constant model has a constant output, which is the mean of all responses in the test data set. The best constant model can be assumed to be the crudest model possible. You may compare the average performance of your random forest model against the best constant model, for a given set of test data. If the latter does not outperform the former by at least a factor of say 3-5, then your RF model is not very good.

Subset selection with LASSO involving categorical variables

I ran a LASSO algorithm on a dataset that has multiple categorical variables. When I used model.matrix() function on the independent variables, it automatically created dummy values for each factor level.
For example, I have a variable "worker_type" that has three values: FTE, contr, other. Here, reference is modality "FTE".
Some other categorical variables have more or fewer factor levels.
When I output the coefficients results from LASSO, I noticed that worker_typecontr and worker_typeother both have coefficients zero. How should I interpret the results? What's the coefficient for FTE in this case? Should I just take this variable out of the formula?

Perhaps this question is suited more for Cross Validated.
Ridge Regression and the Lasso are both "shrinkage" methods, typically used to deal with high dimensional predictor space.
The fact that your Lasso regression reduces some of the beta coefficients to zero indicates that the Lasso is doing exactly what it is designed for! By its mathematical definition, the Lasso assumes that a number of the coefficients are truly equal to zero. The interpretation of coefficients that go to zero is that these predictors do not explain any of the variance in the response compared to the non-zero predictors.
Why does the Lasso shrink some coefficients to zero? We need to investigate how the coefficients are chosen. The Lasso is essentially a multiple linear regression problem that is solved by minimizing the Residual Sum of Squares, plus a special L1 penalty term that shrinks coefficients to 0. This is the term that is minimized:
where p is the number of predictors, and lambda is a a non-negative tuning parameter. When lambda = 0, the penalty term drops out, and you have a multiple linear regression. As lambda becomes larger, your model fit will have less bias, but higher variance (ie - it will be subject to overfitting).
A cross-validation approach should be taken towards selecting the appropriate tuning parameter lambda. Take a grid of lambda values, and compute the cross-validation error for each value of lambda and select the tuning parameter value for which the cross-validation error is the lowest.
The Lasso is useful in some situations and helps in generating simple models, but special consideration should be paid to the nature of the data itself, and whether or not another method such as Ridge Regression, or OLS Regression is more appropriate given how many predictors should be truly related to the response.
Note: See equation 6.7 on page 221 in "An Introduction to Statistical Learning", which you can download for free here.

R vs. SPSS mixed model repeated measures code [from Cross Validated]

NOTE: This question was originally posted on Cross Validated, where it was suggested that it should be asked in StackOverflow instead.
I am trying to model a 3-way repeated measures experiment, FixedFactorA * FixedFactorB * Time[days]. There are no missing observations, but my groups (FactorA * FactorB) are unequal (close, but not completely balanced). From reading online, the best way to model a repeated measures experiment in which observation order matters (due to the response mean and variance changing in a time-dependent way) and for unequal groups is to use a mixed model and specify an appropriate covariance structure. However, I am new to the idea of mixed models and I am confused as to whether I am using the correct syntax to model what I am trying to model.
I would like to do a full factorial analysis, such that I could detect significant time * factor interactions. For example, for subjects with FactorA = 1, their responses over time might have a different slope and/or intercept than subjects with FactorA =2. I also want to be able to check whether certain combinations of FactorA and FactorB have significantly different responses over time (hence the full three-way interaction term).
From reading online, it seems like AR1 is a reasonable covariance structure for longitudinal-like data, so I decided to try that. Also, I saw that one is supposed to use ML if one plans to compare two different models, so I chose that approach in anticipation of needing to fine-tune the model. It is also my understanding that the goal is to minimize the AIC during model selection.
This is the code in the log for what I tried in SPSS (for long-form data), which yielded an AIC of 2471:
MIXED RESPONSE BY FactorA FactorB Day
/CRITERIA=CIN(95) MXITER(100) MXSTEP(10) SCORING(1) SINGULAR(0.000000000001) HCONVERGE(0,
ABSOLUTE) LCONVERGE(0, ABSOLUTE) PCONVERGE(0.000001, ABSOLUTE)
/FIXED=FactorA FactorB Day FactorA*FactorB FactorA*Day FactorB*Day FactorA*FactorB*Day | SSTYPE(3)
/METHOD=ML
/PRINT=SOLUTION TESTCOV
/REPEATED=Day | SUBJECT(Subject_ID) COVTYPE(AR1)
This is what I tried in R, which yielded an AIC of 2156:
require(nlme)
#output error fix: https://stats.stackexchange.com/questions/40647/lme-error-iteration-limit-reached
ctrl <- lmeControl(opt='optim') #I used this b/c otherwise I get the iteration limit reached error
fit1 <- lme(RESPONSE ~ Day*FactorA*FactorB, random = ~ Day|Subject_ID, control=ctrl,
correlation=corAR1(form=~Day), data, method="ML")
summary(fit1)
These are my questions:
The SPSS code above yielded a model with AIC = 2471, while the R code yielded a model with AIC = 2156. What is it about the codes that makes the models different?
From what I described above, are either of these models appropriate for what I am trying to test? If not, what would be a better way, and how would I do it in both programs to get the same results?
Edits
Another thing to note is that I didn't dummy-code my factors. I don't know if this is a problem for either software, or if the built-in coding is different in SPSS vs R. I also don't know if this will be a problem for my three-way interaction term.
Also, when I say "factor", I mean an unchanging group or characteristic (like "sex").

Start with an unconditional model, one with an identity variance-covariance structure at level-1 and one with an AR(1) var-covar structure at level 1:
unconditional.identity<-lme(RESPONSE~Day, random=~Day|Subject_ID, data=data, method='ML')
unconditional.ar1<-lme(RESPONSE~Day, random=~Day|Subject_ID, correlation=corAR1(form=~Day), data=data, method='ML')
Find the intra-class correlation coefficient of this unconditional model, which is the level-2 error divided by the sum of level-1 and level-2 errors. This is probably easier in a spreadsheet program, but in R:
intervals(unconditional.identity)$reStruct$Subject_ID[2]^2/(intervals(unconditional.identity)$reStruct$Subject_ID[2]^2+intervals(unconditional.identity)$sigma[2]^2)
intervals(unconditional.ar1)$reStruct$Subject_ID[2]^2/(intervals(unconditional.ar1)$reStruct$Subject_ID[2]^2+intervals(unconditional.ar1)$sigma[2]^2)
It depends on your field, but in educational research, an ICC below 0.2, definitely below 0.1, is considered not ready for hierarchical linear models. That is to say, multiple regression would be better because the assumption of independence is confirmed. If your ICC is below a cutoff for your field, then do not use a hierarchical longitudinal model.
If your ICC is acceptable for hierarchical linear models, then add in your control grouping variable with identity and AR(1) var-covar matrix:
conditional1.identity<-lme(RESPONSE~Day+Group, random=~Day+Group|Subject_ID, data=data, method='ML')
conditional1.ar1<-lme(RESPONSE~Day+Group, random=~Day+Group|Subject_ID, correlation=corAR1(form=~Day), data=data, method='ML')
If your factors are time-invariant (which you said on Cross Validated), then your model gets bigger because time and group are nested in these fixed effects:
conditional2.identity<-lme(RESPONSE~Day+Group+FactorA+FactorB+FactorA*Day+FactorB*Day+FactorA*Group+FactorB*Group+FactorB, random=~Day+Group|Subject_ID, data=data, method='ML')
conditional2.ar1<-lme(Day+Group+FactorA+FactorB+FactorA*Day+FactorB*Day+FactorA*Group+FactorB*Group+FactorB, random=~Day+Group|Subject_ID, correlation=corAR1(form=~Day), data=data, method='ML')
You can get confidence intervals on the coefficients with intervals() or p-values with summary(). Remember, lme reports error terms in standard deviation format.
I do not know your area of study, so I can't say if your three-way interaction effect makes theoretical sense. But your model is getting quite dense at this point. The more parameters you estimate, the more degrees of freedom the model has when you compare them, so the statistical significance will be biased. If you are really interested in a three-way interaction effect, I suggest you consider the theoretical meaning of such an interaction and what it would mean if such an interaction did occur. Nonetheless, you can estimate it by adding it to the code above:
conditional3.identity<-lme(RESPONSE~Day+Group+FactorA+FactorB+FactorA*Day+FactorB*Day+FactorA*Group+FactorB*Group+FactorB+Day*FactorA*FactorB, random=~Day+Group|Subject_ID, data=data, method='ML')
conditional3.ar1<-lme(Day+Group+FactorA+FactorB+FactorA*Day+FactorB*Day+FactorA*Group+FactorB*Group+FactorB+Day*FactorA*FactorB, random=~Day+Group|Subject_ID, correlation=corAR1(form=~Day), data=data, method='ML')
Finally, compare the nested models:
anova(unconditional.identity,conditional1.identity,conditional2.identity,conditional3.identity)
anova(unconditional.ar1,conditional1.ar1,conditional2.ar1,conditional3.ar1)
Like I said, the more parameters you estimate, the more biased your statistical significance will be: i.e., more parameters = more degrees of freedom = less chance of a statistically significant model.
HOWEVER, the best part about multi-level models is comparing effect sizes, so then you don't have to worry about p-values at all. Effect sizes are in the form of a "proportional reduction in variance explained."
This is comparing models. For example, to comapre the proportional reduction in variance explained in level 1 from the unconditional model to the conditional1 model:
(intervals(unconditional.identity)$sigma[2]^2 - intervals(conditional1.identity)$sigma[2]^2) / intervals(unconditional.identity)$sigma[2]^2
Hopefully you can "plug and play" the same code for the number of level-2 error terms you have (which is more than one in some of your cases). Make sure to compare only nested models in this way.

evaluate forecast by the terms of p-value and pearson correlation

I am using R to do some evaluations for two different forecasting models. The basic idea of the evaluation is do the comparison of Pearson correlation and it corresponding p-value using the function of cor.() . The graph below shows the final result of the correlation coefficient and its p-value.
we suggestion that model which has lower correlation coefficient with corresponding lower p-value(less 0,05) is better(or, higher correlation coefficient but with pretty high corresponding p-value).
so , in this case, overall, we would say that the model1 is better than model2.
but the question here is, is there any other specific statistic method to quantify the comparison?
Thanks a lot !!!

Assuming you're working with time series data since you called out a "forecast". I think what you're really looking for is backtesting of your forecast model. From Ruey S. Tsay's "An Introduction to Analysis of Financial Data with R", you might want to take a look at his backtest.R function.
backtest(m1,rt,orig,h,xre=NULL,fixed=NULL,inc.mean=TRUE)
# m1: is a time-series model object
# orig: is the starting forecast origin
# rt: the time series
# xre: the independent variables
# h: forecast horizon
# fixed: parameter constriant
# inc.mean: flag for constant term of the model.
Backtesting allows you to see how well your models perform on past data and Tsay's backtest.R provides RMSE and Mean-Absolute-Error which will give you another perspective outside of correlation. Caution depending on the size of your data and complexity of your model, this can be a very slow running test.
To compare models you'll normally look at RMSE which is essentially the standard deviation of the error of your model. Those two are directly comparable and smaller is better.
An even better alternative is to set up training, testing, and validation sets before you build your models. If you train two models on the same training / test data you can compare them against your validation set (which has never been seen by your models) to get a more accurate measurement of your model's performance measures.
One final alternative, if you have a "cost" associated with an inaccurate forecast, apply those costs to your predictions and add them up. If one model performs poorly on a more expensive segment of data, you may want to avoid using it.
As a side-note, your interpretation of a p value as less is better leaves a little to be [desired] quite right.
P values address only one question: how likely are your data, assuming a true null hypothesis? It does not measure support for the alternative hypothesis.

predict and multiplicative variables / interaction terms in probit regressions

I want to determine the marginal effects of each dependent variable in a probit regression as follows:
predict the (base) probability with the mean of each variable
for each variable, predict the change in probability compared to the base probability if the variable takes the value of mean + 1x standard deviation of the variable
In one of my regressions, I have a multiplicative variable, as follows:
my_probit <- glm(a ~ b + c + I(b*c), family = binomial(link = "probit"), data=data)
Two questions:
When I determine the marginal effects using the approach above, will the value of the multiplicative term reflect the value of b or c taking the value mean + 1x standard deviation of the variable?
Same question, but with an interaction term (* and no I()) instead of a multiplicative term.
Many thanks

When interpreting the results of models involving interaction terms, the general rule is DO NOT interpret coefficients. The very presence of interactions means that the meaning of coefficients for terms will vary depending on the other variate values being used for prediction. The right way to go about looking at the results is to construct a "prediction grid", i.e. a set of values that are spaced across the range of interest (hopefully within the domain of data support). The two essential functions for this process are expand.grid and predict.
dgrid <- expand.grid(b=fivenum(data$b)[2:4], c=fivenum(data$c)[2:4]
# A grid with the upper and lower hinges and the medians for `a` and `b`.
predict(my_probit, newdata=dgrid)
You may want to have the predictions on a scale other than the default (which is to return the linear predictor), so perhaps this would be easier to interpret if it were:
predict(my_probit, newdata=dgrid, type ="response")
Be sure to read ?predict and ?predict.glm and work with some simple examples to make sure you are getting what you intended.
Predictions from models containing interactions (at least those involving 2 covariates) should be thought of as being surfaces or 2-d manifolds in three dimensions. (And for 3-covariate interactions as being iso-value envelopes.) The reason that non-interaction models can be decomposed into separate term "effects" is that the slopes of the planar prediction surfaces remain constant across all levels of input. Such is not the case with interactions, especially those with multiplicative and non-linear model structures. The graphical tools and insights that one picks up in a differential equations course can be productively applied here.