I'm tryin to use gbm for the first time (actually any kind of regression tree for the first time) on my data, which consists of 14 continuous dependent variables and a factor as response variable with 13 levels. I came to gbm via a very good description by Elith et al., who however used a modification of the basic gbm package that can't handle multinomial distributions. The help of gbm claims that it can handle this:
"distribution: either a character string specifying the name of the distribution to use or a list
with a component name specifying the distribution and any additional param-eters needed. If not specified, gbm will try to guess: if the response has only
2 unique values, bernoulli is assumed; otherwise, if the response is a factor,
multinomial is assumed; otherwise, if the response has class "Surv", coxph is
assumed; otherwise, gaussian is assumed.
Currently available options are "gaussian" (squared error), "laplace" (absolute
loss), "tdist" (t-distribution loss), "bernoulli" (logistic regression for 0-1 out-comes), "huberized" (huberized hinge loss for 0-1 outcomes), "multinomial"
(classification when there are more than 2 classes), "adaboost" (the AdaBoost
exponential loss for 0-1 outcomes), "poisson" (count outcomes), "coxph" (right
censored observations), "quantile", or "pairwise" (ranking measure using the
LambdaMart algorithm)."
Nevertheless, it doesn't work, no matter, whether I specify "multinomial" or "let it guess". Anyone any idea what I am doing wrong? Or am I misunderstanding something completely - does a multinomial distribution of my data not mean, that my error loss function is also of multinomial distribution? It runs if I chose "gaussian", but I guess in that case something completely different is calculated?
I'd appreciate any help!

Are you using the newest version of gbm? I had a similar issue which was resolved after re-installing the gbm package.


estimating heteroskedastic logit model

Suppose (following Train, Discrete Choice Analysis with Simulation, chapter 3.2) you have a discrete choice model, observations from several cities, and you suspect that the variances of the unobserved factors differ over cities. The suggested appropriate
model is heteroskedastic logit (or probit). Can this setup be estimated in R? As far as I can see, the mlogit package doesn't do this
(it doesn't allow you to specify that the source of the heterogeneity is the cities). Is that correct, or am I missing
something? If it is right, any suggestions for an R package?

How does fixest handle negative values of the demeaned dependent variable in poisson estimations?

I need to perform glm (poisson) estimations with fixed-effects (say merely unit FE) and several regressors (RHS variables). I have an unbalanced panel dataset where most (~90%) observations have missing values (NA) for some but not all regressors.
fixest::feglm() can handle this and returns my fitted model.
However, to do so, it (and fixest::demean too) removes observations that have at least one regressor missing, before constructing the fixed-effect means.
In my case, I am afraid this implies not using a significant share of available information in the data.
Therefore, I would like to demean my variables by hand, to be able to include as much information as possible in each fixed-effect dimension's mean, and then run feglm on the demeaned data. However, this implies getting negative dependent variable values, which is not compatible with Poisson. If I run feglm with "poisson" family and my manually demeaned data, I (coherently) get: "Negative values of the dependent variable are not allowed for the "poisson" family.". The same error is returned with data demeaned with the fixest::demean function.
How does feglm handle negative values of the demeaned dependent variable? Is there a way (like some data transformation) to reproduce fepois on a fixed-effect in the formula with fepois on demeaned data and a no fixed-effect formula?
To use the example from fixest::demean documentation (with two-way fixed-effects):
base = trade
base$ln_dist = log(base$dist_km)
base$ln_euros = log(base$Euros)
# We center the two variables ln_dist and ln_euros
# on the factors Origin and Destination
X_demean = demean(X = base[, c("ln_dist", "ln_euros")],
fe = base[, c("Origin", "Destination")])
base[, c("ln_dist_dm", "ln_euros_dm")] = X_demean
and I would like to reproduce
est_fe = fepois(ln_euros ~ ln_dist | Origin + Destination, base)
est = fepois(ln_euros_dm ~ ln_dist_dm, base)
I think there are two main problems.
Modelling strategy
In general, it is important to be able to formally describe the estimated model.
In this case it wouldn't be possible to write down the model with a single equation, where the fixed-effects are estimated using all the data and other variables only on the non-missing observations. And if the model is not clear, then... maybe it's not a good model.
On the other hand, if your model is well defined, then removing random observations should not change the expectation of the coefficients, only their variance. So again, if your model is well specified, you shouldn't worry too much.
By suggesting that observations with missing values are relevant to estimate the fixed-effects coefficients (or stated differently, that they are used to demean some variables) you are implying that these observations are not randomly distributed. And now you should worry.
Just using these observations to demean the variables wouldn't remove the bias on the estimated coefficients due to the selection to non-missingness. That's a deeper problem that cannot be removed by technical tricks but rather by a profound understanding of the data.
There is a misunderstanding with GLM. GLM is a super smart trick to estimate maximum likelihood models with OLS (there's a nice description here). It was developed and used at a time when regular optimization techniques were very expensive in terms of computational time, and it was a way to instead employ well developed and fast OLS techniques to perform equivalent estimations.
GLM is an iterative process where typical OLS estimations are performed at each step, the only changes at each iteration concern the weights associated to each observation. Therefore, since it's a regular OLS process, techniques to perform fast OLS estimations with multiple fixed-effects can be leveraged (as is in the fixest package).
So actually, you could do what you want... but only within the OLS step of the GLM algorithm. By no means you should demean the data before running GLM because, well, it makes no sense (the FWL theorem has absolutely no hold here).

How to assess the model and prediction of random forest when doing regression analysis?

I know when random forest (RF) is used for classification, the AUC normally is used to assess the quality of classification after applying it to test data. However,I have no clue the parameter to assess the quality of regression with RF. Now I want to use RF for the regression analysis, e.g. using a metrics with several hundreds samples and features to predict the concentration (numerical) of chemicals.
The first step is to run randomForest to build the regression model, with y as continuous numerics. How can I know whether the model is good or not, based on the Mean of squared residuals and % Var explained? Sometime my % Var explained is negative.
Afterwards, if the model is fine and/or used straightforward for test data, and I get the predicted values. Now how can I assess the predicted values good or not? I read online some calculated the accuracy (formula: 1-abs(predicted-actual)/actual), which also makes sense to me. However, I have many zero values in my actual dataset, are there any other solutions to assess the accuracy of predicted values?
Looking forward to any suggestions and thanks in advance.
The randomForest R package comes with an importance function which can used to determine the accuracy of a model. From the documentation:
importance(x, type=NULL, class=NULL, scale=TRUE, ...), where x is the output from your initial call to randomForest.
There are two types of importance measurements. One uses a permutation of out of bag data to test the accuracy of the model. The other uses the GINI index. Again, from the documentation:
Here are the definitions of the variable importance measures. The first measure is computed from permuting OOB data: For each tree, the prediction error on the out-of-bag portion of the data is recorded (error rate for classification, MSE for regression). Then the same is done after permuting each predictor variable. The difference between the two are then averaged over all trees, and normalized by the standard deviation of the differences. If the standard deviation of the differences is equal to 0 for a variable, the division is not done (but the average is almost always equal to 0 in that case).
The second measure is the total decrease in node impurities from splitting on the variable, averaged over all trees. For classification, the node impurity is measured by the Gini index. For regression, it is measured by residual sum of squares.
For further information, one more simple importance check you may do, really more of a sanity check than anything else, is to use something called the best constant model. The best constant model has a constant output, which is the mean of all responses in the test data set. The best constant model can be assumed to be the crudest model possible. You may compare the average performance of your random forest model against the best constant model, for a given set of test data. If the latter does not outperform the former by at least a factor of say 3-5, then your RF model is not very good.

Multivariate Analysis on random forest results

Apologies in advance for no data samples:
I built out a random forest of 128 trees with no tuning having 1 binary outcome and 4 explanatory continuous variables. I then compared the AUC of this forest against a forest already built and predicting on cases. What I want to figure out is how to determine what exactly is lending predictive power to this new forest. Univariate analysis with the outcome variable led to no significant findings. Any technique recommendations would be greatly appreciated.
EDIT: To summarize, I want to perform multivariate analysis on these 4 explanatory variables to identify what interactions are taking place that may explain the forest's predictive power.
Random Forest is what's known as a "black box" learning algorithm, because there is no good way to interpret the relationship between input and outcome variables. You can however use something like the variable importance plot or partial dependence plot to give you a sense of what variables are contributing the most in making predictions.
Here are some discussions on variable importance plots, also here and here. It is implemented in the randomForest package as varImpPlot() and in the caret package as varImp(). The interpretation of this plot depends on the metric you are using to assess variable importance. For example if you use MeanDecreaseAccuracy, a high value for a variable would mean that on average, a model that includes this variable reduces classification error by a good amount.
Here are some other discussions on partial dependence plots for predictive models, also here. It is implemented in the randomForest package as partialPlot().
In practice, 4 explanatory variables is not many, so you can just easily run a binary logistic regression (possibly with a L2 regularization) for a more interpretative model. And compare it's performance against a random forest. See this discussion about variable selection. It is implemented in the glmnet package. Basically a L2 regularization, also known as ridge, is a penalty term added to your loss function that shrinks your coefficients for reduced variance, at the expense of increased bias. This effectively reduces prediction error if the amount of reduced variance more than compensates for the bias (this is often the case). Since you only have 4 inputs variables, I suggested L2 instead of L1 (also known as lasso, which also does automatic feature selection). See this answer for ridge and lasso shrinkage parameter tuning using cv.glmnet: How to estimate shrinkage parameter in Lasso or ridge regression with >50K variables?

Can't generate correlated random numbers from binomial distributions in R using rmvbin

I am trying to get a sample of correlated random numbers from binomial distributions in R. I tried to use rmvbin and it worked fine with some probabilities:
> rmvbin(100, margprob = c(0.1,0.1), bincorr=0.5*diag(2)+0.5)
while the next call which is quite similar one raises an error:
> rmvbin(100, margprob = c(0.01,0.01), bincorr=0.5*diag(2)+0.5)
Error in commonprob2sigma(commonprob, simulvals) :
Extrapolation occurred ... margprob and commonprob not compatible?
I can't find any justification for this.
This is a math/stats "problem" and not an R problem (In the sense that it's not a problem but a consequence of the model)
Short version: For bivariate binary data there is a link between the marginal probabilities and the correlation that can be observed. You can see it if you do a bit of boring juggling with the marginal probabilities $p_A$ and $p_B$ and the simultaneous probability $p_{AB}$. In other words: the marginal probabilities put restrictions on range of allowed correlations (and vice versa), and you are violating this in your call.
For bivariate Gaussian random variables the marginals and the correlations are separate and can be specified independently of each other.
The question should probably be moved to stats exchange.
