Boosted trees and Variable Interactions in R - r

How can one see in a Boosted trees classification model for machine learning (adaboost), which variables interact with each other and how much? I would like to make use of this in R gbm package if possible.

To extract the interaction between input variables, you can use any package like lm. http://www.r-bloggers.com/r-tutorial-series-regression-with-interaction-variables/

You can use ?interact.gbm. See also this cross-validated question, which directs to a vignette of a related technique from the package dismo.
In general, these interactions may not necessarily agree with the interaction terms estimated in a linear model.

Related

Obtaining glmer coefficient confidence intervals via bootstrapping

I am in my first experience using mixed models in R for my statistical analysis. Due to my data being comprised of binary outcome variables, I have managed to build a logistic model using the glmer function of the lme4 package that I think works as I wanted it to.
I am now aiming to investigate the statistical significance of my model coefficients. I have read that generally, the best approach for generalized mixed models is to bootstrap confidence intervals, but I haven't managed to find a good, clear, explanation of how to do this in R.
Would anyone have any suggestions? Are there any packages in R that expedite this process, or do people generally build their own functions for this? I haven't really done any bootstrapping before so I'd appreciate some more in-depth answers.
If you want to compute parametric bootstrap confidence intervals, the built-in functionality
confint(fitted_model, method = "boot")
should work (see ?confint.merMod)
Also see this answer (which illustrates both parametric and nonparametric bootstrapping for user-defined quantities).
If you have multiple cores, you can speed this up by adding parallel = "multicore", ncpus = parallel::detectCores()-1 (or some other appropriate number of cores to use): see ?lme4::bootMer for details.

Is XGBoost effective for variable selection?

I have understood the use of XGBoost, I got it this was an amateur question
Can XGBoost be used for variable elimination & selection purpose like LASSO, or we need to use LASSO first to eliminate variables & then use XGBoost finally to get prediction?
XGBoost is quite effective for prediction in the presence of redundant variables (features). As underlying gradient boosting algorithm itself is robust to multi-collinearity.
But it is highly recommended to remove (engineer) any redundant features from any dataset used for training for any algorithm of choice (whether LASSO or XGBoost).
Additionally you can combine those two method using Ensemble learning.
xgboost has built-in regularization(Like LASSO) method when you training.

Random Effects with count Models

I'm trying to do a hurdle model with random effects in either r or stata. I've looked at the glmmADMB package, but am running into problems getting it download in R and I can't find any documentation on the package in Cran. Is this package still available? Has anyone used it successfully to estimate a hurdle model with random effects?
Alternatively, is there a way to estimate this in stata? Is there a way to estimate random effects with any type of count data in stata?
Any advice would be greatly appreciated.
Jennifer
In Stata, xtnbreg and xtpoisson have the random effects estimator as the default option. You can always estimate the two parts separately by hand. See the count-data chapter of Cameron and Trivedi's Stata book for cross-sectional examples.
You also have the user-written hplogit and hnlogit for hurdle count models. These use a logit/probit for the first-stage and a zero-truncated poisson/negative binomial for the second stage. Also, a finite mixture model might be a nice approach (see user-written fmm). There's also ztpnm. All these are cross-sectional models.

Splitting rules in mvpart vs rpart

I would like to make classification trees to predict the presence/absence of 1 bird species based on several variables. I know that rpart handles univariate partitioning and mvpart handles multivariate partitioning, but I'd like to use mvpart for my one-variable tree because of its more flexible output. Does anyone know of a reason that I should not do this? Will the splits be different in rpart vs mvpart with the same exact input?
It cannot be guaranteed that the splits will be the same; mvpart() is minimising the within groups sums of squares whereas rpart for a classification tree will be minimising the Gini coefficient (by default IIRC).
You may end up with the same model/splits but as the two functions are using two different measures of node impurity this may just be a fluke.
FYI, mvpart is fitting a regression model but you want a classification model.
Finally, consider using the party package and its function ctree; it has much nicer outputs than rpart by default but is, again, doing something slightly different in terms of model fitting.
As an aside, also look into the plotmo package which includes enhanced plots for a number of tree-like models including, IIRC, rpart ones.

Panel data with binary dependent variable in R

Is it possible to do regressions in R using a panel data set with a binary dependent variable? I am familiar with using glm for logit and probit and plm for panel data, but am not sure how to combine the two. Are there any existing code examples?
EDIT
It would also be helpful if I could figure out how to extract the matrix that plm() is using when it does a regression. For instance, you could use plm to do fixed effects, or you could create a matrix with the appropriate dummy variables and then run that through glm(). In a case like this, however, it is annoying to generate the dummies yourself and it would be easier to have plm do it for you.
The package "pglm" might be what you need.
http://cran.r-project.org/web/packages/pglm/pglm.pdf
This package offers some functions of glm-like models for panel data.
Maybe the package lme4 is what you are looking for.
It seems to be possible to run generalized regressions with fixed effects using the comand glme.
But you should be aware that panel data with binary dependent variable is different than the usual linear models.
This site may be helpful.
Best regards,
Manoel
model.frame(plmmodel)
will give you the data frame that is actually used by plm for fitting the model (i.e. after list-wise deletion if you have NAs, etc.)
I don't think that plm has implemented functions to estimate models with binary outcomes, but I may be wrong. Check out the reference manual at: http://cran.r-project.org/web/packages/plm/index.html
If I'm right, this would suggest that you can't "combine the two" without considerable work in extending the functions provided by plm.

Resources