defining classes using random forest models in R - r

I am pretty new to machine learning, and I've stumbled upon an issue and can't seem to find a solution no matter how hard I google.
I have performed a multiclass classification procedure using a randomForest algorithm and found a model that offers adequate prediction of my test sample. I then used varImpPlot() to determine which predictors are most important to the determining the class assignments.
My problem: I would like to know why those predictors are most important. Specifically, I would like to be able to report that cases that fall into Class X hold Characteristics A (e.g., are male), B (e.g., are older), and C (e.g., have high IQ), while cases that fall into Class Y hold Characteristics D (female), E (younger), and F (low IQ), and so on for the rest of my classes.
I know that standard binary logistic regression allows you to say that cases with high values on Characteristic A are more likely to fall into class X, for example. So, I was hoping for something conceptually similar, but from a random forest classification model on multiple classes.
Is this a thing that can be done using random forest models? If yes, is there a function in randomForest or in caret (or even elsewhere) that can help me get past the varImpPlot() and varImp() table?
Thanks!

There is a package named ExplainPrediction that promises an explanation for random forest models. Here's the top of DESCRIPTION file. The URL page has a link to an extensive citation list:
Package: ExplainPrediction
Title: Explanation of Predictions for Classification and Regression Models
Version: 1.3.0
Date: 2017-12-27
Author: Marko Robnik-Sikonja
Maintainer: Marko Robnik-Sikonja <marko.robnik#fri.uni-lj.si>
Description: Generates explanations for classification and regression models and visualizes them.
Explanations are generated for individual predictions as well as for models as a whole. Two explanation methods
are included, EXPLAIN and IME. The EXPLAIN method is fast but might miss explanations expressed redundantly
in the model. The IME method is slower as it samples from all feature subsets.
For the EXPLAIN method see Robnik-Sikonja and Kononenko (2008) <doi:10.1109/TKDE.2007.190734>,
and the IME method is described in Strumbelj and Kononenko (2010, JMLR, vol. 11:1-18).
All models in package 'CORElearn' are natively supported, for other prediction models a wrapper function is provided
and illustrated for models from packages 'randomForest', 'nnet', and 'e1071'.
License: GPL-3
URL: http://lkm.fri.uni-lj.si/rmarko/software/
Imports: CORElearn (>= 1.52.0),semiArtificial (>= 2.2.5)
Suggests: nnet,e1071,randomForest
Also:
Package: DALEX
Title: Descriptive mAchine Learning EXplanations
Version: 0.1.1
Authors#R: person("Przemyslaw", "Biecek", email = "przemyslaw.biecek#gmail.com", role = c("aut", "cre"))
Description: Machine Learning (ML) models are widely used and have various applications in classification
or regression. Models created with boosting, bagging, stacking or similar techniques are often
used due to their high performance, but such black-box models usually lack of interpretability.
'DALEX' package contains various explainers that help to understand the link between input variables and model output.
The single_variable() explainer extracts conditional response of a model as a function of a single selected variable.
It is a wrapper over packages 'pdp' and 'ALEPlot'.
The single_prediction() explainer attributes arts of model prediction to articular variables used in the model.
It is a wrapper over 'breakDown' package.
The variable_dropout() explainer assess variable importance based on consecutive permutations.
All these explainers can be plotted with generic plot() function and compared across different models.
Depends: R (>= 3.0)
License: GPL
Encoding: UTF-8
LazyData: true
RoxygenNote: 6.0.1.9000
Imports: pdp, ggplot2, ALEPlot, breakDown
Suggests: gbm, randomForest, xgboost
URL: https://pbiecek.github.io/DALEX/
BugReports: https://github.com/pbiecek/DALEX/issues
NeedsCompilation: no
Packaged: 2018-02-28 01:44:36 UTC; pbiecek
Author: Przemyslaw Biecek [aut, cre]
Maintainer: Przemyslaw Biecek <przemyslaw.biecek#gmail.com>
Repository: CRAN
Date/Publication: 2018-02-28 16:36:14 UTC
Built: R 3.4.3; ; 2018-04-03 03:04:04 UTC; unix

Related

Is it possible to build a random forest with model based trees i.e., `mob()` in partykit package

I'm trying to build a random forest using model based regression trees in partykit package. I have built a model based tree using mob() function with a user defined fit() function which returns an object at the terminal node.
In partykit there is cforest() which uses only ctree() type trees. I want to know if it is possible to modify cforest() or write a new function which builds random forests from model based trees which returns objects at the terminal node. I want to use the objects in the terminal node for predictions. Any help is much appreciated. Thank you in advance.
Edit: The tree I have built is similar to the one here -> https://stackoverflow.com/a/37059827/14168775
How do I build a random forest using a tree similar to the one in above answer?
At the moment, there is no canned solution for general model-based forests using mob() although most of the building blocks are available. However, we are currently reimplementing the backend of mob() so that we can leverage the infrastructure underlying cforest() more easily. Also, mob() is quite a bit slower than ctree() which is somewhat inconvenient in learning forests.
The best alternative, currently, is to use cforest() with a custom ytrafo. These can also accomodate model-based transformations, very much like the scores in mob(). In fact, in many situations ctree() and mob() yield very similar results when provided with the same score function as the transformation.
A worked example is available in this conference presentation:
Heidi Seibold, Achim Zeileis, Torsten Hothorn (2017).
"Individual Treatment Effect Prediction Using Model-Based Random Forests."
Presented at Workshop "Psychoco 2017 - International Workshop on Psychometric Computing",
WU Wirtschaftsuniversität Wien, Austria.
URL https://eeecon.uibk.ac.at/~zeileis/papers/Psychoco-2017.pdf
The special case of model-based random forests for individual treatment effect prediction was also implemented in a dedicated package model4you that uses the approach from the presentation above and is available from CRAN. See also:
Heidi Seibold, Achim Zeileis, Torsten Hothorn (2019).
"model4you: An R Package for Personalised Treatment Effect Estimation."
Journal of Open Research Software, 7(17), 1-6.
doi:10.5334/jors.219

R alternatives to JAGS/BUGS

Is there an R-Package I could use for Bayesian parameter estimation as an alternative to JAGS? I found an old question regarding JAGS/BUGS alternatives in R, however, the last post is already 9 years old. So maybe there are new and flexible gibbs sampling packages available in R? I want to use it to get parameter estimates for novel hierarchical hidden markov models with random effects and covariates etc. I highly value the flexibility of JAGS and think that JAGS is simply great, however, I want to write R functions that facilitate model specification and am looking for a package that I can use for parameter estimation.
There are some alternatives:
stan, with rstan R package. Stan looks well optimized but cannot do certain type of models (like binomial/poisson mixture model), since he cannot sample a discrete variable (or something like that...).
nimble
if you want highly optimized sampling based on C++, you may want to check Rcpp based solutions from Dirk Eddelbuettel

Citing caret R package in APA style

I used caret package to do neural network analysis and need to cite the package in APA style. But, `citation("caret") doesn't look like a typical APA style. Can anyone make it to the APA 6th style? Thanks.
To cite package ‘caret’ in publications use:
Max Kuhn. Contributions from Jed Wing, Steve Weston, Andre Williams,
Chris Keefer, Allan Engelhardt, Tony Cooper, Zachary Mayer, Brenton
Kenkel, the R Core Team, Michael Benesty, Reynald Lescarbeau, Andrew
Ziem, Luca Scrucca, Yuan Tang and Can Candan. (2016). caret:
Classification and Regression Training. R package version 6.0-71.
https://CRAN.R-project.org/package=caret
Kuhn, M. (2008). Caret package. Journal of Statistical Software, 28(5)
Here is the citation format in APA:
Kuhn, M. (2008). Building Predictive Models in R Using the caret Package. Journal of Statistical Software, 28(5), 1 - 26. doi:http://dx.doi.org/10.18637/jss.v028.i05
The citation format in BibTex (in latex):
#article{JSSv028i05,
author = {Max Kuhn},
title = {Building Predictive Models in R Using the caret Package},
journal = {Journal of Statistical Software, Articles},
volume = {28},
number = {5},
year = {2008},
keywords = {},
abstract = {The caret package, short for classification and regression training, contains numerous tools for developing predictive models using the rich set of models available in R. The package focuses on simplifying model training and tuning across a wide variety of modeling techniques. It also includes methods for pre-processing training data, calculating variable importance, and model visualizations. An example from computational chemistry is used to illustrate the functionality on a real data set and to benchmark the benefits of parallel processing with several types of models.},
issn = {1548-7660},
pages = {1--26},
doi = {10.18637/jss.v028.i05},
url = {https://www.jstatsoft.org/v028/i05}
}
Please refer to the following website for other formats:
https://www.jstatsoft.org/rt/captureCite/v028i05/0/ApaCitationPlugin

How do you perform a goodness of link test for a generalized linear model in R?

I'm working on fitting a generalized linear model in R (using glm()) for some data that has two predictors in full factorial. I'm confident that the gamma family is the right error distribution to use but not sure about which link function to use so I'd like to test all possible link functions against one another. Of course, I can do this manually by making a separate model for each link function and then compare deviances, but I imagine there is a R function that will do this and compile results. I have searched on CRAN, SO, Cross-validated, and the web - the closest function I found was clm2 but I do not believe I want a cumulative link model - based on my understanding of what clm's are.
My current model looks like this:
CO2_med_glm_alf_gamma <- glm(flux_median_mod_CO2~PercentH2OGrav+
I(PercentH2OGrav^2)+Min_Dist+
I(Min_Dist^2)+PercentH2OGrav*Min_Dist,
data = NC_alf_DF,
family=Gamma(link="inverse"))
How do I code this model into an R function that will do such a 'goodness-of-link' test?
(As far as the statistical validity of such a test goes, this discussion as well as a discussion with a stats post-doc lead me to believe that is valid to compare AIC or deviances between generalized linear models that are identical except for having different link functions)
This is not "all possible links", it's testing against a specified class of links, but there is a goodness-of-link test by Pregibon that is implemented in the LDdiag package. It's not on CRAN, but you can install it from the archives via
devtools::install_version("LDdiag","0.1")
The example given (not that exciting) is
quine$Days <- ifelse(quine$Days==0, 1, quine$Days)
ex <- glm(Days ~ ., family = Gamma(link="log"), data = quine)
pregibon(ex)
The pregibon family of link functions is implemented in the glmx package. As pointed out by Achim Zeleis in comments, the package provides various parametric link functions and supports general estimation and inference based on such parametric links (or more generally parametric families). To see a worked example how this can be employed for a variety of goodness-of-link assessements, see example("WECO", package = "glmx"). This replicates the analyses from two papers by Koenker and Yoon (see below).
This example might be useful too.
Koenker R (2006). “Parametric Links for Binary Response.” R News, 6(4), 32--34; link to page with supplementary materials.
Koenker R, Yoon J (2009). “Parametric Links for Binary Choice Models: A Fisherian-Bayesian Colloquy.” Journal of Econometrics, 152, 120--130; PDF.
I have learned that the dredge function (MuMIn package) can be used to perform goodness-of-link tests on glms, lms, etc. More generally it is a model selection function but allows for a good deal of customization. In this case, you can use the varying option to compare models fit with different link functions. See the Beetle example that they work for details.

Copy the required data in text file using R

Question: Input data is the text file. Copy only Statistics data and paste it in another text file.
We can see in the output only statistics data. But ignore package data in text
Input:
Statistics - R is statistical software which is used for data analysis. It includes a huge number of statistical procedures such as
t-test, chi-square tests, standard linear models, instrumental
variables estimation, local polynomial regressions, etc. It also
provides high-level graphics capabilities.
R provides a wide variety of statistical and graphical techniques, including linear and nonlinear modeling, classical statistical tests,
time-series analysis, classification, clustering, and others.
R is easily extensible through functions and extensions, and the R community is noted for its active contributions in terms of
packages.
Packages - The capabilities of R are extended through user-created
packages, which allow specialized statistical techniques, graphical
devices (ggplot2), import/export capabilities, reporting tools (knitr,
Sweave), etc.
These packages are developed primarily in R, and sometimes in Java, C
and Fortran. A core set of packages is included with the installation
of R, with more than 5,800 additional packages and 120,000 functions
Statistics - R is an object oriented programming language.
S-PLUS is a commercial version of the same S programming language that R is a free version
SAS is proprietary software that can be used with very large datasets such as census data.
Packages - Other R package resources include Crantastic, a community
site for rating and reviewing all CRAN packages, and R-Forge.
Version 0.16 – This is the last alpha version developed primarily by
Ihaka and Gentleman. Much of the basic functionality from the "White
Book" (see S history) was implemented. The mailing lists commenced on
April 1, 1997.
Output:
Statistics - R is statistical software which is used for data analysis. It includes a huge number of statistical procedures such as
t-test, chi-square tests, standard linear models, instrumental
variables estimation, local polynomial regressions, etc. It also
provides high-level graphics capabilities.
R provides a wide variety of statistical and graphical techniques, including linear and nonlinear modeling, classical statistical tests,
time-series analysis, classification, clustering, and others.
R is easily extensible through functions and extensions, and the R community is noted for its active contributions in terms of
packages.
Statistics - R is an object oriented programming language.
S-PLUS is a commercial version of the same S programming language that R is a free version
SAS is proprietary software that can be used with very large datasets such as census data.
R Code:
setwd("xxx")
text <- readLines("data.txt")
q3<-data.frame(text)
df<- q3[!(is.na(q3$text) | q3$text==""), ]
q4<-data.frame(df)
a<-Search(q4, "Statistics")
View(a)
Only the word containing Statistics Paragraph is captured but not the rest.
Need the help to Build R Code
You can use str_extract_all:
left.border <- "Statistics"
rigth.border <- "Packages"
pattern <- paste0(left.border, "(.*?)", right.border)
str_extract_all(text,pattern)
[[1]]
[1] "Statistics - R is statistical software which is used for data analysis. It includes a huge number of statistical procedures such as t-test, chi-square tests, standard linear models, instrumental variables estimation, local polynomial regressions, etc. It also provides high-level graphics capabilities.\n\nR provides a wide variety of statistical and graphical techniques, including linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, and others.\n\nR is easily extensible through functions and extensions, and the R community is noted for its active contributions in terms of packages.\n\nPackages"
[2] "Statistics - R is an object oriented programming language.\n\nS-PLUS is a commercial version of the same S programming language that R is a free version\n\nSAS is proprietary software that can be used with very large datasets such as census data.\n\nPackages"
Then, you can remplace right.border with empty space to remove "Packages" at the end.
Bests,
ZP

Resources