rpart variable importance shows more variables than decision tree plots - r

I fitted an rpart model in Leave One Out Cross Validation on my data using Caret library in R. Everything is ok, but I want to understand the difference between model's variable importance and decision tree plot.
Calling the variable importance with the function varImp() shows nine variables. Plotting the decision tree using functions such as fancyRpartPlot() or rpart.plot() shows a decision tree that uses only two variables to classify all subjects.
How can it be? Why does the decision tree plot not shows the same nine variables from the variable importance table?
Thank you.

Similar to rpart(), Caret has a cool property: it deals with surrogate variables, i.e. variables that are not chosen for splits, but that were close to win the competition.
Let me be more clear. Say at a given split, the algorithm decided to split on x1. Suppose also there is another variable, say x2, which would be almost as good as x1 for splitting at that stage. We call x2 surrogate, and we assign it its variable importance as we do for x1.
This is way you can get in the importance ranking variables that are actually not used for splitting. You can also find that such variables are more important than others actuall used!
The rationale for this is explained in the documentation for rpart(): suppose we have two identical covariates, say x3 and x4. Then rpart() is likely to split on one of them only, e.g., x3. How can we say that x4 is not important?
To conclude, variable importance considers the increase in fit for both primary variables (i.e., the ones actually chosen for splitting) and surrogate variables. So, the importance for x1 considers both splits for which x1 is chosen as splitting variable, and splits for which another variables is chosen but x1 is a close competitor.
Hope this clarifies your doubts. For more details, see here. Just a quick quotation:
The following methods for estimating the contribution of each variable to the model are available [speaking of how variable importance is computed]:
[...]
- Recursive Partitioning: The reduction in the loss function (e.g. mean squared error) attributed to each variable at each split is tabulated and the sum is returned. Also, since there may be candidate variables that are important but are not used in a split, the top competing variables are also tabulated at each split. This can be turned off using the maxcompete argument in rpart.control.
I am not used to caret, but from this quote it appears that such package actually uses rpart() to grow trees, thus inheriting the property about surrogate variables.

Related

Finding how variable affect output of time-series random-forest regression model

I created a Random-Forest Regression model for time-series data in R that have three predictors and one output variable.
Is there a way to find (perhaps in more absolute terms) how changes in a specific variable affect the prediction output?
I know about variable importance, I am not trying to find the variables that have the biggest effect instead I am trying to see if I pick input variable X_1 and increase its value (or decrease it) how that would change the prediction output.
Does it even makes sense to do this? or is it even possible with a random-forest model? Rereading my question a few times it made me dubious, but any insight/recommendation would be greatly appreciated.
I would guess what this question is actually about is called exploratory data analysis (EDA). For starters, I would calculate the correlations between the variables to get a feeling for the strength of the [linear] relationship between two variables. Further, I would look at scatter plots between the variables to get a feeling for the relationships. Depending on the variables [linear] regression could tell how an increase in variable x1 would affect variable x2.

Extracting normal-distributed subset from a dataset in R

Working with a dataset of ~200 observations and a number of variables. Unfortunately, none of the variables are distributed normally. If it possible to extract a data subset where at least one desired variable will be distributed normally? Want to do some statistics after (at least logistic regression).
Any help will be much appreciated,
Phil
If there are just a few observations that skew the distribution of individual variables, and no other reasons speaking against using a particular method (such as logistic regression) on your data, you might want to study the nature of "weird" observations before deciding on which analysis method to use eventually.
I would:
carry out the desired regression analysis (e.g. logistic regression), and as it's always required, carry out residual analysis (Q-Q Normal plot, Tukey-Anscombe plot, Leverage plot, also see here) to check the model assumptions. See whether the residuals are normally distributed (the normal distribution of model residuals is the actual assumption in linear regression, not that each variable is normally distributed, of course you might have e.g. bimodally distributed data if there are differences between groups), see if there are observations which could be regarded as outliers, study them (see e.g. here), and if possible remove them from the final dataset before re-fitting the linear model without outliers.
However, you always have to state which observations were removed, and on what grounds. Maybe the outliers can be explained as errors in data collection?
The issue of whether it's a good idea to remove outliers, or a better idea to use robust methods was discussed here.
as suggested by GuedesBF, you may want to find a test or model method which has no assumption of normality.
Before modelling anything or removing any data, I would always plot the data by treatment / outcome groups, and inspect the presence of missing values. After quickly looking at your dataset, it seems that quite some variables have high levels of missingness, and your variable 15 has a lot of zeros. This can be quite problematic for e.g. linear regression.
Understanding and describing your data in a model-free way (with clever plots, e.g. using ggplot2 and multiple aesthetics) is much better than fitting a model and interpreting p-values when violating model assumptions.
A good start to get an overview of all data, their distribution and pairwise correlation (and if you don't have more than around 20 variables) is to use the psych library and pairs.panels.
dat <- read.delim("~/Downloads/dput.txt", header = F)
library(psych)
psych::pairs.panels(dat[,1:12])
psych::pairs.panels(dat[,13:23])
You can then quickly see the distribution of each variable, and the presence of correlations among each pair of variables. You can tune arguments of that function to use different correlation methods, and different displays. Happy exploratory data analysis :)

How to account for two covariates in differential gene expression of single cell RNA seq data?

I have human data from different ages and gender. After using integration with seurat, how would I best control for these confounding factors during differential gene expression analysis. I see the option of latent.vars in FindMarkers function. Can I give latent.vars = c("Age", "gender") to account for both together? or can I only use one at a time?
Is there alternative package to do the test better?
Thanks in advance!
You can use that argument, but what it means is that you are shifting to a glm based model, and not the default wilcoxon test. you can also see it in the help page (?FindMarkers) :
latent.vars: Variables to test, used only when ‘test.use’ is one of
'LR', 'negbinom', 'poisson', or 'MAST'
You can see how the glm is called in the source code, under GLMDETest. Basically these two covariates are included in the glm to account for their effects on the dependent variable. What is also important is how you treat the covariate age in this case. Would it be categorical or continuous.. this could affect your results.

R: Which variables to include in model?

I'm fairly new to R and am currently trying to find the best model to predict my dependent variable from a number of predictor variables. I have 20 precictor variables and I want to see which ones I should include in my model and which ones I should exclude.
I am currently just running models with different predictor variables in each and comparing them to see which one has the lowest AIC, but this is taking a really long time. Is there an easier way to do this?
Thank you in advance.
This is more of a theoretical question actually...
In principle, if all of the predictors are actually exogenous to the model, they can all be included together and assuming you have enough data (N >> 20) and they are not too similar (which could give rise to multi-collinearity), that should help prediction. In practice, you need to think about whether each of (or any of) your predictors are actually exogenous to the model (that is, independent of the error term in the model). If they are not, then they will impart a bias on the estimates. (Also, omitting explanatory variables that are actually necessary imparts a bias.)
If predictive accuracy (even spurious in-sample accuracy) is the goal, then techniques like LASSO (as mentioned in the comments) could also help.

Formula interface for glmnet

In the last few months I've worked on a number of projects where I've used the glmnet package to fit elastic net models. It's great, but the interface is rather bare-bones compared to most R modelling functions. In particular, rather than specifying a formula and data frame, you have to give a response vector and predictor matrix. You also lose out on many quality-of-life things that the regular interface provides, eg sensible (?) treatment of factors, missing values, putting variables into the correct order, etc.
So I've generally ended up writing my own code to recreate the formula/data frame interface. Due to client confidentiality issues, I've also ended up leaving this code behind and having to write it again for the next project. I figured I might as well bite the bullet and create an actual package to do this. However, a couple of questions before I do so:
Are there any issues that complicate using the formula/data frame interface with elastic net models? (I'm aware of standardisation and dummy variables, and wide datasets maybe requiring sparse model matrices.)
Is there any existing package that does this?
Well, it looks like there's no pre-built formula interface, so I went ahead and made my own. You can download it from Github: https://github.com/Hong-Revo/glmnetUtils
Or in R, using devtools::install_github:
install.packages("devtools")
library(devtools)
install_github("hong-revo/glmnetUtils")
library(glmnetUtils)
From the readme:
Some quality-of-life functions to streamline the process of fitting
elastic net models with glmnet, specifically:
glmnet.formula provides a formula/data frame interface to glmnet.
cv.glmnet.formula does a similar thing for cv.glmnet.
Methods for predict and coef for both the above.
A function cvAlpha.glmnet to choose both the alpha and lambda parameters via cross-validation, following the approach described in
the help page for cv.glmnet. Optionally does the cross-validation in
parallel.
Methods for plot, predict and coef for the above.
Incidentally, while writing the above, I think I realised why nobody has done this before. Central to R's handling of model frames and model matrices is a terms object, which includes a matrix with one row per variable and one column per main effect and interaction. In effect, that's (at minimum) roughly a p x p matrix, where p is the number of variables in the model. When p is 16000, which is common these days with wide data, the resulting matrix is about a gigabyte in size.
Still, I haven't had any problems (yet) working with these objects. If it becomes a major issue, I'll see if I can find a workaround.
Update Oct-2016
I've pushed an update to the repo, to address the above issue as well as one related to factors. From the documentation:
There are two ways in which glmnetUtils can generate a model matrix out of a formula and data frame. The first is to use the standard R machinery comprising model.frame and model.matrix; and the second is to build the matrix one variable at a time. These options are discussed and contrasted below.
Using model.frame
This is the simpler option, and the one that is most compatible with other R modelling functions. The model.frame function takes a formula and data frame and returns a model frame: a data frame with special information attached that lets R make sense of the terms in the formula. For example, if a formula includes an interaction term, the model frame will specify which columns in the data relate to the interaction, and how they should be treated. Similarly, if the formula includes expressions like exp(x) or I(x^2) on the RHS, model.frame will evaluate these expressions and include them in the output.
The major disadvantage of using model.frame is that it generates a terms object, which encodes how variables and interactions are organised. One of the attributes of this object is a matrix with one row per variable, and one column per main effect and interaction. At minimum, this is (approximately) a p x p square matrix where p is the number of main effects in the model. For wide datasets with p > 10000, this matrix can approach or exceed a gigabyte in size. Even if there is enough memory to store such an object, generating the model matrix can take a significant amount of time.
Another issue with the standard R approach is the treatment of factors. Normally, model.matrix will turn an N-level factor into an indicator matrix with N-1 columns, with one column being dropped. This is necessary for unregularised models as fit with lm and glm, since the full set of N columns is linearly dependent. With the usual treatment contrasts, the interpretation is that the dropped column represents a baseline level, while the coefficients for the other columns represent the difference in the response relative to the baseline.
This may not be appropriate for a regularised model as fit with glmnet. The regularisation procedure shrinks the coefficients towards zero, which forces the estimated differences from the baseline to be smaller. But this only makes sense if the baseline level was chosen beforehand, or is otherwise meaningful as a default; otherwise it is effectively making the levels more similar to an arbitrarily chosen level.
Manually building the model matrix
To deal with the problems above, glmnetUtils by default will avoid using model.frame, instead building up the model matrix term-by-term. This avoids the memory cost of creating a terms object, and can be noticeably faster than the standard approach. It will also include one column in the model matrix for all levels in a factor; that is, no baseline level is assumed. In this situation, the coefficients represent differences from the overall mean response, and shrinking them to zero is meaningful (usually).
The main downside of not using model.frame is that the formula can only be relatively simple. At the moment, only straightforward formulas like y ~ x1 + x2 + ... + x_p are handled by the code, where the x's are columns already present in the data. Interaction terms and computed expressions are not supported. Where possible, you should compute such expressions beforehand.
Update Apr-2017
After a few hiccups, this is finally on CRAN.

Resources