I am trying to use the random forests package for classification in R.
The Variable Importance Measures listed are:
mean raw importance score of variable x for class 0
mean raw importance score of variable x for class 1
MeanDecreaseAccuracy
MeanDecreaseGini
Now I know what these "mean" as in I know their definitions. What I want to know is how to use them.
What I really want to know is what these values mean in only the context of how accurate they are, what is a good value, what is a bad value, what are the maximums and minimums, etc.
If a variable has a high MeanDecreaseAccuracy or MeanDecreaseGini does that mean it is important or unimportant? Also any information on raw scores could be useful too.
I want to know everything there is to know about these numbers that is relevant to the application of them.
An explanation that uses the words 'error', 'summation', or 'permutated' would be less helpful then a simpler explanation that didn't involve any discussion of how random forests works.
Like if I wanted someone to explain to me how to use a radio, I wouldn't expect the explanation to involve how a radio converts radio waves into sound.
An explanation that uses the words 'error', 'summation', or 'permutated'
would be less helpful then a simpler explanation that didn't involve any
discussion of how random forests works.
Like if I wanted someone to explain to me how to use a radio, I wouldn't
expect the explanation to involve how a radio converts radio waves into sound.
How would you explain what the numbers in WKRP 100.5 FM "mean" without going into the pesky technical details of wave frequencies? Frankly parameters and related performance issues with Random Forests are difficult to get your head around even if you understand some technical terms.
Here's my shot at some answers:
-mean raw importance score of variable x for class 0
-mean raw importance score of variable x for class 1
Simplifying from the Random Forest web page, raw importance score measures how much more helpful than random a particular predictor variable is in successfully classifying data.
-MeanDecreaseAccuracy
I think this is only in the R module, and I believe it measures how much inclusion of this predictor in the model reduces classification error.
-MeanDecreaseGini
Gini is defined as "inequity" when used in describing a society's distribution of income, or a measure of "node impurity" in tree-based classification. A low Gini (i.e. higher descrease in Gini) means that a particular predictor variable plays a greater role in partitioning the data into the defined classes. It's a hard one to describe without talking about the fact that data in classification trees are split at individual nodes based on values of predictors. I'm not so clear on how this translates into better performance.
For your immediate concern: higher values mean the variables are more important. This should be true for all the measures you mention.
Random forests give you pretty complex models, so it can be tricky to interpret the importance measures. If you want to easily understand what your variables are doing, don't use RFs. Use linear models or a (non-ensemble) decision tree instead.
You said:
An explanation that uses the words
'error', 'summation', or 'permutated'
would be less helpful then a simpler
explanation that didn't involve any
discussion of how random forests
works.
It's going to be awfully tough to explain much more than the above unless you dig in and learn what about random forests. I assume you're complaining about either the manual, or the section from Breiman's manual:
http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#varimp
To figure out how important a variable is, they fill it with random junk ("permute" it), then see how much predictive accuracy decreases. MeanDecreaseAccuracy and MeanDecreaseGini work this way. I'm not sure what the raw importance scores are.
Interpretability is kinda tough with Random Forests. While RF is an extremely robust classifier it makes its predictions democratically. By this I mean you build hundreds or thousands of trees by taking a random subset of your variables and a random subset of your data and build a tree. Then make a prediction for all the non-selected data and save the prediction. Its robust because it deals well with the vagaries of your data set, (ie it smooths over randomly high/low values, fortuitous plots/samples, measuring the same thing 4 different ways, etc). However if you have some highly correlated variables, both may seem important as they are not both always included in each model.
One potential approach with random forests may be to help whittle down your predictors then switch to regular CART or try the PARTY package for inference based tree models. However then you must be wary about data mining issues, and making inferences about parameters.
Related
Working with a dataset of ~200 observations and a number of variables. Unfortunately, none of the variables are distributed normally. If it possible to extract a data subset where at least one desired variable will be distributed normally? Want to do some statistics after (at least logistic regression).
Any help will be much appreciated,
Phil
If there are just a few observations that skew the distribution of individual variables, and no other reasons speaking against using a particular method (such as logistic regression) on your data, you might want to study the nature of "weird" observations before deciding on which analysis method to use eventually.
I would:
carry out the desired regression analysis (e.g. logistic regression), and as it's always required, carry out residual analysis (Q-Q Normal plot, Tukey-Anscombe plot, Leverage plot, also see here) to check the model assumptions. See whether the residuals are normally distributed (the normal distribution of model residuals is the actual assumption in linear regression, not that each variable is normally distributed, of course you might have e.g. bimodally distributed data if there are differences between groups), see if there are observations which could be regarded as outliers, study them (see e.g. here), and if possible remove them from the final dataset before re-fitting the linear model without outliers.
However, you always have to state which observations were removed, and on what grounds. Maybe the outliers can be explained as errors in data collection?
The issue of whether it's a good idea to remove outliers, or a better idea to use robust methods was discussed here.
as suggested by GuedesBF, you may want to find a test or model method which has no assumption of normality.
Before modelling anything or removing any data, I would always plot the data by treatment / outcome groups, and inspect the presence of missing values. After quickly looking at your dataset, it seems that quite some variables have high levels of missingness, and your variable 15 has a lot of zeros. This can be quite problematic for e.g. linear regression.
Understanding and describing your data in a model-free way (with clever plots, e.g. using ggplot2 and multiple aesthetics) is much better than fitting a model and interpreting p-values when violating model assumptions.
A good start to get an overview of all data, their distribution and pairwise correlation (and if you don't have more than around 20 variables) is to use the psych library and pairs.panels.
dat <- read.delim("~/Downloads/dput.txt", header = F)
library(psych)
psych::pairs.panels(dat[,1:12])
psych::pairs.panels(dat[,13:23])
You can then quickly see the distribution of each variable, and the presence of correlations among each pair of variables. You can tune arguments of that function to use different correlation methods, and different displays. Happy exploratory data analysis :)
I am trying to use the random forests package for classification in R.
The Variable Importance Measures listed are:
mean raw importance score of variable x for class 0
mean raw importance score of variable x for class 1
MeanDecreaseAccuracy
MeanDecreaseGini
Now I know what these "mean" as in I know their definitions. What I want to know is how to use them.
What I really want to know is what these values mean in only the context of how accurate they are, what is a good value, what is a bad value, what are the maximums and minimums, etc.
If a variable has a high MeanDecreaseAccuracy or MeanDecreaseGini does that mean it is important or unimportant? Also any information on raw scores could be useful too.
I want to know everything there is to know about these numbers that is relevant to the application of them.
An explanation that uses the words 'error', 'summation', or 'permutated' would be less helpful then a simpler explanation that didn't involve any discussion of how random forests works.
Like if I wanted someone to explain to me how to use a radio, I wouldn't expect the explanation to involve how a radio converts radio waves into sound.
An explanation that uses the words 'error', 'summation', or 'permutated'
would be less helpful then a simpler explanation that didn't involve any
discussion of how random forests works.
Like if I wanted someone to explain to me how to use a radio, I wouldn't
expect the explanation to involve how a radio converts radio waves into sound.
How would you explain what the numbers in WKRP 100.5 FM "mean" without going into the pesky technical details of wave frequencies? Frankly parameters and related performance issues with Random Forests are difficult to get your head around even if you understand some technical terms.
Here's my shot at some answers:
-mean raw importance score of variable x for class 0
-mean raw importance score of variable x for class 1
Simplifying from the Random Forest web page, raw importance score measures how much more helpful than random a particular predictor variable is in successfully classifying data.
-MeanDecreaseAccuracy
I think this is only in the R module, and I believe it measures how much inclusion of this predictor in the model reduces classification error.
-MeanDecreaseGini
Gini is defined as "inequity" when used in describing a society's distribution of income, or a measure of "node impurity" in tree-based classification. A low Gini (i.e. higher descrease in Gini) means that a particular predictor variable plays a greater role in partitioning the data into the defined classes. It's a hard one to describe without talking about the fact that data in classification trees are split at individual nodes based on values of predictors. I'm not so clear on how this translates into better performance.
For your immediate concern: higher values mean the variables are more important. This should be true for all the measures you mention.
Random forests give you pretty complex models, so it can be tricky to interpret the importance measures. If you want to easily understand what your variables are doing, don't use RFs. Use linear models or a (non-ensemble) decision tree instead.
You said:
An explanation that uses the words
'error', 'summation', or 'permutated'
would be less helpful then a simpler
explanation that didn't involve any
discussion of how random forests
works.
It's going to be awfully tough to explain much more than the above unless you dig in and learn what about random forests. I assume you're complaining about either the manual, or the section from Breiman's manual:
http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#varimp
To figure out how important a variable is, they fill it with random junk ("permute" it), then see how much predictive accuracy decreases. MeanDecreaseAccuracy and MeanDecreaseGini work this way. I'm not sure what the raw importance scores are.
Interpretability is kinda tough with Random Forests. While RF is an extremely robust classifier it makes its predictions democratically. By this I mean you build hundreds or thousands of trees by taking a random subset of your variables and a random subset of your data and build a tree. Then make a prediction for all the non-selected data and save the prediction. Its robust because it deals well with the vagaries of your data set, (ie it smooths over randomly high/low values, fortuitous plots/samples, measuring the same thing 4 different ways, etc). However if you have some highly correlated variables, both may seem important as they are not both always included in each model.
One potential approach with random forests may be to help whittle down your predictors then switch to regular CART or try the PARTY package for inference based tree models. However then you must be wary about data mining issues, and making inferences about parameters.
I am using randomForest in R, I have a training model with R^2 of 0.94 , however , the prediction capacity for testing data is quite low. I would like to know if I can still use this training model only for determining which variable is more important/effective for output prediction.
Thanks
Based on what little information you provide, the question is hard to answer (think about providing more detail and background). Low prediction quality can result from wrong algorithm tuning, or it can be inherent in the data, i.e. your predictors themselves are not very strongly related to the outcome. In the first case, the prediction could be better with different parameters, e.g. more or less trees, different values for mtry, etc. If this is the case, then your importance measures are just as biased as your prediction (and should be used with caution). If the predictors themselves are weak, that means that your low quality prediction is as good as it gets. In this case, I would say the importance measures can be used, but they only tell you which of your overall weak predictors are more or less weak.
I know that there dosens of similar questions/answers, and lots of papers. But please read till the end.
Non-statisticians tend to use stepwise regressions which is strongly argued by statisticians. This is stomething that I don't understand, but I just obey them. "Ok this is not a good way to do your modelling".
Here is (was) my model:
b <- lmer(metric1~a+b+c+d+e+f+g+h+i+j+k+l+(1|X/Y) + (1|Z), data = dataset)
drop1 (b, test="Chisq")
(Just a small note: Watch out for the random effects in my model; random effects are Year, Month, Sampling.location; one of my variables is 1/0: I allready log-transformed my variables)
I am trying to find a exploratory model (with drop1 to reach final model) and evaluating it with my biological knowledge to see if the dependent ("metric" in this case) seems to be responding variables. I will repeat this process with 100 metrics just to evaulate which metrics seems to be responding environmental variables.
I was in the search for an acceptable model instead of stepwise according to the suggestions of statistics gurus.
However, there are lots of alternatives. I read alot, but still feel myself lost. Some say Lasso, some say elastic modelling, some say ridge regression... Which one fits for my purpose?
Any advise for a better alternative and an easy model or a help page for dummies, or examples (that could be better) would be much appreciated.
Thanks in advance.
I have a few hundred thousand measurements where the dependent
variable is a probability, and would like to use logistic regression.
However, the covariates I have are all categorical, and worse, are all
nested. By this I mean that if a certain measurement has "city -
Phoenix" then obviously it is certain to have "state - Arizona" and
"country - U.S." I have four such factors - the most granular has
some 20k levels, but if need be I could do without that one, I think.
I also have a few non-nested categorical covariates (only four or so,
with maybe three different levels each).
What I am most interested in
is prediction - given a new observation in some city, I would like to
know the relevant probability/dependent variable. I am not interested
as much in the related inferential machinery - standard deviations,
etc - at least as of now. I am hoping I can afford to be sloppy.
However, I would love to have that information unless it requires
methods that are more computationally expensive.
Does anyone have any advice on how to attack this? I have looked into
mixed effects, but am not sure it is what I am looking for.
I think this is more of model design question than on R specifically; as such, I'd like to address the context of the question first then the appropriate R packages.
If your dependent variable is a probability, e.g., $y\in[0,1]$, a logistic regression is not data appropriate---particularly given that you are interested in predicting probabilities out of sample. The logistic is going to be modeling the contribution of the independent variables to the probability that your dependent variable flips from a zero to a one, and since your variable is continuous and truncated you need a different specification.
I think your latter intuition about mixed effects is a good one. Since your observations are nested, i.e., US <-> AZ <-> Phoenix, a multi-level model, or in this case a hierarchical linear model, may be the best specification for your data. The best R packages for this type of modeling are multilevel and nlme, and there is an excellent introduction to both multi-level models in R and nlme available here. You may be particularly interested in the discussion of data manipulation for multi-level modeling, which begins on page 26.
I would suggest looking into penalised regressions like the elastic net. The elastic net is used in text mining where each column represents the present or absence of a single word, and there maybe hundreds of thousands of variables, an analogous problem to yours. A good place to start with R would be the glmnet package and its accompanying JSS paper: http://www.jstatsoft.org/v33/i01/.