randomForest's importance only contains MeanDecreaseGini - r

I have two scripts which both generate random forests in R, which as far as I can work out have the same inputs, although my problem suggests this isn't the case. One of them returns an importance table containing
row.names importance.blue importance.red importance.MeanDecreaseAccuracy importance.MeanDecreaseGini
the other importance table just contains
row.names MeanDecreaseGini
Whats the difference between these two forests, and more importantly what's causing the difference given what I thought were identical inputs?
(The scripts are too large to paste here, but both are trying to predict a factor on the basis of a bunch of continuous variables)

The help page of randomForest tells us, that importance (when used for classification) is a matrix with nclass + 2 columns. The first nclass columns are the class-specific measures computed as mean descrease in accuracy. The nclass + 1st column is the mean descrease in accuracy over all classes. The last column is the mean decrease in Gini index.
If importance=FALSE, the last measure is still returned as a vector.
So, it seems to me, that you called randomForest once with importance=TRUE and once with importance=FALSE.

Related

Negative Binomial model offset seems to be creating a 2 level factor

I am trying to fit some data to a negative binomial model and run a pairwise comparison using emmeans. The data has two different sample sizes, 15 and 20 (num_sample in the example below).
I have set up two data frames: good.data which produces the expected result of offset() using random sample sizes between 15 and 20, and bad.data using a sample size of either 15 or 20, which seems to produce a factor of either 15 or 20. The bad.data pairwise comparison produces way too many comparisons compared to the good.data, even though they should produce the same number?
set.seed(1)
library(dplyr)
library(emmeans)
library(MASS)
# make data that works
data.frame(site=c(rep("A",24),
rep("B",24),
rep("C",24),
rep("D",24),
rep("E",24)),
trt_time=rep(rep(c(10,20,30),8),5),
pre_trt=rep(rep(c(rep("N",3),rep("Y",3)),4),5),
storage_time=rep(c(rep(0,6),rep(30,6),rep(60,6),rep(90,6)),5),
num_sample=sample(c(15,17,20),24*5,T),# more than 2 sample sizes...
bad=sample(c(1:7),24*5,T,c(0.6,0.1,0.1,0.05,0.05,0.05,0.05)))->good.data
# make data that doesn't work
data.frame(site=c(rep("A",24),
rep("B",24),
rep("C",24),
rep("D",24),
rep("E",24)),
trt_time=rep(rep(c(10,20,30),8),5),
pre_trt=rep(rep(c(rep("N",3),rep("Y",3)),4),5),
storage_time=rep(c(rep(0,6),rep(30,6),rep(60,6),rep(90,6)),5),
num_sample=sample(c(15,20),24*5,T),# only 2 sample sizes...
bad=sample(c(1:7),24*5,T,c(0.6,0.1,0.1,0.05,0.05,0.05,0.05)))->bad.data
# fit models
good.data%>%
mutate(trt_time=factor(trt_time),
pre_trt=factor(pre_trt),
storage_time=factor(storage_time))%>%
MASS::glm.nb(bad~trt_time:pre_trt:storage_time+offset(log(num_sample)),
data=.)->mod.good
bad.data%>%
mutate(trt_time=factor(trt_time),
pre_trt=factor(pre_trt),
storage_time=factor(storage_time))%>%
MASS::glm.nb(bad~trt_time:pre_trt:storage_time+offset(log(num_sample)),
data=.)->mod.bad
# pairwise comparison
emmeans::emmeans(mod.good,pairwise~trt_time:pre_trt:storage_time+offset(log(num_sample)))$contrasts%>%as.data.frame()
emmeans::emmeans(mod.bad,pairwise~trt_time:pre_trt:storage_time+offset(log(num_sample)))$contrasts%>%as.data.frame()
First , I think you should look up how to use emmeans.The intent is not to give a duplicate of the model formula, but rather to specify which factors you want the marginal means of.
However, that is not the issue here. What emmeans does first is to setup a reference grid that consists of all combinations of
the levels of each factor
the average of each numeric predictor; except if a
numeric predictor has just two different values, then
both its values are included.
It is that exception you have run against. Since num_samples has just 2 values of 15 and 20, both levels are kept separate rather than averaged. If you want them averaged, add cov.keep = 1 to the emmeans call. It has nothing to do with offsets you specify in emmeans-related functions; it has to do with the fact that num_samples is a predictor in your model.
The reason for the exception is that a lot of people specify models with indicator variables (e.g., female having values of 1 if true and 0 if false) in place of factors. We generally want those treated like factors rather than numeric predictors.
To be honest I'm not exactly sure what's going on with the expansion (276, the 'correct' number of contrasts, is choose(24,2), the 'incorrect' number of contrasts is 1128 = choose(48,2)), but I would say that you should probably be following the guidance in the "offsets" section of one of the emmeans vignettes where it says
If a model is fitted and its formula includes an offset() term, then by default, the offset is computed and included in the reference grid. ...
However, many users would like to ignore the offset for this kind of model, because then the estimates we obtain are rates per unit value of the (logged) offset. This may be accomplished by specifying an offset parameter in the call ...
The most natural choice for setting the offset is to 0 (i.e. make predictions etc. for a sample size of 1), but in this case I don't think it matters.
get_contr <- function(x) as_tibble(x$contrasts)
cfun <- function(m) {
emmeans::emmeans(m,
pairwise~trt_time:pre_trt:storage_time, offset=0) |>
get_contr()
}
nrow(cfun(mod.good)) ## 276
nrow(cfun(mod.bad)) ## 276
From a statistical point of view I question the wisdom of looking at 276 pairwise comparisons, but that's a different issue ...

how is obtained the observed cell counts in svychisq in R?

I'm using the function survey::svychisq() to test for independence in a two-way contingency table for complex samples.
With svytable() I get the observed counts considering the weights defined in design, and I would assume that the observed values saved in svychisq() objects would be the same, but they are not:
svytable(~row.var+col.var, design)
# 330.6634 867.6478 177.1630
# 687.4503 962.5404 228.2926
and
svychisq(~row.var+col.var, design)$observed
# 404.6712 1061.8411 216.8149
# 841.3126 1177.9722 279.3881
provide different results and I couldn't really understand why.
Could someone explain to me how the latter observed values are calculated?
Thanks!
The $observed component is a weighted table with weights scaled to sum to the sample size.

How to use the "how" function for an unbalanced repeated design

I have a set of control and treated plots which had been sampled during years. I run the prc function in the vegan package and want to perform a permutation test to check whether control vs treated plots significantly differ during years. As my data is unbalanced, I can not use strata function. my code look like:
library(vegan)
year=as.factor(c(rep(1995,8),rep(1999,8),rep(2001,8),rep(2013,4),rep(1995,4),
rep(1999,4),rep(2001,4),rep(2013,4)))
treatment=as.factor(c(rep("control",28),rep("treated",16)))
I've written this, but I'm sure that it is wrong because the treatment is missing here:
h1 <- how(within = Within(type = "series", mirror = F),
blocks = year, nperm = 999
)
Any suggestions is greatly appreciated.
Under the null hypothesis, samples from the control or treated groups are exchangeable and hence you don't want them in the permutation design; you really want to permute them to generate the permutation-based null distribution for the test statistic.
The permutation design is there to indicate what isn't exchangeable.
You haven't explained why you want samples within the blocks to be permuted in series; why are samples within years also time series? If they're not, you don't want this.
You only need to worry about imbalance if you want to permute the strata. Whilst using blocks is similar in some respects to strata, blocks are never permuted so if you can use blocks you can use strata as you won't be permuting them.
If you want to permute the years as groups of samples, then you'll need strata and you'll need balance at the year level, which you don't have.
What you have defined with your call to how() is:
groups samples by year and as such samples will never be swapped between years, and
samples within the levels of year will be permuted in series, keeping their temporal order intact after applying cyclic shift permutations.
If that's not what you want to do, you need to explain in words what you want to do. By "do" I mean what is it you want to test? What is your model in vegan?

How to understand RandomForestExplainer output (R package)

I have the following code, which basically try to predict the Species from iris data using randomForest. What I'm really intersed in is to find what are the best features (variable) that explain the species classification. I found the package randomForestExplainer is the best
to serve the purpose.
library(randomForest)
library(randomForestExplainer)
forest <- randomForest::randomForest(Species ~ ., data = iris, localImp = TRUE)
importance_frame <- randomForestExplainer::measure_importance(forest)
randomForestExplainer::plot_multi_way_importance(importance_frame, size_measure = "no_of_nodes")
The result of the code produce this plot:
Based on the plot, the key factor to explain why Petal.Length and Petal.Width is the best factor are these (the explanation is based on the vignette):
mean_min_depth – mean minimal depth calculated in one of three ways specified by the parameter mean_sample,
times_a_root – total number of trees in which Xj is used for splitting the root node (i.e., the whole sample is divided into two based on the value of Xj),
no_of_nodes – total number of nodes that use Xj for splitting (it is usually equal to no_of_trees if trees are shallow),
It's not entirely clear to me why the high times_a_root and no_of_nodes is better? And low mean_min_depth is better?
What are the intuitive explanation for that?
The vignette information doesn't help.
You would like a statistical model or measure to be a balance between "power" and "parsimony". The randomForest is designed internally to do penalization as its statistical strategy for achieving parsimony. Furthermore the number of variables selected in any given sample will be less than the the total number of predictors. This allows model building when hte number of predictors exceeds the number of cases (rows) in the dataset. Early splitting or classification rules can be applied relatively easily, but subsequent splits become increasingly difficult to meet criteria of validity. "Power" is the ability to correctly classify items that were not in the subsample, for which a proxy, the so-called OOB or "out-of-bag" items is used. The randomForest strategy is to do this many times to build up a representative set of rules that classify items under the assumptions that the out-of-bag samples will be a fair representation of the "universe" from which the whole dataset arose.
The times_a_root would fall into the category of measuring the "relative power" of a variable compared to its "competitors". The times_a_root statistic measures the number of times a variable is "at the top" of a decision tree, i.e., how likely it is to be chosen first in the process of selecting split criteria. The no_of_node measures the number of times the variable is chosen at all as a splitting criterion among all of the subsampled.
From:
?randomForest # to find the names of the object leaves
forest$ntree
[1] 500
... we can see get a denominator for assessing the meaning of the roughly 200 values in the y-axis of the plot. About 2/5ths of the sample regressions had Petal.Length in the top split criterion, while another 2/5ths had Petal.Width as the top variable selected as the most important variable. About 75 of 500 had Sepal.Length while only about 8 or 9 had Sepal.Width (... it's a log scale.) In the case of the iris dataset, the subsamples would have ignored at least one of the variables in each subsample, so the maximum possible value of times_a_root would have been less than 500. Scores of 200 are pretty good in this situation and we can see that both of these variables have a comparable explanatory ability.
The no_of_nodes statistic totals up the total number of trees that had that variable in any of its nodes, remembering that the number of nodes would be constrained by the penalization rules.

In R, what is the difference between ICCbare and ICCbareF in the ICC package?

I am not sure if this is a right place to ask a question like this, but Im not sure where to ask this.
I am currently doing some research on data and have been asked to find the intraclass correlation of the observations within patients. In the data, some patients have 2 observations, some only have 1 and I have an ID variable to assign each observation to the corresponding patient.
I have come across the ICC package in R, which calculates the intraclass correlation coefficient, but there are 2 commands available: ICCbare and ICCbareF.
I do not understand what is the difference between them as they do give completely different ICC values on the same variables. For example, on the same variable, x:
ICCbare(ID,x) gave me a value of -0.01035216
ICCbareF(ID,x) gave me a value of 0.475403
The second one using ICCbareF is almost the same as the estimated correlation I get when using random effects models.
So I am just confused and would like to understand the algorithm behind them so I could explain them in my research. I know one is to be used when the data is balanced and there are no NA values.
In the description it says that it is either calculated by hand or using ANOVA - what are they?
By: https://www.rdocumentation.org/packages/ICC/versions/2.3.0/topics/ICCbare
ICCbare can be used on balanced or unbalanced datasets with NAs. ICCbareF is similar, however ICCbareF should not be used with unbalanced datasets.

Resources