Minimal depth interaction from randomForestExplainer package - r

So when using the minimal depth interaction feature of the randomForestExplainer package, in R, I'm getting some hard to interpret results.
I simulated some data (x1, x2,..., x5) where x1 is binary and x2-x5 are continuous. In my model, there are no interactions.
Im using the randomForest package to create a random forest and then running it through the randomForestExplainer package.
Here's the code I'm using to simulate the data and random forest:
library(randomForest)
library(randomForestExplainer)
n <- 100
p <- 4
# Create data:
xrandom <- matrix(rnorm(n*p)+5, nrow=n)
colnames(xrandom)<- paste0("x",2:5)
d <- data.frame(xrandom)
d$x1 <- factor(sample(1:2, n, replace=T))
# Equation:
y <- d$x2 + rnorm(n)/5
y[d$x1==1] <- y[d$x1==1]+5
d$y <- y
# Random Forest:
fr <- randomForest(y ~ ., data=d,localImp=T)
# Random Forest Explainer:
interactions_frame <- min_depth_interactions(fr, names(d)[-6])
head(interactions_frame, 2)
This produces the following:
variable root_variable mean_min_depth occurrences interaction
1 x1 x1 4.670732 0 x1:x1
2 x1 x2 2.606190 221 x2:x1
uncond_mean_min_depth
1 1.703252
2 1.703252
So, my question is, if x1:x1 has 0 occurrence ( which is expected) then how can it also have a mean_min_depth?
Surely if it has 0 occurrences, then it can't possibly have a minimum depth? [or rather, the min depth = 0 or NA]
What's going on here? Am I misinterpreting something?
Thanks

My understanding is this has to do with the choice of the mean_sample argument of min_depth_interactions. The default choice replaces NAs with the depth of maximum subtree whose root is x1. Details below.
What is this argument mean_sample for? It specifies how to deal with trees where the interaction of interest is not present. There are three options:
relevant_trees. This only considers the trees where the interaction of interest is present. In your example, this gives NA for mean_min_depth of interaction x1:x1, which is the behavior you were looking for.
interactions_frame <- min_depth_interactions(fr, names(d)[-6], mean_sample = "relevant_trees")
head(interactions_frame, 2)
variable root_variable mean_min_depth occurrences interaction uncond_mean_min_depth
1 x1 x1 NA 0 x1:x1 1.947475
2 x1 x2 1.426606 218 x2:x1 1.947475
all_trees. There is a major problem with relevant_trees, that is for an interaction only showing up in a small number of trees, taking the mean of conditional minimum depth ignores the fact that this interaction is not that important. In this case, a small mean conditional minimum depth doesn't mean an interaction is important. To address this, specifying mean_sample = "all_trees" replaces the conditional minimum depth for the interaction of interest by the mean depth of maximal subtree of the root variable. Basically, if we are looking at the interaction of x1:x2, it says for a tree where this interaction is absent, give it a value of the deepest tree whose root is x1. This gives a (hopefully large) numeric value to mean_min_depth of interaction x1:x2 thus making it less important.
interactions_frame <- min_depth_interactions(fr, names(d)[-6], mean_sample = "all_trees")
head(interactions_frame, 2)
variable root_variable mean_min_depth occurrences interaction uncond_mean_min_depth
1 x1 x1 4.787879 0 x1:x1 1.97568
2 x1 x2 3.654522 218 x2:x1 1.97568
top_trees. Now this is the default choice for mean_sample. My understanding is it's similar to all_trees, but tries to down-weight the contribution of replacing missing values. The motivation, is all_trees pulls mean_min_depth close to the same value when there are many parameters but not enough observations, i.e. shallow trees. To reduce the contribution of replacing missing values, top_trees only calculates the mean conditional minimal depth on a subset of n trees, where n is the number of trees where ANY interactions with specified roots are present. Let's say in your example, out of those 500 trees only 300 have any interaction x1:whatever, then we only consider those 300 trees when filling in value for x1:x1. Because there are 0 occurrence of this interaction, replacing 500 NAs vs replacing 300 NAs with the same value doesn't affect the mean, so it's the same value 4.787879. (There's a slight difference between our results, I think it has to do with seed values).
interactions_frame <- min_depth_interactions(fr, names(d)[-6], mean_sample = "top_trees")
head(interactions_frame, 2)
variable root_variable mean_min_depth occurrences interaction uncond_mean_min_depth
variable root_variable mean_min_depth occurrences interaction uncond_mean_min_depth
1 x1 x1 4.787879 0 x1:x1 1.947475
2 x1 x2 2.951051 218 x2:x1 1.947475
This answer is based on my understanding of the package author's thesis: https://rawgit.com/geneticsMiNIng/BlackBoxOpener/master/randomForestExplainer_Master_thesis.pdf

Related

How to make a mixed anova and test its residuals for normal distribution? (R)

I want to make a mixed anova with the within-subject-factors mzp and cond besides the between-subject-factors cond_order and video_order.
I have 3 timepoints of a repeated measurement, indicated by mzp.
anova.h1 <- aov_car(ee ~ cond_order + video_order + Error(code/mzp*cond), data=dat_long)
Three things I can't find a solution for:
How to separate between within-factors in the error term? A lot of codes I found used *, but I fear it might only be for specific cases? Are there other separating-operators?
mzp has actually 3 levels (i.e. times of measurement) for measuring the dependent variable, cond has only 2 (because there were no baseline measured). So I made for that variable a 3rd timepoint up, by setting values to NA at baseline for cond. But it seems to cause issues now:
Error: Empty cells in within-subjects design (i.e., bad data structure).
table(data[c("mzp", "cond")])
# cond
# mzp s1st t1st s2nd t2nd
# X0 0 0 0 0
# X1 44 43 0 0
# X2 0 0 43 44
I need to examine relations between all 3 times of measuring the dependent variable and its interactions with the independet variables cond, cond_order and video_order. So is there a way of ignoring the NAs in cond, but include every 3 timepoints of the dependent variable for examining the progress of the dependent variable?
Above all, I need this anova to examine the residuals, to test for normality. I tried functions I know and googled (for a model without the cond-variable), but they won't work for this model/this function. I have to examine it graphically. So what works for this anova function?
hist(rstandard(anova.h1))
plot(anova.h1,2)
anova.h1.pr <- proj(anova.h1)
# Error in proj.default(anova.h1) : argument does not contain 'qr' component
res <- anova.h1.pr[["Within"]][ , "Residuals"].
qqnorm(res)

How to find the minimum floating-point value accepted by betareg package?

I'm doing a beta regression in R, which requires values between 0 and 1, endpoints excluded, i.e. (0,1) instead of [0,1].
I have some 0 and 1 values in my dataset, so I'd like to convert them to the smallest possible neighbor, such as 0.0000...0001 and 0.9999...9999. I've used .Machine$double.xmin (which gives me 2.225074e-308), but betareg() still gives an error:
invalid dependent variable, all observations must be in (0, 1)
If I use 0.000001 and 0.999999, I got a different set of errors:
1: In betareg.fit(X, Y, Z, weights, offset, link, link.phi, type, control) :
failed to invert the information matrix: iteration stopped prematurely
2: In sqrt(wpp) :
Error in chol.default(K) :
the leading minor of order 4 is not positive definite
Only if I use 0.0001 and 0.9999 I can run without errors. Is there any way I can improve this minimum values with betareg? Or should I just be happy with that?
Try it with eps (displacement from 0 and 1) first equal to 1e-4 (as you have here) and then with 1e-3. If the results of the models don't differ in any way you care about, that's great. If they are, you need to be very careful, because it suggests your answers will be very sensitive to assumptions.
In the example below the dispersion parameter phi changes a lot, but the intercept and slope parameter don't change very much.
If you do find that the parameters change by a worrying amount for your particular data, then you need to think harder about the process by which zeros and ones arise, and model that process appropriately, e.g.
a censored-data model: zero/one arise through a minimum/maximum detection threshold, models the zero/one values as actually being somewhere in the tails or
a hurdle/zero-one inflation model: zeros and ones arise through a separate process from the rest of the data, use a binomial or multinomial model to characterize zero vs. (0,1) vs. one, then use a Beta regression on the (0,1) component)
Questions about these steps are probably more appropriate for CrossValidated than for SO.
sample data
set.seed(101)
library(betareg)
dd <- data.frame(x=rnorm(500))
rbeta2 <- function(n, prob=0.5, d=1) {
rbeta(n, shape1=prob*d, shape2=(1-prob)*d)
}
dd$y <- rbeta2(500,plogis(1+5*dd$x),d=1)
dd$y[dd$y<1e-8] <- 0
trial fitting function
ss <- function(eps) {
dd <- transform(dd,
y=pmin(1-eps,pmax(eps,y)))
m <- try(betareg(y~x,data=dd))
if (inherits(m,"try-error")) return(rep(NA,3))
return(coef(m))
}
ss(0) ## fails
ss(1e-8) ## fails
ss(1e-4)
## (Intercept) x (phi)
## 0.3140810 1.5724049 0.7604656
ss(1e-3) ## also fails
ss(1e-2)
## (Intercept) x (phi)
## 0.2847142 1.4383922 1.3970437
ss(5e-3)
## (Intercept) x (phi)
## 0.2870852 1.4546247 1.2029984
try it for a range of values
evec <- seq(-4,-1,length=51)
res <- t(sapply(evec, function(e) ss(10^e)) )
library(ggplot2)
ggplot(data.frame(e=10^evec,reshape2::melt(res)),
aes(e,value,colour=Var2))+
geom_line()+scale_x_log10()

Generating different percentages of MAR data in R

The following​ two R functions are from the book "Flexible Imputation of Missing Data" (page no. 59 and 63). The first one generates missing completely at random(MCAR) data and the second on generates missing at random(MAR) data. Both functions give approximately 50% missing values. ​
In MCAR function, we can generate different percentages of missing data by changing the p value. But in MAR function, ​I don't understand ​which parameter should we change to generate different percentages of missing data like 10% or 30%?
MCAR
makemissing <- function(data, p=0.5){
rx <- rbinom(nrow(data), 1, p)
data[rx==0,"y"] <- NA
return(data)
}
MAR
logistic <- function(x) exp(x)/(1+exp(x))
set.seed(32881)
n <- 10000
y <- mvrnorm(n=n,mu=c(5,5),Sigma=matrix(c(1,0.6,0.6,1),nrow=2))
p2.marright <- 1 - logistic(-5 + y[,1])
r2.marright <- rbinom(n, 1, p2.marright)
yobs <- y
yobs[r2.marright==0, 2] <- NA
The probability of an observation being missing is 50% for every case for the MCAR function because, by definition, the missingness is random. For the MAR version, the probability of an observation being missing is different for each observation, since it depends on the values of y[,1]. In your code, the probability of missingness on y[,2] is saved in the variable p2.marright. You can perhaps see this more easily by lining up all of the values in a dataframe:
df <- data.frame(y1 = y[,1], y2_ori = y[,2], y2_mis = yobs[,2], p2.marright = p2.marright, r2.marright)
head(df)
y1 y2_ori y2_mis p2.marright r2.marright
1 2.086475 3.432803 3.432803 0.9485110 1
2 3.784675 5.005584 5.005584 0.7712399 1
3 4.818409 5.356688 NA 0.5452733 0
4 2.937422 3.898014 3.898014 0.8872124 1
5 6.422158 5.032659 5.032659 0.1943236 1
6 4.115106 5.083162 5.083162 0.7078354 1
You can see that whether or not an observation will be NA on y2 is encoded in r2.marright, which is a probabilistic binary version of p2.marright --- for higher values of p2.marright, r2.marright is more likely to 1. To change the overall rate of missingness, you can change the calculation of p2.marright to bias it higher or lower.
You can manipulate p2.marright by changing the constant in the logistic transformation (-5 in the example). If you increase it (make it less negative, e.g. -4) then p2.marright will decrease, resulting in more missing values on y2. If you decrease it (make it more negative, e.g. -6) then you'll end up with fewer missing values on y2. (The reason -5 is resulting in 50% missingness is because 5 is the mean of the variable being transformed, y1.) This works, but the mechanism is rather opaque, and it might be difficult for you to control it easily. For example, it's not obvious what you should set the constant to be if you want 20% missingness on y2.

How to force rpart to do exactly 1 Split

Having a problem similar to this, I am trying to force rpart to do exactly one split. Here is a toy example that reproduces my problem:
require(rpart)
y <- factor(c(1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0))
x1 <- c(12,18,15,10,10,10,20,6,7,34,7,11,10,22,4,19,10,8,13,6,7,47,6,15,7,7,21,7,8,10,15)
x2 <- c(318,356,341,189,308,236,290,635,550,287,261,472,282,262,1153,435,402,182,415,544,251,281,378,498,142,566,152,560,284,213,326)
data <- data.frame(y=y,x1=x1,x2=x2)
tree <-rpart(y~.,
data=data,
control=rpart.control(maxdepth=1, # at most 1 split
cp=0, # any positive improvement will do
minsplit=1,
minbucket=1, # even leaves with 1 point are accepted
xval=0)) # I don't need crossvalidation
length(tree$frame$var) #==1, so there are no splits
Isolating a single point should be possible (minbucket=1) and even the most marginal improvement (isolating one point always decreases the misclassification rate) should lead to the split being kept (cp=0).
Why does the result not include any splits? And how do I have to alter the code to always get exactly one split? Can it be that splits are not kept if both classify to the same factor output?
Change cp = 0 to cp = -1.
Apparently the cp for the first split (maxdepth = 3) is 0.0000000. So going negative allows it to show up with maxdepth = 1.

Error with sem function at R : differences in factors

I wanted to use the function sem (with the package lavaan) on my data in R :
Model1<- 'Transfer~Amotivation+Gender+Age
Amotivation~Gender+Age
transfer are 4 questions with a 5 point likert scale
Amotivation: 4 questions with a 5 pint likert scale
Gender: 0 (=male) and 1 (=female)
Age: just the different ages
And i got next error:
in getDataFull (data= data, group = group, grow.label = group.label,:
lavaan WARNING: some observed variances are (at least) a factor 100 times larger than others; please rescale
Is anybody familiar with this error? Does it influence my results? Do I have to change anything? I really don't know what this error means.
Your scales are not equivalent. Your gender variables are constrained to be either 0 or 1. Amotivation is constrained to be between 1 and 5, but age is even less constrained. I created some sample data for gender, age, and amotivation. You can see that the variance for the age variable is over 4,000 times higher than the variance for gender, and about 500 times higher than sample amotivation data.
gender <- c(0,1,1,1,0,0,1,1,0,1,1,0,0,1,1,1)
age <- c(18,42,87,12,24,26,98,84,23,12,95,44,54,23,10,16)
set.seed(42)
amotivation <- rnorm(16, 3, 1.5)
var(gender) # 0.25 variance
var(age) # 1017.27 variance
var(amotivation) # 2.21 variance
I'm not sure how the unequal variances influence your results, or if you need to do anything at all. To make your age variable more closely match the amotivation scale, you could transform the data so that it's also on a 5 point scale.
newage <- age/max(age)*5
var(newage) # 2.65 variance
You could try running the analysis both ways (using your original data and the transformed data) and see if there are differences.

Resources