How to make a mixed anova and test its residuals for normal distribution? (R) - r

I want to make a mixed anova with the within-subject-factors mzp and cond besides the between-subject-factors cond_order and video_order.
I have 3 timepoints of a repeated measurement, indicated by mzp.
anova.h1 <- aov_car(ee ~ cond_order + video_order + Error(code/mzp*cond), data=dat_long)
Three things I can't find a solution for:
How to separate between within-factors in the error term? A lot of codes I found used *, but I fear it might only be for specific cases? Are there other separating-operators?
mzp has actually 3 levels (i.e. times of measurement) for measuring the dependent variable, cond has only 2 (because there were no baseline measured). So I made for that variable a 3rd timepoint up, by setting values to NA at baseline for cond. But it seems to cause issues now:
Error: Empty cells in within-subjects design (i.e., bad data structure).
table(data[c("mzp", "cond")])
# cond
# mzp s1st t1st s2nd t2nd
# X0 0 0 0 0
# X1 44 43 0 0
# X2 0 0 43 44
I need to examine relations between all 3 times of measuring the dependent variable and its interactions with the independet variables cond, cond_order and video_order. So is there a way of ignoring the NAs in cond, but include every 3 timepoints of the dependent variable for examining the progress of the dependent variable?
Above all, I need this anova to examine the residuals, to test for normality. I tried functions I know and googled (for a model without the cond-variable), but they won't work for this model/this function. I have to examine it graphically. So what works for this anova function?
hist(rstandard(anova.h1))
plot(anova.h1,2)
anova.h1.pr <- proj(anova.h1)
# Error in proj.default(anova.h1) : argument does not contain 'qr' component
res <- anova.h1.pr[["Within"]][ , "Residuals"].
qqnorm(res)

Related

inputs of 2 separate predict() return the same set of fitted values

Confession: I attempted to ask this question yesterday, but used a sample, congruent dataset which resembles the my "real" data in hopes this would be more convenient for readers here. One issue was resolved, but another remains that appears immutable.
My objective is creating a linear model of two predicted vectors: "yC.hat", and "yT.hat" which are meant to project average effects for unique observed values of pri2000v as a function of the average poverty level "I(avgpoverty^2) under control (treatment = 0) and treatment (treatment = 1) conditions.
While I appear to have no issues running the regression itself, the inputs of my data argument have no effect on predict(), and only the object itself affects the output. As a result, treatment = 0 and treatment = 1 in the data argument result in the same fitted values. In fact, I can plug in ANY value into the data argument and it makes do difference. So I suspect my failure to understand issue starts here.
Here is my code:
q6rega <- lm(pri2000v ~ treatment + I(log(pobtot1994)) + I(avgpoverty^2)
#interactions
+ treatment:avgpoverty + treatment:I(avgpoverty^2), data = pga)
## predicted PRI support under the Treatment condition
q6.yT.hat <- predict(q6rega,
data = data.frame(I(avgpoverty^2) = 9:25, treatment = 1))
## predicted PRI support rate under the Control condition
q6.yC.hat <- predict(q6rega,
data = data.frame(I(avgpoverty^2) = 9:25, treatment = 0))
q6.yC.hat == q6.yT.hat
TRUE[417]
dput(pga has been posted on my github, if needed
EDIT: There were a few things wrong with my code above, but not specifying pobtot1994 somehow resulted in R treating it as newdata being omitted. Since I'm fairly new to statistics, I confused fitted values with the prediction output that I was actually trying to achieve. I would have expected that an unexpected input is to produce an error instead.
I'm surprised you are able to run a prediction when it is lacking the required variable (pobtot1994) for your model in the new data frame for prediction.
Anyway, you would need to create a new data frame with the three variables in untransformed form used in the model. Since you are interested to compare the fitted values of avgpoverty 3 to 5 for treatment 1 and 0, you need to force the third variable pobtot1994 as a constant. I use the mean of pobtot9994 here for simplicity.
newdat <- expand.grid(avgpoverty=3:5, treatment=factor(c(0,1)), pobtot1994=mean(pga$pobtot1994))
avgpoverty treatment pobtot1994
1 3 0 2037.384
2 4 0 2037.384
3 5 0 2037.384
4 3 1 2037.384
5 4 1 2037.384
6 5 1 2037.384
The prediction will show you the different values for the two conditions.
newdat$fitted <- predict(q6rega, newdata=newdat)
avgpoverty treatment pobtot1994 fitted
1 3 0 2037.384 38.86817
2 4 0 2037.384 50.77476
3 5 0 2037.384 55.67832
4 3 1 2037.384 51.55077
5 4 1 2037.384 49.03148
6 5 1 2037.384 59.73910

Minimal depth interaction from randomForestExplainer package

So when using the minimal depth interaction feature of the randomForestExplainer package, in R, I'm getting some hard to interpret results.
I simulated some data (x1, x2,..., x5) where x1 is binary and x2-x5 are continuous. In my model, there are no interactions.
Im using the randomForest package to create a random forest and then running it through the randomForestExplainer package.
Here's the code I'm using to simulate the data and random forest:
library(randomForest)
library(randomForestExplainer)
n <- 100
p <- 4
# Create data:
xrandom <- matrix(rnorm(n*p)+5, nrow=n)
colnames(xrandom)<- paste0("x",2:5)
d <- data.frame(xrandom)
d$x1 <- factor(sample(1:2, n, replace=T))
# Equation:
y <- d$x2 + rnorm(n)/5
y[d$x1==1] <- y[d$x1==1]+5
d$y <- y
# Random Forest:
fr <- randomForest(y ~ ., data=d,localImp=T)
# Random Forest Explainer:
interactions_frame <- min_depth_interactions(fr, names(d)[-6])
head(interactions_frame, 2)
This produces the following:
variable root_variable mean_min_depth occurrences interaction
1 x1 x1 4.670732 0 x1:x1
2 x1 x2 2.606190 221 x2:x1
uncond_mean_min_depth
1 1.703252
2 1.703252
So, my question is, if x1:x1 has 0 occurrence ( which is expected) then how can it also have a mean_min_depth?
Surely if it has 0 occurrences, then it can't possibly have a minimum depth? [or rather, the min depth = 0 or NA]
What's going on here? Am I misinterpreting something?
Thanks
My understanding is this has to do with the choice of the mean_sample argument of min_depth_interactions. The default choice replaces NAs with the depth of maximum subtree whose root is x1. Details below.
What is this argument mean_sample for? It specifies how to deal with trees where the interaction of interest is not present. There are three options:
relevant_trees. This only considers the trees where the interaction of interest is present. In your example, this gives NA for mean_min_depth of interaction x1:x1, which is the behavior you were looking for.
interactions_frame <- min_depth_interactions(fr, names(d)[-6], mean_sample = "relevant_trees")
head(interactions_frame, 2)
variable root_variable mean_min_depth occurrences interaction uncond_mean_min_depth
1 x1 x1 NA 0 x1:x1 1.947475
2 x1 x2 1.426606 218 x2:x1 1.947475
all_trees. There is a major problem with relevant_trees, that is for an interaction only showing up in a small number of trees, taking the mean of conditional minimum depth ignores the fact that this interaction is not that important. In this case, a small mean conditional minimum depth doesn't mean an interaction is important. To address this, specifying mean_sample = "all_trees" replaces the conditional minimum depth for the interaction of interest by the mean depth of maximal subtree of the root variable. Basically, if we are looking at the interaction of x1:x2, it says for a tree where this interaction is absent, give it a value of the deepest tree whose root is x1. This gives a (hopefully large) numeric value to mean_min_depth of interaction x1:x2 thus making it less important.
interactions_frame <- min_depth_interactions(fr, names(d)[-6], mean_sample = "all_trees")
head(interactions_frame, 2)
variable root_variable mean_min_depth occurrences interaction uncond_mean_min_depth
1 x1 x1 4.787879 0 x1:x1 1.97568
2 x1 x2 3.654522 218 x2:x1 1.97568
top_trees. Now this is the default choice for mean_sample. My understanding is it's similar to all_trees, but tries to down-weight the contribution of replacing missing values. The motivation, is all_trees pulls mean_min_depth close to the same value when there are many parameters but not enough observations, i.e. shallow trees. To reduce the contribution of replacing missing values, top_trees only calculates the mean conditional minimal depth on a subset of n trees, where n is the number of trees where ANY interactions with specified roots are present. Let's say in your example, out of those 500 trees only 300 have any interaction x1:whatever, then we only consider those 300 trees when filling in value for x1:x1. Because there are 0 occurrence of this interaction, replacing 500 NAs vs replacing 300 NAs with the same value doesn't affect the mean, so it's the same value 4.787879. (There's a slight difference between our results, I think it has to do with seed values).
interactions_frame <- min_depth_interactions(fr, names(d)[-6], mean_sample = "top_trees")
head(interactions_frame, 2)
variable root_variable mean_min_depth occurrences interaction uncond_mean_min_depth
variable root_variable mean_min_depth occurrences interaction uncond_mean_min_depth
1 x1 x1 4.787879 0 x1:x1 1.947475
2 x1 x2 2.951051 218 x2:x1 1.947475
This answer is based on my understanding of the package author's thesis: https://rawgit.com/geneticsMiNIng/BlackBoxOpener/master/randomForestExplainer_Master_thesis.pdf

GAM model error

My data frame looks like:
head(bush_status)
distance status count
0 endemic 844
1 exotic 8
5 native 3
10 endemic 5
15 endemic 4
20 endemic 3
The count data is non-normally distributed. I'm trying to fit a generalized additive model to my data in two ways so i can use anova to see if the p-value supports m2.
m1 <- gam(count ~ s(distance) + status, data=bush_status, family="nb")
m2 <- gam(count ~ s(distance, by=status) + status, data=bush_status, family="nb")
m1 works fine, but m2 sends the error message:
"Error in smoothCon(split$smooth.spec[[i]], data, knots, absorb.cons,
scale.penalty = scale.penalty, :
Can't find by variable"
This is pretty beyond me so if anyone could offer any advice that would be much appreciated!
From your comments it became clear that you passed a character variable to by in the smoother. You must pass a factor variable there. This has been a frequent gotcha for me too and I consider it a design flaw (because base R regression functions deal with character variables just fine).

R multiclass/multinomial classification ROC using multiclass.roc (Package ‘pROC’)

I am having difficulties understanding how the multiclass.roc parameters should look like.
Here a snapshot of my data:
> head(testing.logist$cut.rank)
[1] 3 3 3 3 1 3
Levels: 1 2 3
> head(mnm.predict.test.probs)
1 2 3
9 1.013755e-04 3.713862e-02 0.96276001
10 1.904435e-11 3.153587e-02 0.96846413
12 6.445101e-23 1.119782e-11 1.00000000
13 1.238355e-04 2.882145e-02 0.97105472
22 9.027254e-01 7.259787e-07 0.09727389
26 1.365667e-01 4.034372e-01 0.45999610
>
I tried calling multiclass.roc with:
multiclass.roc(
response=testing.logist$cut.rank,
predictor=mnm.predict.test.probs,
formula=response~predictor
)
but naturally I get an error:
Error in roc.default(response, predictor, levels = X, percent = percent, :
Predictor must be numeric or ordered.
When it's a binary classification problem I know that 'predictor' should contain probabilities (one per observation). However, in my case, I have 3 classes, so my predictor is a list of rows that each have 3 columns (or a sublist of 3 values) correspond to the probability for each class.
Does anyone know how should my 'predictor' should look like rather than what it's currently look like ?
The pROC package is not really designed to handle this case where you get multiple predictions (as probabilities for each class). Typically you would assess your P(class = 1)
multiclass.roc(
response=testing.logist$cut.rank,
predictor=mnm.predict.test.probs[,1])
And then do it again with P(class = 2) and P(class = 3). Or better, determine the most likely class:
predicted.class <- apply(mnm.predict.test.probs, 1, which.max)
multiclass.roc(
response=testing.logist$cut.rank,
predictor=predicted.class)
Consider multiclass.roc as a toy that can sometimes be helpful but most likely won't really fit your needs.

R decision tree using all the variables

I would like to perform a decision tree analysis. I want that the decision tree uses all the variables in the model.
I also need to plot the decision tree. How can I do that in R?
This is a sample of my dataset
> head(d)
TargetGroup2000 TargetGroup2012 SmokingGroup_Kai PA_Score wheeze3 asthma3 tres3
1 2 2 4 2 0 0 0
2 2 2 4 3 1 0 0
3 2 2 5 1 0 0 0
4 2 2 4 2 1 0 0
5 2 3 3 1 0 0 0
6 2 3 3 2 0 0 0
>
I would like to use the formula
myFormula <- wheeze3 ~ TargetGroup2000 + TargetGroup2012 + SmokingGroup_Kai + PA_Score
Note that all the variables are categorical.
EDIT:
My problem is that some variables do not appear in the final decision tree.
The deap of the tree should be defined by a penalty parameter alpha. I do not know how to set this penalty in order that all the variables appear in my model.
In other words I would like a model that minimize the training error.
As mentioned above, if you want to run the tree on all the variables you should write it as
ctree(wheeze3 ~ ., d)
The penalty you mentioned is located at the ctree_control(). You can set the P-value there and the minimum split and bucket size. So in order to maximize the chance that all the variables will be included you should do something like that:
ctree(wheeze3 ~ ., d, controls = ctree_control(mincriterion = 0.85, minsplit = 0, minbucket = 0))
The problem is that you'll get into risk of overfitting.
The last thing you need to understand is, that the reason that you may not see all the variables in the output of the tree is because they don't have a significant influence on the dependend variable. Unlike linear or logistic regression, that will show all the variables and give you the P-value in order to determine if they are significant or not, the decision tree does not return the unsiginifcant variables, i.e, it doesn't split by them.
For better understanding of how ctree works, please take a look here: https://stats.stackexchange.com/questions/12140/conditional-inference-trees-vs-traditional-decision-trees
The easiest way is to use the rpart package that is part of the core R.
library(rpart)
model <- rpart( wheeze3 ~ ., data=d )
summary(model)
plot(model)
text(model)
The . in the formula argument means use all the other variables as independent variables.
plot(ctree(myFormula~., data=sta))

Resources