How to set specific contrasts in multinom() in nnet package? - r

I have a 3-class problem that needs classification. I want to use the multinomial logistic regression in nnet package.
The Class outcome has 3 factors, P, Q, R. I want to treat Q as the base factor.
So I tried to write it the contrasts like this:
P <- c(1,0,0)
R <- c(0,0,1)
contrasts(trainingLR$Class) <- cbind(P,R)
checked it:
> contrasts(trainingLR$Class)
P R
P 1 0
Q 0 0
R 0 1
Now multinom():
library(nnet)
multinom(Class ~., data=trainingLR)
Output:
> multinom(Class ~., data=trainingLR)
# weights: 39 (24 variable)
initial value 180.172415
iter 10 value 34.990665
iter 20 value 11.765136
iter 30 value 0.162491
iter 40 value 0.000192
iter 40 value 0.000096
iter 40 value 0.000096
final value 0.000096
converged
Call:
multinom(formula = Class ~ ., data = trainingLR)
Coefficients:
(Intercept) IL8 IL17A IL23A IL23R
Q -116.2881 -16.562423 -34.80174 3.370051 6.422109
R 203.2414 6.918666 -34.40271 -10.233787 31.446915
EBI3 IL6ST IL12A IL12RB2 IL12B
Q -8.316808 12.75168 -7.880954 5.686425 -9.665776
R 5.135609 -20.48971 -2.093231 37.423452 14.669226
IL12RB1 IL27RA
Q -6.921755 -1.307048
R 15.552842 -7.063026
Residual Deviance: 0.0001922658
AIC: 48.00019
Question:
So as you see, since P class didn't appear in the output, it means that it was treated as base being the first one in alphabetical order as expected when dealing with factor variables in R, and Q class was not treated as base level in this case, how to make it base to the other two levels?

I tried to avoid using contrasts and I discovered the relevel function for choosing a desired level as baseline.
The following code
trainingLR$Class <- relevel(trainingLR$Class, ref = "P")
should set "P" level as your baseline.
Therefore, try the same thing with "Q" or "R" levels.
The R Documentation (?relevel) mentions "This is useful for contr.treatment contrasts which take the first level as the reference."
Though might be too late to answer now, but since others might be interested, I thought is worthwhile sharing the above option.

Related

Minimal depth interaction from randomForestExplainer package

So when using the minimal depth interaction feature of the randomForestExplainer package, in R, I'm getting some hard to interpret results.
I simulated some data (x1, x2,..., x5) where x1 is binary and x2-x5 are continuous. In my model, there are no interactions.
Im using the randomForest package to create a random forest and then running it through the randomForestExplainer package.
Here's the code I'm using to simulate the data and random forest:
library(randomForest)
library(randomForestExplainer)
n <- 100
p <- 4
# Create data:
xrandom <- matrix(rnorm(n*p)+5, nrow=n)
colnames(xrandom)<- paste0("x",2:5)
d <- data.frame(xrandom)
d$x1 <- factor(sample(1:2, n, replace=T))
# Equation:
y <- d$x2 + rnorm(n)/5
y[d$x1==1] <- y[d$x1==1]+5
d$y <- y
# Random Forest:
fr <- randomForest(y ~ ., data=d,localImp=T)
# Random Forest Explainer:
interactions_frame <- min_depth_interactions(fr, names(d)[-6])
head(interactions_frame, 2)
This produces the following:
variable root_variable mean_min_depth occurrences interaction
1 x1 x1 4.670732 0 x1:x1
2 x1 x2 2.606190 221 x2:x1
uncond_mean_min_depth
1 1.703252
2 1.703252
So, my question is, if x1:x1 has 0 occurrence ( which is expected) then how can it also have a mean_min_depth?
Surely if it has 0 occurrences, then it can't possibly have a minimum depth? [or rather, the min depth = 0 or NA]
What's going on here? Am I misinterpreting something?
Thanks
My understanding is this has to do with the choice of the mean_sample argument of min_depth_interactions. The default choice replaces NAs with the depth of maximum subtree whose root is x1. Details below.
What is this argument mean_sample for? It specifies how to deal with trees where the interaction of interest is not present. There are three options:
relevant_trees. This only considers the trees where the interaction of interest is present. In your example, this gives NA for mean_min_depth of interaction x1:x1, which is the behavior you were looking for.
interactions_frame <- min_depth_interactions(fr, names(d)[-6], mean_sample = "relevant_trees")
head(interactions_frame, 2)
variable root_variable mean_min_depth occurrences interaction uncond_mean_min_depth
1 x1 x1 NA 0 x1:x1 1.947475
2 x1 x2 1.426606 218 x2:x1 1.947475
all_trees. There is a major problem with relevant_trees, that is for an interaction only showing up in a small number of trees, taking the mean of conditional minimum depth ignores the fact that this interaction is not that important. In this case, a small mean conditional minimum depth doesn't mean an interaction is important. To address this, specifying mean_sample = "all_trees" replaces the conditional minimum depth for the interaction of interest by the mean depth of maximal subtree of the root variable. Basically, if we are looking at the interaction of x1:x2, it says for a tree where this interaction is absent, give it a value of the deepest tree whose root is x1. This gives a (hopefully large) numeric value to mean_min_depth of interaction x1:x2 thus making it less important.
interactions_frame <- min_depth_interactions(fr, names(d)[-6], mean_sample = "all_trees")
head(interactions_frame, 2)
variable root_variable mean_min_depth occurrences interaction uncond_mean_min_depth
1 x1 x1 4.787879 0 x1:x1 1.97568
2 x1 x2 3.654522 218 x2:x1 1.97568
top_trees. Now this is the default choice for mean_sample. My understanding is it's similar to all_trees, but tries to down-weight the contribution of replacing missing values. The motivation, is all_trees pulls mean_min_depth close to the same value when there are many parameters but not enough observations, i.e. shallow trees. To reduce the contribution of replacing missing values, top_trees only calculates the mean conditional minimal depth on a subset of n trees, where n is the number of trees where ANY interactions with specified roots are present. Let's say in your example, out of those 500 trees only 300 have any interaction x1:whatever, then we only consider those 300 trees when filling in value for x1:x1. Because there are 0 occurrence of this interaction, replacing 500 NAs vs replacing 300 NAs with the same value doesn't affect the mean, so it's the same value 4.787879. (There's a slight difference between our results, I think it has to do with seed values).
interactions_frame <- min_depth_interactions(fr, names(d)[-6], mean_sample = "top_trees")
head(interactions_frame, 2)
variable root_variable mean_min_depth occurrences interaction uncond_mean_min_depth
variable root_variable mean_min_depth occurrences interaction uncond_mean_min_depth
1 x1 x1 4.787879 0 x1:x1 1.947475
2 x1 x2 2.951051 218 x2:x1 1.947475
This answer is based on my understanding of the package author's thesis: https://rawgit.com/geneticsMiNIng/BlackBoxOpener/master/randomForestExplainer_Master_thesis.pdf

How to run cross validation for logistic regression in R?

I'm trying to run a cross validation (leave one out and k fold) using logistic regression in R, binary outcome.
I have a problem with the cost function. I do not understand the cost function in the R help, and found a more intuitive one here on Stack Overflow, but I don't know how to call it, more specifically, how to pass on the arguments.
library(ISLR)
D = Default
mycost <- function(r, pi)
{
weight1 = 1 #cost for getting 1 wrong
weight0 = 1 #cost for getting 0 wrong
c1 = (r==1)&(pi<0.5) #logical vector - true if actual 1 but predict 0
c0 = (r==0)&(pi>=0.5) #logical vector - true if actual 0 but predict 1
return(mean(weight1*c1+weight0*c0))
}
glm.fit1 = glm(default~balance + student, data = D, family = binomial)
The problem is: if R runs several logistic regressions in the background (for example 3 for K=3), how can I pass on the vectors of predicated probabilities (pi) and the vector of actual values ?
I am confused...
Is there a way to use a for loop and do it manually instead of using cv.glm?

how to create manual contrasts with emmeans? - R

Suppose I have these data
library(MASS)
m<-lmer(Y~N*V + (1|B),data=oats)
How can I create a manual contrast in emmeans? For example
Victoria_0.2cwt 1
Victoria_0.4cwt -1
Marvellous_0.2cwt -1
Marvellous_0.4cwt 1
emm = emmeans(m, ~ V * N)
emm
contrast(emm, list(con = c(0,0,0,0,-1,1,0,0,-1,0,0,0)))
However, this is actually a linear function, not a contrast, because the coefficients do not sum to zero.
Note: I may have mis-remembered the factor levels, and if so, the coefficients may need to be rearranged. They should correspond to the combinations you see in the results if the 2nd line

How to send a confusion matrix to caret's confusionMatrix?

I'm looking at this data set: https://archive.ics.uci.edu/ml/datasets/Credit+Approval. I built a ctree:
myFormula<-class~. # class is a factor of "+" or "-"
ct <- ctree(myFormula, data = train)
And now I'd like to put that data into caret's confusionMatrix method to get all the stats associated with the confusion matrix:
testPred <- predict(ct, newdata = test)
#### This is where I'm doing something wrong ####
confusionMatrix(table(testPred, test$class),positive="+")
#### ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ####
$positive
[1] "+"
$table
td
testPred - +
- 99 6
+ 20 88
$overall
Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull AccuracyPValue McnemarPValue
8.779343e-01 7.562715e-01 8.262795e-01 9.186911e-01 5.586854e-01 6.426168e-24 1.078745e-02
$byClass
Sensitivity Specificity Pos Pred Value Neg Pred Value Precision Recall F1
0.9361702 0.8319328 0.8148148 0.9428571 0.8148148 0.9361702 0.8712871
Prevalence Detection Rate Detection Prevalence Balanced Accuracy
0.4413146 0.4131455 0.5070423 0.8840515
$mode
[1] "sens_spec"
$dots
list()
attr(,"class")
[1] "confusionMatrix"
So Sensetivity is:
(from caret's confusionMatrix doc)
If you take my confusion matrix:
$table
td
testPred - +
- 99 6
+ 20 88
You can see this doesn't add up: Sensetivity = 99/(99+20) = 99/119 = 0.831928. In my confusionMatrix results, that value is for Specificity. However Specificity is Specificity = D/(B+D) = 88/(88+6) = 88/94 = 0.9361702, the value for Sensitivity.
I've tried this confusionMatrix(td,testPred, positive="+") but got even weirder results. What am I doing wrong?
UPDATE: I also realized that my confusion matrix is different than what caret thought it was:
Mine: Caret:
td testPred
testPred - + td - +
- 99 6 - 99 20
+ 20 88 + 6 88
As you can see, it thinks my False Positive and False Negative are backwards.
UPDATE: I found it's a lot better to send the data, rather than a table as a parameter. From the confusionMatrix docs:
reference
a factor of classes to be used as the true results
I took this to mean what symbol constitutes a positive outcome. In my case, this would have been a +. However, 'reference' refers to the actual outcomes from the data set, aka the dependent variable.
So I should have used confusionMatrix(testPred, test$class). If your data is out of order for some reason, it will shift it into the correct order (so the positive and negative outcomes/predictions align correctly in the confusion matrix.
However, if you are worried about the outcome being the correct factor, install the plyr library, and use revalue to change the factor:
install.packages("plyr")
library(plyr)
newDF <- df
newDF$class <- revalue(newDF$class,c("+"=1,"-"=0))
# You'd have to rerun your model using newDF
I'm not sure why this worked, but I just removed the positive parameter:
confusionMatrix(table(testPred, test$class))
My Confusion Matrix:
td
testPred - +
- 99 6
+ 20 88
Caret's Confusion Matrix:
td
testPred - +
- 99 6
+ 20 88
Although now it says $positive: "-" so I'm not sure if that's good or bad.

Confidence interval for regression error, R,

Can someone please explain what I am doing wrong here. I want to
find a confidence interval for an average response of my variable
"list1." R has an example online using the 'faithful' dataset and it
works fine. However, whenever I try to find a confidence/prediction
interval, I ALWAYS get this error message. I have been at this for 5
hours and tried a million different things, nothing works.
> list1 <- c(1,2,3,4,5) #first data set
> list2 <- c(2,4,5,6,7) # second data set
> frame <- data.frame(list1,list2) # made a data.frame object
> reg <- lm(list1~list2,data=frame) # regression
> newD = data.frame(list1 = 2.3) #new data input for confidence/prediction interval estimation
> predict(reg,newdata=newD,interval="confidence")
fit lwr upr
1 0.7297297 -0.08625234 1.545712
2 2.3513514 1.88024388 2.822459
3 3.1621622 2.73210185 3.592222
4 3.9729730 3.45214407 4.493802
5 4.7837838 4.09033237 5.477235
Warning message:
'newdata' had 1 row but variables found have 5 rows #Why does this keep happening??
The problem is that you are trying to pass in a new independent variable for prediction, but the name of that predictor matches the dependent variable from the initial model. The formula syntax in the regression is y ~ x. When you use the predict() function, you can pass new independent (x) variables. See the Details section of ?predict for more details.
This however seems to work:
newD2 = data.frame(list2 = 2.3) #note the name is list2 and not list1
predict(reg, newdata = newD2, interval = "confidence")
---
fit lwr upr
1 0.972973 0.2194464 1.7265

Resources