How do you obtain confusion matrix for glmnet Multinomial logistic regression? - r

I fit the Multinomial logistic regression model and I'd like to obtain confusion matrix to obtain the accuracy
library("glmnet")
x=data.matrix(train[-1])
y= data.matrix(train[1])
x_test=data.matrix(test[-1])
y_test=unlist(test[1])
fit.glm=glmnet(x,y,family="multinomial",alpha = 1, type.multinomial = "grouped")
cvfit=cv.glmnet(x, y, family="multinomial", type.multinomial = "grouped", parallel = TRUE)
y_predict=unlist(predict(cvfit, newx = x_test, s = "lambda.min", type = "class"))
and then to calculate confusion matrix I use caret library
library("lattice")
library("ggplot2")
library("caret")
confusionMatrix(data=y_predict,reference=y_test)
I am getting this error which I do not know how to solve that
Error in confusionMatrix.default(data = y_predict, reference = y_test)
: The data must contain some levels that overlap the reference.
I just post the str of y_predict and y_test. They might be helpful
str(y_predict)
chr [1:301, 1] "6" "2" "7" "9" "3" "2" "3" "6" "6" "8" "6" "5" "6" ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr "1"
str(y_test)
Factor w/ 10 levels "accessory","activation",..: 6 8 2 9 3 2 3 5 10 8 ...
- attr(*, "names")= chr [1:301] "category1" "category2" "category3" "category4" ...
I use unlist to avoid getting this error Error: x must be atomic for 'sort.list'

It would make sense to keep track of your labels, and use that to convert the results from glmnet back to labels, and apply the confusion matrix. I use the iris dataset which has 3 labels:
idx = sample(nrow(iris),100)
train = iris[idx,]
test = iris[-idx,]
We convert the response into a numeric:
x = data.matrix(train[,-5])
y = as.numeric(train[,5]) - 1
x_test = data.matrix(test[,-5])
y_test = as.numeric(test[,5]) - 1
Fit, a bit different here, we get back the probabilities:
cvfit=cv.glmnet(x, y, family="multinomial")
y_predict=predict(cvfit, newx = x_test, s = "lambda.min", type = "response")
In this example, the response is the column Species, in yours it will be test[,1] :
ref_labels = test$Species
pred_labels = levels(test$Species)[max.col(y_predict[,,1])]
caret::confusionMatrix(table(pred_labels,ref_labels))
Confusion Matrix and Statistics
ref_labels
pred_labels setosa versicolor virginica
setosa 20 0 0
versicolor 0 12 0
virginica 0 0 18

Related

How to plot the Hazard Ratio + CI over time of survival data in ggplot in R?

Background
I want to plot the hazard ratio over time, including its confidence intervals, of a survival dataset. As an example, I will take a simplified dataset from the survival package: the colon dataset.
library(survival)
library(tidyverse)
# Colon survival dataset
data <- colon %>%
filter(etype == 2) %>%
select(c(id, rx, status, time)) %>%
filter(rx == "Obs" | rx == "Lev+5FU") %>%
mutate(rx = factor(rx))
The dataset contains patients that received a treatment (i.e., "Lev+5FU") and patients that did not (i.e., "Obs"). The survival curves are as follows:
fit <- survfit(Surv(time, status) ~ rx, data = data )
plot(fit)
Attempt
Using the cox.zph function, you can plot the hazard ratio of a cox model.
cox <- coxph(Surv(time, status) ~ rx, data = data)
plot(cox.zph(cox))
However, I want to plot the hazard ratio including 95% CI for this survival dataset using ggplot.
Question(s)
How do you extract the hazard ratio data and the 95% CIs from this cox.zph object to plot them in ggplot?
Are there other R packages that enable doing the same in a more convenient way?
Note: it’s important to recognize the correction of Dion Groothof. The lines and CIs are not really hazard ratios. They are estimates and bounds around time varying log-hazard-ratios. You would need to exponentiate to get HRs.
The values are in the result returned from cox.zph:
str(cox.zph(cox))
#----------------------
List of 7
$ table : num [1:2, 1:3] 1.188 1.188 1 1 0.276 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:2] "rx" "GLOBAL"
.. ..$ : chr [1:3] "chisq" "df" "p"
$ x : num [1:291] 0 0.00162 0.00323 0.00485 0.00646 ...
$ time : num [1:291] 23 34 45 52 79 113 125 127 138 141 ...
$ y : num [1:291, 1] 2.09 2.1 2.1 2.1 2.11 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:291] "23" "34" "45" "52" ...
.. ..$ : chr "rx"
$ var : num [1, 1] 4.11
$ transform: chr "km"
$ call : language cox.zph(fit = cox)
- attr(*, "class")= chr "cox.zph"
To get a plot with any of the paradigms (base, lattice or ggplot2) you use the time as the x axis, use x as the solid line and y at the "points"
z <- cox.zph(cox)
ggdf <- data.frame( unclass(z)[c("time", "x","y")])
ggplot(data=ggdf, aes(x=time, y=-x))+
geom_line()+ ylim(range(z$y))+
geom_point(aes(x=time,y=z$y) )
To get the CI look at getAnywhere(plot.cox.zph)
xx <- x$x
yy <- x$y
df <- max(df)
nvar <- ncol(yy)
pred.x <- seq(from = min(xx), to = max(xx), length = nsmo)
#------------
if (se) {
bk <- backsolve(qmat$qr[1:df, 1:df], diag(df))
xtx <- bk %*% t(bk)
seval <- ((pmat %*% xtx) * pmat) %*% rep(1, df)
temp <- 2 * sqrt(x$var[i, i] * seval)
yup <- yhat + temp
ylow <- yhat - temp
yr <- range(yr, yup, ylow)
#---------------
if (se) {
lines(pred.x, exp(yup), col = col[2], lty = lty[2],
lwd = lwd[2])
lines(pred.x, exp(ylow), col = col[2], lty = lty[2],
lwd = lwd[2])
}
The survminer package will do this for you:
library(survminer)
ggcoxzph(cox.zph(cox))

Unable to get Residuals for AOV in R

I have a dataframe in R.
This is part of the head of the DF:
Sujet T E O P meanTR
1 1 0 0 0 0.97
1 1 0 0 0 1.44
2 0 1 0 1 0.94
Sujet : from 1 to 12
T , E , O , P : 1 or 0
meanTR : Numeric
I want to get the anova table, so I tried this:
model_all <- aov(meanTR ~ E*O*P+ Error(Sujet/E*O*P), data = df)
After that, I want to extract the residuals of my models to plot them
So I tried this :
res <- residuals(model_all) returns NULL
So I found people online suggesting this solution:
model_all.pr <- proj(model_all)
res <- model_all.pr[[3]][, "Residuals"]
But this returns subscript out of bound
res <- model_all.pr[[3]]["Residuals"]
But this returns NA
I don't know what I'm doing wrong. I'm really confused
Any help would be appreciated.
The main goal is to be able to run this:
plot(res)
qqnorm(res)
With aov(), you'll get a top-level $residuals attribute for some fits but not others.
For example with a simple model like the following, you can access residuals directly (use str() to see the structure of an object, include what attributes can be accessed):
fit1 <- aov(Sepal.Length ~ Sepal.Width, data=iris)
str(fit1$residuals)
## Named num [1:150] -0.644 -0.956 -1.111 -1.234 -0.722 ...
## - attr(*, "names")= chr [1:150] "1" "2" "3" "4" ...
But in the more complex model specification you're using (i.e. with an explicit/custom error term), there are separate residual values in each of the top-level attributes:
fit2 <- aov(Sepal.Length ~ Sepal.Width + Error(Species/Sepal.Length), data=iris)
fit2$residuals # NULL
names(fit2)
## [1] "(Intercept)" "Species" "Sepal.Length:Species" "Within"
fit2$Species$residuals
## 2 3
## -1.136219 5.179749
str(fit2$Within$residuals)
## Named num [1:144] -1.83e-15 -2.49e-15 -1.90e-15 -2.55e-15 -2.89e-15 ...
## - attr(*, "names")= chr [1:144] "7" "8" "9" "10" ...
## ...
## ...
I haven't thought about the stats behind enough to say why this is, but I'm sure it is reasonable.
Hope that helps!

getting sensitivity after fitting decision tree with caret

I am trying to obtain the sensitivity and specificity etc. after fitting a decision tree using carret (or even caret::confusionMatrix). I am using code along those lines:
fit <- rpart(
Bla ~ ...
,method="class"
,data=OrginalData
)
preds <- predict(fit, SomeData)
caret::sensitivity(factor(preds[,2]), factor(OrginalData$Bla))
str(preds)
Unfortunately, I get NA as sensitivity. Str returns:
num [1:40, 1:2] 0.926 0.926 0.926 0.926 0.926 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:40] "1" "2" "3" "4" ...
..$ : chr [1:2] "NO" "YESR"
Any ideas?
The problem is you are providing a numeric vector of probabilities as predictions while caret::sensitivity expects a vector of classes. Here is a working code with Sonar data from mlbench:
library(mlbench)
library(rpart)
library(caret)
data(Sonar)
split data:
inds <- createDataPartition(Sonar$Class, p = 0.7, list = FALSE)
train <- Sonar[inds,]
test <- Sonar[-inds,]
create model:
fit <- rpart(Class ~ .,
method = "class",
data = train)
pred <- predict(fit, test)
convert probability to prediction:
caret::sensitivity(as.factor(ifelse(pred[,2] > 0.5, "R", "M")), test$Class)
#output
[1] 0.6969697
Do note that 0.5 as threshold might not be appropriate and it should be tuned according to the desired model behavior. If there is no preference towards a certain type of error, balanced accuracy is a solid metric to tune the probability threshold.
or specify type = class when predicting rpart:
pred <- predict(fit, test, type = "class")
caret::sensitivity(pred, test$Class)

R - Extract ns spline object from lmer model and predict on new data

I'm looking to predict 'terms', especially ns splines, from an lmer model. I've replicated the problem with the mtcars dataset (technically poor example, but works to get the point across).
Here is what I'm trying to do with a linear model:
data(mtcars)
mtcarsmodel <- lm(wt ~ ns(drat,2) + hp + as.factor(gear), data= mtcars)
summary(mtcarsmodel)
coef(mtcarsmodel)
test <- predict(mtcarsmodel, type = "terms")
Perfect. However, there is no equivalent 'terms' option for lmer predict (unresolved issue here).
mtcarsmodellmer <- lmer(wt ~ ns(drat,2) + (hp|as.factor(gear)), data= mtcars)
summary(mtcarsmodellmer)
coef(mtcarsmodellmer)
ranef(mtcarsmodellmer)
Given there is no equivalent ‘predict, terms’ function, I was going to extract the fixed and random coefficients above and apply the coefficients to the mtcars data, but have no idea on how to extract an ns spline object from a model and 'predict' it to some new data. The same goes for a 'poly' transformed variable eg. poly(drat, 2) - extra kudos if you can get this as well.
It is not difficult to do it yourself.
library(lme4)
library(splines)
X <- with(mtcars, ns(drat, 2)) ## design matrix for splines (without intercept)
## head(X)
# 1 2
#[1,] 0.5778474 -0.1560021
#[2,] 0.5778474 -0.1560021
#[3,] 0.5738625 -0.1792162
#[4,] 0.2334329 -0.1440232
#[5,] 0.2808520 -0.1704002
#[6,] 0.0000000 0.0000000
## str(X)
# ns [1:32, 1:2] 0.578 0.578 0.574 0.233 0.281 ...
# - attr(*, "dimnames")=List of 2
# ..$ : NULL
# ..$ : chr [1:2] "1" "2"
# - attr(*, "degree")= int 3
# - attr(*, "knots")= Named num 3.7
# ..- attr(*, "names")= chr "50%"
# - attr(*, "Boundary.knots")= num [1:2] 2.76 4.93
# - attr(*, "intercept")= logi FALSE
# - attr(*, "class")= chr [1:3] "ns" "basis" "matrix"
fit <- lmer(wt ~ X + (hp|gear), data= mtcars)
beta <- coef(fit)
#$gear
# hp (Intercept) X1 X2
#3 0.010614406 2.455403 -2.167337 -0.9246454
#4 0.014601363 2.455403 -2.167337 -0.9246454
#5 0.006342761 2.455403 -2.167337 -0.9246454
#
#attr(,"class")
#[1] "coef.mer"
If we want to predict the ns term, just do
## use `predict.ns`; read `?predict.ns`
x0 <- seq(1, 5, by = 0.2) ## example `newx`
Xp <- predict(X, newx = x0) ## prediction matrix
b <- with(beta$gear, c(X1[1], X2[1])) ## coefficients for spline
y <- Xp %*% b ## predicted mean
plot(x0, y, type = "l")

How to plot MASS:qda scores

From this question, I was wondering if it's possible to extract the Quadratic discriminant analysis (QDA's) scores and reuse them after like PCA scores.
## follow example from ?lda
Iris <- data.frame(rbind(iris3[,,1], iris3[,,2], iris3[,,3]),
Sp = rep(c("s","c","v"), rep(50,3)))
set.seed(1) ## remove this line if you want it to be pseudo random
train <- sample(1:150, 75)
table(Iris$Sp[train])
## your answer may differ
## c s v
## 22 23 30
Using the QDA here
z <- qda(Sp ~ ., Iris, prior = c(1,1,1)/3, subset = train)
## get the whole prediction object
pred <- predict(z)
## show first few sample scores on LDs
Here, you can see that it's not working.
head(pred$x)
# NULL
plot(LD2 ~ LD1, data = pred$x)
# Error in eval(expr, envir, enclos) : object 'LD2' not found
NOTE: Too long/formatted for a comment. NOT AN ANSWER
You may want to try the rrcov package:
library(rrcov)
z <- QdaCov(Sp ~ ., Iris[train,], prior = c(1,1,1)/3)
pred <- predict(z)
str(pred)
## Formal class 'PredictQda' [package "rrcov"] with 4 slots
## ..# classification: Factor w/ 3 levels "c","s","v": 2 2 2 1 3 2 2 1 3 2 ...
## ..# posterior : num [1:41, 1:3] 5.84e-45 5.28e-50 1.16e-25 1.00 1.48e-03 ...
## ..# x : num [1:41, 1:3] -97.15 -109.44 -54.03 2.9 -3.37 ...
## ..# ct : 'table' int [1:3, 1:3] 13 0 1 0 16 0 0 0 11
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ Actual : chr [1:3] "c" "s" "v"
## .. .. ..$ Predicted: chr [1:3] "c" "s" "v"
It also has robust PCA methods that may be useful.
Unfortunately, not every model in R conforms to the same object structure/API and this won't be a linear model, so it is unlikely to conform to linear model fit structure APIs.
There's an example of how to visualize the qda results here — http://ramhiser.com/2013/07/02/a-brief-look-at-mixture-discriminant-analysis/
And, you can do:
library(klaR)
partimat(Sp ~ ., data=Iris, method="qda", subset=train)
for a partition plot of the qda results.

Resources