Saving output of confusionMatrix as a .csv table - r

I have a following code resulting in a table-like output
lvs <- c("normal", "abnormal")
truth <- factor(rep(lvs, times = c(86, 258)),
levels = rev(lvs))
pred <- factor(
c(
rep(lvs, times = c(54, 32)),
rep(lvs, times = c(27, 231))),
levels = rev(lvs))
xtab <- table(pred, truth)
library(caret)
confusionMatrix(xtab)
confusionMatrix(pred, truth)
confusionMatrix(xtab, prevalence = 0.25)
I would like to export the below part of the output as a .csv table
Accuracy : 0.8285
95% CI : (0.7844, 0.8668)
No Information Rate : 0.75
P-Value [Acc > NIR] : 0.0003097
Kappa : 0.5336
Mcnemar's Test P-Value : 0.6025370
Sensitivity : 0.8953
Specificity : 0.6279
Pos Pred Value : 0.8783
Neg Pred Value : 0.6667
Prevalence : 0.7500
Detection Rate : 0.6715
Detection Prevalence : 0.7645
Balanced Accuracy : 0.7616
Attempt to write it as a .csv table results in the error message:
write.csv(confusionMatrix(xtab),file="file.csv")
Error in as.data.frame.default(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors) :
cannot coerce class ""confusionMatrix"" to a data.frame
Doing the whole work manually, for obvious reasons, is impractical and prone to human errors.
Any suggestions on how to export it as a .csv?

Using caret package
results <- confusionMatrix(pred, truth)
as.table(results) gives
Reference
Prediction X1 X0
X1 36 29
X0 218 727
as.matrix(results,what="overall") gives
Accuracy 7.554455e-01
Kappa 1.372895e-01
AccuracyLower 7.277208e-01
AccuracyUpper 7.816725e-01
AccuracyNull 7.485149e-01
AccuracyPValue 3.203599e-01
McnemarPValue 5.608817e-33
and
as.matrix(results, what = "classes") gives
Sensitivity 0.8953488
Specificity 0.6279070
Pos Pred Value 0.8783270
Neg Pred Value 0.6666667
Precision 0.8783270
Recall 0.8953488
F1 0.8867562
Prevalence 0.7500000
Detection Rate 0.6715116
Detection Prevalence 0.7645349
Balanced Accuracy 0.7616279
Using these and write.csv command you can get the entire confusionMatrix info

Ok, so if you inspect the output of confusionMatrix(xtab, prevalence = 0.25) , it's a list:
cm <- confusionMatrix(pred, truth)
str(cm)
List of 5
$ positive: chr "abnormal"
$ table : 'table' int [1:2, 1:2] 231 27 32 54
..- attr(*, "dimnames")=List of 2
.. ..$ Prediction: chr [1:2] "abnormal" "normal"
.. ..$ Reference : chr [1:2] "abnormal" "normal"
$ overall : Named num [1:7] 0.828 0.534 0.784 0.867 0.75 ...
..- attr(*, "names")= chr [1:7] "Accuracy" "Kappa" "AccuracyLower" "AccuracyUpper" ...
$ byClass : Named num [1:8] 0.895 0.628 0.878 0.667 0.75 ...
..- attr(*, "names")= chr [1:8] "Sensitivity" "Specificity" "Pos Pred Value" "Neg Pred Value" ...
$ dots : list()
- attr(*, "class")= chr "confusionMatrix"
From here on you select the appropriate objects that you want to create a csv from and make a data.frame that will have a column for each variable. In your case, this will be:
tocsv <- data.frame(cbind(t(cm$overall),t(cm$byClass)))
# You can then use
write.csv(tocsv,file="file.csv")

I found that capture.output works best for me.
It simply copies your output as a .csv file
(you can also do it as .txt)
capture.output(
confusionMatrix(xtab, prevalence = 0.25),
file = "F:/Home Office/result.csv")

The absolute easiest solution is to simply write out using readr::write_rds. You can export and import all while keeping the confusionMatrix structure intact.

If A is a caret::confusionMatrix object, then:
broom::tidy(A) %>% writexl::write_xlsx("mymatrix.xlsx")
optionally replace writexl with write.csv().
To also include the table on a separate sheet:
broom::tidy(A) %>% list(as.data.frame(A$table)) %>% writexl::write_xlsx("mymatrix.xlsx")

Related

Return value of fitdistr function explanation

I have the following generated dataset
library(MASS)
df <- data.frame(product= sample(x= c("toyota","honda","nissan","bmw"),size = 1000 ,replace = TRUE),
parameter = sample(x= c("X","Y", "A"),size = 1000 ,replace = TRUE),
value= rgamma(1000, shape = 5, rate = 0.1))
and I want to fit the lognormal distribution on column "value" and
I use the following code
dist_par <- fitdistr(unlist(df["value"]), "lognormal")
the result is something like below:
meanlog sdlog
3.8416 0.4292
(0.0458) (0.0324)
I have two questions:
I read the help and I guess that the meanlog and sdlog estimations are shown on the first row:
meanlog sdlog
3.8416 0.4292
but the second row of numbers (numbers in parentheses) are confusing, what are they?
meanlog sdlog
.... ....
(0.0458) (0.0324)
I know the result of fitdistr is a list but I don't know how to have access to those four values. For instance how can I get 3.8416 ?
If I run
dist_par[1]
then I get
meanlog sdlog
3.842 0.429
and if I run:
dist_par[1,1]
then I get the following error:
Error in dist_par[1, 1] : incorrect number of dimensions
According to the ?fitdistr documentation
An object of class "fitdistr", a list with four components,
estimate - the parameter estimates,
sd - the estimated standard errors,
vcov - the estimated variance-covariance matrix, and
loglik - the log-likelihood.
This would be evident if we check the structure
str(out)
List of 5
$ estimate: Named num [1:2] 3.801 0.455
..- attr(*, "names")= chr [1:2] "meanlog" "sdlog"
$ sd : Named num [1:2] 0.0144 0.0102
..- attr(*, "names")= chr [1:2] "meanlog" "sdlog"
$ vcov : num [1:2, 1:2] 0.000207 0 0 0.000103
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:2] "meanlog" "sdlog"
.. ..$ : chr [1:2] "meanlog" "sdlog"
i.e. the print method returns the 'estimatte and inside the parentheses the sd and as they are list the [1,1] doesn't work, we need to use standard extraction methods i.e. either $ or [[
> out
meanlog sdlog
3.80075311 0.45468543
(0.01437842) (0.01016708)
> out$estimate
meanlog sdlog
3.8007531 0.4546854
> out$estimate[["meanlog"]]
[1] 3.800753
> out$sd
meanlog sdlog
0.01437842 0.01016708
i.e inside the list, the elements are just named vectors, so use the [ or [[ to extract by name

Standard errors for smooth coefficient kernel regression with npscoef {np}

While fitting a Smooth Coefficient Kernel Regression with help of npscoef {np} in R, I cannot output the standard errors for the regression estimates.
The Help states that if errors = TRUE, asymptotic standard errors should be computed and returned in the resulting smoothcoefficient object.
Based on the example provided by the authors of the package "NP":
library("np")
data(wage1)
NP.Ydata <- wage1$lwage
NP.Xdata <- wage1[c("educ", "tenure", "exper", "expersq")]
NP.Zdata <- wage1[c("female", "married")]
NP.bw.scoef <- npscoefbw(xdat=NP.Xdata, ydat=NP.Ydata, zdat=NP.Zdata)
NP.scoef <- npscoef(NP.bw.scoef,
betas = TRUE,
residuals = TRUE,
errors = TRUE)
Coefficients are in the object coef(NP.scoef) saved under betas = TRUE
> str(coef(NP.scoef))
num [1:526, 1:5] 0.146 0.504 0.196 0.415 0.415 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:5] "Intercept" "educ" "tenure" "exper" ...
But should not the standard errors for the estimates be saved under errors = TRUE?
I see only one column vector. Not 5 for intercept + 4 explanatory variables.
> str(se(NP.scoef))
num [1:526] 0.015 0.0155 0.0155 0.0268 0.0128 ...
I am confused. Hope for a clarification.

Evaluating a statistical model in R

I have a very big data set (ds). One of its columns is Popularity, of type factor ('High' / ' Low').
I split the data to 70% and 30% in order to create a training set (ds_tr) and a test set (ds_te).
I have created the following model using a Logistic regression:
mdl <- glm(formula = popularity ~ . -url , family= "binomial", data = ds_tr )
then I created a predict object (will do it again for ds_te)
y_hat = predict(mdl, data = ds_tr - url , type = 'response')
I want to find the precision value which corresponds to a cutoff threshold of 0.5 and find the recall value which corresponds to a cutoff threshold of 0.5, so I did:
library(ROCR)
pred <- prediction(y_hat, ds_tr$popularity)
perf <- performance(pred, "prec", "rec")
The result is a table of many values
str(perf)
Formal class 'performance' [package "ROCR"] with 6 slots
..# x.name : chr "Recall"
..# y.name : chr "Precision"
..# alpha.name : chr "Cutoff"
..# x.values :List of 1
.. ..$ : num [1:27779] 0.00 7.71e-05 7.71e-05 1.54e-04 2.31e-04 ...
..# y.values :List of 1
.. ..$ : num [1:27779] NaN 1 0.5 0.667 0.75 ...
..# alpha.values:List of 1
.. ..$ : num [1:27779] Inf 0.97 0.895 0.89 0.887 ...
How do I find the specific precision and recall values corresponding to a cutoff threshold of 0.5?
Acces the slots of performance object (through the combination of # + list)
We create a dataset with all possible values:
probab.cuts <- data.frame(cut=perf#alpha.values[[1]], prec=perf#y.values[[1]], rec=perf#x.values[[1]])
You can view all associated values
probab.cuts
If you want to select the requested values, it is trivial to do:
tail(probab.cuts[probab.cuts$cut > 0.5,], 1)
Manual check
tab <- table(ds_tr$popularity, y_hat > 0.5)
tab[4]/(tab[4]+tab[2]) # recall
tab[4]/(tab[4]+tab[3]) # precision

Predict probability from Cox PH model

I am trying to use cox model to predict the probability of failure after time (which is named stop) 3.
bladder1 <- bladder[bladder$enum < 5, ]
coxmodel = coxph(Surv(stop, event) ~ (rx + size + number) +
cluster(id), bladder1)
range(predict(coxmodel, bladder1, type = "lp"))
range(predict(coxmodel, bladder1, type = "risk"))
range(predict(coxmodel, bladder1, type = "terms"))
range(predict(coxmodel, bladder1, type = "expected"))
However, the outputs of predict function are all not in 0-1 range. Is there any function or how can I use the lp prediction and baseline hazard function to calculate probability?
Please read the help page for predict.coxph. None of those are supposed to be probabilities. The linear predictor for a specific set of covariates is the log-hazard-ratio relative to a hypothetical (and very possibly non-existent) case with the mean of all the predictor values. The 'expected' comes the closest to a probability since it is a predicted number of events, but it would require specification of the time and then be divided by the number at risk at the beginning of observation.
In the case of the example offered on that help page for predict, you can see that the sum of predicted events is close the the actual number:
> sum(predict(fit,type="expected"), na.rm=TRUE)
[1] 163
> sum(lung$status==2)
[1] 165
I suspect you may want to be working instead with the survfit function, since the probability of event is 1-probability of survival.
?survfit.coxph
The code for a similar question appears here: Adding column of predicted Hazard Ratio to dataframe after Cox Regression in R
Since you suggested using the bladder1 dataset, then this would be the code for a specification of time=5
summary(survfit(coxmodel), time=5)
#------------------
Call: survfit(formula = coxmodel)
time n.risk n.event survival std.err lower 95% CI upper 95% CI
5 302 26 0.928 0.0141 0.901 0.956
That would return as a list with the survival prediction as a list element named $surv:
> str(summary(survfit(coxmodel), time=5))
List of 14
$ n : int 340
$ time : num 5
$ n.risk : num 302
$ n.event : num 26
$ conf.int: num 0.95
$ type : chr "right"
$ table : Named num [1:7] 340 340 340 112 NA 51 NA
..- attr(*, "names")= chr [1:7] "records" "n.max" "n.start" "events" ...
$ n.censor: num 19
$ surv : num 0.928
$ std.err : num 0.0141
$ lower : num 0.901
$ upper : num 0.956
$ cumhaz : num 0.0744
$ call : language survfit(formula = coxmodel)
- attr(*, "class")= chr "summary.survfit"
> summary(survfit(coxmodel), time=5)$surv
[1] 0.9282944

How to calculate average sensitivity and specificity at specified cutoff in ROCR package?

I use ROCR package to draw the ROC curve. The code is as follows:
pred <- prediction(my.pred, my.label)
perf <- performance(my.pred, 'tpr', 'fpr')
plot(perf,avg="threshold")
My pred and perf object is not a vector but a list, so I can get an average ROC curve.
Can anyone tell me how to calculate average sensitivity and specificity at a specified cutoff in ROCR package?
Actually, ROCR is an overkill for this task. The performance function of ROCR returns performance metrics at every score that is present in its input. So, theoretically you could do the following:
library(ROCR)
set.seed(123)
N <- 1000
POSITIVE_CASE <- 'case A'
NEGATIVE_CASE <- 'case B'
CUTOFF <- 0.456
scores <- rnorm(n=N)
labels <- ifelse(runif(N) > 0.5, POSITIVE_CASE, NEGATIVE_CASE)
pred <- prediction(scores, labels)
perf <- performance(pred, 'sens', 'spec')
At this point perf contains a lot of useful information:
> str(perf)
Formal class 'performance' [package "ROCR"] with 6 slots
..# x.name : chr "Specificity"
..# y.name : chr "Sensitivity"
..# alpha.name : chr "Cutoff"
..# x.values :List of 1
.. ..$ : num [1:1001] 1 1 0.998 0.996 0.996 ...
..# y.values :List of 1
.. ..$ : num [1:1001] 0 0.00202 0.00202 0.00202 0.00405 ...
..# alpha.values:List of 1
.. ..$ : num [1:1001] Inf 3.24 2.69 2.68 2.58 ...
Now you can search for your score cut-off in perf#alpha.values and find the corresponding sensitivity and specificity values. If you don't find the exact cut-off value in perf#alpha.values, you'll have to do some interpolation:
ix <- which.min(abs(perf#alpha.values[[1]] - CUTOFF)) #good enough in our case
sensitivity <- perf#y.values[[1]][ix] #note the order of arguments to `perfomance` and of x and y in `perf`
specificity <- perf#x.values[[1]][ix]
Which gives you:
> sensitivity
[1] 0.3319838
> specificity
[1] 0.6956522
But there is a much simpler and faster way: just convert your label string to a binary vector and calculate the metrics directly:
binary.labels <- labels == POSITIVE_CASE
tp <- sum( (scores > threshold) & binary.labels )
sensitivity <- tp / sum(binary.labels)
tn <- sum( (scores <= threshold) & (! binary.labels))
specificity <- tn / sum(!binary.labels)
Which gives you:
> sensitivity
[1] 0.3319838
> specificity
[1] 0.6956522

Resources