How to produce a classification table of predicted vs actual values - r

I'm new to R and neutral network and haven't been able to figure out how predict variable from trained network (neural network) and produce classification table of predicted vs actual values. I just haven't been able to understand the code meaning of classification table vs actual . I would greatly appreciate it if you explain the code. This is what I have done so far:
model_Wine <- train(wine~., wine_df, method='nnet', trace=FALSE)
prediction <- predict(model_Wine, wine_df)
table(prediction, wine_df$wine)
prediction A B C
A 59 0 0
B 0 71 0
C 0 0 48
how do i produce classification table of predicted vs actual "wine" values?
Thank you!

Related

credit score from SVM probabilities in R

I am trying to calculate credit scores for the germancredit dataframe in R. I used linear SVM classifier to predict 0 and 1 (i.e. 0 = good, 1 = bad).
I managed to produce probabilities from SVM classifier using the following code.
final_pred = predict(classifier, newdata = data_treated[1:npredictors], decision.values = TRUE, probability = TRUE)
probs = attr(final_pred,"probabilities")
I want to know how to read these probabilities output. The sample output is here.
Does the following output mean that, if the prediction is 1 (Default) in fifth row, then probability is 0.53601166.
0 1 Prediction
1 0.90312125 0.09687875 0
2 0.57899408 0.42100592 0
3 0.93079172 0.06920828 0
4 0.76600082 0.23399918 0
5 0.46398834 0.53601166 1
Can I then use the above respective probabilities to develop credit scorecard like we usually do with logistic regression model
You get a probability for outcome 0 or 1. The first two columns for each row sum to one and give you the overall probability. Your interpretation seems correct to me, i.e. with a probability of 0.53 it is more likely that a default will happen, than the probability of no default happening with p = 0.46.
Yes, you could use that model for developing a credit scorecard. Please mind, that you don't necessarily need to use 0.5 as your cutoff value for deciding if company or person X is going to default.

Output "randomForest" with changing MeanDecreaseAccuracyValues

I have a question relating to the “randomForest” package in R. I am trying to build a model with ecological variables that best explain my species occupancy data for 41 sites in the field (which I have gathered from camera traps). My ultimate goal is to do species occupancy modeling using the “unmarked” package but before I get to that stage I need to select the variables that are best explaining my occupancy, since I have many. To gain some understanding of the randomForest package I generated a fake occupancy dataset and a fake variable dataset (with variables A and D being good predictors of my occupancy and B and C being bad predictors). When I run the randomForest my output looks like this:
0 1 MeanDecreaseAccuracy MeanDecreaseGini
A 25.3537667 27.75533 26.9634018 20.6505920
B 0.9567857 0.00000 0.9665287 0.0728273
C 0.4261638 0.00000 0.4242409 0.1411643
D 32.1889374 35.52439 34.0485837 27.0691574
OOB estimate of error rate: 29.02%
Confusion matrix:
0 1 class.error
0 250 119 0.3224932
1 0 41 0.0000000
I did not make a separate train and test set, I put extra weight on the model to correctly predict the “1’s” and the variables are scaled.
I understand that this output tells me that A and D are important variables because they have high MeanDecreaseAccuracy values. However, D is the inverse of A (they are perfectly correlated) so why does D have a higher MeanDecreaseAccuracy value?
Moreover, when I run the randomForest with only A and D as variables, these values change while the confusion matrix stays the same:
0 1 MeanDecreaseAccuracy MeanDecreaseGini
A 28.79540 29.77911 29.00879 23.58469
D 29.75068 30.79498 29.97520 24.53415
OOB estimate of error rate: 29.02%
Confusion matrix:
0 1 class.error
0 250 119 0.3224932
1 0 41 0.0000000
When I run the model with only 1 good predictor (A or D) or with a good and bad predictor (AB or CD) the confusion matrix stays the same but the MeanDecreaseAccuracy values of my predictors change.
Why do these values change and how should I approach the selection of my variables? (I am a beginner in occupancy modeling).
Thanks a lot!

Correlation of categorical data to binomial response in R

I'm looking to analyze the correlation between a categorical input variable and a binomial response variable, but I'm not sure how to organize my data or if I'm planning the right analysis.
Here's my data table (variables explained below):
species<-c("Aaeg","Mcin","Ctri","Crip","Calb","Tole","Cfus","Mdes","Hill","Cpat","Mabd","Edim","Tdal","Tmin","Edia","Asus","Ltri","Gmor","Sbul","Cvic","Egra","Pvar")
scavenge<-c(1,1,0,1,1,1,1,0,1,0,1,1,1,0,0,1,0,0,0,0,1,1)
dung<-c(0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,1,1,0,0)
pred<-c(0,1,1,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,0,0)
nectar<-c(1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,1,0,0)
plant<-c(0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0)
blood<-c(1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0)
mushroom<-c(0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0)
loss<-c(0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0) #1 means yes, 0 means no
data<-cbind(species,scavenge,dung,pred,nectar,plant,blood,mushroom,loss)
data #check data table
data table explanation
I have individual species listed, and the next columns are their annotated feeding types. A 1 in a given column means yes, and a 0 means no. Some species have multiple feeding types, while some have only one feeding type. The response variable I am interested in is "loss," indicating loss of a trait. I'm curious to know if any of the feeding types predict or are correlated with the status of "loss."
thoughts
I wasn't sure if there was a good way to include feeding types as one categorical variable with multiple categories. I don't think I can organize it as a single variable with the types c("scavenge","dung","pred", etc...) since some species have multiple feeding types, so I split them up into separate columns and indicated their status as 1 (yes) or 0 (no). At the moment I was thinking of trying to use a log-linear analysis, but examples I find don't quite have comparable data... and I'm happy for suggestions.
Any help or pointing in the right direction is much appreciated!
There are too little samples, you have 4 loss == 0 and 18 loss == 1. You will run into problems fitting a full logistic regression (i.e including all variables). I suggest testing for association for each feeding habit using a fisher test:
library(dplyr)
library(purrr)
# function for the fisher test
FISHER <- function(x,y){
FT = fisher.test(table(x,y))
data.frame(
pvalue=FT$p.value,
oddsratio=as.numeric(FT$estimate),
lower_limit_OR = FT$conf.int[1],
upper_limit_OR = FT$conf.int[2]
)
}
# define variables to test
FEEDING <- c("scavenge","dung","pred","nectar","plant","blood","mushroom")
# we loop through and test association between each variable and "loss"
results <- data[,FEEDING] %>%
map_dfr(FISHER,y=data$loss) %>%
add_column(var=FEEDING,.before=1)
You get the results for each feeding habit:
> results
var pvalue oddsratio lower_limit_OR upper_limit_OR
1 scavenge 0.264251538 0.1817465 0.002943469 2.817560
2 dung 1.000000000 1.1582683 0.017827686 20.132849
3 pred 0.263157895 0.0000000 0.000000000 3.189217
4 nectar 0.535201640 0.0000000 0.000000000 5.503659
5 plant 0.002597403 Inf 2.780171314 Inf
6 blood 1.000000000 0.0000000 0.000000000 26.102285
7 mushroom 0.337662338 5.0498688 0.054241930 467.892765
The pvalue is p-value from fisher.test, basically with an odds ratio > 1, the variable is positively associated with loss. Of all your variables, plant is the strongest and you can check:
> table(loss,plant)
plant
loss 0 1
0 18 0
1 1 3
Almost all that are plant=1, are loss=1.. So with your current dataset, I think this is the best you can do. Should get a larger sample size to see if this still holds.

Calculate AUC for test set (keras model in R)

Is there a way (function) to calculate AUC value for a keras model in R on test-set?
I have searched on google but nothing shown up.
From Keras model, we can extract the predicted values as either class or probability as follows:
Probability:
[1,] 9.913518e-01 1.087829e-02
[2,] 9.990101e-01 1.216531e-03
[3,] 9.445553e-01 6.256607e-02
[4,] 9.928864e-01 6.808311e-03
[5,] 9.993126e-01 1.028240e-03
[6,] 6.075442e-01 3.926141e-01
Class:
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
Many thanks,
Ho
Generally it does not really matter what calssifier (keras or not) did the prediction. All you need to estimate the AUC are two things: the predicted probabilities from some classifier and the actual category (for example, dead "yes" vs. "no"). With this data you can calculate both, True Positive Rate and False positive rate, thus you can also make a ROC plot and estimate AUC with this data. You can use
library(pROC)
roc_obj <- roc(category, prediction)
auc(roc_obj)
See here for some more explanation.
I'm not sure this will answer your needs as it depends on your data structure and keras output format, but have a look to Dismo package's function evaluate. You need to set up something like that:
library(dismo)
predictors <- stack of explaining variables
pres_test <- a subset of data used to model ##that you not use in your model for this testing purpose
backg_test <- true or random (background) absence data
model <- output of your model
AUC <- evaluate(pres_test, backg_test, model, predictors) ## you may bootstrap this step x time by randomly selecting 'pres_test' and 'backg_test' x time.

How to construct ROC curve in r with a small clinical dataset

I need help with ROC curve in R. My data is not difficult, but I don't know how to get teh ROC curve and AUC. I typed the dataset here if you need to have a look. The cutoff comes from CMMS score (eg. <5/10/20 then the patient have dementia).
Table - Relationship of clinical dementia to outcome on the Mini-Mental Status Test
CMMS score Nondemented Demented
0–5 0 2
6–10 0 1
11–15 3 4
16–20 9 5
21–25 16 3
26–30 18 1
Total 46 16
Please let me know any ideas. Thank you.

Resources