Order confusion matrix in R - r

I've created a confusion matrix from the observations and its predictions in 3 classes.
classes=c("Underweight", "Normal", "Overweight")
When I compute the confusion matrix, it organizes the classes in the table alphabetical. Here is my code.
# Confusion matrix
Observations <- bmi_classification(cross.m$bmi)
Predicted <- bmi_classification(cross.m$cvpred)
conf <- table(Predicted, Observations)
library(caret)
f.conf <- confusionMatrix(conf)
print(f.conf)
This produces this output:
Confusion Matrix and Statistics
Observations
Predicted Normal Overweight Underweight
Normal 17 0 1
Overweight 1 4 0
Underweight 1 0 1
So, I would like it to first Underweight, then normal and finally Overweight. I've tried to pass the order to the matrix as an argument but no luck with that.
EDIT:
I tried reordering it,
conf <- table(Predicted, Observations)
reorder = matrix(c(9, 7, 8, 3, 1, 2, 6, 4, 5), nrow=3, ncol=3)
conf.reorder <- conf[reorder]
but I'm getting, [1] 1 1 0 1 17 1 0 0 4

Try this then redo your code:
cross.m$Observations <- factor( cross.m$Observations,
levels=c("Underweight","Normal","Overweight") )
cross.m$Predicted<- factor( cross.m$Predicted,
levels=c("Underweight","Normal","Overweight") )
conf <- table(Predicted, Observations)
library(caret)
f.conf <- confusionMatrix(conf)
print(f.conf)
Ordinary matrix methods would probably not work since a caret confusion matrix object is a list.

Related

The cv.glmnet() prediction is the opposite of using "class" and "response"

I'm trying to plot roc curve from lasso logistic regression result. so I used predict() using type="response" to get a probability. however, the result was opposite of when I put type = "class"
first of all, this is my dataset. my predictor has 2 levels
selected_data$danger <- factor(selected_data$danger, levels = c(1,0))
lasso_data<-selected_data
str(lasso_data$danger)
# Factor w/ 2 levels "1","0": 1 1 1 1 1 1 1 1 1 1 ...
# partition
input_train <- createDataPartition(y=lasso_data$danger, p=0.8, list=FALSE)
train_dataset <- lasso_data[input_train,]
test_dataset <- lasso_data[-input_train,]
dim(train_dataset)
# [1] 768 62
dim(test_dataset)
# [1] 192 62
I did run both cases(type = class, response) to compare.
lasso_model <- cv.glmnet( x=data.matrix(train_dataset[,-length(train_dataset)]), y = train_dataset[,length(train_dataset)],
family = "binomial" , type.measure = "auc",alpha=1, nfolds=5)
lasso_pred <- predict(lasso_model, newx=data.matrix(test_dataset[,-length(test_dataset)]),
s=lasso_model$lambda.min, type= "class", levels=c(1,0))
lasso_pred_resp <- predict(lasso_model, s="lambda.1se", newx=data.matrix(test_dataset[,-length(test_dataset)]), type="response", levels=c(1,0))
threshold <- 0.5 # or whatever threshold you use
pred <- ifelse(lasso_pred_resp>threshold, 1, 0)
table(lasso_pred, pred)
# pred
# lasso_pred 0 1
# 0 11 95
# 1 76 10
I have no idea why this is happening...
Any help would be greatly appreciated.
For logistic regression in R, the probability or "response" always refers to the probability of being the 2nd level, in your case it is "0".
So you predictions should be:
pred <- ifelse(lasso_pred_resp>threshold, 0, 1)
To avoid confusions, you can also do:
lvl <- levels(lasso_data$danger)
pred <- ifelse(lasso_pred_resp>threshold,lvl[2],lvl[1])

Why do results of matching depend on order of data (MatchIt package)?

When using the matchit-function for full matching, the results differ by the order of the input dataframe. That is, if the order of the data is changed, results change, too. This is surprising, because in my understanding, the optimal full algorithm should yield only one single best solution.
Am I missing something or is this an error?
Similar differences occur with the optimal algorithm.
Below you find a reproducible example. Subclasses should be identical for the two data sets, which they are not.
Thank you for your help!
# create data
nr <- c(1:100)
x1 <- rnorm(100, mean=50, sd=20)
x2 <- c(rep("a", 20),rep("b", 60), rep("c", 20))
x3 <- rnorm(100, mean=230, sd=2)
outcome <- rnorm(100, mean=500, sd=20)
group <- c(rep(0, 50),rep(1, 50))
df <- data.frame(x1=x1, x2=x2, outcome=outcome, group=group, row.names=nr, nr=nr)
df_neworder <- df[order(outcome),] # re-order data.frame
# perform matching
model_oldorder <- matchit(group~x1, data=df, method="full", distance ="logit")
model_neworder <- matchit(group~x1, data=df_neworder, method="full", distance ="logit")
# store matching results
matcheddata_oldorder <- match.data(model_oldorder, distance="pscore")
matcheddata_neworder <- match.data(model_neworder, distance="pscore")
# Results based on original data.frame
head(matcheddata_oldorder[order(nr),], 10)
x1 x2 outcome group nr pscore weights subclass
1 69.773776 a 489.1769 0 1 0.5409943 1.0 27
2 63.949637 a 529.2733 0 2 0.5283582 1.0 32
3 52.217666 a 526.7928 0 3 0.5028106 0.5 17
4 48.936397 a 492.9255 0 4 0.4956569 1.0 9
5 36.501507 a 512.9301 0 5 0.4685876 1.0 16
# Results based on re-ordered data.frame
head(matcheddata_neworder[order(matcheddata_neworder$nr),], 10)
x1 x2 outcome group nr pscore weights subclass
1 69.773776 a 489.1769 0 1 0.5409943 1.0 25
2 63.949637 a 529.2733 0 2 0.5283582 1.0 31
3 52.217666 a 526.7928 0 3 0.5028106 0.5 15
4 48.936397 a 492.9255 0 4 0.4956569 1.0 7
5 36.501507 a 512.9301 0 5 0.4685876 2.0 14
Apparently, the assignment of objects to subclasses differs. In my understanding, this should not be the case.
The developers of the optmatch package (which the matchit function calls) provided useful help:
I think what we're seeing here is the result of the tolerance argument
that fullmatch has. The matching algorithm requires integer distances,
so we have to scale then truncate floating point distances. For a
given set of integer distances, there may be multiple matchings that
achieve the minimum, so the solver is free to pick among these
non-unique solutions.
Developing your example a little more:
> library(optmatch)
> nr <- c(1:100) x1 <- rnorm(100, mean=50, sd=20)
> outcome <- rnorm(100, mean=500, sd=20) group <- c(rep(0, 50),rep(1, 50))
> df_oldorder <- data.frame(x1=x1, outcome=outcome, group=group, row.names=nr, nr=nr) > df_neworder <- df_oldorder[order(outcome),] # > re-order data.frame
> glm_oldorder <- match_on(glm(group~x1, > data=df_oldorder), data = df_oldorder)
> glm_neworder <- > match_on(glm(group~x1, data=df_neworder), data = df_neworder)
> fm_old <- fullmatch(glm_oldorder, data=df_oldorder)
> fm_new <- fullmatch(glm_neworder, data=df_neworder)
> mean(sapply(matched.distances(fm_old, glm_oldorder), mean))
> ## 0.06216174
> mean(sapply(matched.distances(fm_new, glm_neworder), mean))
> ## 0.062058 mean(sapply(matched.distances(fm_old, glm_oldorder), mean)) -
> mean(sapply(matched.distances(fm_new, glm_neworder), mean))
> ## 0.00010373
which we can see is smaller than the default tolerance of 0.001. You can always decrease the tolerance level, which may
require increased run time, in order to get closer to the true
floating put minimum. We found 0.001 seemed to work well in practice,
but there is nothing special about this value.

How can I use SOM algorithm for classification prediction

I would like to see If SOM algorithm can be used for classification prediction.
I used to code below but I see that the classification results are far from being right. For example, In the test dataset, I get a lot more than just the 3 values that I have in the training target variable. How can I create a prediction model that will be in alignment to the training target variable?
library(kohonen)
library(HDclassif)
data(wine)
set.seed(7)
training <- sample(nrow(wine), 120)
Xtraining <- scale(wine[training, ])
Xtest <- scale(wine[-training, ],
center = attr(Xtraining, "scaled:center"),
scale = attr(Xtraining, "scaled:scale"))
som.wine <- som(Xtraining, grid = somgrid(5, 5, "hexagonal"))
som.prediction$pred <- predict(som.wine, newdata = Xtest,
trainX = Xtraining,
trainY = factor(Xtraining$class))
And the result:
$unit.classif
[1] 7 7 1 7 1 11 6 2 2 7 7 12 11 11 12 2 7 7 7 1 2 7 2 16 20 24 25 16 13 17 23 22
[33] 24 18 8 22 17 16 22 18 22 22 18 23 22 18 18 13 10 14 15 4 4 14 14 15 15 4
This might help:
SOM is an unsupervised classification algorithm, so you shouldn't expect it to be trained on a dataset that contains a classifier label (if you do that it will need this information to work, and will be useless with unlabelled datasets)
The idea is that it will kind of "convert" an input numeric vector to a network unit number (try to run your code again with a 1 per 3 grid and you'll have the output you expected)
You'll then need to convert those network units numbers back into the categories you are looking for (that is the key part missing in your code)
Reproducible example below will output a classical classification error. It includes one implementation option for the "convert back" part missing in your original post.
Though, for this particular dataset, the model overfitts pretty quickly: 3 units give the best results.
#Set and scale a training set (-1 to drop the classes)
data(wine)
set.seed(7)
training <- sample(nrow(wine), 120)
Xtraining <- scale(wine[training, -1])
#Scale a test set (-1 to drop the classes)
Xtest <- scale(wine[-training, -1],
center = attr(Xtraining, "scaled:center"),
scale = attr(Xtraining, "scaled:scale"))
#Set 2D grid resolution
#WARNING: it overfits pretty quickly
#Errors are 36% for 1 unit, 63% for 2, 93% for 3, 89% for 4
som_grid <- somgrid(xdim = 1, ydim=3, topo="hexagonal")
#Create a trained model
som_model <- som(Xtraining, som_grid)
#Make a prediction on test data
som.prediction <- predict(som_model, newdata = Xtest)
#Put together original classes and SOM classifications
error.df <- data.frame(real = wine[-training, 1],
predicted = som.prediction$unit.classif)
#Return the category number that has the strongest association with the unit
#number (0 stands for ambiguous)
switch <- sapply(unique(som_model$unit.classif), function(x, df){
cat <- as.numeric(names(which.max(table(
error.df[error.df$predicted==x,1]))))
if(length(cat)<1){
cat <- 0
}
return(c(x, cat))
}, df = data.frame(real = wine[training, 1], predicted = som_model$unit.classif))
#Translate units numbers into classes
error.df$corrected <- apply(error.df, MARGIN = 1, function(x, switch){
cat <- switch[2, which(switch[1,] == x["predicted"])]
if(length(cat)<1){
cat <- 0
}
return(cat)
}, switch = switch)
#Compute a classification error
sum(error.df$corrected == error.df$real)/length(error.df$real)

Adjust implausible imputed values in an optimized way

I have a dataset with some imputed values. According to a predefined edit rule, some of these imputed values are implausible. For that reason, I want to adjust these implausible imputed values, but the adjustment should be as small as possible.
Here is a simplified example:
# Seed
set.seed(111)
# Example data
data <- data.frame(x1 = round(rnorm(200, 5, 5), 0),
x2 = factor(round(runif(200, 1, 3), 0)),
x3 = round(rnorm(200, 2, 10), 0),
x4 = factor(round(runif(200, 0, 5), 0)))
data[data$x1 > 5 & data$x2 == 1, ]$x3 <- 4
data[data$x1 > 5 & data$x2 == 1, ]$x4 <- 5
# Missings
data$x1[sample(1:nrow(data), 25)] <- NA
data$x2[sample(1:nrow(data), 50)] <- NA
data$x3[sample(1:nrow(data), 40)] <- NA
data$x4[sample(1:nrow(data), 35)] <- NA
# Imputation
library("mice")
imp <- mice(data, m = 1)
# Imputed data
data_imp <- complete(imp, "repeated")
# So far everything works well.
# However, there is a predefined edit rule, which should not be violated.
# Edit Rule:
# If x1 > 5 and x2 == 1
# Then x3 > 3 and x4 > 4
# Because of the imputation, some of the observations have implausible values.
implausible <- data_imp[data_imp$x1 > 5 & data_imp$x2 == 1 &
(data_imp$x3 <= 3 | (data_imp$x4 != 4 & data_imp$x4 != 5)), ]
implausible
# Example 1)
# In row 26 x1 has a value > 5 and x2 equals 1.
# For that reason, x3 would have to be larger than 3 (here x3 is -17).
# Like you can see in the original data, x2 has been imputed in row 26.
data[rownames(implausible), ]
# Hence, x2 would have to be adjusted, so that it randomly gets a different category.
# Example 2)
# In row 182 are also implausible values.
# Three of the variables have been imputed in this row.
# Therefore, all/some of the imputed cells would have to be adjusted,
# but the adjustment should be as small as possible.
I have already made some research and found some relevant papers/books, in which some optimization algorithms are described:
Pannekoek & Zhang (2011): https://www.researchgate.net/publication/269410841_Partial_donor_Imputation_with_Adjustments
de Waal, Pannekoek & Scholtus (2011): Handbook of Statistical Data Editing and Imputation
However, I am struggling with the implementation of these algorithms in R. Is there a Package available, which helps with these kind of calculations. I'd really appreciate some help with my code or some hints about the topic!

How to use predict on a test set?

I am going to eventually do multivariate regression for a vary large set of predictors. To make sure that I am putting the data in correctly and getting expected results with a toy model. However when I try to use predict it does not predict on the new data, also since the size of the new data is different from the training set it gives me an error.
I have looked and tried various things on the Internet and none have worked. I am almost ready to give up and write my own functions but I am also building models with the please package, which I am guessing probably calls this internally already so I want to be consistent. Here is the short script I wrote:
x1<-c(1.1,3.4,5.6,1.2,5,6.4,0.9,7.2,5.4,3.1) # Orginal Variables
x2<-c(10,21,25,15.2,18.9,19,16.2,22.1,18.6,22)
y<-2.0*x1+1.12*x2+rnorm(10,mean=0,sd=0.2) # Define output variable
X<-data.frame(x1,x2)
lfit<-lm(y~.,X) # fit model
n_fit<-lfit$coefficients
xg1<-runif(15,1,10) # define new data
xg2<-runif(15,10,30)
X<-data.frame(xg1,xg2)# put into data frame
y_guess<-predict(lfit,newdata=X) #Predict based on fit
y_actual<-2.0*xg1+1.12*xg2 # actual values because I know the coefficients
y_pred=n_fit[1]+n_fit[2]*xg1+n_fit[3]*xg2 # What predict should give me based on fit
print(y_guess-y_actual) #difference check
print(y_guess-y_pred)
These are the values I am getting and the error message:
[1] -4.7171499 -16.9936498 6.9181074 -6.1964788 -11.1852816 0.9257043 -13.7968731 -6.6624086 15.5365141 -8.5009428
[11] -22.8866505 2.0804016 -1.8728602 -18.7670797 1.2251849
[1] -4.582645 -16.903164 7.038968 -5.878723 -11.149987 1.162815 -13.473351 -6.483111 15.731694 -8.456738
[11] -22.732886 2.390507 -1.662446 -18.627342 1.431469
Warning messages:
1: 'newdata' had 15 rows but variables found have 10 rows
2: In y_guess - y_actual :
longer object length is not a multiple of shorter object length
3: In y_guess - y_pred :
longer object length is not a multiple of shorter object length
The predicted coefficient are 1.97 and 1.13 and intercept -0.25, it should be 0 but I added noise, this would not cause a big discrepancy as it is. How do I get it so I can predict an independent test set.
From the help - documentation, ?predict.lm:
"Variables are first looked for in newdata and then searched for in the usual way (which will include the environment of the formula used in the fit)."
The data.frame(), created in: X <- data.frame(xg1, xg2), has different names: (xg1, xg2). predict() cannot find the original names (x1, x2) and will then search for the correct variables in the formula instead. The result is that you obtain the fitted values from your original data.
Solve this by making your names in the newdata consistent with the original:
X <- data.frame(x1=xg1, x2=xg2) :
x1 <- c(1.1, 3.4, 5.6, 1.2, 5, 6.4, 0.9, 7.2, 5.4, 3.1) # Orginal Variables
x2 <- c(10, 21, 25, 15.2, 18.9, 19, 16.2, 22.1, 18.6, 22)
y <- 2.0*x1 + 1.12*x2 + rnorm(10, mean=0, sd=0.2) # Define output variable
X <- data.frame(x1, x2)
lfit <- lm(y~., X) # fit model
n_fit <- lfit$coefficients
xg1 <- runif(15, 1, 10) # define new data
xg2 <- runif(15, 10, 30)
X <- data.frame(x1=xg1, x2=xg2) # put into data frame
y_guess <- predict(lfit, newdata=X) #Predict based on fit
y_actual <- 2.0*xg1 + 1.12*xg2 # actual values because I know the coefficients
y_pred = n_fit[1] + n_fit[2]*xg1 + n_fit[3]*xg2 # What predict should give me based on fit
> print(y_guess - y_actual) #difference check
1 2 3 4 5 6 7 8 9 10 11 12 13
-0.060223916 -0.047790535 -0.018274280 -0.096190467 -0.079490487 -0.063736231 -0.047506981 -0.009523583 -0.047774006 -0.084276807 -0.106322290 -0.030876942 -0.067232989
14 15
-0.023060651 -0.041264431
> print(y_guess - y_pred)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Resources