Confusion Matrix need to be clearly clarified - r
Hi I tried to print the confusion matrix for a dataset using R. Following are my results
My class variables contains binary values. Medv value is my class variable binarized with Medv value of the house greater than 230k being 1, or 0 otherwise. When I see the confusion matrix, at the end represents Positive class as 0. What does this mean? Are these results misrepresents my data?
my R code so far,
# Load CART packages
library(rpart)
library(rpart.plot)
library(caTools)
library(caret)
library (pROC)
housing_data = read.csv('housing.csv')
summary(housing_data)
housing_data = na.omit(housing_data)
# CART model
latlontree = rpart(Medv ~ Crim + Rm, data=housing_data , method = "class")
# Plot the tree using prp command defined in rpart.plot package
prp(latlontree)
# Split the data for Machine Learning
set.seed(123)
split = sample.split(housing_data$Medv, SplitRatio = 0.8)
train = subset(housing_data, split==TRUE)
test = subset(housing_data, split==FALSE)
#print (train)
#print (test)
# Create a CART model
tree = rpart(Medv ~ Crim + Zn + Indus + Chas + Nox + Rm + Age + Dis + Rad + Tax + Ptratio + B + Lstat , data=train , method = "class")
prp(tree)
#Decision tree prediction
#tree.pred = predict(tree, test)
pred = predict(tree,test, type="class")
#print (pred)
table(pred, test$Medv)
table(factor(pred, levels=min(test$Medv):max(test$Medv)),
factor(test$Medv, levels=min(test$Medv):max(test$Medv)))
# If p exceeds threshold of 0.5, M else R: m_or_r
#m_or_r <- ifelse(p > 0.5, 1, 0)
#print (m_or_r)
# Convert to factor: p_class
#p_class <- factor(m_or_r, levels = test$Medv)
# Create confusion matrix
confusionMatrix(table(factor(pred, levels=min(test$Medv):max(test$Medv)),
factor(test$Medv, levels=min(test$Medv):max(test$Medv))))
#print (tree.sse)
#ROC Curve
#Obtaining predicted probabilites for Test data
tree.probs=predict(tree,
test,
type="prob")
head(tree.probs)
#Calculate ROC curve
rocCurve.tree <- roc(test$Medv,tree.probs[,2])
#plot the ROC curve
plot(rocCurve.tree,col=c(4))
auc <- auc (test$Medv,tree.probs[,2])
print (auc)
#creating a dataframe with a single row
x <- data.frame("Crim"= c(0.03), "Zn"=c(13), "Indus"=c(3.5), "Chas"=c(0.3), "Nox"=c(0.58), "Rm"=c(4.1), "Age"=c(68), "Dis"=c(4.98), "Rad" =c(3), "Tax"=c(225), "Ptratio"=c(17), "B"=c(396), "Lstat"=c(7.56))
#Obtaining predicted probabilites for Test data
probability2=predict(tree,
x,
type="prob")
print (probability2)
#Obtaining predicted class for Test data
probability3=predict(tree,
x,
type="class")
print (probability3)
Image of the dataset
Related
Stata vs. R: Delta Method provides different results for relative risk SE's from logit model
I've been trying to estimate the conditional mean treatment effect of covariates in a logit regression (using relative-risk) along with their standard errors for inference purposes. The delta method is necessary to calculated the standard errors for these treatment effects. I've been trying to recreate these results using the Stata user written command, adjrr, to calculate the relative risk with standard errors and confidence intervals in R. The adjrr command in Stata calculates the adjusted relative-risk (the conditional mean treatment effect of interest for my project) and it's SE's using the delta method. The deltamethod command in R should create the same results, however this is not the case. How can I replicate the results from Stata in R? I used the following self generated data: (https://migariane.github.io/DeltaMethodEpiTutorial.nb.html). R code below: generateData <- function(n, seed){ set.seed(seed) age <- rnorm(n, 65, 5) age65p <- ifelse(age>=65, T, F) cmbd <- rbinom(n, size=1, prob = plogis(1 - 0.05 * age)) Y <- rbinom(n, size=1, prob = plogis(1 - 0.02* age - 0.02 * cmbd)) data.frame(Y, cmbd, age, age65p) } # Describing the data data <- generateData(n = 1000, seed = 777) str(data) logfit <- glm(Y ~ age65p + cmbd, data = data, family = binomial) summary(logfit) p1 <- predict(logfit, newdata = data.frame(age65p = T, cmbd = 0), type="response") p0 <- predict(logfit, newdata = data.frame(age65p = F, cmbd = 0), type="response") rr <- p1 / p0 rr 0.8123348 #result library(msm) se_rr_delta <- deltamethod( ~(1 + exp(-x1)) / (1 + exp(-x1 -x2)), coef(logfit), vcov(logfit)) se_rr_delta 0.6314798 #result Stata Code (using same data): logit Y i.age65p i.cmbd adjrr age65p //results below R1 = 0.3685 (0.0218) 95% CI (0.3259, 0.4112) R0 = 0.4524 (0.0222) 95% CI (0.4090, 0.4958) ARR = 0.8146 (0.0626) 95% CI (0.7006, 0.9471) ARD = -0.0839 (0.0311) 95% CI (-0.1449, -0.0229) p-value (R0 = R1): 0.0071 p-value (ln(R1/R0) = 0): 0.0077
How to build Exponential Smoothing Model
My data : library(forecast) library(Mcomp) # Time Series # Subset the M3 data to contain the relevant series ts.data<- subset(M3, 12)[[551]] print(ts.data) I have selected time series 551 of the monthly data of the M3 competition. I want to build exponential smoothing model (ETS) and then calculate the in-sample error and out-of-sample error of the model. How can i get through this ? Any help? EDITED ! My code : # Exponential Smoothing Model library(forecast) library(Mcomp) # My data is # Time Series # Subset the M3 data to contain the relevant series ts.data<- subset(M3, 12)[[551]] print(ts.data) library(tidyverse) library(fpp2) # Holt’s Method # create training and validation # of the data data.train <- window(???) data.test <- window(???) holt.data <- holt(data.train, h = 100) autoplot(holt.data) # holt's method holt.data$model # accuracy of the model accuracy(holt.data, data.test) # try to find the optimal value of beta through a loop ranging from 0.0001 to 0.5 that will minimize the RMSE test # identify optimal alpha parameter beta <- seq(.0001, .5, by = .001) RMSE <- NA for(i in seq_along(beta)) { fit <- holt(data.train, beta = beta[i], h = 100) RMSE[i] <- accuracy(fit, data.test)[2,2] } # convert to a data frame and # idenitify min alpha value beta.fit <- data_frame(beta, RMSE) beta.min <- filter(beta.fit, RMSE == min(RMSE)) # plot RMSE vs. alpha ggplot(beta.fit, aes(beta, RMSE)) + geom_line() + geom_point(data = beta.min, aes(beta, RMSE), size = 2, color = "red") # Refit the model with the obtained optimal value of beta : # Set the optimal value of beta nad also compare the predictive accuracy with our original model. # new model with optimal beta holt.data.opt <- holt(data.train, h = 100, beta = 0.0601) # accuracy of first model accuracy(holt.data, data.test) # accuracy of new optimal model accuracy(holt.data.opt, data.test) p1 <- autoplot(holt.data) + ggtitle("Original Holt's Model") + coord_cartesian(ylim = c(400, 1000)) p2 <- autoplot(holt.data.opt) + ggtitle("Optimal Holt's Model") + coord_cartesian(ylim = c(400, 1000)) gridExtra::grid.arrange(p1, p2, nrow = 1) My problem is that i cant create my data.train and data.test samples. data.train <- window(???) data.test <- window(???)
From this good stack exchange answer you could do : library(forecast) library(Mcomp) # Time Series # Subset the M3 data to contain the relevant series ts.data<- subset(M3, 12)[[551]] ts.data mod1 <- HoltWinters(ts.data$x, alpha=0.1, beta=FALSE, gamma=FALSE) pred <- predict(mod1, n.ahead=30) abs_error <- abs(pred - ts.data$xx) mae <- sum(abs_error)/30 mae # with forcats mod2 <- forecast::ses(ts.data$x, h=30, alpha=0.1, initial="simple") pred2 <- predict(mod2, n.ahead=30) abs_error2 <- abs(pred2$mean - ts.data$xx) mae2 <- sum(abs_error2)/30 mae2
How to compute log loss in machine learning
The following code are used to produce the probability output of binary classification with Random Forest. library(randomForest) rf <- randomForest(train, train_label,importance=TRUE,proximity=TRUE) prediction<-predict(rf, test, type="prob") Then the result about prediction is as follows: The true label about test data are known (named test_label). Now I want to compute logarithmic loss for probability output of binary classification. The function about LogLoss is as follows. LogLoss=function(actual, predicted) { result=-1/length(actual)*(sum((actual*log(predicted)+(1-actual)*log(1-predicted)))) return(result) } How to compute logarithmic loss with probability output of binary classification. Thank you.
library(randomForest) rf <- randomForest(Species~., data = iris, importance=TRUE, proximity=TRUE) prediction <- predict(rf, iris, type="prob") #bound the results, otherwise you might get infinity results prediction <- apply(prediction, c(1,2), function(x) min(max(x, 1E-15), 1-1E-15)) #model.matrix generates a true probabilities matrix, where an element is either 1 or 0 #we subtract the prediction, and, if the result is bigger than 0 that's the correct class logLoss = function(pred, actual){ -1*mean(log(pred[model.matrix(~ actual + 0) - pred > 0])) } logLoss(prediction, iris$Species)
I think the logLoss formula is a little bit wrong. model <- glm(vs ~ mpg, data = mtcars, family = "binomial") ### OP's formula (Wrong) logLoss1 <- function(pred, actual){ -1*mean(log(pred[model.matrix(~ actual + 0) - pred > 0])) } logLoss1(actual = model$y, pred = model$fitted.values) # [1] 0.4466049 ### Correct formula in native R logLoss2 <- function(pred, actual){ -mean(actual * log(pred) + (1 - actual) * log(1 - pred)) } logLoss2(actual = model$y, pred = model$fitted.values) # [1] 0.3989584 ## Results from various packages to verify the correct answer ### From ModelMetrics package ModelMetrics::logLoss(actual = model$y, pred = model$fitted.values) # [1] 0.3989584 ### From MLmetrics package MLmetrics::LogLoss(y_pred = model$fitted.values, y_true = model$y) # [1] 0.3989584 ### From reticulate package sklearn.metrics <- import("sklearn.metrics") sklearn.metrics$log_loss(y_true = model$y, y_pred = model$fitted.values) # [1] 0.3989584 I used the R version 4.1.0 (2021-05-18).
Getting errors while using neuralnet function
I tried neural net in R on Boston data set available. data("Boston",package="MASS") data <- Boston Retaining only those variable we want to use: keeps <- c("crim", "indus", "nox", "rm" , "age", "dis", "tax" ,"ptratio", "lstat" ,"medv" ) data <- data[keeps] In this case the formula is stored in an R object called f. The response variable medv is to be “regressed” against the remaining nine attributes. I have done it as below: f <- medv ~ crim + indus + nox + rm + age + dis + tax + ptratio + lstat To set up train sample 400 of the 506 rows of data without replacement is collected using the sample method: set.seed(2016) n = nrow(data) train <- sample(1:n, 400, FALSE) neuralnet function of R is fitted. library(neuralnet) fit<- neuralnet(f, data = data[train ,], hidden=c(10 ,12 ,20), algorithm = "rprop+", err.fct = "sse", act.fct = "logistic", threshold =0.1, linear.output=TRUE) But warning message is displayed as algorithm not converging. Warning message: algorithm did not converge in 1 of 1 repetition(s) within the stepmax Tried Prediction using compute, pred <- compute(fit,data[-train, 1:9]) Following error msg is displayed Error in nrow[w] * ncol[w] : non-numeric argument to binary operator In addition: Warning message: In is.na(weights) : is.na() applied to non-(list or vector) of type 'NULL' Why the error is coming up and how to recover from it for prediction. I want to use the neuralnet function on that data set.
When neuralnet doesn't converge, the resulting neural network is not complete. You can tell by calling attributes(fit)$names. When training converges, it will look like this: [1] "call" "response" "covariate" "model.list" "err.fct" [6] "act.fct" "linear.output" "data" "net.result" "weights" [11] "startweights" "generalized.weights" "result.matrix" When it doesn't, some attributes will not be defined: [1] "call" "response" "covariate" "model.list" "err.fct" "act.fct" "linear.output" [8] "data" That explains why compute doesn't work. When training doesn't converge, the first possible solution could be to increase stepmax (default 100000). You can also add lifesign = "full", to get better insight into the training process. Also, looking at your code, I would say three layers with 10, 12 and 20 neurons is too much. I would start with one layer with the same number of neurons as the number of inputs, in your case 9. EDIT: With scaling (remember to scale both training and test data, and to 'de-scale' compute results), it converges much faster. Also note that I reduced the number of layers and neurons, and still lowered the error threshold. data("Boston",package="MASS") data <- Boston keeps <- c("crim", "indus", "nox", "rm" , "age", "dis", "tax" ,"ptratio", "lstat" ,"medv" ) data <- data[keeps] f <- medv ~ crim + indus + nox + rm + age + dis + tax + ptratio + lstat set.seed(2016) n = nrow(data) train <- sample(1:n, 400, FALSE) # Scale data. Scaling parameters are stored in this matrix for later. scaledData <- scale(data) fit<- neuralnet::neuralnet(f, data = scaledData[train ,], hidden=9, algorithm = "rprop+", err.fct = "sse", act.fct = "logistic", threshold = 0.01, linear.output=TRUE, lifesign = "full") pred <- neuralnet::compute(fit,scaledData[-train, 1:9]) scaledResults <- pred$net.result * attr(scaledData, "scaled:scale")["medv"] + attr(scaledData, "scaled:center")["medv"] cleanOutput <- data.frame(Actual = data$medv[-train], Prediction = scaledResults, diff = abs(scaledResults - data$medv[-train])) # Show some results summary(cleanOutput)
The problem seems to be in your argument linear.output = TRUE. With your data, but changing the code a bit (not defining the formula and adding some explanatory comments): library(neuralnet) fit <- neuralnet(formula = medv ~ crim + indus + nox + rm + age + dis + tax + ptratio + lstat, data = data[train,], hidden=c(10, 12, 20), # number of vertices (neurons) in each hidden layer algorithm = "rprop+", # resilient backprop with weight backtracking, err.fct = "sse", # calculates error based on the sum of squared errors act.fct = "logistic", # smoothing the cross product of neurons and weights with logistic function threshold = 0.1, # of the partial derivatives for error function, stopping linear.output=FALSE) # act.fct applied to output neurons print(net) Call: neuralnet(formula = medv ~ crim + indus + nox + rm + age + dis + tax + ptratio + lstat, data = data[train, ], hidden = c(10, 12, 20), threshold = 0.1, rep = 10, algorithm = "rprop+", err.fct = "sse", act.fct = "logistic", linear.output = FALSE) 10 repetitions were calculated. Error Reached Threshold Steps 1 108955.0318 0.03436116236 4 5 108955.0339 0.01391790099 8 3 108955.0341 0.02193379592 3 9 108955.0371 0.01705056758 6 8 108955.0398 0.01983134293 8 4 108955.0450 0.02500006437 5 6 108955.0569 0.03689097762 5 7 108955.0677 0.04765829189 5 2 108955.0705 0.05052776877 5 10 108955.1103 0.09031966778 7 10 108955.1103 0.09031966778 7 # now compute will work pred <- compute(fit, data[-train, 1:9])
predict() R function caret package errors: "newdata" rows different, "type" not accepted
I am running a logistic regression analysis using the caret package. Data is input as a 18x6 matrix everything is fine so far except the predict() function. R is telling me the type parameter is supposed to be raw or prob but raw just spits out an exact copy of the last column (the values of the binomial variable). prob gives me the following error: "Error in dimnames(out)[[2]] <- modelFit$obsLevels : length of 'dimnames' [2] not equal to array extent In addition: Warning message: 'newdata' had 7 rows but variables found have 18 rows" install.packages("pbkrtest") install.packages("caret") install.packages('e1071', dependencies=TRUE) #install.packages('caret', dependencies = TRUE) require(caret) library(caret) A=matrix( c( 64830,18213,4677,24761,9845,17504,22137,12531,5842,28827,51840,4079,1000,2069,969,9173,11646,946,66161,18852,5581,27219,10159,17527,23402,11409,8115,31425,55993,0,0,1890,1430,7873,12779,627,68426,18274,5513,25687,10971,14104,19604,13438,6011,30055,57242,0,0,2190,1509,8434,10492,755,69716,18366,5735,26556,11733,16605,20644,15516,5750,31116,64330,0,0,1850,1679,9233,12000,500,73128,18906,5759,28555,11951,19810,22086,17425,6152,28469,72020,0,0,1400,1750,8599,12000,500,1,1,1,0,1,0,0,0,0,1,0,1,1,1,1,1,1,1 ), nrow = 18, ncol = 6, byrow = FALSE) #"bycol" does NOT exist ################### data set as vectors a<-c(64830,18213,4677,24761,9845,17504,22137,12531,5842,28827,51840,4079,1000,2069,969,9173,11646,946) b<-c(66161,18852,5581,27219,10159,17527,23402,11409,8115,31425,55993,0,0,1890,1430,7873,12779,627) c<-c(68426,18274,5513,25687,10971,14104,19604,13438,6011,30055,57242,0,0,2190,1509,8434,10492,755) d<-c(69716,18366,5735,26556,11733,16605,20644,15516,5750,31116,64330,0,0,1850,1679,9233,12000,500) e<-c(73128,18906,5759,28555,11951,19810,22086,17425,6152,28469,72020,0,0,1400,1750,8599,12000,500) f<-c(1,1,1,0,1,0,0,0,0,1,0,1,1,1,1,1,1,1) ###################### n<-nrow(A); K<-ncol(A)-1; Train <- createDataPartition(f, p=0.6, list=FALSE) #60% of data set is used as training. training <- A[ Train, ] testing <- A[ -Train, ] nrow(training) #this is the logistic formula: #estimates from logistic regression characterize the relationship between the predictor and response variable on a log-odds scale mod_fit <- train(f ~ a + b + c + d +e, data=training, method="glm", family="binomial") mod_fit #this isthe exponential function to calculate the odds ratios for each preditor: exp(coef(mod_fit$finalModel)) predict(mod_fit, newdata=training) predict(mod_fit, newdata=testing, type="prob")
I'm not very sure to understand, but A is a matrix of (a,b,c,d,e,f). So you don't need to create two objects. install.packages("pbkrtest") install.packages("caret") install.packages('e1071', dependencies=TRUE) #install.packages('caret', dependencies = TRUE) require(caret) library(caret) A=matrix( c( 64830,18213,4677,24761,9845,17504,22137,12531,5842,28827,51840,4079,1000,2069,969,9173,11646,946,66161,18852,5581,27219,10159,17527,23402,11409,8115,31425,55993,0,0,1890,1430,7873,12779,627,68426,18274,5513,25687,10971,14104,19604,13438,6011,30055,57242,0,0,2190,1509,8434,10492,755,69716,18366,5735,26556,11733,16605,20644,15516,5750,31116,64330,0,0,1850,1679,9233,12000,500,73128,18906,5759,28555,11951,19810,22086,17425,6152,28469,72020,0,0,1400,1750,8599,12000,500,1,1,1,0,1,0,0,0,0,1,0,1,1,1,1,1,1,1 ), nrow = 18, ncol = 6, byrow = FALSE) #"bycol" does NOT exist A <- data.frame(A) colnames(A) <- c('a','b','c','d','e','f') A$f <- as.factor(A$f) Train <- createDataPartition(A$f, p=0.6, list=FALSE) #60% of data set is used as training. training <- A[ Train, ] testing <- A[ -Train, ] nrow(training) And to predict a variable you must enter the explanatory variables and not the variable to predict mod_fit <- train(f ~ a + b + c + d +e, data=training, method="glm", family="binomial") mod_fit #this isthe exponential function to calculate the odds ratios for each preditor: exp(coef(mod_fit$finalModel)) predict(mod_fit, newdata=training[,-which(colnames(training)=="f")]) predict(mod_fit, newdata=testing[,-which(colnames(testing)=="f")])
Short answer, you should not include the explained variable, which is f in your predict equation. So you should do: predict(mod_fit, newdata=training[, -ncol(training]) predict(mod_fit, newdata=testing[, -ncol(testing]) The issue with the warning message 'newdata' had 11 rows but variables found have 18 rows is because you run the regression using the whole data set (18 observations), but predict using just part of it (either 11 or 7). EDIT: To simplify the data creation and glm processes we can do: library(caret) A <- data.frame(a = c(64830,18213,4677,24761,9845,17504,22137,12531,5842,28827,51840,4079,1000,2069,969,9173,11646,946), b = c(66161,18852,5581,27219,10159,17527,23402,11409,8115,31425,55993,0,0,1890,1430,7873,12779,627), c = c(68426,18274,5513,25687,10971,14104,19604,13438,6011,30055,57242,0,0,2190,1509,8434,10492,755), d = c(69716,18366,5735,26556,11733,16605,20644,15516,5750,31116,64330,0,0,1850,1679,9233,12000,500), e = c(73128,18906,5759,28555,11951,19810,22086,17425,6152,28469,72020,0,0,1400,1750,8599,12000,500), f = c(1,1,1,0,1,0,0,0,0,1,0,1,1,1,1,1,1,1)) Train <- createDataPartition(f, p=0.6, list=FALSE) #60% of data set is used as training. training <- A[ Train, ] testing <- A[ -Train, ] mod_fit <- train(f ~ a + b + c + d + e, data=training, method="glm", family="binomial")
I try to run logistic regression model. I wrote this code: install.packages('caret') library(caret) setwd('C:\\Users\\BAHOZ\\Documents\\') D<-read.csv(file = "D.csv",header = T) D<-read.csv(file = 'DataSet.csv',header=T) names(D) set.seed(111134) Train<-createDataPartition(D$X, p=0.7,list = FALSE) training<-D[Train,] length(training$age) testing<-D[-Train,] length(testing$age) mod_fit<-train(X~age + gender + total.Bilirubin + direct.Bilirubin + total.proteins + albumin + A.G.ratio+SGPT + SGOT + Alkphos,data=training,method="glm", family="binomial") summary(mod_fit) exp(coef(mod_fit$finalModel)) And I recived this message for last command: (Intercept) age gender total.Bilirubin direct.Bilirubin total.proteins albumin A.G.ratio 0.01475027 1.01596886 1.03857883 1.00022899 1.78188072 1.00065332 1.01380334 1.00115742 SGPT SGOT Alkphos 3.93498241 0.05616662 38.29760014 By running this command I could predict my data, predict(mod_fit , newdata=testing) But if I set type="prob" or type="raw" predict(mod_fit , newdata=testing, type = "prob") it falls in error: Error in dimnames(out) <- *vtmp* : length of 'dimnames' [2] not equal to array extent