The following code are used to produce the probability output of binary classification with Random Forest.
library(randomForest)
rf <- randomForest(train, train_label,importance=TRUE,proximity=TRUE)
prediction<-predict(rf, test, type="prob")
Then the result about prediction is as follows:
The true label about test data are known (named test_label). Now I want to compute logarithmic loss for probability output of binary classification. The function about LogLoss is as follows.
LogLoss=function(actual, predicted)
{
result=-1/length(actual)*(sum((actual*log(predicted)+(1-actual)*log(1-predicted))))
return(result)
}
How to compute logarithmic loss with probability output of binary classification. Thank you.
library(randomForest)
rf <- randomForest(Species~., data = iris, importance=TRUE, proximity=TRUE)
prediction <- predict(rf, iris, type="prob")
#bound the results, otherwise you might get infinity results
prediction <- apply(prediction, c(1,2), function(x) min(max(x, 1E-15), 1-1E-15))
#model.matrix generates a true probabilities matrix, where an element is either 1 or 0
#we subtract the prediction, and, if the result is bigger than 0 that's the correct class
logLoss = function(pred, actual){
-1*mean(log(pred[model.matrix(~ actual + 0) - pred > 0]))
}
logLoss(prediction, iris$Species)
I think the logLoss formula is a little bit wrong.
model <- glm(vs ~ mpg, data = mtcars, family = "binomial")
### OP's formula (Wrong)
logLoss1 <- function(pred, actual){
-1*mean(log(pred[model.matrix(~ actual + 0) - pred > 0]))
}
logLoss1(actual = model$y, pred = model$fitted.values)
# [1] 0.4466049
### Correct formula in native R
logLoss2 <- function(pred, actual){
-mean(actual * log(pred) + (1 - actual) * log(1 - pred))
}
logLoss2(actual = model$y, pred = model$fitted.values)
# [1] 0.3989584
## Results from various packages to verify the correct answer
### From ModelMetrics package
ModelMetrics::logLoss(actual = model$y, pred = model$fitted.values)
# [1] 0.3989584
### From MLmetrics package
MLmetrics::LogLoss(y_pred = model$fitted.values, y_true = model$y)
# [1] 0.3989584
### From reticulate package
sklearn.metrics <- import("sklearn.metrics")
sklearn.metrics$log_loss(y_true = model$y, y_pred = model$fitted.values)
# [1] 0.3989584
I used the R version 4.1.0 (2021-05-18).
Related
I need to bootstrap a beta regression model to check its robustness - because of a data point with a large cook's distance - with the boot package (other suggestions welcomed).
I have the following error:
Error in t.star[r, ] <- res[[r]] :
incorrect number of subscripts on matrix
Here's a reproductible example:
library(betareg)
library(boot)
fake_data <- data.frame(diet = as.factor(c(rep("A",10),rep("B",10))),
fat = c(runif(10,.1,.5),runif(10,.4,.9)) )
plot(fat~diet, data = fake_data)
my_beta_reg <- function(data,i){
data_i <- data[i,]
mod <- betareg(data_i[,"fat"] ~ data_i[,"diet"])
return(mod$coef)
}
b = boot(fake_data, statistic = my_beta_reg, R= 50)
Error in t.star[r, ] <- res[[r]] :
incorrect number of subscripts on matrix
What's the issue?
Thanks in advance.
The issue is that mod$coef is a list:
betareg(fat ~ diet, data = fake_data)$coef
#$mean
#(Intercept) dietB
# -1.275793 2.490126
#
#$precision
# (phi)
#20.59014
You need to unlist it or preferably use the function you are supposed to use for extraction of coefficients:
my_beta_reg <- function(data,i){
mod <- betareg(fat ~ diet, data = data[i,])
#unlist(mod$coef)
coef(mod)
}
b = boot(fake_data, statistic = my_beta_reg, R= 50)
print(b)
#ORDINARY NONPARAMETRIC BOOTSTRAP
#
#
#Call:
#boot(data = fake_data, statistic = my_beta_reg, R = 50)
#
#
#Bootstrap Statistics :
# original bias std. error
#t1* -1.275793 -0.019847377 0.2003523
#t2* 2.490126 0.009008892 0.2314521
#t3* 20.590142 8.265394485 17.2271497
Hi I tried to print the confusion matrix for a dataset using R. Following are my results
My class variables contains binary values. Medv value is my class variable binarized with Medv value of the house greater than 230k being 1, or 0 otherwise. When I see the confusion matrix, at the end represents Positive class as 0. What does this mean? Are these results misrepresents my data?
my R code so far,
# Load CART packages
library(rpart)
library(rpart.plot)
library(caTools)
library(caret)
library (pROC)
housing_data = read.csv('housing.csv')
summary(housing_data)
housing_data = na.omit(housing_data)
# CART model
latlontree = rpart(Medv ~ Crim + Rm, data=housing_data , method = "class")
# Plot the tree using prp command defined in rpart.plot package
prp(latlontree)
# Split the data for Machine Learning
set.seed(123)
split = sample.split(housing_data$Medv, SplitRatio = 0.8)
train = subset(housing_data, split==TRUE)
test = subset(housing_data, split==FALSE)
#print (train)
#print (test)
# Create a CART model
tree = rpart(Medv ~ Crim + Zn + Indus + Chas + Nox + Rm + Age + Dis + Rad + Tax + Ptratio + B + Lstat , data=train , method = "class")
prp(tree)
#Decision tree prediction
#tree.pred = predict(tree, test)
pred = predict(tree,test, type="class")
#print (pred)
table(pred, test$Medv)
table(factor(pred, levels=min(test$Medv):max(test$Medv)),
factor(test$Medv, levels=min(test$Medv):max(test$Medv)))
# If p exceeds threshold of 0.5, M else R: m_or_r
#m_or_r <- ifelse(p > 0.5, 1, 0)
#print (m_or_r)
# Convert to factor: p_class
#p_class <- factor(m_or_r, levels = test$Medv)
# Create confusion matrix
confusionMatrix(table(factor(pred, levels=min(test$Medv):max(test$Medv)),
factor(test$Medv, levels=min(test$Medv):max(test$Medv))))
#print (tree.sse)
#ROC Curve
#Obtaining predicted probabilites for Test data
tree.probs=predict(tree,
test,
type="prob")
head(tree.probs)
#Calculate ROC curve
rocCurve.tree <- roc(test$Medv,tree.probs[,2])
#plot the ROC curve
plot(rocCurve.tree,col=c(4))
auc <- auc (test$Medv,tree.probs[,2])
print (auc)
#creating a dataframe with a single row
x <- data.frame("Crim"= c(0.03), "Zn"=c(13), "Indus"=c(3.5), "Chas"=c(0.3), "Nox"=c(0.58), "Rm"=c(4.1), "Age"=c(68), "Dis"=c(4.98), "Rad" =c(3), "Tax"=c(225), "Ptratio"=c(17), "B"=c(396), "Lstat"=c(7.56))
#Obtaining predicted probabilites for Test data
probability2=predict(tree,
x,
type="prob")
print (probability2)
#Obtaining predicted class for Test data
probability3=predict(tree,
x,
type="class")
print (probability3)
Image of the dataset
I am trying to get the five types of bootstrap intervals for linear and quantile regression. I was able to bootstrap and find the 5 boostrap intervals (Quantile,Normal,Basic,Studentized and BCa) for the linear regression using Boot from car and boot.ci from boot. When i tried to do the same for quantile regression using rq from quantreg, it throws up an error. Here is the sample code
Creating the model
library(car)
library(quantreg)
library(boot)
newdata = Prestige[,c(1:4)]
education.c = scale(newdata$education, center=TRUE, scale=FALSE)
prestige.c = scale(newdata$prestige, center=TRUE, scale=FALSE)
women.c = scale(newdata$women, center=TRUE, scale=FALSE)
new.c.vars = cbind(education.c, prestige.c, women.c)
newdata = cbind(newdata, new.c.vars)
names(newdata)[5:7] = c("education.c", "prestige.c", "women.c" )
mod1 = lm(income ~ education.c + prestige.c + women.c, data=newdata)
mod2 = rq(income ~ education.c + prestige.c + women.c, data=newdata)
Booting linear and quantile regression
mod1.boot <- Boot(mod1, R=999)
boot.ci(mod1.boot, level = .95, type = "all")
dat2 <- newdata[5:7]
mod2.boot <- boot.rq(cbind(1,dat2),newdata$income,tau=0.5, R=10000)
boot.ci(mod2.boot, level = .95, type = "all")
Error in if (ncol(boot.out$t) < max(index)) { :
argument is of length zero
1) Why does boot.ci not work for quantile regression
2)Using this solution I got from stackexchange, I was able to find the quantile CI.
Solution for quantile(percentile CI) for rq
t(apply(mod2.boot$B, 2, quantile, c(0.025,0.975)))
how do i obtain other CI for bootstrap (normal, basic, studentized, BCa).
3) Also, my boot.ci command for linear regression produces this warning
Warning message:
In sqrt(tv[, 2L]) : NaNs produced
What does this signify?
Using summary.rq you can calculate boostrap standard errors of model coefficients.
Five boostrap methods (bsmethods) are available (see ?boot.rq).
summary(mod2, se = "boot", bsmethod= "xy")
# Call: rq(formula = income ~ education.c + prestige.c + women.c, data = newdata)
#
# tau: [1] 0.5
#
# Coefficients:
# Value Std. Error t value Pr(>|t|)
# (Intercept) 6542.83599 139.54002 46.88860 0.00000
# education.c 291.57468 117.03314 2.49139 0.01440
# prestige.c 89.68050 22.03406 4.07009 0.00010
# women.c -48.94856 5.79470 -8.44712 0.00000
To calculate bootstrap confidence intervals, you can use the following trick:
mod1.boot <- Boot(mod1, R=999)
set.seed(1234)
boot.ci(mod1.boot, level = .95, type = "all")
dat2 <- newdata[5:7]
set.seed(1234)
mod2.boot <- boot.rq(cbind(1,dat2),newdata$income,tau=0.5, R=10000)
# Create an object with the same structure of mod1.boot
# but with boostrap replicates given by boot.rq
mod3.boot <- mod1.boot
mod3.boot$R <- 10000
mod3.boot$t0 <- coef(mod2)
mod3.boot$t <- mod2.boot$B
boot.ci(mod3.boot, level = .95, type = "all")
# BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
# Based on 10000 bootstrap replicates
#
# CALL :
# boot.ci(boot.out = mod3.boot, type = "all", level = 0.95)
#
# Intervals :
# Level Normal Basic Studentized
# 95% (6293, 6838 ) (6313, 6827 ) (6289, 6941 )
#
# Level Percentile BCa
# 95% (6258, 6772 ) (6275, 6801 )
Thanks for everyone who helped. I was able to figure out the solution myself. I ran a loop calculating the coefficients of the quantile regression and then used boot and boot.ci respectively. Here is the code
Booting commands only, model creation from question
mod3 <- formula(income ~ education.c + prestige.c + women.c)
coefsf <- function(data,ind){
rq(mod3, data=newdata[ind,])$coef
}
boot.mod <- boot(newdata,coefsf,R=10000)
myboot.ci <- list()
for (i in 1:ncol(boot.mod$t)){
myboot.ci[[i]] <- boot.ci(boot.mod, level = .95, type =
c("norm","basic","perc", "bca"),index = i)
}
I did this as I wanted CI on all variables not just the intercept.
I'm trying to use the naiveBayes() function from the e1071 package. When I add a non-zero laplace argument, my resulting probability estimates are not changing and I don't understand why.
Example:
library(e1071)
# Generate data
train.x <- data.frame(x1=c(1,1,0,0), x2=c(1,0,1,0))
train.y <- factor(c("cat", "cat", "dog", "dog"))
test.x <- data.frame(x1=c(1), x2=c(1))
# without laplace smoothing
classifier <- naiveBayes(x=train.x, y=train.y, laplace=0)
predict(classifier, test.x, type="raw") # returns (1, 0.00002507)
# with laplace smoothing
classifier <- naiveBayes(x=train.x, y=train.y, laplace=1)
predict(classifier, test.x, type="raw") # returns (1, 0.00002507)
I would expect the probabilities to change in this case since all the training instances for the "dog" class have 0 for x1. To check this, here's the same thing using Python
Python example:
import numpy as np
from sklearn.naive_bayes import BernoulliNB
train_x = pd.DataFrame({'x1':[1,1,0,0], 'x2':[1,0,1,0]})
train_y = np.array(["cat", "cat", "dog", "dog"])
test_x = pd.DataFrame({'x1':[1,], 'x2':[1,]})
# alpha (i.e. laplace = 0)
classifier = BernoulliNB(alpha=.00000001)
classifier.fit(X=train_x, y=train_y)
classifier.predict_proba(X=test_x) # returns (1, 0)
# alpha (i.e. laplace = 1)
classifier = BernoulliNB(alpha=1)
classifier.fit(X=train_x, y=train_y)
classifier.predict_proba(X=test_x) # returns (.75, .25)
Why am I getting this unexpected result using e1071?
Laplace estimates are only valid for categorical features, not numerical ones. You can find in the source code:
## estimation-function
est <- function(var)
if (is.numeric(var)) {
cbind(tapply(var, y, mean, na.rm = TRUE),
tapply(var, y, sd, na.rm = TRUE))
} else {
tab <- table(y, var)
(tab + laplace) / (rowSums(tab) + laplace * nlevels(var))
}
that for numerical values Gaussian estimates are used. Thus convert your data to factors and you are good to go.
train.x <- data.frame(x1=c("1","1","0","0"), x2=c("1","0","1","0"))
train.y <- factor(c("cat", "cat", "dog", "dog"))
test.x <- data.frame(x1=c("1"), x2=c("1"))
# without laplace smoothing
classifier <- naiveBayes(x=train.x, y=train.y, laplace=0)
predict(classifier, test.x, type="raw") # returns (100% for dog)
# with laplace smoothing
classifier <- naiveBayes(x=train.x, y=train.y, laplace=1)
predict(classifier, test.x, type="raw") # returns (75% for dog)
Major facepalm on this one. The naiveBayes() method was interpreting x1 and x2 as numeric variables and thus trying to use Gaussian conditional probability distributions internally (I think). Encoding my variables as factors solved my problem.
train.x <- data.frame(x1=factor(c(1,1,0,0)), x2=factor(c(1,0,1,0)))
train.y <- factor(c("cat", "cat", "dog", "dog"))
test.x <- data.frame(x1=factor(c(1)), x2=factor(c(1)))
I'm using the fGARCH package in R in order to fit an ARMA_GARCH(1,1) model to a time series. I want to extract standardized residuals, meaning the residuals divided by the corresponding daily volatility estimate. I tried to things
res <- residuals(m1, standardize=FALSE)
vol <- volatility(m1)
stand.res <- res/vol
and
stand.res <- residuals(m1, standardize=TRUE)
if I plot both results, they differ from each other. Why is that?
Thank you very much.
I'm having a similar problem; please consider:
rm(list=ls(all=TRUE))
library(fGarch)
set.seed(4)
x <- runif(6587, -0.10, 0.10)
gfit <- garchFit(formula = ~ garch(2,2), cond.dist = "std", data = x, include.shape=TRUE, trace=FALSE)
and
condVar = gfit#h.t
resid <- (x / sqrt(condVar));
tail(resid) # Standardized Residuals
#[1] -0.4201041 -0.8342208 1.5639541 1.0237848 -0.1779349 -0.7820030
#or
tail(x/ gfit#sigma.t)
#[1] -0.4201041 -0.8342208 1.5639541 1.0237848 -0.1779349 -0.7820030
vs
tail(residuals(gfit, standardize = TRUE))
#[1] -0.4156200 -0.8297368 1.5684382 1.0282689 -0.1734509 -0.7775190