Easy way of counting precision, recall and F1-score in R - r

I am using an rpart classifier in R. The question is - I would want to test the trained classifier on a test data. This is fine - I can use the predict.rpart function.
But I also want to calculate precision, recall and F1 score.
My question is - do I have to write functions for those myself, or is there any function in R or any of CRAN libraries for that?

using the caret package:
library(caret)
y <- ... # factor of positive / negative cases
predictions <- ... # factor of predictions
precision <- posPredValue(predictions, y, positive="1")
recall <- sensitivity(predictions, y, positive="1")
F1 <- (2 * precision * recall) / (precision + recall)
A generic function that works for binary and multi-class classification without using any package is:
f1_score <- function(predicted, expected, positive.class="1") {
predicted <- factor(as.character(predicted), levels=unique(as.character(expected)))
expected <- as.factor(expected)
cm = as.matrix(table(expected, predicted))
precision <- diag(cm) / colSums(cm)
recall <- diag(cm) / rowSums(cm)
f1 <- ifelse(precision + recall == 0, 0, 2 * precision * recall / (precision + recall))
#Assuming that F1 is zero when it's not possible compute it
f1[is.na(f1)] <- 0
#Binary F1 or Multi-class macro-averaged F1
ifelse(nlevels(expected) == 2, f1[positive.class], mean(f1))
}
Some comments about the function:
It's assumed that an F1 = NA is zero
positive.class is used only in
binary f1
for multi-class problems, the macro-averaged F1 is computed
If predicted and expected had different levels, predicted will receive the expected levels

The ROCR library calculates all these and more (see also http://rocr.bioinf.mpi-sb.mpg.de):
library (ROCR);
...
y <- ... # logical array of positive / negative cases
predictions <- ... # array of predictions
pred <- prediction(predictions, y);
# Recall-Precision curve
RP.perf <- performance(pred, "prec", "rec");
plot (RP.perf);
# ROC curve
ROC.perf <- performance(pred, "tpr", "fpr");
plot (ROC.perf);
# ROC area under the curve
auc.tmp <- performance(pred,"auc");
auc <- as.numeric(auc.tmp#y.values)
...

Just to update this as I came across this thread now, the confusionMatrix function in caretcomputes all of these things for you automatically.
cm <- confusionMatrix(prediction, reference = test_set$label)
# extract F1 score for all classes
cm[["byClass"]][ , "F1"] #for multiclass classification problems
You can substitute any of the following for "F1" to extract the relevant values as well:
"Sensitivity", "Specificity", "Pos Pred Value", "Neg Pred Value", "Precision", "Recall", "F1", "Prevalence", "Detection", "Rate", "Detection Prevalence", "Balanced Accuracy"
I think this behaves slightly differently when you're only doing a binary classifcation problem, but in both cases, all of these values are computed for you when you look inside the confusionMatrix object, under $byClass

confusionMatrix() from caret package can be used along with a proper optional field "Positive" specifying which factor should be taken as positive factor.
confusionMatrix(predicted, Funded, mode = "prec_recall", positive="1")
This code will also give additional values such as F-statistic, Accuracy, etc.

I noticed the comment about F1 score being needed for binary classes. I suspect that it usually is. But a while ago I wrote this in which I was doing classification into several groups denoted by number. This may be of use to you...
calcF1Scores=function(act,prd){
#treats the vectors like classes
#act and prd must be whole numbers
df=data.frame(act=act,prd=prd);
scores=list();
for(i in seq(min(act),max(act))){
tp=nrow(df[df$prd==i & df$act==i,]);
fp=nrow(df[df$prd==i & df$act!=i,]);
fn=nrow(df[df$prd!=i & df$act==i,]);
f1=(2*tp)/(2*tp+fp+fn)
scores[[i]]=f1;
}
print(scores)
return(scores);
}
print(mean(unlist(calcF1Scores(c(1,1,3,4,5),c(1,2,3,4,5)))))
print(mean(unlist(calcF1Scores(c(1,2,3,4,5),c(1,2,3,4,5)))))

We can simply get F1 value from caret's confusionMatrix function
result <- confusionMatrix(Prediction, Lable)
# View confusion matrix overall
result
# F1 value
result$byClass[7]

You can also use the confusionMatrix() provided by caret package. The output includes,between others, Sensitivity (also known as recall) and Pos Pred Value(also known as precision). Then F1 can be easily computed, as stated above, as:
F1 <- (2 * precision * recall) / (precision + recall)

library(caret)
result <- confusionMatrix(Prediction, label)
#This shows all the measures you need including precision, recall and F1
result$byClass

Related

GRG Non-Linear Least Squares (Optimization)

I am trying to convert an Excel spreadsheet that involves the solver function, using GRG Non-Linear to optimize 2 variables that return the lowest sum of squared errors. I have 4 known times (B) at 4 known distances(A). I need to create an optimization function to find what interaction of values for Vmax and Tau produce the lowest sum of squared errors. I have looked at the nls function and nloptr package but can't quite seem to piece them together. Current values for Vmax and Tau are what was determined via the excel solver function, just need to replicate in R. Any and all help would be greatly appreciated. Thank you.
A <- c(0,10, 20, 40)
B <- c(0,1.51, 2.51, 4.32)
Measured <- as.data.frame(cbind(A, B))
Corrected <- Measured
Corrected$B <- Corrected$B + .2
colnames(Corrected) <- c("Distance (yds)", "Time (s)")
Corrected$`X (m)` <- Corrected$`Distance (yds)`*.9144
Vmax = 10.460615006988
Tau = 1.03682513806393
Predicted_X <- c(Vmax * (Corrected$`Time (s)`[1] - Tau + Tau*exp(-Corrected$`Time (s)`[1]/Tau)),
Vmax * (Corrected$`Time (s)`[2] - Tau + Tau*exp(-Corrected$`Time (s)`[2]/Tau)),
Vmax * (Corrected$`Time (s)`[3] - Tau + Tau*exp(-Corrected$`Time (s)`[3]/Tau)),
Vmax * (Corrected$`Time (s)`[4] - Tau + Tau*exp(-Corrected$`Time (s)`[4]/Tau)))
Corrected$`Predicted X (m)` <- Predicted_X
Corrected$`Squared Error` <- (Corrected$`X (m)`-Corrected$`Predicted X (m)`)^2
#Sum_Squared_Error <- sum(Corrected$`Squared Error`)
is your issue still unsolved?
I'm working on a similar problem and I think I could help.
First you have to define a function that will be the sum of the errors, which has for variables Vmax and Tau.
Then you can call an optimisation algorithm that will change these variables and look for a minimum of your function. optim() might be sufficient for your application, but here is the documentation for nloptr:
https://www.rdocumentation.org/packages/nloptr/versions/1.0.4/topics/nloptr
and here is a list of optimisation packages in R:
https://cran.r-project.org/web/views/Optimization.html
Edit:
I quickly recoded the way I would do it. I'm a beginner, so it's probably not the best way but it still works.
A <- c(0,10, 20, 40)
B <- c(0,1.51, 2.51, 4.32)
Measured <- as.data.frame(cbind(A, B))
Corrected <- Measured
Corrected$B <- Corrected$B + .2
colnames(Corrected) <- c("Distance (yds)", "Time (s)")
Corrected$`X (m)` <- Corrected$`Distance (yds)`*.9144
#initialize values
Vmax0 = 15
Tau0 = 5
x0 = c(Vmax0,Tau0)
#define function to optimise: optim will minimize the output
f <- function(x) {
y=0
#variables will be optimise to find the minimum value of f
Vmax = x[1]
Tau = x[2]
Predicted_X <- Vmax * (Corrected$`Time (s)` - Tau + Tau*exp(-Corrected$`Time (s)`/Tau))
y = sum((Predicted_X - Corrected$`X (m)`)^2)
return(y)
}
#call optim: results will be available in variable Y
Y<-optim(x0,f)
If you type Y into the console, you will find that the solver finds the same values as Excel, and convergence is achieved.
In R, there is no need to define columns in data frames with brackets as you did, instead use vectors. You should probably follow a tutorial about this first.
Also it is misleading that you set inital values as values that were already the optimal ones. If you do this then optim() will not optimise further.
Here is the documentation for optim:
https://stat.ethz.ch/R-manual/R-devel/library/stats/html/optim.html
and a tutorial on how to use functions:
https://www.datacamp.com/community/tutorials/functions-in-r-a-tutorial
Cheers

Fast nonnegative quantile and Huber regression in R

I am looking for a fast way to do nonnegative quantile and Huber regression in R (i.e. with the constraint that all coefficients are >0). I tried using the CVXR package for quantile & Huber regression and the quantreg package for quantile regression, but CVXR is very slow and quantreg seems buggy when I use nonnegativity constraints. Does anybody know of a good and fast solution in R, e.g. using the Rcplex package or R gurobi API, thereby using the faster CPLEX or gurobi optimizers?
Note that I need to run a problem size like below 80 000 times, whereby I only need to update the y vector in each iteration, but still use the same predictor matrix X. In that sense, I feel it's inefficient that in CVXR I now have to do obj <- sum(quant_loss(y - X %*% beta, tau=0.01)); prob <- Problem(Minimize(obj), constraints = list(beta >= 0)) within each iteration, when the problem is in fact staying the same and all I want to update is y. Any thoughts to do all this better/faster?
Minimal example:
## Generate problem data
n <- 7 # n predictor vars
m <- 518 # n cases
set.seed(1289)
beta_true <- 5 * matrix(stats::rnorm(n), nrow = n)+20
X <- matrix(stats::rnorm(m * n), nrow = m, ncol = n)
y_true <- X %*% beta_true
eps <- matrix(stats::rnorm(m), nrow = m)
y <- y_true + eps
Nonnegative quantile regression using CVXR :
## Solve nonnegative quantile regression problem using CVX
require(CVXR)
beta <- Variable(n)
quant_loss <- function(u, tau) { 0.5*abs(u) + (tau - 0.5)*u }
obj <- sum(quant_loss(y - X %*% beta, tau=0.01))
prob <- Problem(Minimize(obj), constraints = list(beta >= 0))
system.time(beta_cvx <- pmax(solve(prob, solver="SCS")$getValue(beta), 0)) # estimated coefficients, note that they ocasionally can go - though and I had to clip at 0
# 0.47s
cor(beta_true,beta_cvx) # correlation=0.99985, OK but very slow
Syntax for nonnegative Huber regression is the same but would use
M <- 1 ## Huber threshold
obj <- sum(CVXR::huber(y - X %*% beta, M))
Nonnegative quantile regression using quantreg package :
### Solve nonnegative quantile regression problem using quantreg package with method="fnc"
require(quantreg)
R <- rbind(diag(n),-diag(n))
r <- c(rep(0,n),-rep(1E10,n)) # specify bounds of coefficients, I want them to be nonnegative, and 1E10 should ideally be Inf
system.time(beta_rq <- coef(rq(y~0+X, R=R, r=r, tau=0.5, method="fnc"))) # estimated coefficients
# 0.12s
cor(beta_true,beta_rq) # correlation=-0.477, no good, and even worse with tau=0.01...
To speed up CVXR, you can get the problem data once in the beginning, then modify it within a loop and pass it directly to the solver's R interface. The code for this is
prob_data <- get_problem_data(prob, solver = "SCS")
Then, parse out the arguments and pass them to scs from the scs library. (See Solver.solve in solver.R). You'll have to dig into the details of the canonicalization, but I expect if you're just changing y at each iteration, it should be a straightforward modification.

Function to calculate R2 (R-squared) in R

I have a dataframe with observed and modelled data, and I would like to calculate the R2 value. I expected there to be a function I could call for this, but can't locate one. I know I can write my own and apply it, but am I missing something obvious? I want something like
obs <- 1:5
mod <- c(0.8,2.4,2,3,4.8)
df <- data.frame(obs, mod)
R2 <- rsq(df)
# 0.85
You need a little statistical knowledge to see this. R squared between two vectors is just the square of their correlation. So you can define you function as:
rsq <- function (x, y) cor(x, y) ^ 2
Sandipan's answer will return you exactly the same result (see the following proof), but as it stands it appears more readable (due to the evident $r.squared).
Let's do the statistics
Basically we fit a linear regression of y over x, and compute the ratio of regression sum of squares to total sum of squares.
lemma 1: a regression y ~ x is equivalent to y - mean(y) ~ x - mean(x)
lemma 2: beta = cov(x, y) / var(x)
lemma 3: R.square = cor(x, y) ^ 2
Warning
R squared between two arbitrary vectors x and y (of the same length) is just a goodness measure of their linear relationship. Think twice!! R squared between x + a and y + b are identical for any constant shift a and b. So it is a weak or even useless measure on "goodness of prediction". Use MSE or RMSE instead:
How to obtain RMSE out of lm result?
R - Calculate Test MSE given a trained model from a training set and a test set
I agree with 42-'s comment:
The R squared is reported by summary functions associated with regression functions. But only when such an estimate is statistically justified.
R squared can be a (but not the best) measure of "goodness of fit". But there is no justification that it can measure the goodness of out-of-sample prediction. If you split your data into training and testing parts and fit a regression model on the training one, you can get a valid R squared value on training part, but you can't legitimately compute an R squared on the test part. Some people did this, but I don't agree with it.
Here is very extreme example:
preds <- 1:4/4
actual <- 1:4
The R squared between those two vectors is 1. Yes of course, one is just a linear rescaling of the other so they have a perfect linear relationship. But, do you really think that the preds is a good prediction on actual??
In reply to wordsforthewise
Thanks for your comments 1, 2 and your answer of details.
You probably misunderstood the procedure. Given two vectors x and y, we first fit a regression line y ~ x then compute regression sum of squares and total sum of squares. It looks like you skip this regression step and go straight to the sum of square computation. That is false, since the partition of sum of squares does not hold and you can't compute R squared in a consistent way.
As you demonstrated, this is just one way for computing R squared:
preds <- c(1, 2, 3)
actual <- c(2, 2, 4)
rss <- sum((preds - actual) ^ 2) ## residual sum of squares
tss <- sum((actual - mean(actual)) ^ 2) ## total sum of squares
rsq <- 1 - rss/tss
#[1] 0.25
But there is another:
regss <- sum((preds - mean(preds)) ^ 2) ## regression sum of squares
regss / tss
#[1] 0.75
Also, your formula can give a negative value (the proper value should be 1 as mentioned above in the Warning section).
preds <- 1:4 / 4
actual <- 1:4
rss <- sum((preds - actual) ^ 2) ## residual sum of squares
tss <- sum((actual - mean(actual)) ^ 2) ## total sum of squares
rsq <- 1 - rss/tss
#[1] -2.375
Final remark
I had never expected that this answer could eventually be so long when I posted my initial answer 2 years ago. However, given the high views of this thread, I feel obliged to add more statistical details and discussions. I don't want to mislead people that just because they can compute an R squared so easily, they can use R squared everywhere.
Why not this:
rsq <- function(x, y) summary(lm(y~x))$r.squared
rsq(obs, mod)
#[1] 0.8560185
It is not something obvious, but the caret package has a function postResample() that will calculate "A vector of performance estimates" according to the documentation. The "performance estimates" are
RMSE
Rsquared
mean absolute error (MAE)
and have to be accessed from the vector like this
library(caret)
vect1 <- c(1, 2, 3)
vect2 <- c(3, 2, 2)
res <- caret::postResample(vect1, vect2)
rsq <- res[2]
However, this is using the correlation squared approximation for r-squared as mentioned in another answer. I'm not sure why Max Kuhn didn't just use the conventional 1-SSE/SST.
caret also has an R2() method, although it's hard to find in the documentation.
The way to implement the normal coefficient of determination equation is:
preds <- c(1, 2, 3)
actual <- c(2, 2, 4)
rss <- sum((preds - actual) ^ 2)
tss <- sum((actual - mean(actual)) ^ 2)
rsq <- 1 - rss/tss
Not too bad to code by hand of course, but why isn't there a function for it in a language primarily made for statistics? I'm thinking I must be missing the implementation of R^2 somewhere, or no one cares enough about it to implement it. Most of the implementations, like this one, seem to be for generalized linear models.
You can also use the summary for linear models:
summary(lm(obs ~ mod, data=df))$r.squared
Here is the simplest solution based on [https://en.wikipedia.org/wiki/Coefficient_of_determination]
# 1. 'Actual' and 'Predicted' data
df <- data.frame(
y_actual = c(1:5),
y_predicted = c(0.8, 2.4, 2, 3, 4.8))
# 2. R2 Score components
# 2.1. Average of actual data
avr_y_actual <- mean(df$y_actual)
# 2.2. Total sum of squares
ss_total <- sum((df$y_actual - avr_y_actual)^2)
# 2.3. Regression sum of squares
ss_regression <- sum((df$y_predicted - avr_y_actual)^2)
# 2.4. Residual sum of squares
ss_residuals <- sum((df$y_actual - df$y_predicted)^2)
# 3. R2 Score
r2 <- 1 - ss_residuals / ss_total
Not sure why this isn't implemented directly in R, but this answer is essentially the same as Andrii's and Wordsforthewise, I just turned into a function for the sake of convenience if somebody uses it a lot like me.
r2_general <-function(preds,actual){
return(1- sum((preds - actual) ^ 2)/sum((actual - mean(actual))^2))
}
I am use the function MLmetrics::R2_Score from the packages MLmetrics, to compute R2 it uses the vanilla 1-(RSS/TSS) formula.

sampling a multimensional posterior distribution using MCMC Metropolis-Hastings algo in R

I am quite new in sampling posterior distributions(so therefore Bayesian approach) using a MCMC technique based on Metropolis-Hastings algorithm.
I am using the mcmc library in R for this. My distribution is multidimensionnal. In order to check if this metro algorithm works for multivaiate distribution I did it successfully on a multidimensional student-t distribution (package mvtnorm, function dmvt).
Now I want to apply the same thing to my multivariate distribution (2 vars x and y) but it doesn't work; I get an error : Error in X[, 1] : incorrect number of dimensions
Here is my code:
library(mcmc)
library(mvtnorm)
my.seed <- 123
logprior<-function(X,...)
{
ifelse( (-50.0 <= X[,1] & X[,1]<=50.0) & (-50.0 <= X[,2] & X[,2]<=50.0), return(0), return(-Inf))
}
logpost<-function(X,...)
{
log.like <- log( exp(-((X[,1]^2 + X[,2]^2 - 4)/10 )^2) * sin(4*atan(X[,2]/X[,1])) )
log.prior<-logprior(X)
log.post<-log.like + log.prior # if flat prior, the posterior distribution is the likelihood one
return (log.post)
}
x <- seq(-5,5,0.15)
y <- seq(-5,5,0.15)
X<-cbind(x,y)
#out <- metrop(function(X) dmvt(X, df=3, log=TRUE), 0, blen=100, nbatch=100) ; this works
out <- metrop(function(X) logpost(X), c(0,0), blen=100, nbatch=100)
out <- metrop(out)
out$accept
So I tried to respect the same kind of format than for the MWE, but it doesn't work still as I got the error mentioned before.
Another thing, is that applying logpost to X works perfectly.
Thanks in advance for your help, best
The metrop function passes individual samples, and therefore a simple vector to logpost, not a matrix (which is what X is). Hence, the solution is to change X[,1] and X[,2] to X[1] and X[2], respectively.
I ran it like this, and it leads to other issues (X[2]/X[1] is NaN for the initialization), but that has more to do with your specific likelihood model and is out of the scope of your question.

How does glmnet compute the maximal lambda value?

The glmnet package uses a range of LASSO tuning parameters lambda scaled from the maximal lambda_max under which no predictors are selected. I want to find out how glmnet computes this lambda_max value. For example, in a trivial dataset:
set.seed(1)
library("glmnet")
x <- matrix(rnorm(100*20),100,20)
y <- rnorm(100)
fitGLM <- glmnet(x,y)
max(fitGLM$lambda)
# 0.1975946
The package vignette (http://www.jstatsoft.org/v33/i01/paper) describes in section 2.5 that it computes this value as follows:
sx <- as.matrix(scale(x))
sy <- as.vector(scale(y))
max(abs(colSums(sx*sy)))/100
# 0.1865232
Which clearly is close but not the same value. So, what causes this difference? And in a related question, how could I compute lambda_max for a logistic regression?
To get the same result you need to standardize the variables using a standard deviation with n instead of n-1 denominator.
mysd <- function(y) sqrt(sum((y-mean(y))^2)/length(y))
sx <- scale(x,scale=apply(x, 2, mysd))
sx <- as.matrix(sx, ncol=20, nrow=100)
sy <- as.vector(scale(y, scale=mysd(y)))
max(abs(colSums(sx*sy)))/100
## [1] 0.1758808
fitGLM <- glmnet(sx,sy)
max(fitGLM$lambda)
## [1] 0.1758808
For the unscaled (original) x and y, the maximum lambda should be
mysd <- function(y) sqrt(sum((y-mean(y))^2)/length(y))
sx <- scale(x,scale=apply(x, 2, mysd))
norm(t(sx) %*% y, 'i') / nrow(x)
## [1] 0.1975946
# norm of infinity is also equal to
max(abs(colSums(sx*y)))/100
## [1] 0.1975946
max(fitGLM$lambda) - norm(t(sx) %*% y, 'i') / nrow(x)
## [1] 2.775558e-17
It seems lambda_max for a logistic regression is calculated similarly as for linear regression, but with weights based on class proportions:
set.seed(1)
library("glmnet")
x <- matrix(rnorm(100*20),100,20)
y <- rnorm(100)
mysd <- function(y) sqrt(sum((y-mean(y))^2)/length(y))
sx <- scale(x, scale=apply(x, 2, mysd))
sx <- as.matrix(sx, ncol=20, nrow=100)
y_bin <- factor(ifelse(y<0, -1, 1))
prop.table(table(y_bin))
# y_bin
# -1 1
# 0.62 0.38
fitGLM_log <- glmnet(sx, y_bin, family = "binomial")
max(fitGLM_log$lambda)
# [1] 0.1214006
max(abs(colSums(sx*ifelse(y<0, -.38, .62))))/100
# [1] 0.1214006
For your second question, look to Friedman et al's paper, "Regularization paths for generalized linear models via coordinate descent". In particular, see equation (10), which is equality at equilibrium. Just check under what conditions the numerator $S(\cdot,\cdot)$ is zero for all parameters.
Sorry, been a while, but maybe still of help:
You can calculate the maximum lambda value for any problem with L1-regularization by finding the highest absolute value of the gradient of the objective function (i.e. the score function for likelihoods) at the optimized parameter values for the completely regularized model (eg. all penalized parameters set to zero).
I sadly can't help with the difference in values, though. Although I can say that I try to use a max lambda value that is a bit higher - say 5% - than the calculated maximum lambda, so that the model with all selected parameterers constrained will surely be a part of the number of estimated models. Maybe this is what is being done in glmnet.
Edit: sorry, I confused the non-regularized with the fully penalized model. Edited it above now.
According to help("glmnet") the maximal lambda value is "the smallest value for which all coefficients are zero":
sum(fitGLM$beta[, which.max(fitGLM$lambda)])
#[1] 0
sum(glmnet(x,y, lambda=max(fitGLM$lambda)*0.999)$beta)
#[1] -0.0001809804
At a quick glance the value seems to be calculated by the Fortran code called by elnet.

Resources