Custom AUC in R with different thresholds and binary predictions - r

I am looking to plot a FPR vs TPR point on an AUC graph for different thresholds.
For example, if data$C2 is my data frame with the true response column (either 0 or 1), I want to make a vector with predicted values (0 or 1) when data$C1 (a different measurement column) is above or below the specified threshold. Here is the function I've attempted with the ROCR package.
fun <- function (data, col1, col2){
perfc <- NULL #Create null vectors for prediction and performance
perfs <- NULL
temp <- NULL
d <- seq(0.10,0.30,0.01) ##Various thresholds to be tested
for (i in length(d){
temp <- ifelse(data[,col1] > d, 1 , 0) ##Create predicted responses
pred <- prediction(temp, data[,col2]) #Predict responses over true values
perf <- performance(pred, "tpr","fpr") #Store performance information
predc[i] <- pred #Do this i times for every d in the sequence
perfc[i] <- perf
preds <- prediction.class(predc, col2) #Combine to make prediction class
perfs <- performance.class(preds, "tpr","fpr") #Combine to make performance class
}
plot(perfs) #Plot TPR against FPR
}
Is the problem because temp is a list vector and the true labels are from a matrix? Am I applying this for loop incorrectly?
Thanks in advance!
Edit: Here's my attempt to do this manually without the ROC package.
for(t in seq(0.40,0.60,0.01)) #I want to do this for every t in the sequence
{
t <- t
TP <- 0
FP <- 0
p <- sum(data$C2==1, na.rm=TRUE) #Total number of true positives
n <- sum(data$C2==0, na.rm=TRUE) #Total number of true negatives
list <- data$C1 #Column to vector
test <- ifelse(list > t, 1, 0) #Make prediction vector
for(i in 1:nrow(data))
{if(test==1 & data$C2==1)
{TP <- TP + 1} #Count number of correct predictions
if(test==1 & data$C2==0)
{FP <- FP + 1} #Count number of false positives
}
plot(x=(FP/n),y=(TP/p)) #Plot every FP,TP pair
}

I hope I understand your question right, but I think that by AUC graph you mean ROC curve. The ROC curve already takes into account different thresholds to make those classification decisions. See this wikipedia page. I found this picture particularly helpful.
If the above is right, then all you need to do in your code is:
pred <- prediction(data[,col1], data[,col2])
perf <- performance(pred, "tpr","fpr")
plot(perf)
If you would like to 'add' a different curve to that plot, perhaps because you used a different classification technique (e.g. decision tree instead of logistic regression. Then use plot(perf2,add=TRUE). Where perf2 is created in a same way as perf. See the documentation.

Related

How to fill a non-zero coefficient/gamma plot when optimizing with CVXR package?

I'm replicating an article of Kozak, Nagel and Santosh; Shrinking the Cross-section. Therefore I'm creating a model that will select a few characteristics out of a large set of characteristics, that together are a good representation of an SDF.
In this model I make use of Ridge and Lasso techniques and my supervisor advised me to use the CVXR package. I minimize my objective with two loss functions which are multiplied with two sets of gammas. The main goal of my code is to end with a plot that has one of the gammas on the x-axis (the ridge) and the number of non-zero coefficients on the y-axis (so not the lasso parameter).
However, since the number of non-zero coefficients is an outcome of the optimizer I can not state that I want an optut with n non-zero coefficients.
Is there anyone who know how to produce my desired outcome? Code that I used is stated below.
# Grid for L1 penalty
cv.gamma_1 <- seq(0.005,0.02, by = (0.0075/15) )
# Grid for L2 penalty
cv.kappa <- 10^seq(-2,0.5,(2/24))
cv.Tt <- nrow(cv.train)
cv.tau <- sum(diag(cv.Sigma.train))
cv.gamma_2 <- as.numeric(cv.tau/((cv.kappa^2)*cv.Tt))
# Create results Matrix
coef_matrix <- matrix(nrow = length(cv.gamma_2), ncol = Nn, data = 0)
for (i in 1:length(cv.gamma_1)) {
for (j in 1:length(cv.gamma_2)) {
objective <- loss_1 + cv.gamma_2[j] * loss_2 + cv.gamma_1[i] * loss_3
prob <- Problem(Minimize(objective))
result <- solve(prob)
model_betas <- result$getValue(beta)
# Compute R-squared of model with these betas
r_score <- Rsquared(Mu_OOS = cv.Mu.test, Sigma_OOS = cv.Sigma.test, betas = model_betas)
# Coef matrix
non_zeros <- sum( round(model_betas,2) != 0.00)
if (non_zeros != 0){
if (coef_matrix[j,non_zeros] < r_score){
coef_matrix[j,non_zeros] <- r_score}
}
}
For now I ran my optimizer and counted the number of non-zeros, made a matrix with non-zeros on the y-axis and gamma on x-axis. Therefore, I do not have values on all my non-zero values in the matrix.
my plot:
Desired plot:

How do I add new columns to a data set for each regression loop iteration?

I'm trying to test the predictive power of a model by breaking the observations into 1/4th and 3/4th groups (test and train respectively), running a first-order regression with the independent variable train sample, using these coefficients to produce predicted values from the independent variable test sample, and then I would like to add new columns of these predicted values to the dependent variable test data for each iteration of the loop.
For context: TSIP500 is the full sample; iv is independent variable; dv is dependent variable, a max of 50 iterations is simply a test that isn't too large in quantity of iterations.
I was having trouble with the predict function so I did the equation manually. My code is below:
for(i in 1:50){
test_index <- sample(nrow(TSIP500iv), (1/4)*nrow(TSIP500iv), replace=FALSE)
train_500iv <- TSIP500[-test_index,"distance"]
test_500iv <- TSIP500[test_index,"distance"]
train_500dv <- TSIP500[-test_index,"percent_of_max"]
test_500dv <- TSIP500[test_index,"percent_of_max"]
reg_model <- lm(train_500dv~train_500iv)
int <- reg_model$coeff[1]
B1 <- reg_model$coeff[2]
predicted <- (int + B1*test_500iv)
predicted <- data.frame(predicted)
test_500dv <- data.frame(test_500dv)
test_500dv[,i] <- apply(predicted)
}
I've tried different approaches for the last line, but I always just get a singular column added. Any help would be tremendously appreciated.
for(i in 1:50){
test_index <- sample(nrow(TSIP500iv), (1/4)*nrow(TSIP500iv), replace=FALSE)
train_500iv <- TSIP500[-test_index,"distance"]
test_500iv <- TSIP500[test_index,"distance"]
train_500dv <- TSIP500[-test_index,"percent_of_max"]
test_500dv <- TSIP500[test_index,"percent_of_max"]
reg_model <- lm(train_500dv~train_500iv)
int <- reg_model$coeff[1]
B1 <- reg_model$coeff[2]
temp_results <- paste('pred',i,sep='_')
assign(temp_results, as.data.frame(int + B1*test_500iv))
test_500dv <- cbind(data.frame(test_500dv),temp_results)
}

How to perform a multivariate linear regression when y is an indicator matrix in r?

this is the first time I am posting a question, hope it looks not confusing. And thanks very much for your time.
I am working on a zipcode dataset, which can be downloaded here:http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/zip.train.gz
http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/zip.test.gz
In general, my goal is to fit principle component regression model with the top 3 PCs on the train dataset for those response variable are the handwriting digits of 2, 3, 5, and 8, and then predict by using the test data. My main problem is that after performing PCA on the X matrix, I am not sure if I did the regression part correctly. I have turned the response variables into an 2487*4 indicator matrix, and want to fit a multivariate linear regression model. But the prediction results are not binomial indicators, so I am confused that how should I interpret the predictions back to the original response variables, i.e., which are predicted as 2, 3, 5, or 8. Or did I do the regression part totally wrong? Here are my code as follows:
First of all, I built the subset with those response variables are equal to 2, 3, 5, and 8:
zip_train <- read.table(gzfile("zip.train.gz"))
zip_test <- read.table(gzfile("zip.test.gz"))
train <- data.frame(zip_train)
train_sub <- train[which(train$V1 == 2 | train$V1 == 3 | train$V1 == 5 | train$V1 == 8),]
test <- data.frame(zip_test)
test_sub <- test[which(test$V1 == 2 | test$V1 == 3 | test$V1 == 5 | test$V1 == 8),]
xtrain <- train_sub[,-1]
xtest <- test_sub[,-1]
ytrain <- train_sub$V1
ytest <- test_sub$V1
Second, I centered the X matrix, and calculated the top 3 principal components by using svd:
cxtrain <- scale(xtrain)
svd.xtrain <- svd(cxtrain)
cxtest <- scale(xtest)
svd.xtest <- svd(cxtest)
utrain.r3 <- svd.xtrain$u[,c(1:3)] # this is the u_r
vtrain.r3 <- svd.xtrain$v[,c(1:3)] # this is the v_r
dtrain.r3 <- svd.xtrain$d[c(1:3)]
Dtrain.r3 <- diag(x=dtrain.r3,ncol=3,nrow=3) # creat the diagonal matrix D with r=3
ztrain.r3 <- cxtrain %*% vtrain.r3 # this is the scores, the new components
utest.r3 <- svd.xtest$u[,c(1:3)]
vtest.r3 <- svd.xtest$v[,c(1:3)]
dtest.r3 <- svd.xtest$d[c(1:3)]
Dtest.r3 <- diag(x=dtest.r3,ncol=3,nrow=3)
ztest.r3 <- cxtest %*% vtest.r3
Third, which is the part I was not sure if I did in the correct way, I turned the response variables into an indicator matrix, and performed a multivariate linear regression like this:
ytrain.ind <-cbind(I(ytrain==2)*1,I(ytrain==3)*1,I(ytrain==5)*1,I(ytrain==8)*1)
ytest.ind <- cbind(I(ytest==2)*1,I(ytest==3)*1,I(ytest==5)*1,I(ytest==8)*1)
mydata <- data.frame(cbind(ztrain.r3,ytrain.ind))
model_train <- lm(cbind(X4,X5,X6,X7)~X1+X2+X3,data=mydata)
new <- data.frame(ztest.r3)
pred <- predict(model_train,newdata=new)
However, the pred was not an indicator matrix, so I am getting lost that how to interpret them back to the digits and compare them with the real test data to further calculate the prediction error.
I finally figured out how to perform multivariate linear regression with categorical y. First we need to turn the y into an indicator matrix, so then we could interpret the 0 and 1 in this matrix as probabilities. And then regress y on x to build a linear model, and finally use this linear model to predict with the test set of x. The result is a matrix with same dimensions as our indicator matrix. And all the entries should be interpreted as probabilities too, although they could be larger than 1 or smaller than 0 (that's why it confused me before). So we need to find the maximum number per row, to see which predicted y has the highest probability, and this y would be our final prediction. In this way, we could convert the continuous numbers back into categories, and then make a table to compare with the test set of y. So I updated my previous code as below.
First of all, I built the subset with those response variables are equal to 2, 3, 5, and 8 (the code remains the same as the one I posted in my question):
zip_train <- read.table(gzfile("zip.train.gz"))
zip_test <- read.table(gzfile("zip.test.gz"))
train <- data.frame(zip_train)
train_sub <- train[which(train$V1 == 2 | train$V1 == 3 | train$V1 == 5 | train$V1 == 8),]
test <- data.frame(zip_test)
test_sub <- test[which(test$V1 == 2 | test$V1 == 3 | test$V1 == 5 | test$V1 == 8),]
xtrain <- train_sub[,-1]
xtest <- test_sub[,-1]
ytrain <- train_sub$V1
ytest <- test_sub$V1
Second, I centered the X matrix, and calculated the top 3 principal components by using eigen(). I updated this part of code, because I standardized x instead of centering it in my previous code, leading to a wrong computation of the covariance matrix of x and eigenvectors of cov(x).
cxtrain <- scale(xtrain, center = TRUE, scale = FALSE)
eigenxtrain <- eigen(t(cxtrain) %*% cxtrain / (nrow(cxtrain) -1)) # same as get eigen(cov(xtrain)), because I have already centered x before
cxtest <- scale(xtest, center = TRUE, scale = FALSE)
eigenxtest <- eigen(t(cxtest) %*% cxtest/ (nrow(cxtest) -1))
r=3 # set r=3 to get top 3 principles
vtrain <- eigenxtrain$vectors[,c(1:r)]
ztrain <- scale(xtrain) %*% vtrain # this is the scores, the new componenets
vtest <- eigenxtrain$vectors[,c(1:r)]
ztest <- scale(xtest) %*% vtest
Third, I turned the response variables into an indicator matrix, and performed a multivariate linear regression on the training set. And then use this linear model to predict.
ytrain.ind <- cbind(I(ytrain==2)*1,I(ytrain==3)*1,I(ytrain==5)*1,I(ytrain==8)*1)
ytest.ind <- cbind(I(ytest==2)*1,I(ytest==3)*1,I(ytest==5)*1,I(ytest==8)*1)
mydata <- data.frame(cbind(ztrain,ytrain.ind))
model_train <- lm(cbind(X4,X5,X6,X7)~X1+X2+X3,data=mydata)
new <- data.frame(ztest)
pred<- predict(model_train,newdata=new)
The pred is a matrix with all the entries of probabilities, so we need to convert it back into a list of categorical y.
pred.ind <- matrix(rep(0,690*4),nrow=690,ncol=4) # build a matrix with the same dimensions as pred, and all the entries are 0.
for (i in 1:690){
j=which.max(pred[i,]) # j is the column number of the highest probability per row
pred.ind[i,j]=1 # we set 1 to the columns with highest probability per row, in this way, we could turn our pred matrix back into an indicator matrix
}
pred.col1=as.matrix(pred.ind[,1]*2) # first column are those predicted as digit 2
pred.col2=as.matrix(pred.ind[,2]*3)
pred.col3=as.matrix(pred.ind[,3]*5)
pred.col4=as.matrix(pred.ind[,4]*8)
pred.col5 <- cbind(pred.col1,pred.col2,pred.col3,pred.col4)
pred.list <- NULL
for (i in 1:690){
pred.list[i]=max(pred.col5[i,])
} # In this way, we could finally get a list with categorical y
tt=table(pred.list,ytest)
err=(sum(tt)-sum(diag(tt)))/sum(tt) # error rate was 0.3289855
For the third part, we could also perform a multinomial logistic regression instead. But in this way, we don't need to convert y into an indicator matrix, we just factor it. So the code looks as below:
library(nnet)
trainmodel <- data.frame(cbind(ztrain, ytrain))
mul <- multinom(factor(ytrain) ~., data=trainmodel)
new <- as.matrix(ztest)
colnames(new) <- colnames(trainmodel)[1:r]
predict<- predict(mul,new)
tt=table(predict,ytest)
err=(sum(tt)-sum(diag(tt)))/sum(tt) # error rate was 0.2627907
So it showed that the logistic model do perform better than the linear model.

Cross validation for glm() models

I'm trying to do a 10-fold cross validation for some glm models that I have built earlier in R. I'm a little confused about the cv.glm() function in the boot package, although I've read a lot of help files. When I provide the following formula:
library(boot)
cv.glm(data, glmfit, K=10)
Does the "data" argument here refer to the whole dataset or only to the test set?
The examples I have seen so far provide the "data" argument as the test set but that did not really make sense, such as why do 10-folds on the same test set? They are all going to give exactly the same result (I assume!).
Unfortunately ?cv.glm explains it in a foggy way:
data: A matrix or data frame containing the data. The rows should be
cases and the columns correspond to variables, one of which is the
response
My other question would be about the $delta[1] result. Is this the average prediction error over the 10 trials? What if I want to get the error for each fold?
Here's what my script looks like:
##data partitioning
sub <- sample(nrow(data), floor(nrow(x) * 0.9))
training <- data[sub, ]
testing <- data[-sub, ]
##model building
model <- glm(formula = groupcol ~ var1 + var2 + var3,
family = "binomial", data = training)
##cross-validation
cv.glm(testing, model, K=10)
I am always a little cautious about using various packages 10-fold cross validation methods. I have my own simple script to create the test and training partitions manually for any machine learning package:
#Randomly shuffle the data
yourData<-yourData[sample(nrow(yourData)),]
#Create 10 equally size folds
folds <- cut(seq(1,nrow(yourData)),breaks=10,labels=FALSE)
#Perform 10 fold cross validation
for(i in 1:10){
#Segement your data by fold using the which() function
testIndexes <- which(folds==i,arr.ind=TRUE)
testData <- yourData[testIndexes, ]
trainData <- yourData[-testIndexes, ]
#Use test and train data partitions however you desire...
}
#Roman provided some answers in his comments, however, the answer to your questions is provided by inspecting the code with cv.glm:
I believe this bit of code splits the data set up randomly into the K-folds, arranging rounding as necessary if K does not divide n:
if ((K > n) || (K <= 1))
stop("'K' outside allowable range")
K.o <- K
K <- round(K)
kvals <- unique(round(n/(1L:floor(n/2))))
temp <- abs(kvals - K)
if (!any(temp == 0))
K <- kvals[temp == min(temp)][1L]
if (K != K.o)
warning(gettextf("'K' has been set to %f", K), domain = NA)
f <- ceiling(n/K)
s <- sample0(rep(1L:K, f), n)
This bit here shows that the delta value is NOT the root mean square error. It is, as the helpfile says The default is the average squared error function. What does this mean? We can see this by inspecting the function declaration:
function (data, glmfit, cost = function(y, yhat) mean((y - yhat)^2),
K = n)
which shows that within each fold, we calculate the average of the error squared, where error is in the usual sense between predicted response vs actual response.
delta[1] is simply the weighted average of the SUM of all of these terms for each fold, see my inline comments in the code of cv.glm:
for (i in seq_len(ms)) {
j.out <- seq_len(n)[(s == i)]
j.in <- seq_len(n)[(s != i)]
Call$data <- data[j.in, , drop = FALSE]
d.glm <- eval.parent(Call)
p.alpha <- n.s[i]/n #create weighted average for later
cost.i <- cost(glm.y[j.out], predict(d.glm, data[j.out,
, drop = FALSE], type = "response"))
CV <- CV + p.alpha * cost.i # add weighted average error to running total
cost.0 <- cost.0 - p.alpha * cost(glm.y, predict(d.glm,
data, type = "response"))
}

Avoiding a for loop in R in an attempt to evaluate percentage of true positives/negatives when using logistic regression

What I got: A matrix where I got the predicted probability of an outcome (from a logistic regression model) and the known outcome. For those curious I actually got two regression models and an independent test dataset where I wish to compare these two models by doing this.
> head(matrixComb)
probComb outComb
[1,] 0.9999902 1
[2,] 0.9921736 0
[3,] 0.9901175 1
[4,] 0.9815581 0
[5,] 0.7692992 0
[6,] 0.7369990 0
What I want: A graph where I can plot how often my prediction model yields correct outcomes (one line for positives and one line for negatives) as a function of the cut off value for the probability. My problem is that I am unable to figure out how to do this without switching to Perl and use to For-loop to iterate through the matrix.
In Perl I would just start at probability 0.1 and in reach run of the for-loop increase the value by 0.1. In the first iteration I would count all probabilities <0.1 and outcome = 0 as true negatives, probability < 0.1 and outcome 1 as false negatives probability > 0.1 and outcome = 0 as false positives and probability > 0.1 and outcome = 1 as true positives.
The process would then be repeated and the results of each iteration would be printed as [probability, true positives/total positives, true negatives/total negatives]. Thus make it easy for me to print it out in open office calc.
The reason that I am asking this is that the operation is too complex for me to find a similar case here on stackoverflow or in a tutorial. But I would really like to learn a way to do this in an efficient manner in the R environment.
You can get R to draw the curves which are based on ROC analysis. This is a crude version using the ROCR package and could easily be made prettier
ss <- 1000 # sample size
mydf <- data.frame(probComb = runif(ss)) # predictions illustration
mydf$outComb <- 0 + (runif(ss) < mydf$probComb) # actuals illustration
library(ROCR)
pred <- prediction(mydf$probComb, mydf$outComb)
perfp <- performance(pred, "tpr")
perfn <- performance(pred, "tnr")
plot(perfp, col="green", ylab="True positive (green) and true negative (red) rates")
plot(perfn, col="red", ylab="True negative rate", add=TRUE)
to produce
If you must, you can find the data in perfp and perfn.
Here's a way to do this manually:
#Create some sample data
dat <- data.frame(x=runif(100),y=sample(0:1,100,replace=TRUE))
#Function to compute tp and tn
myFun <- function(x){
tbl <- table(dat$x > x,dat$y)
marg <- margin.table(tbl,2)
tn <- tbl[1,1]/marg[1]
tp <- tbl[2,2]/marg[2]
rs <- c(tp,tn)
names(rs) <- c('truePos','trueNeg')
return(rs)
}
#Decision thresholds
thresh <- seq(0.1,0.9, by = 0.1)
#Loop using lapply
temp <- as.data.frame(do.call(rbind,lapply(thresh,myFun)))
temp$thresh <- thresh
#Melt and plot using ggplot
tempMelt <- melt(temp,id.vars="thresh")
ggplot(tempMelt,aes(x=thresh,y=value)) +
geom_line(aes(group=variable,colour=variable))
Alternatively, as mentioned above in the comments, there are a plethora or ROC functions in R which can be found using ??ROC. For example, using roc from the caret package:
temp <- as.data.frame(roc(dat$x,factor(dat$y)))
tempMelt <- melt(temp,id.vars="cutoff")
ggplot(tempMelt,aes(x=cutoff,y=value)) +
geom_line(aes(group=variable,colour=variable))
Maybe something like this:
# A function for counting outcomes for a certain probability
f <- function(d, p) {
lp <- d$prob < p
c(TNeg=sum(lp & d$out==0), TPos=sum(!lp & d$out==1))
}
# Make it accept a vector of probabilities
vf <- Vectorize(f, 'p')
# Sample data
n <- 100
d <- data.frame(prob=runif(n), out=round(runif(n)))
# Probabilities to plot
p <- seq(0,1, len=20)
res <- vf(d, p)
colnames(res) <- paste('p(', p, ')', sep='')
matplot(p, t(res), type='l', xlab='prob', ylab='count')

Resources