multiclass.roc from predict.gbm - r

I am having a hard time understanding how to format and utilize the output from predict.gbm ('gbm' package) with the multiclass.roc function ('pROC' packagage).
I used a multinomial gbm to predict a validation dataset, the output of which appears to be probabilities of each datapoint of belonging to each factor level. (Correct me if I am wrong)
preds2 <- predict.gbm(density.tc5.lr005, ProxFiltered, n.trees=best.iter, type="response")
> head(as.data.frame(preds2))
1.2534 2.2534 3.2534 4.2534 5.2534
1 0.62977743 0.25756095 0.09044278 0.021497259 7.215793e-04
2 0.16992912 0.24545691 0.45540153 0.094520208 3.469224e-02
3 0.02633356 0.06540245 0.89897614 0.009223098 6.474949e-05
The factor levels are 1-5, not sure why the decimal addition
I am trying to compute the multi-class AUC as defined by Hand and Till (2001) using multiclass.roc but I'm not sure how to supply the predicted values in the single vector it requires.
I can try to work up an example if necessary, though I assume this is routine for some and I am missing something as a novice with the procedure.

Pass in the response variable as-is, and use the most likely candidate for the predictor:
multiclass.roc(ProxFiltered$response_variable, apply(preds2, 1, function(row) which.max(row)))

An alternative is to define a custom scoring function - for instance the ratio between the probabilities of two classes and to do the averaging yourself:
names(preds2) <- 1:5
aucs <- combn(1:5, 2, function(X) {
auc(roc(ProxFiltered$response_variable, preds2[[X[1]]] / preds2[[X[2]]], levels = X))
})
mean(aucs)
Yet another (better) option is to convert your question to a non-binary one, i.e is the best prediction (or some weighted-best prediction) correlated with the true class?

Related

How to obtain Brier Score in Random Forest in R?

I am having trouble getting the Brier Score for my Machine Learning Predictive models. The outcome "y" was categorical (1 or 0). Predictors are a mix of continuous and categorical variables.
I have created four models with different predictors, I will call them "model_1"-"model_4" here (except predictors, other parameters are the same). Example code of my model is:
Model_1=rfsrc(y~ ., data=TrainTest, ntree=1000,
mtry=30, nodesize=1, nsplit=1,
na.action="na.impute", nimpute=3,seed=10,
importance=T)
When I run the "Model_1" function in R, I got the results:
My question was how can I get the predicted possibility for those 412 people? And how to find the observed probability for each person? Do I need to calculate by hand? I found the function BrierScore() in "DescTools" package.
But I tried "BrierScore(Model_1)", it gives me no results.
codes I added:
library(scoring)
library(DescTools)
BrierScore(Raw_SB)
class(TrainTest$VL_supress03)
TrainTest$VL_supress03_nu<-as.numeric(as.character(TrainTest$VL_supress03))
class(TrainTest$VL_supress03_nu)
prediction_Raw_SB = predict(Raw_SB, TrainTest)
BrierScore(prediction_Raw_SB, as.numeric(TrainTest$VL_supress03) - 1)
BrierScore(prediction_Raw_SB, as.numeric(as.character(TrainTest$VL_supress03)) - 1)
BrierScore(prediction_Raw_SB, TrainTest$VL_supress03_nu - 1)
I tried some codes: have so many error messages:
One assumption I am making about your approach is that you want to compute the BrierScore on the data you train your model on (which is usually not the correct approach, google train-test split if you need more info there).
In general, therefore you should reflect on whether your approach is correct there.
The BrierScore method in DescTools only has a defined method for glm models, otherwise, it expects as input a vector of predicted probabilities and a vector of true values (see ?BrierScore).
What you would need to do though is to predict on your data using:
prediction = predict(model_1, TrainTest, na.action="na.impute")
and then compute the brier score using
BrierScore(as.numeric(TrainTest$y) - 1, prediction$predicted[, 1L])
(Note, that we transform TrainTest$y into a numeric vector of 0's and 1's in order to compute the brier score.)
Note: The randomForestSRC package also prints a normalized brier score when you call print(prediction).
In general, using one of the available workbenches for machine learning in R (mlr3, tidymodels, caret) might simplify this approach for you and prevent a lot of errors in this direction. This is a really good practice, especially if you are less experienced in ML as it can prevent many errors.
See e.g. this chapter in the mlr3 book for more information.
For reference, here is some very similar code using the mlr3 package, automatically also taking care of train-test splits.
data(breast, package = "randomForestSRC") # with target variable "status"
library(mlr3)
library(mlr3extralearners)
task = TaskClassif$new(id = "breast", backend = breast, target = "status")
algo = lrn("classif.rfsrc", na.action = "na.impute", predict_type = "prob")
resample(task, algo, rsmp("holdout", ratio = 0.8))$score(msr("classif.bbrier"))

R - linear model does not match experimental data

I am trying to perform a linear regression on experimental data consisting of replicate measures of the same condition (for several conditions) to check for the reliability of the experimental data. For each condition I have ~5k-10k observations stored in a data frame df:
[1] cond1 repA cond1 repB cond2 repA cond2 repB ...
[2] 4.158660e+06 4454400.703 ...
[3] 1.458585e+06 4454400.703 ...
[4] NA 887776.392 ...
...
[5024] 9571785.382 9.679092e+06 ...
I use the following code to plot scatterplot + lm + R^2 values (stored in rdata) for the different conditions:
for (i in seq(1,13,2)){
vec <- matrix(0, nrow = nrow(df), ncol = 2)
vec[,1] <- df[,i]
vec[,2] <- df[,i+1]
vec <- na.exclude(vec)
plot(log10(vec[,1]),log10(vec[,2]), xlab = 'rep A', ylab = 'rep B' ,col="#00000033")
abline(fit<-lm(log10(vec[,2])~log10(vec[,1])), col='red')
legend("topleft",bty="n",legend=paste("R2 is",rdata[1,((i+1)/2)] <- format(summary(fit)$adj.r.squared,digits=4)))
}
However, the lm seems to be shifted so that it does not fit the trend I see in the experimental data:
It consistently occurs for every condition. I unsuccesfully tried to find an explanation by looking up the scource code and browsing different forums and posts (this or here).
Would have like to simply comment/ask a few questions, but can't.
From what I've understood, both repA and repB are measured with error. Hence, you cannot fit your data using an ordinary least square procedure, which only takes into account the error in Y (some might argue a weighted OLS may work, however I'm not skilled enough to discuss that). Your question seem linked to this one.
What you can use is a total least square procedure: it takes into account the error in X and Y. In the example below, I've used a "normal" TLS assuming there is the same error in X and Y (thus error.ratio=1). If it is not, you can specify the error ratio by entering error.ratio=var(y1)/var(x1) (at least I think it's var(Y)/var(X): check on the documentation to ensure that).
library(mcr)
MCR_reg=mcreg(x1,y1,method.reg="Deming",error.ratio=1,method.ci="analytical")
MCR_intercept=getCoefficients(MCR_reg)[1,1]
MCR_slope=getCoefficients(MCR_reg)[2,1]
# CI for predicted values
x_to_predict=seq(0,35)
predicted_values=MCResultAnalytical.calcResponse(MCR_reg,x_to_predict,alpha=0.05)
CI_low=predicted_values[,4]
CI_up=predicted_values[,5]
Please note that, in Deming/TLS regressions, your x- and y-errors are supposed to follow normal distribution, as explained here. If it's not the case, go for a Passing-Bablok regressions (and the R code is here).
Also note that the R2 isn't defined for Deming nor Passing Bablok regressions (see here). A correlation coefficient is a good proxy, although it does not exactly provide the same information. Since you're studying a linear correlation between two factors, see Pearson's product moment correlation coefficient, and use e.g. the rcorrfunction.

R Function for Rounding Imputed Binary Variables

There is an ongoing discussion about the reliable methods of rounding imputed binary variables. Still, the so-called Adaptive Rounding Procedure developed by Bernaards and colleagues (2007) is currently the most widely accepted solution.
Adoptive Rounding Procedure involves normal approximation to a binomial distribution. That is, the imputed values in a binary variable are assigned the values of either 0 or 1, based on the threshold derived by the below formula, where x is the mean of the imputed binary variable:
threshold <- mean(x) - qnorm(mean(x))*sqrt(mean(x)*(1-mean(x)))
To the best of my knowledge, major R packages on imputation (such as Amelia or mice) have yet to include functions that help with the rounding of binary variables. This shortcoming makes it difficult especially for researchers who intend to use the imputed values in logistic regression analysis, given that their dependent variable is coded in binary.
Therefore, it makes sense to write an R function for the Bernaards formula above:
bernaards <- function(x)
{
mean(x) - qnorm(mean(x))*sqrt(mean(x)*(1-mean(x)))
}
With this formula, it is much easier to calculate the threshold for an imputed binary variable with a mean of, say, .623:
bernaards(.623)
[1] 0.4711302
After calculating the threshold, the usual next step is to round the imputed values in variable x.
My question is: how can the above function be extended to include that task as well?
In other words, one can do all of the above in R with three lines of code:
threshold <- mean(x) - qnorm(mean(x))*sqrt(mean(x)*(1-mean(x)))
df$x[x > threshold] <- 1
df$x[x < threshold] <- 0
It would be best if the function included the above recoding/rounding, as repeating the same process for each binary variable would be time-consuming, especially when working with large data sets. With such a function, one could simply run an extra line of code (as below) after imputation, and continue with the analyses:
bernaards(dummy1, dummy2, dummy3)

glm summary not giving coefficients values

I'm trying to apply glm on a given dataset,but the summary(model1) is not giving me the correct output , it's not giving coefficient values for Estimate Std. Error z value Pr(>|z|) etc, it's just giving me NA as an output for individual attribute element.
TEXT <- c('Learned a new concept today : metamorphic testing. t.co/0is1IUs3aW','BMC Bioinformatics BioMed Central: Detecting novel ncRNAs by experimental #RNomics is not an easy task... http:/t.co/ui3Unxpx #bing #MyEN','BMC Bioinformatics BioMed Central: small #RNA with a regulatory function as a scientific ... Detecting novel… http:/t.co/wWHOEkR0vc #bing','True or false? link(#Addition, #Classification) http:/t.co/zMJuTFt8iq #Oxytocin','Biologists do have a sense of humor, especially computational bio people http:/t.co/wFZqaaFy')
NAME <- c('QSoft Consulting','Fabrice Leclerc','Sungsam Gong','Frederic','Zach Stednick')
SCREEN_NAME <-c ('QSoftConsulting','rnomics','sunggong','rnomics','jdwasmuth')
FOLLOWERS_COUNT <- c(734,1900,234,266,788)
RETWEET <- c(1,3,5,0,2)
FRIENDS_COUNT <-c(34,532,77,213,422)
STATUSES_COUNT <- c(234,643,899,222,226)
FAVOURITES_COUNT <- c(144,2677,445,930,254)
df <- data.frame(TEXT,NAME,SCREEN_NAME,RETWEET,FRIENDS_COUNT,STATUSES_COUNT,FAVOURITES_COUNT)
mydata<-df
mydata$FAVOURITES_COUNT <- ifelse( mydata$FAVOURITES_COUNT >= 445, 1, 0) #converting fav_count to binary values
Splitting data
library(caret)
split=0.60
trainIndex <- createDataPartition(mydata$FAVOURITES_COUNT, p=split, list=FALSE)
data_train <- mydata[ trainIndex,]
data_test <- mydata[-trainIndex,]
glm model
library(e1071)
model1 <- glm(FAVOURITES_COUNT~.,family = binomial, data = data_train)
summary(model1)
I want to get the p value for further analysis so far i think my code is right, how can i get the correct output?
A binomial distribution will only work if the dependent variable has two outcomes. You should consider a Poisson distribution when the dependent variable is a count. See here for more details: http://www.statmethods.net/advstats/glm.html
Your code for fitting the GLM is programmatically correct. However, there are a few issues:
As mentioned in the comments, for every variable that is categorical, you should use as.factor() to make it into a factor. GLM doesn't know what a "string" variable is.
As MorganBall indicated, if your data truly is count data, you may consider fitting it using a Poisson GLM, instead of converting to binary and using Logistic regression.
You indicate that you have 13 parameters and 1000 observations. While this may seem like enough data, note that some of these parameters may have very few (close to 0?) observations in them. This is a problem.
In addition, did you make sure that your data does not perfectly separate the response? Because if there are some combinations of parameters that do separate the response perfectly, the maximum likelihood estimate won't converge and theoretically goes to infinity. Practically speaking, you'll get very large standard errors for your estimates.

Lasso, glmnet, preprocessing of the data

Im trying to use the glmnet package to fit a lasso (L1 penalty) on a model with a binary outcome (a logit). My predictors are all binary (they're 1/0 not ordered, ~4000) except for one continuous variable.
I need to convert the predictors into a sparse matrix, since it takes forever and a day otherwise.
My question is: it seems that people are using sparse.model.matrix rather than just converting their matrix into a sparse matrix. Why is that? and do I need to do this here? Outcome is a little different for both methods.
Also, do my factors need to be coded as factors (when it comes to both the outcome and the predictors) or is it sufficient to use the sparse matrix and specify in the glmnet model that the outcome is binomial?
Here's what im doing so far
#Create a random dataset, y is outcome, x_d is all the dummies (10 here for simplicity) and x_c is the cont variable
y<- sample(c(1:0), 200, replace = TRUE)
x_d<- matrix(data= sample(c(1:0), 2000, replace = TRUE), nrow=200, ncol=10)
x_c<- sample(60:90, 200, replace = TRUE)
#FIRST: scale that one cont variable.
scaled<-scale(x_c,center=TRUE, scale=TRUE)
#then predictors together
x<- cbind(x_d, scaled)
#HERE'S MY MAIN QUESTION: What i currently do is:
xt<-Matrix(x , sparse = TRUE)
#then run the cross validation...
cv_lasso_1<-cv.glmnet(xt, y, family="binomial", standardize=FALSE)
#which gives slightly different results from (here the outcome variable is in the x matrix too)
xt<-sparse.model.matrix(data=x, y~.)
#then run CV.
So to sum up my 2 questions are:
1-Do i need to use sparse.model.matrix even if my factors are just binary and not ordered? [and if yes what does it actually do differently from just converting the matrix to a sparse matrix]
2- Do i need to code the binary variables as factors?
the reason i ask that is my dataset is huge. it saves a lot of time to just do it without coding as factors.
I don't think you need a sparse.model.matrix, as all that it really gives you above a regular matrix is expansion of factor terms, and if you're binary already that won't give you anything. You certainly don't need to code as factors, I frequently use glmnet on a regular (non-model) sparse matrix with only 1's. At the end of the day glmnet is a numerical method, so a factor will get converted to a number in the end regardless.

Resources