R visualization of correct predictions - r

i have trained SVM classification models based on probability prediction for recognision numbers 0-9.
I have visualization of probality for every model, looks like this for number 0 -data of probability are in variable prediction0
Then i have trained final classificator and i have 1423 correct observations (from 1499) - i have vector c= containing numbers correctly predicted
What i need to do, is when was 0 correctly predicted in vector c, mark that point on red on this graf. If it helps i have "ck" containing probalities for all number prediction for every test sample, where i get maximum probality, which was my final prediction.

You can do this by using the col argument. I'll use the mtcars dataset as an example
plot(
mpg~disp,
data=mtcars,
col=ifelse(mtcars$am==0,"red","blue")
)

Related

How to make predictions using an LDA (Linear discriminant analysis) model in R

as the title suggests I am trying to make predictions using an LDA model in R. I have two sets of data that I'm working with: the first set is a series of entries associated with 16 predictor variables and 1 outcome variable (the outcome variable are "groups" that each entry belongs to that I've assigned myself), the second set of data also consists of entries associated with the same 16 predictor variables, but with no outcome variable. What I would like to do is predict the group membership of the entries in the second set of data.
So far I've successfully managed to create an LDA model by separating the first dataset into a "training set" and a "test set". However, now that I have the model I don't know how I would go about predicting the group membership of the entries in my second data set.
Thanks for the help! Please let me know if any more information is required, this is my first post on stack overflow so I am still learning the ropes.
Short example based on An introduction to Statistical learning, chapter 4. Say you have fitted a model lda_model on a training_data set, with dependent variable Group which you aim to predict, and predictors Predictor1 and Predictor2
library(MASS)
lda_model <- lda(Groupāˆ¼ Predictor1 + Predictor2, data = training_set)
You can then make predictions with the lda_model using the predict function on the testing_set
lda_predictions <- predict(lda_model, testing_set)
lda_predictions then holds the posterior probabilities in $posterior that the observation is part of Group j.
You could then apply a threshold of for instance (but not limiting to) 50% probability. E.g.
sum(lda_model$posterior[, 7] >= .5)
returns the number of observations for which the probabilty that the observation is part of Group 7 is larger than 50%

How to improve a Zero-Inflated Negative Binomial regression model?

everybody!
I have a response variable that counts sucessful days in a month and is distributed in a peculiar shape (see above). About 50% are zeros, and there is a heavy tail. Because of the overdispersion and the excess of zeros, I was advised to predict it with a Zero-Inflated Negative Binomial regression model.
However, no matter how significant a model I obtain, it reflects little of those distributing features (see below). For example, the peaks are always around 4, and no predictions fall beyond 20.
Is this usual in fitting overdispersed, heavy-tailed count data? Are there other ways to improve the fitting? Any suggestions would be appreciated. Thank you!
P. S.
I also tried logistic regression to predict zero/non-zero only. But none of the fitted models perform better than simply guessing zeros for all cases.
I suppose you did a histogram of the fitted values, so this will only reflect the fitted means, and possibly multiplied by the ratio of being zero depending on the model you use. It is not supposed to recreate that distribution because how spread your data can be is embedded in the dispersion parameter.
We can use an example from the pscl package:
library(pscl)
data("bioChemists")
fit <- hurdle(art ~ ., data = bioChemists,dist="negbin",zero.dist="binomial")
par(mfrow=c(1,2))
hist(fit$y,main="Observed")
hist(fit$fitted.values,main="Fitted")
As mentioned before, in this hurdle model, the fitted values you see, are the predicted means multiplied by the ratio of being zero (see more here):
head(fit$fitted.values)
1 2 3 4 5 6
1.9642025 1.2887343 1.3033753 1.3995826 2.4560884 0.8783207
head(predict(fit,type="zero")*predict(fit,type="count"))
1 2 3 4 5 6
1.9642025 1.2887343 1.3033753 1.3995826 2.4560884 0.8783207
To simulate the data based on the fitted model, we extract out the parameters:
Theta=fit$theta
Means=predict(fit,type="count")
Zero_p = predict(fit,type="prob")[,1]
Have function to simulate the counts:
simulateCounts = function(mu,theta,zero_p){
N = length(mu)
x = rnbinom(N,mu=mu,size=THETA)
x[runif(x)<zero_p] = 0
x
}
So run this simulation a number of times to get the spectrum of values:
set.seed(100)
simulated = replicate(10,simulateCounts(Means,Theta,Zero_p))
simulated = unlist(simulated)
par(mfrow=c(1,2))
hist(bioChemists$art,main="Observed")
hist(simulated,main="simulated")

How do I calculate AUC from two continuous variables in R?

I have the following data:
# actual value:
a <- c(26.77814,29.34224,10.39203,29.66659,20.79306,20.73860,22.71488,29.93678,10.14384,32.63233,24.82544,38.14778,25.12343,23.07767,14.60789)
# predicted value
p <- c(27.238142,27.492240,13.542026,32.266587,20.473063,20.508603,21.414882,28.536775,18.313844,32.082333,24.545438,30.877776,25.703430,22.397666,15.627892)
I already calculated MSE and RMSE for these two, but they're asking for AUC and ROC curve. How can I calculate it from this data using R? I thought AUC is for classification problems, was I mistaken? Can we still calculate AUC for numeric values like above?
Question:
I thought AUC is for classification problems, was I mistaken?
You are not mistaken. The area under the receiver operating characteristic curve can't be computed for two numeric vectors like in your example. It's used to determine how well your binary classifier stands up to a gold standard binary classifier. You need a vector of cases vs. controls, or levels for the a vector that put each value in one of two categories.
Here's an example of how you'd do this with the pROC package:
library(pROC)
# actual value
a <- c(26.77814,29.34224,10.39203,29.66659,20.79306,20.73860,22.71488,29.93678,10.14384,32.63233,24.82544,38.14778,25.12343,23.07767,14.60789)
# predicted value
p <- c(27.238142,27.492240,13.542026,32.266587,20.473063,20.508603,21.414882,28.536775,18.313844,32.082333,24.545438,30.877776,25.703430,22.397666,15.627892)
df <- data.frame(a = a, p = p)
# order the data frame according to the actual values
odf <- df[order(df$a),]
# convert the actual values to an ordered binary classification
odf$a <- odf$a > 12 # arbitrarily decided to use 12 as the threshold
# construct the roc object
roc_obj <- roc(odf$a, odf$p)
auc(roc_obj)
# Area under the curve: 0.9615
Here, we have arbitrarily decided that threshold for the gold standard (a) is 12. If that's the case, than observations that have a lower value than 12 are controls. The prediction (p) classifies very well, with an AUC of 0.9615. We don't have to decide on the threshold for our prediction classifier in order to determine the AUC, because it's independent of the threshold decision. We can slide up and down depending on whether it's more important to find cases or to not misclassify a control.
Important Note
I completely made up the threshold for the gold standard classifier. If you choose a different threshold (for the gold standard), you'll get a different AUC. For example, if we chose 28, the AUC would be 1. The AUC is independent of the threshold for the predictor, but absolutely depends on the threshold for the gold standard.
EDIT
To clarify the above note, which was apparently misunderstood, you were not mistaken. This kind of analysis is for classification problems. You cannot use it here without more information. In order to do it, you need a threshold for your a vector, which you don't have. You CAN'T make one up and expect to get a non made up result for the AUC. Because the AUC depends on the threshold for the gold standard classifier, if you just make up the threshold, as we did in the exercise above, you are also just making up the AUC.

Boosting classification tree in R

I'm trying to boost a classification tree using the gbm package in R and I'm a little bit confused about the kind of predictions I obtain from the predict function.
Here is my code:
#Load packages, set random seed
library(gbm)
set.seed(1)
#Generate random data
N<-1000
x<-rnorm(N)
y<-0.6^2*x+sqrt(1-0.6^2)*rnorm(N)
z<-rep(0,N)
for(i in 1:N){
if(x[i]-y[i]+0.2*rnorm(1)>1.0){
z[i]=1
}
}
#Create data frame
myData<-data.frame(x,y,z)
#Split data set into train and test
train<-sample(N,800,replace=FALSE)
test<-(-train)
#Boosting
boost.myData<-gbm(z~.,data=myData[train,],distribution="bernoulli",n.trees=5000,interaction.depth=4)
pred.boost<-predict(boost.myData,newdata=myData[test,],n.trees=5000,type="response")
pred.boost
pred.boost is a vector with elements from the interval (0,1).
I would have expected the predicted values to be either 0 or 1, as my response variable z also consists of dichotomous values - either 0 or 1 - and I'm using distribution="bernoulli".
How should I proceed with my prediction to obtain a real classification of my test data set? Should I simply round the pred.boost values or is there anything I'm doing wrong with the predict function?
Your observed behavior is correct. From documentation:
If type="response" then gbm converts back to the same scale as the
outcome. Currently the only effect this will have is returning
probabilities for bernoulli.
So you should be getting probabilities when using type="response" which is correct. Plus distribution="bernoulli" merely tells that labels follows bernoulli (0/1) pattern. You can omit that and still model will run fine.
To proceed do predict_class <- pred.boost > 0.5 (cutoff = 0.5) or else plot ROC curve to decide on cutoff yourself.
Try using adabag. Class, probabilities, votes and error are inbuilt in adabag which makes it easy to interpret, and of course less lines of codes.

How do I get individual tree probabilities from Random Forests in R?

I'm using the randomForest package in R on a classification problem (outcome is binary).
I want to get the probability output of each one of the trees (to get a prediction interval).
I've set the predict.all=TRUE argument in the predictions, but it gives me a matrix of 800 columns (=the number of trees in my forest) and each of them is a 1 or a 0. How do I get the probability output rather than class?
PS: the size of my nodes=1, which means that this should make sense. however, I changed the node size=50, still got all 0's and 1's no probabilities.
Here's what Im doing:
#build model (node size=1)
rf<-randomForest(y~. ,data=train, ntree=800,replace=TRUE, proximilty=TRUE, keep.inbag=TRUE)
#get the predictions
#store the predictions from all the trees
all_tree_train<-predict(rf, test, type="prob", predict.all= TRUE)$individual
This gives a matrix of 0's and 1's rather than probabilities.
I realise this question is old, but it might help anyone with a similar question.
If you query the trees for their results, you'll always get the end classifications which are deterministic given an initialised forest. You can extract the probabilities by setting predict all to TRUE as you've done and summing across the votes for a probability.
If you have more than 2 classes, the forest classifies an item 'm' as class 'x' with probability
(number of trees which bin m as x)/(number of trees)
As you only have a binary classification, the column sums of the prediction matrix give you the probability of being in class 1.
So the documentation for predict.randomForest states:
If predict.all=TRUE, then the individual component of the returned
object is a character matrix where each column contains the predicted
class by a tree in the forest.
...so it does not appear that it is possible to have a probability returned for each individual tree.
If you want something like a prediction interval for classification, you might try fitting a random forest with many more trees and then generating predictions from many different (random?) subsets of the forest.
One thing you need to be careful of though is that you appear to be feeding your training data to predict.randomForest. This will of course give you biased predictions, unless you use the information from the inbag component of the random forest object to only select trees on which that observation was out of bag.

Resources