How to plot PCA with caret in R - r

I am getting PCA components using preProcess() from caret in R, and getting quantitative results.
dataPCA <- preProcess(data[1:ncol(data)-1], method = "pca", thresh = 0.95)
print(dataPCA)
print(dataPCA$rotation)
PCATrain <- predict(dataPCA,dataTrain[,1:ncol(dataTrain)-1])
PCATest <- predict(dataPCA,dataTest[,1:ncol(dataTest)-1])
However, I'd like to plot the components and variances explained, like you'd do with prcomp with plot(pca, type="1"). Is it possible using preProcess() or should I run prcomp()?

preProcess doesn't save that info. I'd use prcomp.

Related

Worm plot residuals graph in ggplot2

I'm trying to plot the Worm plot residuals on a model fitted using the gamlss function from the gamlss package. The interest graph looks like the one below:
Initially, below is the computational routine referring to the use of the wormplot_gg function from the childsds package, however, the result expressed using the function described above is not looks like the example shown above, which is being applied to a dataset contained within R.
library(ggplot2)
library(gamlss)
library(childsds)
head(Orange)
Dados <- Orange
Model <- gamlss(circumference~age, family=NO,data=Dados); Model
wp(Model)
wormplot_gg(m = Model)
Below are the traditional results via the wp function in the gamlss package.
And finally, we have the results obtained through the wormplot_gg function from the childsds package. However, as already described, this one does not present itself in the way I am interested, that is, with the visual structure of the first figure.
using qqplotr https://aloy.github.io/qqplotr/index.html with the detrend=True option
library(qqplotr)
set.seed(1)
df <- data.frame(z=rnorm(50))
ggplot(df, aes(sample=z)) +
stat_qq_point(detrend = T) +
stat_qq_band(detrend = T, color='black', fill=NA, size=0.5)
you can also add geom_hline(yintercept = 0)
edit:
In the case of using this with a gamlss model, the first have to extract the randomized residuals out of the model, which for gamlss is done simply with the function residuals, so you can just do e.g., df <- data.frame(z=residuals(Model)) and then just continue with the rest of the code

plot one of 500 trees in randomForest package

How can plot trees in output of randomForest function in same names packages in R? For example I use iris data and want to plot first tree in 500 output tress. my code is
model <-randomForest(Species~.,data=iris,ntree=500)
You can use the getTree() function in the randomForest package (official guide: https://cran.r-project.org/web/packages/randomForest/randomForest.pdf)
On the iris dataset:
require(randomForest)
data(iris)
## we have a look at the k-th tree in the forest
k <- 10
getTree(randomForest(iris[, -5], iris[, 5], ntree = 10), k, labelVar = TRUE)
You may use cforest to plot like below, I have hardcoded the value to 5, you may change as per your requirement.
ntree <- 5
library("party")
cf <- cforest(Species~., data=iris,controls=cforest_control(ntree=ntree))
for(i in 1:ntree){
pt <- prettytree(cf#ensemble[[i]], names(cf#data#get("input")))
nt <- new("Random Forest BinaryTree")
nt#tree <- pt
nt#data <- cf#data
nt#responses <- cf#responses
pdf(file=paste0("filex",i,".pdf"))
plot(nt, type="simple")
dev.off()
}
cforest is another implementation of random forest, It can't be said which is better but in general there are few differences that we can see. The difference is that cforest uses conditional inferences where we put more weight to the terminal nodes in comparison to randomForest package where the implementation provides equal weights to terminal nodes.
In general cofrest uses weighted mean and randomForest uses normal average. You may want to check this .

Predictive model decision tree

I want to build a predictive model using decision tree classification in R. I used this code:
library(rpart)
library(caret)
DataYesNo <- read.csv('DataYesNo.csv', header=T)
summary(DataYesNo)
worktrain <- sample(1:50, 40)
worktest <- setdiff(1:50, worktrain)
DataYesNo[worktrain,]
DataYesNo[worktest,]
M <- ncol(DataYesNo)
input <- names(DataYesNo)[1:(M-1)]
target <- “YesNo”
tree <- rpart(YesNo~Var1+Var2+Var3+Var4+Var5,
data=DataYesNo[worktrain, c(input,target)],
method="class",
parms=list(split="information"),
control=rpart.control(usesurrogate=0, maxsurrogate=0))
summary(tree)
plot(tree)
text(tree)
I got just one root (Var3) and two leafs (yes, no). I'm not sure about this result. How can I get the confusion matrix, accuracy, sensitivity, and specificity?
Can I get them with the caret package?
If you use your model to make predictions on your test set, you can use confusionMatrix() to get the measures you're looking for.
Something like this...
predictions <- predict(tree, worktest)
cmatrix <- confusionMatrix(predictions, worktest$YesNo)
print(cmatrix)
Once you create a confusion matrix, other measures can also be obtained - I don't remember them at the moment.
According to your example, the confusion matrix can be obtained as following.
fitted <- predict(tree, DataYesNo[worktest, c(input,target)])
actual <- DataYesNo[worktest, c(target)]
confusion <- table(data.frame(fitted = fitted, actual = actual))

Plot in SVM model (e1071 Package) using DocumentTermMatrix

i trying do create a plot for my model create using SVM in e1071 package.
my code to build the model, predict and build confusion matrix is
ptm <- proc.time()
svm.classifier = svm(x = train.set.list[[0.999]][["0_0.1"]],
y = train.factor.list[[0.999]][["0_0.1"]],
kernel ="linear")
pred = predict(svm.classifier, test.set.list[[0.999]][["0_0.1"]], decision.values = TRUE)
time[["svm"]] = proc.time() - ptm
confmatrix = confusionMatrix(pred,test.factor.list[[0.999]][["0_0.1"]])
confmatrix
train.set.list and test.set.list contains the test and train set for several conditions. train and set factor has the true label for each set. Train.set and test.set are both documenttermmatrix.
Then i tried to see a plot of my data, i tried with
plot(svm.classifier, train.set.list[[0.999]][["0_0.1"]])
but i got the message:
"Error in plot.svm(svm.classifier, train.set.list[[0.999]][["0_0.1"]]) :
missing formula."
what i'm doing wrong? confusion matrix seems good to me even not using formula parameter in svm function
Without given code to run, it's hard to say exactly what the problem is. My guess, given
?plot.svm
which says
formula formula selecting the visualized two dimensions. Only needed if more than two input variables are used.
is that your data has more than two predictors. You should specify in your plot function:
plot(svm.classifier, train.set.list[[0.999]][["0_0.1"]], predictor1 ~ predictor2)

ROC curve in R using ROCR package

Can someone explain me please how to plot a ROC curve with ROCR.
I know that I should first run:
prediction(predictions, labels, label.ordering = NULL)
and then:
performance(prediction.obj, measure, x.measure="cutoff", ...)
I am just not clear what is meant with prediction and labels. I created a model with ctree and cforest and I want the ROC curve for both of them to compare it in the end. In my case the class attribute is y_n, which I suppose should be used for the labels. But what about the predictions? Here are the steps of what I do (dataset name= bank_part):
pred<-cforest(y_n~.,bank_part)
tablebank<-table(predict(pred),bank_part$y_n)
prediction(tablebank, bank_part$y_n)
After running the last line I get this error:
Error in prediction(tablebank, bank_part$y_n) :
Number of cross-validation runs must be equal for predictions and labels.
Thanks in advance!
Here's another example: I have the training dataset(bank_training) and testing dataset(bank_testing) and I ran a randomForest as below:
bankrf<-randomForest(y~., bank_training, mtry=4, ntree=2,
keep.forest=TRUE,importance=TRUE)
bankrf.pred<-predict(bankrf, bank_testing, type='response')
Now the bankrf.pred is a factor object with labels c=("0", "1"). Still, I don't know how to plot ROC, cause I get stuck to the prediction part. Here's what I do
library(ROCR)
pred<-prediction(bankrf.pred$y, bank_testing$c(0,1)
But this is still incorrect, cause I get the error message
Error in bankrf.pred$y_n : $ operator is invalid for atomic vectors
The predictions are your continuous predictions of the classification, the labels are the binary truth for each variable.
So something like the following should work:
> pred <- prediction(c(0.1,.5,.3,.8,.9,.4,.9,.5), c(0,0,0,1,1,1,1,1))
> perf <- performance(pred, "tpr", "fpr")
> plot(perf)
to generate an ROC.
EDIT: It may be helpful for you to include the sample reproducible code in the question (I'm having a hard time intepreting your comment).
There's no new code here, but... here's a function I use quite often for plotting an ROC:
plotROC <- function(truth, predicted, ...){
pred <- prediction(abs(predicted), truth)
perf <- performance(pred,"tpr","fpr")
plot(perf, ...)
}
Like #Jeff said, your predictions need to be continuous for ROCR's prediction function. require(randomForest); ?predict.randomForest shows that, by default, predict.randomForest returns a prediction on the original scale (class labels, in classification), whereas predict.randomForest(..., type = 'prob') returns probabilities of each class. So:
require(ROCR)
data(iris)
iris$setosa <- factor(1*(iris$Species == 'setosa'))
iris.rf <- randomForest(setosa ~ ., data=iris[,-5])
summary(predict(iris.rf, iris[,-5]))
summary(iris.preds <- predict(iris.rf, iris[,-5], type = 'prob'))
preds <- iris.preds[,2]
plot(performance(prediction(preds, iris$setosa), 'tpr', 'fpr'))
gives you what you want. Different classification packages require different commands for getting predicted probabilities -- sometimes it's predict(..., type='probs'), predict(..., type='prob')[,2], etc., so just check out the help files for each function you're calling.
This is how you can do it:
have our data in a csv file,("data_file.csv") but you may need to give the full path here. In that file have the column headers, which here I will use
"default_flag", "var1", "var2", "var3", where default_flag is 0 or 1 and the other variables have any value.
R code:
rm(list=ls())
df <- read.csv("data_file.csv") #use the full path if needed
mylogit <- glm(default_flag ~ var1 + var2 + var3, family = "binomial" , data = df)
summary(mylogit)
library(ROCR)
df$score<-predict.glm(mylogit, type="response" )
pred<-prediction(df$score,df$default_flag)
perf<-performance(pred,"tpr", "fpr")
plot(perf)
auc<- performance(pred,"auc")
auc
Note that df$score will give you the probability of default.
In case you want to use this logit (same regression coefficients) to test in another data df2 set for cross validation, use
df2 <- read.csv("data_file2.csv")
df2$score<-predict.glm(mylogit,newdata=df2, type="response" )
pred<-prediction(df2$score,df2$default_flag)
perf<-performance(pred,"tpr", "fpr")
plot(perf)
auc<- performance(pred,"auc")
auc
The problem is, as pointed out by others, prediction in ROCR expects numerical values. If you are inserting predictions from randomForest (as the first argument into prediction in ROCR), that prediction needs to be generated by type='prob' instead of type='response', which is the default. Alternatively, you could take type='response' results and convert to numerical (that is, if your responses are, say 0/1). But when you plot that, ROCR generates a single meaningful point on ROC curve. For having many points on your ROC curve, you really need the probability associated with each prediction - i.e. use type='prob' in generating predictions.
The problem may be that you would like to run the prediction function on multiple runs for example for cross-validatation.
In this case for prediction(predictions, labels, label.ordering = NULL) function the class of "predictions" and "labels" variables should be list or matrix.
Try this one:
library(ROCR)
pred<-ROCR::prediction(bankrf.pred$y, bank_testing$c(0,1)
The function prediction is present is many packages. You should explicitly specify(ROCR::) to use the one in ROCR. This one worked for me.

Resources