How to plot SVM classifier using RTextTools package? - r

I am using the RTextTools package to create a Document Term Matrix, before using the associated container in range of classification models.
I have reviewed the package information and associated articles but I cannot find any indication on how to plot the results of tuning and predicting the classification models. For example, I am building a linear SVM model using svm.fit <- train_model(container, "SVM", kernel="linear", cost=1). How can I visualise svm.fit?
I am ideally looking to obtain similar results as if I was using plot.svm from the e1071 package. However, I cannot use this here as the class(container) is matrix_container and not an expected data frame.
The code that I am utilising is below. Thanks for your help.
#Create Training Container#
dtMatrix <- create_matrix(cbind.data.frame(Train.df$Keyword1, Train.df$Keyword2), removeSparseTerms=.998)
Train_container <- create_container(dtMatrix, Train.df$Result, trainSize=1:10000, virgin=FALSE)
#Create Validation Container#
trace("create_matrix", edit=T)
Validate_dtMatrix <- create_matrix(cbind.data.frame(Validate.df$Keyword1, Validate.df$Keyword2), originalMatrix=dtMatrix)
predSize = nrow(Validate.df)
ValidateContainer <- create_container(Validate_dtMatrix, labels=rep(0,predSize), testSize=1:predSize, virgin=FALSE)
#===SUPPORT VECTOR MACHINE===#
svm_linear <- train_model(container, "SVM", kernel="linear", cost=1)
predict_SVM.Linear <- classify_model(ValidateContainer, svm_linear)

Related

Relative variable importance from CoxBoost

I am fitting time-to-event survival data using surv.CoxBoost in the mlr package. My question: is there any way to get relative importance for the variables in the fitted model? I have seen this post detailing variablke importance for cvglment but haven't seen any on CoxBoost.
Any idea?
below is an example of a model using CoxBoost`. You may need to install CoxBoost from here as seems no longer on CRAN.
library(randomForestSRC)
library(mlr)
library(survival)
library(CoxBoost)
data(pbc, package="randomForestSRC")
data <- na.omit(pbc)
set.seed(9512)
train <- sample(1:nrow(data), round(nrow(data)*0.7))
data.train <- data[train, ]
data.test <- data[-train, ]
task = makeSurvTask( data=data.train, target=c('days', 'status'))
learner= makeLearner("surv.CoxBoost")
trained.learner=train(learner,task)
CoxBoostfit <- trained.learner$learner.model
CoxBoostfit$coefficients

How can I calculate survival function in gbm package analysis?

I would like to analysis my data based on the gradient boosted model.
On the other hand, as my data is a kind of cohort, I have a trouble understanding the result of this model.
Here's my code. Analysis was performed based on the example data.
install.packages("randomForestSRC")
install.packages("gbm")
install.packages("survival")
library(randomForestSRC)
library(gbm)
library(survival)
data(pbc, package="randomForestSRC")
data <- na.omit(pbc)
set.seed(9512)
train <- sample(1:nrow(data), round(nrow(data)*0.7))
data.train <- data[train, ]
data.test <- data[-train, ]
set.seed(9741)
gbm <- gbm(Surv(days, status)~.,
data.train,
interaction.depth=2,
shrinkage=0.01,
n.trees=500,
distribution="coxph")
summary(gbm)
set.seed(9741)
gbm.pred <- predict.gbm(gbm,
n.trees=500,
newdata=data.test,
type="response")
As I read the package documnet, "gbm.pred" is the result of cox's partial likelihood.
set.seed(9741)
lambda0 = basehaz.gbm(t=data.test$days,
delta=data.test$status,
t.eval=sort(data.test$days),
cumulative = FALSE,
f.x=gbm.pred,
smooth=T)
hazard=lambda0*exp(gbm.pred)
In this code, lambda0 is a baseline hazard fuction.
So, according to formula: h(t/x)=lambda0(t)*exp(f(x))
"hazard" is hazard function.
However, what I've wanted to calculte was the "survival function".
Because, I would like to compare the outcome of original data (data$status) to the prediction result (survival function).
Please let me know how to calculate survival function.
Thank you
Actually, the returns is cumulative baseline hazard function(integral part: \int^t\lambda(z)dz), and survival function can be computed as below:
s(t|X)=exp{-e^f(X)\int^t\lambda(z)dz}
f(X) is prediction of gbm, which is equal to log-hazard proportion.
I think this tutorial about gbm-based survival analysis would help to u!
https://github.com/liupei101/Tutorial-Machine-Learning-Based-Survival-Analysis/blob/master/Tutorial_Survival_GBM.ipynb

Why are the predict values of gbm (R package) negative?

I analyzed my data with 'gbm' R package. My data is based on a cohort study. Therefore, I ran 'gbm' model based on the 'coxph' results.
After constructing a model, I would like to see how this model can predict well. On the other hand, like the code below, the values of prediction are negative. So, I have a trouble understanding this phenomenon.
Please let me know how to interpret this value.
Here's my code.
install.packages("survival")
install.packages("randomForestSRC")
install.packages("gbm")
library(survival)
library(randomForestSRC)
library(gbm)
data(pbc, package="randomForestSRC")
data <- na.omit(pbc)
exposure <- names(data[, names(data.model) !=c("days", "status")])
formula <- as.formula(paste("Surv(days, status)~", paste(exposure, collapse="+")))
set.seed(123)
ex <- gbm(Surv(days, status)~.,
data=data,
distribution="coxph",
cv.folds=5,
shrinkage=.01,
n.trees=1000)
set.seed(123)
pred <- predict(ex, n.trees=1000, type="response")
Read the ?predict.gbm help page, particularly the parameter type. By default predictions are on the link scale.

r caretEnsemble - passing a fit param to one specific model in caretList

I have some code which fits several (cross-validated) models to some data, as below.
library(datasets)
library(caret)
library(caretEnsemble)
# load data
data("iris")
# establish cross-validation structure
set.seed(32)
trainControl <- trainControl(method="repeatedcv",
number=5, repeats=3, # 3x 5-fold CV
search="random")
algorithmList <- c('lda', # Linear Discriminant Analysis
'rpart' , # Classification and Regression Trees
'svmRadial') # SVM with RBF Kernel
# cross-validate models from algorithmList
models <- caretList(Species~., data=iris, trControl=trainControl, methodList=algorithmList)
so far so good. however, if I add 'gbm' to my algorithmList, I get a ton of extraneous log messages because gbm seems to have a verbose=TRUE default fit param.
According to the caret docs, if I were running train on method='gbm' by itself (not along with several models trained in a caretList), I could simply add verbose=FALSE to train(), which would flow through to gbm. But this throws an error when I try it in caretList.
So I would like to pass verbose=FALSE (or any other fit param, in theory) specifically to one particular model from caretList's methodList. How can I accomplish this?
ok this is actually addressed well in the docs.
?caretList
includes:
tuneList: optional, a NAMED list of caretModelSpec objects. This is
much more flexible than methodList and allows the specificaiton of
model-specific parameters
And I've confirmed my problem is solved if instead of:
algorithmList <- c('lda', # Linear Discriminant Analysis
'rpart' , # Classification and Regression Trees
'svmRadial', # SVM with RBF Kernel
'gbm') # Gradient-boosted machines
I use:
modelTypes <- list(lda = caretModelSpec(method="lda"),
rpart = caretModelSpec(method="rpart"),
svmRadial= caretModelSpec(method="svmRadial"),
gbm = caretModelSpec(method="rf", verbose=FALSE)
...then the models <- caretList(... line goes from:
models <- caretList(... methodList=algorithmList)
to:
models <-caretList(... tuneList = modelTypes)

Random Forest Predictions

I am looking for some guidance on a homework assignment I am working on for a class. We are given a dataset with 14K observations and we are asked to build a prediction model. I subset the dataset into training and testing (4909 observations), here I am using the caret package, which predicts the last variable "classe". I pulled out the near zero variables and built the model but when I tried to do predictions I only get 97 predictions back. I reviewed the help files but still can't figure out where I am going wrong. Any hints would be appreciated.
Here is the Code:
set.seed(1234)
pml.training <- read.csv("./data/pml-training.csv")
#
library(caret)
inTrain <- createDataPartition(y=pml.training$classe, p=0.75, list=FALSE)
training <- pml.training[inTrain,]
testing <- pml.training[-inTrain,]
# Pull out the Near Zero Value (NZV)
nzv <- nearZeroVar(training, saveMetrics=TRUE)
omit <- which(nzv$nzv==TRUE)
training <- training[,-omit]
testing <- testing[,-omit]
# Fit the model
modFit <- train(classe ~., method="rf", data=training)
modFit
print(modFit$finalModel)
plot(modFit)
# Try and predict on the testing model
pred <- predict(modFit, newdata=testing)
testing$predRight <- pred==testing$classe
print(table(pred, testing$classe))
Thanks, Pat C.
Have you checked
sum(complete.cases(subset(testing, select = -classe)))
?

Resources