support vector machine in r - r

I've been working on Support Vector Machine algorithm using R Studio. However, I'm ending up with a low accuracy rate and I don't know how to fix it. I'm expecting an accuracy rate higher than 90%.
Here is my code:
install.packages("caTools")
install.packages("class")
library(caTools)
library(class)
install.packages("ISLR")
library(ISLR)
Collegedata<-College[,-1]
Collegedata[,-17]<-scale(Collegedata[,-17])
med<-median(Collegedata$Grad.Rate)
Grad.Rate<-Collegedata$Grad.Rate>=med
Grad.Rate1<-as.numeric(Grad.Rate)
Collegedata<-data.frame(Collegedata,Grad.Rate1)
corcollege<-cor(Collegedata)
Collegedata<-Collegedata[,-2:-3]
Collegedata<-Collegedata[,-4]
Collegedata<-Collegedata[,-7]
Collegedata<-Collegedata[,-13]
SVM
install.packages("e1071")
library("e1071")
collegesplit=sample.split(Collegedata, SplitRatio=0.8)
collegetrain<-subset(Collegedata, collegesplit==1)
collegetest<-subset(Collegedata, collegesplit==0)
collegetrain<-data.frame(collegetrain)
svm.model.college <- svm(Grad.Rate1 ~ ., data = collegetrain, type = "C-classification", cost = 1,gamma = 0.125, cross =10)
svm.pred.college <- predict(svm.model.college, collegetest[,-13])
table(pred = svm.pred.college, true = collegetest[,13])
install.packages('ROCR')
library(ROCR)
ROC=predict(svm.model.college,newdata=collegetest)
ROC<-as.vector(ROC)
ROC<-as.numeric(ROC)
pred=prediction(ROC,collegetest$Grad.Rate1)
perf=performance(pred,'tpr','fpr')
plot(perf)
as.numeric(performance(pred,'auc')#y.values)

It's been a while since I've coded SVMs so I won't be able to provide you with any code, however, I can say that SVMs are incredibly sensitive to your choice of hyperparameters, e.g. cost and gamma. You'd want to perform a grid search over some sequence of values to determine the optimal values. I recommend using the tune.svm() or best.tune() functions in the e1071 package to do this. Further, while the default Gaussian kernel is often the optimal kernel, it is not guaranteed to be the best, so you could perhaps try the linear kernel instead.
You may find this paper useful in helping develop a framework for building your model.

Related

How to obtain Brier Score in Random Forest in R?

I am having trouble getting the Brier Score for my Machine Learning Predictive models. The outcome "y" was categorical (1 or 0). Predictors are a mix of continuous and categorical variables.
I have created four models with different predictors, I will call them "model_1"-"model_4" here (except predictors, other parameters are the same). Example code of my model is:
Model_1=rfsrc(y~ ., data=TrainTest, ntree=1000,
mtry=30, nodesize=1, nsplit=1,
na.action="na.impute", nimpute=3,seed=10,
importance=T)
When I run the "Model_1" function in R, I got the results:
My question was how can I get the predicted possibility for those 412 people? And how to find the observed probability for each person? Do I need to calculate by hand? I found the function BrierScore() in "DescTools" package.
But I tried "BrierScore(Model_1)", it gives me no results.
codes I added:
library(scoring)
library(DescTools)
BrierScore(Raw_SB)
class(TrainTest$VL_supress03)
TrainTest$VL_supress03_nu<-as.numeric(as.character(TrainTest$VL_supress03))
class(TrainTest$VL_supress03_nu)
prediction_Raw_SB = predict(Raw_SB, TrainTest)
BrierScore(prediction_Raw_SB, as.numeric(TrainTest$VL_supress03) - 1)
BrierScore(prediction_Raw_SB, as.numeric(as.character(TrainTest$VL_supress03)) - 1)
BrierScore(prediction_Raw_SB, TrainTest$VL_supress03_nu - 1)
I tried some codes: have so many error messages:
One assumption I am making about your approach is that you want to compute the BrierScore on the data you train your model on (which is usually not the correct approach, google train-test split if you need more info there).
In general, therefore you should reflect on whether your approach is correct there.
The BrierScore method in DescTools only has a defined method for glm models, otherwise, it expects as input a vector of predicted probabilities and a vector of true values (see ?BrierScore).
What you would need to do though is to predict on your data using:
prediction = predict(model_1, TrainTest, na.action="na.impute")
and then compute the brier score using
BrierScore(as.numeric(TrainTest$y) - 1, prediction$predicted[, 1L])
(Note, that we transform TrainTest$y into a numeric vector of 0's and 1's in order to compute the brier score.)
Note: The randomForestSRC package also prints a normalized brier score when you call print(prediction).
In general, using one of the available workbenches for machine learning in R (mlr3, tidymodels, caret) might simplify this approach for you and prevent a lot of errors in this direction. This is a really good practice, especially if you are less experienced in ML as it can prevent many errors.
See e.g. this chapter in the mlr3 book for more information.
For reference, here is some very similar code using the mlr3 package, automatically also taking care of train-test splits.
data(breast, package = "randomForestSRC") # with target variable "status"
library(mlr3)
library(mlr3extralearners)
task = TaskClassif$new(id = "breast", backend = breast, target = "status")
algo = lrn("classif.rfsrc", na.action = "na.impute", predict_type = "prob")
resample(task, algo, rsmp("holdout", ratio = 0.8))$score(msr("classif.bbrier"))

Validate Accuracy of Test Data

I have fit my model with my training data and tested the accuracy of the model using r squared.
However, I want to test the accuracy of the model with my test data, how to do this?
My predicted value is continuous. Quite new to this so open to suggestions.
LR_swim <- lm(racetime_mins ~ event_month +gender + place +
clocktime_mins +handicap_mins +
Wind_Speed_knots+
Air_Temp_Celsius +Water_Temp_Celsius +Wave_Height_m,
data = SwimmingTrain)
family=gaussian(link = "identity")
summary(LR_swim)
rsq(LR_swim) #Returns- 0.9722331
#Predict Race_Time Using Test Data
pred_LR <- predict(LR_swim, SwimmingTest, type ="response")
#Add predicted Race_Times back into the test dataset.
SwimmingTest$Pred_RaceTime <- pred_LR
To start with, as already pointed out in the comments, the term accuracy is actually reserved for classification problems. What you are actually referring to is the performance of your model. And truth is, for regression problems (such as yours), there are several such performance measures available.
For good or bad, R^2 is still the standard measure in several implementations; nevertheless, it may be helpful to keep in mind what I have argued elsewhere:
the whole R-squared concept comes in fact directly from the world of statistics, where the emphasis is on interpretative models, and it has little use in machine learning contexts, where the emphasis is clearly on predictive models; at least AFAIK, and beyond some very introductory courses, I have never (I mean never...) seen a predictive modeling problem where the R-squared is used for any kind of performance assessment; neither it's an accident that popular machine learning introductions, such as Andrew Ng's Machine Learning at Coursera, do not even bother to mention it. And, as noted in the Github thread above (emphasis added):
In particular when using a test set, it's a bit unclear to me what the R^2 means.
with which I certainly concur.
There are several other performance measures that are arguably more suitable in a predictive task, such as yours; and most of them can be implemented with a simple line of R code. So, for some dummy data:
preds <- c(1.0, 2.0, 9.5)
actuals <- c(0.9, 2.1, 10.0)
the mean squared error (MSE) is simply
mean((preds-actuals)^2)
# [1] 0.09
while the mean absolute error (MAE), is
mean(abs(preds-actuals))
# [1] 0.2333333
and the root mean squared error (RMSE) is simply the square root of the MSE, i.e.:
sqrt(mean((preds-actuals)^2))
# [1] 0.3
These measures are arguably more useful for assessing the performance on unseen data. The last two have an additional advantage of being in the same scale as your original data (not the case for MSE).

How to find the best measures for lda

Using the example for lda from quanteda package
require(quanteda)
require(quanteda.corpora)
require(lubridate)
require(topicmodels)
corp_news <- download('data_corpus_guardian')
corp_news_subset <- corpus_subset(corp_news, 'date' >= 2016)
ndoc(corp_news_subset)
dfmat_news <- dfm(corp_news, remove_punct = TRUE, remove = stopwords('en')) %>%
dfm_remove(c('*-time', '*-timeUpdated', 'GMT', 'BST')) %>%
dfm_trim(min_termfreq = 0.95, termfreq_type = "quantile",
max_docfreq = 0.1, docfreq_type = "prop")
dfmat_news <- dfmat_news[ntoken(dfmat_news) > 0,]
dtm <- convert(dfmat_news, to = "topicmodels")
lda <- LDA(dtm, k = 10)
Is there any metrics that can help to understand the appropriate number of topics? I need this as my texts are small and don't know if the performance is right. Also is there any way to have a performance measure (i.e precision/recall) to measure the better performance of lda with different features?
There are several Goodness-of-Fit (GoF) metrics you can use to assess a LDA model. The most common is called perplexity which you can compute trough the function perplexity() in the package topicmodels. The way you select the optimal model is to look for a "knee" in the plot. The idea, stemming from unsupervised methods, is to run multiple LDA models with different topics. As the number of topics increases, you should see the perplexity decrease. You want to stop either when you find a knee or when the incremental decrease is negligible. Thin about the scree plot when you run the Principal Component Analysis.
Having said that, there is an R package called ldatuning which implements four additional metrics based on density-based clustering and on Kullback-Leibler divergence. Three of them can be used with both VEM and Gibbs inference, while the method by Griffith can only be used with Gibbs. For some of these metrics you look for the minimum, for other for the maximum. Also, you can always compute the log-likelihood of your model which want to maximize. The way you can extract the likelihood from an LDA object is pretty straightforward. Let's assume you have an LDA model called ldamodel:
loglikelihood = as.numeric(logLik(ldamodel))
There is a lot of research around this topic. For instance, you can have a look at these papers:
Gerlach et al. (2018)
Fortunato 2010
In addition, you can have a look at the preprint of a paper I am working on with a colleague of mine which uses simple parametric tests to evaluate GoF. We also developed an R package which can be use over a list of LDA models of class LDA from topicmodels. You can find the paper here and the package here. You are more than welcome to submit any issue you may find in the package. The paper is under reviewed at the moment, but again, comments are more than welcome.
Hope this helps!

number of trees in h2o.gbm

in traditional gbm, we can use
predict.gbm(model, newsdata=..., n.tree=...)
So that I can compare result with different number of trees for the test data.
In h2o.gbm, although it has n.tree to set, it seems it doesn't have any effect on the result. It's all the same as the default model:
h2o.test.pred <- as.vector(h2o.predict(h2o.gbm.model, newdata=test.frame, n.tree=100))
R2(h2o.test.pred, test.mat$y)
[1] -0.00714109
h2o.test.pred <- as.vector(h2o.predict(h2o.gbm.model, newdata=test.frame, n.tree=10))
> R2(h2o.test.pred, test.mat$y)
[1] -0.00714109
Does anybod have similar problem? How to solve it? h2o.gbm is much faster than gbm, so if it can get detailed result of each tree that would be great.
I don't think H2O supports what you are describing.
BUT, if what you are after is to get the performance against the number of trees used, that can be done at model building time.
library(h2o)
h2o.init()
iris <- as.h2o(iris)
parts <- h2o.splitFrame(iris,c(0.8,0.1))
train <- parts[[1]]
valid <- parts[[2]]
test <- parts[[3]]
m <- h2o.gbm(1:4, 5, train,
validation_frame = valid,
ntrees = 100, #Max desired
score_tree_interval = 1)
h2o.scoreHistory(m)
plot(m)
The score history will show the evaluation after adding each new tree. plot(m) will show a chart of this. Looks like 20 is plenty for iris!
BTW, if your real purpose was to find out the optimum number of trees to use, then switch early stopping on, and it will do that automatically for you. (Just make sure you are using both validation and test data frames.)
As of 3.20.0.6 H2O does support this. The method you are looking for is
staged_predict_proba. For classification models it produces predicted class probabilities after each iteration (tree), for every observation in your testing frame. For regression models (i.e. when response is numerical), although not really documented, it produces the actual prediction for every observation in your testing frame.
From these predictions it is also easy to compute various performance metrics (AUC, r2 etc), assuming that's what you're after.
Python API:
staged_predict_proba = model.staged_predict_proba(test)
R API:
staged_predict_proba <- h2o.staged_predict_proba(model, prostate.test)

Tune glmnet hyperparameters and evaluate performance using nested cross-validation in mlr?

I'm trying to use the R package mlr to train a glmnet model on a binary classification problem with a large dataset (about 850000 rows and about 100 features) on very modest hardware (my laptop with 4GB RAM --- I don't have access to more CPU muscle). I decided to use mlr because I need to use nested cross-validation to tune the hyperparameters of my classifier and evaluate the expected performance of the final model. To the best of my knowledge, neither caret or h2o offer nested cross-validation at present, but mlr provides provides the infrastructure to do this. However, I find the huge number of functions provided by mlr extremely overwhelming, and it's difficult to know how to slot everything together to achieve my goal. What goes where? How do they fit together? I've read through the entire documentation here: https://mlr-org.github.io/mlr-tutorial/release/html/ and I'm still confused. There are code snippets that show how to do specific things, but it's unclear (to me) how to stitch these together. What's the big picture? I looked for a complete worked example to use as a template and only found this: https://www.bioconductor.org/help/course-materials/2015/CSAMA2015/lab/classification.html which I have been using as my start point. Can anyone help fill in the gaps?
Here's what I want to do:
Tune the hyperparameters (l1 and l2 regularisation parameters) of a glmnet model using grid search or random grid search (or anything faster if it exists -- iterated F-racing? Adaptive resampling?) and stratified k-fold cross-validation inner loop, with an outer cross-validation loop to assess the expected final performance. I want to include a feature preprocessing step in the inner loop with centering, scaling, and Yeo-Johnson transformation, and fast filter-based feature selection (the latter is a necessity because I have very modest hardware and I need to slim the feature space to decrease training time). I have imbalanced classes (positive class is about 20%) so I have opted to use AUC as my optimisation objective, but this is only a surrogate for the real metric of interest, with is the false positive rate for a small number of true positive fixed points (i.e., I want to know the FPR for TPR = 0.6, 0.7, 0.8). I'd like to tune the probability thresholds to achieve those TPRs, and note that this is possible in nested CV, but it's not clear exactly what is being optimised here:
https://github.com/mlr-org/mlr/issues/856
I'd like to know where the cut should be without incurring information leakage, so I want to pick this using CV.
I'm using glmnet because I'd rather spend my CPU cycles on building a robust model than a fancy model that produces over-optimistic results. GBM or Random Forest can be done later if I find it can be done fast enough, but I don't expect the features in my data to be informative enough to bother investing much time in training anything particularly complex.
Finally, after I've obtained an estimate of what performance I can expect from the final model, I want to actually build the final model and obtain the coefficients of the glmnet model --- including which ones are zero, so I know which features have been selected by the LASSO penalty.
Hope all this makes sense!
Here's what I've got so far:
df <- as.data.frame(DT)
task <- makeClassifTask(id = "glmnet",
data = df,
target = "Flavour",
positive = "quark")
task
lrn <- makeLearner("classif.glmnet", predict.type = "prob")
lrn
# Feature preprocessing -- want to do this as part of CV:
lrn <- makePreprocWrapperCaret(lrn,
ppc.center = TRUE,
ppc.scale = TRUE,
ppc.YeoJohnson = TRUE)
lrn
# I want to use the implementation of info gain in CORElearn, not Weka:
infGain = makeFilter(
name = "InfGain",
desc = "Information gain ",
pkg = "CORElearn",
supported.tasks = c("classif", "regr"),
supported.features = c("numerics", "factors"),
fun = function(task, nselect, ...) {
CORElearn::attrEval(
getTaskFormula(task),
data = getTaskData(task), estimator = "InfGain", ...)
}
)
infGain
# Take top 20 features:
lrn <- makeFilterWrapper(lrn, fw.method = "InfGain", fw.abs = 20)
lrn
# Now things start to get foggy...
tuningLrn <- makeTuneWrapper(
lrn,
resampling = makeResampleDesc("CV", iters = 2, stratify = TRUE),
par.set = makeParamSet(
makeNumericParam("s", lower = 0.001, upper = 0.1),
makeNumericParam("alpha", lower = 0.0, upper = 1.0)
),
control = makeTuneControlGrid(resolution = 2)
)
r2 <- resample(learner = tuningLrn,
task = task,
resampling = rdesc,
measures = auc)
# Now what...?

Resources