Classification with One Class SVM in R - r

I am trying to code a SVM for classification using a training data-set that contains only one type of class. So, i want to predict if some data is different or not from my data-set.
I used the same data-set as the training for predicting, but unfortunately, the SVM is not predicting well.
library(e1071)
# Data set
high <- c(10,5,14,12,20)
temp <- c(12,15,20,15,9)
x <- cbind(high,temp)
# Create SVM
model <- svm(x,y=NULL,type='one-classification',kernel='linear')
# Predict training data-set
pred <- predict(model,x)
pred
It returns:
TRUE TRUE FALSE FALSE TRUE
It should be TRUE for all of them.

I am working on a similar problem. In reading the vignette's that the e1071 authors have at CRAN I believe that by definition the SVM is going to draw a hyperplane that separates it into 2 classes. In other words, that 3rd item is the most likely to be an outlier. SVM will always define at least one outlier.

I'm not sure traditional supervised learning techniques, such as SVMs, are well suited to training data where you only have 1 class. There's nothing in the data to inform the model how to differentiate between class A and class B.
I think the best you can do with your 1-class training data is to learn a probability density/mass function from the data, and then find how likely a new instance is under the learned probability density. For some more info see the wikipedia article on one-class classification.

Related

Use glm to predict on fresh data

I'm relatively new to glm - so please bear with me.
I have created a glm (logistic regression) to predict whether an individual CONTINUES studies ("0") or does NOTCONTINUE ("1"). I am interested in predicting the latter. The glm uses seven factors in the dataset and the confusion matrices are very good for what I need and combining seven years' of data have also been done. Straight-forward.
However, I now need to apply the model to the current years' data, which of course does not have the NOTCONTINUE column in it. Lets say the glm model is "CombinedYears" and the new data is "Data2020"
How can I use the glm model to get predictions of who will ("0") or will NOT ("1") continue their studies? Do I need to insert a NOTCONTINUE column into the latest file ?? I have tried this structure
Predict2020 <- predict(CombinedYears, data.frame(Data2020), type = 'response')
but the output only holds values <0.5.
Any help very gratefully appreciated. Thank you in advance
You mentioned that you already created a prediction model to predict whether a particular student will continue studies or not. You used the glm package and your model name is CombinedYears.
Now, what you have to know is that your problem is a binary classification and you used logistic regression for this. The output of your model when you apply it on new data, or even the same data used to fit the model, is probabilities. These are values between zero and one. In the development phase of your model, you need to determine the cutoff threshold of these probabilities which you can use later on when you predict new data. For example, you may determine 0.5 as a cutoff, and every probability above that is considered NOTCONTINUE and below that is CONTINUE. However, the best threshold can be determined from your data as well by maximizing both specificity and sensitivity. This can be done by calculating the area under the receiver operating characteristic curve (AUC). There are many packages than can do this for you, such as pROC and AUC packages in R. The same packages can determine the best cutoff as well.
What you have to do is the following:
Determine the cutoff threshold after calculating the AUC
library(pROC)
roc_object = roc(your_fit_data$NOTCONTINUE ~ fitted(CombinedYears))
coords(roc.roc_object, "best", ret="threshold", transpose = FALSE)
Use your model to predict on your new data year (as you did)
Predict2020 = predict(CombinedYears, data.frame(Data2020), type = 'response')
Now, the content of Predict2020 is just probabilities for each
student. Use the cutoff you obtained from step (1) to classify your
students accordingly

R neuralnet training for a simple dataset of squares of numbers

Dear neuralnet experts,
I am studying ANN with a book and R package.
One of the examples is to train a neuralnet (R package) for a simple set of squares of numbers [1~10]. It was quite quick and easy to fit them with 1 hidden layer with 10 neurons.
But, for a large set of [1~30], the algorithm does not converge. I think that some parameters should be changed to train the neuralnet. At first, I increased the number of neurons and hidden layers, i.e, c(20,10). But, failed...
Could somebody please guide me to learn more about neuralnet to train the dataset?
My code in R is given as,
library("neuralnet")
#Read the input file
mydata50=read.csv('Squares50.csv',sep=",",header=TRUE)
mydata30 <- mydata50[1:30,]
attach(mydata30)
names(mydata30)
mydata30
#Train the model based on output from input
model30=neuralnet(formula = Output~Input,
data = mydata30,
hidden=c(20,10),
threshold=0.01 )
print(model30)
#Lets plot and see the layers
plot(model30)
Best regards,
Dong-Ho

R random forest feature selection based on AUC

For binary option prediction (rise, fall) I am trying random forest in R but the importance measures and OOB are biased in my case
I found this article but it is Python related.
Is there an R package approach for automatic feature selection that
is based on AUC
maybe allows me to define my own evaluation function (money earned is function of recall and precision rates)
maybe allows me to specify the cross-validation approach: randomly selecting traing and test case is biased, as there are timeseries data, where test data must be later than training data
I just came across this question, I found this package that might help you:
i. It's called AUCRF, it performs feature selection in a random forest model based on optimizing AUC.
https://cran.r-project.org/web/packages/AUCRF/AUCRF.pdf
ii. It does allow cross-validation of your AUC based selection
AUCRFcv(x, nCV = 5, M = 20)
where nCV is number of folds, M = number of repeats.
iii. Regarding allowing your own evaluation, it does have an option where you can specify the formula using ~ but you will have to explore that more for your specific case, since you have not provided test code.
Hope this helps!

Why predict() in R has to be done on test data

In the below code, do I need to follow approach 1 or approach 2.
I am confused why the test data to be used in predict as per approach 1.
Would be great if someone can explain it in detail.
train <- sample(nrow(sales), nrow(sales)*0.6)
test <- sales[-train]
Approach 1
fit <- lm(y~x,data=train)
predict(fit,data=test)
Instead can't I do this way:
Approach 2
fit <- lm(y~x,data=train)
predict(fit,data=train)
fit1 <- lm(y~x,data=test)
predict(fit1,data=test)
Speaking generally, predict() using a model trained on training data, applied to training data, can only be used for introspection about the model as trained. Applying that to (ideally independent) test data makes sense either as a validation of the trained model or a use of the model for further prediction.
In other words, they're not different approaches to the same thing; they accomplish completely different things.

Random forest evaluation in R

I am a newbie in R and I am trying to do my best to create my first model. I am working in a 2- classes random forest project and so far I have programmed the model as follows:
library(randomForest)
set.seed(2015)
randomforest <- randomForest(as.factor(goodkit) ~ ., data=training1, importance=TRUE,ntree=2000)
varImpPlot(randomforest)
prediction <- predict(randomforest, test,type='prob')
print(prediction)
I am not sure why I don't get the overall prediction for my model.I must be missing something in my code. I get the OOB and the prediction per case in the test set but not the overall prediction of the model.
library(pROC)
auc <-roc(test$goodkit,prediction)
print(auc)
This doesn't work at all.
I have been through the pROC manual but I cannot get to understand everything. It would be very helpful if anyone can help with the code or post a link to a good practical sample.
Using the ROCR package, the following code should work for calculating the AUC:
library(ROCR)
predictedROC <- prediction(prediction[,2], as.factor(test$goodkit))
as.numeric(performance(predictedROC, "auc")#y.values))
Your problem is that predict on a randomForest object with type='prob' returns two predictions: each column contains the probability to belong to each class (for binary prediction).
You have to decide which of these predictions to use to build the ROC curve. Fortunately for binary classification they are identical (just reversed):
auc1 <-roc(test$goodkit, prediction[,1])
print(auc1)
auc2 <-roc(test$goodkit, prediction[,2])
print(auc2)

Resources