Query about SVM classifier in R - r

I am working on a music data set where I have to classify the music data in to genres. I have both test and train data sets.
I have linked the datasets for you to check
here.
I am working in Rstudio
Here's the code I have written. I am a beginner and have no clue what I am doing. I am shooting arrows randomly. Let me know if you need more information.
The library used is :-
library("e1071")
The code :-
svm.model <- svm(GENRE ~ ., data = musictraindata, cost = 62.5, gamma = 0.5)
Now my problem is what to put in x parameter. I have put "GENRE" from train data set but it's giving me the following error.
Error in svm.default(x, y, scale = scale, ..., na.action = na.action) :
Need numeric dependent variable for regression.
Someone please guide me on what I should do. Thanks.
After corrections :-
I ran the code with the said corrections.I got an svm.model as follows :-
svm.model
Call:
svm(formula = factor(GENRE) ~ ., data = musictraindata, cost = 62.5, gamma = 0.5, type = "C-classification",
tolerance = 0.01)
Parameters:
SVM-Type: C-classification
SVM-Kernel: radial
cost: 62.5
gamma: 0.5
Number of Support Vectors: 11880
Now I try to create a predict model by using it with test data.
svm.pred <- predict(svm.model,musictestdata)
When I plot the svm.pred, I get a graph as follows which is highly unlikely. Here it is:
This how I am suppose to proceed right ? Am I doing something wrong ?
Let me know.

Tough to say without a reproducible example, but I would confirm that the class of your dependent variable (Genre) is a factor, and doesn't have anything goofy going on like NA's. Check this with class(musictraindata$GENRE). Also worth a note, R is cap-sensitive so "Genre" and "GENRE" make a difference.
You can also try specficying the type of SVM you want to run by using
(type = "C-classification")
and see if it throws you a more helpful error?

Related

the power regression by basicTrendline packages in R

I wanna perform a power regression model for my analysis in water quality.
In one of its sections I have these 2 series data:
Q= 0.7409845 1.2736854 0.0713900 1.5316926 1.4607059 0.6124793 1.5902551 1.7286422
1.6547936 1.6088377 1.6054299 1.7810355 1.4429110 1.1905836 2.2374064 1.3004641
1.7137979 1.6578471 1.6386083 1.0181250
Cl= 1.6863990 0.9932518 1.7749524 1.1631508 2.0918641 0.9162907 1.1631508 1.3862944
1.2809338 1.0647107 2.3978953 1.4350845 1.6677068 1.8245493 1.7578579 1.6677068
1.4816045 1.3862944 1.2527630 1.3862944
I want to set power regression like this that performed in Excel :
enter image description here
So show the equation and the square R and the regression line in the plot in R.
I just know the basicTrendline packages for this work
it works for the linear model but for power, No.
library(basicTrendline)
Q = log(gol$debi)
Cl = log(gol$Cl)
trendline(x = Q,y = Cl, model ="power2P",show.pvalue = F
,ePos.x = "topleft",eDigit = 3,CI.level = 0.95,xlab = "Q",ylab = "Cl"
,type="p",Pvalue.corrected = F)
It showed this message when I ran it, while none of my data is below zero:
Error in trendline_summary(x = x, y = y, model = model, Pvalue.corrected = Pvalue.corrected, :
'power2P' model need ALL x values greater than 0. Try other models.
Please help me with this packager other I just want to make something like excel that I send pictures of in R studio.
The results seem a bit different than the Excel version. Wondering if those "Logs" are being done to a different base?
plot(log(Cl) ~ log(Q) )
trendline <- lm(log(Cl)~log(Q))
abline(trendline)
trendline
#------
Call:
lm(formula = log(Cl) ~ log(Q))
Coefficients:
(Intercept) log(Q)
0.37431 -0.02898
#---------------
title(main="Log(Cl)~Log(C)")
Your Excel "logs" are all greater than 0, so it wouldn't be the case that Excel was using log-base-10. That would make the values even more negative. So how the log of 0.7 becomes positive is a great mystery. Maybe you need to explain or offer a citation to this "power regression" method. It doesn't look like standard statistics or mathematics.

How to apply machine learning techniques / how to use model outputs

I am a plant scientist new to machine learning. I have had success writing code and following tutorials of machine learning techniques. My issue is trying to understand how to actually apply these techniques to answer real world questions. I don't really understand how to use the model outputs to answer questions.
I recently followed a tutorial creating an algorithm to detect credit card fraud. All of the models ran nicely and I understand how to build them; but, how in the world do I take this information and translate it into a definitive answer? Following the same example, lets say I wrote this code for my job how would I then take real credit card data and screen it using this algorithm? I really want to establish a link between running these models and generating a useful output from real data.
Thank you all.
In the name of being concise I will highlight some specific examples using the same data set found here:
https://drive.google.com/file/d/1CTAlmlREFRaEN3NoHHitewpqAtWS5cVQ/view
# Import
creditcard_data <- read_csv('PATH')
# Restructure
creditcard_data$Amount=scale(creditcard_data$Amount)
NewData=creditcard_data[,-c(1)]
head(NewData)
#Split
library(caTools)
set.seed(123)
data_sample = sample.split(NewData$Class,SplitRatio=0.80)
train_data = subset(NewData,data_sample==TRUE)
test_data = subset(NewData,data_sample==FALSE)
1) Decision Tree
library(rpart)
library(rpart.plot)
decisionTree_model <- rpart(Class ~ . , creditcard_data, method = 'class')
predicted_val <- predict(decisionTree_model, creditcard_data, type = 'class')
probability <- predict(decisionTree_model, creditcard_data, type = 'prob')
rpart.plot(decisionTree_model)
2) Artificial Neural Network
library(neuralnet)
ANN_model =neuralnet (Class~.,train_data,linear.output=FALSE)
plot(ANN_model)
predANN=compute(ANN_model,test_data)
resultANN=predANN$net.result
resultANN=ifelse(resultANN>0.5,1,0)
3) Gradient Boosting
library(gbm, quietly=TRUE)
# train GBM model
system.time(
model_gbm <- gbm(Class ~ .
, distribution = "bernoulli"
, data = rbind(train_data, test_data)
, n.trees = 100
, interaction.depth = 2
, n.minobsinnode = 10
, shrinkage = 0.01
, bag.fraction = 0.5
, train.fraction = nrow(train_data) / (nrow(train_data) + nrow(test_data))
)
)
# best iteration
gbm.iter = gbm.perf(model_gbm, method = "test")
model.influence = relative.influence(model_gbm, n.trees = gbm.iter, sort. = TRUE)
# plot
plot(model_gbm)
# plot
gbm_test = predict(model_gbm, newdata = test_data, n.trees = gbm.iter)
gbm_auc = roc(test_data$Class, gbm_test, plot = TRUE, col = "red")
print(gbm_auc)
You develop your model with, preferably, three data sets.
Training, Testing and Validation. (Sometimes different terminology is used.)
Here, Train and Test sets are used to develop the model.
The model you decide upon must never see any of the Validation set. This set is used to see how good your model is, in effect it would simulate real-world new data that may come to you in the future. Once you decide your model does perform to an acceptable level you can then go back to running all your data to produce the final operational model. Then any new 'live' data of interest is fed to the model and produces an output. In the case of the fraud detection it would output some probability: here you need human input to decide at what level you would flag the event as fraudulent enough to warrant further investigation.
At periodic intervals or as you data arrives or your model performance weakens (fraudsters may become more cunning!) you would repeat the whole process.

R e1071 SVM leave one out cross validation function result differ from manual LOOCV

I'm using e1071 svm function to classify my data.
I tried two different ways to LOOCV.
First one is like that,
svm.model <- svm(mem ~ ., data, kernel = "sigmoid", cost = 7, gamma = 0.009, cross = subSize)
svm.pred = data$mem
svm.pred[which(svm.model$accuracies==0 & svm.pred=='good')]=NA
svm.pred[which(svm.model$accuracies==0 & svm.pred=='bad')]='good'
svm.pred[is.na(svm.pred)]='bad'
conMAT <- table(pred = svm.pred, true = data$mem)
summary(svm.model)
I typed cross='subject number' to make LOOCV, but the result of classification is different from my manual version of LOOCV, which is like...
for (i in 1:subSize){
data_Tst <- data[i,1:dSize]
data_Trn <- data[-i,1:dSize]
svm.model1 <- svm(mem ~ ., data = data_Trn, kernel = "linear", cost = 2, gamma = 0.02)
svm.pred1 <- predict(svm.model1, data_Tst[,-dSize])
conMAT <- table(pred = svm.pred1, true = data_Tst[,dSize])
CMAT <- CMAT + conMAT
CORR[i] <- sum(diag(conMAT))
}
In my opinion, through LOOCV, accuracy should not vary across many runs of code because SVM makes model with all the data except one and does it until the end of the loop. However, with the svm function with argument 'cross' input, the accuracy differs across every runs of code.
Which way is more accurate? Thanks for read this post! :-)
You are using different hyper-parameters (cost, gamma) and different kernels (linear, sigmoid). If you want identical results, then these should be the same each run.
Also, it depends how Leave One Out (LOO) is implemented:
Does your LOO method leave one out randomly or as a sliding window over the dataset?
Does your LOO method leave one out from one class at a time or both classes at the same time?
Is the training set always the same, or are you using a randomisation procedure before splitting between a training and testing set (assuming you have a separate independent testing set)? In which case, the examples you are cross-validating would change each run.

R random forest - training set using target column for prediction

I am learning how to use various random forest packages and coded up the following from example code:
library(party)
library(randomForest)
set.seed(415)
#I'll try to reproduce this with a public data set; in the mean time here's the existing code
data = read.csv(data_location, sep = ',')
test = data[1:65] #basically data w/o the "answers"
m = sample(1:(nrow(factor)),nrow(factor)/2,replace=FALSE)
o = sample(1:(nrow(data)),nrow(data)/2,replace=FALSE)
train2 = data[m,]
train3 = data[o,]
#random forest implementation
fit.rf <- randomForest(train2[,66] ~., data=train2, importance=TRUE, ntree=10000)
Prediction.rf <- predict(fit.rf, test) #to see if the predictions are accurate -- but it errors out unless I give it all data[1:66]
#cforest implementation
fit.cf <- cforest(train3[,66]~., data=train3, controls=cforest_unbiased(ntree=10000, mtry=10))
Prediction.cf <- predict(fit.cf, test, OOB=TRUE) #to see if the predictions are accurate -- but it errors out unless I give it all data[1:66]
Data[,66] is the is the target factor I'm trying to predict, but it seems that by using "~ ." to solve for it is causing the formula to use the factor in the prediction model itself.
How do I solve for the dimension I want on high-ish dimensionality data, without having to spell out exactly which dimensions to use in the formula (so I don't end up with some sort of cforest(data[,66] ~ data[,1] + data[,2] + data[,3}... etc.?
EDIT:
On a high level, I believe one basically
loads full data
breaks it down to several subsets to prevent overfitting
trains via subset data
generates a fitting formula so one can predict values of target (in my case data[,66]) given data[1:65].
so my PROBLEM is now if I give it a new set of test data, let’s say test = data{1:65], it now says “Error in eval(expr, envir, enclos) :” where it is expecting data[,66]. I want to basically predict data[,66] given the rest of the data!
I think that if the response is in train3 then it will be used as a feature.
I believe this is more like what you want:
crtl <- cforest_unbiased(ntree=1000, mtry=3)
mod <- cforest(iris[,5] ~ ., data = iris[,-5], controls=crtl)

r support vector machine e1071 training not working

I am playing around with Support Vector Machines in the R-Language. Specifically I am using the e1071 package.
As long as I follow the manual pages or the tutorial at wikibooks everythings works. But if I try to use my own datasets with those examples things aren't that good anymore.
It seems that the model creation fails for some reason. At least I am not getting the levels on the target column. Below you find the example for clarification.
Maybe someone can help me to figure out what I am doing wrong here. So here is all the code and data.
Test dataset
target,col1,col2
0,1,2
0,2,3
0,3,4
0,4,5
0,5,6
0,1,2
0,2,3
0,3,4
0,4,5
0,5,6
0,1,2
0,2,3
0,3,4
0,4,5
1,6,7
1,7,8
1,8,9
1,9,0
1,0,10
1,6,7
1,7,8
1,8,9
1,9,0
1,0,10
1,6,7
1,7,8
1,8,9
1,9,0
1,0,10
R-Script
library(e1071)
dataset <- read.csv("test.csv", header=TRUE, sep=',')
tuned <- tune.svm(target~., data = dataset, gamma = 10^(-6:-1), cost = 10^(-1:1))
summary(tuned)
model <- svm(target~., data = dataset, kernel="radial", gamma=0.001, cost=10)
summary(model)
Output of the summary(model) statement
+ summary(model)
Call:
svm(formula = target ~ ., data = dataset, kernel = "radial", gamma = 0.001,
cost = 10)
Parameters:
SVM-Type: eps-regression
SVM-Kernel: radial
cost: 10
gamma: 0.001
epsilon: 0.1
Number of Support Vectors: 28
>
Wikibooks examaple
If I compare this output to the output of the wikibooks example, it's missing some information. Please notice the "Levels"-Section in the output:
library(MASS)
library(e1071)
data(cats)
model <- svm(Sex~., data = cats)
summary(model)
Output
> summary(model)
Call:
svm(formula = Sex ~ ., data = cats)
Parameters:
SVM-Type: C-classification
SVM-Kernel: radial
cost: 1
gamma: 0.5
Number of Support Vectors: 84
( 39 45 )
Number of Classes: 2
Levels:
F M
Putting Roland's answer in the proper "answer" format:
target is numeric
sex is a factor
Let me give a few more suggestions:
it seems as if target really should be a factor. (It has only 2 levels, 0 & 1, and I suspect you're trying to classify into either 0 or 1.) So stick in a dataset$target <- factor(dataset$target) somewhere.
right now, because target is a numeric, a regression model is being run instead of a classification.
it's worthwhile to do a similar check for any of your variables before running a model (especially a model). In the case you gave, for instance, it's not obvious what col1 and col2 are. If either of them are a grouping or classification, you should also make them factors, too.
In R, many functions have multiple ways in which they will run, depending upon the data types fed to them. If you feed factors into a model, it will run classification. If you feed numerics, regression. This is actually common in many programming languages, and is called function overloading.

Resources