I am playing around with Support Vector Machines in the R-Language. Specifically I am using the e1071 package.
As long as I follow the manual pages or the tutorial at wikibooks everythings works. But if I try to use my own datasets with those examples things aren't that good anymore.
It seems that the model creation fails for some reason. At least I am not getting the levels on the target column. Below you find the example for clarification.
Maybe someone can help me to figure out what I am doing wrong here. So here is all the code and data.
Test dataset
target,col1,col2
0,1,2
0,2,3
0,3,4
0,4,5
0,5,6
0,1,2
0,2,3
0,3,4
0,4,5
0,5,6
0,1,2
0,2,3
0,3,4
0,4,5
1,6,7
1,7,8
1,8,9
1,9,0
1,0,10
1,6,7
1,7,8
1,8,9
1,9,0
1,0,10
1,6,7
1,7,8
1,8,9
1,9,0
1,0,10
R-Script
library(e1071)
dataset <- read.csv("test.csv", header=TRUE, sep=',')
tuned <- tune.svm(target~., data = dataset, gamma = 10^(-6:-1), cost = 10^(-1:1))
summary(tuned)
model <- svm(target~., data = dataset, kernel="radial", gamma=0.001, cost=10)
summary(model)
Output of the summary(model) statement
+ summary(model)
Call:
svm(formula = target ~ ., data = dataset, kernel = "radial", gamma = 0.001,
cost = 10)
Parameters:
SVM-Type: eps-regression
SVM-Kernel: radial
cost: 10
gamma: 0.001
epsilon: 0.1
Number of Support Vectors: 28
>
Wikibooks examaple
If I compare this output to the output of the wikibooks example, it's missing some information. Please notice the "Levels"-Section in the output:
library(MASS)
library(e1071)
data(cats)
model <- svm(Sex~., data = cats)
summary(model)
Output
> summary(model)
Call:
svm(formula = Sex ~ ., data = cats)
Parameters:
SVM-Type: C-classification
SVM-Kernel: radial
cost: 1
gamma: 0.5
Number of Support Vectors: 84
( 39 45 )
Number of Classes: 2
Levels:
F M
Putting Roland's answer in the proper "answer" format:
target is numeric
sex is a factor
Let me give a few more suggestions:
it seems as if target really should be a factor. (It has only 2 levels, 0 & 1, and I suspect you're trying to classify into either 0 or 1.) So stick in a dataset$target <- factor(dataset$target) somewhere.
right now, because target is a numeric, a regression model is being run instead of a classification.
it's worthwhile to do a similar check for any of your variables before running a model (especially a model). In the case you gave, for instance, it's not obvious what col1 and col2 are. If either of them are a grouping or classification, you should also make them factors, too.
In R, many functions have multiple ways in which they will run, depending upon the data types fed to them. If you feed factors into a model, it will run classification. If you feed numerics, regression. This is actually common in many programming languages, and is called function overloading.
Related
I try to analyze a dataset with an ordinal response (0-4) and three categorical factors. I'm interested in the interactions of all three factors as well as the main effects. I used the clm function of the package "ordinal" and checked the assumptions by using the "nominal_test" function. It revealed a significant difference for one of the predictors. And now I don't know how to proceed... I tried to put the problematic factor and all its interactions in the "nominal" argument (see code) and R gives me warnings. Nevertheless, I made several likelyhood ratio tests always comparing a model including an interaction with one missing it (ANOVA(without,with, test="Chisq")) and get some nice significant results. Still, I feel like I have no clue what I'm doing here and I don't trust the results. So my question is: Is it ok what I did? What else can I do? or is the data just 'unanalyzable'?
Here is the code for the test:
# this is the model
res=clm(cue~ intention:outcome:age+
intention:outcome+
intention:age+
outcome:age+
intention+outcome+age+
Gender,
data=xdata)
#proportional odds assumption
nominal_test(res)
# Df logLik AIC LRT Pr(>Chi)
#<none> -221.50 467.00
#intention 3 -215.05 460.11 12.891 0.004879 **
#outcome 3 -219.44 468.87 4.124 0.248384
#age
#Gender 3 -219.50 469.00 3.994 0.262156
#intention:outcome
#intention:age
#outcome:age 6 -217.14 470.28 8.716 0.190199
#intention:outcome:age 12 -188.09 424.19 66.808 1.261e-09 ***
And here is an example of how I tried to solve it -> and check the 3-way-interaction of all three predictors. I did the same for the 2-way-interactions as well...
res=clm(cue~ outcome:age+
outcome+age+
Gender,
nominal= ~ intention:age:outcome+
intention:age+
intention:outcome+
intention,
data=xdata)
res.red=clm(cue~ outcome:age+
outcome+age+
Gender,
nominal= ~
intention:age+
intention:outcome+
intention,
data=xdata)
anova(res,res.red, test="Chisq")
# no.par AIC logLik LR.stat df Pr(>Chisq)
#res.red 26 412.50 -180.25
#res 33 424.11 -179.05 2.3945 7 0.9348
And here is the warning that R gives me when I try to cenverge the model:
Warning message:
(-3) not all thresholds are increasing: fit is invalid
In addition: Absolute convergence criterion was met, but relativecriterion was not met
I'm especially concerned about the sentence "Fit is not valid"... I don't know what to do with this and would be happy about any idea or hint!
Thank you!
Have you tried to use a more general model like the partial proportional odds model? Your data only has to be nominal, not ordinal to use this model. If you find hugh differences between the log likelihoods, your assumption about ordinality is not met.
You can use vlgm() from the VGAM package. Here are a few examples.
As I don't know how your data looks like, I can't say whether it's unanalyzable, but the code would be something like this:
library(VGAM)
res <- vglm(cue ~ intention:outcome:age+
intention:outcome+
intention:age+
outcome:age+
intention+outcome+age+
Gender,
family = cumulative(parallel = FALSE ~ intention),
data = xdata)
summary(res)
I think you could use pchiq() as proposed in the example I posted above to compare both models like you did before with anova():
pchisq(deviance(res) - deviance(res.red),
df = df.residual(res) - df.residual(res.red), lower.tail = FALSE)
I am working on a music data set where I have to classify the music data in to genres. I have both test and train data sets.
I have linked the datasets for you to check
here.
I am working in Rstudio
Here's the code I have written. I am a beginner and have no clue what I am doing. I am shooting arrows randomly. Let me know if you need more information.
The library used is :-
library("e1071")
The code :-
svm.model <- svm(GENRE ~ ., data = musictraindata, cost = 62.5, gamma = 0.5)
Now my problem is what to put in x parameter. I have put "GENRE" from train data set but it's giving me the following error.
Error in svm.default(x, y, scale = scale, ..., na.action = na.action) :
Need numeric dependent variable for regression.
Someone please guide me on what I should do. Thanks.
After corrections :-
I ran the code with the said corrections.I got an svm.model as follows :-
svm.model
Call:
svm(formula = factor(GENRE) ~ ., data = musictraindata, cost = 62.5, gamma = 0.5, type = "C-classification",
tolerance = 0.01)
Parameters:
SVM-Type: C-classification
SVM-Kernel: radial
cost: 62.5
gamma: 0.5
Number of Support Vectors: 11880
Now I try to create a predict model by using it with test data.
svm.pred <- predict(svm.model,musictestdata)
When I plot the svm.pred, I get a graph as follows which is highly unlikely. Here it is:
This how I am suppose to proceed right ? Am I doing something wrong ?
Let me know.
Tough to say without a reproducible example, but I would confirm that the class of your dependent variable (Genre) is a factor, and doesn't have anything goofy going on like NA's. Check this with class(musictraindata$GENRE). Also worth a note, R is cap-sensitive so "Genre" and "GENRE" make a difference.
You can also try specficying the type of SVM you want to run by using
(type = "C-classification")
and see if it throws you a more helpful error?
I am currently taking the "Practical Machine Learning" course from Coursera right now and am having run across some strange behavior with the predict function. The question that has been asked was to train a tree and then make some predictions. So that I am not posting the answer here I have changed the dataset used for the problem. The code is as follows:
rm(list = ls())
library(rattle)
data(mtcars)
mtcars$vs = as.factor(mtcars$vs)
set.seed(125)
model = train(am ~ ., method = 'rpart', data = mtcars)
print(model)
fancyRpartPlot(model$finalModel)
sampleData = mtcars[1,]
sampleData[1,names(sampleData)] = rep(NA, length(names(sampleData)))
sampleData[1, c('wt')] = c(4)
predict(model, sampleData[1,], verbose = TRUE)
In the above code, there are two primary sections. The first builds the tree and the second (where sampleData starts) creates a small sample set of data to apply the model to. To make sure that I have the exact same structure as the original data I simply copy the first row of the training dataset and then set all the columns to NA. I then put data in only the columns that the decision tree needs (in this case the wt variable).
When I execute the above code, I get the following result:
Number of training samples: 32
Number of test samples: 0
rpart : 0 unknown predictions were added
numeric(0)
For reference, the following is the structure of the tree:
fancyRpartPlot(model$finalModel)
Can somebody help me to understand why the predict function is not returning a predicted value for the sampleData that I provided?
Unfortunately, even though rpart only used the wt variable in splits, prediction still requires the others to be present. Use a data set with the sample columns:
> predict(model, mtcars[1,])
[1] 0.8571429
Max
I have the following linear models:
gn<- lm(NA.~ I(PC^0.25) + I(((PI)^2)),data=DSET)
gndos<-update(gn,subset=-c(1,2,4,10,11,26,27,100,158))
I run K-cross validation in each model and the results are the same:
library(DAAG)
a<-CVlm(df=DSET,form.lm = gn ,m=5)
t<-CVlm(df=DSET,form.lm = gndos ,m=5)
I would like to know the error in my code.
EDIT:
I show a reproducible example with simulated values. Here's my example:
set.seed(1234)
PC<-rnorm(600,mean=50,sd=4)
PI<-rnorm(600,mean=30,sd=4)
NA. <- 20*PC - 1.725*PI + rnorm(600,sd=1)
zx<-data.frame(NA.,PC,PI)
hn<- lm(NA.~ I(PC^0.25) + I(((PI)^2)),data=zx)
hndos<-update(hn,subset=-c(1,2,4,10,11,26,27,100,158))
library(DAAG)
a<-CVlm(df=zx,form.lm = hn ,m=5)
a<-CVlm(df=zx,form.lm = hndos ,m=5)
So, with this reproducible example, the results of the cross-validations are the same for each model too.
The problem is how CVlm() collects the data used to fit the model. It does this through the form.lm object, but this is not sufficient to reproduce the data used to fit the model. I would consider this a bug, but a deficiency might be more correct.
Because you used subset in the lm() call, the data used to fit that model included 591 observations. But you can't tell this from the formula alone. CVlm() does the following
form <- formula(hndos)
mf <- model.frame(form, zx)
R> nrow(mf)
[1] 600
Ahh, there are 600 observations. What it should have done was simply take
mf <- model.frame(hndos) ## model.frame(form.lm) in their code
R> nrow(mf)
[1] 591
Then the CV results would have been different. The code in CVlm() needs to be a little more nuanced than simply grabbing the formula of an lm() object if that is what is supplied to it. This would also be pretty easy to solve too.
Here is an example of my problem
library(RWeka)
iris <- read.arff("iris.arff")
Perform nfolds to obtain the proper accuracy of the classifier.
m<-J48(class~., data=iris)
e<-evaluate_Weka_classifier(m,numFolds = 5)
summary(e)
The results provided here are obtained by building the model with a part of the dataset and testing it with another part, therefore provides accurate precision
Now I Perform AdaBoost to optimize the parameters of the classifier
m2 <- AdaBoostM1(class ~. , data = temp ,control = Weka_control(W = list(J48, M = 30)))
summary(m2)
The results provided here are obtained by using the same dataset for building the model and also the same ones used for evaluating it, therefore the accuracy is not representative of real life precision in which we use other instances to be evaluated by the model. Nevertheless this procedure is helpful for optimizing the model that is built.
The main problem is that I can not optimize the model built, and at the same time test it with data that was not used to build the model, or just use a nfold validation method to obtain the proper accuracy.
I guess you misinterprete the function of evaluate_Weka_classifier. In both cases, evaluate_Weka_classifier does only the cross-validation based on the training data. It doesn't change the model itself. Compare the confusion matrices of following code:
m<-J48(Species~., data=iris)
e<-evaluate_Weka_classifier(m,numFolds = 5)
summary(m)
e
m2 <- AdaBoostM1(Species ~. , data = iris ,
control = Weka_control(W = list(J48, M = 30)))
e2 <- evaluate_Weka_classifier(m2,numFolds = 5)
summary(m2)
e2
In both cases, the summary gives you the evaluation based on the training data, while the function evaluate_Weka_classifier() gives you the correct crossvalidation. Neither for J48 nor for AdaBoostM1 the model itself gets updated based on the crossvalidation.
Now regarding the AdaBoost algorithm itself : In fact, it does use some kind of "weighted crossvalidation" to come to the final classifier. Wrongly classified items are given more weight in the next building step, but the evaluation is done using equal weight for all observations. So using crossvalidation to optimize the result doesn't really fit into the general idea behind the adaptive boosting algorithm.
If you want a true crossvalidation using a training set and a evaluation set, you could do the following :
id <- sample(1:length(iris$Species),length(iris$Species)*0.5)
m3 <- AdaBoostM1(Species ~. , data = iris[id,] ,
control = Weka_control(W = list(J48, M=5)))
e3 <- evaluate_Weka_classifier(m3,numFolds = 5)
# true crossvalidation
e4 <- evaluate_Weka_classifier(m3,newdata=iris[-id,])
summary(m3)
e3
e4
If you want a model that gets updated based on a crossvalidation, you'll have to go to a different algorithm, eg randomForest() from the randomForest package. That collects a set of optimal trees based on crossvalidation. It can be used in combination with the RWeka package as well.
edit : corrected code for a true crossvalidation. Using the subset argument has effect in the evaluate_Weka_classifier() as well.