Support Vector Machine with 3 outcomes [closed] - r

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
Objective:
I would like to utilize a classificaiton Support Vector Machine to model three outcomes: Win=1, Loss=0, or Draw=2. The inputs are a total of 50 interval variables and 2 categorical variables: isHome or isAway. The dataset is comprised of 23,324 instances or rows.
What the data looks like:
Outcome isHome isAway Var1 Var2 Var3 ... Var50
1 1 0 0.23 0.75 0.5 ... 0.34
0 0 1 0.66 0.51 0.23 ... 0.89
2 1 0 0.39 0.67 0.15 ... 0.45
2 0 1 0.55 0.76 0.17 ... 0.91
0 1 0 0.35 0.81 0.27 ... 0.34
The interval variables are within the range 0 to 1, hence I believe they do not require scaling given they are percentages. The categorical variable inputs are 0 for not Home and 1 for Home in isHome and 1 for Away and 0 for not Away.
Summary
Create Support Vector Machine Model
Correct for gamma and cost
Questions
I will be honest, this is my first time using SVM and I have practiced using the Titanic dataset from Kaggle, but I am trying to exapnd off of that and try new things.
Does the data have to be transformed into a scale of [0,1]? I do not believe it does
I have found some literature stating it is possiable to predict with 3 categories, but this is outside of my scope of knowledge. How would I implement this in R?
Are there too many features that I am looking at in order for this to work, or could there be a problem with noise? I know this is not a yes or no question, but curous to hear people's thoughts.
I understand SVM can split data either linearly, radially, or in a polygon. How does one make the best choice for their data?
Reproducable Code
library(e1071)
library(caret)
# set up data
set.seed(500)
isHome<-c(1,0,1,0,1)
isAway<-c(0,1,0,1,0)
Outcome<-c(1,0,2,2,0)
Var1<-abs(rnorm(5,0,1))
Var2<-abs(rnorm(5,0,1))
Var3<-abs(rnorm(5,0,1))
Var4<-abs(rnorm(5,0,1))
Var5<-abs(rnorm(5,0,1))
df<-data.frame(Outcome,isHome,isAway,Var1,Var2,Var3,Var4,Var5)
# split data into train and test
inTrain<-createDataPartition(y=df$Outcome,p=0.50,list=FALSE)
traindata<-df[inTrain,]
testdata<-df[-inTrain,]
# Train the model
svm_model<-svm(Outcome ~.,data=traindata,type='C',kernel="radial")
summary(svm_model)
# predict
pred <- predict(svm_model,testdata[-1])
# Confusion Matrix
table(pred,testdata$Outcome)
# Tune the model to find the best cost and gamma
svm_tune <- tune(svm, train.x=x, train.y=y,
kernel="radial", ranges=list(cost=10^(-1:2),
gamma=c(.5,1,2)))
print(svm_tune)

I'll try to answer each point, to my opinion you could get different solutions for your problem(s), as it is now it's a bit "broad". You could get answers also from searching for similar topics on CrossValidated.
Does the data have to be transformed into a scale of [0,1]?
It depends, usually yes it would be better to scale var1,var2,... One good approach would be to build to pipelines. One where you scale each var, one were you leave them be, the best model on the validation set will win.
Note, you'll find frequently this kind of approach in order to decide "the best way".
Often what you're really interested in is the performance, so checking via cross-validation is a good way of evaluating your hypothesis.
I have found some literature stating it is possible to predict with 3
categories, but this is outside of my scope of knowledge. How would I
implement this in R?
Yes it is, some functions implement this right away in fact. See the example linked down below.
Note, you could always do a multi-label classification by building more models . This is called usually one-vs-all approach (more here).
In general you could:
First train a model to detect Wins, your labels will be [0,1], so Draws and Losses will both be counted as "zero" class, and Wins will be labeled as "one"
Repeat the same principle for the other two classes
Of course, now you'll have three models, and for each observation, you'll have at least two predictions to make.
You'll assign each obs to the class with the most probability or by majority vote.
Note, there are other ways, it really depends on the model you chose.
Good news is you can avoid this. You can look here to start.
e1071::svm() can be easily generalized for your problem, it does this automatically, so no need to fit multiple models.
Are there too many features that I am looking at in order for this to work, or could there be a problem with noise? I know this is not a yes or no question, but curious to hear people's thoughts.
Could be or could not be the case, again, look at the performance you have via CV. You have reason to suspect that var1,..,var50 are too many variables? Then you could build a pipeline where before fit you use PCA to reduce those dimension, say to 95% of the variance.
How do you know this works? You guessed it, by looking at the performance once again, one the validation set you get via CV.
My suggestion is to follow both solutions and keep the best performing one.
I understand SVM can split data either linearly, radially, or in a
polygon. How does one make the best choice for their data?
You can treat the kernel choice again as a hyperparameter to be tuned. Here you need to look at performances, once again.
This is what I'd follow, based on the fact that you seem to have already selected svm as the model of choice. I suggest to look at the package caret, it should simplify all the evaluations you need to do (Example of CV with caret).
Scale Data vs not Scale Data
Perform PCA vs keep all variables
Fit all the models on the training set and evaluate via CV
Take your time to test all this pipelines (4 so far)
Evaluate again using CV which kernel is best, along with other hyperparameters (C, gamma,..)
You should find which path led you to the best result.
If you're familiar with the classic Confusion Matrix, you can use accuracy even for a multi-class classification problem as a performance metric.

Related

How do I fit a GLM using Binomial Distribution for this data in R?

I have been asked to fit a GLM using binomial distribution for the following question:
A survey was conducted to evaluate the effectiveness of a new canine Cough vaccine that had been administered in a local community. For marketing purpose, the vaccine was provided free of charge in a two-shot sequence over a period of two weeks to those who were wishing to bring their dogs to avail of it. Some dogs received the two-shot sequence, some appeared only for the first shot, and others received neither. A survey of 600 local dog owners in the following session provided the information shown in the table below.
How do I get the data into R in order to get the correct format to fit a GLM for binomial dist?
Any help would be great!
One suitable way would be:
vaccine <- c(rep(c(0,1,2),c(12,4,8)),rep(c(0,1,2),c(175,61,340)))
cough <- c(rep(1,12+4+8),rep(0,175+61+340))
Then you could do something like:
linfit <- glm(cough~vaccine,family=binomial)
summary(linfit)
or
factorfit <- glm(cough~as.factor(vaccine),family=binomial)
summary(factorfit)
or
ordfactorfit <- glm(cough~ordered(vaccine),family=binomial)
summary(ordfactorfit)
or perhaps some other possibilities, depending on what your particular hypotheses were.
This isn't the only way to do it (and you may not want to do it with really large data sets), but "untabulating" in this fashion makes some things easy. You can retabulate easily enough (table(data.frame(cough=cough,vaccine=vaccine))).
You may also find the signed-root-contributions-to-chi-square interesting:
t=table(data.frame(cough=cough,vaccine=vaccine))
r=rowSums(t)
c=colSums(t)
ex=outer(r,c)/sum(t)
print((t-ex)/sqrt(ex),d=3)
vaccine
cough 0 1 2
0 -0.337 -0.177 0.324
1 1.653 0.868 -1.587
These have an interpretation somewhat analogous to standardized residuals.
A plot of the proportion of Nos against vaccine (with say $\pm$1 standard errors marked in) would be similarly useful.

How to set up a one-way repeated measures MANOVA in R with no between-subject factors

Main Question
I'm looking for help setting up a one-way repeated measures MANOVA in R for a data-set that has no between-subject factors.
Background
While there are plenty of good guides out there for setting up RM MANOVAs with between-subject factors, I have, as of yet, been unable to find any when you have an entirely within-subject design. This problem seems like it should be fairly straight-forward, but I am new to using MANOVA, so I am not sure if I'm approaching the problem correctly. I have been primarily using the car package in R, though I am open to suggestions for how to do this differently.
To demonstrate the problem, I'll use a subset of the OBrienKaiser data set, and I'll assume that each of the levels of the Hours within-subjects factor instead represents the measurement of a different dependent variable. I'll then take the pre and post conditions to be the two levels of my single within-subjects independent variable. To keep things concise, I'll only look at the first three levels from Hours.
So what I have for my data set is 16 subjects measured in two different conditions (pre and post) on 3 different dependent variables (1,2, and 3).
data <- subset(OBrienKaiser,select=c(pre.1,pre.2,pre.3,post.1,post.2,post.3))
My goal is to look for a difference between pre and post across the combination of the three different dependent variables. I've relied heavily on this guide...
http://socserv.mcmaster.ca/jfox/Books/Companion/appendix/Appendix-Multivariate-Linear-Models.pdf
... but the case used there is not quite the same, and includes between-subject conditions that I do not have. So far, I have been able to produce potentially correct results with the following approach, but my problem is that I'm not fully sure I've set the problem up correctly. In general, my approach was to follow the steps outline above, but simply omit the between-subject conditions.
My Approach
I start by designing the idata matrix for the Anova() function call by treating the condition and the dependent variables as two different within-subject factors
Condition <- factor(rep(c('Pre','Post'),each=3),levels=c('Pre','Post'))
Measure <- factor(rep(c('M1','M2','M3'),2),levels=c('M1','M2','M3'))
idata <- data.frame(Condition,Measure)
Next, build the multivariate linear model on the data-set, ignoring the between-subject factors.
mod.mlm <- lm(cbind(pre.1,pre.2,pre.3,post.1,post.2,post.3)~1,data=data)
I then call Anova() on the linear-model object mod.mlm, using the idata I defined earlier, and setting my within-subjects design to be ~Condition.
av.out <- Anova(mod.mlm,idata=idata,idesign=~Condition,type=3)
This yields the following output...
Type III Repeated Measures MANOVA Tests: Pillai test statistic
Df test stat approx F num Df den Df Pr(>F)
(Intercept) 1 0.91438 160.189 1 15 2.08e-09 ***
Condition 1 0.37062 8.833 1 15 0.009498 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
To me this process and this result seems reasonable. There are two things that give me pause.
When setting up the linear model, the lack of a predictor variable seems odd, but I think it ends up being defined in the call to Anova() with idesign. If that is just how one sets up the Anova with car then, great. It just seems odd to construct a linear model with out a predictor when I have a predictor variable I'm explicitly interested in.
If I use summary(an.out) to take a more in-depth look at the model that comes out, I can see the contrasts that go into the design. The contrast that is produced with the above approach codes pre with 1 and post with -1. I believe that this is the appropriate contrast for what I am trying to do, but I am not completely sure. Given that one can pass in a custom contrast using imatrix or contrasts in the call to Anova(), I'd like to be sure that what I am trying to test (i.e., differences between pre and post across the three dependent variables) is what I'm actually testing.
Any help and/or advice on how to understanding repeated measures MANOVA in general in this context would be greatly appreciated, as well as specific advice on how to implement this in R.
Bonus
I would also like to do the same thing in Matlab, so if anyone has specific advice on that it would be appreciated (though I realize this might require its own question).

consumer surplus in r

I'm trying to do some econometric analysis using R and can't figure out how to do the analysis I'm look for. Specifically, I want to calculate consumer surplus.
I am trying to predict number of trips (dependent) based on variables like water quality, scenery, parking, etc. I've run a regression of my independent variables on my dependent variable using:
lm()
and also got my predicted values using:
y_hat <- as.matrix(mydata[c("y")])
Now I want to calculate the consumer surplus for each individual (~260 total) from my predicted (y_hat) values.
Welcome to R. I studied economics in college and wish R was taught. You will find that the programming language is very useful in your work.
Note that R is able to accomplish vectorized operations that may speed up your analysis. Consider:
mydata <- data.frame(x=letters[1:3], y=1:3)
x y
1 a 1
2 b 2
3 c 3
Let's say your predicted 'y' is 1.25.
y_hat <- 1.25
You can subtract that number by the entire column of the dataset and it will go row by row for you without having to use compicated 'for loops.'
y_hat - mydata[c("y")]
y
1 0.25
2 -0.75
3 -1.75
Without more information about your particular issue, that is all the help that I can offer. In the future, add a reproducible example that illustrates your data and the specific issue that you are stuck on.

Classification using R in a data set with numeric and categorical variables

I'm working on a very big data-set.(csv)
The data set is composed from both numeric and categorical columns.
One of the columns is my "target column" , meaning i want to use the other columns to determine which value (out of 3 possible known values) is likely to be in the "target column". In the end check my classification vs the real data.
My question:
I'm using R.
I am trying to find a way to select the subset of features which give the best classifiation.
going over all the subsets is impossible.
Does anyone know an algorithm or can think of a way do it on R?
This seems to be a classification problem. Without knowing the amount of covariates you have for your target, can't be sure, but wouldn't a neural network solve your problem?
You could use the nnet package, which uses a Feed-forward neural network and works with multiple classes. Having categorical columns is not a problem since you could just use factors.
Without a datasample I can only explain it just a bit, but mainly using the function:
newNet<-nnet(targetColumn~ . ,data=yourDataset, subset=yourDataSubset [..and more values]..)
You obtain a trained neural net. What is also important here is the size of the hidden layer which is a tricky thing to get right. As a rule of thumb it should be roughly 2/3 of the amount of imputs + amount of outputs (3 in your case).
Then with:
myPrediction <- predict(newNet, newdata=yourDataset(with the other subset))
You obtain the predicted values. About how to evaluate them, I use the ROCR package but currently only supports binary classification, I guess a google search will show some help.
If you are adamant about eliminate some of the covariates, using the cor() function may help you to identify the less caracteristic ones.
Edit for a step by step guide:
Lets say we have this dataframe:
str(df)
'data.frame': 5 obs. of 3 variables:
$ a: num 1 2 3 4 5
$ b: num 1 1.5 2 2.5 3
$ c: Factor w/ 3 levels "blue","red","yellow": 2 2 1 2 3
The column c has 3 levels, that is, 3 type of values it can take. This is something done by default by a dataframe when a column has strings instead of numerical values.
Now, using the columns a and b we want to predict which value c is going to be. Using a neural network. The nnet package is simple enough for this example. If you don't have it installed, use:
install.packages("nnet")
Then, to load it:
require(nnet)
after this, lets train the neural network with a sample of the data, for that, the function
portion<-sample(1:nrow(df),0.7*nrow(df))
will store in portion, 70% of the rows from the dataframe. Now, let's train that net! I recommend you to check the documentation for the nnet package with ?nnet for a deeper knowledge. Using only basics:
myNet<-nnet( c~ a+b,data=df,subset=portion,size=1)
c~ a+b is the formula for the prediction. You want to predict the column c using the columns a and b
data= means the data origin, in this case, the dataframe df
subset= self explanatory
size= the size of the hidden layer, as I said, use about 2/3 of the total columns(a+b) + total outputs(1)
We have trained net now, lets use it.
Using predict you will use the trained net for new values.
newPredictedValues<-predict(myNet,newdata=df[-portion,])
After that, newPredictedValues will have the predictions.
Since you have both numerical and categorical data, then you may try SVM.
I am using SVM and KNN on my numerical data and I also tried to apply DNN. DNN is pretty slow for training especially big data in R. KNN does not need to be trained, but is used for numerical data. And the following is what I am using. Maybe you can have a look at it.
#Train the model
y_train<-data[,1] #first col is response variable
x_train<-subset(data,select=-1)
train_df<-data.frame(x=x_train,y=y_train)
svm_model<-svm(y~.,data=train_df,type="C")
#Test
y_test<-testdata[,1]
x_test<-subset(testdata,select=-1)
pred<-predict(svm_model,newdata = x_test)
svm_t<-table(pred,y_test)
sum(diag(svm_t))/sum(svm_t) #accuracy

R fast AUC function for non-binary dependent variable

I'm trying to calculate the AUC for a large-ish data set and having trouble finding one that both handles values that aren't just 0's or 1's and works reasonably quickly.
So far I've tried the ROCR package, but it only handles 0's and 1's and the pROC package will give me an answer but could take 5-10 minutes to calculate 1 million rows.
As a note all of my values fall between 0 - 1 but are not necessarily 1 or 0.
EDIT: both the answers and predictions fall between 0 - 1.
Any suggestions?
EDIT2:
ROCR can deal with situations like this:
Ex.1
actual prediction
1 0
1 1
0 1
0 1
1 0
or like this:
Ex.2
actual prediction
1 .25
1 .1
0 .9
0 .01
1 .88
but NOT situations like this:
Ex.3
actual prediction
.2 .25
.6 .1
.98 .9
.05 .01
.72 .88
pROC can deal with Ex.3 but it takes a very long time to compute. I'm hoping that there's a faster implementation for a situation like Ex.3.
So far I've tried the ROCR package, but it only handles 0's and 1's
Are you talking about the reference class memberships or the predicted class memberships?
The latter can be between 0 and 1 in ROCR, have a look at its example data set ROCR.simple.
If your reference is in [0, 1], you could have a look at (disclaimer: my) package softclassval. You'd have to construct the ROC/AUC from sensitivity and specificity calculations, though. So unless you think of an optimized algorithm (as ROCR developers did), it'll probably take long, too. In that case you'll also have to think what exactly sensitivity and specificity should mean, as this is ambiguous with reference memberships in (0, 1).
Update after clarification of the question
You need to be aware that grouping the reference or actual together looses information. E.g., if you have actual = 0.5 and prediction = 0.8, what is that supposed to mean? Suppose these values were really actual = 5/10 and prediction = 5/10.
By summarizing the 10 tests into two numbers, you loose the information whether the same 5 out of the 10 were meant or not. Without this, actual = 5/10 and prediction = 8/10 is consistent with anything between 30 % and 70 % correct recognition!
Here's an illustration where the sensitivity is discussed (i.e. correct recognition e.g. of click-through):
You can find the whole poster and two presentaions discussing such issues at softclassval.r-forge.r-project.org, section "About softclassval".
Going on with these thoughts, weighted versions of mean absolute, mean squared, root mean squared etc. errors can be used as well.
However, all those different ways to express of the same performance characteristic of the model (e.g. sensitivity = % correct recognitions of actual click-through events) do have a different meaning, and while they coincide with the usual calculation in unambiguous reference and prediction situations, they will react differently with ambiguous reference / partial reference class membership.
Note also, as you use continuous values in [0, 1] for both reference/actual and prediction, the whole test will be condensed into one point (not a line!) in the ROC or specificity-sensitivity plot.
Bottom line: the grouping of the data gets you in trouble here. So if you could somehow get the information on the single clicks, go and get it!
Can you use other error measures for assessing method performance? (e.g. Mean Absolute Error, Root Mean Square Error)?
This post might also help you out, but if you have different numbers of classes for observed and predicted values, then you might run into some issues.
https://stat.ethz.ch/pipermail/r-help/2008-September/172537.html

Resources