factor(0) when using predict for SVM in R - r

I have a data frame trainData which contains 198 rows and looks like
Matchup Win HomeID AwayID A_TWPCT A_WST6 A_SEED B_TWPCT B_WST6 B_SEED
1 2010_1115_1457 1 1115 1457 0.531 5 16 0.567 4 16
2 2010_1124_1358 1 1124 1358 0.774 5 3 0.75 5 14
...
The testData is similar.
In order to use SVM, I have to change the response variable Win to a factor. I tried the below:
trainDataSVM <- data.frame(Win=as.factor(trainData$Win), A_WST6=trainData$A_WST6, A_SEED=trainData$A_SEED, B_WST6=trainData$B_WST6, B_SEED= trainData$B_SEED,
Matchup=trainData$Matchup, HomeID=trainData$HomeID, AwayID=trainData$AwayID)
I then want to a SVM and predict the probabilities, so I tried the below
svmfit =svm (Win ~ A_WST6 + A_SEED + B_WST6 + B_SEED , data = trainDataSVM , kernel ="linear", cost =10,scale =FALSE )
#use CV with a range of cost values
set.seed (1)
tune.out = tune(svm, Win ~ A_WST6 + A_SEED + B_WST6 + B_SEED, data=trainDataSVM , kernel ="linear",ranges =list (cost=c(0.001 , 0.01 , 0.1, 1 ,5 ,10 ,100) ))
bestmod =tune.out$best.model
testDataSVM <- data.frame(Win=as.factor(testData$Win), A_WST6=testData$A_WST6, A_SEED=testData$A_SEED, B_WST6=testData$B_WST6, B_SEED= testData$B_SEED,
Matchup=testData$Matchup, HomeID=testData$HomeID, AwayID=testData$AwayID)
predictions_SVM <- predict(bestmod, testDataSVM, type = "response")
However, when I try to print out predictions_SVM, I get the message
factor(0)
Levels: 0 1
instead of a column of probability values. What is going on?

I haven't used this much myself, but I know that the SVM algorithm itself does not produce class probabilities, only the response function (distance from hyperplane). If you look at the documentation for svm function, the argument "probability" - "logical indicating whether the model should allow for probability predictions" - is FALSE by default and you did not set it equal to TRUE. Documentation for predict.svm says similarly, argument "probability" is a "Logical indicating whether class probabilities should be computed and returned. Only possible if the model was fitted with the probability option enabled." Hope that's helpful.

Related

How to send a confusion matrix to caret's confusionMatrix?

I'm looking at this data set: https://archive.ics.uci.edu/ml/datasets/Credit+Approval. I built a ctree:
myFormula<-class~. # class is a factor of "+" or "-"
ct <- ctree(myFormula, data = train)
And now I'd like to put that data into caret's confusionMatrix method to get all the stats associated with the confusion matrix:
testPred <- predict(ct, newdata = test)
#### This is where I'm doing something wrong ####
confusionMatrix(table(testPred, test$class),positive="+")
#### ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ####
$positive
[1] "+"
$table
td
testPred - +
- 99 6
+ 20 88
$overall
Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull AccuracyPValue McnemarPValue
8.779343e-01 7.562715e-01 8.262795e-01 9.186911e-01 5.586854e-01 6.426168e-24 1.078745e-02
$byClass
Sensitivity Specificity Pos Pred Value Neg Pred Value Precision Recall F1
0.9361702 0.8319328 0.8148148 0.9428571 0.8148148 0.9361702 0.8712871
Prevalence Detection Rate Detection Prevalence Balanced Accuracy
0.4413146 0.4131455 0.5070423 0.8840515
$mode
[1] "sens_spec"
$dots
list()
attr(,"class")
[1] "confusionMatrix"
So Sensetivity is:
(from caret's confusionMatrix doc)
If you take my confusion matrix:
$table
td
testPred - +
- 99 6
+ 20 88
You can see this doesn't add up: Sensetivity = 99/(99+20) = 99/119 = 0.831928. In my confusionMatrix results, that value is for Specificity. However Specificity is Specificity = D/(B+D) = 88/(88+6) = 88/94 = 0.9361702, the value for Sensitivity.
I've tried this confusionMatrix(td,testPred, positive="+") but got even weirder results. What am I doing wrong?
UPDATE: I also realized that my confusion matrix is different than what caret thought it was:
Mine: Caret:
td testPred
testPred - + td - +
- 99 6 - 99 20
+ 20 88 + 6 88
As you can see, it thinks my False Positive and False Negative are backwards.
UPDATE: I found it's a lot better to send the data, rather than a table as a parameter. From the confusionMatrix docs:
reference
a factor of classes to be used as the true results
I took this to mean what symbol constitutes a positive outcome. In my case, this would have been a +. However, 'reference' refers to the actual outcomes from the data set, aka the dependent variable.
So I should have used confusionMatrix(testPred, test$class). If your data is out of order for some reason, it will shift it into the correct order (so the positive and negative outcomes/predictions align correctly in the confusion matrix.
However, if you are worried about the outcome being the correct factor, install the plyr library, and use revalue to change the factor:
install.packages("plyr")
library(plyr)
newDF <- df
newDF$class <- revalue(newDF$class,c("+"=1,"-"=0))
# You'd have to rerun your model using newDF
I'm not sure why this worked, but I just removed the positive parameter:
confusionMatrix(table(testPred, test$class))
My Confusion Matrix:
td
testPred - +
- 99 6
+ 20 88
Caret's Confusion Matrix:
td
testPred - +
- 99 6
+ 20 88
Although now it says $positive: "-" so I'm not sure if that's good or bad.

Bug with VGAM? vglm family=posnegbinomial => "Error in if (take.half.step) { : missing value where TRUE/FALSE needed"

I have some actual data that I am afraid is somewhat nasty.
It's essentially a Positive Negative Binomial distribution (without any zero counts). However, there are some outliers that seem to cause some bad calculations to occur (maybe underflow or NaNs?) The first 8 or so entries are reasonable, but I'm guessing the last few are causing some problems with the fitting.
Here's the data:
> df
counts t
1 1968 1
2 217 2
3 55 3
4 26 4
5 11 5
6 5 6
7 8 7
8 3 8
9 1 10
10 1 11
11 1 12
12 1 13
13 1 15
14 1 18
15 1 26
16 1 59
This command runs for a while and then spits out the error message
> vglm(counts ~ t, data=df, family = posnegbinomial)
Error in if (take.half.step) { : missing value where TRUE/FALSE needed
BUT, if I rerun this cutting off the outliers, I get a solution for posnegbinomial
> vglm(counts ~ t, data=df[1:9,], family = posnegbinomial)
Call:
vglm(formula = counts ~ t, family = posnegbinomial, data = df[1:9,])
Coefficients:
(Intercept):1 (Intercept):2 t
7.7487404 0.7983811 -0.9427189
Degrees of Freedom: 18 Total; 15 Residual
Log-likelihood: -36.21064
If I try the family pospoisson (Positive Poisson: no zero values), I get a similar error "argument is not interpretable as logical".
I do notice that there are a number of similar questions in Stackoverflow about missing values where TRUE/FALSE is needed, but with other R packages. This indicates to me that perhaps the package writers need to better anticipate calculations might fail.
I think your proximal problem is that the predicted means for the negative binomial for your extreme values are so close to zero that they are underflowing to zero, in a way that was not anticipated/protected against by the package authors. (One thing to realize about nonlinear optimization/fitting is that it is always possible to break a fitting method by giving it extreme data ...)
I couldn't get this to work in VGAM, but I'll offer a couple of other suggestions.
plot(log(counts)~t,data=dd)
And eyeballing the data to get an initial estimate of parameter values (at least for the mean model):
m0 <- lm(log(counts)~t,data=subset(dd,t<10))
I thought I might be able to get vglm() to work by setting starting values, but that didn't actually pan out, even when I have fairly good values from other platforms (see below).
glmmADMB
The glmmADMB package can handle positive NB, via family="truncnbinom":
library(glmmADMB)
m1 <- glmmadmb(counts~t, data=dd, family="truncnbinom")
(there are some warning messages ...)
bbmle::mle2()
This requires a little bit more work: it failed with the standard model, but works if I set a floor on the predicted mean ...
library(VGAM) ## for dposnegbin
library(bbmle)
m2 <- mle2(counts~dposnegbin(size=exp(logk),
munb=pmax(exp(logeta),1e-7)),
parameters=list(logeta~t),
data=dd,
start=list(logk=0,logeta=0))
Again warning messages.
Compare glmmADMB, mle2, simple truncated lm fit ...
cc <- cbind(coef(m2),
c(log(m1$alpha),coef(m1)),
c(NA,coef(m0)))
dimnames(cc) <- list(c("log_k","log_int","slope"),
c("mle2","glmmADMB","lm"))
## mle2 glmmADMB lm
## log_k 0.8094678 0.8094625 NA
## log_int 7.7670604 7.7670637 7.1747551
## slope -0.9491796 -0.9491778 -0.8328487
This is in principle also possible with glmmTMB, but it runs into the same kinds of problems as vglm() ...

R e1071: Balanced Error Rate (BER) as error criterion in tune function

I'm kind of new to R and machine learning in general, so apologies if this seems stupid!
I'm using the e1071 package to tune the parameters of various models. My dataset is very unbalanced and I would like for the error criterion to be Balanced Error Rate... NOT overall classification error. However, I'm stumped as how to achieve this.
Here is my code:
#Find optimal value 'k' value for k-NN model (feature subset).
c <- data_train_sub[1:13]
d <- data_train_sub[,14]
knn2 <- tune.knn(c, d, k = 1:10, tunecontrol = tune.control(sampling = "cross", performances = TRUE, sampling.aggregate = mean)
)
summary(knn2)
plot(knn2)
Which returns this:
Parameter tuning of ‘knn.wrapper’:
- sampling method: 10-fold cross validation
- best parameters:
k
1
- best performance: 0.001190476
- Detailed performance results:
k error dispersion
1 1 0.001190476 0.003764616
2 2 0.005952381 0.006274360
3 3 0.003557423 0.005728122
4 4 0.005924370 0.008352124
5 5 0.005938375 0.008407043
6 6 0.005938375 0.008407043
7 7 0.007128852 0.008315090
8 8 0.009495798 0.009343555
9 9 0.008305322 0.009751997
10 10 0.008319328 0.009795292
Has anyone any experience of altering the error being assessed in this function?
Look at the class.weights argument of the svm() function:
a named vector of weights for the different classes, used for asymmetric class sizes...
Coefficient can easily be calculated as such:
class.weights = table(Xcal$species)/sum(table(Xcal$species))

Multinomial logit models and nested logit models

I am using the mlogit package in program R. I have converted my data from its original wide format to long format. Here is a sample of the converted data.frame which I refer to as 'long_perp'. All of the independent variables are individual specific. I have 4258 unique observations in the data-set.
date_id act2 grp.bin pdist ship sea avgknots shore day location chid alt
4.dive 40707_004 TRUE 2 2.250 second light 14.06809 2.30805 12 Lower 4 dive
4.fly 40707_004 FALSE 2 2.250 second light 14.06809 2.30805 12 Lower 4 fly
4.none 40707_004 FALSE 2 2.250 second light 14.06809 2.30805 12 Lower 4 none
5.dive 40707_006 FALSE 2 0.000 second light 15.12650 2.53312 12 Lower 5 dive
5.fly 40707_006 TRUE 2 0.000 second light 15.12650 2.53312 12 Lower 5 fly
5.none 40707_006 FALSE 2 0.000 second light 15.12650 2.53312 12 Lower 5 none
6.dive 40707_007 FALSE 1 1.995 second light 14.02101 2.01680 12 Lower 6 dive
6.fly 40707_007 TRUE 1 1.995 second light 14.02101 2.01680 12 Lower 6 fly
6.none 40707_007 FALSE 1 1.995 second light 14.02101 2.01680 12 Lower 6 none
'act2' is the dependent variable and consists of choices a bird floating on the water could make when approached by a ship; fly, dive, or none. I am interested in how these probabilities relate to the remaining independent variables in the data.frame, i.e. perpendicular distance to the ship path (pdist) sea conditions (sea), speed (avgknots), distance to shore (shore) etc. The independent variables are made of dichotomous, factor and continuous variables.
I ran two multinomial logit models, one including all the choice options and another including only a subset. I then compared these models with the hmftest() function to test for the IIA assumption. The results were confusing the say the least. I will include the codes for the two models and the test output (in case I am miss-specifying the models in the code).
# model including all choice options (fly, dive, none)
mod.1 <- mlogit(act2 ~ 1 | pdist + as.factor(grp.bin) +
as.factor(sea) + avgknots + shore + as.factor(location),long_perp ,
reflevel = 'none')
# model including only a subset of choice options (fly, dive)
mod.alt <- mlogit(act2 ~ 1 | pdist + as.factor(grp.bin) +
as.factor(sea) + avgknots + shore + as.factor(location),long_perp ,
reflevel = 'none', alt.subset = c("fly","dive"))
# IIA test
hmftest(mod.1, mod.alt)
# output
Hausman-McFadden test
data: long_perp
chisq = -968.7303, df = 7, p-value = 1
alternative hypothesis: IIA is rejected
As you can see the chisquare statistic is negative! I assume I am either 1. doing something wrong, or 2. IIA is violated. This result holds true for choice subset (fly, dive), but the IIA assumption is upheld with choice subset (none, dive)? This confuses me.
Next I tried to formulate a nested model as a way to relax the IIA assumption. I nested the choices as nest1 = none, nest2 = fly, dive. This makes sense to me as this seems like a logical break, the bird decides to react or not then decides which reaction to make.
I am unclear on how to run the nested logit models (even after reading the two vignettes for mlogit, Croissant vignette and Train vignette).
When I run my analysis following the example in the Croissant vignette I get the following error.
nested.1 <- mlogit(act2 ~ 0 | pdist + as.factor(grp.bin) + as.factor(ship) +
as.factor(sea) + avgknots + shore + as.factor(location),
long_perp , reflevel="none",nests = list(noact = "none",
react = c("dive","fly")), unscaled = TRUE)
# Error in solve.default(crossprod(attr(x, "gradi")[, !fixed])) :
Lapack routine dgesv: system is exactly singular: U[19,19] = 0
I have read a bit about this error message and it may occur because of complete separation. I have looked at some tables of the data and do not believe this is happening as I have 4,000+ observations and only one factor variable with more than 2 levels (it has 3).
Help on these specific problems is greatly appreciated but I am also open to alternate analyses that I can use to answer my question. I am mainly interested in the probability of flying as a function of perpendicular distance to the ships path.
Thanks, Tim
To get a positive chi-sq, change the code as follows:
alt.subset = c("none", "fly")
that is, the ref level will be in the subset too. It may help, though the P-value may not change much.

Fitting two parameter observations into copulas

I have one set of observations containing two parameters.
How to fit it into copula (estimate the parameter of the copula and the margin function)?
Let's say the margin distribution are log-normal distributions, and the copula is Gumbel copula.
The data is as below:
1 974.0304 1010
2 6094.2672 1150
3 3103.2720 1490
4 1746.1872 1210
5 6683.7744 3060
6 6299.6832 3330
7 4784.0112 1550
8 1472.4288 607
9 3758.5728 1970
10 4381.2144 1350
Library(copula)
gumbel.cop <- gumbelCopula(dim=2)
myMvd <- mvdc(gumbel.cop, c("lnorm","lnorm"), list(list(meanlog = 7.1445391,sdlog=0.4568783), list(meanlog = 7.957392,sdlog=0.559831)))
x <- rmvdc(myMvd, 1000)
fit <- fitMvdc(x, myMvd, c(7.1445391,0.4568783,7.957392,0.559831))
The meanlog and sdlog value are derived from the data set. Error message:
"Error in if (alpha - 1 < .Machine$double.eps^(1/3)) return(rCopula(n, :
missing value where TRUE/FALSE needed"
How to choose the copula parameter with the given data, and the margin distributions derived from the data set?
To close the question assessed in the comments.
It seems that giving a parameter of TRUE or FALSE close the problem as well as doing first the pseudo-observation and then fit the function.

Resources