R: svm from e1071 predictions differ based on "probability" argument setting - r

Under certain circumstances, there are differences in predictions from e1071 package svm models depending on the setting of the probability input argument. This code example:
rm(list = ls())
data(iris)
## Training and testing subsets
set.seed(73) # For reproducibility
ri = sample(seq(1, nrow(iris)), round(nrow(iris)*0.8))
train = iris[ri, ]
test = iris[-ri,]
## Models and predictions with probability setting F or T
set.seed(42) # Just to exclude that randomness in algorithm itself is the cause
m1 <- svm(Species ~ ., data = train, probability = F)
pred1 = predict(m1, newdata = test, probability = F)
set.seed(42) # Just to exclude that randomness in algorithm itself is the cause
m2 <- svm(Species ~ ., data = train, probability = T)
pred2 = predict(m2, newdata = test, probability = T)
## Accuracy
acc1 = sum(test$Species == pred1)/nrow(iris)
acc2 = sum(test$Species == pred2)/nrow(iris)
will give
acc1 = 0.18666...
acc2 = 0.19333...
My conclusion is that svm() performs calculations differently based on the setting of the probability parameter.
Is that correct?
If so, why and how does it differ?
I haven't seen anything about this in the docs for the package or function.
The reason I bother with this is that I have found the performance of the classification to be not only different, but consistently slightly worse when probability = T in a project where I do classification based on ~800 observations of ~250 gene abundances (bioinformatics stuff). The code from that project contains data cleaning and uses cross-validation, making it a bit bulky to include here, so you'll have to take my word for it.
Any ideas folks?

Related

How does the kernel SVM in e1071 predict?

I am trying to understand the way the e1071 package obtains its SVM predictions in a two-class classification framework. Consider the following toy example.
library(mvtnorm)
library(e1071)
n <- 50
### Gaussians
eps <- 0.05
data1 <- as.data.frame(rmvnorm(n, mean = c(0,0), sigma=diag(rep(eps,2))))
data2 <- as.data.frame(rmvnorm(n, mean = c(1,1), sigma=diag(rep(eps,2))))
### Train Model
data_df <- as.data.frame(rbind(data1, data2))
data <- as.matrix(data_df)
data_df$y <- as.factor(c(rep(-1,n), rep(1,n)))
svm <- svm(y ~ ., data = data_df, kernel = "radial", gamma=1, type = "C-classification", scale = FALSE)
Having trained the SVM, I would like to write a function that uses the coefficients and the intercept to predict on a new data point.
Recall that the kernel trick guarantees that we can write the prediction on a new point as the weighted sum of the kernel evaluated at the support vectors and the new point itself (plus some intercept).
In other words: How to combine the following three terms
supportv <- svm$SV
coefs <- svm$coefs
intercept <- svm$rho
to get the prediction associated with the corresponding SVM?
If this is not possible, or too complicated, I would also switch to a different package.

Similar to cox regression hazard model, can we get survival curves and hazard ratios using survivalsvm?

I am a beginner, trying to do survival analysis using machine learning on the lung cancer dataset. I know how to do the survival analysis using the Cox proportional hazard model. Cox proportional hazard model provides us the hazard ratios, which are nothing but the exponential of the regression coefficients. I wonder if, we can do the same thing using machine learning. As a beginner, I am trying survivalsvm from the R language. Please see the link for this. I am using the inbuilt cancer data for doing survival analysis. Following is the R code, given at this link.
library(survival)
library(survivalsvm)
set.seed(123)
n <- nrow(veteran)
train.index <- sample(1:n, 0.7 * n, replace = FALSE)
test.index <- setdiff(1:n, train.index)
survsvm.reg <- survivalsvm(Surv(diagtime, status) ~ .,
subset = train.index, data = veteran,
type = "regression", gamma.mu = 1,
opt.meth = "quadprog", kernel = "add_kernel")
print(survsvm.reg)
pred.survsvm.reg <- predict(object = survsvm.reg,
newdata = veteran, subset = test.index)
print(pred.survsvm.reg)
Can anyone help me to get the hazard ratios or survival curve for this dataset? Also, how to interpret the output of this function
This question is kind of old now but I'm going to answer anyway because this is a difficult problem and I struggled with {survivalsvm} when I first used it.
So depending on the type argument you get different outputs. In your case type = "regression" means you are plotting Shivaswamy's (hope i spelt correctly) SVCR which predicts the time until an event takes place, so these are survival time predictions.
In order to convert this to a survival curve you have to make some assumptions about the shape of the survival distribution. So for example, let's say you think the survival time is Normally distributed with N(mu, sigma). Then you can use your predicted survival time as mu and either predict or make an assumption about sigma.
Below is an example using your code and my {distr6} package, which enables quick computation of many distributions and printing and plotting of functions:
library(survival)
library(survivalsvm)
set.seed(123)
n <- nrow(veteran)
train.index <- sample(1:n, 0.7 * n, replace = FALSE)
test.index <- setdiff(1:n, train.index)
survsvm.reg <- survivalsvm(Surv(diagtime, status) ~ .,
subset = train.index, data = veteran,
type = "regression", gamma.mu = 1,
opt.meth = "quadprog", kernel = "add_kernel")
print(survsvm.reg)
pred.survsvm.reg <- predict(object = survsvm.reg,
newdata = veteran, subset = test.index)
# load distr6
library(distr6)
# create a vector of normal distributions each with
# mean as the predicted time and with variance 1
# `decorators = "ExoticStatistics"` adds survival function
v = VectorDistribution$new(distribution = "Normal",
params = data.frame(mean = as.numeric(pred.survsvm.reg$predicted)),
shared_params = list(var = 1),
decorators = "ExoticStatistics")
# survival function evaluated at times = 1:10
v$survival(1:10)
# plot survival function for first individual
plot(v[1], fun = "survival")
# plot hazard function for first individual
plot(v[1], fun = "hazard")

Cannot generate predictions in mgcv when using discretization (discrete=T)

I am fitting a model using a random site-level effect using a generalized additive model, implemented in the mgcv package for R. I had been doing this using the function gam() however, to speed things up I need to shift to the bam() framework, which is basically the same as gam(), but faster. I further sped up fitting by passing the options bam(nthreads = N, discrete=T), where nthreads is the number of cores on my machine. However, when I use the discretization option, and then try to make predictions with my model on new data, while ignoring the random effect, I consistent get an error.
Here is code to generate example data and reproduce the error.
library(mgcv)
#generate data.
N <- 10000
x <- runif(N,0,1)
y <- (0.5*x / (x + 0.2)) + rnorm(N)*0.1 #non-linear relationship between x and y.
#uninformative random effect.
random.x <- as.factor(do.call(paste0, replicate(2, sample(LETTERS, N, TRUE), FALSE)))
#fit models.
fit1 <- gam(y ~ s(x) + s(random.x, bs = 're')) #this one takes ~1 minute to fit, rest faster.
fit2 <- bam(y ~ s(x) + s(random.x, bs = 're'))
fit3 <- bam(y ~ s(x) + s(random.x, bs = 're'), discrete = T, nthreads = 2)
#make predictions on new data.
newdat <- data.frame(runif(200, 0, 1))
colnames(newdat) <- 'x'
test1 <- predict(fit1, newdata=newdat, exclude = c("s(random.x)"), newdata.guaranteed = T)
test2 <- predict(fit2, newdata=newdat, exclude = c("s(random.x)"), newdata.guaranteed = T)
test3 <- predict(fit3, newdata=newdat, exclude = c("s(random.x)"), newdata.guaranteed = T)
Making predictions with the third model which uses discretization throws this error (which the other two do not):
Error in model.frame.default(object$dinfo$gp$fake.formula[-2], newdata) :
variable lengths differ (found for 'random.x')
In addition: Warning message:
'newdata' had 200 rows but variables found have 10000 rows
How can I go about making predictions for a new dataset using the model fit with discretization?
newdata.gauranteed doesn't seem to be working for bam() models with discrete = TRUE. You could email the author and maintainer of mgcv and send him the reproducible example so he can take a look. See ?bug.reports.mgcv.
You probably want
names(newdat) <- "x"
as data frames have names.
But the workaround is just to pass in something for random.x
newdat <- data.frame(x = runif(200, 0, 1), random.x = random.x[[1]])
and then do your call to generate test3 and it will work.
The warning message and error are the result of you not specifying random.x in the newdata and then mgcv looking for random.x and finding it in the global environment. You should really gather that variables into a data frame and use the data argument when you are fitting your models, and try not to leave similarly named objects lying around in your global environment.

Calculate Prediction Intervals of a predicted value using Caret package of R

I used different neural network packages within Caret package for my predictions. Code used with nnet package is
library(caret)
# training model using nnet method
data <- na.omit(data)
xtrain <- data[,c("temperature","prevday1","prevday2","prev_instant1","prev_instant2","prev_2_hour")]
ytrain <- data$power
train_model <- train(x = xtrain, y = ytrain, method = "nnet", linout=TRUE, na.action = na.exclude,trace=FALSE)
# prediction using training model created
pred_ob <- predict(train_model, newdata=dframe,type="raw")
The predict function simply calculates the prediction value. But, I also need prediction intervals (2-sigma) as well. On searching, I found a relevant answer at stackoverflow link, but this does not result as needed. The solution suggests to use finalModelvariable as
predict(train_model$finalModel, newdata=dframe, interval = "confidence",type=raw)
Is there any other way to calculate prediction intervals? The training data used is the dput() of my previous question at stackoverflow link and the dput() of my prediction dataframe (test data) is
dframe <- structure(list(temperature = 27, prevday1 = 1607.69296666667,
prevday2 = 1766.18103333333, prev_instant1 = 1717.19306666667,
prev_instant2 = 1577.168915, prev_2_hour = 1370.14983583333), .Names = c("temperature",
"prevday1", "prevday2", "prev_instant1", "prev_instant2", "prev_2_hour"
), class = "data.frame", row.names = c(NA, -1L))
****************************UPDATE***********************
I used nnetpredintpackage as suggested at link. To my surprise it results in an error, which I find difficult to debug. Here is my updated code till now,
library(nnetpredint)
nnetPredInt(train_model, xTrain = xtrain, yTrain = ytrain,newData = dframe)
It results in the following error:
Error: Number of observations for xTrain, yTrain, yFit are not the same
[1] 0
I can check that xtrain, ytrain and dframe are with correct dimensions, but I do not have any idea about yFit. I don't need this according to the examples of nnetpredintvignette
caret doesn't generate prediction intervals; that relies on the individual package. If that package cannot do this, then neither can the train objects. I agree that nnetPredInt is the appropriate way to go.
Two other notes:
you most likely should center and scale your data if you have not already.
using the finalModel object is somewhat dangerous since it has no idea what was done to the data (e.g. dummy variables, centering and scale or other preprocessing methods, etc) before it was created.
Max
Thanks for your question. And a simple answer to your problem is: Right now the nnetPredInt function only support the following S3 object, "nnet", "nn" and "rsnns", produced by different neural network packages. And the train function in caret package return an "train" object. That's why the function nnetPredInt doesn't get the yFit vectors, which is the fitted.value of the training datasets, from your train_model.
1.Quick way to use the model from caret package:
Get the finalModel result from the 'train' object:
nnetObj = train_model$finalModel # return the 'nnet' model which the caret package has found.
yPredInt = nnetPredInt(nnetObj, xTrain = xtrain, yTrain = ytrain,newData = dframe)
For Example, Use the Iris Dataset and the 'nnet' method from caret package for regression prediction.
library(caret)
library(nnetpredint)
# Setosa 0 and Versicolor 1
ird <- data.frame(rbind(iris3[,,1], iris3[,,2]), species = c(rep(0, 50), rep(1, 50)))
samp = sample(1:100, 80)
xtrain = ird[samp,][1:4]
ytrain = ird[samp,]$species
# Training
train_model <- train(x = xtrain, y = ytrain, method = "nnet", linout = FALSE, na.action = na.exclude,trace=FALSE)
class(train_model) # [1] "train"
nnetObj = train_model$finalModel
class(nnetObj) # [1] "nnet.formula" "nnet"
# Constructing Prediction Interval
xtest = ird[-samp,][1:4]
ytest = ird[-samp,]$species
yPredInt = nnetPredInt(nnetObj, xTrain = xtrain, yTrain = ytrain,newData = xtest)
# Compare Results: ytest and yPredInt
ytest
yPredInt
2.The Hard Way
Use the generic nnetPredInt function to pass all the neural net specific parameters to the function:
nnetPredInt(object = NULL, xTrain, yTrain, yFit, node, wts, newData,alpha = 0.05 , lambda = 0.5, funName = 'sigmoid', ...)
xTrain # Training Dataset
yTrain # Training Target Value
yFit # Fitted Value of the training data
node # Structure of your network, like c(4,5,5,1)
wts # Specific order of weights parameters found by your neural network
newData # New Data for prediction
Tips:
Right now nnetpredint package only support the standard multilayer neural network regression with activated output, not the linear output,
And it will support more type of models soon in the future.
You can use the nnetPredInt function {package:nnetpredint}. Check out the function's help page here
If you are open to writing your own implementation there is another option. You can get prediction intervals from a trained net using the same implementation you would write for standard non-linear regression (assuming back propagation was used to do the estimation).
This paper goes through the methodology and is fairly straight foward: http://www.cis.upenn.edu/~ungar/Datamining/Publications/yale.pdf.
There are, as with everything,some cons (outlined in the paper) to this approach but definitely worth knowing as an option.

R - Top x Important Variable Each Individual Sample Data in Classification

I'm building a churn model using C5 algorithm in R. After finishing the model and successfully predicting the data, how do I know the top 3 important predictors for each of customer that will churn? So I know the reason why the model classifies -for example- cust A,B,D,F as positive and the others as negative. Is it possible?
Thanks.
Many models have built-in approaches for measuring the aggregate effect of the predictors on the model. The caret package contains a general class for calculating or returning these values including C5.0, JRip, PART, RRF, RandomForest, bagEarth, classbagg, cubist, dsa, earth, fda, gam, gbm, glm, glmnet, lm, multinom, mvr, nnet, pamrtrained, plsda, randomForest, regbagg, rfe, rpart, sbf, and train.
For example,
> library(caret)
> set.seed(1401)
> ctrl <- trainControl(method = 'repeatedcv' , number = 6, , repeats = 2 , classProbs = TRUE)
> C5fit <- train(x = iris[, 1:4], y = iris$Species, method = "C5.0", metric = "ROC", trControl = ctrl)
> varImp(C5fit, scale = FALSE)
C5.0 variable importance
Overall
Petal.Width 100
Sepal.Width 0
Petal.Length 0
Sepal.Length 0
You can plot the trees within the model. If you use a single C5.0 tree, this gives you an easy way to provide the exact reasoning of the tree.
library(C50)
set.seed(1401)
C5tree <- C5.0(x = iris[, 1:4], y = iris$Species, trials = 1) # A single C50 tree
C5imp(C5tree)
plot(C5tree, trial = 0)
If you use boosting (i.e. trials > 1 when you train the trees), then this approach is likely too complicated due to the number of trees.
C5boosted <- C5.0(x = iris[, 1:4], y = iris$Species, trials = 3) # Boost three trees
C5imp(C5boosted)
# Plot each of the trees
for(i in 0:2){ # trials starts counting at 0, see ?plot.C5.0
plot(C5boosted, trial = i)
}
Instead, you can rely on the variable importance for a general report of important variables or use partial dependence plots that show the (non-linear) effect of one variable relative to all other variables. I suggest having a look at package pdp on CRAN.

Resources