I am trying to understand the way the e1071 package obtains its SVM predictions in a two-class classification framework. Consider the following toy example.
library(mvtnorm)
library(e1071)
n <- 50
### Gaussians
eps <- 0.05
data1 <- as.data.frame(rmvnorm(n, mean = c(0,0), sigma=diag(rep(eps,2))))
data2 <- as.data.frame(rmvnorm(n, mean = c(1,1), sigma=diag(rep(eps,2))))
### Train Model
data_df <- as.data.frame(rbind(data1, data2))
data <- as.matrix(data_df)
data_df$y <- as.factor(c(rep(-1,n), rep(1,n)))
svm <- svm(y ~ ., data = data_df, kernel = "radial", gamma=1, type = "C-classification", scale = FALSE)
Having trained the SVM, I would like to write a function that uses the coefficients and the intercept to predict on a new data point.
Recall that the kernel trick guarantees that we can write the prediction on a new point as the weighted sum of the kernel evaluated at the support vectors and the new point itself (plus some intercept).
In other words: How to combine the following three terms
supportv <- svm$SV
coefs <- svm$coefs
intercept <- svm$rho
to get the prediction associated with the corresponding SVM?
If this is not possible, or too complicated, I would also switch to a different package.
Related
I am working with the wine quality database.
I am studying regression trees depending on different variables as:
library(rpart)
library(rpart.plot)
library(rattle)
library(naniar)
library(dplyr)
library(ggplot2)
vinos <- read.csv(file = 'Wine.csv', header = T)
arbol0<-rpart(formula=quality~chlorides, data=vinos, method="anova")
fancyRpartPlot(arbol0)
arbol1<-rpart(formula=quality~chlorides+density, data=vinos, method="anova")
fancyRpartPlot(arbol1)
I want to calculate the mean square error to see if arbol1 is better than arbol0. I will use my own dataset since no more data is available. I have tried to do it as
aaa<-predict(object=arbol0, newdata=data.frame(chlorides=vinos$chlorides), type="anova")
bbb<-predict(object=arbol1, newdata=data.frame(chlorides=vinos$chlorides, density=vinos$density), type="anova")
and then substract manually the last column of the dataframe from aaa and bbb. However, I am getting an error. Can someone please help me?
This website could be useful for you. It's very important to split your dataset into train and test subsets before training your models. In the following code, I've done it with base functions, but there's another function called sample.split from the caTools package that does the same procedure. I attach you this website where you can see all the ways to split data in R.
Remember that the function of the Mean Squared Error (MSE) is the following one:
So, it's very simple to apply it with R. You just have to compute the mean of the squared difference between the observed (i.e, the response variable from your test subset) and predicted values (i.e, the values you have predicted from the model with the predict function).
A solution for your wine dataset could be this one, based on the previous website.
library(rpart)
library(dplyr)
library(data.table)
vinos <- fread(file = 'Winequality-red.csv', header = TRUE)
# Split data into train and test subsets
sample_index <- sample(nrow(vinos), size = nrow(vinos)*0.75)
train <- vinos[sample_index, ]
test <- vinos[-sample_index, ]
# Train regression trees models
arbol0 <- rpart(formula = quality ~ chlorides, data = train, method = "anova")
arbol1 <- rpart(formula = quality ~ chlorides + density, data = train, method = "anova")
# Make predictions for each model
pred0 <- predict(arbol0, newdata = test)
pred1 <- predict(arbol1, newdata = test)
# Calculate MSE for each model
mean((pred0 - test$quality)^2)
mean((pred1 - test$quality)^2)
I am attempting to predict tree species composition using Sentinel 2A imagery and forest plot data. I have calculated the proportion of basal area (the cross-sectional area trees of a given species occupy within the plot divided by the total cross-sectional area all tree occupy with in a plot) from the forest plot data and want to predict proportion of basal area by species across the landscape as a raster using the intensity values from Sentinel 2A. Dirichlet regression seems appropriate here, because I have more than 2 categories of proportional data bounded on (0,1) where the summed proportions for each observational unit must be equal to 1. Therefore, I want to model the joint composition of proportional basal area as a function of spectral reflectance intensity using Dirichlet regression, with k-folds crossvalidation using 10 folds and 5 repeats. This seems like a modeling exercise perfectly suited to caret::train() using a custom function.
I followed the documentation for how to build a custom routine for DirichletReg::DirichReg(), but I am a bit stumped on how to feed a multivariate response variable to caret::train(). The DirichletReg::DirichReg() function requires that response variable be formatted with the DirichletReg::DR_data() function first prior to modeling. However, when I feed a Dr_data object to caret::train(), I get this error:
Error: wrong model type for classification
I think this error is getting thrown because I have passed a multivariate response variable to the y= argument of caret::train() and the routine thinks that can only happen with classification algorithms. Does anyone have experience doing multivariate regression? Is there another way to change my custom routine so that caret::train() will accept multivariate response variables?
Below is my reproducible example:
##Loading Necessary Packages##
library(caret)
library(DirichletReg)
##Creating Fake Data##
set.seed(88)#For reproducibility
#Response variables#
PSME_BA<-rnorm(25,50, 15)
TSHE_BA<-rnorm(25,40,12)
ALRU2_BA<-rnorm(25,20,0.5)
Total_BA<-PSME_BA+TSHE_BA+ALRU2_BA
#Predictor variables#
B1<-runif(25, 0, 2000)
B2<-runif(25, 0, 1800)
B3<-runif(25, 0, 3000)
#Dataset for modeling#
df<-data.frame(PSME=PSME_BA/Total_BA, TSHE=TSHE_BA/Total_BA, ALRU2=ALRU2_BA/Total_BA,
B1=B1, B2=B2, B3=B3)
##Creating a Dirichlet regression modeling routine to feed to caret::train()##
#Method list#
Dreg <- list(type='Regression',
library='DirichletReg',
loop=NULL)
#Parameters element#
prm <- data.frame(parameter=c("model"),
class="character")
Dreg$parameters <- prm
#Grid element#
DregGrid <- function(x, y, len=NULL, search="grid"){
if(search == "grid"){
out <- expand.grid(model=c("common", "alternative"), stringsAsFactors = F) # here force the strings as character,
# othewise get error that the model arguments
# were expecting 'chr' when fitting
}
out
}
Dreg$grid <- DregGrid
#Fit element#
DregFit <- function(x, y, param, ...){
dat <- if(is.data.frame(x)) x else as.data.frame(x)
dat$.outcome <- y
theDots <- list(...)
modelArgs <- c(list(formula = as.formula(".outcome ~ ."), data = dat, link=param$model, type=param$type), theDots)
out <- do.call(DirichletReg::DirichReg, modelArgs)
out$call <- NULL
out
}
Dreg$fit <- DregFit
#Predict element#
DregPred <- function(modelFit, newdata, preProc=NULL, submodels=NULL){
if(!is.data.frame(newdata)) newdata <- as.data.frame(newdata)
DirichletReg::predict.DirichletRegModel(modelFit, newdata)
}
Dreg$predict <- DregPred
#prob element#
DregProb <- function(){
return(NULL)
}
Dreg$prob <- DregProb
##Modeling the data using Dirichlet regression with repeated k-folds cross validation##
trCrtl<-trainControl(method="repeatedcv", number = 10, repeats = 5)
Y<-DR_data(df[,c(1:3)])#Converting the response variables to DR_data format
mod<-train(x=df[,-c(1:3)],Y, method=Dreg,trControl=trCrtl)#Throws error
I am a beginner, trying to do survival analysis using machine learning on the lung cancer dataset. I know how to do the survival analysis using the Cox proportional hazard model. Cox proportional hazard model provides us the hazard ratios, which are nothing but the exponential of the regression coefficients. I wonder if, we can do the same thing using machine learning. As a beginner, I am trying survivalsvm from the R language. Please see the link for this. I am using the inbuilt cancer data for doing survival analysis. Following is the R code, given at this link.
library(survival)
library(survivalsvm)
set.seed(123)
n <- nrow(veteran)
train.index <- sample(1:n, 0.7 * n, replace = FALSE)
test.index <- setdiff(1:n, train.index)
survsvm.reg <- survivalsvm(Surv(diagtime, status) ~ .,
subset = train.index, data = veteran,
type = "regression", gamma.mu = 1,
opt.meth = "quadprog", kernel = "add_kernel")
print(survsvm.reg)
pred.survsvm.reg <- predict(object = survsvm.reg,
newdata = veteran, subset = test.index)
print(pred.survsvm.reg)
Can anyone help me to get the hazard ratios or survival curve for this dataset? Also, how to interpret the output of this function
This question is kind of old now but I'm going to answer anyway because this is a difficult problem and I struggled with {survivalsvm} when I first used it.
So depending on the type argument you get different outputs. In your case type = "regression" means you are plotting Shivaswamy's (hope i spelt correctly) SVCR which predicts the time until an event takes place, so these are survival time predictions.
In order to convert this to a survival curve you have to make some assumptions about the shape of the survival distribution. So for example, let's say you think the survival time is Normally distributed with N(mu, sigma). Then you can use your predicted survival time as mu and either predict or make an assumption about sigma.
Below is an example using your code and my {distr6} package, which enables quick computation of many distributions and printing and plotting of functions:
library(survival)
library(survivalsvm)
set.seed(123)
n <- nrow(veteran)
train.index <- sample(1:n, 0.7 * n, replace = FALSE)
test.index <- setdiff(1:n, train.index)
survsvm.reg <- survivalsvm(Surv(diagtime, status) ~ .,
subset = train.index, data = veteran,
type = "regression", gamma.mu = 1,
opt.meth = "quadprog", kernel = "add_kernel")
print(survsvm.reg)
pred.survsvm.reg <- predict(object = survsvm.reg,
newdata = veteran, subset = test.index)
# load distr6
library(distr6)
# create a vector of normal distributions each with
# mean as the predicted time and with variance 1
# `decorators = "ExoticStatistics"` adds survival function
v = VectorDistribution$new(distribution = "Normal",
params = data.frame(mean = as.numeric(pred.survsvm.reg$predicted)),
shared_params = list(var = 1),
decorators = "ExoticStatistics")
# survival function evaluated at times = 1:10
v$survival(1:10)
# plot survival function for first individual
plot(v[1], fun = "survival")
# plot hazard function for first individual
plot(v[1], fun = "hazard")
Under certain circumstances, there are differences in predictions from e1071 package svm models depending on the setting of the probability input argument. This code example:
rm(list = ls())
data(iris)
## Training and testing subsets
set.seed(73) # For reproducibility
ri = sample(seq(1, nrow(iris)), round(nrow(iris)*0.8))
train = iris[ri, ]
test = iris[-ri,]
## Models and predictions with probability setting F or T
set.seed(42) # Just to exclude that randomness in algorithm itself is the cause
m1 <- svm(Species ~ ., data = train, probability = F)
pred1 = predict(m1, newdata = test, probability = F)
set.seed(42) # Just to exclude that randomness in algorithm itself is the cause
m2 <- svm(Species ~ ., data = train, probability = T)
pred2 = predict(m2, newdata = test, probability = T)
## Accuracy
acc1 = sum(test$Species == pred1)/nrow(iris)
acc2 = sum(test$Species == pred2)/nrow(iris)
will give
acc1 = 0.18666...
acc2 = 0.19333...
My conclusion is that svm() performs calculations differently based on the setting of the probability parameter.
Is that correct?
If so, why and how does it differ?
I haven't seen anything about this in the docs for the package or function.
The reason I bother with this is that I have found the performance of the classification to be not only different, but consistently slightly worse when probability = T in a project where I do classification based on ~800 observations of ~250 gene abundances (bioinformatics stuff). The code from that project contains data cleaning and uses cross-validation, making it a bit bulky to include here, so you'll have to take my word for it.
Any ideas folks?
I'm using Support Vector Machine (SVM, package e1071) within R to build a classification model and out-of-sample predicting a 7-factor class.
The problem is, when using the predict function, I obtain a array, much larger than the number of rows in the validation set. See code and results below.
Any suggestions about what goes wrong? Do I miss-interpret the predict function in the SVM package?
install.packages("e1071","caret")
library(e1071)
library(caret)
data <- data.frame(replicate(10,sample(0:6,1000,rep=TRUE)))
trainIndex <- createDataPartition(data[,1], p = 0.8,
list = FALSE,
times = 1)
trainset <- data[trainIndex,2:10]
validationset <- data[-trainIndex,2:10]
trainlabel <- data[trainIndex,1]
validationlabel <- data[-trainIndex,1]
svmModel <- svm(x = trainset,
y = trainlabel,
type = "C-classification",
kernel = "radial")
# Predict
svmPred <- predict(svmModel, x = validationset)
length(svmPred)
# 800, expected 200 since validationset has nrow = 200.
It's because x doesn't exist in predict
try :
svmPred <- predict(svmModel, validationset)
length(svmPred)