svm {e1071} predict creates larger array of predicted values than expected - r

I'm using Support Vector Machine (SVM, package e1071) within R to build a classification model and out-of-sample predicting a 7-factor class.
The problem is, when using the predict function, I obtain a array, much larger than the number of rows in the validation set. See code and results below.
Any suggestions about what goes wrong? Do I miss-interpret the predict function in the SVM package?
install.packages("e1071","caret")
library(e1071)
library(caret)
data <- data.frame(replicate(10,sample(0:6,1000,rep=TRUE)))
trainIndex <- createDataPartition(data[,1], p = 0.8,
list = FALSE,
times = 1)
trainset <- data[trainIndex,2:10]
validationset <- data[-trainIndex,2:10]
trainlabel <- data[trainIndex,1]
validationlabel <- data[-trainIndex,1]
svmModel <- svm(x = trainset,
y = trainlabel,
type = "C-classification",
kernel = "radial")
# Predict
svmPred <- predict(svmModel, x = validationset)
length(svmPred)
# 800, expected 200 since validationset has nrow = 200.

It's because x doesn't exist in predict
try :
svmPred <- predict(svmModel, validationset)
length(svmPred)

Related

How does the kernel SVM in e1071 predict?

I am trying to understand the way the e1071 package obtains its SVM predictions in a two-class classification framework. Consider the following toy example.
library(mvtnorm)
library(e1071)
n <- 50
### Gaussians
eps <- 0.05
data1 <- as.data.frame(rmvnorm(n, mean = c(0,0), sigma=diag(rep(eps,2))))
data2 <- as.data.frame(rmvnorm(n, mean = c(1,1), sigma=diag(rep(eps,2))))
### Train Model
data_df <- as.data.frame(rbind(data1, data2))
data <- as.matrix(data_df)
data_df$y <- as.factor(c(rep(-1,n), rep(1,n)))
svm <- svm(y ~ ., data = data_df, kernel = "radial", gamma=1, type = "C-classification", scale = FALSE)
Having trained the SVM, I would like to write a function that uses the coefficients and the intercept to predict on a new data point.
Recall that the kernel trick guarantees that we can write the prediction on a new point as the weighted sum of the kernel evaluated at the support vectors and the new point itself (plus some intercept).
In other words: How to combine the following three terms
supportv <- svm$SV
coefs <- svm$coefs
intercept <- svm$rho
to get the prediction associated with the corresponding SVM?
If this is not possible, or too complicated, I would also switch to a different package.

Random Forest prediction error in R "No forest component in the object"

I am attempting to use a random forest regressor to classify a raster stack, but an error does not allow a prediction of "area_pct", have I not trained the model properly?
d100 is my dataset with predictor variables d100[,4:ncol(d100)] and prediction variable d100["area_pct"].
#change na values to zero
d100[is.na(d100)] <- 0
set.seed(100)
#split dataset into training (70%) and testing (30%)
id<- sample(2,nrow(d100), replace = TRUE, prob = c(0.7,0.3))
train_100<- d100[id==1,]
test_100 <- d100[id==2,]
train random forest model with randomForest package, this appears to work fine
final_CC_rf_20 = randomForest(x=train[,4:ncol(train)], y= train$area_pct,
xtest=test[,4:ncol(test)], ytest=test$area_pct, mtry=14, importance=TRUE, ntree = 600)
Then I try to predict a raster.
New raster stack with predictor variables
sentinel_2_20 <- stack( paste(getwd(), "Sentinel_SR_clip_20.tif", sep="/") )
area_classified_20_2018 <- predict(object = final_CC_rf_20 , newdata = sentinel_2_20,type = 'response', progress = 'window')
but error pops up:
#Error in predict.randomForest(object = final_CC_rf_20, newdata = sentinel_2_20, :
# No forest component in the object
any help would be extremely useful
The arguments you are using for predict (with raster data) are not correct. The first argument, object, should be the raster data, the second argument, model, should be the fitted model. There is no argument newdata.
Another problem is that you use keep.forest=FALSE which is the default when xtest is not NULL. You could set keep.forest=TRUE but that is not a good approach, generally, as you should fit your model with all data before you make a prediction (you are no longer evaluating your model). Thus, I would suggest fitting your model without xtest, like this
rfmod <- randomForest(x=d100[,4:ncol(train)], y=d100$area_pct,
mtry=14, importance=TRUE, ntree = 600)
And then do
p <- predict(sentinel_2_20, rfmod, type='response')
See ?raster::predict or ?terra::predict for working examples

SVM in R e1071 package yields different outputs for different formula notation but same data and parameters

For the purpose of doing Twitter sentiment analysis I am using the SVM function from e1071 package.
I used the RTextTools package to create a document term matrix which I split into a training and a test set. I have "train" which is a data frame of training observations excluding the predicted variable. Then I have "sentitrain", which is a vector of sentiment values corresponding to the training set. Same for testing.
Then I used three different ways for fitting the a SVM model.
Firstly, I created a container
trainmat = as.matrix(train)
# create container object
traincontainer = create_container(trainmat,
sentitrain,
trainSize = 1:nrow(trainmat),
virgin = FALSE)
# create test matrix
testmat = as.matrix(test)
testcontainer = create_container(testmat, labels = rep(0, nrow(test)),
testSize = 1:nrow(test), virgin = FALSE)
model <- train_model(traincontainer , "SVM", kernel="radial", cost=400)
results = classify_model(testcontainer, model)
preds = results[,1]
confusionMatrix(table(preds, sentitest))
This gave me approximately 76% classification accuracy.
In the second method I simply took the column names of my training matrix and then created a formula:
n = names(train)
# exclude the predicted variable
n = setdiff(n, c("sentiment"))
predictors = paste(n, collapse = " + ")
# create formula
f = as.formula(paste("sentiment ~ ", predictors))
model = svm(f, data = train, cost = 400, kernel = "radial")
preds = predict(model, test)
confusionMatrix(table(preds, sentitest))
This gave me around 69% accuracy.
Thirdly I just passed the data frames and the vectors of predicted values directly to the function call:
model = svm(train, sentitrain, data = train, cost = 400, kernel =
"radial")
preds = predict(model, test)
confusionMatrix(table(preds, sentitest))
This resulted in an astonishing 87% accuracy.
I ran each model several times with cross validation to make sure these differences were not due to randomness.
As I understand it I always used the same function from the same package (RTextTools calls the SVM function from e1071 internally) with the same parameters on the same data. The only difference is the way I passed those parameters. How can the results be so different?

How do I use predict() on new data for lme4::glmer model?

I have been trying to establish predictive performance (AUC ROC) for a glmer model. When I try and use the predict() function on a test data set, the output for this function is the length of my train data set.
folds = 10;
glmerperf=rep(0,folds); glmperf=glmerperf;
TB_Train.glmer.subset <- TB_Train.glmer %>% select(one_of(subset.vars), IDNO)
TB_Train.glmer.fs <- TB_Train.glmer.subset[,c(1:7, 22)]
TB_Train.glmer.ns <- TB_Train.glmer.subset[, 8:21]
TB_Train.glmer.cns <- TB_Train.glmer.ns %>% scale(center=TRUE, scale=TRUE) %>% cbind(TB_Train.glmer.fs)
foldsamples = caret::createFolds(TB_Train.glmer.cns$Case.Status, k = folds, list = TRUE, returnTrain = FALSE)
for (n in 1:folds)
{
testdata = TB_Train.glmer.cns[foldsamples[[n]],]
traindata = TB_Train.glmer.cns[-foldsamples[[n]],]
GLMER <- lme4::glmer(Case.Status ~ . + (1 | IDNO), data = traindata, family="binomial", control=glmerControl(optimizer="bobyqa", optCtrl=list(maxfun=1000000)))
glmer.probs <- predict(GLMER, newdata=testdata$Non.TB.Case, type="response")
glmer.ROC <- roc(predictor=glmer.probs, response=testdata$Case.Status, levels=rev(levels(testdata$Case.Status)))
glmerperf[n] <- glmer.ROC$auc
}
prob <- predict(GLMER, newdata=TB_Test.glmer$Non.TB.Case, type="response", re.form=~(1|IDNO))
print(sprintf('Mean AUC ROC of model on test set for GLMER %f', mean(glmerperf)))
Both the prob and glmer.probs objects are the length of the traindata object, despite specifying the newdata argument. I have noticed issues with the predict function in the past, but none as specific as this one.
Also, when the model is run, I get several errors about needing to scale my data (which I already have) and that the model fails to converge. Any ideas on how to fix this? I have already bumped up the iterations and selected a new optimizer.
Figured out that error was arising from using the "." shortcut to specify all predictors for the model.

Calculate Prediction Intervals of a predicted value using Caret package of R

I used different neural network packages within Caret package for my predictions. Code used with nnet package is
library(caret)
# training model using nnet method
data <- na.omit(data)
xtrain <- data[,c("temperature","prevday1","prevday2","prev_instant1","prev_instant2","prev_2_hour")]
ytrain <- data$power
train_model <- train(x = xtrain, y = ytrain, method = "nnet", linout=TRUE, na.action = na.exclude,trace=FALSE)
# prediction using training model created
pred_ob <- predict(train_model, newdata=dframe,type="raw")
The predict function simply calculates the prediction value. But, I also need prediction intervals (2-sigma) as well. On searching, I found a relevant answer at stackoverflow link, but this does not result as needed. The solution suggests to use finalModelvariable as
predict(train_model$finalModel, newdata=dframe, interval = "confidence",type=raw)
Is there any other way to calculate prediction intervals? The training data used is the dput() of my previous question at stackoverflow link and the dput() of my prediction dataframe (test data) is
dframe <- structure(list(temperature = 27, prevday1 = 1607.69296666667,
prevday2 = 1766.18103333333, prev_instant1 = 1717.19306666667,
prev_instant2 = 1577.168915, prev_2_hour = 1370.14983583333), .Names = c("temperature",
"prevday1", "prevday2", "prev_instant1", "prev_instant2", "prev_2_hour"
), class = "data.frame", row.names = c(NA, -1L))
****************************UPDATE***********************
I used nnetpredintpackage as suggested at link. To my surprise it results in an error, which I find difficult to debug. Here is my updated code till now,
library(nnetpredint)
nnetPredInt(train_model, xTrain = xtrain, yTrain = ytrain,newData = dframe)
It results in the following error:
Error: Number of observations for xTrain, yTrain, yFit are not the same
[1] 0
I can check that xtrain, ytrain and dframe are with correct dimensions, but I do not have any idea about yFit. I don't need this according to the examples of nnetpredintvignette
caret doesn't generate prediction intervals; that relies on the individual package. If that package cannot do this, then neither can the train objects. I agree that nnetPredInt is the appropriate way to go.
Two other notes:
you most likely should center and scale your data if you have not already.
using the finalModel object is somewhat dangerous since it has no idea what was done to the data (e.g. dummy variables, centering and scale or other preprocessing methods, etc) before it was created.
Max
Thanks for your question. And a simple answer to your problem is: Right now the nnetPredInt function only support the following S3 object, "nnet", "nn" and "rsnns", produced by different neural network packages. And the train function in caret package return an "train" object. That's why the function nnetPredInt doesn't get the yFit vectors, which is the fitted.value of the training datasets, from your train_model.
1.Quick way to use the model from caret package:
Get the finalModel result from the 'train' object:
nnetObj = train_model$finalModel # return the 'nnet' model which the caret package has found.
yPredInt = nnetPredInt(nnetObj, xTrain = xtrain, yTrain = ytrain,newData = dframe)
For Example, Use the Iris Dataset and the 'nnet' method from caret package for regression prediction.
library(caret)
library(nnetpredint)
# Setosa 0 and Versicolor 1
ird <- data.frame(rbind(iris3[,,1], iris3[,,2]), species = c(rep(0, 50), rep(1, 50)))
samp = sample(1:100, 80)
xtrain = ird[samp,][1:4]
ytrain = ird[samp,]$species
# Training
train_model <- train(x = xtrain, y = ytrain, method = "nnet", linout = FALSE, na.action = na.exclude,trace=FALSE)
class(train_model) # [1] "train"
nnetObj = train_model$finalModel
class(nnetObj) # [1] "nnet.formula" "nnet"
# Constructing Prediction Interval
xtest = ird[-samp,][1:4]
ytest = ird[-samp,]$species
yPredInt = nnetPredInt(nnetObj, xTrain = xtrain, yTrain = ytrain,newData = xtest)
# Compare Results: ytest and yPredInt
ytest
yPredInt
2.The Hard Way
Use the generic nnetPredInt function to pass all the neural net specific parameters to the function:
nnetPredInt(object = NULL, xTrain, yTrain, yFit, node, wts, newData,alpha = 0.05 , lambda = 0.5, funName = 'sigmoid', ...)
xTrain # Training Dataset
yTrain # Training Target Value
yFit # Fitted Value of the training data
node # Structure of your network, like c(4,5,5,1)
wts # Specific order of weights parameters found by your neural network
newData # New Data for prediction
Tips:
Right now nnetpredint package only support the standard multilayer neural network regression with activated output, not the linear output,
And it will support more type of models soon in the future.
You can use the nnetPredInt function {package:nnetpredint}. Check out the function's help page here
If you are open to writing your own implementation there is another option. You can get prediction intervals from a trained net using the same implementation you would write for standard non-linear regression (assuming back propagation was used to do the estimation).
This paper goes through the methodology and is fairly straight foward: http://www.cis.upenn.edu/~ungar/Datamining/Publications/yale.pdf.
There are, as with everything,some cons (outlined in the paper) to this approach but definitely worth knowing as an option.

Resources