How to predict multiple svm models in R? - r

I have train and test images separately. I want to predict the SVM models in an iterative way. After creating models if i predict the result, i can see only the last predicted value rather than all the predicted values for n number of models. I would like to know how to automate the process of creating n SVM models and predict all the values.
Thanks in advance.

If your problem is a "multi-class" problem, you can directly apply SVM function provided by e1071 for training your data which are properly labelled.
If your problem is a "multi-instance" problem, you can train multiple SVM models by giving them different names. For automating iterations, you can play the trick using paste(). Something like
for (n in 1:itr) {
svm.model <- svm(label~., data)
assign(paste("svm.model", n, sep = "."), svm.model)
}
You will get svm.model.1, svm.model.2, ... for multiple SVM models, respectively.

Related

Obtaining predictions from a pooled imputation model

I want to implement a "combine then predict" approach for a logistic regression model in R. These are the steps that I already developed, using a fictive example from pima data from faraway package. Step 4 is where my issue occurs.
#-----------activate packages and download data-------------##
library(faraway)
library(mice)
library(margins)
data(pima)
Apply a multiple imputation by chained equation method using MICE package. For the sake of the example, I previously randomly assign missing values to pima dataset using the ampute function from the same package. A number of 20 imputated datasets were generated by setting "m" argument to 20.
#-------------------assign missing values to data-----------------#
result<-ampute(pima)
result<-result$amp
#-------------------multiple imputation by chained equation--------#
#generate 20 imputated datasets
newresult<-mice(result,m=20)
Run a logistic regression on each of the 20 imputated datasets. Inspecting convergence, original and imputated data distributions is skipped for the sake of the example. "Test" variable is set as the binary dependent variable.
#run a logistic regression on each of the 20 imputated datasets
model<-with(newresult,glm(test~pregnant+glucose+diastolic+triceps+age+bmi,family = binomial(link="logit")))
Combine the regression estimations from the 20 imputation models to create a single pooled imputation model.
#pooled regressions
summary(pool(model))
Generate predictions from the pooled imputation model using prediction function from the margins package. This specific function allows to generate predicted values fixed at a specific level (for factors) or values (for continuous variables). In this example, I could chose to generate new predicted probabilites, i.e. P(Y=1), while setting pregnant variable (# of pregnancies) at 3. In other words, it would give me the distribution of the issue in the contra-factual situation where all the observations are set at 3 for this variable. Normally, I would just give my model to the x argument of the prediction function (as below), but in the case of a pooled imputation model with MICE, the object class is a mipo and not a glm object.
#-------------------marginal standardization--------#
prediction(model,at=list(pregnant=3))
This throws the following error:
Error in check_at_names(names(data), at) :
Unrecognized variable name in 'at': (1) <empty>p<empty>r<empty>e<empty>g<empty>n<empty>a<empty>n<empty>t<empty
I thought of two solutions:
a) changing the class object to make it fit prediction()'s requirements
b) extracting pooled imputation regression parameters and reconstruct it in a list that would fit prediction()'s requirements
However, I'm not sure how to achieve this and would enjoy any advice that could help me getting closer to obtaining predictions from a pooled imputation model in R.
You might be interested in knowing that the pima data set is a bit problematic (the Native Americans from whom the data was collected don't want it used for research any more ...)
In addition to #Vincent's comment about marginaleffects, I found this GitHub issue discussing mice support for the emmeans package:
library(emmeans)
emmeans(model, ~pregnant, at=list(pregnant=3))
marginaleffects works in a different way. (Warning, I haven't really looked at the results to make sure they make sense ...)
library(marginaleffects)
fit_reg <- function(dat) {
mod <- glm(test~pregnant+glucose+diastolic+
triceps+age+bmi,
data = dat, family = binomial)
out <- predictions(mod, newdata = datagrid(pregnant=3))
return(out)
}
dat_mice <- mice(pima, m = 20, printFlag = FALSE, .Random.seed = 1024)
dat_mice <- complete(dat_mice, "all")
mod_imputation <- lapply(dat_mice, fit_reg)
mod_imputation <- pool(mod_imputation)

Predicting with plm function in R

I was wondering if it is possible to predict with the plm function from the plm package in R for a new dataset of predicting variables. I have create a model object using:
model <- plm(formula, data, index, model = 'pooling')
Now I'm hoping to predict a dependent variable from a new dataset which has not been used in the estimation of the model. I can do it through using the coefficients from the model object like this:
col_idx <- c(...)
df <- cbind(rep(1, nrow(df)), df[(1:ncol(df))[-col_idx]])
fitted_values <- as.matrix(df) %*% as.matrix(model_object$coefficients)
Such that I first define index columns used in the model and dropped columns due to collinearity in col_idx and subsequently construct a matrix of data which needs to be multiplied by the coefficients from the model. However, I can see errors occuring much easier with the manual dropping of columns.
A function designed to do this would make the code a lot more readable I guess. I have also found the pmodel.response() function but I can only get this to work for the dataset which has been used in predicting the actual model object.
Any help would be appreciated!
I wrote a function (predict.out.plm) to do out of sample predictions after estimating First Differences or Fixed Effects models with plm.
The function is posted here:
https://stackoverflow.com/a/44185441/2409896

Combining Multiple Random Forest Models from Amelia Imputed Data

I just created 40 imputed data sets using the Amelia package, and they are stored in a.out.
I then used the lapply function to create randomforest models on the data sets:
rf.amelia.out = lapply(a.out$imputations, function(i) randomForest(y + x1+x2, data = i) )
Now I would like to combine these models to make a prediction on a bunch a.test.out, which is a list of amelia imputed data testing data.
I can't figure out how to combine these random forest models. I've tried randomforest combine function like combine(rf.amelia.out) but that didn't work. The problem is that rf.amelia.out is not a model object, but neither is rf.amelia.out[1].
I also tried to use zelig to automatically combine multiple models:
rf.z.out = zelig(y~x1+x2, data = a.out, model = "rf")
But I don't think zelig supports random forest models.
How do I access and combine the multiple random forest models so that I can make one prediction?
Since rf.amelia.out is already a list, the combine function in randomForest loses its methods when it tries to convert it to a list again. I recommend one of two fixes:
Change the combine function and then use the modified version:
body(combine)[[4]] <- substitute(rflist <- (...))
rf.all <- combine(rf.amelia.out)
Or use:
combine(rf.amelia.out[[1]].rf.amelia.out[[2]],...)
I think the first way is easier (and much less manual).

What is the objective of model.matrix()?

I'm currently going through the 'Introduction to Statistical Learning' MOOC by Stanford OpenX. In one of the lab exercises, it suggests creating a model matrix from the test data by explicitly using model.matrix().
Extract from textbook
We now compute the validation set error for the best model of each model size. We first make a model matrix from the test data.
test.mat=model.matrix (Salary∼.,data=Hitters [test ,])
The model.matrix() function is used in many regression packages for
building an X matrix from data. Now we run a loop, and for each size i, we
extract the coefficients from regfit.best for the best model of that
size, multiply them into the appropriate columns of the test model
matrix to form the predictions, and compute the test MSE.
val.errors =rep(NA ,19)
for(i in 1:19){
coefi=coef(regfit .best ,id=i)
pred=test.mat [,names(coefi)]%*% coefi
val.errors [i]= mean(( Hitters$Salary[test]-pred)^2)
}
I understand that model.matrix would convert string variables into values with different levels, and that models like lm() would do the conversions under the hood.
However, what are the instances that we would explicitly use model.matrix(), and why?

Regression evaluation in R

Are there any utilities/packages for showing various performance metrics of a regression model on some labeled test data? Basic stuff I can easily write like RMSE, R-squared, etc., but maybe with some extra utilities for visualization, or reporting the distribution of prediction confidence/variance, or other things I haven't thought of. This is usually reported in most training utilities (like caret's train), but only over the training data (AFAICT). Thanks in advance.
This question is really quite broad and should be focused a bit, but here's a small subset of functions written to work with linear models:
x <- rnorm(seq(1,100,1))
y <- rnorm(seq(1,100,1))
model <- lm(x~y)
#general summary
summary(model)
#Visualize some diagnostics
plot(model)
#Coefficient values
coef(model)
#Confidence intervals
confint(model)
#predict values
predict(model)
#predict new values
predict(model, newdata = data.frame(y = 1:10))
#Residuals
resid(model)
#Standardized residuals
rstandard(model)
#Studentized residuals
rstudent(model)
#AIC
AIC(model)
#BIC
BIC(model)
#Cook's distance
cooks.distance(model)
#DFFITS
dffits(model)
#lots of measures related to model fit
influence.measures(model)
Bootstrap confidence intervals for parameters of models can be computed using the recommended package boot. It is a very general package requiring you to write a simple wrapper function to return the parameter of interest, say fit the model with some supplied data and return one of the model coefficients, whilst it takes care of the rest, doing the sampling and computation of intervals etc.
Consider also the caret package, which is a wrapper around a large number of modelling functions, but also provides facilities to compare model performance using a range of metrics using an independent test set or a resampling of the training data (k-fold, bootstrap). caret is well documented and quite easy to use, though to get the best out of it, you do need to be familiar with the modelling function you want to employ.

Resources