Producing graph of training and validation sets using caret's train function - r

I am using caret's train function in R to produce a model using GBM. I have used repeated cross-validation with 5 repititions meaning there will be 50 samples. I want to ask if there is a way to plot the results in a different way such that the plot shows the boosting iterations on the x-axis and auc on the y-axis and inside it shows the results obtained from the best parameter selection but a separate line for training folds and test folds. This can be produced when you use "gbm" function from the gbm package and use "gbm.perf" along with sampling technique to plot the training and validation curve for deviance.
Is it possible to do the same with caret's train function somehow?
Thanks.

Within your caret object, if you used method='gbm', you can select the attribute 'finalModel', which is the resulting gbm object. For example, if your train object is named 'a', then
gbm_model <- a$finalModel
With gbm_model, you can then run the functions internal to the gbm package.

Related

Obtaining predictions from a pooled imputation model

I want to implement a "combine then predict" approach for a logistic regression model in R. These are the steps that I already developed, using a fictive example from pima data from faraway package. Step 4 is where my issue occurs.
#-----------activate packages and download data-------------##
library(faraway)
library(mice)
library(margins)
data(pima)
Apply a multiple imputation by chained equation method using MICE package. For the sake of the example, I previously randomly assign missing values to pima dataset using the ampute function from the same package. A number of 20 imputated datasets were generated by setting "m" argument to 20.
#-------------------assign missing values to data-----------------#
result<-ampute(pima)
result<-result$amp
#-------------------multiple imputation by chained equation--------#
#generate 20 imputated datasets
newresult<-mice(result,m=20)
Run a logistic regression on each of the 20 imputated datasets. Inspecting convergence, original and imputated data distributions is skipped for the sake of the example. "Test" variable is set as the binary dependent variable.
#run a logistic regression on each of the 20 imputated datasets
model<-with(newresult,glm(test~pregnant+glucose+diastolic+triceps+age+bmi,family = binomial(link="logit")))
Combine the regression estimations from the 20 imputation models to create a single pooled imputation model.
#pooled regressions
summary(pool(model))
Generate predictions from the pooled imputation model using prediction function from the margins package. This specific function allows to generate predicted values fixed at a specific level (for factors) or values (for continuous variables). In this example, I could chose to generate new predicted probabilites, i.e. P(Y=1), while setting pregnant variable (# of pregnancies) at 3. In other words, it would give me the distribution of the issue in the contra-factual situation where all the observations are set at 3 for this variable. Normally, I would just give my model to the x argument of the prediction function (as below), but in the case of a pooled imputation model with MICE, the object class is a mipo and not a glm object.
#-------------------marginal standardization--------#
prediction(model,at=list(pregnant=3))
This throws the following error:
Error in check_at_names(names(data), at) :
Unrecognized variable name in 'at': (1) <empty>p<empty>r<empty>e<empty>g<empty>n<empty>a<empty>n<empty>t<empty
I thought of two solutions:
a) changing the class object to make it fit prediction()'s requirements
b) extracting pooled imputation regression parameters and reconstruct it in a list that would fit prediction()'s requirements
However, I'm not sure how to achieve this and would enjoy any advice that could help me getting closer to obtaining predictions from a pooled imputation model in R.
You might be interested in knowing that the pima data set is a bit problematic (the Native Americans from whom the data was collected don't want it used for research any more ...)
In addition to #Vincent's comment about marginaleffects, I found this GitHub issue discussing mice support for the emmeans package:
library(emmeans)
emmeans(model, ~pregnant, at=list(pregnant=3))
marginaleffects works in a different way. (Warning, I haven't really looked at the results to make sure they make sense ...)
library(marginaleffects)
fit_reg <- function(dat) {
mod <- glm(test~pregnant+glucose+diastolic+
triceps+age+bmi,
data = dat, family = binomial)
out <- predictions(mod, newdata = datagrid(pregnant=3))
return(out)
}
dat_mice <- mice(pima, m = 20, printFlag = FALSE, .Random.seed = 1024)
dat_mice <- complete(dat_mice, "all")
mod_imputation <- lapply(dat_mice, fit_reg)
mod_imputation <- pool(mod_imputation)

Choosing prediction model with regularization, spatial cross-validation and bounded predictions

I am new to machine learning and R. I want to run a statistical model to predict daily hours of supply of electricity (y). I have several x variables to use for prediction. I have three goals to achieve:
I want to use some sort of regularization to choose the x variables that should go in the model.
y is bounded between 0 and 24. So I want the predictions to also be bounded within this range.
The data has spatial attributes and I want to use spatial cross-validation to re-sample while tuning regularization parameters.
I am planning to use the mlr package in R. Which learner can I use that can achieve the above three goals?
Many thanks.

How to predict multiple svm models in R?

I have train and test images separately. I want to predict the SVM models in an iterative way. After creating models if i predict the result, i can see only the last predicted value rather than all the predicted values for n number of models. I would like to know how to automate the process of creating n SVM models and predict all the values.
Thanks in advance.
If your problem is a "multi-class" problem, you can directly apply SVM function provided by e1071 for training your data which are properly labelled.
If your problem is a "multi-instance" problem, you can train multiple SVM models by giving them different names. For automating iterations, you can play the trick using paste(). Something like
for (n in 1:itr) {
svm.model <- svm(label~., data)
assign(paste("svm.model", n, sep = "."), svm.model)
}
You will get svm.model.1, svm.model.2, ... for multiple SVM models, respectively.

Random forest evaluation in R

I am a newbie in R and I am trying to do my best to create my first model. I am working in a 2- classes random forest project and so far I have programmed the model as follows:
library(randomForest)
set.seed(2015)
randomforest <- randomForest(as.factor(goodkit) ~ ., data=training1, importance=TRUE,ntree=2000)
varImpPlot(randomforest)
prediction <- predict(randomforest, test,type='prob')
print(prediction)
I am not sure why I don't get the overall prediction for my model.I must be missing something in my code. I get the OOB and the prediction per case in the test set but not the overall prediction of the model.
library(pROC)
auc <-roc(test$goodkit,prediction)
print(auc)
This doesn't work at all.
I have been through the pROC manual but I cannot get to understand everything. It would be very helpful if anyone can help with the code or post a link to a good practical sample.
Using the ROCR package, the following code should work for calculating the AUC:
library(ROCR)
predictedROC <- prediction(prediction[,2], as.factor(test$goodkit))
as.numeric(performance(predictedROC, "auc")#y.values))
Your problem is that predict on a randomForest object with type='prob' returns two predictions: each column contains the probability to belong to each class (for binary prediction).
You have to decide which of these predictions to use to build the ROC curve. Fortunately for binary classification they are identical (just reversed):
auc1 <-roc(test$goodkit, prediction[,1])
print(auc1)
auc2 <-roc(test$goodkit, prediction[,2])
print(auc2)

Unable to predict on a new dataset using coxph in R

I built a coxph model on training data and I am having trouble in using it for predictions on the validation dataset. I used Surv to build a survival object and used it as the response variable in coxph model. I then used predict to get predictions on both the training and test dataset but at both times it predicted using on training data.
my.surv<-Surv(train$time_to_event, train$event, type="right")
basic.coxph<-coxph(my.surv~train$basin+train$region+train$district_code)
prediction_train<-predict(basic.coxph,train,type="risk")
prediction_test<-predict(object=basic.coxph,newdata=validate,type="risk")
The results of both prediction_train and prediction_test have the same dimensions and are exactly the same, though the training and validation dataset have different dimensions but al the same columns. Any suggestion on what I am doing wrong here?

Resources