I have multiple classification machine learning models with all different accuracy. When I run my xgBOOST (using library(caret)) in the console, I get an accuracy of 0.7586. But when I knit my Rmarkdown, the accuracy of the same model is 0.8621. I have no idea why this is different.
I followed the suggestions of this link, but nothing worked:
I also followed the suggestions of problem, but nothing worked: Statistics Result in R Markdown is different from the Knit Output (All Format: Word, HTML, PDF)
At last I tried this, but also nothing worked: sample function gives different result in console and in knitted document when seed is set
Here is my code which I run the same in console and Rmarkdown but with different accuracy:
# Data
data <- data[!$var1),]
# Change levels of var1
#Data Preparation and Preprocessing
# Create the training and test datasets
# Step 1: Get row numbers for the training data
trainRowNumbers <- createDataPartition(data$var1, p=0.8, list=FALSE)
# Step 2: Create the training dataset
trainset <- data[trainRowNumbers,]
# Step 3: Create the test dataset
testset <- data[-trainRowNumbers,]
# Store Y for later use.
y = trainset$var1
# Create the knn imputation model on the training data
preProcess_missingdata_model <- preProcess(, method= c("knnImpute"))
# Create the knn imputation model on the testset data
preProcess_missingdata_model_test <- preProcess(, method = c("knnImpute"))
# Use the imputation model to predict the values of missing data points
library(RANN) # required for knnInpute
trainset <- predict(preProcess_missingdata_model, newdata = trainset)
# Use the imputation model to predict the values of missing data points
library(RANN) # required for knnInpute
testset <- predict(preProcess_missingdata_model_test, newdata = testset)
# Append the Y variable
trainset$var1 <- y
# Run algorithms using 5-fold cross validation
control <- trainControl(method="cv",
repeats = 5,
savePredictions = "final",
search = "grid",
classProbs = TRUE)
metric <- "Accuracy"
# Make Valid Column Names
colnames(trainset) <- make.names(colnames(trainset))
colnames(testset) <- make.names(colnames(testset))
fit.xgbDART <- train(var1~., data = trainset, method = "xgbTree", metric = metric, trControl = control, verbose = FALSE, tuneLength = 7, nthread = 1)
# estimate skill of xgBOOST on the testset dataset
predictions <- predict(fit.xgbDART, testset)
cm <- caret::confusionMatrix(predictions, testset$var1, mode='everything')
My RNGKind is:
[1] "L'Ecuyer-CMRG" "Inversion" "Rejection"

always add the function :
This function sets the starting number used to generate a sequence of random numbers – it ensures that you get the same result if you start with that same seed each time you run the same process. For example, if I use the sample() function immediately after setting a seed, I will always get the same sample.

This is my suggestion on where to use set.seed()
# Data
data <- data[!$var1),]
# Change levels of var1
#Data Preparation and Preprocessing
# Create the training and test datasets
# Step 1: Get row numbers for the training data
trainRowNumbers <- createDataPartition(data$var1, p=0.8, list=FALSE)
# Step 2: Create the training dataset
trainset <- data[trainRowNumbers,]
# Step 3: Create the test dataset
testset <- data[-trainRowNumbers,]
# Store Y for later use.
y = trainset$var1
# Create the knn imputation model on the training data
preProcess_missingdata_model <- preProcess(, method= c("knnImpute"))
# Create the knn imputation model on the testset data
preProcess_missingdata_model_test <- preProcess(, method = c("knnImpute"))
# Use the imputation model to predict the values of missing data points
library(RANN) # required for knnInpute
trainset <- predict(preProcess_missingdata_model, newdata = trainset)
# Use the imputation model to predict the values of missing data points
library(RANN) # required for knnInpute
testset <- predict(preProcess_missingdata_model_test, newdata = testset)
# Append the Y variable
trainset$var1 <- y
# Run algorithms using 5-fold cross validation
control <- trainControl(method="cv",
repeats = 5,
savePredictions = "final",
search = "grid",
classProbs = TRUE)
metric <- "Accuracy"
# Make Valid Column Names
colnames(trainset) <- make.names(colnames(trainset))
colnames(testset) <- make.names(colnames(testset))
fit.xgbDART <-
var1 ~ .,
data = trainset,
method = "xgbTree",
metric = metric,
trControl = control,
verbose = FALSE,
tuneLength = 7,
nthread = 1
# estimate skill of xgBOOST on the testset dataset
predictions <- predict(fit.xgbDART, testset)
cm <- caret::confusionMatrix(predictions, testset$var1, mode='everything')


caret deepnet produces same value for all predictions

I am very new to deep learning. I trained a neural net using the packages deepnet and caret. For this regression problem caretuses a sigmoid function as activation function and a linear one as output function.
I preprocessed the predictors using preprocess = "range" (which I thought normalizes the predictors).
set.seed(123, kind = "Mersenne-Twister", normal.kind = "Inversion")
# create data
dat <-
dat$vari <- sample(LETTERS, nrow(dat), replace = TRUE)
dat$Chick <- as.character(dat$Chick)
preds <- dat[1:100,2:5]
response <- dat[1:100,1]
vali <- dat[101:150,]
# change format of categorical predictors to one-hot encoded format
dmy <- dummyVars(" ~ .", data = preds)
preds_dummies <- data.frame(predict(dmy, newdata = preds))
# specifiy trainControl for tuning mtry and with specified folds
control <- caret::trainControl(search = "grid", method="repeatedcv", number=3,
savePred = T)
# tune hyperparameters and build final model
tunegrid <- expand.grid(layer1 = c(5,50),
layer2 = c(0,5,50),
layer3 = c(0,5,50),
hidden_dropout = c(0, 0.1),
visible_dropout = c(0, 0.1))
model <- caret::train(x = preds_dummies,
y = response,
metric= "RMSE",
trControl= control,
preProcess = "range"
When I predict using the validation set with the tuned neural network model, it produces only one prediction value despite of various input predictors.
# predict with validation set
# create dummies
dmy <- dummyVars(" ~ .", data = vali)
vali_dummies <- data.frame(predict(dmy, newdata = vali))
vali_dummies <- vali_dummies[,which(names(vali_dummies) %in% model$finalModel$xNames)]
# add empty columns for categorical preds of the one used in the model (to have the same matix)
not_included <- setdiff(model$finalModel$xNames, names(vali_dummies))
vali_add <-, length(not_included)*nrow(vali_dummies)),
nrow = nrow(vali_dummies),
ncol = length(not_included))
# change names
names(vali_add) <- not_included
# add to vali_dummies
vali_dummies <- cbind(vali_dummies, vali_add)
# put it in the same order as preds_dummies (sort the columns)
vali_dummies <- vali_dummies[names(preds_dummies)]
# normalize also the validation set
pp = preProcess(vali_dummies, method = c("range"))
vali_dummies <- predict(pp, vali_dummies)
# save obs and pred for predictions with the outer CV out-of-fold test set
temp <- data.frame(obs = vali[,1],
pred = caret::predict.train(object = model, newdata = vali_dummies))
When I am using the Boston data set from the MASS package where no categorical predictors are present, I get slightly different prediction values for all the different input predictors of the validation set.
How can I fix this and create a neural network which predicts "different" predictions when using numeric as well as categorical predictors? What else besides normalization should I try?

Listing model coefficients in descending order

I have a dataset with both continuous and categorical variables. I am running regression to predict one of the variables based on the other variables in the dataset. After comparing the results of ridge, lasso and elastic-net regression, the lasso regression is the best model to proceed with.
I used the 'coef' function to extract the model's coefficients, however, the result is a very long list with over 800 variables (as some of my categorical variables have many levels). Is there a way I can quickly rank the coefficients from largest to smallest? This is a glmnet model output
Reproducible problem with example code:
# Libraries Needed
# Data
data <- BostonHousing
# Data Partition
ind <- sample(2, nrow(data), replace = T, prob = c(0.7, 0.3))
train <- data[ind==1,]
test <- data[ind==2,]
# Custom Control Parameters
custom <- trainControl(method = "repeatedcv",
number = 10,
repeats = 5,
verboseIter = T)
# Linear Model
lm <- train(medv ~.,
trControl = custom)
# Results
# Ridge Regression
ridge <- train(medv ~.,
method = 'glmnet',
tuneGrid = expand.grid(alpha = 0,
lambda = seq(0.0001, 1, length=5)),#try 5 values for lambda between 0.0001 and 1
#increasing lambda = increasing penalty and vice versa
#increase lambda therefore will cause coefs to shrink
# Plot Results
plot(ridge$finalModel, xvar = "lambda", label = T)
plot(ridge$finalModel, xvar = 'dev', label=T)
plot(varImp(ridge, scale=T))
# Lasso Regression
lasso <- train(medv ~.,
method = 'glmnet',
tuneGrid = expand.grid(alpha=1,
lambda = seq(0.0001,1, length=5)),
trControl = custom)
# Plot Results
plot(lasso$finalModel, xvar = 'lambda', label=T)
plot(lasso$finalModel, xvar = 'dev', label=T)
plot(varImp(lasso, scale=T))
# Elastic Net Regression
en <- train(medv ~.,
method = 'glmnet',
tuneGrid = expand.grid(alpha = seq(0,1,length=10),
lambda = seq(0.0001,1,length=5)),
trControl = custom)
# Plot Results
plot(en$finalModel, xvar = 'lambda', label=T)
plot(en$finalModel, xvar = 'dev', label=T)
# Compare Models
model_list <- list(LinearModel = lm, Ridge = ridge, Lasso = lasso, ElasticNet=en)
res <- resamples(model_list)
xyplot(res, metric = 'RMSE')
# Best Model
best <- en$finalModel
coef(best, s = en$bestTune$lambda)
For most models all you'd have to do would be:
sort(coef(model), decreasing=TRUE)
Since you're using glmnet it's a little bit more complicated. I'm going to replicate a minimal version of your example here (the other models, plots, etc. are not necessary in order for us to be able to reproduce your problem ...)
## Packages
library(mlbench) ## for BostonHousing data
# Data
data <- BostonHousing
# Data Partition
ind <- sample(2, nrow(data), replace = TRUE, prob = c(0.7, 0.3))
train <- data[ind==1,]
test <- data[ind==2,]
# Custom Control Parameters
custom <- trainControl(method = "repeatedcv",
number = 10,
repeats = 5,
verboseIter = TRUE)
# Elastic Net Regression
en <- train(medv ~.,
method = 'glmnet',
tuneGrid = expand.grid(alpha = seq(0,1,length=10),
lambda = seq(0.0001,1,length=5)),
trControl = custom)
# Best Model
best <- en$finalModel
coefs <- coef(best, s = en$bestTune$lambda)
(This could probably be made simpler: for example, do you really need the custom control parameters to show us the example? This would be even simpler without using caret - just using `glmnet - but I was afraid I might leave something out.)
Once you've got the coefficients, sorting does appear to work, albeit with a message about possible inefficiency:
sort(coefs, decreasing=TRUE)
## <sparse>[ <logic> ] : .M.sub.i.logical() maybe inefficient
## [1] 25.191049410 5.078589706 1.389548822 0.244605193 0.045600250
## [6] 0.008840485 0.004372752 -0.012701593 -0.028337745 -0.162794401
## [11] -0.335062819 -0.901475516 -1.395091095 -12.632336419
sort(as.numeric(coefs)) also appears to work fine.
If you want to sort the entire matrix (i.e. keeping the values for all penalization levels), you can take advantage of the fact that the penalization doesn't change the rank-order of the parameters:
coeftab <-coef(best)
lastvals <- coeftab[,ncol(coeftab)]
coeftab_s <- coeftab[order(lastvals,decreasing=TRUE),]
## plot, leaving out the intercept

Function to calculate Precision, Recall and Accuracy for 5 different algorithms in R

I have 4 different datasets in this general form:
df <- data.frame(var1 = c(319, 77, 222, 107, 167),
var2 = c(137, 290, 237, 52, 192),
class = c(1,1,0,1,0))
Each containing var1, var2 and a class variable. I was given the following instructions:
Write an R script that takes a data table as input and returns the performance statistics (precision, recall, and accuracy) for the five difference algorithms, decision trees (rpart), naive Bayes (naiveBayes), K nearest neighbor (knn) , support vector machines (svm) and artificial neural networks (nnet). The return value of the script will be a 5 by 3 matrix of the statistics for each algorithm. For knn use k=3, for svm, using the linear kernel and for nnet use 4 hidden nodes. To calculate the statistics you will be using 10 fold cross validation.
Essentially, I believe I have to write an all encompassing function that I can pass a dataframe, and the return of that function is the precision, recall and accuracy for each of the 5 different algorithms outlined in the directions above. Is there a concise way to perform this?? Any help would be appreciated.
Assuming the class variable is the 3rd column and named "class", which it is with all my example datasets, here is what I came up with:
algStats <- function(dataset){
# Split data into test and train (80/20 train to test)
trainindex <- sample(1:nrow(dataset), 0.8 * nrow(dataset))
TrainData <- dataset[trainindex, ] # Train data length same for all datasets
TestData <- dataset[-trainindex, ] # Test data length same for all datasets
# Declare 10-fold CV (same for all datasets)
train_control <- trainControl(method="cv", number=10)
# Train a decision tree model
DecTreeMod <- train(as.factor(class)~., data=TrainData,
trControl=train_control, method="rpart")
# Predict on test data using Decision Tree model
DecTreepred <- predict(DecTreeMod, TestData[,1:2])
# Create confusion matrix for Decision Tree classifier
DecTreecf <- confusionMatrix(DecTreepred, as.factor(TestData[,3]), mode = "prec_recall", positive = "1")
# Extract Precision, Recall and Accuracy from confusion matrix
DecTreePrecision <- DecTreecf$byClass[5] # <-----Precision
DecTreeRecall <- DecTreecf$byClass[6] # <-----Recall
DecTreeAcc <- DecTreecf$overall[1] # <-----Accuracy
# Create an empty matrix to hold performance measures of each algorithm
rownames = c("Decision Tree", "Naive Bayes", "KNN", "SVM", "ANN")
colnames = c("Precision", "Recall", "Accuracy")
performance <- matrix(ncol = 3, nrow = 5,
byrow = T, dimnames = list(rownames, colnames))
# Append the metrics from the Decision Tree classifier into the matrix
# performance <- rbind(performance, c(DecTreePrecision,DecTreeRecall,DecTreeAcc))
performance[1,] <- c(DecTreePrecision,DecTreeRecall,DecTreeAcc)
# Train a Naive Bayes model
NBMod <- train(as.factor(class)~., data=TrainData,
trControl=train_control, method="nb")
# Predict on test data using Naive Bayes model
NBpred <- predict(NBMod, TestData[,1:2])
# Create confusion matrix for Naive Bayes classifier
NBcf <- confusionMatrix(NBpred, as.factor(TestData[,3]), mode = "prec_recall", positive = "1")
# Extract Precision, Recall and Accuracy from confusion matrix
NBPrecision <- NBcf$byClass[5] # <-----Precision
NBRecall <- NBcf$byClass[6] # <-----Recall
NBAcc <- NBcf$overall[1] # <-----Accuracy
# Append the metrics from the Naive Bayes classifier into the matrix
performance[2,] <- c(NBPrecision,NBRecall,NBAcc)
# Train a KNN model
KNNMod <- train(as.factor(class)~., data=TrainData, tuneGrid = expand.grid(k = 3),
trControl=train_control, method="knn", preProcess = c("center","scale"))
# Predict on test data using KNN model
KNNpred <- predict(KNNMod, TestData[,1:2])
# Create confusion matrix for KNN classifier
KNNcf <- confusionMatrix(KNNpred, as.factor(TestData[,3]), mode = "prec_recall", positive = "1")
# Extract Precision, Recall and Accuracy from confusion matrix
KNNPrecision <- KNNcf$byClass[5] # <-----Precision
KNNRecall <- KNNcf$byClass[6] # <-----Recall
KNNAcc <- KNNcf$overall[1] # <-----Accuracy
# Append the metrics from the KNN classifier into the matrix
performance[3,] <- c(KNNPrecision,KNNRecall,KNNAcc)
# Train an SVM model
SVMMod <- train(as.factor(class)~., data=TrainData,
trControl=train_control, method="svmLinear", preProcess = c("center","scale"))
# Predict on test data using the SVM model
SVMpred <- predict(SVMMod, TestData[,1:2])
# Create confusion matrix for SVM classifier
SVMcf <- confusionMatrix(SVMpred, as.factor(TestData[,3]), mode = "prec_recall", positive = "1")
# Extract Precision, Recall and Accuracy from confusion matrix
SVMPrecision <- SVMcf$byClass[5] # <-----Precision
SVMRecall <- SVMcf$byClass[6] # <-----Recall
SVMAcc <- SVMcf$overall[1] # <-----Accuracy
# Append the metrics from the SVM classifier into the matrix
performance[4,] <- c(SVMPrecision,SVMRecall,SVMAcc)
# Train an ANN model
ANNMod <- train(as.factor(class)~., data=TrainData, tuneGrid = expand.grid(
size = 4, decay = 0.1), linear.output = F, trControl=train_control, method="nnet",
preProcess = c("center","scale"))
# Predict on test data using the ANN model
ANNpred <- predict(ANNMod, TestData[,1:2])
# Create confusion matrix for ANN classifier
ANNcf <- confusionMatrix(ANNpred, as.factor(TestData[,3]), mode = "prec_recall", positive = "1")
# Extract Precision, Recall and Accuracy from confusion matrix
ANNPrecision <- ANNcf$byClass[5] # <-----Precision
ANNRecall <- ANNcf$byClass[6] # <-----Recall
ANNAcc <- ANNcf$overall[1] # <-----Accuracy
# Append the metrics from the ANN classifier into the matrix
performance[5,] <- c(ANNPrecision,ANNRecall,ANNAcc)
# Return the performance matrix

Linear SVM and extracting the weights

I am practicing SVM in R using the iris dataset and I want to get the feature weights/coefficients from my model, but I think I may have misinterpreted something given that my output gives me 32 support vectors. I was under the assumption I would get four given I have four variables being analyzed. I know there is a way to do it when using the svm() function, but I am trying to use the train() function from caret to produce my SVM.
# Define fitControl
fitControl <- trainControl(## 5-fold CV
method = "cv",
number = 5,
classProbs = TRUE,
summaryFunction = twoClassSummary )
# Define Tune
df<-iris head(df)
# set random seed and run the model
svmFit1 <- train(x = df[-5],
method = "svmLinear",
trControl = fitControl,
preProc = c("center","scale"),
tuneGrid=grid )
I thought it was simply svmFit1$finalModel#coefbut I get 32 vectors when I believe I should get 4. Why is that?
So coef is not the weight W of the support vectors. Here's the relevant section of the ksvm class in the docs:
coef The corresponding coefficients times the training labels.
To get what you are looking for, you'll need to do the following:
coefs <- svmFit1$finalModel#coef[[1]]
mat <- svmFit1$finalModel#xmatrix[[1]]
coefs %*% mat
See below for a reproducible example.
#> Loading required package: lattice
#> Loading required package: ggplot2
#> Warning: package 'ggplot2' was built under R version 3.5.2
# Define fitControl
fitControl <- trainControl(
method = "cv",
number = 5,
classProbs = TRUE,
summaryFunction = twoClassSummary
# Define Tune
grid <- expand.grid(C = c(2^-5, 2^-3, 2^-1))
df <- iris
df<-df[df$Species != 'setosa', ]
df$Species <- as.character(df$Species)
df$Species <- as.factor(df$Species)
# set random seed and run the model
svmFit1 <- train(x = df[-5],
method = "svmLinear",
trControl = fitControl,
preProc = c("center","scale"),
tuneGrid=grid )
coefs <- svmFit1$finalModel#coef[[1]]
mat <- svmFit1$finalModel#xmatrix[[1]]
coefs %*% mat
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> [1,] -0.1338791 -0.2726322 0.9497457 1.027411
Created on 2019-06-11 by the reprex package (v0.2.1.9000)
As more folks start moving from Caret to Tidymodels I thought I'd put a version of the above solution for Tidymodels Aug 2020 because I don't see many discussions about this so far and it isn't that straightforward to do.
Outlining the main steps here but please review the links at the end for detail for why it was done this way.
1. Get Your Final Model
# Assuming kernlab linear SVM
# Grid Search Parameters
tune_rs <- tune_grid(
grid = param_grid,
metrics = classification_measure,
control = control_grid(save_pred = TRUE)
# Finalise workflow with the parameters for best accuracy
best_accuracy <- select_best(tune_rs, "accuracy")
svm_wf_final <- finalize_workflow(
# Fit on your final model on all available data at the end of experiment
final_model <- fit(svm_wf_final, data)
# fit takes a model spec and executes the model fit routine (Parsnip)
# model_spec, formula and data to fit upon
2. Extract the KSVM Object, Pull Required Info, Calculate Variable Importance
ksvm_obj <- pull_workflow_fit(final_model)$fit
# Pull_workflow_fit returns the parsnip model fit object
# $fit returns the object produced by the fitting fn (which is what we need! and is dependent on the engine)
coefs <- ksvm_obj#coef[[1]]
# first bit of info we need are the coefficients from the linear fit
mat <- ksvm_obj#xmatrix[[1]]
# xmatrix that we need to matrix multiply against
var_impt <- coefs %*% mat
# var importance
Extracting the Weights of Support Vectors using Caret: Linear SVM and extracting the weights
Variable Importance (Last Section of this post):

Run several GLM models using for loop in R

I'm trying to do some experiment and I want to run several GLMs model in R using the same variables but different training samples.
Here is some simulated data:
resp <- sample(0:1,100,TRUE)
x1 <- c(rep(5,20),rep(0,15), rep(2.5,40),rep(17,25))
x2 <- c(rep(23,10),rep(5,10), rep(15,40),rep(1,25), rep(2, 15))
dat <- data.frame(resp,x1, x2)
This is the loop I'm trying to use:
n <- 5
for (i in 1:n)
### Create training and testing data
## 80% of the sample size
# Note that I didn't use seed so that random split is performed every iteration.
smp_sizelogis <- floor(0.8 * nrow(dat))
train_indlogis <- sample(seq_len(nrow(dat)), size = smp_sizelogis)
trainlogis <- dat[train_indlogis, ]
testlogis <- dat[-train_indlogis, ]
InitLOogModel[i] <- glm(resp ~ ., data =trainlogis, family=binomial)
But unfortunately, I'm getting this error:
Error in InitLOogModel[i] <- glm(resp ~ ., data = trainlogis, family = binomial) :
object 'InitLOogModel' not found
Any thoughts.
I'd suggest using caret for what you're trying to do. It takes some time to learn, but incorporates many 'best practices'. Once you've learned the basics you'll be able to quickly try models other than a glm, and easily compare the models to each other. Here's modified code from your example to get you started.
## caret
# your data
resp <- sample(0:1,100,TRUE)
x1 <- c(rep(5,20),rep(0,15), rep(2.5,40),rep(17,25))
x2 <- c(rep(23,10),rep(5,10), rep(15,40),rep(1,25), rep(2, 15))
dat <- data.frame(resp,x1, x2)
# so caret knows you're trying to do classification, otherwise will give you an error at the train step
dat$resp <- as.factor(dat$resp)
# create a hold-out set to use after your model fitting
# not really necessary for your example, but showing for completeness
train_index <- createDataPartition(dat$resp, p = 0.8,
list = FALSE,
times = 1)
# create your train and test data
train_dat <- dat[train_index, ]
test_dat <- dat[-train_index, ]
# repeated cross validation, repeated 5 times
# this is like your 5 loops, taking 80% of the data each time
fitControl <- trainControl(method = "repeatedcv",
number = 5,
repeats = 5)
# fit the glm!
glm_fit <- train(resp ~ ., data = train_dat,
method = "glm",
family = "binomial",
trControl = fitControl)
# summary
# best model
