glmnet / glmnetUtils: Repeated cross-validation - r

I am trying to run repeated 10-fold CV (alpha and lambda) using glmnet / glmnetUtils. My proposed workflow is to:
a) fit a proposed model at 11 values of alpha,
b) run the process X (in this case, 10) times,
c) average the results, and
d) fit a final model with the best combination of alpha and lambda (s = "lambda.1se").
To address a-c, I used the code below; however, the results from the 10 iterations are exactly the same.
library(glmnet)
library(glmnetUtils)
library(doParallel)
data(BinomialExample)
# Create alpha sequence; fix folds
alpha <- seq(.5, 1, .05)
set.seed(1)
folds <- sample(1:10, size = length(y), replace = TRUE)
# Determine optimal combination of alpha and lambda; extract lowest CV error and associated lambda at each alpha
extractGlmnetInfo <- function(object)
{
# Find lambdas
lambda1se <- object$lambda.1se
# Determine where lambdas fall in path
which1se <- which(object$lambda == lambda1se)
# Create data frame with selected lambdas and corresponding error
data.frame(lambda.1se = lambda1se, cv.1se = object$cvm[which1se])
}
#Run glmnet
cl <- makeCluster(detectCores())
registerDoParallel(cl)
enet <- foreach(i = 1:10,
.inorder = FALSE,
.multicombine = TRUE,
.packages = "glmnetUtils") %dopar%
{
cv <- cva.glmnet(x, y,
foldid = folds,
alpha = alpha,
family = "binomial",
parallel = TRUE)
}
stopCluster(cl)
# Extract smallest CV error and lambda at each alpha for each iteration of 10-fold CV
# Calculate means (across iterations) of lowest CV error and associated lambdas for each alpha
cv.rep1 <- ldply(enet[[1]]$modlist, extractGlmnetInfo)
cv.rep2 <- ldply(enet[[2]]$modlist, extractGlmnetInfo)
cv.rep3 <- ldply(enet[[3]]$modlist, extractGlmnetInfo)
cv.rep4 <- ldply(enet[[4]]$modlist, extractGlmnetInfo)
cv.rep5 <- ldply(enet[[5]]$modlist, extractGlmnetInfo)
cv.rep6 <- ldply(enet[[6]]$modlist, extractGlmnetInfo)
cv.rep7 <- ldply(enet[[7]]$modlist, extractGlmnetInfo)
cv.rep8 <- ldply(enet[[8]]$modlist, extractGlmnetInfo)
cv.rep9 <- ldply(enet[[9]]$modlist, extractGlmnetInfo)
cv.rep10 <- ldply(enet[[10]]$modlist, extractGlmnetInfo)
cv.rep <- bind_rows(cv.rep1, cv.rep2, cv.rep3, cv.rep4, cv.rep5, cv.rep6, cv.rep7, cv.rep8, cv.rep9, cv.rep10)
cv.rep <- data.frame(cbind(alpha, cv.rep))
Questions
My understanding is that the folds should be fixed when cross-validating over alpha. Therefore, should I set.seed() multiple times to generate different folds for each iteration and run each iteration separately, rather than looping over them? For example:
# Set folds for first iteration
set.seed(1)
folds1 <- sample(1:10, size = length(y), replace = TRUE)
# Run first iteration
enet1 <- cva.glmnet(x, y,
foldid = folds1,
alpha = alpha,
family = "binomial")
# Set folds for second iteration
set.seed(2)
folds2 <- sample(1:10, size = length(y), replace = TRUE)
# Run second iteration
enet2 <- cva.glmnet(x, y,
foldid = folds2,
alpha = alpha,
family = "binomial")
Or is there a way to fix the folds and loop over the iterations, thereby making use of parallel processing?
Re: the option presented in 1., how do I determine which configuration of folds I should use to fit the final model using the optimal combination of alpha and lambda? Is the decision arbitrary?
NB. I am not using caret for this specific task.

Related

R: Multiclass Matrices

I am working with the R programming language. I am trying to learn how to make a "confusion matrix" for multiclass variables (e.g. How to construct the confusion matrix for a multi class variable).
Suppose I generate some data and fit a decision tree model :
#load libraries
library(rpart)
library(caret)
#generate data
a <- rnorm(1000, 10, 10)
b <- rnorm(1000, 10, 5)
d <- rnorm(1000, 5, 10)
group_1 <- sample( LETTERS[1:3], 1000, replace=TRUE, prob=c(0.33,0.33,0.34) )
e = data.frame(a,b,d, group_1)
e$group_1 = as.factor(d$group_1)
#split data into train and test set
trainIndex <- createDataPartition(e$group_1, p = .8,
list = FALSE,
times = 1)
training <- e[trainIndex,]
test <- e[-trainIndex,]
fitControl <- trainControl(## 10-fold CV
method = "repeatedcv",
number = 5,
## repeated ten times
repeats = 1)
#fit decision tree model
TreeFit <- train(group_1 ~ ., data = training,
method = "rpart2",
trControl = fitControl)
From here, I am able to store the results into a "confusion matrix":
pred <- predict(TreeFit,test)
table_example <- table(pred,test$group_1)
This satisfies my requirements - but this "table" requires me to manually calculate the different accuracy metrics of "A", "B" and "C" (as well as the total accuracy).
My question: Is it possible to use the caret::confusionMatrix() command for this problem?
e.g.
pred <- predict(TreeFit, test, type = "prob")
labels_example <- as.factor(ifelse(pred[,2]>0.5, "1", "0"))
con <- confusionMatrix(labels_example, test$group_1)
This way, I would be able to directly access the accuracy measurements from the confusion matrix. E.g. metric = con$overall[1]
Thanks
Is this what you're looking for?
pred <- predict(
TreeFit,
test)
con <- confusionMatrix(
test$group_1,
pred)
con
con$overall[1]
Same output as in:
table(test$group_1, pred)
Plus accuracy metrics.

How to find the optimal value for K in K-nearest neighbors using R?

My dataset contains 5851 observations, and is split into a train (3511 observations) and test (2340 observations) set. I now want to train a model using KNN, with two variables. I want to do 10-fold CV, repeated 5 times, using ROC metric and the one-standard error rule and the variables are preprocessed. The code is shown below.
set.seed(44780)
ctrl_repcvSE <- trainControl(method = "repeatedcv", number = 10, repeats = 5,
summaryFunction = twoClassSummary, classProbs = TRUE,
selectionFunction = "oneSE")
tune_grid <- expand.grid(k = 45:75)
mod4 <- train(purchased ~ total_policies + total_contrib,
data = mhomes_train, method = "knn",
trControl= ctrl_repcvSE, metric = "ROC",
tuneGrid = tune_grid, preProcess = c("center", "scale"))
The problem I have is that I already have tried so many different values of K (e.g., K = 10:20, 30:40, 50:60, 150:160 + different tuning lengths. However, every time the output says that the chosen value for K is the one which is last, so for example for values of K = 70:80, the chosen value for K = 80, every time I do this. This means I should look further, because if the chosen value is K in that case then there are better values of K available which are above 80. How should I eventually find this one?
The assignment only specifies: For k-nearest neighbours, explore reasonable values of k using the total_policies and total_contrib variables only.
Welcome to Stack Overflow. Your question isn't easy to answer.
For k-nearest neighbours I use another function knn3 part of the caret library.
I'll give an example using the iris dataset. We try to get the accuracy of our model for different values for k and plot those accuracies.
library(data.table)
library(tidyverse)
library(scales)
library(caret)
dt <- as.data.table(iris)
# converting and scaling data ----
dt$Species <- dt$Species %>% as.factor()
dt$Sepal.Length <- dt$Sepal.Length %>% scale()
dt$Sepal.Width <- dt$Sepal.Width %>% scale()
dt$Petal.Length <- dt$Petal.Length %>% scale()
dt$Petal.Width <- dt$Petal.Width %>% scale()
# remove in the real run ----
set.seed(1234567)
# split data into train and test - 3:1 ----
train_index <- createDataPartition(dt$Species, p = 0.75, list = FALSE)
train <- dt[train_index, ]
test <- dt[-train_index, ]
# values to check for k ----
K_VALUES <- 20:1
test_acc <- numeric(0)
train_acc <- numeric(0)
# calculate different models for each value of k ----
for (x in K_VALUES){
model <- knn3(Species ~ ., data = train, k = x)
pred_test <- predict(model, test, type = "class")
pred_test_acc <- confusionMatrix(table(pred_test,
test$Species))$overall["Accuracy"]
test_acc <- c(test_acc, pred_test_acc)
pred_train <- predict(model, train, type = "class")
pred_train_acc <- confusionMatrix(table(pred_train,
train$Species))$overall["Accuracy"]
train_acc <- c(train_acc, pred_train_acc)
}
data <- data.table(x = K_VALUES, train = train_acc, test = test_acc)
# plot a validation curve ----
plot_data <- gather(data, "type", "value", -x)
g <- qplot(x = x,
y = value,
data = plot_data,
color = type,
geom = "path",
xlim = c(max(K_VALUES),min(K_VALUES)-1))
print(g)
Now find a k with a good accuracy for your test data. That's the value you're looking for.
Disclosure: That's simplified but this approach should help you solving your problem.

XGBoost custom loss function cannot reproduce binary:logistic objective

I am workin in RStudio and am looking to develop a custom objective function for XGBoost. In order to make sure I have understood how the process works, I have tried to write an objective function which reproduces the "binary:logistic" objective. However, my custom objective function yields significantly different results (often a lot worse).
Based on the examples on the XGBoost github repo my custom objective function looks like this:
# custom objective function
logloss <- function(preds, dtrain){
# Get weights and labels
labels<- getinfo(dtrain, "label")
# Apply logistic transform to predictions
preds <- 1/(1 + exp(-preds))
# Find gradient and hessian
grad <- (preds - labels)
hess <- preds * (1-preds)
return(list("grad" = grad, "hess" = hess))
}
Based on this medium blog post this seems to match what is implemented in XGBoost binary objective.
Using some simple test data, my final training-rmse for the built-in objective is ~0.468 and using my custom objective it is ~0.72.
The code below can be used to generate test data and reproduce the problem.
Can somebody explain why my code does not reproduce the behavious of objective "binary:logistic"? I am using the XGBoost R-Package v0.90.0.2.
library(data.table)
library(xgboost)
# Generate test data
generate_test_data <- function(n_rows = 1e5, feature_count = 5, train_fraction = 0.5){
# Make targets
test_data <- data.table(
target = sign(runif(n = n_rows, min=-1, max=1))
)
# Add feature columns.These are normally distributed and shifted by the target
# in order to create a noisy signal
for(feature in 1:feature_count){
# Randomly create features of the noise
mu <- runif(1, min=-1, max=1)
sdev <- runif(1, min=5, max=10)
# Create noisy signal
test_data[, paste0("feature_", feature) := rnorm(
n=n_rows, mean = mu, sd = sdev)*target + target]
}
# Split data into test/train
test_data[, index_fraction := .I/.N]
split_data <- list(
"train" = test_data[index_fraction < (train_fraction)],
"test" = test_data[index_fraction >= (train_fraction)]
)
# Make vector of feature names
feature_names <- paste0("feature_", 1:feature_count)
# Make test/train matrix and labels
split_data[["test_trix"]] <- as.matrix(split_data$test[, feature_names, with=FALSE])
split_data[["train_trix"]] <- as.matrix(split_data$train[, feature_names, with=FALSE])
split_data[["test_labels"]] <- as.logical(split_data$test$target + 1)
split_data[["train_labels"]] <- as.logical(split_data$train$target + 1)
return(split_data)
}
# Build the tree
build_model <- function(split_data, objective){
# Make evaluation matrix
train_dtrix <-
xgb.DMatrix(
data = split_data$train_trix, label = split_data$train_labels)
# Train the model
model <- xgb.train(
data = train_dtrix,
watchlist = list(
train = train_dtrix),
nrounds = 5,
objective = objective,
eval_metric = "rmse"
)
return(model)
}
split_data <- generate_test_data()
cat("\nUsing built-in binary:logistic objective.\n")
test_1 <- build_model(split_data, "binary:logistic")
cat("\n\nUsing custom objective")
test_2 <- build_model(split_data, logloss)

Meaning: improvement in RMSE during crossvalidation, although not on testset?

In the code below I train a NN with crossvalidation on the first 20000 records in the dataset. The dataset contains 8 predictors.
First I have split my data in 2 parts:
the first 20.000 rows (trainset)
and the last 4003 rows (out of sample test set)
I have done 2 runs:
run 1) a run with 3 predictors
run 2) a run with all 8 predictors (see code below).
Based on crossvalidation within the 20.000 rows from the trainset, the RMSE (for the optimal parametersetting) improves from 2.30 (run 1) to 2.11 (run 2).
Although when I test both models on the 4003 rows from the out of sample test set, the RMSE improves only neglible from 2.64 (run 1) to 2.63 (run 2).
What can be concluded from these contradiction in the results?
Thanks!
### R code from Applied Predictive Modeling (2013) by Kuhn and Johnson.
### Chapter 7: Non-Linear Regression Models
### Required packages: AppliedPredictiveModeling, caret, doMC (optional),
### earth, kernlab, lattice, nnet
################################################################################
library(caret)
### Load the data
mydata <- read.csv(file="data.csv", header=TRUE, sep=",")
validatiex <- mydata[20001:24003,c(1:8)]
validatiey <- mydata[20001:24003,9]
mydata <- mydata[1:20000,]
x <- mydata[,c(1:8)]
y <- mydata[,9]
parti <- createDataPartition(y, times = 1, p=0.8, list = FALSE)
x_train <- x[parti,]
x_test <- x[-parti,]
y_train <- y[parti]
y_test <- y[-parti]
set.seed(100)
indx <- createFolds(y_train, returnTrain = TRUE)
ctrl <- trainControl(method = "cv", index = indx)
## train neural net:
nnetGrid <- expand.grid(decay = c(.1),
size = c(5, 15, 30),
bag = FALSE)
set.seed(100)
nnetTune <- train(x = x_train, y = y_train,
method = "avNNet",
tuneGrid = nnetGrid,
trControl = ctrl,
preProc = c("center", "scale"),
linout = TRUE,
trace = FALSE,
MaxNWts = 30 * (ncol(x_train) + 1) + 30 + 1,
maxit = 1000,
repeats = 25,
allowParallel = FALSE)
nnetTune
plot(nnetTune)
predictions <- predict(nnetTune, validatiex, type="raw")
mse <- mean((validatiey - predictions)^2)
mse <- sqrt(mse)
print (mse)

Set seed with cv.glmnet paralleled gives different results in R

I'm running parallel cv.glmnet from glmnet package on over 1000 data sets. In each run I set the seed to have the results reproducible. What I've noticed is that my results differ. The thing is that when I run the code on the same day, then the results are the same. But the next day they differ.
Here is my code:
model <- function(path, file, wyniki, faktor = 0.75) {
set.seed(2)
dane <- read.csv(file)
n <- nrow(dane)
podzial <- 1:floor(faktor*n)
########## GLMNET ############
nFolds <- 3
train_sparse <- dane[podzial,]
test_sparse <- dane[-podzial,]
# fit with cross-validation
tryCatch({
wart <- c(rep(0,6), "nie")
model <- cv.glmnet(train_sparse[,-1], train_sparse[,1], nfolds=nFolds, standardize=FALSE)
pred <- predict(model, test_sparse[,-1], type = "response",s=model$lambda.min)
# fetch of AUC value
aucp1 <- roc(test_sparse[,1],pred)$auc
}, error = function(e) print("error"))
results <- data.frame(auc = aucp1, n = nrow(dane))
write.table(results, wyniki, sep=',', append=TRUE,row.names =FALSE,col.names=FALSE)
}
path <- path_to_files
files <- list.files(sciezka, full.names = TRUE, recursive = TRUE)
wyniki <- "wyniki_adex__samplingfalse_decl_201512.csv"
library('doSNOW')
library('parallel')
#liczba watkow
threads <- 5
#rejestrujemy liczbe watkow
cl <- makeCluster(threads, outfile="")
registerDoSNOW(cl)
message("Loading packages on threads...")
clusterEvalQ(cl,library(pROC))
clusterEvalQ(cl,library(ROCR))
clusterEvalQ(cl,library(glmnet))
clusterEvalQ(cl,library(stringi))
message("Modelling...")
foreach(i=1:length(pliki)) %dopar% {
print(i)
model(path, files[i], wyniki)
}
Does anyone know what is the cause?
I'm running CentOS Linux release 7.0.1406 (Core) / Red Hat 4.8.2-16
Found the answer in the documentation of the cv.glmnet function:
Note also that the results of cv.glmnet are random, since the folds
are selected at random.
The solution is to manually set the folds so that there are not chosen at random:
nFolds <- 3
foldid <- sample(rep(seq(nFolds), length.out = nrow(train_sparse))
model <- cv.glmnet(x = as.matrix(x = train_sparse[,-1],
y = train_sparse[,1],
nfolds = nFolds,
foldid = foldid,
standardize = FALSE)
According to Writing R Extensions, a C wrapper is needed to call R's normal random numbers from FORTRAN. I don't see any C code in the glmnet source. I'm afraid it doesn't look implemented:
6.6 Calling C from FORTRAN and vice versa

Resources