I'm running parallel cv.glmnet from glmnet package on over 1000 data sets. In each run I set the seed to have the results reproducible. What I've noticed is that my results differ. The thing is that when I run the code on the same day, then the results are the same. But the next day they differ.
Here is my code:
model <- function(path, file, wyniki, faktor = 0.75) {
set.seed(2)
dane <- read.csv(file)
n <- nrow(dane)
podzial <- 1:floor(faktor*n)
########## GLMNET ############
nFolds <- 3
train_sparse <- dane[podzial,]
test_sparse <- dane[-podzial,]
# fit with cross-validation
tryCatch({
wart <- c(rep(0,6), "nie")
model <- cv.glmnet(train_sparse[,-1], train_sparse[,1], nfolds=nFolds, standardize=FALSE)
pred <- predict(model, test_sparse[,-1], type = "response",s=model$lambda.min)
# fetch of AUC value
aucp1 <- roc(test_sparse[,1],pred)$auc
}, error = function(e) print("error"))
results <- data.frame(auc = aucp1, n = nrow(dane))
write.table(results, wyniki, sep=',', append=TRUE,row.names =FALSE,col.names=FALSE)
}
path <- path_to_files
files <- list.files(sciezka, full.names = TRUE, recursive = TRUE)
wyniki <- "wyniki_adex__samplingfalse_decl_201512.csv"
library('doSNOW')
library('parallel')
#liczba watkow
threads <- 5
#rejestrujemy liczbe watkow
cl <- makeCluster(threads, outfile="")
registerDoSNOW(cl)
message("Loading packages on threads...")
clusterEvalQ(cl,library(pROC))
clusterEvalQ(cl,library(ROCR))
clusterEvalQ(cl,library(glmnet))
clusterEvalQ(cl,library(stringi))
message("Modelling...")
foreach(i=1:length(pliki)) %dopar% {
print(i)
model(path, files[i], wyniki)
}
Does anyone know what is the cause?
I'm running CentOS Linux release 7.0.1406 (Core) / Red Hat 4.8.2-16
Found the answer in the documentation of the cv.glmnet function:
Note also that the results of cv.glmnet are random, since the folds
are selected at random.
The solution is to manually set the folds so that there are not chosen at random:
nFolds <- 3
foldid <- sample(rep(seq(nFolds), length.out = nrow(train_sparse))
model <- cv.glmnet(x = as.matrix(x = train_sparse[,-1],
y = train_sparse[,1],
nfolds = nFolds,
foldid = foldid,
standardize = FALSE)
According to Writing R Extensions, a C wrapper is needed to call R's normal random numbers from FORTRAN. I don't see any C code in the glmnet source. I'm afraid it doesn't look implemented:
6.6 Calling C from FORTRAN and vice versa
Related
I am workin in RStudio and am looking to develop a custom objective function for XGBoost. In order to make sure I have understood how the process works, I have tried to write an objective function which reproduces the "binary:logistic" objective. However, my custom objective function yields significantly different results (often a lot worse).
Based on the examples on the XGBoost github repo my custom objective function looks like this:
# custom objective function
logloss <- function(preds, dtrain){
# Get weights and labels
labels<- getinfo(dtrain, "label")
# Apply logistic transform to predictions
preds <- 1/(1 + exp(-preds))
# Find gradient and hessian
grad <- (preds - labels)
hess <- preds * (1-preds)
return(list("grad" = grad, "hess" = hess))
}
Based on this medium blog post this seems to match what is implemented in XGBoost binary objective.
Using some simple test data, my final training-rmse for the built-in objective is ~0.468 and using my custom objective it is ~0.72.
The code below can be used to generate test data and reproduce the problem.
Can somebody explain why my code does not reproduce the behavious of objective "binary:logistic"? I am using the XGBoost R-Package v0.90.0.2.
library(data.table)
library(xgboost)
# Generate test data
generate_test_data <- function(n_rows = 1e5, feature_count = 5, train_fraction = 0.5){
# Make targets
test_data <- data.table(
target = sign(runif(n = n_rows, min=-1, max=1))
)
# Add feature columns.These are normally distributed and shifted by the target
# in order to create a noisy signal
for(feature in 1:feature_count){
# Randomly create features of the noise
mu <- runif(1, min=-1, max=1)
sdev <- runif(1, min=5, max=10)
# Create noisy signal
test_data[, paste0("feature_", feature) := rnorm(
n=n_rows, mean = mu, sd = sdev)*target + target]
}
# Split data into test/train
test_data[, index_fraction := .I/.N]
split_data <- list(
"train" = test_data[index_fraction < (train_fraction)],
"test" = test_data[index_fraction >= (train_fraction)]
)
# Make vector of feature names
feature_names <- paste0("feature_", 1:feature_count)
# Make test/train matrix and labels
split_data[["test_trix"]] <- as.matrix(split_data$test[, feature_names, with=FALSE])
split_data[["train_trix"]] <- as.matrix(split_data$train[, feature_names, with=FALSE])
split_data[["test_labels"]] <- as.logical(split_data$test$target + 1)
split_data[["train_labels"]] <- as.logical(split_data$train$target + 1)
return(split_data)
}
# Build the tree
build_model <- function(split_data, objective){
# Make evaluation matrix
train_dtrix <-
xgb.DMatrix(
data = split_data$train_trix, label = split_data$train_labels)
# Train the model
model <- xgb.train(
data = train_dtrix,
watchlist = list(
train = train_dtrix),
nrounds = 5,
objective = objective,
eval_metric = "rmse"
)
return(model)
}
split_data <- generate_test_data()
cat("\nUsing built-in binary:logistic objective.\n")
test_1 <- build_model(split_data, "binary:logistic")
cat("\n\nUsing custom objective")
test_2 <- build_model(split_data, logloss)
I have this code for a multiclass classification problem:
data$Class = as.factor(data$Class)
levels(data$Class) <- make.names(levels(factor(data$Class)))
trainIndex <- createDataPartition(data$Class, p = 0.6, list = FALSE, times=1)
trainingSet <- data[ trainIndex,]
testingSet <- data[-trainIndex,]
train_x <- trainingSet[, -ncol(trainingSet)]
train_y <- trainingSet$Class
testing_x <- testingSet[, -ncol(testingSet)]
testing_y <- testingSet$Class
oneRM <- OneR(trainingSet, verbose = TRUE)
oneRM
summary(oneRM)
plot(oneRM)
oneRM_pred <- predict(oneRM, testing_x)
oneRM_pred
eval_model(oneRM_pred, testing_y)
AUC_oneRM_pred <- auc(roc(oneRM_pred,testing_y))
cat ("AUC=", oneRM_pred)
# Recall-Precision curve
oneRM_prediction <- prediction(oneRM_pred, testing_y)
RP.perf <- performance(oneRM_prediction, "tpr", "fpr")
plot (RP.perf)
plot(roc(oneRM_pred,testing_y))
But code does not work, after this line:
oneRM_prediction <- prediction(oneRM_pred, testing_y)
I get this error:
Error in prediction(oneRM_pred, testing_y) : Format of predictions is
invalid.
In addition, I donĀ“t know how I can get easily the F1-measure.
Finally, a question, does it make sense to calculate AUC in a multi-class classification problem?
Let's start from F1.
Assuming that you are using the iris dataset, first, we need to load everything, train the model and perform the predictions as you did.
library(datasets)
library(caret)
library(OneR)
library(pROC)
trainIndex <- createDataPartition(iris$Species, p = 0.6, list = FALSE, times=1)
trainingSet <- iris[ trainIndex,]
testingSet <- iris[-trainIndex,]
train_x <- trainingSet[, -ncol(trainingSet)]
train_y <- trainingSet$Species
testing_x <- testingSet[, -ncol(testingSet)]
testing_y <- testingSet$Species
oneRM <- OneR(trainingSet, verbose = TRUE)
oneRM_pred <- predict(oneRM, testing_x)
Then, you should calculate the precision, recall, and F1 for each class.
cm <- as.matrix(confusionMatrix(oneRM_pred, testing_y))
n = sum(cm) # number of instances
nc = nrow(cm) # number of classes
rowsums = apply(cm, 1, sum) # number of instances per class
colsums = apply(cm, 2, sum) # number of predictions per class
diag = diag(cm) # number of correctly classified instances per class
precision = diag / colsums
recall = diag / rowsums
f1 = 2 * precision * recall / (precision + recall)
print(" ************ Confusion Matrix ************")
print(cm)
print(" ************ Diag ************")
print(diag)
print(" ************ Precision/Recall/F1 ************")
print(data.frame(precision, recall, f1))
After that, you are able to find the macro F1.
macroPrecision = mean(precision)
macroRecall = mean(recall)
macroF1 = mean(f1)
print(" ************ Macro Precision/Recall/F1 ************")
print(data.frame(macroPrecision, macroRecall, macroF1))
To find the ROC (precisely the AUC), it best to use pROC library.
print(" ************ AUC ************")
roc.multi <- multiclass.roc(testing_y, as.numeric(oneRM_pred))
print(auc(roc.multi))
Hope that it helps you.
Find details on this link for F1 and this for AUC.
If I use levels(oneRM_pred) <- levels(testing_y) in this way:
...
oneRM <- OneR(trainingSet, verbose = TRUE)
oneRM
summary(oneRM)
plot(oneRM)
oneRM_pred <- predict(oneRM, testing_x)
levels(oneRM_pred) <- levels(testing_y)
...
The accuracy is very much lower than before. So, I am not sure if to enforce the same levels is a good solution.
I am trying to run repeated 10-fold CV (alpha and lambda) using glmnet / glmnetUtils. My proposed workflow is to:
a) fit a proposed model at 11 values of alpha,
b) run the process X (in this case, 10) times,
c) average the results, and
d) fit a final model with the best combination of alpha and lambda (s = "lambda.1se").
To address a-c, I used the code below; however, the results from the 10 iterations are exactly the same.
library(glmnet)
library(glmnetUtils)
library(doParallel)
data(BinomialExample)
# Create alpha sequence; fix folds
alpha <- seq(.5, 1, .05)
set.seed(1)
folds <- sample(1:10, size = length(y), replace = TRUE)
# Determine optimal combination of alpha and lambda; extract lowest CV error and associated lambda at each alpha
extractGlmnetInfo <- function(object)
{
# Find lambdas
lambda1se <- object$lambda.1se
# Determine where lambdas fall in path
which1se <- which(object$lambda == lambda1se)
# Create data frame with selected lambdas and corresponding error
data.frame(lambda.1se = lambda1se, cv.1se = object$cvm[which1se])
}
#Run glmnet
cl <- makeCluster(detectCores())
registerDoParallel(cl)
enet <- foreach(i = 1:10,
.inorder = FALSE,
.multicombine = TRUE,
.packages = "glmnetUtils") %dopar%
{
cv <- cva.glmnet(x, y,
foldid = folds,
alpha = alpha,
family = "binomial",
parallel = TRUE)
}
stopCluster(cl)
# Extract smallest CV error and lambda at each alpha for each iteration of 10-fold CV
# Calculate means (across iterations) of lowest CV error and associated lambdas for each alpha
cv.rep1 <- ldply(enet[[1]]$modlist, extractGlmnetInfo)
cv.rep2 <- ldply(enet[[2]]$modlist, extractGlmnetInfo)
cv.rep3 <- ldply(enet[[3]]$modlist, extractGlmnetInfo)
cv.rep4 <- ldply(enet[[4]]$modlist, extractGlmnetInfo)
cv.rep5 <- ldply(enet[[5]]$modlist, extractGlmnetInfo)
cv.rep6 <- ldply(enet[[6]]$modlist, extractGlmnetInfo)
cv.rep7 <- ldply(enet[[7]]$modlist, extractGlmnetInfo)
cv.rep8 <- ldply(enet[[8]]$modlist, extractGlmnetInfo)
cv.rep9 <- ldply(enet[[9]]$modlist, extractGlmnetInfo)
cv.rep10 <- ldply(enet[[10]]$modlist, extractGlmnetInfo)
cv.rep <- bind_rows(cv.rep1, cv.rep2, cv.rep3, cv.rep4, cv.rep5, cv.rep6, cv.rep7, cv.rep8, cv.rep9, cv.rep10)
cv.rep <- data.frame(cbind(alpha, cv.rep))
Questions
My understanding is that the folds should be fixed when cross-validating over alpha. Therefore, should I set.seed() multiple times to generate different folds for each iteration and run each iteration separately, rather than looping over them? For example:
# Set folds for first iteration
set.seed(1)
folds1 <- sample(1:10, size = length(y), replace = TRUE)
# Run first iteration
enet1 <- cva.glmnet(x, y,
foldid = folds1,
alpha = alpha,
family = "binomial")
# Set folds for second iteration
set.seed(2)
folds2 <- sample(1:10, size = length(y), replace = TRUE)
# Run second iteration
enet2 <- cva.glmnet(x, y,
foldid = folds2,
alpha = alpha,
family = "binomial")
Or is there a way to fix the folds and loop over the iterations, thereby making use of parallel processing?
Re: the option presented in 1., how do I determine which configuration of folds I should use to fit the final model using the optimal combination of alpha and lambda? Is the decision arbitrary?
NB. I am not using caret for this specific task.
I am trying to train a neural network for churn prediction with R package neuralnet. Here is the code:
data <- read.csv('C:/PredictChurn.csv')
maxs <- apply(data, 2, max)
mins <- apply(data, 2, min)
scaled_temp <- as.data.frame(scale(data, center = mins, scale = maxs - mins))
scaled <- data
scaled[, -c(1)] <- scaled_temp[, -c(1)]
index <- sample(1:nrow(data),round(0.75*nrow(data)))
train_ <- scaled[index,]
test_ <- scaled[-index,]
library(neuralnet)
n <- names(train_[, -c(1)])
f <- as.formula(paste("CHURNED_F ~", paste(n[!n %in% "CHURNED_F"], collapse = " + ")))
nn <- neuralnet(f,data=train_,hidden=c(5),linear.output=F)
It works as it should, however when training with the full data set (in the range of millions of rows) it just takes too long. So I know R is by default single threaded, so I have tried researching on how to parallelize the work into all the cores. Is it even possible to make this function in parallel? I have tried various packages with no success.
Has anyone been able to do this?
It doesn't have to be the neuralnet package, any solution that lets me train a neural network would work.
Thank you
I have had good experiences with the package Rmpi, and it may be applicable in your case too.
library(Rmpi)
Briefly, its usage is as follows:
nproc = 4 # could be automatically determined
# Specify one master and nproc-1 slaves
Rmpi:: mpi.spawn.Rslaves(nslaves=nproc-1)
# Execute function "func_to_be_parallelized" on multiple CPUs; pass two variables to function
my_fast_results = Rmpi::mpi.parLapply(var1_passed_to_func,
func_to_be_parallelized,
var2_passed_to_func)
# Close slaves
Rmpi::mpi.close.Rslaves(dellog=T)
You can try using the caret and doParallel packages for this. This is what I have been using. It works for some of the model types but may not work for all.
layer1 = c(6,12,18,24,30)
layer2 = c(6,12,18,24,30)
layer3 = c(6,12,18,24,30)
cv.folds = 5
# In order to make models fully reproducible when using parallel processing, we need to pass seeds as a parameter
# https://stackoverflow.com/questions/13403427/fully-reproducible-parallel-models-using-caret
total.param.permutations = length(layer1) * length(layer2) * length(layer3)
seeds <- vector(mode = "list", length = cv.folds + 1)
set.seed(1)
for(i in 1:cv.folds) seeds[[i]]<- sample.int(n=1, total.param.permutations, replace = TRUE)
seeds[[cv.folds + 1]]<-sample.int(1, 1, replace = TRUE) #for the last model
nn.grid <- expand.grid(layer1 = layer1, layer2 = layer2, layer3 = layer3)
cl <- makeCluster(detectCores()*0.5) # use 50% of cores only, leave rest for other tasks
registerDoParallel(cl)
train_control <- caret::trainControl(method = "cv"
,number=cv.folds
,seeds = seeds # user defined seeds for parallel processing
,verboseIter = TRUE
,allowParallel = TRUE
)
stopCluster(cl)
registerDoSEQ()
tic("Total Time to NN Training: ")
set.seed(1)
model.nn.caret = caret::train(form = formula,
data = scaled.train.data,
method = 'neuralnet',
tuneGrid = nn.grid,
trControl = train_control
)
toc()
I want to run 150 multiple imputations by using mice in R. However, in order to save some computing time, I would like to subdivide the process in parallel streams (as suggested by Stef van Buuren in "Flexible Imputation for Missing Data").
My question is: how to do that?
I can imagine 2 options:
opt.1:
imp1<-mice(data, m=1, pred=quicktry, maxit=15, seed=1)
imp2<-mice(data, m=1, pred=quicktry, maxit=15, seed=1)
imp...<-mice(data, m=1, pred=quicktry, maxit=15, seed=1)
imp150<-mice(data, m=1, pred=quicktry, maxit=15, seed=1)
and then combine the imputations together by using complete and as.mids afterwards
opt.2:
imp1<-mice(data, m=1, pred=quicktry, maxit=15, seed=VAL_1to150)
imp2<-mice(data, m=1, pred=quicktry, maxit=15, seed=VAL_1to150)
imp...<-mice(data, m=1, pred=quicktry, maxit=15, seed=VAL_1to150)
imp150<-mice(data, m=1, pred=quicktry, maxit=15, seed=VAL_1to150)
by adding VAL_1to150 otherwise it seems to me (I may be wrong) that if they all run with the same dataset and the same seed you will have 150 times the same result.
Are there any other options?
Thanks
So the main problem is combining the imputations and as I see it there are three options, using ibind, complete as described or trying to keep the mids structure. I strongly suggest the ibind solution. The others are left in the answer for those curious.
Get parallel results
Before doing anything we need to get the parallel mice imputations. The parallel part is rather simple, all we need to do is to use the parallel package and make sure that we set the seed using the clusterSetRNGStream:
library(parallel)
# Using all cores can slow down the computer
# significantly, I therefore try to leave one
# core alone in order to be able to do something
# else during the time the code runs
cores_2_use <- detectCores() - 1
cl <- makeCluster(cores_2_use)
clusterSetRNGStream(cl, 9956)
clusterExport(cl, "nhanes")
clusterEvalQ(cl, library(mice))
imp_pars <-
parLapply(cl = cl, X = 1:cores_2_use, fun = function(no){
mice(nhanes, m = 30, printFlag = FALSE)
})
stopCluster(cl)
The above will yield cores_2_use * 30 imputed datasets.
Using ibind
As #AleksanderBlekh suggested, the mice::ibind is probably the best, most straightforward solution:
imp_merged <- imp_pars[[1]]
for (n in 2:length(imp_pars)){
imp_merged <-
ibind(imp_merged,
imp_pars[[n]])
}
Using foreach with ibind
The perhaps the simplest alternative is to use foreach:
library(foreach)
library(doParallel)
cl <- makeCluster(cores_2_use)
clusterSetRNGStream(cl, 9956)
registerDoParallel(cl)
library(mice)
imp_merged <-
foreach(no = 1:cores_2_use,
.combine = ibind,
.export = "nhanes",
.packages = "mice") %dopar%
{
mice(nhanes, m = 30, printFlag = FALSE)
}
stopCluster(cl)
Using complete
Extracting the full datasets using complete(..., action="long"), rbind-ing these and then using as.mids other mice objects may work well but it generates a slimmer object than what the other two approaches:
merged_df <- nhanes
merged_df <-
cbind(data.frame(.imp = 0,
.id = 1:nrow(nhanes)),
merged_df)
for (n in 1:length(imp_pars)){
tmp <- complete(imp_pars[[n]], action = "long")
tmp$.imp <- as.numeric(tmp$.imp) + max(merged_df$.imp)
merged_df <-
rbind(merged_df,
tmp)
}
imp_merged <-
as.mids(merged_df)
# Compare the most important the est and se for easier comparison
cbind(summary(pool(with(data=imp_merged,
exp=lm(bmi~age+hyp+chl))))[,c("est", "se")],
summary(pool(with(data=mice(nhanes,
m = 60,
printFlag = FALSE),
exp=lm(bmi~age+hyp+chl))))[,c("est", "se")])
Gives the output:
est se est se
(Intercept) 20.41921496 3.85943925 20.33952967 3.79002725
age -3.56928102 1.35801557 -3.65568620 1.27603817
hyp 1.63952970 2.05618895 1.60216683 2.17650536
chl 0.05396451 0.02278867 0.05525561 0.02087995
Keeping a correct mids-object
My alternative approach below shows how to merge imputation objects and retain the full functionality behind the mids object. Since the ibind solution I've left this in for anyone interested in exploring how to merge complex lists.
I've looked into mice's mids-object and there are a few step that you have to take in order to get at least a similar mids-object after running in parallel. If we examine the mids-object and compare two objects with two different setups we get:
library(mice)
imp <- list()
imp <- c(imp,
list(mice(nhanes, m = 40)))
imp <- c(imp,
list(mice(nhanes, m = 20)))
sapply(names(imp[[1]]),
function(n)
try(all(useful::compare.list(imp[[1]][[n]],
imp[[2]][[n]]))))
Where you can see that the call, m, imp, chainMean, and chainVar differ between the two runs. Out of these the imp is without doubt the most important but it seems like a wise option to update the other components as well. We will therefore start by building a mice merger function:
mergeMice <- function (imp) {
merged_imp <- NULL
for (n in 1:length(imp)){
if (is.null(merged_imp)){
merged_imp <- imp[[n]]
}else{
counter <- merged_imp$m
# Update counter
merged_imp$m <-
merged_imp$m + imp[[n]]$m
# Rename chains
dimnames(imp[[n]]$chainMean)[[3]] <-
sprintf("Chain %d", (counter + 1):merged_imp$m)
dimnames(imp[[n]]$chainVar)[[3]] <-
sprintf("Chain %d", (counter + 1):merged_imp$m)
# Merge chains
merged_imp$chainMean <-
abind::abind(merged_imp$chainMean,
imp[[n]]$chainMean)
merged_imp$chainVar <-
abind::abind(merged_imp$chainVar,
imp[[n]]$chainVar)
for (nn in names(merged_imp$imp)){
# Non-imputed variables are not in the
# data.frame format but are null
if (!is.null(imp[[n]]$imp[[nn]])){
colnames(imp[[n]]$imp[[nn]]) <-
(counter + 1):merged_imp$m
merged_imp$imp[[nn]] <-
cbind(merged_imp$imp[[nn]],
imp[[n]]$imp[[nn]])
}
}
}
}
# TODO: The function should update the $call parameter
return(merged_imp)
}
We can now simply merge the two above generated imputations through:
merged_imp <- mergeMice(imp)
merged_imp_pars <- mergeMice(imp_pars)
Now it seems that we get the right output:
# Compare the three alternatives
cbind(
summary(pool(with(data=merged_imp,
exp=lm(bmi~age+hyp+chl))))[,c("est", "se")],
summary(pool(with(data=merged_imp_pars,
exp=lm(bmi~age+hyp+chl))))[,c("est", "se")],
summary(pool(with(data=mice(nhanes,
m = merged_imp$m,
printFlag = FALSE),
exp=lm(bmi~age+hyp+chl))))[,c("est", "se")])
Gives:
est se est se
(Intercept) 20.16057550 3.74819873 20.31814393 3.7346252
age -3.67906629 1.19873118 -3.64395716 1.1476377
hyp 1.72637216 2.01171565 1.71063127 1.9936347
chl 0.05590999 0.02350609 0.05476829 0.0213819
est se
(Intercept) 20.14271905 3.60702992
age -3.78345532 1.21550474
hyp 1.77361005 2.11415290
chl 0.05648672 0.02046868
Ok, that's it. Have fun.