I want to run 150 multiple imputations by using mice in R. However, in order to save some computing time, I would like to subdivide the process in parallel streams (as suggested by Stef van Buuren in "Flexible Imputation for Missing Data").
My question is: how to do that?
I can imagine 2 options:
opt.1:
imp1<-mice(data, m=1, pred=quicktry, maxit=15, seed=1)
imp2<-mice(data, m=1, pred=quicktry, maxit=15, seed=1)
imp...<-mice(data, m=1, pred=quicktry, maxit=15, seed=1)
imp150<-mice(data, m=1, pred=quicktry, maxit=15, seed=1)
and then combine the imputations together by using complete and as.mids afterwards
opt.2:
imp1<-mice(data, m=1, pred=quicktry, maxit=15, seed=VAL_1to150)
imp2<-mice(data, m=1, pred=quicktry, maxit=15, seed=VAL_1to150)
imp...<-mice(data, m=1, pred=quicktry, maxit=15, seed=VAL_1to150)
imp150<-mice(data, m=1, pred=quicktry, maxit=15, seed=VAL_1to150)
by adding VAL_1to150 otherwise it seems to me (I may be wrong) that if they all run with the same dataset and the same seed you will have 150 times the same result.
Are there any other options?
Thanks
So the main problem is combining the imputations and as I see it there are three options, using ibind, complete as described or trying to keep the mids structure. I strongly suggest the ibind solution. The others are left in the answer for those curious.
Get parallel results
Before doing anything we need to get the parallel mice imputations. The parallel part is rather simple, all we need to do is to use the parallel package and make sure that we set the seed using the clusterSetRNGStream:
library(parallel)
# Using all cores can slow down the computer
# significantly, I therefore try to leave one
# core alone in order to be able to do something
# else during the time the code runs
cores_2_use <- detectCores() - 1
cl <- makeCluster(cores_2_use)
clusterSetRNGStream(cl, 9956)
clusterExport(cl, "nhanes")
clusterEvalQ(cl, library(mice))
imp_pars <-
parLapply(cl = cl, X = 1:cores_2_use, fun = function(no){
mice(nhanes, m = 30, printFlag = FALSE)
})
stopCluster(cl)
The above will yield cores_2_use * 30 imputed datasets.
Using ibind
As #AleksanderBlekh suggested, the mice::ibind is probably the best, most straightforward solution:
imp_merged <- imp_pars[[1]]
for (n in 2:length(imp_pars)){
imp_merged <-
ibind(imp_merged,
imp_pars[[n]])
}
Using foreach with ibind
The perhaps the simplest alternative is to use foreach:
library(foreach)
library(doParallel)
cl <- makeCluster(cores_2_use)
clusterSetRNGStream(cl, 9956)
registerDoParallel(cl)
library(mice)
imp_merged <-
foreach(no = 1:cores_2_use,
.combine = ibind,
.export = "nhanes",
.packages = "mice") %dopar%
{
mice(nhanes, m = 30, printFlag = FALSE)
}
stopCluster(cl)
Using complete
Extracting the full datasets using complete(..., action="long"), rbind-ing these and then using as.mids other mice objects may work well but it generates a slimmer object than what the other two approaches:
merged_df <- nhanes
merged_df <-
cbind(data.frame(.imp = 0,
.id = 1:nrow(nhanes)),
merged_df)
for (n in 1:length(imp_pars)){
tmp <- complete(imp_pars[[n]], action = "long")
tmp$.imp <- as.numeric(tmp$.imp) + max(merged_df$.imp)
merged_df <-
rbind(merged_df,
tmp)
}
imp_merged <-
as.mids(merged_df)
# Compare the most important the est and se for easier comparison
cbind(summary(pool(with(data=imp_merged,
exp=lm(bmi~age+hyp+chl))))[,c("est", "se")],
summary(pool(with(data=mice(nhanes,
m = 60,
printFlag = FALSE),
exp=lm(bmi~age+hyp+chl))))[,c("est", "se")])
Gives the output:
est se est se
(Intercept) 20.41921496 3.85943925 20.33952967 3.79002725
age -3.56928102 1.35801557 -3.65568620 1.27603817
hyp 1.63952970 2.05618895 1.60216683 2.17650536
chl 0.05396451 0.02278867 0.05525561 0.02087995
Keeping a correct mids-object
My alternative approach below shows how to merge imputation objects and retain the full functionality behind the mids object. Since the ibind solution I've left this in for anyone interested in exploring how to merge complex lists.
I've looked into mice's mids-object and there are a few step that you have to take in order to get at least a similar mids-object after running in parallel. If we examine the mids-object and compare two objects with two different setups we get:
library(mice)
imp <- list()
imp <- c(imp,
list(mice(nhanes, m = 40)))
imp <- c(imp,
list(mice(nhanes, m = 20)))
sapply(names(imp[[1]]),
function(n)
try(all(useful::compare.list(imp[[1]][[n]],
imp[[2]][[n]]))))
Where you can see that the call, m, imp, chainMean, and chainVar differ between the two runs. Out of these the imp is without doubt the most important but it seems like a wise option to update the other components as well. We will therefore start by building a mice merger function:
mergeMice <- function (imp) {
merged_imp <- NULL
for (n in 1:length(imp)){
if (is.null(merged_imp)){
merged_imp <- imp[[n]]
}else{
counter <- merged_imp$m
# Update counter
merged_imp$m <-
merged_imp$m + imp[[n]]$m
# Rename chains
dimnames(imp[[n]]$chainMean)[[3]] <-
sprintf("Chain %d", (counter + 1):merged_imp$m)
dimnames(imp[[n]]$chainVar)[[3]] <-
sprintf("Chain %d", (counter + 1):merged_imp$m)
# Merge chains
merged_imp$chainMean <-
abind::abind(merged_imp$chainMean,
imp[[n]]$chainMean)
merged_imp$chainVar <-
abind::abind(merged_imp$chainVar,
imp[[n]]$chainVar)
for (nn in names(merged_imp$imp)){
# Non-imputed variables are not in the
# data.frame format but are null
if (!is.null(imp[[n]]$imp[[nn]])){
colnames(imp[[n]]$imp[[nn]]) <-
(counter + 1):merged_imp$m
merged_imp$imp[[nn]] <-
cbind(merged_imp$imp[[nn]],
imp[[n]]$imp[[nn]])
}
}
}
}
# TODO: The function should update the $call parameter
return(merged_imp)
}
We can now simply merge the two above generated imputations through:
merged_imp <- mergeMice(imp)
merged_imp_pars <- mergeMice(imp_pars)
Now it seems that we get the right output:
# Compare the three alternatives
cbind(
summary(pool(with(data=merged_imp,
exp=lm(bmi~age+hyp+chl))))[,c("est", "se")],
summary(pool(with(data=merged_imp_pars,
exp=lm(bmi~age+hyp+chl))))[,c("est", "se")],
summary(pool(with(data=mice(nhanes,
m = merged_imp$m,
printFlag = FALSE),
exp=lm(bmi~age+hyp+chl))))[,c("est", "se")])
Gives:
est se est se
(Intercept) 20.16057550 3.74819873 20.31814393 3.7346252
age -3.67906629 1.19873118 -3.64395716 1.1476377
hyp 1.72637216 2.01171565 1.71063127 1.9936347
chl 0.05590999 0.02350609 0.05476829 0.0213819
est se
(Intercept) 20.14271905 3.60702992
age -3.78345532 1.21550474
hyp 1.77361005 2.11415290
chl 0.05648672 0.02046868
Ok, that's it. Have fun.
Related
I have already written the following code, all of which works OK:
directory_path <- "~/DAEN_698/sample_obs"
file_list <- list.files(path = directory_path, full.names = TRUE, recursive = TRUE)
head(file_list, n = 2)
> head(file_list, n = 2)
[1] "C:/Users/Spencer/Documents/DAEN_698/sample_obs2/0-5-1-1.csv"
[2] "C:/Users/Spencer/Documents/DAEN_698/sample_obs2/0-5-1-2.csv"
# Create another list with the just the "n-n-n-n" part of the names of of each dataset
DS_name_list = stri_sub(file_list, 49, 55)
head(DS_name_list, n = 3)
> head(DS_name_list, n = 3)
[1] "0-5-1-1" "0-5-1-2" "0-5-1-3"
# This command reads all the data in each of the N csv files via their names
# stored in the 'file_list' list of characters.
csvs <- lapply(file_list, read.csv)
### Run a Backward Elimination Stepwise Regression on each of the N csvs.
# Assign the full model (meaning the one with all 30 candidate regressors
# included as the initial model in step 1).
# This is crucial because if the initial model has less than the number of
# total candidate factors for Stepwise to select from in the datasets,
# then it could miss 1 or more of the true factors.
full_model <- lapply(csvs, function(i) {
lm(formula = Y ~ ., data = i) })
# my failed attempt at figuring it out myself
set.seed(50) # for reproducibility
BE_fits3 <- lapply(full_model, function(i) {step(object = i[["coefficients"]],
direction = 'backward', scope = formula(full_model), trace = 0)})
When I hit run on the above 2 lines of code after setting the seed, I get
the following error message in the Console:
Error in terms`(object) : object 'i' not found
To briefly elaborate a bit further on why it is
absolutely essential that the initial model when running a Backward Elimination
version of Stepwise Regression, consider the following example:
Let us say that we start out with an initial model of 25, so, X1:X26 instead of
X1:X30, in that case, it would be possible to miss out on Stepwise Regression j
being able to select/choose 1 or more of the IVs/factors from X26 through X30,
especially if 1 or more of those really are included in the true underlying
population model that characterizes dataset j.
Instead of two lapply loops, one to fit the models and the second to run the stepwise regressions, use a for loop doing both operations one after the other. This is an environments thing, it seems that step is not finding the data when run in the environment of the lapply function.
I have also changed the code to create DS_name_list. Below it processes the full names without string position dependent code.
DS_name_list <- basename(file_list)
DS_name_list <- tools::file_path_sans_ext(DS_name_list)
head(DS_name_list, n = 2)
And here is the regressions code.
csvs <- lapply(file_list, read.csv)
names(csvs) <- DS_name_list
set.seed(50) # for reproducibility
full_model <- vector("list", length = length(csvs))
BE_fits3 <- vector("list", length = length(csvs))
for(i in seq_along(csvs)) {
full_model[[i]] <- lm(formula = Y ~ ., data = csvs[[i]])
BE_fits3[[i]] <- step(object = full_model[[i]],
scope = formula(full_model[[i]]),
direction = 'backward',
trace = 0)
}
I cross-post this question here, but it seems to me that I'm unlikely to receive any answer. So I post it here.
I'm running the classification method Bagging Tree (Bootstrap Aggregation) and compare the misclassification error rate with one from one single tree. We expect that the result from bagging tree is better then that from one single tree, i.e. error rate from bagging is lower than that of single tree.
I repeat the whole procedure M = 100 times (each time splitting randomly the original data set into a training set and a test set) to obtain 100 test errors and bagging test errors (use a for loop). Then I use boxplots to compare the distributions of these two types of errors.
# Loading package and data
library(rpart)
library(boot)
library(mlbench)
data(PimaIndiansDiabetes)
# Initialization
n <- 768
ntrain <- 468
ntest <- 300
B <- 100
M <- 100
single.tree.error <- vector(length = M)
bagging.error <- vector(length = M)
# Define statistic
estim.pred <- function(a.sample, vector.of.indices)
{
current.train <- a.sample[vector.of.indices, ]
current.fitted.model <- rpart(diabetes ~ ., data = current.train, method = "class")
predict(current.fitted.model, test.set, type = "class")
}
for (j in 1:M)
{
# Split the data into test/train sets
train.idx <- sample(1:n, ntrain, replace = FALSE)
train.set <- PimaIndiansDiabetes[train.idx, ]
test.set <- PimaIndiansDiabetes[-train.idx, ]
# Train a direct tree model
fitted.tree <- rpart(diabetes ~ ., data = train.set, method = "class")
pred.test <- predict(fitted.tree, test.set, type = "class")
single.tree.error[j] <- mean(pred.test != test.set$diabetes)
# Bootstrap estimates
res.boot = boot(train.set, estim.pred, B)
pred.boot <- vector(length = ntest)
for (i in 1:ntest)
{
pred.boot[i] <- ifelse (mean(res.boot$t[, i] == "pos") >= 0.5, "pos", "neg")
}
bagging.error[j] <- mean(pred.boot != test.set$diabetes)
}
boxplot(single.tree.error, bagging.error, ylab = "Misclassification errors", names = c("single.tree", "bagging"))
The result is
Could you please explain why the error rate for bagging trees is much higher than that of a single tree? I feel that this does not make sense. I've checked my code but could not found anything unusual.
I've received an answer from https://stats.stackexchange.com/questions/452882/why-is-the-error-rate-from-bagging-trees-much-higher-than-that-from-a-single-tre. I posted it here to close this question and for future visitors.
I am using the doPar package in an attempt to parallelise the training of machine learning algorithms as they seem to take quite a while.
My plan is to train multiple neural nets, SVMs, and decision trees (currently 10 of each, named neuralnet1 .. neuralnet10, svm1 ..., svm10, etc. The dataframe all_classifiers contains the classifier name I wish to name it and the stopping/starting training time
> head(all_classifiers,3)
classifiers train_start train_stop
1 neuralnet1 7833 8074
2 neuralnet2 45590 45682
3 neuralnet3 64341 64574
> tail(all_classifiers,3)
classifiers train_start train_stop
28 dt8 235639 235737
29 dt9 256497 257198
30 dt10 257814 258034
my loop right now looks like this
for(i in 1:trainloop{
# Select training data + remove NA
train_start <- all_classifiers[["train_start"]][i]
train_stop <- all_classifiers[["train_stop"]][i]
train_data <- na.omit(data[train_start:train_stop,])
print(paste("Using data from ", train_start,"to", train_stop))
train_scaled <- as.data.frame(train_data)
# Train appropriate model
firstLetter <- strtrim(all_classifiers[["classifiers"]][i],1)
if(firstLetter == "n"){
print("Training neural net")
trained_classifier <- neuralnet(f, data=train_scaled , hidden=c(3),
act.fct = 'logistic', linear.output=F,
stepmax=1e6, rep=1, learningrate = 0.30)
} else if(firstLetter == "s"){
print("Training SVM")
trained_classifier <- svm(upmove ~ . , data = train_scaled,
kernel = "polynomial", coef0 = 2.0)
} else if(firstLetter == "d"){
print("Training DT")
train_scaled$upmove <- as.factor(train_scaled$upmove)
trained_classifier <- C5.0(f, data = train_scaled)
}
flog.info(paste("Training",all_classifiers[["classifiers"]][i]))
assign(toString(all_classifiers[["classifiers"]][i]), trained_classifier)
}
I wish to parallelise this loop by using
foreach(i = 1:trainloop, .packages = 'neuralnet',
'e1071','C5.0','futile.logger') %dopar% { %loop here$ }
But it seems that each worker starts with iterator i=1, while my variable assignments
assign(toString(all_classifiers[["classifiers"]][i]), trained_classifier)
are dependent on the value of the iterator being used. How would I solve this? Eventually I want to end up with all the names in the first column of all_classifiers being trained classifiers on the associated starting and stopping training times.
I don't want to get too into the specifics of your code, but here is a small example that will hopefully help you understand how to translate a base R loop to foreach:
x1 <- numeric(10)
for (i in 1:10) {
x1[i] <- i^2
}
x2 <- foreach(i=1:10,.combine=rbind) %do% {
i^2
}
x1==x2
I am trying to train a neural network for churn prediction with R package neuralnet. Here is the code:
data <- read.csv('C:/PredictChurn.csv')
maxs <- apply(data, 2, max)
mins <- apply(data, 2, min)
scaled_temp <- as.data.frame(scale(data, center = mins, scale = maxs - mins))
scaled <- data
scaled[, -c(1)] <- scaled_temp[, -c(1)]
index <- sample(1:nrow(data),round(0.75*nrow(data)))
train_ <- scaled[index,]
test_ <- scaled[-index,]
library(neuralnet)
n <- names(train_[, -c(1)])
f <- as.formula(paste("CHURNED_F ~", paste(n[!n %in% "CHURNED_F"], collapse = " + ")))
nn <- neuralnet(f,data=train_,hidden=c(5),linear.output=F)
It works as it should, however when training with the full data set (in the range of millions of rows) it just takes too long. So I know R is by default single threaded, so I have tried researching on how to parallelize the work into all the cores. Is it even possible to make this function in parallel? I have tried various packages with no success.
Has anyone been able to do this?
It doesn't have to be the neuralnet package, any solution that lets me train a neural network would work.
Thank you
I have had good experiences with the package Rmpi, and it may be applicable in your case too.
library(Rmpi)
Briefly, its usage is as follows:
nproc = 4 # could be automatically determined
# Specify one master and nproc-1 slaves
Rmpi:: mpi.spawn.Rslaves(nslaves=nproc-1)
# Execute function "func_to_be_parallelized" on multiple CPUs; pass two variables to function
my_fast_results = Rmpi::mpi.parLapply(var1_passed_to_func,
func_to_be_parallelized,
var2_passed_to_func)
# Close slaves
Rmpi::mpi.close.Rslaves(dellog=T)
You can try using the caret and doParallel packages for this. This is what I have been using. It works for some of the model types but may not work for all.
layer1 = c(6,12,18,24,30)
layer2 = c(6,12,18,24,30)
layer3 = c(6,12,18,24,30)
cv.folds = 5
# In order to make models fully reproducible when using parallel processing, we need to pass seeds as a parameter
# https://stackoverflow.com/questions/13403427/fully-reproducible-parallel-models-using-caret
total.param.permutations = length(layer1) * length(layer2) * length(layer3)
seeds <- vector(mode = "list", length = cv.folds + 1)
set.seed(1)
for(i in 1:cv.folds) seeds[[i]]<- sample.int(n=1, total.param.permutations, replace = TRUE)
seeds[[cv.folds + 1]]<-sample.int(1, 1, replace = TRUE) #for the last model
nn.grid <- expand.grid(layer1 = layer1, layer2 = layer2, layer3 = layer3)
cl <- makeCluster(detectCores()*0.5) # use 50% of cores only, leave rest for other tasks
registerDoParallel(cl)
train_control <- caret::trainControl(method = "cv"
,number=cv.folds
,seeds = seeds # user defined seeds for parallel processing
,verboseIter = TRUE
,allowParallel = TRUE
)
stopCluster(cl)
registerDoSEQ()
tic("Total Time to NN Training: ")
set.seed(1)
model.nn.caret = caret::train(form = formula,
data = scaled.train.data,
method = 'neuralnet',
tuneGrid = nn.grid,
trControl = train_control
)
toc()
I'm running parallel cv.glmnet from glmnet package on over 1000 data sets. In each run I set the seed to have the results reproducible. What I've noticed is that my results differ. The thing is that when I run the code on the same day, then the results are the same. But the next day they differ.
Here is my code:
model <- function(path, file, wyniki, faktor = 0.75) {
set.seed(2)
dane <- read.csv(file)
n <- nrow(dane)
podzial <- 1:floor(faktor*n)
########## GLMNET ############
nFolds <- 3
train_sparse <- dane[podzial,]
test_sparse <- dane[-podzial,]
# fit with cross-validation
tryCatch({
wart <- c(rep(0,6), "nie")
model <- cv.glmnet(train_sparse[,-1], train_sparse[,1], nfolds=nFolds, standardize=FALSE)
pred <- predict(model, test_sparse[,-1], type = "response",s=model$lambda.min)
# fetch of AUC value
aucp1 <- roc(test_sparse[,1],pred)$auc
}, error = function(e) print("error"))
results <- data.frame(auc = aucp1, n = nrow(dane))
write.table(results, wyniki, sep=',', append=TRUE,row.names =FALSE,col.names=FALSE)
}
path <- path_to_files
files <- list.files(sciezka, full.names = TRUE, recursive = TRUE)
wyniki <- "wyniki_adex__samplingfalse_decl_201512.csv"
library('doSNOW')
library('parallel')
#liczba watkow
threads <- 5
#rejestrujemy liczbe watkow
cl <- makeCluster(threads, outfile="")
registerDoSNOW(cl)
message("Loading packages on threads...")
clusterEvalQ(cl,library(pROC))
clusterEvalQ(cl,library(ROCR))
clusterEvalQ(cl,library(glmnet))
clusterEvalQ(cl,library(stringi))
message("Modelling...")
foreach(i=1:length(pliki)) %dopar% {
print(i)
model(path, files[i], wyniki)
}
Does anyone know what is the cause?
I'm running CentOS Linux release 7.0.1406 (Core) / Red Hat 4.8.2-16
Found the answer in the documentation of the cv.glmnet function:
Note also that the results of cv.glmnet are random, since the folds
are selected at random.
The solution is to manually set the folds so that there are not chosen at random:
nFolds <- 3
foldid <- sample(rep(seq(nFolds), length.out = nrow(train_sparse))
model <- cv.glmnet(x = as.matrix(x = train_sparse[,-1],
y = train_sparse[,1],
nfolds = nFolds,
foldid = foldid,
standardize = FALSE)
According to Writing R Extensions, a C wrapper is needed to call R's normal random numbers from FORTRAN. I don't see any C code in the glmnet source. I'm afraid it doesn't look implemented:
6.6 Calling C from FORTRAN and vice versa