I am using the doPar package in an attempt to parallelise the training of machine learning algorithms as they seem to take quite a while.
My plan is to train multiple neural nets, SVMs, and decision trees (currently 10 of each, named neuralnet1 .. neuralnet10, svm1 ..., svm10, etc. The dataframe all_classifiers contains the classifier name I wish to name it and the stopping/starting training time
> head(all_classifiers,3)
classifiers train_start train_stop
1 neuralnet1 7833 8074
2 neuralnet2 45590 45682
3 neuralnet3 64341 64574
> tail(all_classifiers,3)
classifiers train_start train_stop
28 dt8 235639 235737
29 dt9 256497 257198
30 dt10 257814 258034
my loop right now looks like this
for(i in 1:trainloop{
# Select training data + remove NA
train_start <- all_classifiers[["train_start"]][i]
train_stop <- all_classifiers[["train_stop"]][i]
train_data <- na.omit(data[train_start:train_stop,])
print(paste("Using data from ", train_start,"to", train_stop))
train_scaled <- as.data.frame(train_data)
# Train appropriate model
firstLetter <- strtrim(all_classifiers[["classifiers"]][i],1)
if(firstLetter == "n"){
print("Training neural net")
trained_classifier <- neuralnet(f, data=train_scaled , hidden=c(3),
act.fct = 'logistic', linear.output=F,
stepmax=1e6, rep=1, learningrate = 0.30)
} else if(firstLetter == "s"){
print("Training SVM")
trained_classifier <- svm(upmove ~ . , data = train_scaled,
kernel = "polynomial", coef0 = 2.0)
} else if(firstLetter == "d"){
print("Training DT")
train_scaled$upmove <- as.factor(train_scaled$upmove)
trained_classifier <- C5.0(f, data = train_scaled)
}
flog.info(paste("Training",all_classifiers[["classifiers"]][i]))
assign(toString(all_classifiers[["classifiers"]][i]), trained_classifier)
}
I wish to parallelise this loop by using
foreach(i = 1:trainloop, .packages = 'neuralnet',
'e1071','C5.0','futile.logger') %dopar% { %loop here$ }
But it seems that each worker starts with iterator i=1, while my variable assignments
assign(toString(all_classifiers[["classifiers"]][i]), trained_classifier)
are dependent on the value of the iterator being used. How would I solve this? Eventually I want to end up with all the names in the first column of all_classifiers being trained classifiers on the associated starting and stopping training times.
I don't want to get too into the specifics of your code, but here is a small example that will hopefully help you understand how to translate a base R loop to foreach:
x1 <- numeric(10)
for (i in 1:10) {
x1[i] <- i^2
}
x2 <- foreach(i=1:10,.combine=rbind) %do% {
i^2
}
x1==x2
Related
I have already written the following code, all of which works OK:
directory_path <- "~/DAEN_698/sample_obs"
file_list <- list.files(path = directory_path, full.names = TRUE, recursive = TRUE)
head(file_list, n = 2)
> head(file_list, n = 2)
[1] "C:/Users/Spencer/Documents/DAEN_698/sample_obs2/0-5-1-1.csv"
[2] "C:/Users/Spencer/Documents/DAEN_698/sample_obs2/0-5-1-2.csv"
# Create another list with the just the "n-n-n-n" part of the names of of each dataset
DS_name_list = stri_sub(file_list, 49, 55)
head(DS_name_list, n = 3)
> head(DS_name_list, n = 3)
[1] "0-5-1-1" "0-5-1-2" "0-5-1-3"
# This command reads all the data in each of the N csv files via their names
# stored in the 'file_list' list of characters.
csvs <- lapply(file_list, read.csv)
### Run a Backward Elimination Stepwise Regression on each of the N csvs.
# Assign the full model (meaning the one with all 30 candidate regressors
# included as the initial model in step 1).
# This is crucial because if the initial model has less than the number of
# total candidate factors for Stepwise to select from in the datasets,
# then it could miss 1 or more of the true factors.
full_model <- lapply(csvs, function(i) {
lm(formula = Y ~ ., data = i) })
# my failed attempt at figuring it out myself
set.seed(50) # for reproducibility
BE_fits3 <- lapply(full_model, function(i) {step(object = i[["coefficients"]],
direction = 'backward', scope = formula(full_model), trace = 0)})
When I hit run on the above 2 lines of code after setting the seed, I get
the following error message in the Console:
Error in terms`(object) : object 'i' not found
To briefly elaborate a bit further on why it is
absolutely essential that the initial model when running a Backward Elimination
version of Stepwise Regression, consider the following example:
Let us say that we start out with an initial model of 25, so, X1:X26 instead of
X1:X30, in that case, it would be possible to miss out on Stepwise Regression j
being able to select/choose 1 or more of the IVs/factors from X26 through X30,
especially if 1 or more of those really are included in the true underlying
population model that characterizes dataset j.
Instead of two lapply loops, one to fit the models and the second to run the stepwise regressions, use a for loop doing both operations one after the other. This is an environments thing, it seems that step is not finding the data when run in the environment of the lapply function.
I have also changed the code to create DS_name_list. Below it processes the full names without string position dependent code.
DS_name_list <- basename(file_list)
DS_name_list <- tools::file_path_sans_ext(DS_name_list)
head(DS_name_list, n = 2)
And here is the regressions code.
csvs <- lapply(file_list, read.csv)
names(csvs) <- DS_name_list
set.seed(50) # for reproducibility
full_model <- vector("list", length = length(csvs))
BE_fits3 <- vector("list", length = length(csvs))
for(i in seq_along(csvs)) {
full_model[[i]] <- lm(formula = Y ~ ., data = csvs[[i]])
BE_fits3[[i]] <- step(object = full_model[[i]],
scope = formula(full_model[[i]]),
direction = 'backward',
trace = 0)
}
I cross-post this question here, but it seems to me that I'm unlikely to receive any answer. So I post it here.
I'm running the classification method Bagging Tree (Bootstrap Aggregation) and compare the misclassification error rate with one from one single tree. We expect that the result from bagging tree is better then that from one single tree, i.e. error rate from bagging is lower than that of single tree.
I repeat the whole procedure M = 100 times (each time splitting randomly the original data set into a training set and a test set) to obtain 100 test errors and bagging test errors (use a for loop). Then I use boxplots to compare the distributions of these two types of errors.
# Loading package and data
library(rpart)
library(boot)
library(mlbench)
data(PimaIndiansDiabetes)
# Initialization
n <- 768
ntrain <- 468
ntest <- 300
B <- 100
M <- 100
single.tree.error <- vector(length = M)
bagging.error <- vector(length = M)
# Define statistic
estim.pred <- function(a.sample, vector.of.indices)
{
current.train <- a.sample[vector.of.indices, ]
current.fitted.model <- rpart(diabetes ~ ., data = current.train, method = "class")
predict(current.fitted.model, test.set, type = "class")
}
for (j in 1:M)
{
# Split the data into test/train sets
train.idx <- sample(1:n, ntrain, replace = FALSE)
train.set <- PimaIndiansDiabetes[train.idx, ]
test.set <- PimaIndiansDiabetes[-train.idx, ]
# Train a direct tree model
fitted.tree <- rpart(diabetes ~ ., data = train.set, method = "class")
pred.test <- predict(fitted.tree, test.set, type = "class")
single.tree.error[j] <- mean(pred.test != test.set$diabetes)
# Bootstrap estimates
res.boot = boot(train.set, estim.pred, B)
pred.boot <- vector(length = ntest)
for (i in 1:ntest)
{
pred.boot[i] <- ifelse (mean(res.boot$t[, i] == "pos") >= 0.5, "pos", "neg")
}
bagging.error[j] <- mean(pred.boot != test.set$diabetes)
}
boxplot(single.tree.error, bagging.error, ylab = "Misclassification errors", names = c("single.tree", "bagging"))
The result is
Could you please explain why the error rate for bagging trees is much higher than that of a single tree? I feel that this does not make sense. I've checked my code but could not found anything unusual.
I've received an answer from https://stats.stackexchange.com/questions/452882/why-is-the-error-rate-from-bagging-trees-much-higher-than-that-from-a-single-tre. I posted it here to close this question and for future visitors.
I have wrote the knn cross validation method below using the iris dataset in R. How would I get the best value of k from this and create a confusion matrix based on this? Any help would be great.
library(class)
data("iris")
kfolds = 5
iris$folds = cut(seq(1,nrow(iris)),breaks=kfolds,labels=FALSE)
iris$folds
# Sets the columns to use as predicators
pred = c("Petal.Width", "Petal.Length")
accuracies = c()
ks = c(1,3,5,7,9,11,13,15)
for (k in ks) {
k.accuracies = c()
for(i in 1:kfolds) {
# Builds the training set and test set for this fold.
train.items.this.fold = iris[iris$folds != i,]
validation.items.this.fold = iris[iris$folds == i,]
# Fit knn model on this fold.
predictions = knn(train.items.this.fold[,pred],
validation.items.this.fold[,pred],
train.items.this.fold$Species, k=k)
predictions.table <- table(predictions, validation.items.this.fold$Species)
# Work out the amount of correct and incorrect predictions.
correct.list <- predictions == validation.items.this.fold$Species
nr.correct = nrow(validation.items.this.fold[correct.list,])
# Get accuracy rate of cv.
accuracy.rate = nr.correct/nrow(validation.items.this.fold)
# Adds the accuracy list.
k.accuracies <- cbind(k.accuracies, accuracy.rate)
}
# Adds the mean accuracy to the total accuracy list.
accuracies <- cbind(accuracies, mean(k.accuracies))
}
# Accuracy for each value of k: visualisation.
accuracies
Update:
predictions.table <- table(predictions == ks[which.max(accuracies)], validation.items.this.fold$Species)
Your code have some problems, this one runs:
library(class)
data("iris")
kfolds = 5
iris$folds = cut(seq(1,nrow(iris)),breaks=kfolds,labels=FALSE)
iris$folds
# Sets the columns to use as predicators
pred = c("Petal.Width", "Petal.Length")
accuracies = c()
ks = c(1,3,5,7,9,11,13,15)
k.accuracies = c()
predictions.list = list()
for (k in ks) {
k.accuracies = c()
for(i in 1:kfolds) {
# Builds the training set and test set for this fold.
train.items.this.fold = iris[iris$folds != i,]
validation.items.this.fold = iris[iris$folds == i,]
# Fit knn model on this fold.
predictions = knn(train.items.this.fold[,pred],
validation.items.this.fold[,pred],
train.items.this.fold$Species, k=k)
predictions.list[[i]] = predictions
predictions.table <- table(predictions, validation.items.this.fold$Species)
# Work out the amount of correct and incorrect predictions.
correct.list <- predictions == validation.items.this.fold$Species
nr.correct = nrow(validation.items.this.fold[correct.list,])
# Get accuracy rate of cv.
accuracy.rate = nr.correct/nrow(validation.items.this.fold)
# Adds the accuracy list.
k.accuracies <- cbind(k.accuracies, accuracy.rate)
}
# Adds the mean accuracy to the total accuracy list.
accuracies <- cbind(accuracies, mean(k.accuracies))
}
accuracies
predictions.table <- table(predictions.list[[which.max(accuracies)]], validation.items.this.fold$Species)
When you calling predictions.table <- table(predictions, validation.items.this.fold$Species), this is the confusion matrix, and you are using the accuracy as the evaluation metric, so the best K is the best accuracy. You can get the best K value like this:
ks[which.max(accuracies)]
UPDATE
Create a list to store each prediction and then created the confusion matrix using the best accuracy.
I'm trying to train a neural network on candidatesTestData, a 20177 x 14 matrix, while trying to follow the procedure listed here: this answer
I am trying to avoid over-fitting the training data. This is what I have tried so far:
returnNet <- NULL
currMax <- 40
for(i in 1:10) {
validationData <- sample_n(candidatesTrainingData, 20)
trainingData <- setdiff(candidatesTrainingData, validationData)
temp <- nnet(yield ~ ., data=trainingData, size = 6, linout=TRUE, skip=TRUE, MaxNWts = 10000)
validationPrediction <- predict(temp, validationData[1:length(names(validationData))-1])
errorVector <- abs(validationData$yield - validationPrediction)
if ( min(errorVector, na.rm=TRUE) < 5 & mean(errorVector, na.rm=TRUE) < currMax ) {
currMax <- mean(errorVector, na.rm=TRUE)
returnNet <- temp
}
}
return(returnNet)
In 10 minutes 60 iterations have completed for the first Neural Network. Is there any way this can be sped up i.e. improve the algorithmic run time?
I am trying to train a neural network for churn prediction with R package neuralnet. Here is the code:
data <- read.csv('C:/PredictChurn.csv')
maxs <- apply(data, 2, max)
mins <- apply(data, 2, min)
scaled_temp <- as.data.frame(scale(data, center = mins, scale = maxs - mins))
scaled <- data
scaled[, -c(1)] <- scaled_temp[, -c(1)]
index <- sample(1:nrow(data),round(0.75*nrow(data)))
train_ <- scaled[index,]
test_ <- scaled[-index,]
library(neuralnet)
n <- names(train_[, -c(1)])
f <- as.formula(paste("CHURNED_F ~", paste(n[!n %in% "CHURNED_F"], collapse = " + ")))
nn <- neuralnet(f,data=train_,hidden=c(5),linear.output=F)
It works as it should, however when training with the full data set (in the range of millions of rows) it just takes too long. So I know R is by default single threaded, so I have tried researching on how to parallelize the work into all the cores. Is it even possible to make this function in parallel? I have tried various packages with no success.
Has anyone been able to do this?
It doesn't have to be the neuralnet package, any solution that lets me train a neural network would work.
Thank you
I have had good experiences with the package Rmpi, and it may be applicable in your case too.
library(Rmpi)
Briefly, its usage is as follows:
nproc = 4 # could be automatically determined
# Specify one master and nproc-1 slaves
Rmpi:: mpi.spawn.Rslaves(nslaves=nproc-1)
# Execute function "func_to_be_parallelized" on multiple CPUs; pass two variables to function
my_fast_results = Rmpi::mpi.parLapply(var1_passed_to_func,
func_to_be_parallelized,
var2_passed_to_func)
# Close slaves
Rmpi::mpi.close.Rslaves(dellog=T)
You can try using the caret and doParallel packages for this. This is what I have been using. It works for some of the model types but may not work for all.
layer1 = c(6,12,18,24,30)
layer2 = c(6,12,18,24,30)
layer3 = c(6,12,18,24,30)
cv.folds = 5
# In order to make models fully reproducible when using parallel processing, we need to pass seeds as a parameter
# https://stackoverflow.com/questions/13403427/fully-reproducible-parallel-models-using-caret
total.param.permutations = length(layer1) * length(layer2) * length(layer3)
seeds <- vector(mode = "list", length = cv.folds + 1)
set.seed(1)
for(i in 1:cv.folds) seeds[[i]]<- sample.int(n=1, total.param.permutations, replace = TRUE)
seeds[[cv.folds + 1]]<-sample.int(1, 1, replace = TRUE) #for the last model
nn.grid <- expand.grid(layer1 = layer1, layer2 = layer2, layer3 = layer3)
cl <- makeCluster(detectCores()*0.5) # use 50% of cores only, leave rest for other tasks
registerDoParallel(cl)
train_control <- caret::trainControl(method = "cv"
,number=cv.folds
,seeds = seeds # user defined seeds for parallel processing
,verboseIter = TRUE
,allowParallel = TRUE
)
stopCluster(cl)
registerDoSEQ()
tic("Total Time to NN Training: ")
set.seed(1)
model.nn.caret = caret::train(form = formula,
data = scaled.train.data,
method = 'neuralnet',
tuneGrid = nn.grid,
trControl = train_control
)
toc()