Neural net training very slowly (R) - r

I'm trying to train a neural network on candidatesTestData, a 20177 x 14 matrix, while trying to follow the procedure listed here: this answer
I am trying to avoid over-fitting the training data. This is what I have tried so far:
returnNet <- NULL
currMax <- 40
for(i in 1:10) {
validationData <- sample_n(candidatesTrainingData, 20)
trainingData <- setdiff(candidatesTrainingData, validationData)
temp <- nnet(yield ~ ., data=trainingData, size = 6, linout=TRUE, skip=TRUE, MaxNWts = 10000)
validationPrediction <- predict(temp, validationData[1:length(names(validationData))-1])
errorVector <- abs(validationData$yield - validationPrediction)
if ( min(errorVector, na.rm=TRUE) < 5 & mean(errorVector, na.rm=TRUE) < currMax ) {
currMax <- mean(errorVector, na.rm=TRUE)
returnNet <- temp
}
}
return(returnNet)
In 10 minutes 60 iterations have completed for the first Neural Network. Is there any way this can be sped up i.e. improve the algorithmic run time?

Related

Very high memory requirements for power simulations using the R function simr::powerSim

I'm trying to do a power analysis for a project where we expect 40,000 observations (at least) across 40-60 "units".
When I do this power analysis using a much smaller number of observations within units (say, 100 per unit), the simulation runs fine. However, memory requirements seem to explode when I up the observations per unit to 300 or 500. I haven't been able to get the simulation to run even on a computer with a memory limit of 70GB.
What's going on? Why are the memory requirements so high and how would I make them more manageable?
The generic power simulation script is below, for reference.
#psacr001_power.R: Run a job using specifications determined using a shell script
#Get the arguments passed from the shell script
args <- commandArgs(trailingOnly=TRUE)
job <- as.numeric(args[[1]])
sample_size <- as.numeric(args[[2]])
number_of_labs <- as.numeric(args[[3]])
icc_size <- as.numeric(args[[4]])
slope_size <- as.numeric(args[[5]])
effect <- as.numeric(args[[6]])
#Required R libraries
library(lme4)
library(simr)
library(pbkrtest)
library(tibble)
#create dataset
simulated.df <- tibble(
x = sample(rep(0:1, sample_size * number_of_labs / 2)), #group assignment
g = rep(1:number_of_labs, sample_size), #number of clusters
y = rnorm(sample_size * number_of_labs) #outcome
)
#Create simple model
model.of.interest <- lmer(y ~ x + (x|g), data=simulated.df)
#Create a small effect of interest
fixef(model.of.interest)['x'] <- effect
#create various ICC intercept
VarCorr(model.of.interest)["g"][["g"]][1] <- icc_size
#create varioous slopes
VarCorr(model.of.interest)["g"][["g"]][4] <- slope_size
#try not to be singular
attr(VarCorr(model.of.interest)[["g"]], "correlation")[2:3] <- .3
power_summary <- tibble(
job = numeric(),
successes = numeric(),
trials = numeric(),
mean = numeric(),
lower = numeric(),
upper = numeric()
)
#simulate those models
temp <- powerSim(fit = model.of.interest,
nsim = 200,
progress=FALSE
)
power_summary[1 , 1] <- job
power_summary[1 , 2:6] <- summary(temp)
write.csv(power_summary, paste(c("res_", job, ".csv"), collapse=""), row.names=FALSE)

Why is the error rate from bagging trees much higher than that from a single tree?

I cross-post this question here, but it seems to me that I'm unlikely to receive any answer. So I post it here.
I'm running the classification method Bagging Tree (Bootstrap Aggregation) and compare the misclassification error rate with one from one single tree. We expect that the result from bagging tree is better then that from one single tree, i.e. error rate from bagging is lower than that of single tree.
I repeat the whole procedure M = 100 times (each time splitting randomly the original data set into a training set and a test set) to obtain 100 test errors and bagging test errors (use a for loop). Then I use boxplots to compare the distributions of these two types of errors.
# Loading package and data
library(rpart)
library(boot)
library(mlbench)
data(PimaIndiansDiabetes)
# Initialization
n <- 768
ntrain <- 468
ntest <- 300
B <- 100
M <- 100
single.tree.error <- vector(length = M)
bagging.error <- vector(length = M)
# Define statistic
estim.pred <- function(a.sample, vector.of.indices)
{
current.train <- a.sample[vector.of.indices, ]
current.fitted.model <- rpart(diabetes ~ ., data = current.train, method = "class")
predict(current.fitted.model, test.set, type = "class")
}
for (j in 1:M)
{
# Split the data into test/train sets
train.idx <- sample(1:n, ntrain, replace = FALSE)
train.set <- PimaIndiansDiabetes[train.idx, ]
test.set <- PimaIndiansDiabetes[-train.idx, ]
# Train a direct tree model
fitted.tree <- rpart(diabetes ~ ., data = train.set, method = "class")
pred.test <- predict(fitted.tree, test.set, type = "class")
single.tree.error[j] <- mean(pred.test != test.set$diabetes)
# Bootstrap estimates
res.boot = boot(train.set, estim.pred, B)
pred.boot <- vector(length = ntest)
for (i in 1:ntest)
{
pred.boot[i] <- ifelse (mean(res.boot$t[, i] == "pos") >= 0.5, "pos", "neg")
}
bagging.error[j] <- mean(pred.boot != test.set$diabetes)
}
boxplot(single.tree.error, bagging.error, ylab = "Misclassification errors", names = c("single.tree", "bagging"))
The result is
Could you please explain why the error rate for bagging trees is much higher than that of a single tree? I feel that this does not make sense. I've checked my code but could not found anything unusual.
I've received an answer from https://stats.stackexchange.com/questions/452882/why-is-the-error-rate-from-bagging-trees-much-higher-than-that-from-a-single-tre. I posted it here to close this question and for future visitors.

simulation study using AIC

I have to code a simulation study in R. So, I have X1,...,X15~N(0,1) explanatory variables and Y~N(2+2*X1+0.8*X2­1.2*X15, 1) and I need to simulate n=100 values and repeat that for iter=100 times. Then, for each linear model created I have to calculate the AIC­values and, finally, find the best model. The problem is that I can't figure out how to do that for item=100 times. I wrote the code for 1 simulation, which is the following:
set.seed(123)
n<‐100
p<‐15
iter<‐100 X<‐matrix(rep(NA,n*p),ncol=p) for (j in 1:p) {
X[,j]<‐rnorm(n = 100, mean = 0, sd = 1) }
mu<‐(2+2*X[,1])+(0.8*X[,2])‐(1.2*X[,15]) Y<‐rnorm(n = 100, mean = mu , sd = 1)
sim<‐data.frame(Y,X)
d<‐lm(Y~X, data = sim)
But how I do the rest I have to do, i.e.the 100 simulations and the calculations of AIC? I'm really new to R, so I am quite confused.
How about this
nsim <- 100
nobs <- 100
nvar <- 15
results <- lapply(1:nsim, function(i) {
X <- matrix(rnorm(nobs*nvar),nrow=nobs)
y <- rnorm(nobs, mean=2 + X[,c(1,2,15)]%*% c(2, .8,-1.2))
DF <- data.frame(y, X)
lm(y ~ X, data=DF)})
That should give you your simulations. Now find the "best"
findbest <- which.min(sapply(results, function(i) { AIC(i) }))
results[[findbest]]
Since all data are simulated using the same underlying data-generating process any variation in AIC is essentially random variation.

Working with iterator value dopar R

I am using the doPar package in an attempt to parallelise the training of machine learning algorithms as they seem to take quite a while.
My plan is to train multiple neural nets, SVMs, and decision trees (currently 10 of each, named neuralnet1 .. neuralnet10, svm1 ..., svm10, etc. The dataframe all_classifiers contains the classifier name I wish to name it and the stopping/starting training time
> head(all_classifiers,3)
classifiers train_start train_stop
1 neuralnet1 7833 8074
2 neuralnet2 45590 45682
3 neuralnet3 64341 64574
> tail(all_classifiers,3)
classifiers train_start train_stop
28 dt8 235639 235737
29 dt9 256497 257198
30 dt10 257814 258034
my loop right now looks like this
for(i in 1:trainloop{
# Select training data + remove NA
train_start <- all_classifiers[["train_start"]][i]
train_stop <- all_classifiers[["train_stop"]][i]
train_data <- na.omit(data[train_start:train_stop,])
print(paste("Using data from ", train_start,"to", train_stop))
train_scaled <- as.data.frame(train_data)
# Train appropriate model
firstLetter <- strtrim(all_classifiers[["classifiers"]][i],1)
if(firstLetter == "n"){
print("Training neural net")
trained_classifier <- neuralnet(f, data=train_scaled , hidden=c(3),
act.fct = 'logistic', linear.output=F,
stepmax=1e6, rep=1, learningrate = 0.30)
} else if(firstLetter == "s"){
print("Training SVM")
trained_classifier <- svm(upmove ~ . , data = train_scaled,
kernel = "polynomial", coef0 = 2.0)
} else if(firstLetter == "d"){
print("Training DT")
train_scaled$upmove <- as.factor(train_scaled$upmove)
trained_classifier <- C5.0(f, data = train_scaled)
}
flog.info(paste("Training",all_classifiers[["classifiers"]][i]))
assign(toString(all_classifiers[["classifiers"]][i]), trained_classifier)
}
I wish to parallelise this loop by using
foreach(i = 1:trainloop, .packages = 'neuralnet',
'e1071','C5.0','futile.logger') %dopar% { %loop here$ }
But it seems that each worker starts with iterator i=1, while my variable assignments
assign(toString(all_classifiers[["classifiers"]][i]), trained_classifier)
are dependent on the value of the iterator being used. How would I solve this? Eventually I want to end up with all the names in the first column of all_classifiers being trained classifiers on the associated starting and stopping training times.
I don't want to get too into the specifics of your code, but here is a small example that will hopefully help you understand how to translate a base R loop to foreach:
x1 <- numeric(10)
for (i in 1:10) {
x1[i] <- i^2
}
x2 <- foreach(i=1:10,.combine=rbind) %do% {
i^2
}
x1==x2

R neuralnet package too slow for millions of records

I am trying to train a neural network for churn prediction with R package neuralnet. Here is the code:
data <- read.csv('C:/PredictChurn.csv')
maxs <- apply(data, 2, max)
mins <- apply(data, 2, min)
scaled_temp <- as.data.frame(scale(data, center = mins, scale = maxs - mins))
scaled <- data
scaled[, -c(1)] <- scaled_temp[, -c(1)]
index <- sample(1:nrow(data),round(0.75*nrow(data)))
train_ <- scaled[index,]
test_ <- scaled[-index,]
library(neuralnet)
n <- names(train_[, -c(1)])
f <- as.formula(paste("CHURNED_F ~", paste(n[!n %in% "CHURNED_F"], collapse = " + ")))
nn <- neuralnet(f,data=train_,hidden=c(5),linear.output=F)
It works as it should, however when training with the full data set (in the range of millions of rows) it just takes too long. So I know R is by default single threaded, so I have tried researching on how to parallelize the work into all the cores. Is it even possible to make this function in parallel? I have tried various packages with no success.
Has anyone been able to do this?
It doesn't have to be the neuralnet package, any solution that lets me train a neural network would work.
Thank you
I have had good experiences with the package Rmpi, and it may be applicable in your case too.
library(Rmpi)
Briefly, its usage is as follows:
nproc = 4 # could be automatically determined
# Specify one master and nproc-1 slaves
Rmpi:: mpi.spawn.Rslaves(nslaves=nproc-1)
# Execute function "func_to_be_parallelized" on multiple CPUs; pass two variables to function
my_fast_results = Rmpi::mpi.parLapply(var1_passed_to_func,
func_to_be_parallelized,
var2_passed_to_func)
# Close slaves
Rmpi::mpi.close.Rslaves(dellog=T)
You can try using the caret and doParallel packages for this. This is what I have been using. It works for some of the model types but may not work for all.
layer1 = c(6,12,18,24,30)
layer2 = c(6,12,18,24,30)
layer3 = c(6,12,18,24,30)
cv.folds = 5
# In order to make models fully reproducible when using parallel processing, we need to pass seeds as a parameter
# https://stackoverflow.com/questions/13403427/fully-reproducible-parallel-models-using-caret
total.param.permutations = length(layer1) * length(layer2) * length(layer3)
seeds <- vector(mode = "list", length = cv.folds + 1)
set.seed(1)
for(i in 1:cv.folds) seeds[[i]]<- sample.int(n=1, total.param.permutations, replace = TRUE)
seeds[[cv.folds + 1]]<-sample.int(1, 1, replace = TRUE) #for the last model
nn.grid <- expand.grid(layer1 = layer1, layer2 = layer2, layer3 = layer3)
cl <- makeCluster(detectCores()*0.5) # use 50% of cores only, leave rest for other tasks
registerDoParallel(cl)
train_control <- caret::trainControl(method = "cv"
,number=cv.folds
,seeds = seeds # user defined seeds for parallel processing
,verboseIter = TRUE
,allowParallel = TRUE
)
stopCluster(cl)
registerDoSEQ()
tic("Total Time to NN Training: ")
set.seed(1)
model.nn.caret = caret::train(form = formula,
data = scaled.train.data,
method = 'neuralnet',
tuneGrid = nn.grid,
trControl = train_control
)
toc()

Resources