Can H2O deeplearning models in R be reproducible while remaining multithreaded? - r

I've been working on validating models developed using h2o.
Specificially I've been testing a neural net implemented using h2o.deeplearning. I've been attempting to generate consistent results by setting a seed in the H2O function, but even doing this I see correlation coefficients of between 0.6 and 0.85 between different versions of the same model, even ones with identical seeds.
I did some reading, and saw that I could force reproducibility by setting the reproducible flag to TRUE, but at a significant performance cost. The input to this model is too large for that to be a feasible method.
Has anyone else ever had to solve a similar problem/found a way to force H2O neural nets to be reproducible with less performance impact?

From the technical note on this topic
Why Deep learning results are not reproducible:
Motivation
H2O's Deep Learning uses a technique called HOGWILD! which greatly increases the speed of training, but is not reproducible by default.
Solution
In order to obtain reproducible results, you must set reproducible = TRUE and seed = 1 (for example, but you can use any seed as long as you use the same one each time). If you force reproducibility, it will slow down the training because this only works on a single thread. By default, H2O clusters are started with the same number of threads as number of cores (e.g. 8 is typical on a laptop).
The R example below demonstrates how to produce reproducible deep learning models:
library(h2o)
h2o.init(nthreads = -1)
# Import a sample binary outcome train/test set into R
train <- read.table("http://www.stat.berkeley.edu/~ledell/data/higgs_10k.csv", sep=",")
test <- read.table("http://www.stat.berkeley.edu/~ledell/data/higgs_test_5k.csv", sep=",")
# Convert R data.frames into H2O parsed data objects
training_frame <- as.h2o(train)
validation_frame <- as.h2o(test)
y <- "V1"
x <- setdiff(names(training_frame), y)
family <- "binomial"
training_frame[,c(y)] <- as.factor(training_frame[,c(y)]) #Force Binary classification
validation_frame[,c(y)] <- as.factor(validation_frame[,c(y)])
Now we will fit two models and show that the training AUC is the same both times (ie. reproducible).
fit <- h2o.deeplearning(x = x, y = y,
training_frame = training_frame,
reproducible = TRUE,
seed = 1)
h2o.auc(fit)
#[1] 0.8715931
fit2 <- h2o.deeplearning(x = x, y = y,
training_frame = training_frame,
reproducible = TRUE,
seed = 1)
h2o.auc(fit2)
#[1] 0.8715931

Related

R training and tuning random forest classifier with hardware challenge

I'm sorry if this is not appropriate to ask here but please forgive a noobie.
I'm training a random forest multiple class(8) classifier using R Caret on my experimental data using my desktop, 32GB RAM and a 4 core CPU. However, I'm facing constant complains from RStudio reporting it cannot allocate vector of 9GB. So I have to reduce the training set all the way to 1% of the data just to run fold CV and some grid search. As a result my mode accuracy is ~50% and the resulting features selected aren't very good at all. Only 2 out of 8 classes are being distinguished somewhat truthfully. Of course it could be that I don't have any good features. But I want to at least test train and tune my model on a decent size of training data first. What are the solutions can help? or is there anywhere I can upload my data and train somewhere? I'm so new that I don't know if something like cloud based things can help me? Pointers will be appreciated.
Edit: I have uploaded the data table and my codes so maybe it is my bad coding screwed things up.
Here is a link to the data:
https://drive.google.com/file/d/1wScYKd7J-KlRvvDxHAmG3_If1o5yUimy/view?usp=sharing
Here are my codes:
#load libraries
library(data.table)
library(caret)
library(caTools)
library(e1071)
#read the data in
df.raw <-fread("CLL_merged_sampled_same_ctrl_40percent.csv", header =TRUE,data.table = FALSE)
#get the useful data
#subset and get rid of useless labels
df.1 <- subset(df.raw, select = c(18:131))
df <- subset(df.1, select = -c(2:4))
#As I want to build a RF model to classify drug treatments
# make the treatmentsun as factors
#there should be 7 levels
df$treatmentsum <- as.factor(df$treatmentsum)
df$treatmentsum
#find nearZerovarance features
#I did not remove them. Just flagged them
nzv <- nearZeroVar(df[-1], saveMetrics= TRUE)
nzv[nzv$nzv==TRUE,]
possible.nzv.flagged <- nzv[nzv$nzv=="TRUE",]
write.csv(possible.nzv.flagged, "Near Zero Features flagged.CSV", row.names = TRUE)
#identify correlated features
df.Cor <- cor(df[-1])
highCorr <- sum(abs(df.Cor[upper.tri(df.Cor)]) > .99)
highlyCor <- findCorrelation(df.Cor, cutoff = .99,verbose = TRUE)
#Get rid off strongly correlated features
filtered.df<- df[ ,-highlyCor]
str(filtered.df)
#identify linear dependecies
linear.combo <- findLinearCombos(filtered.df[-1])
linear.combo #no linear ones detected
#splt datainto training and test
#Here is my problem, I want to use 80% of the data for training
#but in my computer, I can only use 0.002
set.seed(123)
split <- sample.split(filtered.df$treatmentsum, SplitRatio = 0.8)
training_set <- subset(filtered.df, split==TRUE)
test_set <- subset(filtered.df, split==FALSE)
#scaling numeric data
#leave the first column labels out
training_set[-1] = scale(training_set[-1])
test_set[-1] = scale(test_set[-1])
training_set[1]
#build RF
#use Cross validation for model training
#I can't use repeated CV as it fails on my machine
#I set a grid search for tuning
control <- trainControl(method="cv", number=10,verboseIter = TRUE, search = 'grid')
#default mtry below, is around 10
#mtry <- sqrt(ncol(training_set))
#I used ,mtry 1:12 to run, but I wanted to test more, limited again by machine
tunegrid <- expand.grid(.mtry = (1:20))
model <- train(training_set[,-1],as.factor(training_set[,1]), data=training_set, method = "rf", trControl = control,metric= "Accuracy", maximize = TRUE ,importance = TRUE, type="classification", ntree =800,tuneGrid = tunegrid)
print(model)
plot(model)
prediction2 <- predict(model, test_set[,-1])
cm<-confusionMatrix(prediction2, as.factor(test_set[,1]), positive = "1")

How to apply machine learning techniques / how to use model outputs

I am a plant scientist new to machine learning. I have had success writing code and following tutorials of machine learning techniques. My issue is trying to understand how to actually apply these techniques to answer real world questions. I don't really understand how to use the model outputs to answer questions.
I recently followed a tutorial creating an algorithm to detect credit card fraud. All of the models ran nicely and I understand how to build them; but, how in the world do I take this information and translate it into a definitive answer? Following the same example, lets say I wrote this code for my job how would I then take real credit card data and screen it using this algorithm? I really want to establish a link between running these models and generating a useful output from real data.
Thank you all.
In the name of being concise I will highlight some specific examples using the same data set found here:
https://drive.google.com/file/d/1CTAlmlREFRaEN3NoHHitewpqAtWS5cVQ/view
# Import
creditcard_data <- read_csv('PATH')
# Restructure
creditcard_data$Amount=scale(creditcard_data$Amount)
NewData=creditcard_data[,-c(1)]
head(NewData)
#Split
library(caTools)
set.seed(123)
data_sample = sample.split(NewData$Class,SplitRatio=0.80)
train_data = subset(NewData,data_sample==TRUE)
test_data = subset(NewData,data_sample==FALSE)
1) Decision Tree
library(rpart)
library(rpart.plot)
decisionTree_model <- rpart(Class ~ . , creditcard_data, method = 'class')
predicted_val <- predict(decisionTree_model, creditcard_data, type = 'class')
probability <- predict(decisionTree_model, creditcard_data, type = 'prob')
rpart.plot(decisionTree_model)
2) Artificial Neural Network
library(neuralnet)
ANN_model =neuralnet (Class~.,train_data,linear.output=FALSE)
plot(ANN_model)
predANN=compute(ANN_model,test_data)
resultANN=predANN$net.result
resultANN=ifelse(resultANN>0.5,1,0)
3) Gradient Boosting
library(gbm, quietly=TRUE)
# train GBM model
system.time(
model_gbm <- gbm(Class ~ .
, distribution = "bernoulli"
, data = rbind(train_data, test_data)
, n.trees = 100
, interaction.depth = 2
, n.minobsinnode = 10
, shrinkage = 0.01
, bag.fraction = 0.5
, train.fraction = nrow(train_data) / (nrow(train_data) + nrow(test_data))
)
)
# best iteration
gbm.iter = gbm.perf(model_gbm, method = "test")
model.influence = relative.influence(model_gbm, n.trees = gbm.iter, sort. = TRUE)
# plot
plot(model_gbm)
# plot
gbm_test = predict(model_gbm, newdata = test_data, n.trees = gbm.iter)
gbm_auc = roc(test_data$Class, gbm_test, plot = TRUE, col = "red")
print(gbm_auc)
You develop your model with, preferably, three data sets.
Training, Testing and Validation. (Sometimes different terminology is used.)
Here, Train and Test sets are used to develop the model.
The model you decide upon must never see any of the Validation set. This set is used to see how good your model is, in effect it would simulate real-world new data that may come to you in the future. Once you decide your model does perform to an acceptable level you can then go back to running all your data to produce the final operational model. Then any new 'live' data of interest is fed to the model and produces an output. In the case of the fraud detection it would output some probability: here you need human input to decide at what level you would flag the event as fraudulent enough to warrant further investigation.
At periodic intervals or as you data arrives or your model performance weakens (fraudsters may become more cunning!) you would repeat the whole process.

Consisten results with Multiple runs of h2o deeplearning

For a certain combination of parameters in the deeplearning function of h2o, I get different results each time I run it.
args <- list(list(hidden = c(200,200,200),
loss = "CrossEntropy",
hidden_dropout_ratio = c(0.1, 0.1,0.1),
activation = "RectifierWithDropout",
epochs = EPOCHS))
run <- function(extra_params) {
model <- do.call(h2o.deeplearning,
modifyList(list(x = columns, y = c("Response"),
validation_frame = validation, distribution = "multinomial",
l1 = 1e-5,balance_classes = TRUE,
training_frame = training), extra_params))
}
model <- lapply(args, run)
What would I need to do in order to get consistent results for the model each time I run this?
Deeplearning with H2O will not be reproducible if it is run on more than a single core. The results and performance metrics may vary slightly from what you see each time you train the deep learning model. The implementation in H2O uses a technique called "Hogwild!" which increases the speed of training at the cost of reproducibility on multiple cores.
So if you want reproducible results you will need to restrict H2O to run on a single core and make sure to use a seed in the h2o.deeplearning call.
Edit based on comment by Darren Cook:
I forgot to include the reproducible = TRUE parameter that needs to be set in combination with the seed to make it truly reproducible. Note that this will make it a lot slower to run. And is is not advisable to do this with a large dataset.
More information on "Hogwild!"

R e1071 SVM leave one out cross validation function result differ from manual LOOCV

I'm using e1071 svm function to classify my data.
I tried two different ways to LOOCV.
First one is like that,
svm.model <- svm(mem ~ ., data, kernel = "sigmoid", cost = 7, gamma = 0.009, cross = subSize)
svm.pred = data$mem
svm.pred[which(svm.model$accuracies==0 & svm.pred=='good')]=NA
svm.pred[which(svm.model$accuracies==0 & svm.pred=='bad')]='good'
svm.pred[is.na(svm.pred)]='bad'
conMAT <- table(pred = svm.pred, true = data$mem)
summary(svm.model)
I typed cross='subject number' to make LOOCV, but the result of classification is different from my manual version of LOOCV, which is like...
for (i in 1:subSize){
data_Tst <- data[i,1:dSize]
data_Trn <- data[-i,1:dSize]
svm.model1 <- svm(mem ~ ., data = data_Trn, kernel = "linear", cost = 2, gamma = 0.02)
svm.pred1 <- predict(svm.model1, data_Tst[,-dSize])
conMAT <- table(pred = svm.pred1, true = data_Tst[,dSize])
CMAT <- CMAT + conMAT
CORR[i] <- sum(diag(conMAT))
}
In my opinion, through LOOCV, accuracy should not vary across many runs of code because SVM makes model with all the data except one and does it until the end of the loop. However, with the svm function with argument 'cross' input, the accuracy differs across every runs of code.
Which way is more accurate? Thanks for read this post! :-)
You are using different hyper-parameters (cost, gamma) and different kernels (linear, sigmoid). If you want identical results, then these should be the same each run.
Also, it depends how Leave One Out (LOO) is implemented:
Does your LOO method leave one out randomly or as a sliding window over the dataset?
Does your LOO method leave one out from one class at a time or both classes at the same time?
Is the training set always the same, or are you using a randomisation procedure before splitting between a training and testing set (assuming you have a separate independent testing set)? In which case, the examples you are cross-validating would change each run.

Time series modelling: "train" function with method "nnet" is not giving satisfactory result

I was trying to implement the use of train function in R using nnet as method on monthly consumption data. But the output (the predicted values) are all showing to be equal to some mean value.
I have data for 24 time points (each representing a month's data) and I have used first 20 for training and the rest 4 for testing the model. Here is my code:
a<-read.csv("...",header=TRUE)
tem<-a[,5]
hum<-a[,4]
con<- a[,3]
require(quantmod)
require(nnet)
require(caret)
y<-con
plot(con,type="l")
dat <- data.frame( y, x1=tem, x2=hum)
names(dat) <- c('y','x1','x2')
#Fit model
model <- train(y ~ x1+x2,
dat[1:20,],
method='nnet',
linout=TRUE,
trace = FALSE)
ps <- predict(model2, dat[21:24,])
plot(1:24,y,type="l",col = 2)
lines(1:24,c(y[1:20],ps), col=3,type="o")
legend(5, 70, c("y", "pred"), cex=1.5, fill=2:3)
Any suggestion on how can I approach this problem alternatively? Is there any way to use Neural Network more efficiently? Or is there any other better method for this?
The problem is likely to be not enough data. 24 data points is quite low, for any machine learning problem. If the curve/shape/surface of the data is eg a simple sin wave, then 24 would be enough.
But for any more complex function, the more data the better. Can you accurately model eg sin^2 x * cos^0.3 x / sinh x with only 6 data points? No, because the available data does not capture enough detail.
If you can acquire daily data, use that instead.

Resources