MSE using h2o for anomaly detection - r

I was using the example given by h2o for ECG anomaly detection.
When trying to compute manually the MSE, I got different results.
To demonstrate the difference I used the last test case
but all 23 cases differ.
Attached is the full code:
Thanks,
Eli.
suppressMessages(library(h2o))
localH2O = h2o.init(max_mem_size = '6g', # use 6GB of RAM of *GB available
nthreads = -1) # use all CPUs (8 on my personal computer :3)
# Download and import ECG train and test data into the H2O cluster
train_ecg <- h2o.importFile(path = "http://h2o-public-test-data.s3.amazonaws.com/smalldata/anomaly/ecg_discord_train.csv",
header = FALSE,
sep = ",")
test_ecg <- h2o.importFile(path = "http://h2o-public-test-data.s3.amazonaws.com/smalldata/anomaly/ecg_discord_test.csv",
header = FALSE,
sep = ",")
# Train deep autoencoder learning model on "normal"
# training data, y ignored
anomaly_model <- h2o.deeplearning(x = names(train_ecg),
training_frame = train_ecg,
activation = "Tanh",
autoencoder = TRUE,
hidden = c(50,20,50),
l1 = 1e-4,
epochs = 100)
# Compute reconstruction error with the Anomaly
# detection app (MSE between output layer and input layer)
recon_error <- h2o.anomaly(anomaly_model, test_ecg)
# Pull reconstruction error data into R and
# plot to find outliers (last 3 heartbeats)
recon_error <- as.data.frame(recon_error)
recon_error
plot.ts(recon_error)
test_recon <- h2o.predict(anomaly_model, test_ecg)
t <- as.vector(test_ecg[23,])
r <- as.vector(test_recon[23,])
mse.23 <- sum((t-r)^2)/length(t)
mse.23
recon_error[23,]
> mse.23
[1] 2.607374
> recon_error[23,]
[1] 8.264768

it is not really an answer but I did what #Arno Candel has suggested. I have tried to combine test and train data and normalize to 0 - 1. After that, I split the combined and normalized data back to test and train data and run the scripts as generated by the OP. However, I am still getting a different MSE using manual calculation. The MSE is also different when I normalized test and train data separately. Is there something I can do to get the manual calculation correctly?
suppressMessages(library(purrr))
suppressMessages(library(dplyr))
suppressMessages(library(h2o))
localH2O = h2o.init(max_mem_size = '6g', # use 6GB of RAM of *GB available
nthreads = -1) # use all CPUs (8 on my personal computer :3)
# Download and import ECG train and test data into the H2O cluster
train_ecg <- h2o.importFile(path = "http://h2o-public-test-data.s3.amazonaws.com/smalldata/anomaly/ecg_discord_train.csv",
header = FALSE,
sep = ",")
test_ecg <- h2o.importFile(path = "http://h2o-public-test-data.s3.amazonaws.com/smalldata/anomaly/ecg_discord_test.csv",
header = FALSE,
sep = ",")
### adding this section
# normalize data
train_ecg <- as.data.frame(train_ecg)
test_ecg <- as.data.frame(test_ecg)
dat <- rbind(train_ecg,test_ecg)
get_desc <- function(x) {
map(x, ~list(
min = min(.x),
max = max(.x),
mean = mean(.x),
sd = sd(.x)
))
}
normalization_minmax <- function(x, desc) {
map2_dfc(x, desc, ~(.x - .y$min)/(.y$max - .y$min))
}
desc <- dat %>%
get_desc()
dat <- dat %>%
normalization_minmax(desc)
train_ecg <- as.matrix(dat[1:20,]) ; test_ecg <- as.matrix(dat[21:43,])
# Train deep autoencoder learning model on "normal"
# training data, y ignored
anomaly_model <- h2o.deeplearning(x = names(train_ecg),
training_frame = train_ecg,
activation = "Tanh",
autoencoder = TRUE,
hidden = c(50,20,50),
l1 = 1e-4,
epochs = 100)
# Compute reconstruction error with the Anomaly
# detection app (MSE between output layer and input layer)
recon_error <- h2o.anomaly(anomaly_model, test_ecg)
# Pull reconstruction error data into R and
# plot to find outliers (last 3 heartbeats)
recon_error <- as.data.frame(recon_error)
recon_error
plot.ts(recon_error)
test_recon <- h2o.predict(anomaly_model, test_ecg)
t <- as.vector(test_ecg[23,])
r <- as.vector(test_recon[23,])
mse.23 <- sum((t-r)^2)/length(t)
mse.23
recon_error[23,]
> mse.23
[1] 23.14947
> recon_error[23,]
[1] 8.076866

For autoencoders in H2O, the MSE math is done in the normalized space to avoid numerical scaling issues. For example, if you have categorical features or very large numbers, the neural network autoencoder can't directly operate on those numbers, but instead, it first does dummy one-hot encoding and normalization of numeric features, then it does the fwd/back propagation and computation of reconstruction errors (in the normalized and expanded space). You can manually divide each column by its range (max-min) first for purely numerical data, and your results should match.
Here is a JUnit that does this check explicitly (on that very dataset):
https://github.com/h2oai/h2o-3/blob/master/h2o-algos/src/test/java/hex/deeplearning/DeepLearningAutoEncoderTest.java#L86-L104
You can also see https://0xdata.atlassian.net/browse/PUBDEV-2078 for more info.

Related

Is there a way 2 store factors selected by a (BE) Stepwise Regression run on N datasets via lapply(full_model, FUN(i) {step(i[[“Coeffs”]])})?

I have already written the following code, all of which works OK:
directory_path <- "~/DAEN_698/sample_obs"
file_list <- list.files(path = directory_path, full.names = TRUE, recursive = TRUE)
head(file_list, n = 2)
> head(file_list, n = 2)
[1] "C:/Users/Spencer/Documents/DAEN_698/sample_obs2/0-5-1-1.csv"
[2] "C:/Users/Spencer/Documents/DAEN_698/sample_obs2/0-5-1-2.csv"
# Create another list with the just the "n-n-n-n" part of the names of of each dataset
DS_name_list = stri_sub(file_list, 49, 55)
head(DS_name_list, n = 3)
> head(DS_name_list, n = 3)
[1] "0-5-1-1" "0-5-1-2" "0-5-1-3"
# This command reads all the data in each of the N csv files via their names
# stored in the 'file_list' list of characters.
csvs <- lapply(file_list, read.csv)
### Run a Backward Elimination Stepwise Regression on each of the N csvs.
# Assign the full model (meaning the one with all 30 candidate regressors
# included as the initial model in step 1).
# This is crucial because if the initial model has less than the number of
# total candidate factors for Stepwise to select from in the datasets,
# then it could miss 1 or more of the true factors.
full_model <- lapply(csvs, function(i) {
lm(formula = Y ~ ., data = i) })
# my failed attempt at figuring it out myself
set.seed(50) # for reproducibility
BE_fits3 <- lapply(full_model, function(i) {step(object = i[["coefficients"]],
direction = 'backward', scope = formula(full_model), trace = 0)})
When I hit run on the above 2 lines of code after setting the seed, I get
the following error message in the Console:
Error in terms`(object) : object 'i' not found
To briefly elaborate a bit further on why it is
absolutely essential that the initial model when running a Backward Elimination
version of Stepwise Regression, consider the following example:
Let us say that we start out with an initial model of 25, so, X1:X26 instead of
X1:X30, in that case, it would be possible to miss out on Stepwise Regression j
being able to select/choose 1 or more of the IVs/factors from X26 through X30,
especially if 1 or more of those really are included in the true underlying
population model that characterizes dataset j.
Instead of two lapply loops, one to fit the models and the second to run the stepwise regressions, use a for loop doing both operations one after the other. This is an environments thing, it seems that step is not finding the data when run in the environment of the lapply function.
I have also changed the code to create DS_name_list. Below it processes the full names without string position dependent code.
DS_name_list <- basename(file_list)
DS_name_list <- tools::file_path_sans_ext(DS_name_list)
head(DS_name_list, n = 2)
And here is the regressions code.
csvs <- lapply(file_list, read.csv)
names(csvs) <- DS_name_list
set.seed(50) # for reproducibility
full_model <- vector("list", length = length(csvs))
BE_fits3 <- vector("list", length = length(csvs))
for(i in seq_along(csvs)) {
full_model[[i]] <- lm(formula = Y ~ ., data = csvs[[i]])
BE_fits3[[i]] <- step(object = full_model[[i]],
scope = formula(full_model[[i]]),
direction = 'backward',
trace = 0)
}

Very high memory requirements for power simulations using the R function simr::powerSim

I'm trying to do a power analysis for a project where we expect 40,000 observations (at least) across 40-60 "units".
When I do this power analysis using a much smaller number of observations within units (say, 100 per unit), the simulation runs fine. However, memory requirements seem to explode when I up the observations per unit to 300 or 500. I haven't been able to get the simulation to run even on a computer with a memory limit of 70GB.
What's going on? Why are the memory requirements so high and how would I make them more manageable?
The generic power simulation script is below, for reference.
#psacr001_power.R: Run a job using specifications determined using a shell script
#Get the arguments passed from the shell script
args <- commandArgs(trailingOnly=TRUE)
job <- as.numeric(args[[1]])
sample_size <- as.numeric(args[[2]])
number_of_labs <- as.numeric(args[[3]])
icc_size <- as.numeric(args[[4]])
slope_size <- as.numeric(args[[5]])
effect <- as.numeric(args[[6]])
#Required R libraries
library(lme4)
library(simr)
library(pbkrtest)
library(tibble)
#create dataset
simulated.df <- tibble(
x = sample(rep(0:1, sample_size * number_of_labs / 2)), #group assignment
g = rep(1:number_of_labs, sample_size), #number of clusters
y = rnorm(sample_size * number_of_labs) #outcome
)
#Create simple model
model.of.interest <- lmer(y ~ x + (x|g), data=simulated.df)
#Create a small effect of interest
fixef(model.of.interest)['x'] <- effect
#create various ICC intercept
VarCorr(model.of.interest)["g"][["g"]][1] <- icc_size
#create varioous slopes
VarCorr(model.of.interest)["g"][["g"]][4] <- slope_size
#try not to be singular
attr(VarCorr(model.of.interest)[["g"]], "correlation")[2:3] <- .3
power_summary <- tibble(
job = numeric(),
successes = numeric(),
trials = numeric(),
mean = numeric(),
lower = numeric(),
upper = numeric()
)
#simulate those models
temp <- powerSim(fit = model.of.interest,
nsim = 200,
progress=FALSE
)
power_summary[1 , 1] <- job
power_summary[1 , 2:6] <- summary(temp)
write.csv(power_summary, paste(c("res_", job, ".csv"), collapse=""), row.names=FALSE)

Why is the error rate from bagging trees much higher than that from a single tree?

I cross-post this question here, but it seems to me that I'm unlikely to receive any answer. So I post it here.
I'm running the classification method Bagging Tree (Bootstrap Aggregation) and compare the misclassification error rate with one from one single tree. We expect that the result from bagging tree is better then that from one single tree, i.e. error rate from bagging is lower than that of single tree.
I repeat the whole procedure M = 100 times (each time splitting randomly the original data set into a training set and a test set) to obtain 100 test errors and bagging test errors (use a for loop). Then I use boxplots to compare the distributions of these two types of errors.
# Loading package and data
library(rpart)
library(boot)
library(mlbench)
data(PimaIndiansDiabetes)
# Initialization
n <- 768
ntrain <- 468
ntest <- 300
B <- 100
M <- 100
single.tree.error <- vector(length = M)
bagging.error <- vector(length = M)
# Define statistic
estim.pred <- function(a.sample, vector.of.indices)
{
current.train <- a.sample[vector.of.indices, ]
current.fitted.model <- rpart(diabetes ~ ., data = current.train, method = "class")
predict(current.fitted.model, test.set, type = "class")
}
for (j in 1:M)
{
# Split the data into test/train sets
train.idx <- sample(1:n, ntrain, replace = FALSE)
train.set <- PimaIndiansDiabetes[train.idx, ]
test.set <- PimaIndiansDiabetes[-train.idx, ]
# Train a direct tree model
fitted.tree <- rpart(diabetes ~ ., data = train.set, method = "class")
pred.test <- predict(fitted.tree, test.set, type = "class")
single.tree.error[j] <- mean(pred.test != test.set$diabetes)
# Bootstrap estimates
res.boot = boot(train.set, estim.pred, B)
pred.boot <- vector(length = ntest)
for (i in 1:ntest)
{
pred.boot[i] <- ifelse (mean(res.boot$t[, i] == "pos") >= 0.5, "pos", "neg")
}
bagging.error[j] <- mean(pred.boot != test.set$diabetes)
}
boxplot(single.tree.error, bagging.error, ylab = "Misclassification errors", names = c("single.tree", "bagging"))
The result is
Could you please explain why the error rate for bagging trees is much higher than that of a single tree? I feel that this does not make sense. I've checked my code but could not found anything unusual.
I've received an answer from https://stats.stackexchange.com/questions/452882/why-is-the-error-rate-from-bagging-trees-much-higher-than-that-from-a-single-tre. I posted it here to close this question and for future visitors.

Reproducing the Airlines Delay h2o flow example with h2o package does not match

The following script, reproduces an equivalent problem as it was stated in h2o Help (Help -> View Example Flow or Help -> Browse Installed packs.. -> examples -> Airlines Delay.flow, download), but using h2o R-package and a fixed seed (123456):
library(h2o)
# To use avaliable cores
h2o.init(max_mem_size = "12g", nthreads = -1)
IS_LOCAL_FILE = switch(1, FALSE, TRUE)
if (IS_LOCAL_FILE) {
data.input <- read.csv(file = "allyears2k.csv", stringsAsFactors = F)
allyears2k.hex <- as.h2o(data.input, destination_frame = "allyears2k.hex")
} else {
airlinesPath <- "https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv"
allyears2k.hex <- h2o.importFile(path = airlinesPath, destination_frame = "allyears2k.hex")
}
response <- "IsDepDelayed"
predictors <- setdiff(names(allyears2k.hex), response)
# Copied and pasted from the flow, then converting to R syntax
predictors.exc = c("DayofMonth", "DepTime", "CRSDepTime", "ArrTime", "CRSArrTime",
"TailNum", "ActualElapsedTime", "CRSElapsedTime",
"AirTime", "ArrDelay", "DepDelay", "TaxiIn", "TaxiOut",
"Cancelled", "CancellationCode", "Diverted", "CarrierDelay",
"WeatherDelay", "NASDelay", "SecurityDelay", "LateAircraftDelay",
"IsArrDelayed")
predictors <- setdiff(predictors, predictors.exc)
# Convert to factor for classification
allyears2k.hex[, response] <- as.factor(allyears2k.hex[, response])
# Copied and pasted from the flow, then converting to R syntax
fit1 <- h2o.glm(
x = predictors,
model_id="glm_model", seed=123456, training_frame=allyears2k.hex,
ignore_const_cols = T, y = response,
family="binomial", solver="IRLSM",
alpha=0.5,lambda=0.00001, lambda_search=F, standardize=T,
non_negative=F, score_each_iteration=F,
max_iterations=-1, link="family_default", intercept=T, objective_epsilon=0.00001,
beta_epsilon=0.0001, gradient_epsilon=0.0001, prior=-1, max_active_predictors=-1
)
# Analysis
confMatrix <- h2o.confusionMatrix(fit1)
print("Confusion Matrix for training dataset")
print(confMatrix)
print(summary(fit1))
h2o.shutdown()
This is the Confusion Matrix for the training set:
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
NO YES Error Rate
NO 0 20887 1.000000 =20887/20887
YES 0 23091 0.000000 =0/23091
Totals 0 43978 0.474942 =20887/43978
And the metrics:
H2OBinomialMetrics: glm
** Reported on training data. **
MSE: 0.2473858
RMSE: 0.4973789
LogLoss: 0.6878898
Mean Per-Class Error: 0.5
AUC: 0.5550138
Gini: 0.1100276
R^2: 0.007965165
Residual Deviance: 60504.04
AIC: 60516.04
On contrary the result of h2o flow has a better performance:
and Confusion Matrix for max f1 threshold:
The h2o flow performance is much better than running the same algorithm using the equivalent R-package function.
Note: For sake of simplicity I am using Airlines Delay problem, that is a well-known problem using h2o, but I realized that such kind of significant difference are found in other similar situations using glm algorithm.
Any thought about why these significant differences occur
Appendix A: Using default model parameters
Following the suggestion from #DarrenCook answer, just using default building parameters except for excluding columns and seed:
h2o flow
Now the buildModel is invoked like this:
buildModel 'glm', {"model_id":"glm_model-default",
"seed":"123456","training_frame":"allyears2k.hex",
"ignored_columns":
["DayofMonth","DepTime","CRSDepTime","ArrTime","CRSArrTime","TailNum",
"ActualElapsedTime","CRSElapsedTime","AirTime","ArrDelay","DepDelay",
"TaxiIn","TaxiOut","Cancelled","CancellationCode","Diverted",
"CarrierDelay","WeatherDelay","NASDelay","SecurityDelay",
"LateAircraftDelay","IsArrDelayed"],
"response_column":"IsDepDelayed","family":"binomial"
}
and the results are:
and the training metrics:
Running R-Script
The following script allows for an easy switch into default configuration (via IS_DEFAULT_MODEL variable) and also keeping the configuration as it states in the Airlines Delay example:
library(h2o)
h2o.init(max_mem_size = "12g", nthreads = -1) # To use avaliable cores
IS_LOCAL_FILE = switch(2, FALSE, TRUE)
IS_DEFAULT_MODEL = switch(2, FALSE, TRUE)
if (IS_LOCAL_FILE) {
data.input <- read.csv(file = "allyears2k.csv", stringsAsFactors = F)
allyears2k.hex <- as.h2o(data.input, destination_frame = "allyears2k.hex")
} else {
airlinesPath <- "https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv"
allyears2k.hex <- h2o.importFile(path = airlinesPath, destination_frame = "allyears2k.hex")
}
response <- "IsDepDelayed"
predictors <- setdiff(names(allyears2k.hex), response)
# Copied and pasted from the flow, then converting to R syntax
predictors.exc = c("DayofMonth", "DepTime", "CRSDepTime", "ArrTime", "CRSArrTime",
"TailNum", "ActualElapsedTime", "CRSElapsedTime",
"AirTime", "ArrDelay", "DepDelay", "TaxiIn", "TaxiOut",
"Cancelled", "CancellationCode", "Diverted", "CarrierDelay",
"WeatherDelay", "NASDelay", "SecurityDelay", "LateAircraftDelay",
"IsArrDelayed")
predictors <- setdiff(predictors, predictors.exc)
# Convert to factor for classification
allyears2k.hex[, response] <- as.factor(allyears2k.hex[, response])
if (IS_DEFAULT_MODEL) {
fit1 <- h2o.glm(
x = predictors, model_id = "glm_model", seed = 123456,
training_frame = allyears2k.hex, y = response, family = "binomial"
)
} else { # Copied and pasted from the flow, then converting to R syntax
fit1 <- h2o.glm(
x = predictors,
model_id = "glm_model", seed = 123456, training_frame = allyears2k.hex,
ignore_const_cols = T, y = response,
family = "binomial", solver = "IRLSM",
alpha = 0.5, lambda = 0.00001, lambda_search = F, standardize = T,
non_negative = F, score_each_iteration = F,
max_iterations = -1, link = "family_default", intercept = T, objective_epsilon = 0.00001,
beta_epsilon = 0.0001, gradient_epsilon = 0.0001, prior = -1, max_active_predictors = -1
)
}
# Analysis
confMatrix <- h2o.confusionMatrix(fit1)
print("Confusion Matrix for training dataset")
print(confMatrix)
print(summary(fit1))
h2o.shutdown()
It produces the following results:
MSE: 0.2473859
RMSE: 0.497379
LogLoss: 0.6878898
Mean Per-Class Error: 0.5
AUC: 0.5549898
Gini: 0.1099796
R^2: 0.007964984
Residual Deviance: 60504.04
AIC: 60516.04
Confusion Matrix (vertical: actual; across: predicted)
for F1-optimal threshold:
NO YES Error Rate
NO 0 20887 1.000000 =20887/20887
YES 0 23091 0.000000 =0/23091
Totals 0 43978 0.474942 =20887/43978
Some metrics are close, but the Confusion Matrix is quite diferent, the R-Script predict all flights as delayed.
Appendix B: Configuration
Package: h2o
Version: 3.18.0.4
Type: Package
Title: R Interface for H2O
Date: 2018-03-08
Note: I tested the R-Script also under 3.19.0.4231 with the same results
This is the cluster information after running the R:
> h2o.init(max_mem_size = "12g", nthreads = -1)
R is connected to the H2O cluster:
H2O cluster version: 3.18.0.4
...
H2O API Extensions: Algos, AutoML, Core V3, Core V4
R Version: R version 3.3.3 (2017-03-06)
Troubleshooting Tip: build the all-defaults model first:
mDef = h2o.glm(predictors, response, allyears2k.hex, family="binomial")
This takes 2 seconds and gives almotst exactly the same AUC and confusion matrix as in your Flow screenshots.
So, we now know the problem you see is due to all the model customization you have done...
...except when I build your fit1 I get basically the same results as my default model:
NO YES Error Rate
NO 4276 16611 0.795279 =16611/20887
YES 1573 21518 0.068122 =1573/23091
Totals 5849 38129 0.413479 =18184/43978
This was using your script exactly as given, so it fetched the remote csv file. (Oh, I removed the max_mem_size argument, as I don't have 12g on this notebook!)
Assuming you can get exactly your posted results, running exactly the code you posted (and in a fresh R session, with a newly started H2O cluster), one possible explanation is you are using 3.19.x, but the latest stable release is 3.18.0.2? (My test was with 3.14.0.1)
Finally, I guess this is the explanation: both have the same parameter configuration for building the model (that is not the problem), but the H2o flow uses a specific parsing customization converting some variables values into Enum, that the R-script did not specify.
The Airlines Delay problem how it was specified in the h2o Flow example uses as predictor variables (the flow defines the ignored_columns):
"Year", "Month", "DayOfWeek", "UniqueCarrier",
"FlightNum", "Origin", "Dest", "Distance"
Where all of the predictors should be parsed as: Enum except Distance. Therefore the R-Script needs to convert such columns from numeric or char into factor.
Executing using h2o R-package
Here the R-Script updated:
library(h2o)
h2o.init(max_mem_size = "12g", nthreads = -1) # To use avaliable cores
IS_LOCAL_FILE = switch(2, FALSE, TRUE)
IS_DEFAULT_MODEL = switch(2, FALSE, TRUE)
if (IS_LOCAL_FILE) {
data.input <- read.csv(file = "allyears2k.csv", stringsAsFactors = T)
allyears2k.hex <- as.h2o(data.input, destination_frame = "allyears2k.hex")
} else {
airlinesPath <- "https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv"
allyears2k.hex <- h2o.importFile(path = airlinesPath, destination_frame = "allyears2k.hex")
}
response <- "IsDepDelayed"
predictors <- setdiff(names(allyears2k.hex), response)
# Copied and pasted from the flow, then converting to R syntax
predictors.exc = c("DayofMonth", "DepTime", "CRSDepTime",
"ArrTime", "CRSArrTime",
"TailNum", "ActualElapsedTime", "CRSElapsedTime",
"AirTime", "ArrDelay", "DepDelay", "TaxiIn", "TaxiOut",
"Cancelled", "CancellationCode", "Diverted", "CarrierDelay",
"WeatherDelay", "NASDelay", "SecurityDelay", "LateAircraftDelay",
"IsArrDelayed")
predictors <- setdiff(predictors, predictors.exc)
column.asFactor <- c("Year", "Month", "DayofMonth", "DayOfWeek",
"UniqueCarrier", "FlightNum", "Origin", "Dest", response)
# Coercing as factor (equivalent to Enum from h2o Flow)
# Note: Using lapply does not work, see the answer of this question
# https://stackoverflow.com/questions/49393343/how-to-coerce-multiple-columns-to-factors-at-once-for-h2oframe-object
for (col in column.asFactor) {
allyears2k.hex[col] <- as.factor(allyears2k.hex[col])
}
if (IS_DEFAULT_MODEL) {
fit1 <- h2o.glm(x = predictors, y = response,
training_frame = allyears2k.hex,
family = "binomial", seed = 123456
)
} else { # Copied and pasted from the flow, then converting to R syntax
fit1 <- h2o.glm(
x = predictors,
model_id = "glm_model", seed = 123456,
training_frame = allyears2k.hex,
ignore_const_cols = T, y = response,
family = "binomial", solver = "IRLSM",
alpha = 0.5, lambda = 0.00001, lambda_search = F, standardize = T,
non_negative = F, score_each_iteration = F,
max_iterations = -1, link = "family_default", intercept = T,
objective_epsilon = 0.00001,
beta_epsilon = 0.0001, gradient_epsilon = 0.0001, prior = -1,
max_active_predictors = -1
)
}
# Analysis
print("Confusion Matrix for training dataset")
confMatrix <- h2o.confusionMatrix(fit1)
print(confMatrix)
print(summary(fit1))
h2o.shutdown()
Here the result running the R-Script under default configuraiton IS_DEFAULT_MODEL=T:
H2OBinomialMetrics: glm
** Reported on training data. **
MSE: 0.2001145
RMSE: 0.4473416
LogLoss: 0.5845852
Mean Per-Class Error: 0.3343562
AUC: 0.7570867
Gini: 0.5141734
R^2: 0.1975266
Residual Deviance: 51417.77
AIC: 52951.77
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
NO YES Error Rate
NO 10337 10550 0.505099 =10550/20887
YES 3778 19313 0.163614 =3778/23091
Totals 14115 29863 0.325799 =14328/43978
Executing under h2o flow
Now executing the flow: Airlines_Delay_GLMFixedSeed, we can obtain the same results. Here the detail about the flow configuration:
The parseFiles function:
parseFiles
paths: ["https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv"]
destination_frame: "allyears2k.hex"
parse_type: "CSV"
separator: 44
number_columns: 31
single_quotes: false
column_names:
["Year","Month","DayofMonth","DayOfWeek","DepTime","CRSDepTime","ArrTime",
"CRSArrTime","UniqueCarrier","FlightNum","TailNum","ActualElapsedTime",
"CRSElapsedTime","AirTime","ArrDelay","DepDelay","Origin","Dest",
"Distance","TaxiIn","TaxiOut","Cancelled","CancellationCode",
"Diverted","CarrierDelay","WeatherDelay","NASDelay","SecurityDelay",
"LateAircraftDelay","IsArrDelayed",
"IsDepDelayed"]
column_types ["Enum","Enum","Enum","Enum","Numeric","Numeric",
"Numeric","Numeric", "Enum","Enum","Enum","Numeric",
"Numeric", "Numeric","Numeric","Numeric",
"Enum","Enum","Numeric","Numeric","Numeric",
"Enum","Enum","Numeric","Numeric","Numeric",
"Numeric","Numeric","Numeric","Enum","Enum"]
delete_on_done: true
check_header: 1
chunk_size: 4194304
where the following predictor columns are converted to Enum: "Year", "Month", "DayOfWeek", "UniqueCarrier", "FlightNum", "Origin", "Dest"
Now invoking the buildModel function as follows, using the default parameters except for ignored_columns and seed:
buildModel 'glm', {"model_id":"glm_model-default","seed":"123456",
"training_frame":"allyears2k.hex",
"ignored_columns":["DayofMonth","DepTime","CRSDepTime","ArrTime",
"CRSArrTime","TailNum",
"ActualElapsedTime","CRSElapsedTime","AirTime","ArrDelay","DepDelay",
"TaxiIn","TaxiOut","Cancelled","CancellationCode","Diverted",
"CarrierDelay","WeatherDelay","NASDelay","SecurityDelay",
"LateAircraftDelay","IsArrDelayed"],"response_column":"IsDepDelayed",
"family":"binomial"}
and finally we get the following result:
and Training Output Metrics:
model glm_model-default
model_checksum -2438376548367921152
frame allyears2k.hex
frame_checksum -2331137066674151424
description ·
model_category Binomial
scoring_time 1521598137667
predictions ·
MSE 0.200114
RMSE 0.447342
nobs 43978
custom_metric_name ·
custom_metric_value 0
r2 0.197527
logloss 0.584585
AUC 0.757084
Gini 0.514168
mean_per_class_error 0.334347
residual_deviance 51417.772427
null_deviance 60855.951538
AIC 52951.772427
null_degrees_of_freedom 43977
residual_degrees_of_freedom 43211
Comparing both results
The training metrics are almost the same for first 4-significant digits:
R-Script H2o Flow
MSE: 0.2001145 0.200114
RMSE: 0.4473416 0.447342
LogLoss: 0.5845852 0.584585
Mean Per-Class Error: 0.3343562 0.334347
AUC: 0.7570867 0.757084
Gini: 0.5141734 0.514168
R^2: 0.1975266 0.197527
Residual Deviance: 51417.77 51417.772427
AIC: 52951.77 52951.772427
Confusion Matrix is slightly different:
TP TN FP FN
R-Script 10337 19313 10550 3778
H2o Flow 10341 19309 10546 3782
Error
R-Script 0.325799
H2o Flow 0.3258
My understanding is that the difference are withing the acceptable threshold (around 0.0001), therefore we can say that both interfaces provide the same result.

R neuralnet package too slow for millions of records

I am trying to train a neural network for churn prediction with R package neuralnet. Here is the code:
data <- read.csv('C:/PredictChurn.csv')
maxs <- apply(data, 2, max)
mins <- apply(data, 2, min)
scaled_temp <- as.data.frame(scale(data, center = mins, scale = maxs - mins))
scaled <- data
scaled[, -c(1)] <- scaled_temp[, -c(1)]
index <- sample(1:nrow(data),round(0.75*nrow(data)))
train_ <- scaled[index,]
test_ <- scaled[-index,]
library(neuralnet)
n <- names(train_[, -c(1)])
f <- as.formula(paste("CHURNED_F ~", paste(n[!n %in% "CHURNED_F"], collapse = " + ")))
nn <- neuralnet(f,data=train_,hidden=c(5),linear.output=F)
It works as it should, however when training with the full data set (in the range of millions of rows) it just takes too long. So I know R is by default single threaded, so I have tried researching on how to parallelize the work into all the cores. Is it even possible to make this function in parallel? I have tried various packages with no success.
Has anyone been able to do this?
It doesn't have to be the neuralnet package, any solution that lets me train a neural network would work.
Thank you
I have had good experiences with the package Rmpi, and it may be applicable in your case too.
library(Rmpi)
Briefly, its usage is as follows:
nproc = 4 # could be automatically determined
# Specify one master and nproc-1 slaves
Rmpi:: mpi.spawn.Rslaves(nslaves=nproc-1)
# Execute function "func_to_be_parallelized" on multiple CPUs; pass two variables to function
my_fast_results = Rmpi::mpi.parLapply(var1_passed_to_func,
func_to_be_parallelized,
var2_passed_to_func)
# Close slaves
Rmpi::mpi.close.Rslaves(dellog=T)
You can try using the caret and doParallel packages for this. This is what I have been using. It works for some of the model types but may not work for all.
layer1 = c(6,12,18,24,30)
layer2 = c(6,12,18,24,30)
layer3 = c(6,12,18,24,30)
cv.folds = 5
# In order to make models fully reproducible when using parallel processing, we need to pass seeds as a parameter
# https://stackoverflow.com/questions/13403427/fully-reproducible-parallel-models-using-caret
total.param.permutations = length(layer1) * length(layer2) * length(layer3)
seeds <- vector(mode = "list", length = cv.folds + 1)
set.seed(1)
for(i in 1:cv.folds) seeds[[i]]<- sample.int(n=1, total.param.permutations, replace = TRUE)
seeds[[cv.folds + 1]]<-sample.int(1, 1, replace = TRUE) #for the last model
nn.grid <- expand.grid(layer1 = layer1, layer2 = layer2, layer3 = layer3)
cl <- makeCluster(detectCores()*0.5) # use 50% of cores only, leave rest for other tasks
registerDoParallel(cl)
train_control <- caret::trainControl(method = "cv"
,number=cv.folds
,seeds = seeds # user defined seeds for parallel processing
,verboseIter = TRUE
,allowParallel = TRUE
)
stopCluster(cl)
registerDoSEQ()
tic("Total Time to NN Training: ")
set.seed(1)
model.nn.caret = caret::train(form = formula,
data = scaled.train.data,
method = 'neuralnet',
tuneGrid = nn.grid,
trControl = train_control
)
toc()

Resources