MSE using h2o for anomaly detection - r
I was using the example given by h2o for ECG anomaly detection.
When trying to compute manually the MSE, I got different results.
To demonstrate the difference I used the last test case
but all 23 cases differ.
Attached is the full code:
Thanks,
Eli.
suppressMessages(library(h2o))
localH2O = h2o.init(max_mem_size = '6g', # use 6GB of RAM of *GB available
nthreads = -1) # use all CPUs (8 on my personal computer :3)
# Download and import ECG train and test data into the H2O cluster
train_ecg <- h2o.importFile(path = "http://h2o-public-test-data.s3.amazonaws.com/smalldata/anomaly/ecg_discord_train.csv",
header = FALSE,
sep = ",")
test_ecg <- h2o.importFile(path = "http://h2o-public-test-data.s3.amazonaws.com/smalldata/anomaly/ecg_discord_test.csv",
header = FALSE,
sep = ",")
# Train deep autoencoder learning model on "normal"
# training data, y ignored
anomaly_model <- h2o.deeplearning(x = names(train_ecg),
training_frame = train_ecg,
activation = "Tanh",
autoencoder = TRUE,
hidden = c(50,20,50),
l1 = 1e-4,
epochs = 100)
# Compute reconstruction error with the Anomaly
# detection app (MSE between output layer and input layer)
recon_error <- h2o.anomaly(anomaly_model, test_ecg)
# Pull reconstruction error data into R and
# plot to find outliers (last 3 heartbeats)
recon_error <- as.data.frame(recon_error)
recon_error
plot.ts(recon_error)
test_recon <- h2o.predict(anomaly_model, test_ecg)
t <- as.vector(test_ecg[23,])
r <- as.vector(test_recon[23,])
mse.23 <- sum((t-r)^2)/length(t)
mse.23
recon_error[23,]
> mse.23
[1] 2.607374
> recon_error[23,]
[1] 8.264768
it is not really an answer but I did what #Arno Candel has suggested. I have tried to combine test and train data and normalize to 0 - 1. After that, I split the combined and normalized data back to test and train data and run the scripts as generated by the OP. However, I am still getting a different MSE using manual calculation. The MSE is also different when I normalized test and train data separately. Is there something I can do to get the manual calculation correctly?
suppressMessages(library(purrr))
suppressMessages(library(dplyr))
suppressMessages(library(h2o))
localH2O = h2o.init(max_mem_size = '6g', # use 6GB of RAM of *GB available
nthreads = -1) # use all CPUs (8 on my personal computer :3)
# Download and import ECG train and test data into the H2O cluster
train_ecg <- h2o.importFile(path = "http://h2o-public-test-data.s3.amazonaws.com/smalldata/anomaly/ecg_discord_train.csv",
header = FALSE,
sep = ",")
test_ecg <- h2o.importFile(path = "http://h2o-public-test-data.s3.amazonaws.com/smalldata/anomaly/ecg_discord_test.csv",
header = FALSE,
sep = ",")
### adding this section
# normalize data
train_ecg <- as.data.frame(train_ecg)
test_ecg <- as.data.frame(test_ecg)
dat <- rbind(train_ecg,test_ecg)
get_desc <- function(x) {
map(x, ~list(
min = min(.x),
max = max(.x),
mean = mean(.x),
sd = sd(.x)
))
}
normalization_minmax <- function(x, desc) {
map2_dfc(x, desc, ~(.x - .y$min)/(.y$max - .y$min))
}
desc <- dat %>%
get_desc()
dat <- dat %>%
normalization_minmax(desc)
train_ecg <- as.matrix(dat[1:20,]) ; test_ecg <- as.matrix(dat[21:43,])
# Train deep autoencoder learning model on "normal"
# training data, y ignored
anomaly_model <- h2o.deeplearning(x = names(train_ecg),
training_frame = train_ecg,
activation = "Tanh",
autoencoder = TRUE,
hidden = c(50,20,50),
l1 = 1e-4,
epochs = 100)
# Compute reconstruction error with the Anomaly
# detection app (MSE between output layer and input layer)
recon_error <- h2o.anomaly(anomaly_model, test_ecg)
# Pull reconstruction error data into R and
# plot to find outliers (last 3 heartbeats)
recon_error <- as.data.frame(recon_error)
recon_error
plot.ts(recon_error)
test_recon <- h2o.predict(anomaly_model, test_ecg)
t <- as.vector(test_ecg[23,])
r <- as.vector(test_recon[23,])
mse.23 <- sum((t-r)^2)/length(t)
mse.23
recon_error[23,]
> mse.23
[1] 23.14947
> recon_error[23,]
[1] 8.076866
For autoencoders in H2O, the MSE math is done in the normalized space to avoid numerical scaling issues. For example, if you have categorical features or very large numbers, the neural network autoencoder can't directly operate on those numbers, but instead, it first does dummy one-hot encoding and normalization of numeric features, then it does the fwd/back propagation and computation of reconstruction errors (in the normalized and expanded space). You can manually divide each column by its range (max-min) first for purely numerical data, and your results should match.
Here is a JUnit that does this check explicitly (on that very dataset):
https://github.com/h2oai/h2o-3/blob/master/h2o-algos/src/test/java/hex/deeplearning/DeepLearningAutoEncoderTest.java#L86-L104
You can also see https://0xdata.atlassian.net/browse/PUBDEV-2078 for more info.
Related
Is there a way 2 store factors selected by a (BE) Stepwise Regression run on N datasets via lapply(full_model, FUN(i) {step(i[[“Coeffs”]])})?
I have already written the following code, all of which works OK: directory_path <- "~/DAEN_698/sample_obs" file_list <- list.files(path = directory_path, full.names = TRUE, recursive = TRUE) head(file_list, n = 2) > head(file_list, n = 2) [1] "C:/Users/Spencer/Documents/DAEN_698/sample_obs2/0-5-1-1.csv" [2] "C:/Users/Spencer/Documents/DAEN_698/sample_obs2/0-5-1-2.csv" # Create another list with the just the "n-n-n-n" part of the names of of each dataset DS_name_list = stri_sub(file_list, 49, 55) head(DS_name_list, n = 3) > head(DS_name_list, n = 3) [1] "0-5-1-1" "0-5-1-2" "0-5-1-3" # This command reads all the data in each of the N csv files via their names # stored in the 'file_list' list of characters. csvs <- lapply(file_list, read.csv) ### Run a Backward Elimination Stepwise Regression on each of the N csvs. # Assign the full model (meaning the one with all 30 candidate regressors # included as the initial model in step 1). # This is crucial because if the initial model has less than the number of # total candidate factors for Stepwise to select from in the datasets, # then it could miss 1 or more of the true factors. full_model <- lapply(csvs, function(i) { lm(formula = Y ~ ., data = i) }) # my failed attempt at figuring it out myself set.seed(50) # for reproducibility BE_fits3 <- lapply(full_model, function(i) {step(object = i[["coefficients"]], direction = 'backward', scope = formula(full_model), trace = 0)}) When I hit run on the above 2 lines of code after setting the seed, I get the following error message in the Console: Error in terms`(object) : object 'i' not found To briefly elaborate a bit further on why it is absolutely essential that the initial model when running a Backward Elimination version of Stepwise Regression, consider the following example: Let us say that we start out with an initial model of 25, so, X1:X26 instead of X1:X30, in that case, it would be possible to miss out on Stepwise Regression j being able to select/choose 1 or more of the IVs/factors from X26 through X30, especially if 1 or more of those really are included in the true underlying population model that characterizes dataset j.
Instead of two lapply loops, one to fit the models and the second to run the stepwise regressions, use a for loop doing both operations one after the other. This is an environments thing, it seems that step is not finding the data when run in the environment of the lapply function. I have also changed the code to create DS_name_list. Below it processes the full names without string position dependent code. DS_name_list <- basename(file_list) DS_name_list <- tools::file_path_sans_ext(DS_name_list) head(DS_name_list, n = 2) And here is the regressions code. csvs <- lapply(file_list, read.csv) names(csvs) <- DS_name_list set.seed(50) # for reproducibility full_model <- vector("list", length = length(csvs)) BE_fits3 <- vector("list", length = length(csvs)) for(i in seq_along(csvs)) { full_model[[i]] <- lm(formula = Y ~ ., data = csvs[[i]]) BE_fits3[[i]] <- step(object = full_model[[i]], scope = formula(full_model[[i]]), direction = 'backward', trace = 0) }
Very high memory requirements for power simulations using the R function simr::powerSim
I'm trying to do a power analysis for a project where we expect 40,000 observations (at least) across 40-60 "units". When I do this power analysis using a much smaller number of observations within units (say, 100 per unit), the simulation runs fine. However, memory requirements seem to explode when I up the observations per unit to 300 or 500. I haven't been able to get the simulation to run even on a computer with a memory limit of 70GB. What's going on? Why are the memory requirements so high and how would I make them more manageable? The generic power simulation script is below, for reference. #psacr001_power.R: Run a job using specifications determined using a shell script #Get the arguments passed from the shell script args <- commandArgs(trailingOnly=TRUE) job <- as.numeric(args[[1]]) sample_size <- as.numeric(args[[2]]) number_of_labs <- as.numeric(args[[3]]) icc_size <- as.numeric(args[[4]]) slope_size <- as.numeric(args[[5]]) effect <- as.numeric(args[[6]]) #Required R libraries library(lme4) library(simr) library(pbkrtest) library(tibble) #create dataset simulated.df <- tibble( x = sample(rep(0:1, sample_size * number_of_labs / 2)), #group assignment g = rep(1:number_of_labs, sample_size), #number of clusters y = rnorm(sample_size * number_of_labs) #outcome ) #Create simple model model.of.interest <- lmer(y ~ x + (x|g), data=simulated.df) #Create a small effect of interest fixef(model.of.interest)['x'] <- effect #create various ICC intercept VarCorr(model.of.interest)["g"][["g"]][1] <- icc_size #create varioous slopes VarCorr(model.of.interest)["g"][["g"]][4] <- slope_size #try not to be singular attr(VarCorr(model.of.interest)[["g"]], "correlation")[2:3] <- .3 power_summary <- tibble( job = numeric(), successes = numeric(), trials = numeric(), mean = numeric(), lower = numeric(), upper = numeric() ) #simulate those models temp <- powerSim(fit = model.of.interest, nsim = 200, progress=FALSE ) power_summary[1 , 1] <- job power_summary[1 , 2:6] <- summary(temp) write.csv(power_summary, paste(c("res_", job, ".csv"), collapse=""), row.names=FALSE)
Why is the error rate from bagging trees much higher than that from a single tree?
I cross-post this question here, but it seems to me that I'm unlikely to receive any answer. So I post it here. I'm running the classification method Bagging Tree (Bootstrap Aggregation) and compare the misclassification error rate with one from one single tree. We expect that the result from bagging tree is better then that from one single tree, i.e. error rate from bagging is lower than that of single tree. I repeat the whole procedure M = 100 times (each time splitting randomly the original data set into a training set and a test set) to obtain 100 test errors and bagging test errors (use a for loop). Then I use boxplots to compare the distributions of these two types of errors. # Loading package and data library(rpart) library(boot) library(mlbench) data(PimaIndiansDiabetes) # Initialization n <- 768 ntrain <- 468 ntest <- 300 B <- 100 M <- 100 single.tree.error <- vector(length = M) bagging.error <- vector(length = M) # Define statistic estim.pred <- function(a.sample, vector.of.indices) { current.train <- a.sample[vector.of.indices, ] current.fitted.model <- rpart(diabetes ~ ., data = current.train, method = "class") predict(current.fitted.model, test.set, type = "class") } for (j in 1:M) { # Split the data into test/train sets train.idx <- sample(1:n, ntrain, replace = FALSE) train.set <- PimaIndiansDiabetes[train.idx, ] test.set <- PimaIndiansDiabetes[-train.idx, ] # Train a direct tree model fitted.tree <- rpart(diabetes ~ ., data = train.set, method = "class") pred.test <- predict(fitted.tree, test.set, type = "class") single.tree.error[j] <- mean(pred.test != test.set$diabetes) # Bootstrap estimates res.boot = boot(train.set, estim.pred, B) pred.boot <- vector(length = ntest) for (i in 1:ntest) { pred.boot[i] <- ifelse (mean(res.boot$t[, i] == "pos") >= 0.5, "pos", "neg") } bagging.error[j] <- mean(pred.boot != test.set$diabetes) } boxplot(single.tree.error, bagging.error, ylab = "Misclassification errors", names = c("single.tree", "bagging")) The result is Could you please explain why the error rate for bagging trees is much higher than that of a single tree? I feel that this does not make sense. I've checked my code but could not found anything unusual.
I've received an answer from https://stats.stackexchange.com/questions/452882/why-is-the-error-rate-from-bagging-trees-much-higher-than-that-from-a-single-tre. I posted it here to close this question and for future visitors.
Reproducing the Airlines Delay h2o flow example with h2o package does not match
The following script, reproduces an equivalent problem as it was stated in h2o Help (Help -> View Example Flow or Help -> Browse Installed packs.. -> examples -> Airlines Delay.flow, download), but using h2o R-package and a fixed seed (123456): library(h2o) # To use avaliable cores h2o.init(max_mem_size = "12g", nthreads = -1) IS_LOCAL_FILE = switch(1, FALSE, TRUE) if (IS_LOCAL_FILE) { data.input <- read.csv(file = "allyears2k.csv", stringsAsFactors = F) allyears2k.hex <- as.h2o(data.input, destination_frame = "allyears2k.hex") } else { airlinesPath <- "https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv" allyears2k.hex <- h2o.importFile(path = airlinesPath, destination_frame = "allyears2k.hex") } response <- "IsDepDelayed" predictors <- setdiff(names(allyears2k.hex), response) # Copied and pasted from the flow, then converting to R syntax predictors.exc = c("DayofMonth", "DepTime", "CRSDepTime", "ArrTime", "CRSArrTime", "TailNum", "ActualElapsedTime", "CRSElapsedTime", "AirTime", "ArrDelay", "DepDelay", "TaxiIn", "TaxiOut", "Cancelled", "CancellationCode", "Diverted", "CarrierDelay", "WeatherDelay", "NASDelay", "SecurityDelay", "LateAircraftDelay", "IsArrDelayed") predictors <- setdiff(predictors, predictors.exc) # Convert to factor for classification allyears2k.hex[, response] <- as.factor(allyears2k.hex[, response]) # Copied and pasted from the flow, then converting to R syntax fit1 <- h2o.glm( x = predictors, model_id="glm_model", seed=123456, training_frame=allyears2k.hex, ignore_const_cols = T, y = response, family="binomial", solver="IRLSM", alpha=0.5,lambda=0.00001, lambda_search=F, standardize=T, non_negative=F, score_each_iteration=F, max_iterations=-1, link="family_default", intercept=T, objective_epsilon=0.00001, beta_epsilon=0.0001, gradient_epsilon=0.0001, prior=-1, max_active_predictors=-1 ) # Analysis confMatrix <- h2o.confusionMatrix(fit1) print("Confusion Matrix for training dataset") print(confMatrix) print(summary(fit1)) h2o.shutdown() This is the Confusion Matrix for the training set: Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold: NO YES Error Rate NO 0 20887 1.000000 =20887/20887 YES 0 23091 0.000000 =0/23091 Totals 0 43978 0.474942 =20887/43978 And the metrics: H2OBinomialMetrics: glm ** Reported on training data. ** MSE: 0.2473858 RMSE: 0.4973789 LogLoss: 0.6878898 Mean Per-Class Error: 0.5 AUC: 0.5550138 Gini: 0.1100276 R^2: 0.007965165 Residual Deviance: 60504.04 AIC: 60516.04 On contrary the result of h2o flow has a better performance: and Confusion Matrix for max f1 threshold: The h2o flow performance is much better than running the same algorithm using the equivalent R-package function. Note: For sake of simplicity I am using Airlines Delay problem, that is a well-known problem using h2o, but I realized that such kind of significant difference are found in other similar situations using glm algorithm. Any thought about why these significant differences occur Appendix A: Using default model parameters Following the suggestion from #DarrenCook answer, just using default building parameters except for excluding columns and seed: h2o flow Now the buildModel is invoked like this: buildModel 'glm', {"model_id":"glm_model-default", "seed":"123456","training_frame":"allyears2k.hex", "ignored_columns": ["DayofMonth","DepTime","CRSDepTime","ArrTime","CRSArrTime","TailNum", "ActualElapsedTime","CRSElapsedTime","AirTime","ArrDelay","DepDelay", "TaxiIn","TaxiOut","Cancelled","CancellationCode","Diverted", "CarrierDelay","WeatherDelay","NASDelay","SecurityDelay", "LateAircraftDelay","IsArrDelayed"], "response_column":"IsDepDelayed","family":"binomial" } and the results are: and the training metrics: Running R-Script The following script allows for an easy switch into default configuration (via IS_DEFAULT_MODEL variable) and also keeping the configuration as it states in the Airlines Delay example: library(h2o) h2o.init(max_mem_size = "12g", nthreads = -1) # To use avaliable cores IS_LOCAL_FILE = switch(2, FALSE, TRUE) IS_DEFAULT_MODEL = switch(2, FALSE, TRUE) if (IS_LOCAL_FILE) { data.input <- read.csv(file = "allyears2k.csv", stringsAsFactors = F) allyears2k.hex <- as.h2o(data.input, destination_frame = "allyears2k.hex") } else { airlinesPath <- "https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv" allyears2k.hex <- h2o.importFile(path = airlinesPath, destination_frame = "allyears2k.hex") } response <- "IsDepDelayed" predictors <- setdiff(names(allyears2k.hex), response) # Copied and pasted from the flow, then converting to R syntax predictors.exc = c("DayofMonth", "DepTime", "CRSDepTime", "ArrTime", "CRSArrTime", "TailNum", "ActualElapsedTime", "CRSElapsedTime", "AirTime", "ArrDelay", "DepDelay", "TaxiIn", "TaxiOut", "Cancelled", "CancellationCode", "Diverted", "CarrierDelay", "WeatherDelay", "NASDelay", "SecurityDelay", "LateAircraftDelay", "IsArrDelayed") predictors <- setdiff(predictors, predictors.exc) # Convert to factor for classification allyears2k.hex[, response] <- as.factor(allyears2k.hex[, response]) if (IS_DEFAULT_MODEL) { fit1 <- h2o.glm( x = predictors, model_id = "glm_model", seed = 123456, training_frame = allyears2k.hex, y = response, family = "binomial" ) } else { # Copied and pasted from the flow, then converting to R syntax fit1 <- h2o.glm( x = predictors, model_id = "glm_model", seed = 123456, training_frame = allyears2k.hex, ignore_const_cols = T, y = response, family = "binomial", solver = "IRLSM", alpha = 0.5, lambda = 0.00001, lambda_search = F, standardize = T, non_negative = F, score_each_iteration = F, max_iterations = -1, link = "family_default", intercept = T, objective_epsilon = 0.00001, beta_epsilon = 0.0001, gradient_epsilon = 0.0001, prior = -1, max_active_predictors = -1 ) } # Analysis confMatrix <- h2o.confusionMatrix(fit1) print("Confusion Matrix for training dataset") print(confMatrix) print(summary(fit1)) h2o.shutdown() It produces the following results: MSE: 0.2473859 RMSE: 0.497379 LogLoss: 0.6878898 Mean Per-Class Error: 0.5 AUC: 0.5549898 Gini: 0.1099796 R^2: 0.007964984 Residual Deviance: 60504.04 AIC: 60516.04 Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold: NO YES Error Rate NO 0 20887 1.000000 =20887/20887 YES 0 23091 0.000000 =0/23091 Totals 0 43978 0.474942 =20887/43978 Some metrics are close, but the Confusion Matrix is quite diferent, the R-Script predict all flights as delayed. Appendix B: Configuration Package: h2o Version: 3.18.0.4 Type: Package Title: R Interface for H2O Date: 2018-03-08 Note: I tested the R-Script also under 3.19.0.4231 with the same results This is the cluster information after running the R: > h2o.init(max_mem_size = "12g", nthreads = -1) R is connected to the H2O cluster: H2O cluster version: 3.18.0.4 ... H2O API Extensions: Algos, AutoML, Core V3, Core V4 R Version: R version 3.3.3 (2017-03-06)
Troubleshooting Tip: build the all-defaults model first: mDef = h2o.glm(predictors, response, allyears2k.hex, family="binomial") This takes 2 seconds and gives almotst exactly the same AUC and confusion matrix as in your Flow screenshots. So, we now know the problem you see is due to all the model customization you have done... ...except when I build your fit1 I get basically the same results as my default model: NO YES Error Rate NO 4276 16611 0.795279 =16611/20887 YES 1573 21518 0.068122 =1573/23091 Totals 5849 38129 0.413479 =18184/43978 This was using your script exactly as given, so it fetched the remote csv file. (Oh, I removed the max_mem_size argument, as I don't have 12g on this notebook!) Assuming you can get exactly your posted results, running exactly the code you posted (and in a fresh R session, with a newly started H2O cluster), one possible explanation is you are using 3.19.x, but the latest stable release is 3.18.0.2? (My test was with 3.14.0.1)
Finally, I guess this is the explanation: both have the same parameter configuration for building the model (that is not the problem), but the H2o flow uses a specific parsing customization converting some variables values into Enum, that the R-script did not specify. The Airlines Delay problem how it was specified in the h2o Flow example uses as predictor variables (the flow defines the ignored_columns): "Year", "Month", "DayOfWeek", "UniqueCarrier", "FlightNum", "Origin", "Dest", "Distance" Where all of the predictors should be parsed as: Enum except Distance. Therefore the R-Script needs to convert such columns from numeric or char into factor. Executing using h2o R-package Here the R-Script updated: library(h2o) h2o.init(max_mem_size = "12g", nthreads = -1) # To use avaliable cores IS_LOCAL_FILE = switch(2, FALSE, TRUE) IS_DEFAULT_MODEL = switch(2, FALSE, TRUE) if (IS_LOCAL_FILE) { data.input <- read.csv(file = "allyears2k.csv", stringsAsFactors = T) allyears2k.hex <- as.h2o(data.input, destination_frame = "allyears2k.hex") } else { airlinesPath <- "https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv" allyears2k.hex <- h2o.importFile(path = airlinesPath, destination_frame = "allyears2k.hex") } response <- "IsDepDelayed" predictors <- setdiff(names(allyears2k.hex), response) # Copied and pasted from the flow, then converting to R syntax predictors.exc = c("DayofMonth", "DepTime", "CRSDepTime", "ArrTime", "CRSArrTime", "TailNum", "ActualElapsedTime", "CRSElapsedTime", "AirTime", "ArrDelay", "DepDelay", "TaxiIn", "TaxiOut", "Cancelled", "CancellationCode", "Diverted", "CarrierDelay", "WeatherDelay", "NASDelay", "SecurityDelay", "LateAircraftDelay", "IsArrDelayed") predictors <- setdiff(predictors, predictors.exc) column.asFactor <- c("Year", "Month", "DayofMonth", "DayOfWeek", "UniqueCarrier", "FlightNum", "Origin", "Dest", response) # Coercing as factor (equivalent to Enum from h2o Flow) # Note: Using lapply does not work, see the answer of this question # https://stackoverflow.com/questions/49393343/how-to-coerce-multiple-columns-to-factors-at-once-for-h2oframe-object for (col in column.asFactor) { allyears2k.hex[col] <- as.factor(allyears2k.hex[col]) } if (IS_DEFAULT_MODEL) { fit1 <- h2o.glm(x = predictors, y = response, training_frame = allyears2k.hex, family = "binomial", seed = 123456 ) } else { # Copied and pasted from the flow, then converting to R syntax fit1 <- h2o.glm( x = predictors, model_id = "glm_model", seed = 123456, training_frame = allyears2k.hex, ignore_const_cols = T, y = response, family = "binomial", solver = "IRLSM", alpha = 0.5, lambda = 0.00001, lambda_search = F, standardize = T, non_negative = F, score_each_iteration = F, max_iterations = -1, link = "family_default", intercept = T, objective_epsilon = 0.00001, beta_epsilon = 0.0001, gradient_epsilon = 0.0001, prior = -1, max_active_predictors = -1 ) } # Analysis print("Confusion Matrix for training dataset") confMatrix <- h2o.confusionMatrix(fit1) print(confMatrix) print(summary(fit1)) h2o.shutdown() Here the result running the R-Script under default configuraiton IS_DEFAULT_MODEL=T: H2OBinomialMetrics: glm ** Reported on training data. ** MSE: 0.2001145 RMSE: 0.4473416 LogLoss: 0.5845852 Mean Per-Class Error: 0.3343562 AUC: 0.7570867 Gini: 0.5141734 R^2: 0.1975266 Residual Deviance: 51417.77 AIC: 52951.77 Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold: NO YES Error Rate NO 10337 10550 0.505099 =10550/20887 YES 3778 19313 0.163614 =3778/23091 Totals 14115 29863 0.325799 =14328/43978 Executing under h2o flow Now executing the flow: Airlines_Delay_GLMFixedSeed, we can obtain the same results. Here the detail about the flow configuration: The parseFiles function: parseFiles paths: ["https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv"] destination_frame: "allyears2k.hex" parse_type: "CSV" separator: 44 number_columns: 31 single_quotes: false column_names: ["Year","Month","DayofMonth","DayOfWeek","DepTime","CRSDepTime","ArrTime", "CRSArrTime","UniqueCarrier","FlightNum","TailNum","ActualElapsedTime", "CRSElapsedTime","AirTime","ArrDelay","DepDelay","Origin","Dest", "Distance","TaxiIn","TaxiOut","Cancelled","CancellationCode", "Diverted","CarrierDelay","WeatherDelay","NASDelay","SecurityDelay", "LateAircraftDelay","IsArrDelayed", "IsDepDelayed"] column_types ["Enum","Enum","Enum","Enum","Numeric","Numeric", "Numeric","Numeric", "Enum","Enum","Enum","Numeric", "Numeric", "Numeric","Numeric","Numeric", "Enum","Enum","Numeric","Numeric","Numeric", "Enum","Enum","Numeric","Numeric","Numeric", "Numeric","Numeric","Numeric","Enum","Enum"] delete_on_done: true check_header: 1 chunk_size: 4194304 where the following predictor columns are converted to Enum: "Year", "Month", "DayOfWeek", "UniqueCarrier", "FlightNum", "Origin", "Dest" Now invoking the buildModel function as follows, using the default parameters except for ignored_columns and seed: buildModel 'glm', {"model_id":"glm_model-default","seed":"123456", "training_frame":"allyears2k.hex", "ignored_columns":["DayofMonth","DepTime","CRSDepTime","ArrTime", "CRSArrTime","TailNum", "ActualElapsedTime","CRSElapsedTime","AirTime","ArrDelay","DepDelay", "TaxiIn","TaxiOut","Cancelled","CancellationCode","Diverted", "CarrierDelay","WeatherDelay","NASDelay","SecurityDelay", "LateAircraftDelay","IsArrDelayed"],"response_column":"IsDepDelayed", "family":"binomial"} and finally we get the following result: and Training Output Metrics: model glm_model-default model_checksum -2438376548367921152 frame allyears2k.hex frame_checksum -2331137066674151424 description · model_category Binomial scoring_time 1521598137667 predictions · MSE 0.200114 RMSE 0.447342 nobs 43978 custom_metric_name · custom_metric_value 0 r2 0.197527 logloss 0.584585 AUC 0.757084 Gini 0.514168 mean_per_class_error 0.334347 residual_deviance 51417.772427 null_deviance 60855.951538 AIC 52951.772427 null_degrees_of_freedom 43977 residual_degrees_of_freedom 43211 Comparing both results The training metrics are almost the same for first 4-significant digits: R-Script H2o Flow MSE: 0.2001145 0.200114 RMSE: 0.4473416 0.447342 LogLoss: 0.5845852 0.584585 Mean Per-Class Error: 0.3343562 0.334347 AUC: 0.7570867 0.757084 Gini: 0.5141734 0.514168 R^2: 0.1975266 0.197527 Residual Deviance: 51417.77 51417.772427 AIC: 52951.77 52951.772427 Confusion Matrix is slightly different: TP TN FP FN R-Script 10337 19313 10550 3778 H2o Flow 10341 19309 10546 3782 Error R-Script 0.325799 H2o Flow 0.3258 My understanding is that the difference are withing the acceptable threshold (around 0.0001), therefore we can say that both interfaces provide the same result.
R neuralnet package too slow for millions of records
I am trying to train a neural network for churn prediction with R package neuralnet. Here is the code: data <- read.csv('C:/PredictChurn.csv') maxs <- apply(data, 2, max) mins <- apply(data, 2, min) scaled_temp <- as.data.frame(scale(data, center = mins, scale = maxs - mins)) scaled <- data scaled[, -c(1)] <- scaled_temp[, -c(1)] index <- sample(1:nrow(data),round(0.75*nrow(data))) train_ <- scaled[index,] test_ <- scaled[-index,] library(neuralnet) n <- names(train_[, -c(1)]) f <- as.formula(paste("CHURNED_F ~", paste(n[!n %in% "CHURNED_F"], collapse = " + "))) nn <- neuralnet(f,data=train_,hidden=c(5),linear.output=F) It works as it should, however when training with the full data set (in the range of millions of rows) it just takes too long. So I know R is by default single threaded, so I have tried researching on how to parallelize the work into all the cores. Is it even possible to make this function in parallel? I have tried various packages with no success. Has anyone been able to do this? It doesn't have to be the neuralnet package, any solution that lets me train a neural network would work. Thank you
I have had good experiences with the package Rmpi, and it may be applicable in your case too. library(Rmpi) Briefly, its usage is as follows: nproc = 4 # could be automatically determined # Specify one master and nproc-1 slaves Rmpi:: mpi.spawn.Rslaves(nslaves=nproc-1) # Execute function "func_to_be_parallelized" on multiple CPUs; pass two variables to function my_fast_results = Rmpi::mpi.parLapply(var1_passed_to_func, func_to_be_parallelized, var2_passed_to_func) # Close slaves Rmpi::mpi.close.Rslaves(dellog=T)
You can try using the caret and doParallel packages for this. This is what I have been using. It works for some of the model types but may not work for all. layer1 = c(6,12,18,24,30) layer2 = c(6,12,18,24,30) layer3 = c(6,12,18,24,30) cv.folds = 5 # In order to make models fully reproducible when using parallel processing, we need to pass seeds as a parameter # https://stackoverflow.com/questions/13403427/fully-reproducible-parallel-models-using-caret total.param.permutations = length(layer1) * length(layer2) * length(layer3) seeds <- vector(mode = "list", length = cv.folds + 1) set.seed(1) for(i in 1:cv.folds) seeds[[i]]<- sample.int(n=1, total.param.permutations, replace = TRUE) seeds[[cv.folds + 1]]<-sample.int(1, 1, replace = TRUE) #for the last model nn.grid <- expand.grid(layer1 = layer1, layer2 = layer2, layer3 = layer3) cl <- makeCluster(detectCores()*0.5) # use 50% of cores only, leave rest for other tasks registerDoParallel(cl) train_control <- caret::trainControl(method = "cv" ,number=cv.folds ,seeds = seeds # user defined seeds for parallel processing ,verboseIter = TRUE ,allowParallel = TRUE ) stopCluster(cl) registerDoSEQ() tic("Total Time to NN Training: ") set.seed(1) model.nn.caret = caret::train(form = formula, data = scaled.train.data, method = 'neuralnet', tuneGrid = nn.grid, trControl = train_control ) toc()