I am getting an error when I try to run glmulti on different datasets in parallel using furr::future_map. It works when future_map is called sequentially.
#Load packages
library(furrr)
library(future)
library(tidyverse)
#This doesn't work
plan(multisession, workers = 2 ) #Set number of parallel sessions (can't do multi core on Windows)
mods <- list(tibble(exposure = rnorm(100, 0:100), outcome = rbinom(100, 1, 0.5)),
tibble(exposure = rnorm(100, 0:100), outcome = rbinom(100, 1, 0.5))) %>%
future_map(~glmulti(
y = "outcome",
xr = "exposure",
data = .x,
level = 1,
method = "g", #Genetic algorithm
fitfunction = "glm",
family = binomial,
confsetsize = 2, #Maximum number of possible models, so it doesn't run indefinitely
plotty = F, report = F #To simplify the outputs
))
Here is the error message this gives:
Error in get(as.character(FUN), mode = "function", envir = envir) : object 'aic' of mode 'function' was not found
It runs fine when done sequentially:
#This works
plan(multiprocess, workers = 1 ) #1 worker, so normal map behaviour.
mods <- list(tibble(exposure = rnorm(1000, 0:1000), outcome = rbinom(1000, 1, 0.5)),
tibble(exposure = rnorm(1000, 0:1000), outcome = rbinom(1000, 1, 0.5))) %>%
future_map(~glmulti(
y = "outcome",
xr = "exposure",
data = .x,
level = 1,
method = "g", #Genetic algorithm
fitfunction = "glm",
family = binomial,
confsetsize = 2,
plotty = F, report = F
))
Is there any way to fix this? Or is it just a problem with one of the two packages? Is it more likely that it's an issue with furrr or with glmulti?
Related
I am trying to run multiscale geographically weighted regression (MGWR) using the GWmodel package in R. When running the function gwr.multiscale this error is shown:
Error in gw_weight_vec(vdist, bw, kernel, adaptive): Not compatible
with requested type: [type=NULL; target=double].
An example:
library(GWmodel)
data(LondonHP)
dist <- gw.dist(coordinates(londonhp))
ab_gwr <- gwr.multiscale(PURCHASE ~ FLOORSZ + PROF,
data = londonhp,
criterion = "dCVR",
kernel = "gaussian",
adaptive = FALSE,
var.dMat.indx = 2,
bws0 = c(100,
100,
100),
bw.seled = rep(T, 3),
dMats = list(dist,
dist,
dist),
parallel.method = "omp",
parallel.arg = "omp")
I have tried other parameters as well, like adaptive bandwidth, to include fewer covariates, to change the bws0 parameter etc etc. Other kinds of errors occur depending on what I have tried.
I am following the example from the package's PDF.
The parameter var.dMat.indx is defined for the usage of distance matrix for each variable, and was used wrongly in my code. The solution:
library(GWmodel)
data(LondonHP)
dist <- gw.dist(coordinates(londonhp))
ab_gwr <- gwr.multiscale(PURCHASE ~ FLOORSZ + PROF,
data = londonhp,
criterion = "dCVR",
kernel = "gaussian",
adaptive = FALSE,
var.dMat.indx = 1:3,
bws0 = c(100,
100,
100),
bw.seled = rep(TRUE, 3),
dMats = list(dist,
dist,
dist),
parallel.method = "omp",
parallel.arg = "omp")
I tried to run regression using caret's nnet, but I got a error.
library(tidyverse)
library(caret)
feature = rnorm(100, 0, 1) %>% as.matrix()
colnames(feature) = "x1"
outcome = rnorm(100, 0, 1) %>% as.matrix()
colnames(feature) = "y1"
model = caret::train(
x = feature, y = outcome, method = "nnet",
tuneGrid = expand.grid(size=c(1:3), decay=seq(0.1, 1, 0.1)),
weights = NULL, linout = TRUE
)
Error: Metric RMSE not applicable for classification models
Of course, I want to regress, not classify. In order to show this, I set the option linout = TRUE. What went wrong?
Also, I followed this question, and tried to remove as.matrix, but it also shows other error.
library(tidyverse)
library(caret)
feature = rnorm(100, 0, 1) %>% as.double()
outcome = rnorm(100, 0, 1) %>% as.double()
CATE_model = caret::train(
x = feature, y = outcome, method = "nnet",
tuneGrid = expand.grid(size=c(1:3), decay=seq(0.1, 1, 0.1)),
weights = NULL, linout = TRUE
)
Error: Please use column names for 'x'
Thanks, a lot
The input of the Neural Network seems to require colnames.
This makes sense as the trained neural net might later need to identify the correct columns in its input for prediction.
if you force your features into a Dataframe with colnames (in my case x), it works.
library(tidyverse)
library(caret)
feature = data.frame(x = rnorm(100, 0, 1) %>% as.double()) # changed line
outcome = rnorm(100, 0, 1) %>% as.double()
CATE_model = caret::train(
x = feature, y = outcome, method = "nnet",
tuneGrid = expand.grid(size=c(1:3), decay=seq(0.1, 1, 0.1)),
weights = NULL, linout = TRUE
)
I am new to R and trying to do hyper parameter tuning for xgboost- binary classification, However I am getting error, I would appreciate if someone could help me
Error in as.matrix(cv.res)[, 3] : subscript out of bounds In addition: Warning message: 'early.stop.round' is deprecated. Use 'early_stopping_rounds' instead. See help("Deprecated") and help("xgboost-deprecated").
Please find below the code snippet`
I would appreciate if some one could provide another alternative too apart from this approach in R
X_Train <- as(X_train, "dgCMatrix")
GS_LogLoss = data.frame("Rounds" = numeric(),
"Depth" = numeric(),
"r_sample" = numeric(),
"c_sample" = numeric(),
"minLogLoss" = numeric(),
"best_round" = numeric())
for (rounds in seq(50,100, 25)) {
for (depth in c(4, 6, 8, 10)) {
for (r_sample in c(0.5, 0.75, 1)) {
for (c_sample in c(0.4, 0.6, 0.8, 1)) {
for (imb_scale_pos_weight in c(5, 10, 15, 20, 25)) {
for (wt_gamma in c(5, 7, 10)) {
for (wt_max_delta_step in c(5,7,10)) {
for (wt_min_child_weight in c(5,7,10,15)) {
set.seed(1024)
eta_val = 2 / rounds
cv.res = xgb.cv(data = X_Train, nfold = 2, label = y_train,
nrounds = rounds,
eta = eta_val,
max_depth = depth,
subsample = r_sample,
colsample_bytree = c_sample,
early.stop.round = 0.5*rounds,
scale_pos_weight= imb_scale_pos_weight,
max_delta_step = wt_max_delta_step,
gamma = wt_gamma,
objective='binary:logistic',
eval_metric = 'auc',
verbose = FALSE)
print(paste(rounds, depth, r_sample, c_sample, min(as.matrix(cv.res)[,3]) ))
GS_LogLoss[nrow(GS_LogLoss)+1, ] = c(rounds,
depth,
r_sample,
c_sample,
min(as.matrix(cv.res)[,3]),
which.min(as.matrix(cv.res)[,3]))
}
}
}
}
}
}
}
}
`
To do you hyperparameters selection, you could use the metapackage tidymodels, especially the packages parsnip, rsample, yardstick and tune.
A workflow like this would work:
library(tidyverse)
library(tidymodels)
# Specify the model and the parameters to tune (parnsip)
model <-
boost_tree(tree_depth = tune(), mtry = tune()) %>%
set_mode("classification") %>%
set_engine("xgboost")
# Specify the resampling method (rsample)
splits <- vfold_cv(X_train, v = 2)
# Specify the metrics to optimize (yardstick)
metrics <- metric_set(roc_auc)
# Specify the parameters grid (or you can use dials to automate your grid search)
grid <- expand_grid(tree_depth = c(4, 6, 8, 10),
mtry = c(2, 10, 50)) # You can add others
# Run each model (tune)
tuned <- tune_grid(formula = Y ~ .,
model = model,
resamples = splits,
grid = grid,
metrics = metrics,
control = control_grid(verbose = TRUE))
# Check results
show_best(tuned)
autoplot(tuned)
select_best(tuned)
# Update model
tuned_model <-
model %>%
finalize_model(select_best(tuned)) %>%
fit(Y ~ ., data = X_train)
# Make prediction
predict(tuned_model, X_train)
predict(tuned_model, X_test)
Please note that the names during the model specification are subject to change compare to the original names in xgboost because parsnip is a unified interface with consistant names across several models. See here.
I am running h2o grid search on R. The model is a glm using a gamma distribution.
I have defined the grid using the following settings.
hyper_parameters = list(alpha = c(0, .5), missing_values_handling = c("Skip", "MeanImputation"))
h2o.grid(algorithm = "glm", # Setting algorithm type
grid_id = "grid.s", # Id so retrieving information on iterations will be easier later
x = predictors, # Setting predictive features
y = response, # Setting target variable
training_frame = data, # Setting training set
validation_frame = validate, # Setting validation frame
hyper_params = hyper_parameters, # Setting apha values for iterations
remove_collinear_columns = T, # Parameter to remove collinear columns
lambda_search = T, # Setting parameter to find optimal lambda value
seed = 1234, # Setting to ensure replicateable results
keep_cross_validation_predictions = F, # Setting to save cross validation predictions
compute_p_values = F, # Calculating p-values of the coefficients
family = 'gamma', # Distribution type used
standardize = T, # Standardizing continuous variables
nfolds = 2, # Number of cross-validations
fold_assignment = "Modulo", # Specifying fold assignment type to use for cross validations
link = "log")
When i run the above script, i get the following error:
Error in hyper_names[[index2]] : subscript out of bounds
Please can you help me find where the error is
As disucssed in the comments it is difficult to tell what the cause for the error could be without sample data and code. The out-of-bounds error could be because the code is trying to access a value that does not exist in the input. So possibly, it could be either of the inputs to the h2o.grid(). I would check columns and rows in the train and validation data sets. The hyperparameters from the question run fine with family="binomial".
The code below runs fine with glm(). I have made several assumptions such as: (1) family=binomial instead of family=gamma was used based on sample data created, (2) response y is binary, (3) train and test split ratio, (4) number of responses are limited to three predictors or independent variables (x1, x2, x3), (5) one binary response variable (y`).
Import libraries
library(h2o)
library(h2oEnsemble)
Create sample data
x1 <- abs(100*rnorm(100))
x2 <- 10+abs(100*rnorm(100))
x3 <- 100+abs(100*rnorm(100))
#y <- ronorm(100)
y <- floor(runif(100,0,1.5))
df <- data.frame(x1, x2, x3,y)
df$y <- ifelse(df$y==1, 'yes', 'no')
df$y <- as.factor(df$y)
head(df)
Initialize h2o
h2o.init()
Prepare data in required h2o format
df <- as.h2o(df)
y <- "y"
x <- setdiff( names(df), y )
df<- df[ df$y %in% c("no", "yes"), ]
h2o.setLevels(df$y, c("no","yes") )
# Split data into train and validate sets
data <- h2o.splitFrame( df, ratios = c(.6, 0.15) )
names(data) <- c('train', 'valid', 'test')
data$train
Set parameters
grid_id <- 'glm_grid'
hyper_parameters <- list( alpha = c(0, .5, 1),
lambda = c(1, 0.5, 0.1, 0.01),
missing_values_handling = c("Skip", "MeanImputation"),
tweedie_variance_power = c(0, 1, 1.1,1.8,1.9,2,2.1,2.5,2.6,3, 5, 7),
#tweedie_variance_power = c(0, 1, 1.1,1.8,1.9,2,2.1,2.5,2.6,3, 5, 7),
seed = 1234
)
Fit h2o.grid()
h2o.grid(
algorithm = "glm",
#grid_id = grid_id,
hyper_params = hyper_parameters,
training_frame = data$train,
validation_frame = data$valid,
x = x,
y = y,
lambda_search = TRUE,
remove_collinear_columns = T,
keep_cross_validation_predictions = F,
compute_p_values = F,
standardize = T,
nfolds = 2,
fold_assignment = "Modulo",
family = "binomial"
)
Output
Is it possible to create a deep learning net that gives multiple outputs?
The reason for doing this is to also try to capture the relationships between outputs.
In the examples given I can only create one output.
library(h2o)
localH2O = h2o.init()
irisPath = system.file("extdata", "iris.csv", package = "h2o")
iris.hex = h2o.importFile(localH2O, path = irisPath)
h2o.deeplearning(x = 1:4, y = 5, data = iris.hex, activation = "Tanh",
hidden = c(10, 10), epochs = 5)
It doesn't look like multiple response columns are currently supported in H2O (H2O FAQ and H2O Google Group topic). Their suggestion is to train a new model for each response.
(Nonsensical) example:
library(h2o)
localH2O <- h2o.init()
irisPath <- system.file("extdata", "iris.csv", package = "h2o")
iris.hex <- h2o.importFile(localH2O, path = irisPath)
m1 <- h2o.deeplearning(x = 1:2, y = 3, data = iris.hex, activation = "Tanh",
hidden = c(10, 10), epochs = 5, classification = FALSE)
m2 <- h2o.deeplearning(x = 1:2, y = 4, data = iris.hex, activation = "Tanh",
hidden = c(10, 10), epochs = 5, classification = FALSE)
However, it appears that multiple responses are available through the deepnet package (check library(sos); findFn("deep learning")).
library(deepnet)
x <- as.matrix(iris[,1:2])
y <- as.matrix(iris[,3:4])
m3 <- dbn.dnn.train(x = x, y = y, hidden = c(5,5))