I'm new to machine learning and deep learning, I'm dealing with mnist dataset with keras package, tensorflow package and recipe package to build a mlp model.
I just have a question regarding the normalization process when I did the preprocessing.
I first try to divide all numeric variable (which are the pixels) to 255 so every data point will fall into [0,1], and then I did the following
#load the data
digit <- read_csv("digit_train.csv") %>%
clean_names() %>%
mutate(label = factor(label)) %>%
mutate_if(is.numeric,funs(./255))
#split the data
set.seed(52)
train_test_split<- initial_split(digit, prop = 0.7)
train <- training(train_test_split)
test <- testing(train_test_split)
#create recipe and build model
digit_recipe_mlp <- recipe(label ~ ., train) %>%
update_role(id, new_role = "id")
digit_mlp <- mlp(hidden_units = 120,
epochs = 10,
dropout = .13,
activation = "relu") %>%
set_engine("keras") %>%
set_mode("classification")
digit_wf_mlp <- workflow() %>%
add_recipe(digit_recipe_mlp) %>%
add_model(digit_mlp) %>%
fit(train)
It turns out my model has some overfitting problem, and the overall accuracy is just 97.5%, I'm thinking how can I improve that. Then I tried to use another way to normalize my data in the recipe process. I did step_normalize and step_range, however, both of them did not work, if I tried them the fitting results only return NAN though the model can run. I'm wondering why that happened.
digit_recipe_mlp <- recipe(label ~ ., train) %>%
update_role(id, new_role = "id") %>%
step_normalize(all_numeric())
digit_mlp <- mlp(hidden_units = 120,
epochs = 10,
dropout = .13,
activation = "relu") %>%
set_engine("keras") %>%
set_mode("classification")
digit_wf_mlp <- workflow() %>%
add_recipe(digit_recipe_mlp) %>%
add_model(digit_mlp) %>%
fit(train)
Related
I am fitting a random forest model using tidymodels in R, and an error occurs when I try to predict the test set using the tuned model: Each element of splits must be an rsplit object.
# Data splitting
data(Sacramento, package = "modeldata")
set.seed(123)
data_split <- initial_split(Sacramento, prop = 0.75, strata = price)
Sac_train <- training(data_split)
Sac_test <- testing(data_split)
# Build the model
rf_mod <- rand_forest(mtry = tune(), min_n = tune(), trees = 1000) %>%
set_engine("ranger", importance = "permutation") %>%
set_mode("regression")
# Create the recipe
Sac_recipe <- recipe(price ~ ., data = Sac_train) %>%
step_rm(zip, latitude, longitude) %>%
step_corr(all_numeric_predictors(), threshold = 0.85) %>%
step_zv(all_numeric_predictors()) %>%
step_normalize(all_numeric_predictors()) %>%
step_dummy(all_nominal_predictors())
# Create the workflow
rf_workflow <- workflow() %>%
add_model(rf_mod) %>%
add_recipe(Sac_recipe)
# Train and Tune the model
set.seed(123)
Sac_folds <- vfold_cv(Sac_train, v = 10, repeats = 2, strata = price)
rf_res <- rf_workflow %>%
tune_grid(grid = 2*2,
resamples = Sac_folds,
control = control_grid(save_pred = TRUE),
metrics = metric_set(rmse))
# Extract the best model
rf_best <- rf_res %>%
select_best(metric = "rmse")
# Last fit
last_rf_workflow <- rf_workflow %>%
finalize_workflow(rf_best)
last_rf_fit <- last_rf_workflow %>%
last_fit(Sac_train)
# Error: Each element of `splits` must be an `rsplit` object.
predict(last_rf_fit, Sac_test, type = "conf_int")
The error generates from these lines,
last_rf_fit <- last_rf_workflow %>%
last_fit(Sac_train)
Now from the documentation of last_fit,
# S3 method for workflow
last_fit(object, split, ..., metrics = NULL, control = control_last_fit())
So an workflow object is passed to last_fit as the first argument via %>% and Sac_train is passed to split parameter.
But from the docs, the split argument needs to be,
An rsplit object created from rsample::initial_split()
So Instead, try this,
last_rf_fit <- last_rf_workflow %>%
last_fit(data_split)
Then to collect the predictions, following the docs,
collect_predictions(last_rf_fit)
I am working on a classification model to predict building age. I want to train my random forest models by groups (suburbs) within the larger dataset.
I've used this as the basis of the code below.
My question is - how should I write the code to train and record the hyperparameters for each suburb?
age.rf <- rand_forest(
mtry = tune(),
trees = tune(),
min_n = tune()) %>%
set_mode("classification") %>%
set_engine("ranger")
age.workflow <- workflow() %>%
add_model(age.rf)
### function for model fitting and predicting
age.predict <- function(df) {
# split the dataset
set.seed(1)
split <- initial_split(df)
train_df <- training(df)
test_df <- testing(df)
# create recipe
age.recipe <- recipe(decade_built ~ .,
data = train_df) %>%
update_role(bld_index, new_role = "ID") %>%
step_dummy(all_nominal_predictors(), -has_role("ID")) %>%
step_zv(all_predictors()) %>%
step_normalize(all_numeric_predictors()) %>%
prep()
# hyperparameters
age.randgrid_rf <- grid_random(mtry(c(1,20)),
trees(),
min_n(),
size = 10)
ctrl <- control_grid(save_pred = T, extract = extract_model)
age_folds <- vfold_cv(train_df, strata = "suburb", v = 10)
age.tunerandom_rf <- age.workflow %>%
tune_grid(resamples = age_folds,
grid = age.randgrid_rf,
control = ctrl)
# best parameters
age.params_rf <- select_best(age.tunerandom_rf)
# finalise model
age.final_rf <- finalize_model(age.spec_rf, age.params_rf)
age.workflowfinal_rf <- workflow() %>%
add_recipe(age.recipe) %>%
add_model(age.final_rf)
# predict on test data
predict(age.workflowfinal_rf, test_df)
}
age_nested <- final.df %>%
group_by(suburb) %>%
nest()
age.preds <- age_nested %>%
mutate(prediction = map(data, possibly(age.predict, otherwise = NA)))
I've mapped out the dataset using the nest() function, and followed the workflow based on Julia's post on another page.
Any help to identify how to get the hyperparameters, as well as apply them to the individual models for each group would be much appreciated.
At the moment, my output is NA.
In time series forecasting external regressors can make a big difference. Currently I want to track the effects of external regressors, using the modeltime framework.
However, I could not find any helpful information on this topic so far. I only found out, that you can add regressor variables with a "+" to your recipe.
After adding the variables Transactions (number of transactions per day and Store) and Open_Closed (1 = Store is closed, and 0 = Store is open) to my recipe, I found out, that there was no effect on the prediction. How can I achieve this?
some reprex data:
suppressPackageStartupMessages(library(modeltime))
suppressPackageStartupMessages(library(tidymodels))
suppressPackageStartupMessages(library(lubridate))
suppressPackageStartupMessages(library(timetk))
#### DATA
data <- data.frame (Store = c(rep("1",365),rep("2",365)),
Sales = c(seq( 1, 44, length.out = 365)),
Date = c(dates <- ymd("2013-01-01")+ days(0:364)),
Transactions = c(seq( 50, 100, length.out = 365)),
Open_Closed = sample(rep(0:1,each=365))
)
h = 42
# split
set.seed(234)
splits <- time_series_split(data, assess = "42 days", cumulative = TRUE)
# recipe
recipe_spec <- recipe(Sales ~ Date + Transactions + Open_Closed, data) %>%
step_timeseries_signature(Date) %>%
step_rm(matches("(iso$)|(xts$)|(day)|(hour)|(min)|(sec)|(am.pm)")) %>%
step_dummy(all_nominal())
recipe_spec %>% prep() %>% juice()
#### MODELS
# elnet
model_spec_glmnet <- linear_reg(penalty = 1) %>%
set_engine("glmnet")
wflw_fit_glmnet <- workflow() %>%
add_model(model_spec_glmnet) %>%
add_recipe(recipe_spec %>% step_rm(Date)) %>%
fit(training(splits))
# xgboost
model_spec_xgboost <- boost_tree("regression", learn_rate = 0.35) %>%
set_engine("xgboost")
set.seed(123)
wflw_fit_xgboost <- workflow() %>%
add_model(model_spec_xgboost) %>%
add_recipe(recipe_spec %>% step_rm(Date)) %>%
fit(training(splits))
# sub tbl
submodels_tbl <- modeltime_table(
wflw_fit_glmnet,
wflw_fit_xgboost
)
submodels_tbl %>%
modeltime_accuracy(testing(splits)) %>%
table_modeltime_accuracy(.interactive = FALSE)
I have a nested GBM, and am looking to extract the partial depndence, tryingto use the following query:
library(rsample) # data splitting
library(gbm) # basic implementation
library(xgboost) # a faster implementation of gbm
library(caret) # an aggregator package for performing many machine learning models
library(h2o) # a java-based platform
library(pdp) # model visualization
basic_gbm <- function(data) {
mymodel <- gbm(formula = mpg ~ . ,
distribution = "gaussian",
data = data ,
n.minobsinnode = 1,
bag.fraction = 1
)
return(mymodel)
}
blah_model <- mtcars %>%
group_by() %>%
nest() %>%
mutate(model = map(data, basic_gbm))
blah_summary <- mtcars %>%
group_by() %>%
nest() %>%
mutate(model = map(data, basic_gbm)) %>%
mutate(summary = map(model, summary)) %>%
mutate(all_data = pmap(list(data, summary), .f =left_join, by = character())) %>%
select(cols=c(all_data)) %>%
unnest(cols = c(cols)) %>%
ungroup()
blah_model %>%
left_join(blah_summary, by = character()) %>%
mutate(pred = map(model, partial, pred.var = var, n.trees = model$n.trees, train = data)) -- this does not work
This does work and is what I would want as a nested df for each var:
coeffs <- blah_model$model[[1]] %>%
partial(pred.var = 'disp', n.trees = blah_model$model[[1]]$n.trees, train = blah_model$data[[1]])
However, it is saying it is not finding the variables in the training data - the data I am passing through is the training data. The var in the map is from the summary functions - these are prediction variables.
I gave a better example
I know that I can do parallel computing in R however I am having trouble setting it up in a way that it works with my modeling approach.
Here is how I set load / set up the parallel computing
library(doParallel) # Parallel Computing
cores <- detectCores() - 1
cluster <- makePSOCKcluster(cores)
registerDoParallel(cluster)
Later on in my markdown book, I have the following code to create, tune and fit a workflow for a support vector machine using a polynomial kernel.
tune_cv_folds <- vfold_cv(data = train_baked, v = 10)
tune_spec <- svm_poly(cost = tune(), degree = tune(), margin = tune()) %>%
set_engine("kernlab") %>%
set_mode("classification")
tune_wf <- workflow() %>%
add_model(tune_spec) %>%
add_formula(win ~ .)
tune_res <-
tune_wf %>%
tune_grid(
resamples = tune_cv_folds,
grid = 10)
paramvalue <- tune_res %>% select_best("roc_auc")
fit <- svm_poly(cost = paramvalue$cost, degree = paramvalue$degree, margin = paramvalue$degree) %>%
set_engine("kernlab") %>%
set_mode("classification")
wf <- workflow() %>% add_model(poly_fit) %>% add_formula(win ~.) %>% fit(data = train_baked)
poly_fit <- poly_wf %>% pull_workflow_fit()
summary(poly_fit)
My question here is how do I enable parallel computing for the following?
tune_wf <- workflow() %>%
add_model(tune_spec) %>%
add_formula(win ~ .)