I am trying to retain an ID on the row when predicting using a Random Forest model to merge back on to the original dataframe. I am using step_naomit in the recipe that removes the rows with missing data when I bake the training data, but also removes the records with missing data on the testing data. Unfortunately, I don't have an ID to easily know which records were removed so I can accurately merge back on the predictions.
I have tried to add an ID column to the original data, but bake will remove any variable not included in the formula (and I don't want to include ID in the formula). I also thought I may be able to retain the row.names from the original table to merge on, but it appears the row.name is reset upon baking as well.
I realize I can remove the NA values prior to the recipe to solve this problem, but then what is the point of step_naomit in the recipe? I also tried skip=TRUE in the step_naomit, but then I get an error for missing data when fitting the model (only for random forest). I feel I am missing something here in tidymodels that would allow me to retain all the rows prior to baking?
See example:
## R 3.6.1 ON WINDOWS 10 MACHINE
require(tidyverse)
require(tidymodels)
require(ranger)
set.seed(123)
temp <- iris %>%
dplyr::mutate(Petal.Width = case_when(
round(Sepal.Width) %% 2 == 0 ~ NA_real_, ## INTRODUCE NA VALUES
TRUE ~ Petal.Width))
mySplit <- rsample::initial_split(temp, prop = 0.8)
myRecipe <- function(dataFrame) {
recipes::recipe(Petal.Width ~ ., data = dataFrame) %>%
step_naomit(all_numeric()) %>%
prep(data = dataFrame)
}
myPred <- function(mySplit,myRecipe) {
train_set <- training(mySplit)
test_set <- testing(mySplit)
train_prep <- myRecipe(train_set)
analysis_processed <- bake(train_prep, new_data = train_set)
model <- rand_forest(
mode = "regression",
mtry = 3,
trees = 50) %>%
set_engine("ranger", importance = 'impurity') %>%
fit(Sepal.Width ~ ., data=analysis_processed)
test_processed <- bake(train_prep, new_data = test_set)
test_processed %>%
bind_cols(myPrediction = unlist(predict(model,new_data=test_processed)))
}
getPredictions <- myPred(mySplit,myRecipe)
nrow(getPredictions)
## 21 ROWS
max(as.numeric(row.names(getPredictions)))
## 21
nrow(testing(mySplit))
## 29 ROWS
max(as.numeric(row.names(testing(mySplit))))
## 150
To be able to keep track of which observations were removed we need to give the original dataset an id variable.
temp <- iris %>%
dplyr::mutate(Petal.Width = case_when(
round(Sepal.Width) %% 2 == 0 ~ NA_real_, ## INTRODUCE NA VALUES
TRUE ~ Petal.Width),
id = row_number()) #<<<<
Then we use update_role() to first designate it as an "id variable", then remove it as a predictor so it doesn't become part of the modeling process. And that is it. Everything else should work like before. Below is the fully updated code with #<<<< to denote my changes.
require(tidyverse)
#> Loading required package: tidyverse
require(tidymodels)
#> Loading required package: tidymodels
#> Registered S3 method overwritten by 'xts':
#> method from
#> as.zoo.xts zoo
#> ── Attaching packages ───────────────────── tidymodels 0.0.3 ──
#> ✔ broom 0.5.2 ✔ recipes 0.1.7
#> ✔ dials 0.0.3 ✔ rsample 0.0.5
#> ✔ infer 0.5.0 ✔ yardstick 0.0.4
#> ✔ parsnip 0.0.4
#> ── Conflicts ──────────────────────── tidymodels_conflicts() ──
#> ✖ scales::discard() masks purrr::discard()
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ recipes::fixed() masks stringr::fixed()
#> ✖ dplyr::lag() masks stats::lag()
#> ✖ dials::margin() masks ggplot2::margin()
#> ✖ dials::offset() masks stats::offset()
#> ✖ yardstick::spec() masks readr::spec()
#> ✖ recipes::step() masks stats::step()
require(ranger)
#> Loading required package: ranger
set.seed(1234)
temp <- iris %>%
dplyr::mutate(Petal.Width = case_when(
round(Sepal.Width) %% 2 == 0 ~ NA_real_, ## INTRODUCE NA VALUES
TRUE ~ Petal.Width),
id = row_number()) #<<<<
mySplit <- rsample::initial_split(temp, prop = 0.8)
myRecipe <- function(dataFrame) {
recipes::recipe(Petal.Width ~ ., data = dataFrame) %>%
update_role(id, new_role = "id variable") %>% #<<<<
update_role(-id, new_role = 'predictor') %>% #<<<<
step_naomit(all_numeric()) %>%
prep(data = dataFrame)
}
myPred <- function(mySplit,myRecipe) {
train_set <- training(mySplit)
test_set <- testing(mySplit)
train_prep <- myRecipe(train_set)
analysis_processed <- bake(train_prep, new_data = train_set)
model <- rand_forest(
mode = "regression",
mtry = 3,
trees = 50) %>%
set_engine("ranger", importance = 'impurity') %>%
fit(Sepal.Width ~ ., data=analysis_processed)
test_processed <- bake(train_prep, new_data = test_set)
test_processed %>%
bind_cols(myPrediction = unlist(predict(model,new_data=test_processed)))
}
getPredictions <- myPred(mySplit, myRecipe)
getPredictions
#> # A tibble: 23 x 7
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species id myPrediction
#> <dbl> <dbl> <dbl> <dbl> <fct> <int> <dbl>
#> 1 4.6 3.1 1.5 0.2 setosa 4 3.24
#> 2 4.3 3 1.1 0.1 setosa 14 3.04
#> 3 5.1 3.4 1.5 0.2 setosa 40 3.22
#> 4 5.9 3 4.2 1.5 versico… 62 2.98
#> 5 6.7 3.1 4.4 1.4 versico… 66 2.92
#> 6 6 2.9 4.5 1.5 versico… 79 3.03
#> 7 5.7 2.6 3.5 1 versico… 80 2.79
#> 8 6 2.7 5.1 1.6 versico… 84 3.12
#> 9 5.8 2.6 4 1.2 versico… 93 2.79
#> 10 6.2 2.9 4.3 1.3 versico… 98 2.88
#> # … with 13 more rows
# removed ids
setdiff(testing(mySplit)$id, getPredictions$id)
#> [1] 5 28 47 70 90 132
Created on 2019-11-26 by the reprex package (v0.3.0)
Using skip = TRUE in the step_naomit() recipe specification, and then including the recipe in a workflow might be the proper solution. For example,
myRecipe <- recipe(Petal.Width ~ ., data = dataFrame) %>%
step_naomit(all_numeric(), step = FALSE)`
# don't include the prep()
wflow <- workflow() %>%
add_model(model) %>%
add_recipe(myRecipe)
wflow_fit <- wflow %>%
fit(train_set)
preds <- predict(wflow_fit, new_data = (test_set))
Related
I am trying to understand how to apply step_pca to preprocess my data. Suppose I want to build a K-Nearest Neighbor classifier to the iris dataset. For the sake of simplicity, I will not split the original iris dataset into train and test. I will assume iris is the train dataset and I have some other observations as my test dataset.
I want to apply three transformations to the predictors in my train dataset:
Center all predictor variables
Scale all predictor variables
PCA transform all predictor variables and keep a number of them that explains, at least, 80% of my data variance
So this is what I have:
library(tidymodels)
iris_rec <-
recipe(Species ~ .,
data = iris) %>%
# center/scale
step_center(-Species) %>%
step_scale(-Species) %>%
# pca
step_pca(-Species, threshold = 0.8) %>%
# apply data transformation
prep()
iris_rec
#> Recipe
#>
#> Inputs:
#>
#> role #variables
#> outcome 1
#> predictor 4
#>
#> Training data contained 150 data points and no missing data.
#>
#> Operations:
#>
#> Centering for Sepal.Length, Sepal.Width, Petal.Length, Petal.... [trained]
#> Scaling for Sepal.Length, Sepal.Width, Petal.Length, Petal.... [trained]
#> PCA extraction with Sepal.Length, Sepal.Width, Petal.Length, Petal.W... [trained]
Created on 2022-10-13 with reprex v2.0.2
Ok, so far, so good. All the transformations are applied to my dataset. When I prepare my train dataset using juice, everything goes as expected:
# transformed training set
iris_train_t <- juice(iris_rec)
iris_train_t
#> # A tibble: 150 × 3
#> Species PC1 PC2
#> <fct> <dbl> <dbl>
#> 1 setosa -2.26 -0.478
#> 2 setosa -2.07 0.672
#> 3 setosa -2.36 0.341
#> 4 setosa -2.29 0.595
#> 5 setosa -2.38 -0.645
#> 6 setosa -2.07 -1.48
#> 7 setosa -2.44 -0.0475
#> 8 setosa -2.23 -0.222
#> 9 setosa -2.33 1.11
#> 10 setosa -2.18 0.467
#> # … with 140 more rows
Created on 2022-10-13 with reprex v2.0.2
So, I have two predictors based on PCA (PC1 and PC2) and my response variable. However, when I proceed with my modelling, I get an error: all the models I test fail, as you can see below:
# cross validation
set.seed(2022)
iris_train_cv <- vfold_cv(iris_train_t, v = 5)
# tuning
iris_knn_tune <-
nearest_neighbor(
neighbors = tune(),
weight_func = tune(),
dist_power = tune()
) %>%
set_engine("kknn") %>%
set_mode("classification")
# grid search
iris_knn_grid <-
grid_regular(neighbors(range = c(3, 9)),
weight_func(),
dist_power(),
levels = c(22, 2, 2))
# workflow creation
iris_wflow <-
workflow() %>%
add_recipe(iris_rec) %>%
add_model(iris_knn_tune)
# model assessment
iris_knn_fit_tune <-
iris_wflow %>%
tune_grid(
resamples = iris_train_cv,
grid = iris_knn_grid
)
#> x Fold1: preprocessor 1/1:
#> Error in `check_training_set()`:
#> ! Not all variables in the recipe are present in the supplied training...
#> x Fold2: preprocessor 1/1:
#> Error in `check_training_set()`:
#> ! Not all variables in the recipe are present in the supplied training...
#> x Fold3: preprocessor 1/1:
#> Error in `check_training_set()`:
#> ! Not all variables in the recipe are present in the supplied training...
#> x Fold4: preprocessor 1/1:
#> Error in `check_training_set()`:
#> ! Not all variables in the recipe are present in the supplied training...
#> x Fold5: preprocessor 1/1:
#> Error in `check_training_set()`:
#> ! Not all variables in the recipe are present in the supplied training...
#> Warning: All models failed. Run `show_notes(.Last.tune.result)` for more
#> information.
# cv results
collect_metrics(iris_knn_fit_tune)
#> Error in `estimate_tune_results()`:
#> ! All of the models failed. See the .notes column.
#> Backtrace:
#> ▆
#> 1. ├─tune::collect_metrics(iris_knn_fit_tune)
#> 2. └─tune:::collect_metrics.tune_results(iris_knn_fit_tune)
#> 3. └─tune::estimate_tune_results(x)
#> 4. └─rlang::abort("All of the models failed. See the .notes column.")
Created on 2022-10-13 with reprex v2.0.2
I am suspecting my problem is with the formula I defined on my iris_rec recipe. The formula there is
Species ~ ., data = iris
which means
Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data = iris
However, when I run my models, the predictor variables are PC1 and PC2, so I guess the formula should be
Species ~ ., data = iris_train_t
or
Species ~ PC1 + PC2, data = iris_train_t
How can I inform my model that my variables and dataset changed? All the others step_* I used on my tidymodels have worked, but I am struggling specifically with step_pca.
Two things that are confusing.
First, you don't need to prep() or juice() a recipe before using it in a model or workflow. The tuning and resampling functions will be doing that within each resample.
You can prep() and juice() if you want the training set processed to troubleshoot, visualize, or otherwise explore. But you don’t need to otherwise.
Second, the recipe is basically a replacement for the formula. It knows what the predictors and outcomes are so there is rarely the need to use an additional formula on top of that.
(The exception is for models that require special formulas but otherwise no).
Here is updated code for you:
library(tidymodels)
iris_rec <-
recipe(Species ~ .,
data = iris) %>%
# center/scale
step_center(-Species) %>%
step_scale(-Species) %>%
# pca
step_pca(-Species, threshold = 0.8)
set.seed(2022)
iris_train_cv <- vfold_cv(iris, v = 5) #<- changes here
# tuning
iris_knn_tune <-
nearest_neighbor(
neighbors = tune(),
weight_func = tune(),
dist_power = tune()
) %>%
set_engine("kknn") %>%
set_mode("classification")
# grid search
iris_knn_grid <-
grid_regular(neighbors(range = c(3, 9)),
weight_func(),
dist_power(),
levels = c(22, 2, 2))
# workflow creation
iris_wflow <-
workflow() %>%
add_recipe(iris_rec) %>%
add_model(iris_knn_tune)
# model assessment
iris_knn_fit_tune <-
iris_wflow %>%
tune_grid(
resamples = iris_train_cv,
grid = iris_knn_grid
)
show_best(iris_knn_fit_tune, metric = "roc_auc")
#> # A tibble: 5 × 9
#> neighbors weight_func dist_power .metric .estima…¹ mean n std_err .config
#> <int> <chr> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
#> 1 9 rectangular 1 roc_auc hand_till 0.976 5 0.00580 Prepro…
#> 2 7 triangular 1 roc_auc hand_till 0.975 5 0.00688 Prepro…
#> 3 9 triangular 2 roc_auc hand_till 0.975 5 0.00571 Prepro…
#> 4 8 triangular 1 roc_auc hand_till 0.975 5 0.00655 Prepro…
#> 5 9 triangular 1 roc_auc hand_till 0.975 5 0.00655 Prepro…
#> # … with abbreviated variable name ¹.estimator
Created on 2022-10-13 with reprex v2.0.2
When I run the below data it shows an incorrect roc_curve.
Prep
The below code should be run-able for anyone using r-studio. The dataframe contains characteristics of different employees regarding: performance ratings, sales figures, and whether
or not they were promoted.
I am attempting to create a decision tree model that uses all other variables to predict if an employee was promoted. The primary purpose of this question is to find out what I am doing incorrectly when tring to use the roc_curve() function.
library(tidyverse)
library(tidymodels)
library(peopleanalyticsdata)
url <- "http://peopleanalytics-regression-book.org/data/salespeople.csv"
salespeople <- read.csv(url)
salespeople <- salespeople %>% mutate(promoted = factor(ifelse(promoted == 1, "yes", "no")))
creating testing/training data
Using my own homemade train_test() function just for kicks!
train_test <- function(data, train.size=0.7, na.rm=FALSE) {
if(na.rm == TRUE) {
dt <- sample(x=nrow(data), size=nrow(data)* train.size)
data_nm <- na.omit(data)
train<-data_nm[dt,]
test<- data_nm[-dt,]
set <- list(train, test)
names(set) <- c("train", "test")
return(set)
} else {
dt <- sample(x=nrow(data), size=nrow(data)* train.size)
train<-data[dt,]
test<- data[-dt,]
set <- list(train, test)
names(set) <- c("train", "test")
return(set)
}
}
tt_list <- train_test(salespeople)
sales_train <- tt_list$train
sales_test <- tt_list$test
'''
creating decision tree model structure/final model/prediction dataframe
'''
tree <- decision_tree() %>%
set_engine("rpart") %>%
set_mode("classification")
model <- tree %>% fit(promoted ~ ., data = sales_train)
predictions <- predict(model,
sales_test,
type = "prob") %>%
bind_cols(sales_test)
'''
Calculate & Plot the ROC curve
When I use the .pred_yes column as the estimate column, it calculates an ROC curve that is the inverse of what I want. It seems that it has identified .pred_no as the "real" estimate column
'''
roc <- roc_curve(predictions,
estimate = .pred_yes,
truth = promoted)
autoplot(roc)
'''
Thoughts
Seems like the issue goes away when I supply pred_no as the estimate column to roc_curve()
FYI: this is my first stack overflow post, if you have any suggestions to make this post more clear/better formatted please let me know!
In factor(c("yes", "no")), "no" is the first level, the level that most modeling packages assume is the one of interest. In tidymodels, you can adjust the level of interest via the event_level argument, as documented here:
library(tidyverse)
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
url <- "http://peopleanalytics-regression-book.org/data/salespeople.csv"
salespeople <- read_csv(url) %>%
mutate(promoted = factor(ifelse(promoted == 1, "yes", "no")))
#> Rows: 351 Columns: 4
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> dbl (4): promoted, sales, customer_rate, performance
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sales_split <- initial_split(salespeople)
sales_train <- training(sales_split)
sales_test <- testing(sales_split)
tree <- decision_tree() %>%
set_engine("rpart") %>%
set_mode("classification")
tree_fit <- tree %>% fit(promoted ~ ., data = sales_train)
sales_preds <- augment(tree_fit, sales_test)
sales_preds
#> # A tibble: 88 × 7
#> promoted sales customer_rate performance .pred_class .pred_no .pred_yes
#> <fct> <dbl> <dbl> <dbl> <fct> <dbl> <dbl>
#> 1 no 364 4.89 1 no 0.973 0.0267
#> 2 no 342 3.74 3 no 0.973 0.0267
#> 3 yes 716 3.16 3 yes 0 1
#> 4 no 450 3.21 3 no 0.973 0.0267
#> 5 no 372 3.87 3 no 0.973 0.0267
#> 6 no 535 4.47 2 no 0.973 0.0267
#> 7 yes 736 3.94 4 yes 0 1
#> 8 no 330 2.54 2 no 0.973 0.0267
#> 9 no 478 3.48 2 no 0.973 0.0267
#> 10 yes 728 2.66 3 yes 0 1
#> # … with 78 more rows
sales_preds %>%
roc_curve(promoted, .pred_yes, event_level = "second") %>%
autoplot()
Created on 2021-09-08 by the reprex package (v2.0.1)
I know that in tidymodels you can set a custom tunable parameter space by interacting directly with the workflow object as follows:
library(tidymodels)
model <- linear_reg(
mode = "regression",
engine = "glmnet",
penalty = tune()
)
rec_cars <- recipe(mpg ~ ., data = mtcars)
wkf <- workflow() %>%
add_recipe(rec_cars) %>%
add_model(model)
wkf_new_param_space <- wkf %>%
parameters() %>%
update(penalty = penalty(range = c(0.9, 1)))
but sometimes it makes more sense to do this right at the moment I specify a recipe or a model.
Someone knows a way to achieve this?
The parameter ranges are inherently separated from the model specification and recipe specification in tidymodels. When you set tune() you are giving a signal to the tune function that this parameter will take multiple values and should be tuned over.
So as a short answer, you can not specify ranges of parameters when you specify a recipe or a model, but you can create the parameters object right after as you did.
In the end, you need the parameter set to construct the grid values that you are using for hyperparameter tuning, and you can create those gid values in at least 4 ways.
The first way is to do it the way you are doing it, by pulling the needed parameters out of the workflow and modifying them when needed.
The second way is to create a parameters object that will match the parameters that you will need to use. This option and the remaining require you to make sure that you create values for all the parameters you are tuning.
The Third way is to skip the parameters object altogether and create the grid with your grid_*() function and dials functions.
The fourth way is to skip dials functions altogether and create the data frame yourself. I find tidyr::crossing() an useful replacement for grid_regular(). This way is a lot easier when you are working with integer parameters and parameters that don't benefit from transformations.
library(tidymodels)
model <- linear_reg(
mode = "regression",
engine = "glmnet",
penalty = tune()
)
rec_cars <- recipe(mpg ~ ., data = mtcars)
wkf <- workflow() %>%
add_recipe(rec_cars) %>%
add_model(model)
# Option 1: using parameters() on workflow
wkf_new_param_space <- wkf %>%
parameters() %>%
update(penalty = penalty(range = c(-5, 5)))
wkf_new_param_space %>%
grid_regular(levels = 10)
#> # A tibble: 10 × 1
#> penalty
#> <dbl>
#> 1 0.00001
#> 2 0.000129
#> 3 0.00167
#> 4 0.0215
#> 5 0.278
#> 6 3.59
#> 7 46.4
#> 8 599.
#> 9 7743.
#> 10 100000
# Option 2: Using parameters() on list
my_params <- parameters(
list(
penalty(range = c(-5, 5))
)
)
my_params %>%
grid_regular(levels = 10)
#> # A tibble: 10 × 1
#> penalty
#> <dbl>
#> 1 0.00001
#> 2 0.000129
#> 3 0.00167
#> 4 0.0215
#> 5 0.278
#> 6 3.59
#> 7 46.4
#> 8 599.
#> 9 7743.
#> 10 100000
# Option 3: Use grid_*() with dials objects directly
grid_regular(
penalty(range = c(-5, 5)),
levels = 10
)
#> # A tibble: 10 × 1
#> penalty
#> <dbl>
#> 1 0.00001
#> 2 0.000129
#> 3 0.00167
#> 4 0.0215
#> 5 0.278
#> 6 3.59
#> 7 46.4
#> 8 599.
#> 9 7743.
#> 10 100000
# Option 4: Create grid values manually
tidyr::crossing(
penalty = 10 ^ seq(-5, 5, length.out = 10)
)
#> # A tibble: 10 × 1
#> penalty
#> <dbl>
#> 1 0.00001
#> 2 0.000129
#> 3 0.00167
#> 4 0.0215
#> 5 0.278
#> 6 3.59
#> 7 46.4
#> 8 599.
#> 9 7743.
#> 10 100000
Created on 2021-08-17 by the reprex package (v2.0.1)
seems that this is an old question but I am having a hard time trying to insert this approach (option 1) in my workflow.
How is supposed to continue?
wkf_new_param_space is used as grid or as object in tuning model?
model_tuned <-
tune::tune_grid(
object = wkf_new_param_space, ?
resamples = cv_folds,
grid = wkf_new_param_space, ?
metrics = model_metrics,
control = tune::control_grid(save_pred = TRUE, save_workflow = TRUE)
)
I am using tidyverse to connect to multiple databases with the same data structure (clusters). Due to different database sources a union ist not possible without copy locally.
I can do everything with long coding, but now I try to shorten the code an run in the following problem. When defining the column names for the select statement dbplyr stores that with a looping variable into the connection rather than evaluating and store the result of the string.
Here is a minimal reproducible example:
library(tidyverse)
#reproducable example with two database and two tables in memory
con1 <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
con2 <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(con1, mtcars)
copy_to(con1, iris)
copy_to(con2, mtcars)
copy_to(con2, iris)
#names of the tables
tables<-c("mtcars", "iris")
#specify which columns to select from which table
columns<-list("mtcars"=c("mpg", "hp"),
"iris"=c("Sepal.Length", "Sepal.Width"))
#list to put
data_list<-vector(mode="list", length=length(tables))
names(data_list)<-tables
#loop over tables
for(i in tables){
#loop over databases
for(j in 1:2)
data_list[[i]][[j]]<-tbl(get(paste0("con",j)), i)%>%select(columns[[i]])
}
This code works fine so far, but the problem is with accessing the data stored in the list (data_list).
If I try
data_list[[1]][[1]]
R still evaluate
select(columns[[i]])
After looping i ist "iris" and the statement give the error message:
Error: Unknown columns Sepal.Length and Sepal.Width
For the second list in data_list it works fine because i is set appropriate:
data_list[[2]][[1]]
How can I force the select statement to evaluate the expression and not to store the expression with the looping variable I?
In the next step, I would like to add a filter expression too so that I don't have to collect all the data and only the data needed.
If the union would work over databases without copy that would solve all the problems
Thx and best regards
Thomas
Hmm, you mean that you want to select columns interactively after you've queried the data base?
I edited your code to use use the tidyverse functions (since you've already loaded).
library(tidyverse)
# Reproducable example with two database and two tables in memory
con1 <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
con2 <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(con1, mtcars)
copy_to(con1, iris)
copy_to(con2, mtcars)
copy_to(con2, iris)
# Specify which columns to select from which table
columns <-list("mtcars" = c("mpg", "hp"), "iris" = c("Sepal.Length", "Sepal.Width"))
# Loop over the table names (mtcars, iris) **and** the columns that belong to those datasets
data_list <-
map2(names(columns), columns, ~ {
# For each table/column combination, grab them from con1 and con2 and return them in a list
con1_db <- tbl(con1, .x) %>% select(.y)
con2_db <- tbl(con2, .x) %>% select(.y)
list(con1_db, con2_db)
}) %>%
setNames(names(columns))
# With this you can interactively select the columns you wanted for each data. Just replace the dataset that you're interested in.
data_list %>%
pluck("iris") %>%
map(select, columns[['iris']])
#> [[1]]
#> Warning: `overscope_eval_next()` is deprecated as of rlang 0.2.0.
#> Please use `eval_tidy()` with a data mask instead.
#> This warning is displayed once per session.
#> Warning: `overscope_clean()` is deprecated as of rlang 0.2.0.
#> This warning is displayed once per session.
#> # Source: lazy query [?? x 2]
#> # Database: sqlite 3.30.1 [:memory:]
#> Sepal.Length Sepal.Width
#> <dbl> <dbl>
#> 1 5.1 3.5
#> 2 4.9 3
#> 3 4.7 3.2
#> 4 4.6 3.1
#> 5 5 3.6
#> 6 5.4 3.9
#> 7 4.6 3.4
#> 8 5 3.4
#> 9 4.4 2.9
#> 10 4.9 3.1
#> # … with more rows
#>
#> [[2]]
#> # Source: lazy query [?? x 2]
#> # Database: sqlite 3.30.1 [:memory:]
#> Sepal.Length Sepal.Width
#> <dbl> <dbl>
#> 1 5.1 3.5
#> 2 4.9 3
#> 3 4.7 3.2
#> 4 4.6 3.1
#> 5 5 3.6
#> 6 5.4 3.9
#> 7 4.6 3.4
#> 8 5 3.4
#> 9 4.4 2.9
#> 10 4.9 3.1
#> # … with more rows
Behold the painfully simple case and the error(s). Comments inline.
library(flexsurv)
#> Loading required package: survival
library(tidyverse)
library(magrittr)
#>
#> Attaching package: 'magrittr'
#> The following object is masked from 'package:purrr':
#>
#> set_names
#> The following object is masked from 'package:tidyr':
#>
#> extract
set.seed(2019)
train_data <- tribble(
~wait_time, ~called_yet, ~time_queued,
131.282999992371, 0, 1570733365.28,
358.296000003815, 1, 1570733421.187,
1352.13999986649, 1, 1570733540.923,
1761.61400008202, 0, 1570733941.343,
1208.25300002098, 0, 1570734327.11,
522.296999931335, 1, 1570734376.953,
241.75, 0, 1570734659.44,
143.156999826431, 0, 1570734809.673,
1202.79999995232, 1, 1570734942.907,
614.640000104904, 1, 1570735526.567
)
# Base survival works fine!
survival_model <- survreg(Surv(wait_time, called_yet) ~ time_queued,
data = train_data,
dist = "weibull")
survival_model
#> Call:
#> survreg(formula = Surv(wait_time, called_yet) ~ time_queued,
#> data = train_data, dist = "weibull")
#>
#> Coefficients:
#> (Intercept) time_queued
#> 4.533765e+05 -2.886352e-04
#>
#> Scale= 0.518221
#>
#> Loglik(model)= -40.2 Loglik(intercept only)= -40.5
#> Chisq= 0.5 on 1 degrees of freedom, p= 0.48
#> n= 10
# flexsurvreg can't even get a valid initializer for time_queued, even though
# the doc says it takes the mean of the data
flexsurv_model <- flexsurvreg(Surv(wait_time, called_yet) ~ time_queued,
data = train_data,
dist = "weibull")
#> Error in flexsurvreg(Surv(wait_time, called_yet) ~ time_queued, data = train_data, : Initial value for parameter 2 out of range
# Maybe the low variance of the predictor here is the problem? So let's up the
# variance just to see
train_data %<>% mutate_at("time_queued", subtract, 1.57073e9)
train_data
#> # A tibble: 10 x 3
#> wait_time called_yet time_queued
#> <dbl> <dbl> <dbl>
#> 1 131. 0 3365.
#> 2 358. 1 3421.
#> 3 1352. 1 3541.
#> 4 1762. 0 3941.
#> 5 1208. 0 4327.
#> 6 522. 1 4377.
#> 7 242. 0 4659.
#> 8 143. 0 4810.
#> 9 1203. 1 4943.
#> 10 615. 1 5527.
# Now it initializes, so that's different... but now it won't converge!
flexsurv_model <- flexsurvreg(Surv(wait_time, called_yet) ~ time_queued,
data = train_data,
dist = "weibull")
#> Warning in flexsurvreg(Surv(wait_time, called_yet) ~ time_queued, data
#> = train_data, : Optimisation has probably not converged to the maximum
#> likelihood - Hessian is not positive definite.
Created on 2019-10-19 by the reprex package (v0.3.0)
I mainly wanted to use flexsurv for its better plotting options and more standard shape & scale definitions - and the ancillary parameters are very attractive too - but now I'm mainly just wondering if I'm doing something really wrong, and flexsurv is trying to tell me not to trust my base survival model either.
Marco Sandri pointed out that recentering fixes it; however, recentering without rescaling only guarantees initialization, and still results in no convergence if the variance is very large. I'm considering this a bug since survival has no problem with the exact same model with the exact same values. Created an issue here.