I am watching one of the solutions for House Prices Kaggle competition. I would like to know how do you get RMSE value from this:
Subset the train rows and selected features
dt.train <- fulldt %>% filter(Set == "Train") %>% select("Id", "OverallQual", "TotalArea", "AreaAbvground", "GarageArea", "TotalBaths", "YearBuilt", "Neighborhood", "MSSubClass", "FireplaceQu", "ExterQual", "KitchenQual", "BsmtQual", "HouseStyle") %>% mutate(SalePrice = log(raw.train$SalePrice))
Same for the test features
dt.test <- fulldt %>% filter(Set == "Test") %>%
select("Id", "OverallQual", "TotalArea", "AreaAbvground", "GarageArea", "TotalBaths", "YearBuilt",
"Neighborhood", "MSSubClass", "FireplaceQu", "ExterQual", "KitchenQual", "BsmtQual", "HouseStyle")
Random Forest model
fit <- randomForest(SalePrice ~ ., data = dt.train, importance = T)
Use new model to predict SalePrice values from the test set
pred <- exp(predict(fit , newdata = dt.test))
How do you get RMSE value from pred ?
Let's calculate the RMSE of the training and test rows based on the minimal example iris data:
library(tibble)
library(randomForest)
#> randomForest 4.6-14
#> Type rfNews() to see new features/changes/bug fixes.
library(yardstick)
#> For binary classification, the first factor level is assumed to be the event.
#> Use the argument `event_level = "second"` to alter this as needed.
train_df <- head(iris, 100)
test_df <- tail(iris, 50)
model <- randomForest(Sepal.Length ~ ., data = train_df, importance = T)
# Test RMSE
tibble(
truth = predict(model, newdata = test_df),
predicted = test_df$Sepal.Length
) %>%
rmse(truth, predicted)
#> # A tibble: 1 x 3
#> .metric .estimator .estimate
#> <chr> <chr> <dbl>
#> 1 rmse standard 0.836
# Train RMSE
tibble(
truth = predict(model, newdata = train_df),
predicted = train_df$Sepal.Length
) %>%
rmse(truth, predicted)
#> # A tibble: 1 x 3
#> .metric .estimator .estimate
#> <chr> <chr> <dbl>
#> 1 rmse standard 0.265
Created on 2021-12-13 by the reprex package (v2.0.1)
Related
I have the following code as a simple example.
library(tidymodels)
library(bonsai)
train_folds <- vfold_cv(data = train, strata = target)
train_rec <- recipe(formula = target ~ ., data = train) %>%
update_role(Id, new_role = "ID")
gb_mod <- boost_tree(engine = "lightgbm",
mtry = 11,
mode = "classification",
trees = 100)
gb_workflow <- workflow(preprocessor = train_rec,
spec = gb_mod)
model_fit <- gb_workflow %>% fit_resamples(train_folds,
metrics = metric_set(kap, roc_auc, accuracy))
model_fit %>% collect_metrics()
The kap function calculates the Kappa metric which has no weighting by default. To calculate quadratic weighted Kappa you must add weighting = "quadratic" as a parameter, which metric_set() doesn't seem to accept. How can I include QWK in the metrics output?
Apologies if this has been answered already but I couldn't find any examples.
You need to make an alternate function (just by wrapping the original):
library(yardstick)
#> For binary classification, the first factor level is assumed to be the event.
#> Use the argument `event_level = "second"` to alter this as needed.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
data(hpc_cv, package = "modeldata")
# See example in ?metric_set examples
kap_quad <- function(data, truth, estimate, na_rm = TRUE, ...) {
kap(
data = data,
truth = !! rlang::enquo(truth),
estimate = !! rlang::enquo(estimate),
# set weighting = "quadratic"
weighting = "quadratic",
na_rm = na_rm,
...
)
}
kap_quad <- new_numeric_metric(kap_quad, "maximize")
met <- metric_set(kap_quad)
hpc_cv %>%
met(obs, estimate = pred)
#> # A tibble: 1 × 3
#> .metric .estimator .estimate
#> <chr> <chr> <dbl>
#> 1 kap multiclass 0.692
# no weighting
hpc_cv %>%
kap(obs, estimate = pred)
#> # A tibble: 1 × 3
#> .metric .estimator .estimate
#> <chr> <chr> <dbl>
#> 1 kap multiclass 0.508
Created on 2023-01-31 by the reprex package (v2.0.1)
I am using tidymodels for building a model where false negatives are more costly than false positives. Hence I'd like to use the yardstick::classification_cost metric for hyperparameter tuning, but with a custom classification cost matrix that reflects this fact.
Doing this after fitting a model is simple enough:
library(tidymodels)
# load simulated prediction output
data("two_class_example")
# cost matrix penalizing false negatives
cost_matrix <- tribble(
~truth, ~estimate, ~cost,
"Class1", "Class2", 2,
"Class2", "Class1", 1
)
# use function on simulated prediction output
classification_cost(
data = two_class_example,
truth = truth,
# target class probability
Class1,
# supply the function with the cost matrix
costs = cost_matrix)
#> # A tibble: 1 × 3
#> .metric .estimator .estimate
#> <chr> <chr> <dbl>
#> 1 classification_cost binary 0.260
Created on 2021-11-01 by the reprex package (v2.0.1)
But using this function during hyperparameter tuning is where I run into problems. The documentation states that for setting options the metric should be wrapped in a custom function. Here's my attempt and the resulting error. Note how this wrapper works fine for evaluating a fitted model, but throws an error when trying to use for tuning:
library(tidymodels)
# load data
data("two_class_example")
data("two_class_dat")
# create custom metric penalizing false negatives
classification_cost_penalized <- function(
data,
truth,
class_proba,
na_rm = TRUE
) {
# cost matrix penalizing false negatives
cost_matrix <- tribble(
~truth, ~estimate, ~cost,
"Class1", "Class2", 2,
"Class2", "Class1", 1
)
classification_cost(
data = data,
truth = !! rlang::enquo(truth),
# supply the function with the class probabilities
!! rlang::enquo(class_proba),
# supply the function with the cost matrix
costs = cost_matrix,
na_rm = na_rm
)
}
# Use `new_numeric_metric()` to formalize this new metric function
classification_cost_penalized <- new_prob_metric(classification_cost_penalized, "minimize")
# test if this works on the simulated estimates
two_class_example %>%
classification_cost_penalized(truth = truth, class_prob = Class1)
#> # A tibble: 1 × 3
#> .metric .estimator .estimate
#> <chr> <chr> <dbl>
#> 1 classification_cost binary 0.260
# test if this works with hyperparameter tuning
# specify a RF model
my_model <-
rand_forest(mtry = tune(),
min_n = tune(),
trees = 500) %>%
set_engine("ranger") %>%
set_mode("classification")
# specify recipe
my_recipe <- recipe(Class ~ A + B, data = two_class_dat)
# bundle to workflow
my_wf <- workflow() %>%
add_model(my_model) %>%
add_recipe(my_recipe)
# start tuning
tuned_rf <- my_wf %>%
# set up tuning grid
tune_grid(
resamples = vfold_cv(two_class_dat,
v = 5),
grid = 5,
metrics = metric_set(classification_cost_penalized))
#> i Creating pre-processing data to finalize unknown parameter: mtry
#> x Fold1: internal: Error: In metric: `classification_cost_penalized`
#> unused argum...
#> x Fold2: internal: Error: In metric: `classification_cost_penalized`
#> unused argum...
#> x Fold3: internal: Error: In metric: `classification_cost_penalized`
#> unused argum...
#> x Fold4: internal: Error: In metric: `classification_cost_penalized`
#> unused argum...
#> x Fold5: internal: Error: In metric: `classification_cost_penalized`
#> unused argum...
#> Warning: All models failed. See the `.notes` column.
Created on 2021-11-01 by the reprex package (v2.0.1)
Unnesting the notes shows that there are unused arguments: "internal: Error: In metric: classification_cost_penalized\nunused arguments (estimator = ~prob_estimator, event_level = ~event_level)" But apparently the yardstick_event_level()function, which is how event_level should be set according to this documentation, does not exist? No function under that name shows up when searching for it.
I don't know how to proceed here.
Thank you for your time.
When you are tweaking an existing yardstick metric, it is much easier to use the metric_tweak() function, which allows you to hard code certain optional arguments (like cost), while keeping everything else the same. It is sort of like purrr::partial(), but for yardstick metrics.
library(tidymodels)
# load data
data("two_class_example")
data("two_class_dat")
cost_matrix <- tribble(
~truth, ~estimate, ~cost,
"Class1", "Class2", 2,
"Class2", "Class1", 1
)
classification_cost_penalized <- metric_tweak(
.name = "classification_cost_penalized",
.fn = classification_cost,
costs = cost_matrix
)
# test if this works on the simulated estimates
two_class_example %>%
classification_cost_penalized(truth = truth, class_prob = Class1)
#> # A tibble: 1 × 3
#> .metric .estimator .estimate
#> <chr> <chr> <dbl>
#> 1 classification_cost_penalized binary 0.260
# specify a RF model
my_model <-
rand_forest(
mtry = tune(),
min_n = tune(),
trees = 500
) %>%
set_engine("ranger") %>%
set_mode("classification")
# specify recipe
my_recipe <- recipe(Class ~ A + B, data = two_class_dat)
# bundle to workflow
my_wf <- workflow() %>%
add_model(my_model) %>%
add_recipe(my_recipe)
# start tuning
tuned_rf <- my_wf %>%
tune_grid(
resamples = vfold_cv(two_class_dat, v = 5),
grid = 5,
metrics = metric_set(classification_cost_penalized)
)
#> i Creating pre-processing data to finalize unknown parameter: mtry
collect_metrics(tuned_rf)
#> # A tibble: 5 × 8
#> mtry min_n .metric .estimator mean n std_err .config
#> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
#> 1 1 35 classification_cost… binary 0.407 5 0.0162 Preprocessor1…
#> 2 1 23 classification_cost… binary 0.403 5 0.0146 Preprocessor1…
#> 3 1 10 classification_cost… binary 0.403 5 0.0137 Preprocessor1…
#> 4 2 27 classification_cost… binary 0.396 5 0.0166 Preprocessor1…
#> 5 2 6 classification_cost… binary 0.401 5 0.0161 Preprocessor1…
Created on 2021-11-03 by the reprex package (v2.0.1)
Is there an alternative metric or approach you would recommend for a situation in which costs are different? It doesn't seem to do a whole lot for evaluating what is best, even when extreme differences between false positives and negatives are provided. An example based on your code above:
library(tidymodels)
# load data
data("two_class_example")
data("two_class_dat")
cost_matrix_1 <- tribble(
~truth, ~estimate, ~cost,
"Class1", "Class2", 10,
"Class2", "Class1", 1
)
cost_matrix_2 <- tribble(
~truth, ~estimate, ~cost,
"Class1", "Class2", 1,
"Class2", "Class1", 10
)
classification_cost_penalized_1 <- metric_tweak(
.name = "classification_cost_penalized_1",
.fn = classification_cost,
costs = cost_matrix_1
)
classification_cost_penalized_2 <- metric_tweak(
.name = "classification_cost_penalized_2",
.fn = classification_cost,
costs = cost_matrix_2
)
# test if this works on the simulated estimates
two_class_example %>%
classification_cost_penalized_1(truth = truth, class_prob = Class1)
#> # A tibble: 1 × 3
#> .metric .estimator .estimate
#> <chr> <chr> <dbl>
#> 1 classification_cost_penalized binary 0.260
two_class_example %>%
classification_cost_penalized_2(truth = truth, class_prob = Class1)
# specify a RF model
my_model <-
rand_forest(
mtry = tune(),
min_n = tune(),
trees = 500
) %>%
set_engine("ranger") %>%
set_mode("classification")
# specify recipe
my_recipe <- recipe(Class ~ A + B, data = two_class_dat)
# bundle to workflow
my_wf <- workflow() %>%
add_model(my_model) %>%
add_recipe(my_recipe)
# start tuning
tuned_rf <- my_wf %>%
tune_grid(
resamples = vfold_cv(two_class_dat, v = 5),
grid = 50,
metrics = metric_set(classification_cost_penalized_1,classification_cost_penalized_2)
)
#> i Creating pre-processing data to finalize unknown parameter: mtry
#seems to always be increasing or decreaing and not much differrence even when large differences
collect_metrics(tuned_rf) %>%
ggplot(aes(x = min_n, y = mean, color = .metric)) +
geom_line() +
facet_grid(rows = "mtry")
visualization of metrics overview
I'm struggling with how the obtain the AUC from a logistic regression model using tidymodels.
Here's an example using the built-in mpg dataset.
library(tidymodels)
library(tidyverse)
# Use mpg dataset
df <- mpg
# Create an indicator variable for class="suv"
df$is_suv <- as.factor(df$class == "suv")
# Create the split object
df_split <- initial_split(df, prop=1/2)
# Create the training and testing sets
df_train <- training(df_split)
df_test <- testing(df_split)
# Create workflow
rec <-
recipe(is_suv ~ cty + hwy + cyl, data=df_train)
glm_spec <-
logistic_reg() %>%
set_engine(engine = "glm")
glm_wflow <-
workflow() %>%
add_recipe(rec) %>%
add_model(glm_spec)
# Fit the model
model1 <- fit(glm_wflow, df_train)
# Attach predictions to training dataset
training_results <- bind_cols(df_train, predict(model1, df_train))
# Calculate accuracy
accuracy(training_results, truth = is_suv, estimate = .pred_class)
# Calculate AUC??
roc_auc(training_results, truth = is_suv, estimate = .pred_class)
The last line returns this error:
> roc_auc(training_results, truth = is_suv, estimate = .pred_class)
Error in metric_summarizer(metric_nm = "roc_auc", metric_fn = roc_auc_vec, :
formal argument "estimate" matched by multiple actual arguments
Since you are doing binary classification, roc_auc() is expecting a vector of class probabilities corresponding to the "relevant" class, not the predicted class.
You can get this using predict(model1, df_train, type = "prob"). Alternatively, if you are using workflows version 0.2.2 or newer you can use the augment() to get class predictions and probabilities without using bind_cols().
library(tidymodels)
library(tidyverse)
# Use mpg dataset
df <- mpg
# Create an indicator variable for class="suv"
df$is_suv <- as.factor(df$class == "suv")
# Create the split object
df_split <- initial_split(df, prop=1/2)
# Create the training and testing sets
df_train <- training(df_split)
df_test <- testing(df_split)
# Create workflow
rec <-
recipe(is_suv ~ cty + hwy + cyl, data=df_train)
glm_spec <-
logistic_reg() %>%
set_engine(engine = "glm")
glm_wflow <-
workflow() %>%
add_recipe(rec) %>%
add_model(glm_spec)
# Fit the model
model1 <- fit(glm_wflow, df_train)
# Attach predictions to training dataset
training_results <- augment(model1, df_train)
# Calculate accuracy
accuracy(training_results, truth = is_suv, estimate = .pred_class)
#> # A tibble: 1 x 3
#> .metric .estimator .estimate
#> <chr> <chr> <dbl>
#> 1 accuracy binary 0.795
# Calculate AUC
roc_auc(training_results, truth = is_suv, estimate = .pred_FALSE)
#> # A tibble: 1 x 3
#> .metric .estimator .estimate
#> <chr> <chr> <dbl>
#> 1 roc_auc binary 0.879
Created on 2021-04-12 by the reprex package (v1.0.0)
# Partition the data:
library(tidymodels)
set.seed(1234)
uni_split <- initial_split(suspicious_match, strata = truth)
uni_train <- training(uni_split)
uni_test <- testing(uni_split)
uni_split
## Build a model recipe :
library(themis)
uni_rec <- recipe(truth ~ lv + lcs + qgram + jaccard + jw + cosine , data = uni_train)%>%
step_normalize(all_numeric()) %>%
step_smote(truth, skip = FALSE)%>%
prep()
uni_rec
bake(uni_rec, new_data = uni_train)
i trained the data with multiple models:(an example)
# Train Logistic Regression :
glm_spec <- logistic_reg()%>%
set_engine("glm")
glm_fit <- glm_spec %>%
fit(truth ~ lv + lcs + qgram + cosine + jaccard + jw , data= juice(uni_rec))
glm_fit
## Model evaluation with resampling :
set.seed(123)
folds <- vfold_cv(juice(uni_rec), strata = truth)
folds
#1: Logistic Reg:
set.seed(234)
glm_rs <- glm_spec%>%
fit_resamples(truth ~ lv + lcs + qgram + cosine + jaccard + jw, folds,
metrics = metric_set(roc_auc, sens, spec, accuracy),
control = control_resamples(save_pred = TRUE))
## Evaluation des modeles :
glm_rs %>% collect_metrics()
> glm_rs %>% collect_metrics()
# A tibble: 4 x 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 accuracy binary 0.851 10 0.00514 Preprocessor1_Model1
2 roc_auc binary 0.898 10 0.00390 Preprocessor1_Model1
3 sens binary 0.875 10 0.00695 Preprocessor1_Model1
4 spec binary 0.827 10 0.00700 Preprocessor1_Model1
but then when i try applying the logistic regression model to the test data i get this error:
> glm_fit %>%
+ predict(new_data = bake(uni_rec, new_data = uni_test),
+ type = "prob")%>%
+ mutate(truth = uni_test$truth)%>%
+ roc_auc(truth, .pred_correct)
Erreur : Problem with `mutate()` input `truth`.
x Input `truth` can't be recycled to size 2022.
i Input `truth` is `uni_test$truth`.
i Input `truth` must be size 2022 or 1, not 1373.
Run `rlang::last_error()` to see where the error occurred.
i figured it's because of the smote step in the recipe but i can't figure out how to fix it
please help !!
When you used bake, your test set changed. (#Emil Hvitfeldt identified why.)
I didn't have the data you used, but what I used left the outcome variable (truth in your data) alone when bake was applied. Therefore, you can drop the call to mutate. When I found that worked as expected, I found that truth wasn't recognized in the roc_auc.
To find these errors I ran
fit.p <- gpl_fit %>% predict(new_data = bake(uni_rec, new_data = uni_test),
type = "prob")
Then I looked at fit.p. What worked for my data
nd = bake(uni_rec, new_data = uni_test)
glm_fit %>%
predict(new_data = nd,
type = "prob") %>%
roc_auc(nd$vs, .pred_0)
It seems like predict is producing a standard error that is too large. I get 0.820 with a parsnip model but 0.194 with a base R model. 0.194 for a standard error seems more reasonable since about 2*0.195 above and below my prediction are the ends of the confidence interval. What is my problem/misunderstanding?
library(parsnip)
library(dplyr)
# example data
mod_dat <- mtcars %>%
as_tibble() %>%
mutate(cyl_8 = as.numeric(cyl == 8)) %>%
select(mpg, cyl_8)
parsnip_mod <- logistic_reg() %>%
set_engine("glm") %>%
fit(as.factor(cyl_8) ~ mpg, data = mod_dat)
base_mod <- glm(as.factor(cyl_8) ~ mpg, data = mod_dat, family = "binomial")
parsnip_pred <- tibble(mpg = 18) %>%
bind_cols(predict(parsnip_mod, new_data = ., type = 'prob'),
predict(parsnip_mod, new_data = ., type = 'conf_int', std_error = T)) %>%
select(!ends_with("_0"))
base_pred <- predict(base_mod, tibble(mpg = 18), se.fit = T, type = "response") %>%
unlist()
# these give the same prediction but different SE
parsnip_pred
#> # A tibble: 1 x 5
#> mpg .pred_1 .pred_lower_1 .pred_upper_1 .std_error
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 18 0.614 0.230 0.895 0.820
base_pred
#> fit.1 se.fit.1 residual.scale
#> 0.6140551 0.1942435 1.0000000
Created on 2020-06-04 by the reprex package (v0.3.0)
--EDIT--
As #thelatemail and #Limey said, using type="link" for the base model will give the standard error on the logit scale (0.820). However, I want the standard error on the probability scale.
Is there an option in the parsnip documentation that I'm missing? I would like to use parsnip.
#thelatemail is correct. From the online doc for predict.glm:
type
the type of prediction required. The default is on the scale of the linear predictors; the alternative "response" is on the scale of the response variable. Thus for a default binomial model the default predictions are of log-odds (probabilities on logit scale) and type = "response" gives the predicted probabilities.
The default is to report using the logit scale,, 'response' requests results on the raw probability scale. It's not obvious from the parsnip::predict documentation that I found how that chooses the scale on which to return its results, but it's clear it's using the raw probability scale.
So both methods are returning correct answers, they're just using different scales.
I don't want to steal an accepted solution from #thelatemail, so invite them to post a similar answer to this.
As #thelatemail said, you can get the standard error on the probability scale with parsnip using the arguments: type="raw", opts=list(se.fit=TRUE, type="response"). But at that point, you might as well use a base model since the output is exactly the same. However, this is still useful if you are already using a parsnip model and you want the standard error output of a base model.
library(parsnip)
library(dplyr)
mod_dat <- mtcars %>%
as_tibble() %>%
mutate(cyl_8 = as.numeric(cyl == 8)) %>%
select(mpg, cyl_8)
parsnip_mod <- logistic_reg() %>%
set_engine("glm") %>%
fit(as.factor(cyl_8) ~ mpg, data = mod_dat)
base_mod <- glm(as.factor(cyl_8) ~ mpg, data = mod_dat, family = "binomial")
predict(parsnip_mod, tibble(mpg = 18), type="raw",
opts=list(se.fit=TRUE, type="response")) %>%
as_tibble()
#> # A tibble: 1 x 3
#> fit se.fit residual.scale
#> <dbl> <dbl> <dbl>
#> 1 0.614 0.194 1
predict.glm(base_mod, tibble(mpg = 18), se.fit = T, type="response") %>%
as_tibble()
#> # A tibble: 1 x 3
#> fit se.fit residual.scale
#> <dbl> <dbl> <dbl>
#> 1 0.614 0.194 1
Created on 2020-06-11 by the reprex package (v0.3.0)