Suppose I am creating a recipe for my machine learning model, and I need to preprocess my outcome.
How do I reverse the preprocess my outcome or my predictors?
If I preprocess my outcome, how to reverse the output of a model to the original scale?
library(recipes)
biomass <- biomass
rec <- biomass %>%
recipe(carbon ~ hydrogen ) %>%
step_BoxCox(all_outcomes()) %>%
prep()
biomass_box <- rec %>% bake(biomass)
In this example I have made a BoxCox Transformation on my outcome. How do I get biomass_box$carbon back to its original values? recipes may have an easy way of undoing it, but I've been unable to find it.
Have you tried using step_inverse()?
library(recipes)
biomass <- biomass
rec <- biomass %>%
recipe(carbon ~ hydrogen ) %>%
step_BoxCox(all_outcomes()) %>%
prep() %>%
step_inverse(all_predictors())
Related
Please consider this minimal reproducible example of a random forest regression estimate
library(randomForest)
# fix missing data
airquality <- na.roughfix(airquality)
set.seed(123)
#fit the random forest model
rf_fit <- randomForest(formula = Ozone ~ ., data = airquality)
#define new observation
new <- data.frame(Solar.R=250, Wind=8, Temp=70, Month=5, Day=5)
set.seed(123)
#use predict all on new observation
rf_predict<-predict(rf_fit, newdata=new, predict.all = TRUE)
rf_predict$aggregate
library(tidyverse)
predict_mean <- rf_predict$individual %>%
as_tibble() %>%
rowwise() %>%
transmute(avg = mean(V1:V500))
predict_mean
I was expecting to get the same value by rf_predict$aggregate and predict_mean
Where and why am I wrong about this assumption?
My final objective is to get a confidence interval of the predicted value.
I believe your code needs to include a c_across() call for the calculation to be performed correctly:
The ?c_across documentations tells us:
c_across() is designed to work with rowwise() to make it easy to
perform row-wise aggregations.
predict_mean <- rf_predict$individual %>%
as_tibble() %>%
rowwise() %>%
transmute(avg = mean(c_across(V1:V500)))
>predict_mean
[1] 30.5
An answer to a previous question, points out that mean() can't handle a data.frame. And in your code the data being provide to mean() is a row-wise data frame with class rowwise_df. c_across allows the data in the rows to be presented to mean() as vectors (I think).
I am trying to incorporate tidy models PCA into the workflow of a model. I want to have a predictive model that uses PCA as a preprocessing step and then make predictions with that model.
I have tried the following approach,
diamonds <- diamonds %>%
select(-clarity, -cut, - color)
diamonds_split <- initial_split(diamonds, prop = 4/5)
diamonds_train <- training(diamonds_split)
diamonds_test <- testing(diamonds_split)
diamonds_test <-vfold_cv(diamonds_train)
diamonds_recipe <-
# La fórmula básica y todos los datos (outcome ~ predictors)
recipe(price ~ ., data = diamonds_train) %>%
step_log(all_outcomes(),skip = T) %>%
step_normalize(all_predictors(), -all_nominal()) %>%
step_pca(all_predictors())
preprocesados <- prep(diamonds_recipe)
linear_model <-
linear_reg() %>%
set_engine("glmnet") %>%
set_mode("regression")
pca_workflow <- workflow() %>%
add_recipe(diamonds_recipe) %>%
add_model(linear_model)
lr_fitted_workflow <- pca_workflow %>% #option A workflow full dataset
last_fit(diamonds_split)
performance <- lr_fitted_workflow %>% collect_metrics()
test_predictions <- lr_fitted_workflow %>% collect_predictions()
But I get this error:
x Resample1: model (predictions): Error: penalty should be a single numeric value. ...
Warning message:
“All models failed in [fit_resamples()]. See the .notes column.”
Following other tutorials I tried to use this other approach, but I don't know how to use the model to make new predictions, because the new data comes in the original (non-pca) form. So I tried this:
pca_fit <- juice(preprocesados) %>% #option C no work flow at all
lm(price ~ ., data = .)
prep_test <- prep(diamonds_recipe, new_data = diamonds_test)
truths <- juice(prep_test) %>%
select(price)
ans <- predict(pca_fit, new_data = prep_test)
tib <- tibble(row = 1:length(ans),ans, truths)
ggplot(data = tib) +
geom_smooth(mapping = aes(x = row, y = ans, colour = "predicted")) +
geom_smooth(mapping = aes(x = row, y = price, colour = "true"))
And it prints something that seams reasonable, but by this point I have lost confidence and some guidance would be much appreciated. :D
The problem is not in your recipe or the workflow. As described in chapter 7 of TidyModels with R the function for fitting your model is fit and for it to work you'll have to provide the data for the fitting process (here diamonds). The tradeoff is that you don't have to prep your recipe as the workflow will take care of this itself.
So reducing your code slightly, the example below will work.
library(tidymodels)
data(diamonds)
diamonds <- diamonds %>%
select(-clarity, -cut, - color)
diamonds_split <- initial_split(diamonds, prop = 4/5)
diamonds_train <- training(diamonds_split)
diamonds_test <- testing(diamonds_split)
diamonds_recipe <-
# La fórmula básica y todos los datos (outcome ~ predictors)
recipe(price ~ ., data = diamonds_train) %>%
step_log(all_outcomes(),skip = T) %>%
step_normalize(all_predictors(), -all_nominal()) %>%
step_pca(all_predictors())
linear_model <-
linear_reg() %>%
set_engine("glmnet") %>%
set_mode("regression")
pca_workflow <- workflow() %>%
add_recipe(diamonds_recipe) %>%
add_model(linear_model)
pca_fit <- fit(pca_workflow, data = diamonds_train)
As for crossvalidation one has to use fit_resamples and should split the training set and not the testing set. But here I am currently getting the same error (my answer will be updated if i figure out why)
Edit
Now I've done a bit of digging, and the problem with crossvalidation stems from the engine being glmnet. I am guessing that of the many different aspects this one has somehow been missed. I've added a possible issue to the workflows package github site. Often the answers are quick in coming, so likely one of the developers will come with a reply soon.
As for crossvalidation, assume you instead fit using any of the other engines described in ?linear_reg then we could do this as
linear_model_base <-
linear_reg() %>%
set_engine("lm") %>%
set_mode("regression")
pca_workflow <- update_model(pca_workflow, linear_model_base)
folds <- vfold_cv(diamonds_train, 10)
pca_folds_fit <- fit_resamples(pca_workflow, resamples = folds)
and in the case where metrics are of interest these can indeed be collected as you did using collect_metrics
pca_folds_fit %>% collect_metrics()
If we are interested in the predictions you'll have to tell the model that you want to save these during the fitting process and then use collect_predictions
pca_folds_fit <- fit_resamples(pca_workflow, resamples = folds, control = control_resamples(save_pred = TRUE))
collect_predictions(pca_folds_fit)
Note however that the output from this is the predictions from each fold as you are literally fitting 10 models.
Usually crossvalidation is used to compare multiple models or tuning parameters (eg. random forest vs linear model). The best model on crossvalidation performance (collect_metrics) would then be selected for use and the test dataset would be used to get the evaluation of this models accuracy.
This is all described in TMwR chapter 10 & 11
I have managed to build a decision tree model using the tidymodels package but I am unsure how to pull the results and plot the tree. I know I can use the rpart and rpart.plot packages to achieve the same thing but I would rather use tidymodels as that is what I am learning. Below is an example using the mtcars data.
library(tidymodels)
library(rpart)
library(rpart.plot)
library(dplyr) #contains mtcars
#data
df <- mtcars %>%
mutate(gear = factor(gear))
#train/test
set.seed(1234)
df_split <- initial_split(df)
df_train <- training(df_split)
df_test <- testing(df_split)
df_recipe <- recipe(gear~ ., data = df) %>%
step_normalize(all_numeric())
#building model
tree <- decision_tree() %>%
set_engine("rpart") %>%
set_mode("classification")
#workflow
tree_wf <- workflow() %>%
add_recipe(df_recipe) %>%
add_model(tree) %>%
fit(df_train) #results are found here
rpart.plot(tree_wf$fit$fit) #error is here
The error I get says Error in rpart.plot(tree_wf$fit$fit) : Not an rpart object which makes sense but I am unaware if there is a package or step I am missing to convert the results into a format that rpart.plot will allow me to plot. This might not be possible but any help would be much appreciated.
You can also use the workflows::pull_workflow_fit() function. It makes the code a little bit more elegant.
tree_fit <- tree_wf %>%
pull_workflow_fit()
rpart.plot(tree_fit$fit)
The following works (note the extra $fit):
rpart.plot(tree_wf$fit$fit$fit)
Not a very elegant solution, but it does plot the tree.
Tested with parsnip 0.1.3 and rpart.plot 3.0.8.
I have a dataframe that comprises of a binary outcome column (y), and multiple independent predictor columns (x1, x2, x3...).
I would like to run many single-variable logistic regression models (e.g. y ~ x1, y ~ x2, y ~ x3), and extract the exponentiated coefficients (odds ratios), 95% confidence intervals and p-values for each model into rows of a dataframe/tibble. It seems to me that a solution should be possible using a combination of purrr and broom.
This question is similar, but I can't work out the next steps of:
extracting only the values I need and
tidying into a dataframe/tibble.
Working from the example in the referenced question:
library(tidyverse)
library(broom)
df <- mtcars
df %>%
names() %>%
paste('am~',.) %>%
map(~glm(as.formula(.x), data= df, family = "binomial"))
After sleeping on it, the solution occurred to me. Requires the use of map_df to run each model, and tidy to extract the values from each model.
Hopefully this will be useful for others:
library(tidyverse)
library(broom)
df <- mtcars
output <- df %>%
select(-am) %>%
names() %>%
paste('am~',.) %>%
map_df(~tidy(glm(as.formula(.x),
data= df,
family = "binomial"),
conf.int=TRUE,
exponentiate=TRUE)) %>%
filter(term !="(Intercept)")
I have a big dataset that I want to partition based on the values of a particular variable (in my case lifetime), and then run logistic regression on each partition. Following the answer of #tchakravarty in Fitting several regression models with dplyr I wrote the following code:
lifetimemodels = data %>% group_by(lifetime) %>% sample_frac(0.7)%>%
do(lifeModel = glm(churn ~., x= TRUE, family=binomial(link='logit'), data = .))
My question now is how I can use the resulting logistic models on computing the AUC on the rest of the data (the 0.3 fraction that was not chosen) which should again be grouped by lifetime?
Thanks a lot in advance!
You could adapt your dplyr approach to use the tidyr and purrr framework. You look at grouping/nesting, and the mutate and map functions to create list frames to store pieces of your workflow.
The test/training split you are looking for is part of modelr a package built to assist modelling within the purrr framework. Specifically the cross_vmc and cross_vkfold functions.
A toy example using mtcars (just to illustrate the framework).
library(dplyr)
library(tidyr)
library(purrr)
library(modelr)
analysis <- mtcars %>%
nest(-cyl) %>%
unnest(map(data, ~crossv_mc(.x, 1, test = 0.3))) %>%
mutate(model = map(train, ~lm(mpg ~ wt, data = .x))) %>%
mutate(pred = map2(model, train, predict)) %>%
mutate(error = map2_dbl(model, test, rmse))
This:
takes mtcars
nest into a list frame called data by cyl
Separate each data into a training set by mapping crossv_mc to each element, then using unnest to make the test and train list columns.
Map the lm model to each train, store that in model
Map the predict function to model and train and store in pred
Map the rmse function to model and test sets and store in error.
There are probably users out there more familiar than me with the workflow, so please correct/elaborate.