Tidymodels: How to extra importance from training data - r

I have the following code, where I do some grid search for different mtry and min_n. I know how to extract the parameters that give the highest accuracy (see second code box). How can I extract the importance of each feature in the training dataset? The guides I found online show how to do it only in the test dataset using "last_fit". E.g. of guide: https://www.tidymodels.org/start/case-study/#data-split
set.seed(seed_number)
data_split <- initial_split(node_strength,prop = 0.8,strata = Group)
train <- training(data_split)
test <- testing(data_split)
train_folds <- vfold_cv(train,v = 10)
rfc <- rand_forest(mode = "classification", mtry = tune(),
min_n = tune(), trees = 1500) %>%
set_engine("ranger", num.threads = 48, importance = "impurity")
rfc_recipe <- recipe(data = train, Group~.)
rfc_workflow <- workflow() %>% add_model(rfc) %>%
add_recipe(rfc_recipe)
rfc_result <- rfc_workflow %>%
tune_grid(train_folds, grid = 40, control = control_grid(save_pred = TRUE),
metrics = metric_set(accuracy))
.
best <-
rfc_result %>%
select_best(metric = "accuracy")

To do this, you will want to create a custom extract function, as outlined in this documentation.
For random forest variable importance, your function will look something like this:
get_rf_imp <- function(x) {
x %>%
extract_fit_parsnip() %>%
vip::vi()
}
And then you can apply it to your resamples like so (notice that you get a new .extracts column):
library(tidymodels)
data(cells, package = "modeldata")
set.seed(123)
cell_split <- cells %>% select(-case) %>%
initial_split(strata = class)
cell_train <- training(cell_split)
cell_test <- testing(cell_split)
folds <- vfold_cv(cell_train)
rf_spec <- rand_forest(mode = "classification") %>%
set_engine("ranger", importance = "impurity")
ctrl_imp <- control_grid(extract = get_rf_imp)
cells_res <-
workflow(class ~ ., rf_spec) %>%
fit_resamples(folds, control = ctrl_imp)
cells_res
#> # Resampling results
#> # 10-fold cross-validation
#> # A tibble: 10 × 5
#> splits id .metrics .notes .extracts
#> <list> <chr> <list> <list> <list>
#> 1 <split [1362/152]> Fold01 <tibble [2 × 4]> <tibble [0 × 3]> <tibble [1 × 2]>
#> 2 <split [1362/152]> Fold02 <tibble [2 × 4]> <tibble [0 × 3]> <tibble [1 × 2]>
#> 3 <split [1362/152]> Fold03 <tibble [2 × 4]> <tibble [0 × 3]> <tibble [1 × 2]>
#> 4 <split [1362/152]> Fold04 <tibble [2 × 4]> <tibble [0 × 3]> <tibble [1 × 2]>
#> 5 <split [1363/151]> Fold05 <tibble [2 × 4]> <tibble [0 × 3]> <tibble [1 × 2]>
#> 6 <split [1363/151]> Fold06 <tibble [2 × 4]> <tibble [0 × 3]> <tibble [1 × 2]>
#> 7 <split [1363/151]> Fold07 <tibble [2 × 4]> <tibble [0 × 3]> <tibble [1 × 2]>
#> 8 <split [1363/151]> Fold08 <tibble [2 × 4]> <tibble [0 × 3]> <tibble [1 × 2]>
#> 9 <split [1363/151]> Fold09 <tibble [2 × 4]> <tibble [0 × 3]> <tibble [1 × 2]>
#> 10 <split [1363/151]> Fold10 <tibble [2 × 4]> <tibble [0 × 3]> <tibble [1 × 2]>
Created on 2022-06-19 by the reprex package (v2.0.1)
Once you have those variable importance score extracts, you can unnest() them (right now, you have to do this twice because it is deeply nested) and then you can summarize and visualize as you prefer:
cells_res %>%
select(id, .extracts) %>%
unnest(.extracts) %>%
unnest(.extracts) %>%
group_by(Variable) %>%
summarise(Mean = mean(Importance),
Variance = sd(Importance)) %>%
slice_max(Mean, n = 15) %>%
ggplot(aes(Mean, reorder(Variable, Mean))) +
geom_crossbar(aes(xmin = Mean - Variance, xmax = Mean + Variance)) +
labs(x = "Variable importance", y = NULL)
Created on 2022-06-19 by the reprex package (v2.0.1)

Related

How to make svm_linear work with tune_grid/tune_race_anova

So when I try to tune cost for svm_linear with tidymodels approach, it fails every time, but it works just fine with svm_rbf function, so I cannot understand where the problem comes from
rcpsvm<-recipe(Species~.,data=iris)
svmlin<-svm_linear(cost=tune())%>%
set_engine("LiblineaR")%>%
set_mode("classification")
svmlinwrkfl<-workflow()%>%
add_recipe(rcpsvm)%>%
add_model(svmlin)
gridwals<-expand_grid(cost=c(0.01, 0.1, 1, 10, 100))
folds<-vfold_cv(iris, strata=Species, 5)
tunelin<-tune_grid(svmlinwrkfl, grid = gridwals, folds)
And then it says that all models failed cause No data available in table
What I'm doing wrong?
The specific model you are using cannot generate class probabilities, only hard class predictions, so you need to tune using a metric for classes (not a metric for probabilities). An example of this is sensitivity:
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
data(two_class_dat)
data_train <- two_class_dat[-(1:10), ]
data_test <- two_class_dat[ 1:10 , ]
folds <- bootstraps(data_train, times = 5)
svm_cls_spec <-
svm_linear(cost = tune()) %>%
set_mode("classification") %>%
set_engine("LiblineaR")
workflow(Class ~ ., svm_cls_spec) %>%
tune_grid(folds, grid = 5, metrics = metric_set(sensitivity))
#> # Tuning results
#> # Bootstrap sampling
#> # A tibble: 5 × 4
#> splits id .metrics .notes
#> <list> <chr> <list> <list>
#> 1 <split [781/296]> Bootstrap1 <tibble [5 × 5]> <tibble [0 × 1]>
#> 2 <split [781/286]> Bootstrap2 <tibble [5 × 5]> <tibble [0 × 1]>
#> 3 <split [781/296]> Bootstrap3 <tibble [5 × 5]> <tibble [0 × 1]>
#> 4 <split [781/291]> Bootstrap4 <tibble [5 × 5]> <tibble [0 × 1]>
#> 5 <split [781/304]> Bootstrap5 <tibble [5 × 5]> <tibble [0 × 1]>
Created on 2022-01-28 by the reprex package (v2.0.1)

purrr::map does not work with pipe operator

I have a data frame like this:
df <- tibble(
i = rep(1:10, times = 5),
t = rep(1:5, each = 10)
) %>%
mutate(y = rnorm(50))
I want to apply a function that takes data frame of each t as argument:
f <- function(df){
return(lm(y ~ +1, data = df))
}
When I apply purrr::map for a nested data frame with pipe operator, I get error.
# does not work
df_nested <- df %>%
nest(data = c(t, y)) %>%
rename(data_col = data)
df_nested %>%
purrr::map(.x = .$data_col, .f = f)
On the other hand, when I do not use pipe operator, I get the desired result.
# Ok
purrr::map(.x = df_nested$data_col, .f = f)
To my understanding, both code should return the same result. What is wrong with the code with pipe operator?
Pipe already passes the previous value (df_nested) as the first argument to map. You may use {} to stop that from happening.
library(tidyverse)
df_nested %>%
{purrr::map(.x = .$data_col, .f = f)}
Another way would be to use -
df %>%
nest(data_col = c(t, y)) %>%
mutate(model = map(data_col, f))
# i data_col model
# <int> <list> <list>
# 1 1 <tibble [5 × 2]> <lm>
# 2 2 <tibble [5 × 2]> <lm>
# 3 3 <tibble [5 × 2]> <lm>
# 4 4 <tibble [5 × 2]> <lm>
# 5 5 <tibble [5 × 2]> <lm>
# 6 6 <tibble [5 × 2]> <lm>
# 7 7 <tibble [5 × 2]> <lm>
# 8 8 <tibble [5 × 2]> <lm>
# 9 9 <tibble [5 × 2]> <lm>
#10 10 <tibble [5 × 2]> <lm>

step_rose() fails in tune grid

I noted that when training with certain engines (e.g. keras and xgboost) the recipe returns more ys than Xs.
Here you'll find a minimal reproducible example:
library(themis)
library(recipes)
library(tune)
library(parsnip)
library(workflows)
library(dials)
library(rsample)
xg_mod <- parsnip::boost_tree(mode = "classification",
trees = tune(),
tree_depth = tune(),
min_n = tune(),
loss_reduction = tune(),
learn_rate = tune()) %>%
set_engine("xgboost")
xg_grid <- grid_latin_hypercube(over_ratio(range = c(0,1)),
trees(),
tree_depth(),
min_n(),
loss_reduction(),
learn_rate(),
size = 5)
my_recipe <- recipe(class ~ ., data = circle_example) %>%
step_rose(class, over_ratio = tune())
workflow() %>%
add_model(xg_mod) %>%
add_recipe(my_recipe) %>%
tune_grid(resamples = mc_cv(circle_example, strata = class),
grid = xg_grid)
The resulting error is Error in data.frame(ynew, Xnew): arguments imply differing number of rows: 385, 386
It is related to tuning the over_ratio. If you skip tuning it, the example will work with no errors.
library(tidymodels)
#> ── Attaching packages ────────────────────────────────────── tidymodels 0.1.1
library(themis)
data(iris)
iris_imbalance <- iris %>%
filter(Species != "setosa") %>%
slice_sample(n = 60, weight_by = case_when(
Species == "virginica" ~ 60,
TRUE ~ 1)) %>%
mutate(Species = factor(Species))
xg_mod <- parsnip::boost_tree(mode = "classification",
trees = tune(),
tree_depth = tune(),
min_n = tune(),
loss_reduction = tune(),
learn_rate = tune()) %>%
set_engine("xgboost")
xg_grid <- grid_latin_hypercube(#over_ratio(range = c(0,1)),
trees(),
tree_depth(),
min_n(),
loss_reduction(),
learn_rate(),
size = 5)
my_recipe <- recipe(Species ~ ., data = iris_imbalance) %>%
step_rose(Species) #, over_ratio = tune())
workflow() %>%
add_model(xg_mod) %>%
add_recipe(my_recipe) %>%
tune_grid(resamples = mc_cv(iris_imbalance, strata = Species),
grid = xg_grid)
#> # Tuning results
#> # Monte Carlo cross-validation (0.75/0.25) with 25 resamples using stratification
#> # A tibble: 25 x 4
#> splits id .metrics .notes
#> <list> <chr> <list> <list>
#> 1 <split [46/14]> Resample01 <tibble [10 × 9]> <tibble [0 × 1]>
#> 2 <split [46/14]> Resample02 <tibble [10 × 9]> <tibble [0 × 1]>
#> 3 <split [46/14]> Resample03 <tibble [10 × 9]> <tibble [0 × 1]>
#> 4 <split [46/14]> Resample04 <tibble [10 × 9]> <tibble [0 × 1]>
#> 5 <split [46/14]> Resample05 <tibble [10 × 9]> <tibble [0 × 1]>
#> 6 <split [46/14]> Resample06 <tibble [10 × 9]> <tibble [0 × 1]>
#> 7 <split [46/14]> Resample07 <tibble [10 × 9]> <tibble [0 × 1]>
#> 8 <split [46/14]> Resample08 <tibble [10 × 9]> <tibble [0 × 1]>
#> 9 <split [46/14]> Resample09 <tibble [10 × 9]> <tibble [0 × 1]>
#> 10 <split [46/14]> Resample10 <tibble [10 × 9]> <tibble [0 × 1]>
#> # … with 15 more rows
Created on 2020-11-13 by the reprex package (v0.3.0)

Predict nested model with nested newdata

Imagine a high resolution temperature and light time series taken at many locations (stations) over many days. Except, at each station temp and light are taken by different sensors, resulting in a slightly different set of timestamps.
To merge these into one data.frame, I've been trying to make a model of light for each day at each station in df.light. Then, I want to predict light values at the exact timestamps of temp readings, which are nested the same way in df.temp (the temperature dataset).
station <- rep(1:5, each=36500)
dayofyear <- rep(1:365, 5, each=100)
hourofday.light <- runif(182500, min=0, max=24)
light <- runif(182500, min=0, max=40)
hourofday.temp <- runif(182500, min=0, max=24)
temp <- runif(182500, min=0, max=40)
df.light <- data.frame(station, dayofyear, hourofday.light, light)
df.temp <- data.frame(station, dayofyear, hourofday.temp, temp)
> head(df.light)
station dayofyear hourofday.light light
1 1 1 10.217349 0.120381
2 1 1 12.179213 12.423694
3 1 1 16.515400 7.277784
4 1 1 3.775723 31.793782
5 1 1 7.719266 30.578220
6 1 1 9.269916 16.937042
> tail(df.light)
station dayofyear hourofday.light light
182495 5 365 4.712285 19.2047471
182496 5 365 11.190919 39.5921675
182497 5 365 18.710969 11.8182347
182498 5 365 20.288101 11.6874453
182499 5 365 15.466373 0.3264828
182500 5 365 12.969125 29.4429034
> head(df.temp)
station dayofyear hourofday.temp temp
1 1 1 12.1298554 30.862308
2 1 1 23.6226076 9.328942
3 1 1 9.3699831 28.970397
4 1 1 0.1814767 1.405557
5 1 1 23.6300014 39.875743
6 1 1 7.6999984 39.786182
I can make the light models, e.g. GAMs, for each day at each station in df.light using dplyr. But I am stuck not knowing how to feed the nested newdata from df.temp to the models to generate the per-station-per-day predictions.
library("mgcv")
library("tidyverse")
data <- as_tibble(df.light) %>%
group_by(station, dayofyear) %>%
nest()
models <- data %>%
mutate(
model = map(data, ~ gam(light ~ s(hourofday.light), data = .x)),
predicted = map(model, ~ predict.gam(.x, newdata = hourofday.temp)) # newdata doesn't look nested
)
The last line starting with predicted does not work because newdata is not nested...I think. Please help. I'm guessing this could be a common issue in merging time series generated by multiple sources.
You can first prepare the data.
names(df.temp)[3:4] <- names(df.light)[3:4]
data1 <- df.light %>% group_by(station, dayofyear) %>%nest() %>% ungroup()
data2 <- df.temp %>% group_by(station, dayofyear) %>% nest() %>% ungroup()
apply model and get predicted values.
result <- data1 %>%
mutate(data2 = data2$data,
model = map(data, ~ gam(light ~ s(hourofday.light),data = .x)),
predicted = map2(model, data2, predict.gam))
result
# A tibble: 1,825 x 6
# station dayofyear data data2 model predicted
# <int> <int> <list> <list> <list> <list>
# 1 1 1 <tibble [100 × 2]> <tibble [100 × 2]> <gam> <dbl [100]>
# 2 1 2 <tibble [100 × 2]> <tibble [100 × 2]> <gam> <dbl [100]>
# 3 1 3 <tibble [100 × 2]> <tibble [100 × 2]> <gam> <dbl [100]>
# 4 1 4 <tibble [100 × 2]> <tibble [100 × 2]> <gam> <dbl [100]>
# 5 1 5 <tibble [100 × 2]> <tibble [100 × 2]> <gam> <dbl [100]>
# 6 1 6 <tibble [100 × 2]> <tibble [100 × 2]> <gam> <dbl [100]>
# 7 1 7 <tibble [100 × 2]> <tibble [100 × 2]> <gam> <dbl [100]>
# 8 1 8 <tibble [100 × 2]> <tibble [100 × 2]> <gam> <dbl [100]>
# 9 1 9 <tibble [100 × 2]> <tibble [100 × 2]> <gam> <dbl [100]>
#10 1 10 <tibble [100 × 2]> <tibble [100 × 2]> <gam> <dbl [100]>
# … with 1,815 more rows

Match two datasets by minimum geospatial distance (R)

I have the two following datasets:
houses <- data.table(house_number = c(1:3),
lat_decimal = seq(1.1, 1.3, by = 0.1),
lon_decimal = seq(1.4, 1.6, by = 0.1))
stations <- data.table(station_numer = c(1:11),
lat_decimal = seq(1, 2, by = 0.1),
lon_decimal = seq(2, 3, by = 0.1))
I want to merge houses and stations together such that the resulting station_number is the station that's closest to the corresponding house_number.
This question is very similar, but I'm not sure if they're working with latitude and longitude and also, I don't know how to calculate distances when dealing with longitude and latitude (which is why I prefer to simply use distm from the geosphere package).
I have never worked with the outer function. In case the answer from the aforementioned question would work, how can I adapt the answer to use the distmfunction instead of the sqrtfunction?
Use match_nrst_haversine from hutilscpp:
library(hutilscpp)
houses[, c("station_number", "dist") := match_nrst_haversine(lat_decimal,
lon_decimal,
addresses_lat = stations$lat_decimal,
addresses_lon = stations$lon_decimal,
Index = stations$station_numer,
close_enough = 0,
cartesian_R = 5)]
houses
#> house_number lat_decimal lon_decimal station_number dist
#> 1: 1 1.1 1.4 1 67.62617
#> 2: 2 1.2 1.5 1 59.87076
#> 3: 3 1.3 1.6 1 55.59026
You may want to adjust close_enough and cartesian_R if your data are numerous (i.e. over a million points to match) for performance.
`cartesian_R`
The maximum radius of any address from the points to be geocoded. Used to accelerate the detection of minimum distances. Note, as the
argument name suggests, the distance is in cartesian coordinates, so a
small number is likely.
`close_enough`
The distance, in metres, below which a match will be considered to have occurred. (The distance that is considered "close enough" to be a
match.)
For example, close_enough = 10 means the first location within ten metres will be matched, even if a closer match occurs later.
May be provided as a string to emphasize the units, e.g. close_enough = "0.25km". Only km and m are permitted.
Your question is a bit more complicated than a simple merge, and outer is somewhat ill-suited for the purpose. To be as thorough as possible, we want to calculate the distance between all combinations of houses and stations, then keep only the closest station per house. We'll need two packages:
library(tidyverse)
library(geosphere)
First, a bit of prep. distm expects coordinates to be ordered as longitude first, latitude second (you have the opposite), so let's fix that, name the columns better, and correct a typo while we're at it:
houses <- data.frame(house_number = c(1:3),
lon_house = seq(1.4, 1.6, by = 0.1),
lat_house = seq(1.1, 1.3, by = 0.1)
)
stations <- data.frame(station_number = c(1:11),
lon_station = seq(2, 3, by = 0.1),
lat_station = seq(1, 2, by = 0.1)
)
We'll create "nested" data frames so that it's easier to keep coordinates together:
house_nest <- nest(houses, -house_number, .key = 'house_coords')
station_nest <- nest(stations, -station_number, .key = 'station_coords')
house_number house_coords
<int> <list>
1 1 <data.frame [1 × 2]>
2 2 <data.frame [1 × 2]>
3 3 <data.frame [1 × 2]>
station_number station_coords
<int> <list>
1 1 <data.frame [1 × 2]>
2 2 <data.frame [1 × 2]>
3 3 <data.frame [1 × 2]>
4 4 <data.frame [1 × 2]>
5 5 <data.frame [1 × 2]>
6 6 <data.frame [1 × 2]>
7 7 <data.frame [1 × 2]>
8 8 <data.frame [1 × 2]>
9 9 <data.frame [1 × 2]>
10 10 <data.frame [1 × 2]>
11 11 <data.frame [1 × 2]>
Use dplyr::crossing to combine every row from both data frames:
data.master <- crossing(house_nest, station_nest)
house_number house_coords station_number station_coords
<int> <list> <int> <list>
1 1 <data.frame [1 × 2]> 1 <data.frame [1 × 2]>
2 1 <data.frame [1 × 2]> 2 <data.frame [1 × 2]>
3 1 <data.frame [1 × 2]> 3 <data.frame [1 × 2]>
4 1 <data.frame [1 × 2]> 4 <data.frame [1 × 2]>
5 1 <data.frame [1 × 2]> 5 <data.frame [1 × 2]>
6 1 <data.frame [1 × 2]> 6 <data.frame [1 × 2]>
7 1 <data.frame [1 × 2]> 7 <data.frame [1 × 2]>
8 1 <data.frame [1 × 2]> 8 <data.frame [1 × 2]>
9 1 <data.frame [1 × 2]> 9 <data.frame [1 × 2]>
10 1 <data.frame [1 × 2]> 10 <data.frame [1 × 2]>
# ... with 23 more rows
With all this now in place, we can use distm on each row to calculate a distance, and keep the shortest distance per house:
data.dist <- data.master %>%
mutate(dist = map2_dbl(house_coords, station_coords, distm)) %>%
group_by(house_number) %>%
filter(dist == min(dist))
house_number house_coords station_number station_coords dist
<int> <list> <int> <list> <dbl>
1 1 <data.frame [1 × 2]> 1 <data.frame [1 × 2]> 67690.
2 2 <data.frame [1 × 2]> 1 <data.frame [1 × 2]> 59883.
3 3 <data.frame [1 × 2]> 1 <data.frame [1 × 2]> 55519.

Resources