When I run the below data it shows an incorrect roc_curve.
Prep
The below code should be run-able for anyone using r-studio. The dataframe contains characteristics of different employees regarding: performance ratings, sales figures, and whether
or not they were promoted.
I am attempting to create a decision tree model that uses all other variables to predict if an employee was promoted. The primary purpose of this question is to find out what I am doing incorrectly when tring to use the roc_curve() function.
library(tidyverse)
library(tidymodels)
library(peopleanalyticsdata)
url <- "http://peopleanalytics-regression-book.org/data/salespeople.csv"
salespeople <- read.csv(url)
salespeople <- salespeople %>% mutate(promoted = factor(ifelse(promoted == 1, "yes", "no")))
creating testing/training data
Using my own homemade train_test() function just for kicks!
train_test <- function(data, train.size=0.7, na.rm=FALSE) {
if(na.rm == TRUE) {
dt <- sample(x=nrow(data), size=nrow(data)* train.size)
data_nm <- na.omit(data)
train<-data_nm[dt,]
test<- data_nm[-dt,]
set <- list(train, test)
names(set) <- c("train", "test")
return(set)
} else {
dt <- sample(x=nrow(data), size=nrow(data)* train.size)
train<-data[dt,]
test<- data[-dt,]
set <- list(train, test)
names(set) <- c("train", "test")
return(set)
}
}
tt_list <- train_test(salespeople)
sales_train <- tt_list$train
sales_test <- tt_list$test
'''
creating decision tree model structure/final model/prediction dataframe
'''
tree <- decision_tree() %>%
set_engine("rpart") %>%
set_mode("classification")
model <- tree %>% fit(promoted ~ ., data = sales_train)
predictions <- predict(model,
sales_test,
type = "prob") %>%
bind_cols(sales_test)
'''
Calculate & Plot the ROC curve
When I use the .pred_yes column as the estimate column, it calculates an ROC curve that is the inverse of what I want. It seems that it has identified .pred_no as the "real" estimate column
'''
roc <- roc_curve(predictions,
estimate = .pred_yes,
truth = promoted)
autoplot(roc)
'''
Thoughts
Seems like the issue goes away when I supply pred_no as the estimate column to roc_curve()
FYI: this is my first stack overflow post, if you have any suggestions to make this post more clear/better formatted please let me know!
In factor(c("yes", "no")), "no" is the first level, the level that most modeling packages assume is the one of interest. In tidymodels, you can adjust the level of interest via the event_level argument, as documented here:
library(tidyverse)
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
url <- "http://peopleanalytics-regression-book.org/data/salespeople.csv"
salespeople <- read_csv(url) %>%
mutate(promoted = factor(ifelse(promoted == 1, "yes", "no")))
#> Rows: 351 Columns: 4
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> dbl (4): promoted, sales, customer_rate, performance
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sales_split <- initial_split(salespeople)
sales_train <- training(sales_split)
sales_test <- testing(sales_split)
tree <- decision_tree() %>%
set_engine("rpart") %>%
set_mode("classification")
tree_fit <- tree %>% fit(promoted ~ ., data = sales_train)
sales_preds <- augment(tree_fit, sales_test)
sales_preds
#> # A tibble: 88 × 7
#> promoted sales customer_rate performance .pred_class .pred_no .pred_yes
#> <fct> <dbl> <dbl> <dbl> <fct> <dbl> <dbl>
#> 1 no 364 4.89 1 no 0.973 0.0267
#> 2 no 342 3.74 3 no 0.973 0.0267
#> 3 yes 716 3.16 3 yes 0 1
#> 4 no 450 3.21 3 no 0.973 0.0267
#> 5 no 372 3.87 3 no 0.973 0.0267
#> 6 no 535 4.47 2 no 0.973 0.0267
#> 7 yes 736 3.94 4 yes 0 1
#> 8 no 330 2.54 2 no 0.973 0.0267
#> 9 no 478 3.48 2 no 0.973 0.0267
#> 10 yes 728 2.66 3 yes 0 1
#> # … with 78 more rows
sales_preds %>%
roc_curve(promoted, .pred_yes, event_level = "second") %>%
autoplot()
Created on 2021-09-08 by the reprex package (v2.0.1)
Related
I am trying to understand how to apply step_pca to preprocess my data. Suppose I want to build a K-Nearest Neighbor classifier to the iris dataset. For the sake of simplicity, I will not split the original iris dataset into train and test. I will assume iris is the train dataset and I have some other observations as my test dataset.
I want to apply three transformations to the predictors in my train dataset:
Center all predictor variables
Scale all predictor variables
PCA transform all predictor variables and keep a number of them that explains, at least, 80% of my data variance
So this is what I have:
library(tidymodels)
iris_rec <-
recipe(Species ~ .,
data = iris) %>%
# center/scale
step_center(-Species) %>%
step_scale(-Species) %>%
# pca
step_pca(-Species, threshold = 0.8) %>%
# apply data transformation
prep()
iris_rec
#> Recipe
#>
#> Inputs:
#>
#> role #variables
#> outcome 1
#> predictor 4
#>
#> Training data contained 150 data points and no missing data.
#>
#> Operations:
#>
#> Centering for Sepal.Length, Sepal.Width, Petal.Length, Petal.... [trained]
#> Scaling for Sepal.Length, Sepal.Width, Petal.Length, Petal.... [trained]
#> PCA extraction with Sepal.Length, Sepal.Width, Petal.Length, Petal.W... [trained]
Created on 2022-10-13 with reprex v2.0.2
Ok, so far, so good. All the transformations are applied to my dataset. When I prepare my train dataset using juice, everything goes as expected:
# transformed training set
iris_train_t <- juice(iris_rec)
iris_train_t
#> # A tibble: 150 × 3
#> Species PC1 PC2
#> <fct> <dbl> <dbl>
#> 1 setosa -2.26 -0.478
#> 2 setosa -2.07 0.672
#> 3 setosa -2.36 0.341
#> 4 setosa -2.29 0.595
#> 5 setosa -2.38 -0.645
#> 6 setosa -2.07 -1.48
#> 7 setosa -2.44 -0.0475
#> 8 setosa -2.23 -0.222
#> 9 setosa -2.33 1.11
#> 10 setosa -2.18 0.467
#> # … with 140 more rows
Created on 2022-10-13 with reprex v2.0.2
So, I have two predictors based on PCA (PC1 and PC2) and my response variable. However, when I proceed with my modelling, I get an error: all the models I test fail, as you can see below:
# cross validation
set.seed(2022)
iris_train_cv <- vfold_cv(iris_train_t, v = 5)
# tuning
iris_knn_tune <-
nearest_neighbor(
neighbors = tune(),
weight_func = tune(),
dist_power = tune()
) %>%
set_engine("kknn") %>%
set_mode("classification")
# grid search
iris_knn_grid <-
grid_regular(neighbors(range = c(3, 9)),
weight_func(),
dist_power(),
levels = c(22, 2, 2))
# workflow creation
iris_wflow <-
workflow() %>%
add_recipe(iris_rec) %>%
add_model(iris_knn_tune)
# model assessment
iris_knn_fit_tune <-
iris_wflow %>%
tune_grid(
resamples = iris_train_cv,
grid = iris_knn_grid
)
#> x Fold1: preprocessor 1/1:
#> Error in `check_training_set()`:
#> ! Not all variables in the recipe are present in the supplied training...
#> x Fold2: preprocessor 1/1:
#> Error in `check_training_set()`:
#> ! Not all variables in the recipe are present in the supplied training...
#> x Fold3: preprocessor 1/1:
#> Error in `check_training_set()`:
#> ! Not all variables in the recipe are present in the supplied training...
#> x Fold4: preprocessor 1/1:
#> Error in `check_training_set()`:
#> ! Not all variables in the recipe are present in the supplied training...
#> x Fold5: preprocessor 1/1:
#> Error in `check_training_set()`:
#> ! Not all variables in the recipe are present in the supplied training...
#> Warning: All models failed. Run `show_notes(.Last.tune.result)` for more
#> information.
# cv results
collect_metrics(iris_knn_fit_tune)
#> Error in `estimate_tune_results()`:
#> ! All of the models failed. See the .notes column.
#> Backtrace:
#> ▆
#> 1. ├─tune::collect_metrics(iris_knn_fit_tune)
#> 2. └─tune:::collect_metrics.tune_results(iris_knn_fit_tune)
#> 3. └─tune::estimate_tune_results(x)
#> 4. └─rlang::abort("All of the models failed. See the .notes column.")
Created on 2022-10-13 with reprex v2.0.2
I am suspecting my problem is with the formula I defined on my iris_rec recipe. The formula there is
Species ~ ., data = iris
which means
Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data = iris
However, when I run my models, the predictor variables are PC1 and PC2, so I guess the formula should be
Species ~ ., data = iris_train_t
or
Species ~ PC1 + PC2, data = iris_train_t
How can I inform my model that my variables and dataset changed? All the others step_* I used on my tidymodels have worked, but I am struggling specifically with step_pca.
Two things that are confusing.
First, you don't need to prep() or juice() a recipe before using it in a model or workflow. The tuning and resampling functions will be doing that within each resample.
You can prep() and juice() if you want the training set processed to troubleshoot, visualize, or otherwise explore. But you don’t need to otherwise.
Second, the recipe is basically a replacement for the formula. It knows what the predictors and outcomes are so there is rarely the need to use an additional formula on top of that.
(The exception is for models that require special formulas but otherwise no).
Here is updated code for you:
library(tidymodels)
iris_rec <-
recipe(Species ~ .,
data = iris) %>%
# center/scale
step_center(-Species) %>%
step_scale(-Species) %>%
# pca
step_pca(-Species, threshold = 0.8)
set.seed(2022)
iris_train_cv <- vfold_cv(iris, v = 5) #<- changes here
# tuning
iris_knn_tune <-
nearest_neighbor(
neighbors = tune(),
weight_func = tune(),
dist_power = tune()
) %>%
set_engine("kknn") %>%
set_mode("classification")
# grid search
iris_knn_grid <-
grid_regular(neighbors(range = c(3, 9)),
weight_func(),
dist_power(),
levels = c(22, 2, 2))
# workflow creation
iris_wflow <-
workflow() %>%
add_recipe(iris_rec) %>%
add_model(iris_knn_tune)
# model assessment
iris_knn_fit_tune <-
iris_wflow %>%
tune_grid(
resamples = iris_train_cv,
grid = iris_knn_grid
)
show_best(iris_knn_fit_tune, metric = "roc_auc")
#> # A tibble: 5 × 9
#> neighbors weight_func dist_power .metric .estima…¹ mean n std_err .config
#> <int> <chr> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
#> 1 9 rectangular 1 roc_auc hand_till 0.976 5 0.00580 Prepro…
#> 2 7 triangular 1 roc_auc hand_till 0.975 5 0.00688 Prepro…
#> 3 9 triangular 2 roc_auc hand_till 0.975 5 0.00571 Prepro…
#> 4 8 triangular 1 roc_auc hand_till 0.975 5 0.00655 Prepro…
#> 5 9 triangular 1 roc_auc hand_till 0.975 5 0.00655 Prepro…
#> # … with abbreviated variable name ¹.estimator
Created on 2022-10-13 with reprex v2.0.2
I was getting acquainted with tidymodels by reading the book and this line in section 9.2 kept me thinking about retransformation.
It is best practice to analyze the predictions on the transformed
scale (if one were used) even if the predictions are reported using
the original units.
But I found it confusing that the examples in the book use a log transformation on the outcome, but they do not use a recipe for this (the recipe has not been introduced at this point, but later when they introduce recipe, still they do not use step_log for the outcome but just for the predictors). So I wanted to try that and found something puzzling, illustrated with the reprex below:
# So let's use most of the code from the examples in the book
library(tidymodels)
tidymodels_prefer()
set.seed(501)
# data budget
data(ames)
ames_split <- initial_split(ames, prop = 0.80, strata = Sale_Price)
ames_train <- training(ames_split)
ames_test <- testing(ames_split)
ames_folds <- vfold_cv(ames_train, v = 10, strata = Sale_Price)
# IJALM
lm_model <-
linear_reg(penalty = 0) |>
set_engine("glmnet")
# And use a recipe,
# but Instead of manually transforming the outcome like this ...
# `ames <- ames %>% mutate(Sale_Price = log10(Sale_Price))`
# let's include the outcome transformation into the recipe
simple_ames <-
recipe(
Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type,
data = ames_train
) |>
step_log(Gr_Liv_Area, base = 10) |>
step_dummy(all_nominal_predictors()) |>
step_log(Sale_Price, base = 10, skip = TRUE)
lm_wflow <-
workflow() |>
add_model(lm_model) |>
add_recipe(simple_ames)
lm_res <- fit_resamples(
lm_wflow,
resamples = ames_folds,
control = control_resamples(save_pred = TRUE, save_workflow = TRUE),
metrics = metric_set(rmse)
)
collect_metrics(lm_res)
#> # A tibble: 1 x 6
#> .metric .estimator mean n std_err .config
#> <chr> <chr> <dbl> <int> <dbl> <chr>
#> 1 rmse standard 197264. 10 1362. Preprocessor1_Model1
# Now, I wanted to double-check how rmse was calculated
# It should be the mean of the rmse for each fold
# (individual values stored in this list lm_res$.metrics,
# with one element for each fold)
# and each rmse should have been calculated with the predictions of each fold
# (stored in lm_res$.predictions)
# So, this rmse corresponding to the first fold
lm_res$.metrics[[1]]
#> # A tibble: 1 x 4
#> .metric .estimator .estimate .config
#> <chr> <chr> <dbl> <chr>
#> 1 rmse standard 196074. Preprocessor1_Model1
# Should have been calculated with this data
lm_res$.predictions[[1]]
#> # A tibble: 236 x 4
#> .pred .row Sale_Price .config
#> <dbl> <int> <int> <chr>
#> 1 5.09 2 126000 Preprocessor1_Model1
#> 2 4.92 33 105900 Preprocessor1_Model1
#> 3 5.06 34 125500 Preprocessor1_Model1
#> 4 5.18 44 121500 Preprocessor1_Model1
#> 5 5.14 51 127000 Preprocessor1_Model1
#> 6 5.13 53 114000 Preprocessor1_Model1
#> 7 4.90 57 84900 Preprocessor1_Model1
#> 8 5.13 62 113000 Preprocessor1_Model1
#> 9 5.02 74 83500 Preprocessor1_Model1
#> 10 5.02 76 88250 Preprocessor1_Model1
#> # ... with 226 more rows
#> # i Use `print(n = ...)` to see more rows
# But here's the issue!
# The predictions are in the log-scale while the observed values
# are in the original units.
# This is just a quick-check to make sure the rmse reported above
# (calculated by yardstick) does in fact involve mixing-up the log-scale
# (predictions) and the original units (observed values)
yhat <- lm_res$.predictions[[1]]$.pred
yobs <- lm_res$.predictions[[1]]$Sale_Price
sqrt(mean((yhat - yobs)^2))
#> [1] 196073.6
# So, apparently, for cross-validation tidymodels does not `bake` the folds
# with the recipe to calculate the metrics
And here’s where I got it (at least I think so), after spending half
an hour writing this reprex. So, to not feel I wasted my time, I decided
to post it anyway, and put what I think is going on as an answer.
Perhaps someone finds it useful, because it was not evident to me at the
first time. Or perhaps someone can explain if there is something else going on.
Created on 2022-08-07 by the reprex package (v2.0.1)
It was basically your fault. You explicitly told tidymodels not to bake() this step. The line below was the culprit, in particular, the skip = TRUE part.
step_log(Sale_Price, base = 10, skip = TRUE)
Hence, tidymodels will not include that step in baking the folds before making the predictions and you end up with the log-scale of the predictions mixed-up with the untransformed outcome variable. This is perhaps one of the examples they had in mind when they wrote in the documentation that:
Care should be taken when using skip = TRUE as it may affect the
computations for subsequent operations.
You probably decided to skip that step, because the outcome variable may not be available on a new dataset for which you want predictions, and then the process would fail. But it basically messes up metrics for cross-validation. So, better not to skip the step and deal other way with that problem.
I know that in tidymodels you can set a custom tunable parameter space by interacting directly with the workflow object as follows:
library(tidymodels)
model <- linear_reg(
mode = "regression",
engine = "glmnet",
penalty = tune()
)
rec_cars <- recipe(mpg ~ ., data = mtcars)
wkf <- workflow() %>%
add_recipe(rec_cars) %>%
add_model(model)
wkf_new_param_space <- wkf %>%
parameters() %>%
update(penalty = penalty(range = c(0.9, 1)))
but sometimes it makes more sense to do this right at the moment I specify a recipe or a model.
Someone knows a way to achieve this?
The parameter ranges are inherently separated from the model specification and recipe specification in tidymodels. When you set tune() you are giving a signal to the tune function that this parameter will take multiple values and should be tuned over.
So as a short answer, you can not specify ranges of parameters when you specify a recipe or a model, but you can create the parameters object right after as you did.
In the end, you need the parameter set to construct the grid values that you are using for hyperparameter tuning, and you can create those gid values in at least 4 ways.
The first way is to do it the way you are doing it, by pulling the needed parameters out of the workflow and modifying them when needed.
The second way is to create a parameters object that will match the parameters that you will need to use. This option and the remaining require you to make sure that you create values for all the parameters you are tuning.
The Third way is to skip the parameters object altogether and create the grid with your grid_*() function and dials functions.
The fourth way is to skip dials functions altogether and create the data frame yourself. I find tidyr::crossing() an useful replacement for grid_regular(). This way is a lot easier when you are working with integer parameters and parameters that don't benefit from transformations.
library(tidymodels)
model <- linear_reg(
mode = "regression",
engine = "glmnet",
penalty = tune()
)
rec_cars <- recipe(mpg ~ ., data = mtcars)
wkf <- workflow() %>%
add_recipe(rec_cars) %>%
add_model(model)
# Option 1: using parameters() on workflow
wkf_new_param_space <- wkf %>%
parameters() %>%
update(penalty = penalty(range = c(-5, 5)))
wkf_new_param_space %>%
grid_regular(levels = 10)
#> # A tibble: 10 × 1
#> penalty
#> <dbl>
#> 1 0.00001
#> 2 0.000129
#> 3 0.00167
#> 4 0.0215
#> 5 0.278
#> 6 3.59
#> 7 46.4
#> 8 599.
#> 9 7743.
#> 10 100000
# Option 2: Using parameters() on list
my_params <- parameters(
list(
penalty(range = c(-5, 5))
)
)
my_params %>%
grid_regular(levels = 10)
#> # A tibble: 10 × 1
#> penalty
#> <dbl>
#> 1 0.00001
#> 2 0.000129
#> 3 0.00167
#> 4 0.0215
#> 5 0.278
#> 6 3.59
#> 7 46.4
#> 8 599.
#> 9 7743.
#> 10 100000
# Option 3: Use grid_*() with dials objects directly
grid_regular(
penalty(range = c(-5, 5)),
levels = 10
)
#> # A tibble: 10 × 1
#> penalty
#> <dbl>
#> 1 0.00001
#> 2 0.000129
#> 3 0.00167
#> 4 0.0215
#> 5 0.278
#> 6 3.59
#> 7 46.4
#> 8 599.
#> 9 7743.
#> 10 100000
# Option 4: Create grid values manually
tidyr::crossing(
penalty = 10 ^ seq(-5, 5, length.out = 10)
)
#> # A tibble: 10 × 1
#> penalty
#> <dbl>
#> 1 0.00001
#> 2 0.000129
#> 3 0.00167
#> 4 0.0215
#> 5 0.278
#> 6 3.59
#> 7 46.4
#> 8 599.
#> 9 7743.
#> 10 100000
Created on 2021-08-17 by the reprex package (v2.0.1)
seems that this is an old question but I am having a hard time trying to insert this approach (option 1) in my workflow.
How is supposed to continue?
wkf_new_param_space is used as grid or as object in tuning model?
model_tuned <-
tune::tune_grid(
object = wkf_new_param_space, ?
resamples = cv_folds,
grid = wkf_new_param_space, ?
metrics = model_metrics,
control = tune::control_grid(save_pred = TRUE, save_workflow = TRUE)
)
I need some help analyzing survey data.
Here is my code.
Data prep
library(survey)
library(srvyr)
data(api)
dclus2 <- apiclus1 %>%
as_survey_design(dnum, weights = pw, fpc = fpc)
These two codes give me the same result.
One using the package survey
#Code
survey::svymean(~awards, dclus2)
#Results
mean SE
awardsNo 0.28962 0.033
awardsYes 0.71038 0.033
One using the package srvyr
#Code
srvyr::dclus2%>%
group_by(awards)%>%
summarise(m=survey_mean())
#Results
awards m m_se
No 0.2896175 0.0330183
Yes 0.7103825 0.0330183
I would like to get the survey mean of by the variable "awards" subset by the variable "stype" with levels No and Yes.
In the survey package, interaction is used
eg.svymean(~interaction(awards,stype), dclus2) How do I get the same result using the srvyr package?
Thank you for your help
How do get the result below using the package srvyr?
#Code
svymean(~interaction(awards,stype), dclus2)
#Results
mean SE
interaction(awards, stype)No.E 0.180328 0.0250
interaction(awards, stype)Yes.E 0.606557 0.0428
interaction(awards, stype)No.H 0.043716 0.0179
interaction(awards, stype)Yes.H 0.032787 0.0168
interaction(awards, stype)No.M 0.065574 0.0230
interaction(awards, stype)Yes.M 0.071038 0.0203
You can simply imitate the recommended behavior for survey: create a new variable formed by concatenating distinct values of each of the component variables. That's all that the interaction() function is doing for svymean().
library(survey)
library(srvyr)
data(api)
# Set up design object
dclus2 <- apiclus1 %>%
as_survey_design(dnum, weights = pw, fpc = fpc)
# Create 'interaction' variable
dclus2 %>%
mutate(awards_stype = paste(awards, stype, sep = " - ")) %>%
group_by(awards_stype) %>%
summarize(
prop = survey_mean()
)
#> # A tibble: 6 x 3
#> awards_stype prop prop_se
#> <chr> <dbl> <dbl>
#> 1 No - E 0.180 0.0250
#> 2 No - H 0.0437 0.0179
#> 3 No - M 0.0656 0.0230
#> 4 Yes - E 0.607 0.0428
#> 5 Yes - H 0.0328 0.0168
#> 6 Yes - M 0.0710 0.0203
To get the various component variables split back into separate columns, you can use the separate() function from the tidyr package.
# Separate the columns afterwards
dclus2 %>%
mutate(awards_stype = paste(awards, stype, sep = " - ")) %>%
group_by(awards_stype) %>%
summarize(
prop = survey_mean()
) %>%
tidyr::separate(col = "awards_stype",
into = c("awards", "stype"),
sep = " - ")
#> # A tibble: 6 x 4
#> awards stype prop prop_se
#> <chr> <chr> <dbl> <dbl>
#> 1 No E 0.180 0.0250
#> 2 No H 0.0437 0.0179
#> 3 No M 0.0656 0.0230
#> 4 Yes E 0.607 0.0428
#> 5 Yes H 0.0328 0.0168
#> 6 Yes M 0.0710 0.0203
Created on 2021-03-30 by the reprex package (v1.0.0)
I am using dplyr and broom to compute kmeans for my data. My data contains a test and training set of X and Y coordinates and are grouped by a some parameter value (lambda in this case):
mds.test = data.frame()
for(l in seq(0.1, 0.9, by=0.2)) {
new.dist <- run.distance.model(x, y, lambda=l)
mds <- preform.mds(new.dist, ndim=2)
mds.test <- rbind(mds.test, cbind(mds$space, design[,c(1,3,4,5)], lambda=rep(l, nrow(mds$space)), data="test"))
}
> head(mds.test)
Comp1 Comp2 Transcripts Genes Timepoint Run lambda data
7A_0_AAGCCTAGCGAC -0.06690476 -0.25519106 68125 9324 Day 0 7A 0.1 test
7A_0_AAATGACTGGCC -0.15292848 0.04310200 28443 6746 Day 0 7A 0.1 test
7A_0_CATCTCGTTCTA -0.12529445 0.13022908 27360 6318 Day 0 7A 0.1 test
7A_0_ACCGGCACATTC -0.33015913 0.14647857 23038 5709 Day 0 7A 0.1 test
7A_0_TATGTCGGAATG -0.25826098 0.05424976 22414 5878 Day 0 7A 0.1 test
7A_0_GAAAAAGGTGAT -0.24349387 0.08071162 21907 6766 Day 0 7A 0.1 test
I've head the test dataset above but I also have one named mds.train which contains my training data coordinates. My ultimate goal here is to run k-means for both sets grouped by lambda, then compute the within.ss, between.ss and total.ss for the test data on the training centers. Thanks to a great resource on broom, I am able to run kmeans for each lambda for the test set by simply doing the following:
test.kclusts = mds.test %>%
group_by(lambda) %>%
do(kclust=kmeans(cbind(.$Comp1, .$Comp2), centers=length(unique(design$Timepoint))))
Then I can compute the centers of this data for each cluster within each lambda:
test.clusters = test.kclusts %>%
group_by(lambda) %>%
do(tidy(.$kclust[[1]]))
This is where I am stuck. How do I compute the feature assignments as similarly shown on the reference page (e.g. kclusts %>% group_by(k) %>% do(augment(.$kclust[[1]], points.matrix))), where my points.matrix is mds.test which is a data.frame with length(unique(mds.test$lambda)) times as many rows as should be? And is there a way to somehow use centers from the training set to compute glance() statistics based off the test assignments?
Any help would be greatly appreciated! Thank you!
EDIT: Updating progress. I have figured out how to aggregate the test/training assignments but am still having issues trying to compute kmeans statistics from both sets (training assignments on test center and test assignments on training centers). Updated code is below:
test.kclusts = mds.test %>% group_by(lambda) %>% do(kclust=kmeans(cbind(.$Comp1, .$Comp2), centers=length(unique(design$Timepoint))))
test.clusters = test.kclusts %>% group_by(lambda) %>% do(tidy(.$kclust[[1]]))
test.clusterings = test.kclusts %>% group_by(lambda) %>% do(glance(.$kclust[[1]]))
test.assignments = left_join(test.kclusts, mds.test) %>% group_by(lambda) %>% do(augment(.$kclust[[1]], cbind(.$Comp1, .$Comp2)))
train.kclusts = mds.train %>% group_by(lambda) %>% do(kclust=kmeans(cbind(.$Comp1, .$Comp2), centers=length(unique(design$Timepoint))))
train.clusters = train.kclusts %>% group_by(lambda) %>% do(tidy(.$kclust[[1]]))
train.clusterings = train.kclusts %>% group_by(lambda) %>% do(glance(.$kclust[[1]]))
train.assignments = left_join(train.kclusts, mds.train) %>% group_by(lambda) %>% do(augment(.$kclust[[1]], cbind(.$Comp1, .$Comp2)))
test.assignments$data = "test"
train.assignments$data = "train"
merge.assignments = rbind(test.assignments, train.assignments)
merge.assignments %>% filter(., data=='test') %>% group_by(lambda) ... ?
Ive attached a plot below which illustrates my progress to this point. Just to reiterate, I would like to compute kmeans statistics (within sum of square, total sum of squares, and between sum of squares) for the training data centers on test assignments/coordinates (the plots which the centers look off):
One approach would be to...
extract the table specifying the centroids of your clusters (built on the training set) via broom.
calculate the distance of each point in the test set from each of the cluster centroids built using the training set. Could do this via fuzzyjoin package.
the cluster centroid that a test point has the shortest Euclidian distance from represents its assigned cluster.
From there you can calculate any metrics of interest.
See below using a simpler dataset pulled from clustering example from tidymodels.
library(tidyverse)
library(rsample)
library(broom)
library(fuzzyjoin)
# data and train / test set-up
set.seed(27)
centers <- tibble(
cluster = factor(1:3),
num_points = c(100, 150, 50), # number points in each cluster
x1 = c(5, 0, -3), # x1 coordinate of cluster center
x2 = c(-1, 1, -2) # x2 coordinate of cluster center
)
labelled_points <-
centers %>%
mutate(
x1 = map2(num_points, x1, rnorm),
x2 = map2(num_points, x2, rnorm)
) %>%
select(-num_points) %>%
unnest(cols = c(x1, x2))
points <-
labelled_points %>%
select(-cluster)
set.seed(1234)
split <- rsample::initial_split(points)
train <- rsample::training(split)
test <- rsample::testing(split)
# Fit kmeans on train then assign clusters to test
kclust <- kmeans(train, centers = 3)
clust_centers <- kclust %>%
tidy() %>%
select(-c(size, withinss))
test_clusts <- fuzzyjoin::distance_join(mutate(test, index = row_number()),
clust_centers,
max_dist = Inf,
method = "euclidean",
distance_col = "dist") %>%
group_by(index) %>%
filter(dist == min(dist)) %>%
ungroup()
#> Joining by: c("x1", "x2")
# resulting table
test_clusts
#> # A tibble: 75 x 7
#> x1.x x2.x index x1.y x2.y cluster dist
#> <dbl> <dbl> <int> <dbl> <dbl> <fct> <dbl>
#> 1 4.24 -0.946 1 5.07 -1.10 3 0.847
#> 2 3.54 0.287 2 5.07 -1.10 3 2.06
#> 3 3.71 -1.67 3 5.07 -1.10 3 1.47
#> 4 5.03 -0.788 4 5.07 -1.10 3 0.317
#> 5 6.57 -2.49 5 5.07 -1.10 3 2.04
#> 6 4.97 0.233 6 5.07 -1.10 3 1.34
#> 7 4.43 -1.89 7 5.07 -1.10 3 1.01
#> 8 5.34 -0.0705 8 5.07 -1.10 3 1.07
#> 9 4.60 0.196 9 5.07 -1.10 3 1.38
#> 10 5.68 -1.55 10 5.07 -1.10 3 0.758
#> # ... with 65 more rows
# calc within clusts SS on test
test_clusts %>%
group_by(cluster) %>%
summarise(size = n(),
withinss = sum(dist^2),
withinss_avg = withinss / size)
#> # A tibble: 3 x 4
#> cluster size withinss withinss_avg
#> <fct> <int> <dbl> <dbl>
#> 1 1 11 32.7 2.97
#> 2 2 35 78.9 2.26
#> 3 3 29 62.0 2.14
# compare to on train
tidy(kclust) %>%
mutate(withinss_avg = withinss / size)
#> # A tibble: 3 x 6
#> x1 x2 size withinss cluster withinss_avg
#> <dbl> <dbl> <int> <dbl> <fct> <dbl>
#> 1 -3.22 -1.91 40 76.8 1 1.92
#> 2 0.0993 1.06 113 220. 2 1.95
#> 3 5.07 -1.10 72 182. 3 2.53
# plot of test and train points
test_clusts %>%
select(x1 = x1.x, x2 = x2.x, cluster) %>%
mutate(type = "test") %>%
bind_rows(
augment(kclust, train) %>%
mutate(type = "train") %>%
rename(cluster = .cluster)
) %>%
ggplot(aes(x = x1,
y = x2,
color = as.factor(cluster)))+
geom_point()+
facet_wrap(~fct_rev(as.factor(type)))+
coord_fixed()+
labs(title = "Cluster Assignment on Training and Holdout Datasets",
color = "Cluster")+
theme_bw()
Created on 2021-08-19 by the reprex package (v2.0.0)
(See comment on OP for link to conversations on making this easier within tidymodels.)