Tuning XGBoost with TIdymodels - over 12 hours later still not finished - r

I have been running an XGBoost model in R on a high-powered machine (4Ghz, 16 cores, 32gb RAM) for over 12 hours and it's still not finished. I am not sure what's going wrong. I followed Julia Silge's blog to the tee. This is what my data looks like:
str(hts.facility.df)
tibble [24,422 x 47] (S3: tbl_df/tbl/data.frame)
$ patient_id : Factor w/ 24422 levels
$ datim_code : chr [1:24422]
$ sex : Factor w/ 2 levels "F","M": 2 1 1 1 1 1 1 1 2 1 ...
$ age : num [1:24422] 33 36 29 21 49 44 71 26 50 38 ...
$ age_group : Factor w/ 12 levels "< 1","1 - 4",..: 7 8 6 5 10 9 12 6 12 8 ...
$ referred_from : Factor w/ 3 levels "Self","TB","Other": 2 1 1 1 1 1 1 1 1 1 ...
$ marital_status : Factor w/ 4 levels "M","S","W","D": 1 1 2 1 1 1 3 2 1 2 ...
$ no_of_own_children_lessthan_5 : Factor w/ 2 levels "0","more_than_2_children": 2 1 1 1 1 1 1 1 1 1 ...
$ no_of_wives : Factor w/ 2 levels "0","more_than_2_wives": 2 1 1 1 1 1 1 1 1 1 ...
$ session_type : Factor w/ 2 levels "Couple","Individual": 2 2 2 2 2 2 2 2 2 2 ...
$ previously_tested_hiv_negative : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ client_pregnant : Factor w/ 2 levels "1","0": 2 1 1 1 1 1 1 1 2 1 ...
$ hts_test_result : Factor w/ 2 levels "Neg","Pos": 1 1 1 1 1 1 1 1 1 1 ...
$ hts_setting : Factor w/ 4 levels "CT","TB","Ward",..: 3 3 3 3 3 3 3 3 3 3 ...
$ tested_for_hiv_before_within_this_year: Factor w/ 2 levels "PreviouslyTestedNegative",..: 2 1 2 2 2 2 2 2 2 2 ...
$ is_surge_site : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ nga_agesex_f_15_2019 : num [1:24422] 0.0627 0.0627 0.0627 0.0627 0.0627 ...
$ nga_agesex_f_20_2019 : num [1:24422] 0.0581 0.0581 0.0581 0.0581 0.0581 ...
$ nga_agesex_f_25_2019 : num [1:24422] 0.0411 0.0411 0.0411 0.0411 0.0411 ...
$ nga_agesex_f_30_2019 : num [1:24422] 0.0314 0.0314 0.0314 0.0314 0.0314 ...
$ nga_agesex_f_35_2019 : num [1:24422] 0.0275 0.0275 0.0275 0.0275 0.0275 ...
$ nga_agesex_f_40_2019 : num [1:24422] 0.021 0.021 0.021 0.021 0.021 ...
$ nga_agesex_f_45_2019 : num [1:24422] 0.0166 0.0166 0.0166 0.0166 0.0166 ...
$ nga_agesex_m_15_2019 : num [1:24422] 0.0536 0.0536 0.0536 0.0536 0.0536 ...
$ nga_agesex_m_20_2019 : num [1:24422] 0.0632 0.0632 0.0632 0.0632 0.0632 ...
$ nga_agesex_m_25_2019 : num [1:24422] 0.0534 0.0534 0.0534 0.0534 0.0534 ...
$ nga_agesex_m_30_2019 : num [1:24422] 0.036 0.036 0.036 0.036 0.036 ...
$ nga_agesex_m_35_2019 : num [1:24422] 0.0325 0.0325 0.0325 0.0325 0.0325 ...
$ nga_agesex_m_40_2019 : num [1:24422] 0.0263 0.0263 0.0263 0.0263 0.0263 ...
$ nga_agesex_m_45_2019 : num [1:24422] 0.0236 0.0236 0.0236 0.0236 0.0236 ...
$ IHME_CONDOM_LAST_TIME_PREV_MEAN_2017 : num [1:24422] 14.1 14.1 14.1 14.1 14.1 ...
$ IHME_HAD_INTERCOURSE_PREV_MEAN_2017 : num [1:24422] 63.1 63.1 63.1 63.1 63.1 ...
$ IHME_HIV_COUNT_MEAN_2017 : num [1:24422] 0.0126 0.0126 0.0126 0.0126 0.0126 ...
$ IHME_IN_UNION_PREV_MEAN_2017 : num [1:24422] 56.9 56.9 56.9 56.9 56.9 ...
$ IHME_MALE_CIRCUMCISION_PREV_MEAN_2017 : num [1:24422] 98.7 98.7 98.7 98.7 98.7 ...
$ IHME_PARTNER_AWAY_PREV_MEAN_2017 : num [1:24422] 13.5 13.5 13.5 13.5 13.5 ...
$ IHME_PARTNERS_YEAR_MN_PREV_MEAN_2017 : num [1:24422] 13.5 13.5 13.5 13.5 13.5 ...
$ IHME_PARTNERS_YEAR_WN_PREV_MEAN_2017 : num [1:24422] 3.07 3.07 3.07 3.07 3.07 ...
$ IHME_STI_SYMPTOMS_PREV_MEAN_2017 : num [1:24422] 4.15 4.15 4.15 4.15 4.15 ...
$ wp_contraceptive : num [1:24422] 0.282 0.282 0.282 0.282 0.282 ...
$ wp_liveBirths : num [1:24422] 124 124 124 124 124 ...
$ wp_poverty : num [1:24422] 0.555 0.555 0.555 0.555 0.555 ...
$ wp_lit_men : num [1:24422] 0.967 0.967 0.967 0.967 0.967 ...
$ wp_lit_women : num [1:24422] 0.874 0.874 0.874 0.874 0.874 ...
$ wp_stunting_men : num [1:24422] 0.178 0.178 0.178 0.178 0.178 ...
$ wp_stunting_women : num [1:24422] 0.215 0.215 0.215 0.215 0.215 ...
$ road_density_km : num [1:24422] 82.3 82.3 82.3 82.3 82.3 ...
And this is the code I am running:
set.seed(4488)
hts.facility.df2 = hts.facility.df %>%
mutate(hts_test_result = as.factor(case_when(
hts_test_result == 'Pos' ~ 1,
hts_test_result == 'Neg' ~ 0
)))
# split data into training and test using hts test result column ----------------------
df.split = initial_split(hts.facility.df2, strata = hts_test_result) # default split if .75/.25
train.df = training(df.split)
test.df = testing(df.split)
# recipe for Random Forest model ------------------------------------------------
# use themis package for oversampling: https://github.com/tidymodels/themis
# for more info on SMOTE method for unbalanced data refer: https://jair.org/index.php/jair/article/view/10302/24590
hts_recipe = recipe(hts_test_result ~ ., data = train.df) %>%
# remove individual data - patient id and facility id and age since age-grouo is already in the dataset
step_rm(patient_id, datim_code, age) %>%
update_role(patient_id, new_role = "patient_ID") %>%
update_role(datim_code, new_role = "facility_id") %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
# # normalize numeric variables
step_normalize(all_predictors()) %>%
# downsample positive tests as they are 90% of the results -
themis::step_smote(hts_test_result, over_ratio = 1)
hts_tree_prep <- prep(hts_recipe)
# create the data frame
hts_juiced <- juice(hts_tree_prep)
xgb_spec <- boost_tree(
trees = 500,
tree_depth = tune(), min_n = tune(),
loss_reduction = tune(), ## first three: model complexity
sample_size = tune(), mtry = tune(), ## randomness
learn_rate = tune(), ## step size
) %>%
set_engine("xgboost") %>%
set_mode("classification")
# set up grid for tuning values -------------------
xgb_grid <- grid_latin_hypercube(
tree_depth(),
min_n(),
loss_reduction(),
sample_size = sample_prop(),
finalize(mtry(), train.df),
learn_rate(),
size = 20
)
xgb_grid
# set up workflow ---------------------------------------------------------
xgb_wf <- workflow() %>%
add_formula(hts_test_result ~ .) %>%
add_model(xgb_spec)
set.seed(123)
vb_folds <- vfold_cv(train.df, strata = hts_test_result)
vb_folds
# tune the model ----------------------------------------------------------
doParallel::registerDoParallel()
set.seed(234)
xgb_res <- tune::tune_grid(
xgb_wf,
resamples = vb_folds,
grid = xgb_grid,
control = control_grid(save_pred = TRUE)
)
This is where it's been stuck for the last 12 hours. My dataset is so small, why is it taking so long?

Related

Take 20+ subsets of data?

I have a dataset and would like to take a lot of subsets based on various columns, values, and conditional operators. I think the most desirable output is a list containing all of these subsetted data frames as separate elements in the list. I attempted to do this by building a data frame that contains the subset conditions I would like to use, building a function, then using apply to feed that data frame to the function, but that didn't work. I'm sure there's probably a better method that uses an anonymous function or something like that, but I'm not sure how I would implement that. Below is an example code that should produce 8 subsets of data.
Original dataset, where x1 and x2 are scored on items that won't be used for subsetting and RT and LS are the variables that will be a subset on:
df <- data.frame(x1 = rnorm(100),
x2 = rnorm(100),
RT = abs(rnorm(100)),
LS = sample(1:10, 100, replace = T))
Dataframe containing the conditions for subsetting. E.g., the first subset of data should be any observations with values greater than or equal to 0.5 in the RT column, the second subset should be any observations greater than or equal to 1 in the subset column, etc. There should be 8 subsets, 4 done on the RT variable and 4 done on the LS variable.
subsetConditions <- data.frame(column = rep(c("RT", "LS"), each = 4),
operator = rep(c(">=", "<="), each = 4),
value = c(0.5, 1, 1.5, 2,
9, 8, 7, 6))
And this is the ugly function I wrote to attempt to do this:
subsetFun <- function(x){
subset(df, eval(parse(text = paste(x))))
}
subsets <- apply(subsetConditions, 1, subsetFun)
Thanks for any help!
Consider Map (wrapper to mapply) without any eval + parse. Since ==, <=, >=, and other operators can be used as functions with two arguments where 4 <= 5 can be written as `<=`(4,5) or "<="(4, 5), simply pass arguments elementwise and use get to reference the function by string:
sub_data <- function(col, op, val) {
df[get(op)(df[[col]], val),]
}
sub_dfs <- with(subsetConditions, Map(sub_data, column, operator, value))
Output
str(sub_dfs)
List of 8
$ RT:'data.frame': 62 obs. of 4 variables:
..$ x1: num [1:62] -1.12 -0.745 -1.377 0.848 1.63 ...
..$ x2: num [1:62] -0.257 -2.385 0.805 -0.313 0.662 ...
..$ RT: num [1:62] 0.693 1.662 0.731 2.145 0.543 ...
..$ LS: int [1:62] 5 5 1 2 9 1 5 9 3 10 ...
$ RT:'data.frame': 36 obs. of 4 variables:
..$ x1: num [1:36] -0.745 0.848 0.908 -0.761 0.74 ...
..$ x2: num [1:36] -2.3849 -0.3131 -2.4645 -0.0784 0.8512 ...
..$ RT: num [1:36] 1.66 2.15 1.74 1.65 1.13 ...
..$ LS: int [1:36] 5 2 1 5 9 10 2 7 1 3 ...
$ RT:'data.frame': 14 obs. of 4 variables:
..$ x1: num [1:14] -0.745 0.848 0.908 -0.761 -1.063 ...
..$ x2: num [1:14] -2.3849 -0.3131 -2.4645 -0.0784 -2.9886 ...
..$ RT: num [1:14] 1.66 2.15 1.74 1.65 2.63 ...
..$ LS: int [1:14] 5 2 1 5 5 6 9 4 8 4 ...
$ RT:'data.frame': 3 obs. of 4 variables:
..$ x1: num [1:3] 0.848 -1.063 0.197
..$ x2: num [1:3] -0.313 -2.989 0.709
..$ RT: num [1:3] 2.15 2.63 2.05
..$ LS: int [1:3] 2 5 6
$ LS:'data.frame': 92 obs. of 4 variables:
..$ x1: num [1:92] -1.12 -0.745 -1.377 0.848 0.612 ...
..$ x2: num [1:92] -0.257 -2.385 0.805 -0.313 0.958 ...
..$ RT: num [1:92] 0.693 1.662 0.731 2.145 0.489 ...
..$ LS: int [1:92] 5 5 1 2 1 9 1 5 9 3 ...
$ LS:'data.frame': 78 obs. of 4 variables:
..$ x1: num [1:78] -1.12 -0.745 -1.377 0.848 0.612 ...
..$ x2: num [1:78] -0.257 -2.385 0.805 -0.313 0.958 ...
..$ RT: num [1:78] 0.693 1.662 0.731 2.145 0.489 ...
..$ LS: int [1:78] 5 5 1 2 1 1 5 3 5 2 ...
$ LS:'data.frame': 75 obs. of 4 variables:
..$ x1: num [1:75] -1.12 -0.745 -1.377 0.848 0.612 ...
..$ x2: num [1:75] -0.257 -2.385 0.805 -0.313 0.958 ...
..$ RT: num [1:75] 0.693 1.662 0.731 2.145 0.489 ...
..$ LS: int [1:75] 5 5 1 2 1 1 5 3 5 2 ...
$ LS:'data.frame': 62 obs. of 4 variables:
..$ x1: num [1:62] -1.12 -0.745 -1.377 0.848 0.612 ...
..$ x2: num [1:62] -0.257 -2.385 0.805 -0.313 0.958 ...
..$ RT: num [1:62] 0.693 1.662 0.731 2.145 0.489 ...
..$ LS: int [1:62] 5 5 1 2 1 1 5 3 5 2 ...
You were actually pretty close with your function, but just needed to make an adjustment. So, with paste for each row, you need to collapse all 3 columns so that it is only 1 string rather than 3, then it can properly evaluate the expression.
subsetFun <- function(x){
subset(df, eval(parse(text = paste(x, collapse = ""))))
}
subsets <- apply(subsetConditions, 1, subsetFun)
Output
Then, it will return the 8 subsets.
str(subsets)
List of 8
$ :'data.frame': 67 obs. of 4 variables:
..$ x1: num [1:67] -1.208 0.606 -0.17 0.728 -0.424 ...
..$ x2: num [1:67] 0.4058 -0.3041 -0.3357 0.7904 -0.0264 ...
..$ RT: num [1:67] 1.972 0.883 0.598 0.633 1.517 ...
..$ LS: int [1:67] 8 9 2 10 8 5 3 4 7 2 ...
$ :'data.frame': 35 obs. of 4 variables:
..$ x1: num [1:35] -1.2083 -0.4241 -0.0906 0.9851 -0.8236 ...
..$ x2: num [1:35] 0.4058 -0.0264 1.0054 0.0653 1.4647 ...
..$ RT: num [1:35] 1.97 1.52 1.05 1.63 1.47 ...
..$ LS: int [1:35] 8 8 5 4 7 3 1 6 8 6 ...
$ :'data.frame': 16 obs. of 4 variables:
..$ x1: num [1:16] -1.208 -0.424 0.985 0.99 0.939 ...
..$ x2: num [1:16] 0.4058 -0.0264 0.0653 0.3486 -0.7562 ...
..$ RT: num [1:16] 1.97 1.52 1.63 1.85 1.8 ...
..$ LS: int [1:16] 8 8 4 6 10 2 6 6 3 9 ...
$ :'data.frame': 7 obs. of 4 variables:
..$ x1: num [1:7] 0.963 0.423 -0.444 0.279 0.417 ...
..$ x2: num [1:7] 0.6612 0.0354 0.0555 0.1253 -0.3056 ...
..$ RT: num [1:7] 2.71 2.15 2.05 2.01 2.07 ...
..$ LS: int [1:7] 2 6 9 9 7 7 4
$ :'data.frame': 91 obs. of 4 variables:
..$ x1: num [1:91] -0.952 -1.208 0.606 -0.17 -0.048 ...
..$ x2: num [1:91] -0.645 0.406 -0.304 -0.336 -0.897 ...
..$ RT: num [1:91] 0.471 1.972 0.883 0.598 0.224 ...
..$ LS: int [1:91] 6 8 9 2 1 8 4 5 3 4 ...
$ :'data.frame': 75 obs. of 4 variables:
..$ x1: num [1:75] -0.952 -1.208 -0.17 -0.048 -0.424 ...
..$ x2: num [1:75] -0.6448 0.4058 -0.3357 -0.8968 -0.0264 ...
..$ RT: num [1:75] 0.471 1.972 0.598 0.224 1.517 ...
..$ LS: int [1:75] 6 8 2 1 8 4 5 3 4 1 ...
$ :'data.frame': 65 obs. of 4 variables:
..$ x1: num [1:65] -0.9517 -0.1698 -0.048 0.2834 -0.0906 ...
..$ x2: num [1:65] -0.645 -0.336 -0.897 -2.072 1.005 ...
..$ RT: num [1:65] 0.471 0.598 0.224 0.486 1.053 ...
..$ LS: int [1:65] 6 2 1 4 5 3 4 1 7 4 ...
$ :'data.frame': 58 obs. of 4 variables:
..$ x1: num [1:58] -0.9517 -0.1698 -0.048 0.2834 -0.0906 ...
..$ x2: num [1:58] -0.645 -0.336 -0.897 -2.072 1.005 ...
..$ RT: num [1:58] 0.471 0.598 0.224 0.486 1.053 ...
..$ LS: int [1:58] 6 2 1 4 5 3 4 1 4 2 ...

How to normalize all variables in an R dataframe (except for the one variable that's a factor)

I'm having difficulty applying the max-min normalize function to the predictor variables (30 of them) in my data frame without excluding the diagnosis variable (as it is a factor and not subject to the function) from the data frame.
```{r}
cancer_data <- as.data.frame(lapply(cancer_data, normalize))
```
This won't run bc it will prompt an error message referencing the factor column, but I don't want the new data frame to be created without that column. I would just like to apply the normalize function I created to the 30 predictor variables.
Here is the structure of my data frame if it provides helpful context at all:
str(cancer_data)
## 'data.frame': 569 obs. of 31 variables:
## $ diagnosis : Factor w/ 2 levels "Benign","Malignant": 1 1 1 1 1 1 1 2 1 1 ...
## $ radius_mean : num 12.3 10.6 11 11.3 15.2 ...
## $ texture_mean : num 12.4 18.9 16.8 13.4 13.2 ...
## $ perimeter_mean : num 78.8 69.3 70.9 73 97.7 ...
## $ area_mean : num 464 346 373 385 712 ...
## $ smoothness_mean : num 0.1028 0.0969 0.1077 0.1164 0.0796 ...
## $ compactness_mean : num 0.0698 0.1147 0.078 0.1136 0.0693 ...
## $ concavity_mean : num 0.0399 0.0639 0.0305 0.0464 0.0339 ...
## $ points_mean : num 0.037 0.0264 0.0248 0.048 0.0266 ...
## $ symmetry_mean : num 0.196 0.192 0.171 0.177 0.172 ...
## $ dimension_mean : num 0.0595 0.0649 0.0634 0.0607 0.0554 ...
## $ radius_se : num 0.236 0.451 0.197 0.338 0.178 ...
## $ texture_se : num 0.666 1.197 1.387 1.343 0.412 ...
## $ perimeter_se : num 1.67 3.43 1.34 1.85 1.34 ...
## $ area_se : num 17.4 27.1 13.5 26.3 17.7 ...
## $ smoothness_se : num 0.00805 0.00747 0.00516 0.01127 0.00501 ...
## $ compactness_se : num 0.0118 0.03581 0.00936 0.03498 0.01485 ...
## $ concavity_se : num 0.0168 0.0335 0.0106 0.0219 0.0155 ...
## $ points_se : num 0.01241 0.01365 0.00748 0.01965 0.00915 ...
## $ symmetry_se : num 0.0192 0.035 0.0172 0.0158 0.0165 ...
## $ dimension_se : num 0.00225 0.00332 0.0022 0.00344 0.00177 ...
## $ radius_worst : num 13.5 11.9 12.4 11.9 16.2 ...
## $ texture_worst : num 15.6 22.9 26.4 15.8 15.7 ...
## $ perimeter_worst : num 87 78.3 79.9 76.5 104.5 ...
## $ area_worst : num 549 425 471 434 819 ...
## $ smoothness_worst : num 0.139 0.121 0.137 0.137 0.113 ...
## $ compactness_worst: num 0.127 0.252 0.148 0.182 0.174 ...
## $ concavity_worst : num 0.1242 0.1916 0.1067 0.0867 0.1362 ...
## $ points_worst : num 0.0939 0.0793 0.0743 0.0861 0.0818 ...
## $ symmetry_worst : num 0.283 0.294 0.3 0.21 0.249 ...
## $ dimension_worst : num 0.0677 0.0759 0.0788 0.0678 0.0677 ...
Assuming you already have normalize function in your environment. You can get the numeric variables in your data and apply the function to selected columns using lapply.
cols <- sapply(cancer_data, is.numeric)
cancer_data[cols] <- lapply(cancer_data[cols], normalize)
Or without creating cols.
cancer_data[] <- lapply(cancer_data, function(x)
if(is.numeric(x)) normalize(x) else x)
If you want to exclude only 1st column, you can also use :
cancer_data[-1] <- lapply(cancer_data[-1], normalize)
This should work, but do look into tidymodels
Thanks to akrun for the new shorter answer.
library(tidyverse)
cancer_data <-cancer_data %>% mutate_if(negate(is.factor), normalize)

UCI Machine Learning Repository datasets

I am new to UCI Machine Learning Repository datasets
I have tried to download the data into R, but I can not do it.
Could someone please help with this?
Note, I am using MacBook Pro.
data capture
data capture
This is the data I want to use
You need to look at the data first to understand its arrangement and whether there is any metadata like a header. Your browser should be sufficient for this. The first two lines of the ionosphere.data file are:
1,0,0.99539,-0.05889,0.85243,0.02306,0.83398,-0.37708,1,0.03760,0.85243,-0.17755,0.59755,-0.44945,0.60536,-0.38223,0.84356,-0.38542,0.58212,-0.32192,0.56971,-0.29674,0.36946,-0.47357,0.56811,-0.51171,0.41078,-0.46168,0.21266,-0.34090,0.42267,-0.54487,0.18641,-0.45300,g
1,0,1,-0.18829,0.93035,-0.36156,-0.10868,-0.93597,1,-0.04549,0.50874,-0.67743,0.34432,-0.69707,-0.51685,-0.97515,0.05499,-0.62237,0.33109,-1,-0.13151,-0.45300,-0.18056,-0.35734,-0.20332,-0.26569,-0.20468,-0.18401,-0.19040,-0.11593,-0.16626,-0.06288,-0.13738,-0.02447,b
So, no header, but it is a CSV file. Can use either read.table with sep="," or read.csv with header=FALSE. You might (incorrectly as did I) assume the column names are in the other file, but this is a machine learning task where there are no labels, so the read.* functions will assign generic names to the columns of the dataframe created.
You copy the link address with your browser to the datafile, then paste it into read.table in quotes and add the separator argument (since read.table's default separator values (whitespace) does not include commas:
ionosphere <- read.table( "https://archive.ics.uci.edu/ml/machine-learning-databases/ionosphere/ionosphere.data",
sep=",") # header=FALSE is default for read.table
> str(ionosphere)
'data.frame': 351 obs. of 35 variables:
$ V1 : int 1 1 1 1 1 1 1 0 1 1 ...
$ V2 : int 0 0 0 0 0 0 0 0 0 0 ...
$ V3 : num 0.995 1 1 1 1 ...
$ V4 : num -0.0589 -0.1883 -0.0336 -0.4516 -0.024 ...
$ V5 : num 0.852 0.93 1 1 0.941 ...
$ V6 : num 0.02306 -0.36156 0.00485 1 0.06531 ...
$ V7 : num 0.834 -0.109 1 0.712 0.921 ...
$ V8 : num -0.377 -0.936 -0.121 -1 -0.233 ...
$ V9 : num 1 1 0.89 0 0.772 ...
$ V10: num 0.0376 -0.0455 0.012 0 -0.164 ...
$ V11: num 0.852 0.509 0.731 0 0.528 ...
$ V12: num -0.1776 -0.6774 0.0535 0 -0.2028 ...
$ V13: num 0.598 0.344 0.854 0 0.564 ...
$ V14: num -0.44945 -0.69707 0.00827 0 -0.00712 ...
$ V15: num 0.605 -0.517 0.546 -1 0.344 ...
$ V16: num -0.38223 -0.97515 0.00299 0.14516 -0.27457 ...
$ V17: num 0.844 0.055 0.838 0.541 0.529 ...
$ V18: num -0.385 -0.622 -0.136 -0.393 -0.218 ...
$ V19: num 0.582 0.331 0.755 -1 0.451 ...
$ V20: num -0.3219 -1 -0.0854 -0.5447 -0.1781 ...
$ V21: num 0.5697 -0.1315 0.7089 -0.6997 0.0598 ...
$ V22: num -0.297 -0.453 -0.275 1 -0.356 ...
$ V23: num 0.3695 -0.1806 0.4339 0 0.0231 ...
$ V24: num -0.474 -0.357 -0.121 0 -0.529 ...
$ V25: num 0.5681 -0.2033 0.5753 1 0.0329 ...
$ V26: num -0.512 -0.266 -0.402 0.907 -0.652 ...
$ V27: num 0.411 -0.205 0.59 0.516 0.133 ...
$ V28: num -0.462 -0.184 -0.221 1 -0.532 ...
$ V29: num 0.2127 -0.1904 0.431 1 0.0243 ...
$ V30: num -0.341 -0.116 -0.174 -0.201 -0.622 ...
$ V31: num 0.4227 -0.1663 0.6044 0.2568 -0.0571 ...
$ V32: num -0.5449 -0.0629 -0.2418 1 -0.5957 ...
$ V33: num 0.1864 -0.1374 0.5605 -0.3238 -0.0461 ...
$ V34: num -0.453 -0.0245 -0.3824 1 -0.657 ...
$ V35: Factor w/ 2 levels "b","g": 2 1 2 1 2 1 2 1 2 1 ...

data training with R where data preprocessed into PCA components?

I would like to train a knn using caret::train to classify digits (classic problem) employing a PCA on the features before training.
control = trainControl(method = "repeatedcv",
number = 10,
repeats = 5,
p = 0.9)
knnFit = train(x = trainingDigit,
y = label,
metric = "Accuracy",
method = "knn",
trControl = control,
preProcess = "pca")
I don't understand how to represent my data for training resulting in an error:
Error in sample.int(length(x), size, replace, prob) :
cannot take a sample larger than the population when 'replace = FALSE'
My training data is represented as follows (Rdata file):
List of 10
$ : num [1:400, 1:324] 0.934 0.979 0.877 0.853 0.945 ...
$ : num [1:400, 1:324] 0.807 0.98 0.803 0.978 0.969 ...
$ : num [1:400, 1:324] 0.745 0.883 0.776 0.825 0.922 ...
$ : num [1:400, 1:324] 0.892 0.817 0.835 0.84 0.842 ...
$ : num [1:400, 1:324] 0.752 0.859 0.881 0.884 0.855 ...
$ : num [1:400, 1:324] 0.798 0.969 0.925 0.921 0.873 ...
$ : num [1:400, 1:324] 0.964 0.93 0.97 0.857 0.926 ...
$ : num [1:400, 1:324] 0.922 0.939 0.958 0.946 0.867 ...
$ : num [1:400, 1:324] 0.969 0.947 0.916 0.861 0.86 ...
$ : num [1:400, 1:324] 0.922 0.933 0.978 0.968 0.971 ...
Labels as follows (.Rdata file):
List of 10
$ : num [1:400] 0 0 0 0 0 0 0 0 0 0 ...
$ : num [1:400] 1 1 1 1 1 1 1 1 1 1 ...
$ : num [1:400] 2 2 2 2 2 2 2 2 2 2 ...
$ : num [1:400] 3 3 3 3 3 3 3 3 3 3 ...
$ : num [1:400] 4 4 4 4 4 4 4 4 4 4 ...
$ : num [1:400] 5 5 5 5 5 5 5 5 5 5 ...
$ : num [1:400] 6 6 6 6 6 6 6 6 6 6 ...
$ : num [1:400] 7 7 7 7 7 7 7 7 7 7 ...
$ : num [1:400] 8 8 8 8 8 8 8 8 8 8 ...
$ : num [1:400] 9 9 9 9 9 9 9 9 9 9 ...
The problem is in your representation of the data. Try this before you start training:
label <- factor(c(label, recursive = TRUE))
trainingDigit <- data.frame(do.call(rbind, trainingDigit))
You need to massage your data into a data.frame or data.frame-like format with a single column representing your different outcomes with the other columns being features for each outcome.
Also, if you want to do classification, not regression, your outcomes need to be a factor.
To be clear, I tried to run the training code as follows, and it works just fine.
library(caret)
load("data.RData")
load("testClass_new.RData")
label <- factor(c(label, recursive = TRUE))
trainingDigit <- data.frame(do.call(rbind, trainingDigit))
control <- trainControl(method = "repeatedcv",
number = 10,
repeats = 5,
p = 0.9)
knnFit <- train(x = trainingDigit,
y = label,
metric = "Accuracy",
method = "knn",
trControl = control,
preProcess = "pca")

Add a column in a list of data frames

I want to add a column to each of my data frames in my list table after I do this code :
#list of my dataframes
df <- list(df1,df2,df3,df4)
#compute stats
stats <- function(d) do.call(rbind, lapply(split(d, d[,2]), function(x) data.frame(Nb= length(x$Year), Mean=mean(x$A), SD=sd(x$A) )))
#Apply to list of dataframes
table <- lapply(df, stats)
This column which I call Source for example, include the names of my dataframes along with Nb, Mean and SD variables. So the variable Source should contain df1,df1,df1... for my table[1], and so on.
Is there anyway I can add it in my code above?
Here's a different way of doing things:
First, let's start with some reproducible data:
set.seed(1)
n = 10
dat <- list(data.frame(a=rnorm(n), b=sample(1:3,n,TRUE)),
data.frame(a=rnorm(n), b=sample(1:3,n,TRUE)),
data.frame(a=rnorm(n), b=sample(1:3,n,TRUE)),
data.frame(a=rnorm(n), b=sample(1:3,n,TRUE)))
Then, you want a function that adds columns to a data.frame. The obvious candidate is within. The particular things you want to calculate are constant values for each observation within a particular category. To do that, use ave for each of the columns you want to add. Here's your new function:
stat <- function(d){
within(d, {
Nb = ave(a, b, FUN=length)
Mean = ave(a, b, FUN=mean)
SD = ave(a, b, FUN=sd)
})
}
Then just lapply it to your list of data.frames:
lapply(dat, stat)
As you can see, columns are added as appropriate:
> str(lapply(dat, stat))
List of 4
$ :'data.frame': 10 obs. of 5 variables:
..$ a : num [1:10] -0.626 0.184 -0.836 1.595 0.33 ...
..$ b : int [1:10] 3 1 2 1 1 2 1 2 3 2
..$ SD : num [1:10] 0.85 0.643 0.738 0.643 0.643 ...
..$ Mean: num [1:10] -0.0253 0.649 -0.3058 0.649 0.649 ...
..$ Nb : num [1:10] 2 4 4 4 4 4 4 4 2 4
$ :'data.frame': 10 obs. of 5 variables:
..$ a : num [1:10] -0.0449 -0.0162 0.9438 0.8212 0.5939 ...
..$ b : int [1:10] 2 3 2 1 1 1 1 2 2 2
..$ SD : num [1:10] 1.141 NA 1.141 0.136 0.136 ...
..$ Mean: num [1:10] -0.0792 -0.0162 -0.0792 0.7791 0.7791 ...
..$ Nb : num [1:10] 5 1 5 4 4 4 4 5 5 5
$ :'data.frame': 10 obs. of 5 variables:
..$ a : num [1:10] 1.3587 -0.1028 0.3877 -0.0538 -1.3771 ...
..$ b : int [1:10] 2 3 2 1 3 1 3 1 1 1
..$ SD : num [1:10] 0.687 0.668 0.687 0.635 0.668 ...
..$ Mean: num [1:10] 0.873 -0.625 0.873 0.267 -0.625 ...
..$ Nb : num [1:10] 2 3 2 5 3 5 3 5 5 5
$ :'data.frame': 10 obs. of 5 variables:
..$ a : num [1:10] -0.707 0.365 0.769 -0.112 0.881 ...
..$ b : int [1:10] 3 3 2 2 1 1 3 1 2 2
..$ SD : num [1:10] 0.593 0.593 1.111 1.111 0.297 ...
..$ Mean: num [1:10] -0.318 -0.318 0.24 0.24 0.54 ...
..$ Nb : num [1:10] 3 3 4 4 3 3 3 3 4 4

Resources