UCI Machine Learning Repository datasets - r

I am new to UCI Machine Learning Repository datasets
I have tried to download the data into R, but I can not do it.
Could someone please help with this?
Note, I am using MacBook Pro.
data capture
data capture
This is the data I want to use

You need to look at the data first to understand its arrangement and whether there is any metadata like a header. Your browser should be sufficient for this. The first two lines of the ionosphere.data file are:
1,0,0.99539,-0.05889,0.85243,0.02306,0.83398,-0.37708,1,0.03760,0.85243,-0.17755,0.59755,-0.44945,0.60536,-0.38223,0.84356,-0.38542,0.58212,-0.32192,0.56971,-0.29674,0.36946,-0.47357,0.56811,-0.51171,0.41078,-0.46168,0.21266,-0.34090,0.42267,-0.54487,0.18641,-0.45300,g
1,0,1,-0.18829,0.93035,-0.36156,-0.10868,-0.93597,1,-0.04549,0.50874,-0.67743,0.34432,-0.69707,-0.51685,-0.97515,0.05499,-0.62237,0.33109,-1,-0.13151,-0.45300,-0.18056,-0.35734,-0.20332,-0.26569,-0.20468,-0.18401,-0.19040,-0.11593,-0.16626,-0.06288,-0.13738,-0.02447,b
So, no header, but it is a CSV file. Can use either read.table with sep="," or read.csv with header=FALSE. You might (incorrectly as did I) assume the column names are in the other file, but this is a machine learning task where there are no labels, so the read.* functions will assign generic names to the columns of the dataframe created.
You copy the link address with your browser to the datafile, then paste it into read.table in quotes and add the separator argument (since read.table's default separator values (whitespace) does not include commas:
ionosphere <- read.table( "https://archive.ics.uci.edu/ml/machine-learning-databases/ionosphere/ionosphere.data",
sep=",") # header=FALSE is default for read.table
> str(ionosphere)
'data.frame': 351 obs. of 35 variables:
$ V1 : int 1 1 1 1 1 1 1 0 1 1 ...
$ V2 : int 0 0 0 0 0 0 0 0 0 0 ...
$ V3 : num 0.995 1 1 1 1 ...
$ V4 : num -0.0589 -0.1883 -0.0336 -0.4516 -0.024 ...
$ V5 : num 0.852 0.93 1 1 0.941 ...
$ V6 : num 0.02306 -0.36156 0.00485 1 0.06531 ...
$ V7 : num 0.834 -0.109 1 0.712 0.921 ...
$ V8 : num -0.377 -0.936 -0.121 -1 -0.233 ...
$ V9 : num 1 1 0.89 0 0.772 ...
$ V10: num 0.0376 -0.0455 0.012 0 -0.164 ...
$ V11: num 0.852 0.509 0.731 0 0.528 ...
$ V12: num -0.1776 -0.6774 0.0535 0 -0.2028 ...
$ V13: num 0.598 0.344 0.854 0 0.564 ...
$ V14: num -0.44945 -0.69707 0.00827 0 -0.00712 ...
$ V15: num 0.605 -0.517 0.546 -1 0.344 ...
$ V16: num -0.38223 -0.97515 0.00299 0.14516 -0.27457 ...
$ V17: num 0.844 0.055 0.838 0.541 0.529 ...
$ V18: num -0.385 -0.622 -0.136 -0.393 -0.218 ...
$ V19: num 0.582 0.331 0.755 -1 0.451 ...
$ V20: num -0.3219 -1 -0.0854 -0.5447 -0.1781 ...
$ V21: num 0.5697 -0.1315 0.7089 -0.6997 0.0598 ...
$ V22: num -0.297 -0.453 -0.275 1 -0.356 ...
$ V23: num 0.3695 -0.1806 0.4339 0 0.0231 ...
$ V24: num -0.474 -0.357 -0.121 0 -0.529 ...
$ V25: num 0.5681 -0.2033 0.5753 1 0.0329 ...
$ V26: num -0.512 -0.266 -0.402 0.907 -0.652 ...
$ V27: num 0.411 -0.205 0.59 0.516 0.133 ...
$ V28: num -0.462 -0.184 -0.221 1 -0.532 ...
$ V29: num 0.2127 -0.1904 0.431 1 0.0243 ...
$ V30: num -0.341 -0.116 -0.174 -0.201 -0.622 ...
$ V31: num 0.4227 -0.1663 0.6044 0.2568 -0.0571 ...
$ V32: num -0.5449 -0.0629 -0.2418 1 -0.5957 ...
$ V33: num 0.1864 -0.1374 0.5605 -0.3238 -0.0461 ...
$ V34: num -0.453 -0.0245 -0.3824 1 -0.657 ...
$ V35: Factor w/ 2 levels "b","g": 2 1 2 1 2 1 2 1 2 1 ...

Related

Take 20+ subsets of data?

I have a dataset and would like to take a lot of subsets based on various columns, values, and conditional operators. I think the most desirable output is a list containing all of these subsetted data frames as separate elements in the list. I attempted to do this by building a data frame that contains the subset conditions I would like to use, building a function, then using apply to feed that data frame to the function, but that didn't work. I'm sure there's probably a better method that uses an anonymous function or something like that, but I'm not sure how I would implement that. Below is an example code that should produce 8 subsets of data.
Original dataset, where x1 and x2 are scored on items that won't be used for subsetting and RT and LS are the variables that will be a subset on:
df <- data.frame(x1 = rnorm(100),
x2 = rnorm(100),
RT = abs(rnorm(100)),
LS = sample(1:10, 100, replace = T))
Dataframe containing the conditions for subsetting. E.g., the first subset of data should be any observations with values greater than or equal to 0.5 in the RT column, the second subset should be any observations greater than or equal to 1 in the subset column, etc. There should be 8 subsets, 4 done on the RT variable and 4 done on the LS variable.
subsetConditions <- data.frame(column = rep(c("RT", "LS"), each = 4),
operator = rep(c(">=", "<="), each = 4),
value = c(0.5, 1, 1.5, 2,
9, 8, 7, 6))
And this is the ugly function I wrote to attempt to do this:
subsetFun <- function(x){
subset(df, eval(parse(text = paste(x))))
}
subsets <- apply(subsetConditions, 1, subsetFun)
Thanks for any help!
Consider Map (wrapper to mapply) without any eval + parse. Since ==, <=, >=, and other operators can be used as functions with two arguments where 4 <= 5 can be written as `<=`(4,5) or "<="(4, 5), simply pass arguments elementwise and use get to reference the function by string:
sub_data <- function(col, op, val) {
df[get(op)(df[[col]], val),]
}
sub_dfs <- with(subsetConditions, Map(sub_data, column, operator, value))
Output
str(sub_dfs)
List of 8
$ RT:'data.frame': 62 obs. of 4 variables:
..$ x1: num [1:62] -1.12 -0.745 -1.377 0.848 1.63 ...
..$ x2: num [1:62] -0.257 -2.385 0.805 -0.313 0.662 ...
..$ RT: num [1:62] 0.693 1.662 0.731 2.145 0.543 ...
..$ LS: int [1:62] 5 5 1 2 9 1 5 9 3 10 ...
$ RT:'data.frame': 36 obs. of 4 variables:
..$ x1: num [1:36] -0.745 0.848 0.908 -0.761 0.74 ...
..$ x2: num [1:36] -2.3849 -0.3131 -2.4645 -0.0784 0.8512 ...
..$ RT: num [1:36] 1.66 2.15 1.74 1.65 1.13 ...
..$ LS: int [1:36] 5 2 1 5 9 10 2 7 1 3 ...
$ RT:'data.frame': 14 obs. of 4 variables:
..$ x1: num [1:14] -0.745 0.848 0.908 -0.761 -1.063 ...
..$ x2: num [1:14] -2.3849 -0.3131 -2.4645 -0.0784 -2.9886 ...
..$ RT: num [1:14] 1.66 2.15 1.74 1.65 2.63 ...
..$ LS: int [1:14] 5 2 1 5 5 6 9 4 8 4 ...
$ RT:'data.frame': 3 obs. of 4 variables:
..$ x1: num [1:3] 0.848 -1.063 0.197
..$ x2: num [1:3] -0.313 -2.989 0.709
..$ RT: num [1:3] 2.15 2.63 2.05
..$ LS: int [1:3] 2 5 6
$ LS:'data.frame': 92 obs. of 4 variables:
..$ x1: num [1:92] -1.12 -0.745 -1.377 0.848 0.612 ...
..$ x2: num [1:92] -0.257 -2.385 0.805 -0.313 0.958 ...
..$ RT: num [1:92] 0.693 1.662 0.731 2.145 0.489 ...
..$ LS: int [1:92] 5 5 1 2 1 9 1 5 9 3 ...
$ LS:'data.frame': 78 obs. of 4 variables:
..$ x1: num [1:78] -1.12 -0.745 -1.377 0.848 0.612 ...
..$ x2: num [1:78] -0.257 -2.385 0.805 -0.313 0.958 ...
..$ RT: num [1:78] 0.693 1.662 0.731 2.145 0.489 ...
..$ LS: int [1:78] 5 5 1 2 1 1 5 3 5 2 ...
$ LS:'data.frame': 75 obs. of 4 variables:
..$ x1: num [1:75] -1.12 -0.745 -1.377 0.848 0.612 ...
..$ x2: num [1:75] -0.257 -2.385 0.805 -0.313 0.958 ...
..$ RT: num [1:75] 0.693 1.662 0.731 2.145 0.489 ...
..$ LS: int [1:75] 5 5 1 2 1 1 5 3 5 2 ...
$ LS:'data.frame': 62 obs. of 4 variables:
..$ x1: num [1:62] -1.12 -0.745 -1.377 0.848 0.612 ...
..$ x2: num [1:62] -0.257 -2.385 0.805 -0.313 0.958 ...
..$ RT: num [1:62] 0.693 1.662 0.731 2.145 0.489 ...
..$ LS: int [1:62] 5 5 1 2 1 1 5 3 5 2 ...
You were actually pretty close with your function, but just needed to make an adjustment. So, with paste for each row, you need to collapse all 3 columns so that it is only 1 string rather than 3, then it can properly evaluate the expression.
subsetFun <- function(x){
subset(df, eval(parse(text = paste(x, collapse = ""))))
}
subsets <- apply(subsetConditions, 1, subsetFun)
Output
Then, it will return the 8 subsets.
str(subsets)
List of 8
$ :'data.frame': 67 obs. of 4 variables:
..$ x1: num [1:67] -1.208 0.606 -0.17 0.728 -0.424 ...
..$ x2: num [1:67] 0.4058 -0.3041 -0.3357 0.7904 -0.0264 ...
..$ RT: num [1:67] 1.972 0.883 0.598 0.633 1.517 ...
..$ LS: int [1:67] 8 9 2 10 8 5 3 4 7 2 ...
$ :'data.frame': 35 obs. of 4 variables:
..$ x1: num [1:35] -1.2083 -0.4241 -0.0906 0.9851 -0.8236 ...
..$ x2: num [1:35] 0.4058 -0.0264 1.0054 0.0653 1.4647 ...
..$ RT: num [1:35] 1.97 1.52 1.05 1.63 1.47 ...
..$ LS: int [1:35] 8 8 5 4 7 3 1 6 8 6 ...
$ :'data.frame': 16 obs. of 4 variables:
..$ x1: num [1:16] -1.208 -0.424 0.985 0.99 0.939 ...
..$ x2: num [1:16] 0.4058 -0.0264 0.0653 0.3486 -0.7562 ...
..$ RT: num [1:16] 1.97 1.52 1.63 1.85 1.8 ...
..$ LS: int [1:16] 8 8 4 6 10 2 6 6 3 9 ...
$ :'data.frame': 7 obs. of 4 variables:
..$ x1: num [1:7] 0.963 0.423 -0.444 0.279 0.417 ...
..$ x2: num [1:7] 0.6612 0.0354 0.0555 0.1253 -0.3056 ...
..$ RT: num [1:7] 2.71 2.15 2.05 2.01 2.07 ...
..$ LS: int [1:7] 2 6 9 9 7 7 4
$ :'data.frame': 91 obs. of 4 variables:
..$ x1: num [1:91] -0.952 -1.208 0.606 -0.17 -0.048 ...
..$ x2: num [1:91] -0.645 0.406 -0.304 -0.336 -0.897 ...
..$ RT: num [1:91] 0.471 1.972 0.883 0.598 0.224 ...
..$ LS: int [1:91] 6 8 9 2 1 8 4 5 3 4 ...
$ :'data.frame': 75 obs. of 4 variables:
..$ x1: num [1:75] -0.952 -1.208 -0.17 -0.048 -0.424 ...
..$ x2: num [1:75] -0.6448 0.4058 -0.3357 -0.8968 -0.0264 ...
..$ RT: num [1:75] 0.471 1.972 0.598 0.224 1.517 ...
..$ LS: int [1:75] 6 8 2 1 8 4 5 3 4 1 ...
$ :'data.frame': 65 obs. of 4 variables:
..$ x1: num [1:65] -0.9517 -0.1698 -0.048 0.2834 -0.0906 ...
..$ x2: num [1:65] -0.645 -0.336 -0.897 -2.072 1.005 ...
..$ RT: num [1:65] 0.471 0.598 0.224 0.486 1.053 ...
..$ LS: int [1:65] 6 2 1 4 5 3 4 1 7 4 ...
$ :'data.frame': 58 obs. of 4 variables:
..$ x1: num [1:58] -0.9517 -0.1698 -0.048 0.2834 -0.0906 ...
..$ x2: num [1:58] -0.645 -0.336 -0.897 -2.072 1.005 ...
..$ RT: num [1:58] 0.471 0.598 0.224 0.486 1.053 ...
..$ LS: int [1:58] 6 2 1 4 5 3 4 1 4 2 ...

Tuning XGBoost with TIdymodels - over 12 hours later still not finished

I have been running an XGBoost model in R on a high-powered machine (4Ghz, 16 cores, 32gb RAM) for over 12 hours and it's still not finished. I am not sure what's going wrong. I followed Julia Silge's blog to the tee. This is what my data looks like:
str(hts.facility.df)
tibble [24,422 x 47] (S3: tbl_df/tbl/data.frame)
$ patient_id : Factor w/ 24422 levels
$ datim_code : chr [1:24422]
$ sex : Factor w/ 2 levels "F","M": 2 1 1 1 1 1 1 1 2 1 ...
$ age : num [1:24422] 33 36 29 21 49 44 71 26 50 38 ...
$ age_group : Factor w/ 12 levels "< 1","1 - 4",..: 7 8 6 5 10 9 12 6 12 8 ...
$ referred_from : Factor w/ 3 levels "Self","TB","Other": 2 1 1 1 1 1 1 1 1 1 ...
$ marital_status : Factor w/ 4 levels "M","S","W","D": 1 1 2 1 1 1 3 2 1 2 ...
$ no_of_own_children_lessthan_5 : Factor w/ 2 levels "0","more_than_2_children": 2 1 1 1 1 1 1 1 1 1 ...
$ no_of_wives : Factor w/ 2 levels "0","more_than_2_wives": 2 1 1 1 1 1 1 1 1 1 ...
$ session_type : Factor w/ 2 levels "Couple","Individual": 2 2 2 2 2 2 2 2 2 2 ...
$ previously_tested_hiv_negative : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ client_pregnant : Factor w/ 2 levels "1","0": 2 1 1 1 1 1 1 1 2 1 ...
$ hts_test_result : Factor w/ 2 levels "Neg","Pos": 1 1 1 1 1 1 1 1 1 1 ...
$ hts_setting : Factor w/ 4 levels "CT","TB","Ward",..: 3 3 3 3 3 3 3 3 3 3 ...
$ tested_for_hiv_before_within_this_year: Factor w/ 2 levels "PreviouslyTestedNegative",..: 2 1 2 2 2 2 2 2 2 2 ...
$ is_surge_site : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ nga_agesex_f_15_2019 : num [1:24422] 0.0627 0.0627 0.0627 0.0627 0.0627 ...
$ nga_agesex_f_20_2019 : num [1:24422] 0.0581 0.0581 0.0581 0.0581 0.0581 ...
$ nga_agesex_f_25_2019 : num [1:24422] 0.0411 0.0411 0.0411 0.0411 0.0411 ...
$ nga_agesex_f_30_2019 : num [1:24422] 0.0314 0.0314 0.0314 0.0314 0.0314 ...
$ nga_agesex_f_35_2019 : num [1:24422] 0.0275 0.0275 0.0275 0.0275 0.0275 ...
$ nga_agesex_f_40_2019 : num [1:24422] 0.021 0.021 0.021 0.021 0.021 ...
$ nga_agesex_f_45_2019 : num [1:24422] 0.0166 0.0166 0.0166 0.0166 0.0166 ...
$ nga_agesex_m_15_2019 : num [1:24422] 0.0536 0.0536 0.0536 0.0536 0.0536 ...
$ nga_agesex_m_20_2019 : num [1:24422] 0.0632 0.0632 0.0632 0.0632 0.0632 ...
$ nga_agesex_m_25_2019 : num [1:24422] 0.0534 0.0534 0.0534 0.0534 0.0534 ...
$ nga_agesex_m_30_2019 : num [1:24422] 0.036 0.036 0.036 0.036 0.036 ...
$ nga_agesex_m_35_2019 : num [1:24422] 0.0325 0.0325 0.0325 0.0325 0.0325 ...
$ nga_agesex_m_40_2019 : num [1:24422] 0.0263 0.0263 0.0263 0.0263 0.0263 ...
$ nga_agesex_m_45_2019 : num [1:24422] 0.0236 0.0236 0.0236 0.0236 0.0236 ...
$ IHME_CONDOM_LAST_TIME_PREV_MEAN_2017 : num [1:24422] 14.1 14.1 14.1 14.1 14.1 ...
$ IHME_HAD_INTERCOURSE_PREV_MEAN_2017 : num [1:24422] 63.1 63.1 63.1 63.1 63.1 ...
$ IHME_HIV_COUNT_MEAN_2017 : num [1:24422] 0.0126 0.0126 0.0126 0.0126 0.0126 ...
$ IHME_IN_UNION_PREV_MEAN_2017 : num [1:24422] 56.9 56.9 56.9 56.9 56.9 ...
$ IHME_MALE_CIRCUMCISION_PREV_MEAN_2017 : num [1:24422] 98.7 98.7 98.7 98.7 98.7 ...
$ IHME_PARTNER_AWAY_PREV_MEAN_2017 : num [1:24422] 13.5 13.5 13.5 13.5 13.5 ...
$ IHME_PARTNERS_YEAR_MN_PREV_MEAN_2017 : num [1:24422] 13.5 13.5 13.5 13.5 13.5 ...
$ IHME_PARTNERS_YEAR_WN_PREV_MEAN_2017 : num [1:24422] 3.07 3.07 3.07 3.07 3.07 ...
$ IHME_STI_SYMPTOMS_PREV_MEAN_2017 : num [1:24422] 4.15 4.15 4.15 4.15 4.15 ...
$ wp_contraceptive : num [1:24422] 0.282 0.282 0.282 0.282 0.282 ...
$ wp_liveBirths : num [1:24422] 124 124 124 124 124 ...
$ wp_poverty : num [1:24422] 0.555 0.555 0.555 0.555 0.555 ...
$ wp_lit_men : num [1:24422] 0.967 0.967 0.967 0.967 0.967 ...
$ wp_lit_women : num [1:24422] 0.874 0.874 0.874 0.874 0.874 ...
$ wp_stunting_men : num [1:24422] 0.178 0.178 0.178 0.178 0.178 ...
$ wp_stunting_women : num [1:24422] 0.215 0.215 0.215 0.215 0.215 ...
$ road_density_km : num [1:24422] 82.3 82.3 82.3 82.3 82.3 ...
And this is the code I am running:
set.seed(4488)
hts.facility.df2 = hts.facility.df %>%
mutate(hts_test_result = as.factor(case_when(
hts_test_result == 'Pos' ~ 1,
hts_test_result == 'Neg' ~ 0
)))
# split data into training and test using hts test result column ----------------------
df.split = initial_split(hts.facility.df2, strata = hts_test_result) # default split if .75/.25
train.df = training(df.split)
test.df = testing(df.split)
# recipe for Random Forest model ------------------------------------------------
# use themis package for oversampling: https://github.com/tidymodels/themis
# for more info on SMOTE method for unbalanced data refer: https://jair.org/index.php/jair/article/view/10302/24590
hts_recipe = recipe(hts_test_result ~ ., data = train.df) %>%
# remove individual data - patient id and facility id and age since age-grouo is already in the dataset
step_rm(patient_id, datim_code, age) %>%
update_role(patient_id, new_role = "patient_ID") %>%
update_role(datim_code, new_role = "facility_id") %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
# # normalize numeric variables
step_normalize(all_predictors()) %>%
# downsample positive tests as they are 90% of the results -
themis::step_smote(hts_test_result, over_ratio = 1)
hts_tree_prep <- prep(hts_recipe)
# create the data frame
hts_juiced <- juice(hts_tree_prep)
xgb_spec <- boost_tree(
trees = 500,
tree_depth = tune(), min_n = tune(),
loss_reduction = tune(), ## first three: model complexity
sample_size = tune(), mtry = tune(), ## randomness
learn_rate = tune(), ## step size
) %>%
set_engine("xgboost") %>%
set_mode("classification")
# set up grid for tuning values -------------------
xgb_grid <- grid_latin_hypercube(
tree_depth(),
min_n(),
loss_reduction(),
sample_size = sample_prop(),
finalize(mtry(), train.df),
learn_rate(),
size = 20
)
xgb_grid
# set up workflow ---------------------------------------------------------
xgb_wf <- workflow() %>%
add_formula(hts_test_result ~ .) %>%
add_model(xgb_spec)
set.seed(123)
vb_folds <- vfold_cv(train.df, strata = hts_test_result)
vb_folds
# tune the model ----------------------------------------------------------
doParallel::registerDoParallel()
set.seed(234)
xgb_res <- tune::tune_grid(
xgb_wf,
resamples = vb_folds,
grid = xgb_grid,
control = control_grid(save_pred = TRUE)
)
This is where it's been stuck for the last 12 hours. My dataset is so small, why is it taking so long?

How to normalize all variables in an R dataframe (except for the one variable that's a factor)

I'm having difficulty applying the max-min normalize function to the predictor variables (30 of them) in my data frame without excluding the diagnosis variable (as it is a factor and not subject to the function) from the data frame.
```{r}
cancer_data <- as.data.frame(lapply(cancer_data, normalize))
```
This won't run bc it will prompt an error message referencing the factor column, but I don't want the new data frame to be created without that column. I would just like to apply the normalize function I created to the 30 predictor variables.
Here is the structure of my data frame if it provides helpful context at all:
str(cancer_data)
## 'data.frame': 569 obs. of 31 variables:
## $ diagnosis : Factor w/ 2 levels "Benign","Malignant": 1 1 1 1 1 1 1 2 1 1 ...
## $ radius_mean : num 12.3 10.6 11 11.3 15.2 ...
## $ texture_mean : num 12.4 18.9 16.8 13.4 13.2 ...
## $ perimeter_mean : num 78.8 69.3 70.9 73 97.7 ...
## $ area_mean : num 464 346 373 385 712 ...
## $ smoothness_mean : num 0.1028 0.0969 0.1077 0.1164 0.0796 ...
## $ compactness_mean : num 0.0698 0.1147 0.078 0.1136 0.0693 ...
## $ concavity_mean : num 0.0399 0.0639 0.0305 0.0464 0.0339 ...
## $ points_mean : num 0.037 0.0264 0.0248 0.048 0.0266 ...
## $ symmetry_mean : num 0.196 0.192 0.171 0.177 0.172 ...
## $ dimension_mean : num 0.0595 0.0649 0.0634 0.0607 0.0554 ...
## $ radius_se : num 0.236 0.451 0.197 0.338 0.178 ...
## $ texture_se : num 0.666 1.197 1.387 1.343 0.412 ...
## $ perimeter_se : num 1.67 3.43 1.34 1.85 1.34 ...
## $ area_se : num 17.4 27.1 13.5 26.3 17.7 ...
## $ smoothness_se : num 0.00805 0.00747 0.00516 0.01127 0.00501 ...
## $ compactness_se : num 0.0118 0.03581 0.00936 0.03498 0.01485 ...
## $ concavity_se : num 0.0168 0.0335 0.0106 0.0219 0.0155 ...
## $ points_se : num 0.01241 0.01365 0.00748 0.01965 0.00915 ...
## $ symmetry_se : num 0.0192 0.035 0.0172 0.0158 0.0165 ...
## $ dimension_se : num 0.00225 0.00332 0.0022 0.00344 0.00177 ...
## $ radius_worst : num 13.5 11.9 12.4 11.9 16.2 ...
## $ texture_worst : num 15.6 22.9 26.4 15.8 15.7 ...
## $ perimeter_worst : num 87 78.3 79.9 76.5 104.5 ...
## $ area_worst : num 549 425 471 434 819 ...
## $ smoothness_worst : num 0.139 0.121 0.137 0.137 0.113 ...
## $ compactness_worst: num 0.127 0.252 0.148 0.182 0.174 ...
## $ concavity_worst : num 0.1242 0.1916 0.1067 0.0867 0.1362 ...
## $ points_worst : num 0.0939 0.0793 0.0743 0.0861 0.0818 ...
## $ symmetry_worst : num 0.283 0.294 0.3 0.21 0.249 ...
## $ dimension_worst : num 0.0677 0.0759 0.0788 0.0678 0.0677 ...
Assuming you already have normalize function in your environment. You can get the numeric variables in your data and apply the function to selected columns using lapply.
cols <- sapply(cancer_data, is.numeric)
cancer_data[cols] <- lapply(cancer_data[cols], normalize)
Or without creating cols.
cancer_data[] <- lapply(cancer_data, function(x)
if(is.numeric(x)) normalize(x) else x)
If you want to exclude only 1st column, you can also use :
cancer_data[-1] <- lapply(cancer_data[-1], normalize)
This should work, but do look into tidymodels
Thanks to akrun for the new shorter answer.
library(tidyverse)
cancer_data <-cancer_data %>% mutate_if(negate(is.factor), normalize)

Merge 2 list of lists in R [duplicate]

This question already has answers here:
Combining elements of list of lists by index
(3 answers)
in r combine a list of lists into one list
(3 answers)
Closed 4 years ago.
I have 2 list of lists in R with the same list name as follow :
str(total_delta_final[[1]])
List of 4
$ sector1_T02 :'data.frame': 24 obs. of 3 variables:
..$ DeltaF_1: num [1:24] 0.737 0.737 0.693 0.738 0.738 ...
..$ DeltaF_2: num [1:24] 0.24 0.24 0.279 0.239 0.239 ...
..$ DeltaF_3: num [1:24] 0.0233 0.0233 0.0275 0.0232 0.0232 ...
$ sector2_T03 :'data.frame': 24 obs. of 3 variables:
..$ DeltaF_1: num [1:24] 0.582 0.582 0.568 0.69 0.69 ...
..$ DeltaF_2: num [1:24] 0.377 0.377 0.39 0.282 0.282 ...
..$ DeltaF_3: num [1:24] 0.0406 0.0406 0.0426 0.0278 0.0278 ...
$ sector3_T03 :'data.frame': 24 obs. of 3 variables:
..$ DeltaF_1: num [1:24] 0.607 0.607 0.495 0.409 0.375 ...
..$ DeltaF_2: num [1:24] 0.356 0.356 0.451 0.519 0.544 ...
..$ DeltaF_3: num [1:24] 0.0373 0.0373 0.0541 0.072 0.0809 ...
$ sector12_T02:'data.frame': 24 obs. of 3 variables:
..$ DeltaF_1: num [1:24] 0.743 0.743 0.758 0.689 0.705 ...
..$ DeltaF_2: num [1:24] 0.234 0.234 0.22 0.283 0.269 ...
..$ DeltaF_3: num [1:24] 0.0226 0.0226 0.0213 0.028 0.0263 ...
> str(total_TI_final[[1]])
List of 4
$ sector1_T02 :'data.frame': 24 obs. of 3 variables:
..$ I_1: num [1:24] NA 0.0756 0.083 0.0799 0.0799 ...
..$ I_2: num [1:24] 0.122 NA 0.163 0.172 0.172 ...
..$ I_3: num [1:24] 0.212 0.211 NA 0.266 0.273 ...
$ sector2_T03 :'data.frame': 24 obs. of 3 variables:
..$ I_1: num [1:24] NA 0.0986 0.1013 0.1011 0.101 ...
..$ I_2: num [1:24] 0.15 NA 0.184 0.211 0.211 ...
..$ I_3: num [1:24] 0.249 0.249 NA 0.331 0.337 ...
$ sector3_T03 :'data.frame': 24 obs. of 3 variables:
..$ I_1: num [1:24] NA 0.119 0.115 0.113 0.105 ...
..$ I_2: num [1:24] 0.193 NA 0.2 0.193 0.177 ...
..$ I_3: num [1:24] 0.323 0.323 NA 0.277 0.256 ...
$ sector12_T02:'data.frame': 24 obs. of 3 variables:
..$ I_1: num [1:24] NA 0.0825 0.0681 0.0723 0.0706 ...
..$ I_2: num [1:24] 0.138 NA 0.146 0.145 0.144 ...
..$ I_3: num [1:24] 0.24 0.24 NA 0.22 0.226 ...
How could I merge these 2 list of lists in a way that my final output looks like total_TI_final[[1]][1] and the second list total_delta_final[[1]][1] then total_TI_final[[1]][2] and total_delta_final[[1]][2] and so on ...
We can use Map
Map(c, total_delta_final, total_TI_final)

transform all columns that are NAs from numeric to factors

I have a data.table called td.br.2, in which some columns are completely NAs. These columns are of type numeric. What I would like to do, is only for these columns to transform them to factors.
I have tried the following, but it does not work ( I do not get an error but it does not do the job either)
td.br.2[] <- td.br.2[,lapply(.SD, function(x) {ifelse(sum(is.na(x)==nrow(td.br.2)),as.factor(x),x)})]
n=10#nr of rows
m=10#nr of cols
N<-n*m
m1<-matrix(runif(N),nrow=n,ncol = m)
dt<-data.table(m1)
names(dt)<-letters[1:m]
dt<-cbind(dt,xxx=rep(NA,nrow(dt)))#adding NA column
At this point
str(dt)
Classes ‘data.table’ and 'data.frame': 10 obs. of 11 variables:
$ a : num 0.661 0.864 0.152 0.342 0.989 ...
$ b : num 0.06036 0.67587 0.00847 0.37674 0.30417 ...
$ c : num 0.3938 0.6274 0.0514 0.882 0.1568 ...
$ d : num 0.777 0.233 0.619 0.117 0.132 ...
$ e : num 0.655 0.926 0.277 0.598 0.237 ...
$ f : num 0.649 0.197 0.547 0.585 0.685 ...
$ g : num 0.6877 0.3676 0.009 0.6975 0.0327 ...
$ h : num 0.519 0.705 0.457 0.465 0.966 ...
$ i : num 0.43777 0.00961 0.30224 0.58172 0.37621 ...
$ j : num 0.44 0.481 0.485 0.125 0.263 ...
$ xxx: logi NA NA NA NA NA NA ...
So by executing:
dt<-dt[, lapply(.SD, function(x){ if(all(is.na(x)))as.factor(as.character(x)) else x}),]
yields:
str(dt)
Classes ‘data.table’ and 'data.frame': 10 obs. of 11 variables:
$ a : num 0.0903 0.0448 0.5956 0.418 0.1316 ...
$ b : num 0.672 0.582 0.687 0.113 0.371 ...
$ c : num 0.404 0.16 0.848 0.863 0.737 ...
$ d : num 0.073 0.129 0.243 0.334 0.285 ...
$ e : num 0.485 0.186 0.539 0.486 0.784 ...
$ f : num 0.4685 0.4815 0.585 0.3596 0.0764 ...
$ g : num 0.958 0.194 0.549 0.71 0.737 ...
$ h : num 0.168 0.355 0.552 0.765 0.605 ...
$ i : num 0.665 0.88 0.23 0.575 0.413 ...
$ j : num 0.1113 0.8797 0.1244 0.0741 0.8724 ...
$ xxx: Factor w/ 0 levels: NA NA NA NA NA NA NA NA NA NA
I am not sure why you would want to do that, but here you are:
naColumns <- sapply(td.br.2, function(x) { all(is.na(x)) })
for (col in which(naColumns))
set(td.br.2, j=col, value=as.factor(x[[col]]))
The factors will have no levels, but you can deal with that as necessary.
(The for loop is partly based on this.)

Resources