I want to add a column to each of my data frames in my list table after I do this code :
#list of my dataframes
df <- list(df1,df2,df3,df4)
#compute stats
stats <- function(d) do.call(rbind, lapply(split(d, d[,2]), function(x) data.frame(Nb= length(x$Year), Mean=mean(x$A), SD=sd(x$A) )))
#Apply to list of dataframes
table <- lapply(df, stats)
This column which I call Source for example, include the names of my dataframes along with Nb, Mean and SD variables. So the variable Source should contain df1,df1,df1... for my table[1], and so on.
Is there anyway I can add it in my code above?
Here's a different way of doing things:
First, let's start with some reproducible data:
set.seed(1)
n = 10
dat <- list(data.frame(a=rnorm(n), b=sample(1:3,n,TRUE)),
data.frame(a=rnorm(n), b=sample(1:3,n,TRUE)),
data.frame(a=rnorm(n), b=sample(1:3,n,TRUE)),
data.frame(a=rnorm(n), b=sample(1:3,n,TRUE)))
Then, you want a function that adds columns to a data.frame. The obvious candidate is within. The particular things you want to calculate are constant values for each observation within a particular category. To do that, use ave for each of the columns you want to add. Here's your new function:
stat <- function(d){
within(d, {
Nb = ave(a, b, FUN=length)
Mean = ave(a, b, FUN=mean)
SD = ave(a, b, FUN=sd)
})
}
Then just lapply it to your list of data.frames:
lapply(dat, stat)
As you can see, columns are added as appropriate:
> str(lapply(dat, stat))
List of 4
$ :'data.frame': 10 obs. of 5 variables:
..$ a : num [1:10] -0.626 0.184 -0.836 1.595 0.33 ...
..$ b : int [1:10] 3 1 2 1 1 2 1 2 3 2
..$ SD : num [1:10] 0.85 0.643 0.738 0.643 0.643 ...
..$ Mean: num [1:10] -0.0253 0.649 -0.3058 0.649 0.649 ...
..$ Nb : num [1:10] 2 4 4 4 4 4 4 4 2 4
$ :'data.frame': 10 obs. of 5 variables:
..$ a : num [1:10] -0.0449 -0.0162 0.9438 0.8212 0.5939 ...
..$ b : int [1:10] 2 3 2 1 1 1 1 2 2 2
..$ SD : num [1:10] 1.141 NA 1.141 0.136 0.136 ...
..$ Mean: num [1:10] -0.0792 -0.0162 -0.0792 0.7791 0.7791 ...
..$ Nb : num [1:10] 5 1 5 4 4 4 4 5 5 5
$ :'data.frame': 10 obs. of 5 variables:
..$ a : num [1:10] 1.3587 -0.1028 0.3877 -0.0538 -1.3771 ...
..$ b : int [1:10] 2 3 2 1 3 1 3 1 1 1
..$ SD : num [1:10] 0.687 0.668 0.687 0.635 0.668 ...
..$ Mean: num [1:10] 0.873 -0.625 0.873 0.267 -0.625 ...
..$ Nb : num [1:10] 2 3 2 5 3 5 3 5 5 5
$ :'data.frame': 10 obs. of 5 variables:
..$ a : num [1:10] -0.707 0.365 0.769 -0.112 0.881 ...
..$ b : int [1:10] 3 3 2 2 1 1 3 1 2 2
..$ SD : num [1:10] 0.593 0.593 1.111 1.111 0.297 ...
..$ Mean: num [1:10] -0.318 -0.318 0.24 0.24 0.54 ...
..$ Nb : num [1:10] 3 3 4 4 3 3 3 3 4 4
Related
I have a dataset and would like to take a lot of subsets based on various columns, values, and conditional operators. I think the most desirable output is a list containing all of these subsetted data frames as separate elements in the list. I attempted to do this by building a data frame that contains the subset conditions I would like to use, building a function, then using apply to feed that data frame to the function, but that didn't work. I'm sure there's probably a better method that uses an anonymous function or something like that, but I'm not sure how I would implement that. Below is an example code that should produce 8 subsets of data.
Original dataset, where x1 and x2 are scored on items that won't be used for subsetting and RT and LS are the variables that will be a subset on:
df <- data.frame(x1 = rnorm(100),
x2 = rnorm(100),
RT = abs(rnorm(100)),
LS = sample(1:10, 100, replace = T))
Dataframe containing the conditions for subsetting. E.g., the first subset of data should be any observations with values greater than or equal to 0.5 in the RT column, the second subset should be any observations greater than or equal to 1 in the subset column, etc. There should be 8 subsets, 4 done on the RT variable and 4 done on the LS variable.
subsetConditions <- data.frame(column = rep(c("RT", "LS"), each = 4),
operator = rep(c(">=", "<="), each = 4),
value = c(0.5, 1, 1.5, 2,
9, 8, 7, 6))
And this is the ugly function I wrote to attempt to do this:
subsetFun <- function(x){
subset(df, eval(parse(text = paste(x))))
}
subsets <- apply(subsetConditions, 1, subsetFun)
Thanks for any help!
Consider Map (wrapper to mapply) without any eval + parse. Since ==, <=, >=, and other operators can be used as functions with two arguments where 4 <= 5 can be written as `<=`(4,5) or "<="(4, 5), simply pass arguments elementwise and use get to reference the function by string:
sub_data <- function(col, op, val) {
df[get(op)(df[[col]], val),]
}
sub_dfs <- with(subsetConditions, Map(sub_data, column, operator, value))
Output
str(sub_dfs)
List of 8
$ RT:'data.frame': 62 obs. of 4 variables:
..$ x1: num [1:62] -1.12 -0.745 -1.377 0.848 1.63 ...
..$ x2: num [1:62] -0.257 -2.385 0.805 -0.313 0.662 ...
..$ RT: num [1:62] 0.693 1.662 0.731 2.145 0.543 ...
..$ LS: int [1:62] 5 5 1 2 9 1 5 9 3 10 ...
$ RT:'data.frame': 36 obs. of 4 variables:
..$ x1: num [1:36] -0.745 0.848 0.908 -0.761 0.74 ...
..$ x2: num [1:36] -2.3849 -0.3131 -2.4645 -0.0784 0.8512 ...
..$ RT: num [1:36] 1.66 2.15 1.74 1.65 1.13 ...
..$ LS: int [1:36] 5 2 1 5 9 10 2 7 1 3 ...
$ RT:'data.frame': 14 obs. of 4 variables:
..$ x1: num [1:14] -0.745 0.848 0.908 -0.761 -1.063 ...
..$ x2: num [1:14] -2.3849 -0.3131 -2.4645 -0.0784 -2.9886 ...
..$ RT: num [1:14] 1.66 2.15 1.74 1.65 2.63 ...
..$ LS: int [1:14] 5 2 1 5 5 6 9 4 8 4 ...
$ RT:'data.frame': 3 obs. of 4 variables:
..$ x1: num [1:3] 0.848 -1.063 0.197
..$ x2: num [1:3] -0.313 -2.989 0.709
..$ RT: num [1:3] 2.15 2.63 2.05
..$ LS: int [1:3] 2 5 6
$ LS:'data.frame': 92 obs. of 4 variables:
..$ x1: num [1:92] -1.12 -0.745 -1.377 0.848 0.612 ...
..$ x2: num [1:92] -0.257 -2.385 0.805 -0.313 0.958 ...
..$ RT: num [1:92] 0.693 1.662 0.731 2.145 0.489 ...
..$ LS: int [1:92] 5 5 1 2 1 9 1 5 9 3 ...
$ LS:'data.frame': 78 obs. of 4 variables:
..$ x1: num [1:78] -1.12 -0.745 -1.377 0.848 0.612 ...
..$ x2: num [1:78] -0.257 -2.385 0.805 -0.313 0.958 ...
..$ RT: num [1:78] 0.693 1.662 0.731 2.145 0.489 ...
..$ LS: int [1:78] 5 5 1 2 1 1 5 3 5 2 ...
$ LS:'data.frame': 75 obs. of 4 variables:
..$ x1: num [1:75] -1.12 -0.745 -1.377 0.848 0.612 ...
..$ x2: num [1:75] -0.257 -2.385 0.805 -0.313 0.958 ...
..$ RT: num [1:75] 0.693 1.662 0.731 2.145 0.489 ...
..$ LS: int [1:75] 5 5 1 2 1 1 5 3 5 2 ...
$ LS:'data.frame': 62 obs. of 4 variables:
..$ x1: num [1:62] -1.12 -0.745 -1.377 0.848 0.612 ...
..$ x2: num [1:62] -0.257 -2.385 0.805 -0.313 0.958 ...
..$ RT: num [1:62] 0.693 1.662 0.731 2.145 0.489 ...
..$ LS: int [1:62] 5 5 1 2 1 1 5 3 5 2 ...
You were actually pretty close with your function, but just needed to make an adjustment. So, with paste for each row, you need to collapse all 3 columns so that it is only 1 string rather than 3, then it can properly evaluate the expression.
subsetFun <- function(x){
subset(df, eval(parse(text = paste(x, collapse = ""))))
}
subsets <- apply(subsetConditions, 1, subsetFun)
Output
Then, it will return the 8 subsets.
str(subsets)
List of 8
$ :'data.frame': 67 obs. of 4 variables:
..$ x1: num [1:67] -1.208 0.606 -0.17 0.728 -0.424 ...
..$ x2: num [1:67] 0.4058 -0.3041 -0.3357 0.7904 -0.0264 ...
..$ RT: num [1:67] 1.972 0.883 0.598 0.633 1.517 ...
..$ LS: int [1:67] 8 9 2 10 8 5 3 4 7 2 ...
$ :'data.frame': 35 obs. of 4 variables:
..$ x1: num [1:35] -1.2083 -0.4241 -0.0906 0.9851 -0.8236 ...
..$ x2: num [1:35] 0.4058 -0.0264 1.0054 0.0653 1.4647 ...
..$ RT: num [1:35] 1.97 1.52 1.05 1.63 1.47 ...
..$ LS: int [1:35] 8 8 5 4 7 3 1 6 8 6 ...
$ :'data.frame': 16 obs. of 4 variables:
..$ x1: num [1:16] -1.208 -0.424 0.985 0.99 0.939 ...
..$ x2: num [1:16] 0.4058 -0.0264 0.0653 0.3486 -0.7562 ...
..$ RT: num [1:16] 1.97 1.52 1.63 1.85 1.8 ...
..$ LS: int [1:16] 8 8 4 6 10 2 6 6 3 9 ...
$ :'data.frame': 7 obs. of 4 variables:
..$ x1: num [1:7] 0.963 0.423 -0.444 0.279 0.417 ...
..$ x2: num [1:7] 0.6612 0.0354 0.0555 0.1253 -0.3056 ...
..$ RT: num [1:7] 2.71 2.15 2.05 2.01 2.07 ...
..$ LS: int [1:7] 2 6 9 9 7 7 4
$ :'data.frame': 91 obs. of 4 variables:
..$ x1: num [1:91] -0.952 -1.208 0.606 -0.17 -0.048 ...
..$ x2: num [1:91] -0.645 0.406 -0.304 -0.336 -0.897 ...
..$ RT: num [1:91] 0.471 1.972 0.883 0.598 0.224 ...
..$ LS: int [1:91] 6 8 9 2 1 8 4 5 3 4 ...
$ :'data.frame': 75 obs. of 4 variables:
..$ x1: num [1:75] -0.952 -1.208 -0.17 -0.048 -0.424 ...
..$ x2: num [1:75] -0.6448 0.4058 -0.3357 -0.8968 -0.0264 ...
..$ RT: num [1:75] 0.471 1.972 0.598 0.224 1.517 ...
..$ LS: int [1:75] 6 8 2 1 8 4 5 3 4 1 ...
$ :'data.frame': 65 obs. of 4 variables:
..$ x1: num [1:65] -0.9517 -0.1698 -0.048 0.2834 -0.0906 ...
..$ x2: num [1:65] -0.645 -0.336 -0.897 -2.072 1.005 ...
..$ RT: num [1:65] 0.471 0.598 0.224 0.486 1.053 ...
..$ LS: int [1:65] 6 2 1 4 5 3 4 1 7 4 ...
$ :'data.frame': 58 obs. of 4 variables:
..$ x1: num [1:58] -0.9517 -0.1698 -0.048 0.2834 -0.0906 ...
..$ x2: num [1:58] -0.645 -0.336 -0.897 -2.072 1.005 ...
..$ RT: num [1:58] 0.471 0.598 0.224 0.486 1.053 ...
..$ LS: int [1:58] 6 2 1 4 5 3 4 1 4 2 ...
I have been running an XGBoost model in R on a high-powered machine (4Ghz, 16 cores, 32gb RAM) for over 12 hours and it's still not finished. I am not sure what's going wrong. I followed Julia Silge's blog to the tee. This is what my data looks like:
str(hts.facility.df)
tibble [24,422 x 47] (S3: tbl_df/tbl/data.frame)
$ patient_id : Factor w/ 24422 levels
$ datim_code : chr [1:24422]
$ sex : Factor w/ 2 levels "F","M": 2 1 1 1 1 1 1 1 2 1 ...
$ age : num [1:24422] 33 36 29 21 49 44 71 26 50 38 ...
$ age_group : Factor w/ 12 levels "< 1","1 - 4",..: 7 8 6 5 10 9 12 6 12 8 ...
$ referred_from : Factor w/ 3 levels "Self","TB","Other": 2 1 1 1 1 1 1 1 1 1 ...
$ marital_status : Factor w/ 4 levels "M","S","W","D": 1 1 2 1 1 1 3 2 1 2 ...
$ no_of_own_children_lessthan_5 : Factor w/ 2 levels "0","more_than_2_children": 2 1 1 1 1 1 1 1 1 1 ...
$ no_of_wives : Factor w/ 2 levels "0","more_than_2_wives": 2 1 1 1 1 1 1 1 1 1 ...
$ session_type : Factor w/ 2 levels "Couple","Individual": 2 2 2 2 2 2 2 2 2 2 ...
$ previously_tested_hiv_negative : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ client_pregnant : Factor w/ 2 levels "1","0": 2 1 1 1 1 1 1 1 2 1 ...
$ hts_test_result : Factor w/ 2 levels "Neg","Pos": 1 1 1 1 1 1 1 1 1 1 ...
$ hts_setting : Factor w/ 4 levels "CT","TB","Ward",..: 3 3 3 3 3 3 3 3 3 3 ...
$ tested_for_hiv_before_within_this_year: Factor w/ 2 levels "PreviouslyTestedNegative",..: 2 1 2 2 2 2 2 2 2 2 ...
$ is_surge_site : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ nga_agesex_f_15_2019 : num [1:24422] 0.0627 0.0627 0.0627 0.0627 0.0627 ...
$ nga_agesex_f_20_2019 : num [1:24422] 0.0581 0.0581 0.0581 0.0581 0.0581 ...
$ nga_agesex_f_25_2019 : num [1:24422] 0.0411 0.0411 0.0411 0.0411 0.0411 ...
$ nga_agesex_f_30_2019 : num [1:24422] 0.0314 0.0314 0.0314 0.0314 0.0314 ...
$ nga_agesex_f_35_2019 : num [1:24422] 0.0275 0.0275 0.0275 0.0275 0.0275 ...
$ nga_agesex_f_40_2019 : num [1:24422] 0.021 0.021 0.021 0.021 0.021 ...
$ nga_agesex_f_45_2019 : num [1:24422] 0.0166 0.0166 0.0166 0.0166 0.0166 ...
$ nga_agesex_m_15_2019 : num [1:24422] 0.0536 0.0536 0.0536 0.0536 0.0536 ...
$ nga_agesex_m_20_2019 : num [1:24422] 0.0632 0.0632 0.0632 0.0632 0.0632 ...
$ nga_agesex_m_25_2019 : num [1:24422] 0.0534 0.0534 0.0534 0.0534 0.0534 ...
$ nga_agesex_m_30_2019 : num [1:24422] 0.036 0.036 0.036 0.036 0.036 ...
$ nga_agesex_m_35_2019 : num [1:24422] 0.0325 0.0325 0.0325 0.0325 0.0325 ...
$ nga_agesex_m_40_2019 : num [1:24422] 0.0263 0.0263 0.0263 0.0263 0.0263 ...
$ nga_agesex_m_45_2019 : num [1:24422] 0.0236 0.0236 0.0236 0.0236 0.0236 ...
$ IHME_CONDOM_LAST_TIME_PREV_MEAN_2017 : num [1:24422] 14.1 14.1 14.1 14.1 14.1 ...
$ IHME_HAD_INTERCOURSE_PREV_MEAN_2017 : num [1:24422] 63.1 63.1 63.1 63.1 63.1 ...
$ IHME_HIV_COUNT_MEAN_2017 : num [1:24422] 0.0126 0.0126 0.0126 0.0126 0.0126 ...
$ IHME_IN_UNION_PREV_MEAN_2017 : num [1:24422] 56.9 56.9 56.9 56.9 56.9 ...
$ IHME_MALE_CIRCUMCISION_PREV_MEAN_2017 : num [1:24422] 98.7 98.7 98.7 98.7 98.7 ...
$ IHME_PARTNER_AWAY_PREV_MEAN_2017 : num [1:24422] 13.5 13.5 13.5 13.5 13.5 ...
$ IHME_PARTNERS_YEAR_MN_PREV_MEAN_2017 : num [1:24422] 13.5 13.5 13.5 13.5 13.5 ...
$ IHME_PARTNERS_YEAR_WN_PREV_MEAN_2017 : num [1:24422] 3.07 3.07 3.07 3.07 3.07 ...
$ IHME_STI_SYMPTOMS_PREV_MEAN_2017 : num [1:24422] 4.15 4.15 4.15 4.15 4.15 ...
$ wp_contraceptive : num [1:24422] 0.282 0.282 0.282 0.282 0.282 ...
$ wp_liveBirths : num [1:24422] 124 124 124 124 124 ...
$ wp_poverty : num [1:24422] 0.555 0.555 0.555 0.555 0.555 ...
$ wp_lit_men : num [1:24422] 0.967 0.967 0.967 0.967 0.967 ...
$ wp_lit_women : num [1:24422] 0.874 0.874 0.874 0.874 0.874 ...
$ wp_stunting_men : num [1:24422] 0.178 0.178 0.178 0.178 0.178 ...
$ wp_stunting_women : num [1:24422] 0.215 0.215 0.215 0.215 0.215 ...
$ road_density_km : num [1:24422] 82.3 82.3 82.3 82.3 82.3 ...
And this is the code I am running:
set.seed(4488)
hts.facility.df2 = hts.facility.df %>%
mutate(hts_test_result = as.factor(case_when(
hts_test_result == 'Pos' ~ 1,
hts_test_result == 'Neg' ~ 0
)))
# split data into training and test using hts test result column ----------------------
df.split = initial_split(hts.facility.df2, strata = hts_test_result) # default split if .75/.25
train.df = training(df.split)
test.df = testing(df.split)
# recipe for Random Forest model ------------------------------------------------
# use themis package for oversampling: https://github.com/tidymodels/themis
# for more info on SMOTE method for unbalanced data refer: https://jair.org/index.php/jair/article/view/10302/24590
hts_recipe = recipe(hts_test_result ~ ., data = train.df) %>%
# remove individual data - patient id and facility id and age since age-grouo is already in the dataset
step_rm(patient_id, datim_code, age) %>%
update_role(patient_id, new_role = "patient_ID") %>%
update_role(datim_code, new_role = "facility_id") %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
# # normalize numeric variables
step_normalize(all_predictors()) %>%
# downsample positive tests as they are 90% of the results -
themis::step_smote(hts_test_result, over_ratio = 1)
hts_tree_prep <- prep(hts_recipe)
# create the data frame
hts_juiced <- juice(hts_tree_prep)
xgb_spec <- boost_tree(
trees = 500,
tree_depth = tune(), min_n = tune(),
loss_reduction = tune(), ## first three: model complexity
sample_size = tune(), mtry = tune(), ## randomness
learn_rate = tune(), ## step size
) %>%
set_engine("xgboost") %>%
set_mode("classification")
# set up grid for tuning values -------------------
xgb_grid <- grid_latin_hypercube(
tree_depth(),
min_n(),
loss_reduction(),
sample_size = sample_prop(),
finalize(mtry(), train.df),
learn_rate(),
size = 20
)
xgb_grid
# set up workflow ---------------------------------------------------------
xgb_wf <- workflow() %>%
add_formula(hts_test_result ~ .) %>%
add_model(xgb_spec)
set.seed(123)
vb_folds <- vfold_cv(train.df, strata = hts_test_result)
vb_folds
# tune the model ----------------------------------------------------------
doParallel::registerDoParallel()
set.seed(234)
xgb_res <- tune::tune_grid(
xgb_wf,
resamples = vb_folds,
grid = xgb_grid,
control = control_grid(save_pred = TRUE)
)
This is where it's been stuck for the last 12 hours. My dataset is so small, why is it taking so long?
I would like to create several datasets via for loop.
basically I want create 29 datasets in which I can find in the 1st one the 44th and 45th column of the DF, in the 2nd one the 46th and 47th column of the DF and so on.
I tried like this with no results.
data. <- data.frame(matrix( nrow=1442, ncol=2))
for (i in 1:29){
assign(paste("data",i, sep="_"), data.)
data_[i][,1] <- DF[,c(43+i)]
data_[i][,2] <- DF[,c(44+i)]
}
Can you help me please?
Like this?
data <- list()
DF <- data.frame(matrix(runif(10000),ncol=100))
for (i in 1:29){
data[[i]] <- data.frame(DF[,c(43:44+i)])
}
str(data, list.len = 3)
One solution using purrr
DF <- data.frame(matrix(runif(10000),ncol=100))
library(purrr)
res <- 0:28 %>%
# create the indices to subset
map( ~ c(44, 45) + .x) %>%
# subset the df for each indice group
map( ~ DF[, .x])
length(res)
#> [1] 29
str(head(res))
#> List of 6
#> $ :'data.frame': 100 obs. of 2 variables:
#> ..$ X44: num [1:100] 0.477 0.0593 0.2616 0.7349 0.1202 ...
#> ..$ X45: num [1:100] 0.43 0.105 0.557 0.341 0.111 ...
#> $ :'data.frame': 100 obs. of 2 variables:
#> ..$ X45: num [1:100] 0.43 0.105 0.557 0.341 0.111 ...
#> ..$ X46: num [1:100] 0.78 0.877 0.518 0.162 0.565 ...
#> $ :'data.frame': 100 obs. of 2 variables:
#> ..$ X46: num [1:100] 0.78 0.877 0.518 0.162 0.565 ...
#> ..$ X47: num [1:100] 0.931 0.985 0.59 0.656 0.713 ...
#> $ :'data.frame': 100 obs. of 2 variables:
#> ..$ X47: num [1:100] 0.931 0.985 0.59 0.656 0.713 ...
#> ..$ X48: num [1:100] 0.82 0.899 0.359 0.809 0.329 ...
#> $ :'data.frame': 100 obs. of 2 variables:
#> ..$ X48: num [1:100] 0.82 0.899 0.359 0.809 0.329 ...
#> ..$ X49: num [1:100] 0.7982 0.0966 0.2716 0.3364 0.7295 ...
#> $ :'data.frame': 100 obs. of 2 variables:
#> ..$ X49: num [1:100] 0.7982 0.0966 0.2716 0.3364 0.7295 ...
#> ..$ X50: num [1:100] 0.83057 0.64207 0.94392 0.00904 0.26966 ...
Created on 2018-11-04 by the reprex package (v0.2.1)
Give this a try.
n = 1000
k = 120
DF = matrix(runif(n*k), n, k)
for (i in 1:29){
tmp = DF[,c(43, 43) + c(2*i-1, 2*i)]
assign(paste0("data_", i), tmp)
}
ls()
all(data_1 == DF[,c(44, 45)])
all(data_2 == DF[,c(46, 47)])
Doing data_[i] will make R look for the object called data_, so you can't just subscript the object name like that.
I would like to train a knn using caret::train to classify digits (classic problem) employing a PCA on the features before training.
control = trainControl(method = "repeatedcv",
number = 10,
repeats = 5,
p = 0.9)
knnFit = train(x = trainingDigit,
y = label,
metric = "Accuracy",
method = "knn",
trControl = control,
preProcess = "pca")
I don't understand how to represent my data for training resulting in an error:
Error in sample.int(length(x), size, replace, prob) :
cannot take a sample larger than the population when 'replace = FALSE'
My training data is represented as follows (Rdata file):
List of 10
$ : num [1:400, 1:324] 0.934 0.979 0.877 0.853 0.945 ...
$ : num [1:400, 1:324] 0.807 0.98 0.803 0.978 0.969 ...
$ : num [1:400, 1:324] 0.745 0.883 0.776 0.825 0.922 ...
$ : num [1:400, 1:324] 0.892 0.817 0.835 0.84 0.842 ...
$ : num [1:400, 1:324] 0.752 0.859 0.881 0.884 0.855 ...
$ : num [1:400, 1:324] 0.798 0.969 0.925 0.921 0.873 ...
$ : num [1:400, 1:324] 0.964 0.93 0.97 0.857 0.926 ...
$ : num [1:400, 1:324] 0.922 0.939 0.958 0.946 0.867 ...
$ : num [1:400, 1:324] 0.969 0.947 0.916 0.861 0.86 ...
$ : num [1:400, 1:324] 0.922 0.933 0.978 0.968 0.971 ...
Labels as follows (.Rdata file):
List of 10
$ : num [1:400] 0 0 0 0 0 0 0 0 0 0 ...
$ : num [1:400] 1 1 1 1 1 1 1 1 1 1 ...
$ : num [1:400] 2 2 2 2 2 2 2 2 2 2 ...
$ : num [1:400] 3 3 3 3 3 3 3 3 3 3 ...
$ : num [1:400] 4 4 4 4 4 4 4 4 4 4 ...
$ : num [1:400] 5 5 5 5 5 5 5 5 5 5 ...
$ : num [1:400] 6 6 6 6 6 6 6 6 6 6 ...
$ : num [1:400] 7 7 7 7 7 7 7 7 7 7 ...
$ : num [1:400] 8 8 8 8 8 8 8 8 8 8 ...
$ : num [1:400] 9 9 9 9 9 9 9 9 9 9 ...
The problem is in your representation of the data. Try this before you start training:
label <- factor(c(label, recursive = TRUE))
trainingDigit <- data.frame(do.call(rbind, trainingDigit))
You need to massage your data into a data.frame or data.frame-like format with a single column representing your different outcomes with the other columns being features for each outcome.
Also, if you want to do classification, not regression, your outcomes need to be a factor.
To be clear, I tried to run the training code as follows, and it works just fine.
library(caret)
load("data.RData")
load("testClass_new.RData")
label <- factor(c(label, recursive = TRUE))
trainingDigit <- data.frame(do.call(rbind, trainingDigit))
control <- trainControl(method = "repeatedcv",
number = 10,
repeats = 5,
p = 0.9)
knnFit <- train(x = trainingDigit,
y = label,
metric = "Accuracy",
method = "knn",
trControl = control,
preProcess = "pca")
I am analyzing degradation of certain substances over time. Goal of the script is to get a list containing the variance of the observations at a certain time after application ("Dag" variable).
Output of the script below is a list containing NULL values. I think the problem lies with my assignment of the variable aux1 to the list item but the line works when I do it via command line.
There is probably a faster way of calculating this; even after 2 months I still feel overwhelmed by R.
mydata1<-as.data.frame(matrix(rnorm(600),ncol=6))
names(mydata1)=c("a","b","c","d","e","f")
substance1<-names(mydata1)
times1<-as.data.frame(rep(seq_len(10),10),ncol=1)
names(times1)<-"Dag"
times2<-unique(times1)
mydata1<-cbind(times1,mydata1)
vartijd<-function(times,mydata,substance){
varlist<<-vector("list",length(substance))
for (j in 1:length(substance))
aux<-sapply(times,function(i)var(mydata[mydata$Dag==i,substance[j]],na.rm=TRUE))
aux1<-cbind(times,aux)
varlist[[j]]<-aux1
}
vartijd(times2,mydata1,substance1)
With basic fixes to your code, this works fine.
vartijd<-function(times,mydata,substance){
varlist<-vector("list",length(substance)) # local `<-` assignment
for (j in 1:length(substance)){ # opening bracket
aux<-sapply(times,function(i)var(mydata[mydata$Dag==i,substance[j]],na.rm=TRUE))
aux1<-cbind(times,aux)
varlist[[j]]<-aux1
} # closing bracket
return(varlist) # explicit return
}
Result:
> out <- vartijd(times2,mydata1,substance1)
> str(out)
List of 6
$ :'data.frame': 10 obs. of 2 variables:
..$ Dag: int [1:10] 1 2 3 4 5 6 7 8 9 10
..$ aux: num [1:10] 0.997 0.997 0.997 0.997 0.997 ...
$ :'data.frame': 10 obs. of 2 variables:
..$ Dag: int [1:10] 1 2 3 4 5 6 7 8 9 10
..$ aux: num [1:10] 0.891 0.891 0.891 0.891 0.891 ...
$ :'data.frame': 10 obs. of 2 variables:
..$ Dag: int [1:10] 1 2 3 4 5 6 7 8 9 10
..$ aux: num [1:10] 1.08 1.08 1.08 1.08 1.08 ...
$ :'data.frame': 10 obs. of 2 variables:
..$ Dag: int [1:10] 1 2 3 4 5 6 7 8 9 10
..$ aux: num [1:10] 0.927 0.927 0.927 0.927 0.927 ...
$ :'data.frame': 10 obs. of 2 variables:
..$ Dag: int [1:10] 1 2 3 4 5 6 7 8 9 10
..$ aux: num [1:10] 0.86 0.86 0.86 0.86 0.86 ...
$ :'data.frame': 10 obs. of 2 variables:
..$ Dag: int [1:10] 1 2 3 4 5 6 7 8 9 10
..$ aux: num [1:10] 0.874 0.874 0.874 0.874 0.874 ...