Convert ordinals to date (YY/MM/DD) - r

I converted from 1 January 1988 to 31 December 1988 to ordinal numbers with equivalent to the datenum function of Matlab:
datenum_vect <- (as.numeric(as.Date(ISOdate(years_vector[1], 1, 1, 0)))+ 719529):
as.numeric(as.Date(ISOdate(tail(years_vector,n=1), 12, 31, 0))+ 719529)
Now, I need to convert back to the form YEAR-DAY-MONTH (1988-01-01). I tried like this:
format(as.Date(datenum_vect - 719529, origin = "1988-01-01"), '%b-%Y')
but it is not working. Any idea?

You can simply use as.Date, though you'll need to set to origin to 0 to do this. Also, there's an off-by-one error somewhere so you need to subtract one day from datenum_vec
years_vector <- 1988
datenum_vect <- (as.numeric(as.Date(ISOdate(years_vector[1], 1, 1, 0)))+ 719529):
as.numeric(as.Date(ISOdate(tail(years_vector,n=1), 12, 31, 0))+ 719529)
as.Date(datenum_vect - 1, origin = "0000-01-01")
#> [1] "1988-01-01" "1988-01-02" "1988-01-03" "1988-01-04" "1988-01-05"
#> [6] "1988-01-06" "1988-01-07" "1988-01-08" "1988-01-09" "1988-01-10"
#> [11] "1988-01-11" "1988-01-12" "1988-01-13" "1988-01-14" "1988-01-15"
#> [16] "1988-01-16" "1988-01-17" "1988-01-18" "1988-01-19" "1988-01-20"
#> [21] "1988-01-21" "1988-01-22" "1988-01-23" "1988-01-24" "1988-01-25"
#> [26] "1988-01-26" "1988-01-27" "1988-01-28" "1988-01-29" "1988-01-30"
#> [31] "1988-01-31" "1988-02-01" "1988-02-02" "1988-02-03" "1988-02-04"
#> [36] "1988-02-05" "1988-02-06" "1988-02-07" "1988-02-08" "1988-02-09"
#> [41] "1988-02-10" "1988-02-11" "1988-02-12" "1988-02-13" "1988-02-14"
#> [46] "1988-02-15" "1988-02-16" "1988-02-17" "1988-02-18" "1988-02-19"
#> [51] "1988-02-20" "1988-02-21" "1988-02-22" "1988-02-23" "1988-02-24"
#> [56] "1988-02-25" "1988-02-26" "1988-02-27" "1988-02-28" "1988-02-29"
#> [61] "1988-03-01" "1988-03-02" "1988-03-03" "1988-03-04" "1988-03-05"
#> [66] "1988-03-06" "1988-03-07" "1988-03-08" "1988-03-09" "1988-03-10"
#> [71] "1988-03-11" "1988-03-12" "1988-03-13" "1988-03-14" "1988-03-15"
#> [76] "1988-03-16" "1988-03-17" "1988-03-18" "1988-03-19" "1988-03-20"
#> [81] "1988-03-21" "1988-03-22" "1988-03-23" "1988-03-24" "1988-03-25"
#> [86] "1988-03-26" "1988-03-27" "1988-03-28" "1988-03-29" "1988-03-30"
#> [91] "1988-03-31" "1988-04-01" "1988-04-02" "1988-04-03" "1988-04-04"
#> [96] "1988-04-05" "1988-04-06" "1988-04-07" "1988-04-08" "1988-04-09"
#> [101] "1988-04-10" "1988-04-11" "1988-04-12" "1988-04-13" "1988-04-14"
#> [106] "1988-04-15" "1988-04-16" "1988-04-17" "1988-04-18" "1988-04-19"
#> [111] "1988-04-20" "1988-04-21" "1988-04-22" "1988-04-23" "1988-04-24"
#> [116] "1988-04-25" "1988-04-26" "1988-04-27" "1988-04-28" "1988-04-29"
#> [121] "1988-04-30" "1988-05-01" "1988-05-02" "1988-05-03" "1988-05-04"
#> [126] "1988-05-05" "1988-05-06" "1988-05-07" "1988-05-08" "1988-05-09"
#> [131] "1988-05-10" "1988-05-11" "1988-05-12" "1988-05-13" "1988-05-14"
#> [136] "1988-05-15" "1988-05-16" "1988-05-17" "1988-05-18" "1988-05-19"
#> [141] "1988-05-20" "1988-05-21" "1988-05-22" "1988-05-23" "1988-05-24"
#> [146] "1988-05-25" "1988-05-26" "1988-05-27" "1988-05-28" "1988-05-29"
#> [151] "1988-05-30" "1988-05-31" "1988-06-01" "1988-06-02" "1988-06-03"
#> [156] "1988-06-04" "1988-06-05" "1988-06-06" "1988-06-07" "1988-06-08"
#> [161] "1988-06-09" "1988-06-10" "1988-06-11" "1988-06-12" "1988-06-13"
#> [166] "1988-06-14" "1988-06-15" "1988-06-16" "1988-06-17" "1988-06-18"
#> [171] "1988-06-19" "1988-06-20" "1988-06-21" "1988-06-22" "1988-06-23"
#> [176] "1988-06-24" "1988-06-25" "1988-06-26" "1988-06-27" "1988-06-28"
#> [181] "1988-06-29" "1988-06-30" "1988-07-01" "1988-07-02" "1988-07-03"
#> [186] "1988-07-04" "1988-07-05" "1988-07-06" "1988-07-07" "1988-07-08"
#> [191] "1988-07-09" "1988-07-10" "1988-07-11" "1988-07-12" "1988-07-13"
#> [196] "1988-07-14" "1988-07-15" "1988-07-16" "1988-07-17" "1988-07-18"
#> [201] "1988-07-19" "1988-07-20" "1988-07-21" "1988-07-22" "1988-07-23"
#> [206] "1988-07-24" "1988-07-25" "1988-07-26" "1988-07-27" "1988-07-28"
#> [211] "1988-07-29" "1988-07-30" "1988-07-31" "1988-08-01" "1988-08-02"
#> [216] "1988-08-03" "1988-08-04" "1988-08-05" "1988-08-06" "1988-08-07"
#> [221] "1988-08-08" "1988-08-09" "1988-08-10" "1988-08-11" "1988-08-12"
#> [226] "1988-08-13" "1988-08-14" "1988-08-15" "1988-08-16" "1988-08-17"
#> [231] "1988-08-18" "1988-08-19" "1988-08-20" "1988-08-21" "1988-08-22"
#> [236] "1988-08-23" "1988-08-24" "1988-08-25" "1988-08-26" "1988-08-27"
#> [241] "1988-08-28" "1988-08-29" "1988-08-30" "1988-08-31" "1988-09-01"
#> [246] "1988-09-02" "1988-09-03" "1988-09-04" "1988-09-05" "1988-09-06"
#> [251] "1988-09-07" "1988-09-08" "1988-09-09" "1988-09-10" "1988-09-11"
#> [256] "1988-09-12" "1988-09-13" "1988-09-14" "1988-09-15" "1988-09-16"
#> [261] "1988-09-17" "1988-09-18" "1988-09-19" "1988-09-20" "1988-09-21"
#> [266] "1988-09-22" "1988-09-23" "1988-09-24" "1988-09-25" "1988-09-26"
#> [271] "1988-09-27" "1988-09-28" "1988-09-29" "1988-09-30" "1988-10-01"
#> [276] "1988-10-02" "1988-10-03" "1988-10-04" "1988-10-05" "1988-10-06"
#> [281] "1988-10-07" "1988-10-08" "1988-10-09" "1988-10-10" "1988-10-11"
#> [286] "1988-10-12" "1988-10-13" "1988-10-14" "1988-10-15" "1988-10-16"
#> [291] "1988-10-17" "1988-10-18" "1988-10-19" "1988-10-20" "1988-10-21"
#> [296] "1988-10-22" "1988-10-23" "1988-10-24" "1988-10-25" "1988-10-26"
#> [301] "1988-10-27" "1988-10-28" "1988-10-29" "1988-10-30" "1988-10-31"
#> [306] "1988-11-01" "1988-11-02" "1988-11-03" "1988-11-04" "1988-11-05"
#> [311] "1988-11-06" "1988-11-07" "1988-11-08" "1988-11-09" "1988-11-10"
#> [316] "1988-11-11" "1988-11-12" "1988-11-13" "1988-11-14" "1988-11-15"
#> [321] "1988-11-16" "1988-11-17" "1988-11-18" "1988-11-19" "1988-11-20"
#> [326] "1988-11-21" "1988-11-22" "1988-11-23" "1988-11-24" "1988-11-25"
#> [331] "1988-11-26" "1988-11-27" "1988-11-28" "1988-11-29" "1988-11-30"
#> [336] "1988-12-01" "1988-12-02" "1988-12-03" "1988-12-04" "1988-12-05"
#> [341] "1988-12-06" "1988-12-07" "1988-12-08" "1988-12-09" "1988-12-10"
#> [346] "1988-12-11" "1988-12-12" "1988-12-13" "1988-12-14" "1988-12-15"
#> [351] "1988-12-16" "1988-12-17" "1988-12-18" "1988-12-19" "1988-12-20"
#> [356] "1988-12-21" "1988-12-22" "1988-12-23" "1988-12-24" "1988-12-25"
#> [361] "1988-12-26" "1988-12-27" "1988-12-28" "1988-12-29" "1988-12-30"
#> [366] "1988-12-31"
Created on 2020-08-30 by the reprex package (v0.3.0)

Related

Get the mean for every iteration

I'm new in R. Hoping someone could help me.
I am trying to get the mean using for the first values of i for nth iteration, example (first value on first iteration then first two values on 2nd iterations)
How do I go about doing this?
Here is the sample data:
set.seed(1234)
i <- sample(200,100)
An alternative, may be, simpler solution
set.seed(1234)
i <- sample(200,100)
cumsum(i)/(1:100)
#> [1] 28.00000 54.00000 86.00000 89.75000 94.00000 101.16667 105.71429
#> [8] 113.25000 116.66667 118.20000 116.36364 115.25000 113.30769 110.21429
#> [15] 108.13333 108.62500 103.05882 104.33333 102.10526 97.20000 101.66667
#> [22] 103.81818 101.04348 100.70833 101.56000 105.11538 103.66667 105.96429
#> [29] 106.55172 104.60000 104.70968 105.53125 104.96970 103.08824 103.42857
#> [36] 102.55556 104.10811 102.47368 100.94872 98.47500 98.92683 101.00000
#> [43] 99.79070 99.84091 98.75556 99.52174 100.76596 101.87500 100.95918
#> [50] 101.66000 100.17647 101.03846 102.37736 100.62963 100.54545 99.14286
#> [57] 98.01754 99.20690 100.38983 100.15000 101.00000 99.53226 99.68254
#> [64] 100.34375 100.07692 101.39394 100.17910 99.75000 99.18841 99.85714
#> [71] 100.35211 100.72222 102.04110 101.02703 100.69333 101.53947 102.44156
#> [78] 101.89744 101.43038 100.61250 100.83951 102.04878 101.04819 99.95238
#> [85] 99.12941 98.70930 97.77011 98.44318 98.92135 98.46667 97.45055
#> [92] 97.31522 97.75269 97.05319 96.84211 97.02083 97.81443 97.93878
#> [99] 98.92929 99.55000
Created on 2022-03-04 by the reprex package (v2.0.1)
Here's a one-liner to get the result:
sapply(1:100, function(x) mean(i[seq(x)]))
#> [1] 28.00000 54.00000 86.00000 89.75000 94.00000 101.16667 105.71429
#> [8] 113.25000 116.66667 118.20000 116.36364 115.25000 113.30769 110.21429
#> [15] 108.13333 108.62500 103.05882 104.33333 102.10526 97.20000 101.66667
#> [22] 103.81818 101.04348 100.70833 101.56000 105.11538 103.66667 105.96429
#> [29] 106.55172 104.60000 104.70968 105.53125 104.96970 103.08824 103.42857
#> [36] 102.55556 104.10811 102.47368 100.94872 98.47500 98.92683 101.00000
#> [43] 99.79070 99.84091 98.75556 99.52174 100.76596 101.87500 100.95918
#> [50] 101.66000 100.17647 101.03846 102.37736 100.62963 100.54545 99.14286
#> [57] 98.01754 99.20690 100.38983 100.15000 101.00000 99.53226 99.68254
#> [64] 100.34375 100.07692 101.39394 100.17910 99.75000 99.18841 99.85714
#> [71] 100.35211 100.72222 102.04110 101.02703 100.69333 101.53947 102.44156
#> [78] 101.89744 101.43038 100.61250 100.83951 102.04878 101.04819 99.95238
#> [85] 99.12941 98.70930 97.77011 98.44318 98.92135 98.46667 97.45055
#> [92] 97.31522 97.75269 97.05319 96.84211 97.02083 97.81443 97.93878
#> [99] 98.92929 99.55000
Created on 2022-03-04 by the reprex package (v2.0.1)

Tuning Tidymodels’ Recipe and Model Parameters Simultaneously

We can use Tidymodels to tune both recipe parameters and model parameters simultaneously, right? I'm struggling to understand what corrective action I should take based on the message, Error: Some tuning parameters require finalization but there are recipe parameters that require tuning. Please use parameters() to finalize the parameter ranges.” Any help would be most appreciated.
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(tidymodels))
suppressPackageStartupMessages(library(themis))
suppressPackageStartupMessages(library(finetune))
suppressPackageStartupMessages(library(doParallel))
suppressPackageStartupMessages(library(titanic))
registerDoParallel()
set.seed(123)
train.df <- titanic_train %>%
mutate(Survived = factor(ifelse(Survived == 1, 'Y', 'N')),
Pclass = factor(Pclass, ordered = TRUE),
Sex = factor(Sex),
Embarked = factor(ifelse(Embarked == '', NA, Embarked))) %>%
select(-c(Name, Ticket, Cabin))
summary(train.df)
#> PassengerId Survived Pclass Sex Age SibSp
#> Min. : 1.0 N:549 1:216 female:314 Min. : 0.42 Min. :0.000
#> 1st Qu.:223.5 Y:342 2:184 male :577 1st Qu.:20.12 1st Qu.:0.000
#> Median :446.0 3:491 Median :28.00 Median :0.000
#> Mean :446.0 Mean :29.70 Mean :0.523
#> 3rd Qu.:668.5 3rd Qu.:38.00 3rd Qu.:1.000
#> Max. :891.0 Max. :80.00 Max. :8.000
#> NA's :177
#> Parch Fare Embarked
#> Min. :0.0000 Min. : 0.00 C :168
#> 1st Qu.:0.0000 1st Qu.: 7.91 Q : 77
#> Median :0.0000 Median : 14.45 S :644
#> Mean :0.3816 Mean : 32.20 NA's: 2
#> 3rd Qu.:0.0000 3rd Qu.: 31.00
#> Max. :6.0000 Max. :512.33
#>
cv.folds <- vfold_cv(train.df, v = 4, strata = Survived)
cv.folds
#> # 4-fold cross-validation using stratification
#> # A tibble: 4 x 2
#> splits id
#> <list> <chr>
#> 1 <split [667/224]> Fold1
#> 2 <split [668/223]> Fold2
#> 3 <split [669/222]> Fold3
#> 4 <split [669/222]> Fold4
#########################################################
# Logistic Regression Model -- This Works
# Tuning Recipe Parameters: Yes
# Tuning Model Hyperparameters: No
recipe.logistic.regression <-
recipe(Survived ~ ., data = train.df) %>%
update_role(PassengerId, new_role = 'ID') %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_impute_knn(all_predictors(), neighbors = tune()) %>%
step_normalize(all_predictors()) %>%
step_downsample(Survived, seed = 456)
spec.logistic.regression <-
logistic_reg() %>%
set_engine("glm")
wf.logistic.regression <-
workflow() %>%
add_recipe(recipe.logistic.regression) %>%
add_model(spec.logistic.regression)
wf.logistic.regression
#> == Workflow ====================================================================
#> Preprocessor: Recipe
#> Model: logistic_reg()
#>
#> -- Preprocessor ----------------------------------------------------------------
#> 4 Recipe Steps
#>
#> * step_dummy()
#> * step_impute_knn()
#> * step_normalize()
#> * step_downsample()
#>
#> -- Model -----------------------------------------------------------------------
#> Logistic Regression Model Specification (classification)
#>
#> Computational engine: glm
rs.logistic.regression <- tune_race_anova(
wf.logistic.regression,
resamples = cv.folds,
grid = 25,
metrics = metric_set(accuracy),
control = control_race(verbose = TRUE, verbose_elim = TRUE,
parallel_over = "everything",
save_pred = TRUE,
save_workflow = TRUE)
)
#> i Racing will maximize the accuracy metric.
#> i Resamples are analyzed in a random order.
#> i Fold4: 1 eliminated; 9 candidates remain.
show_best(rs.logistic.regression)
#> # A tibble: 5 x 7
#> neighbors .metric .estimator mean n std_err .config
#> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
#> 1 9 accuracy binary 0.791 4 0.0193 Preprocessor01_Model1
#> 2 2 accuracy binary 0.788 4 0.0186 Preprocessor08_Model1
#> 3 4 accuracy binary 0.788 4 0.0190 Preprocessor09_Model1
#> 4 1 accuracy binary 0.787 4 0.0205 Preprocessor05_Model1
#> 5 10 accuracy binary 0.787 4 0.0205 Preprocessor10_Model1
#########################################################
# Random Forest Model A -- This Works
# Tuning Recipe Parameters: No
# Tuning Model Hyperparameters: Yes
recipe.random.forest.a <-
recipe(Survived ~ ., data = train.df) %>%
update_role(PassengerId, new_role = 'ID') %>%
step_impute_knn(all_predictors(),
neighbors = 5) %>% # <-- Manually setting value for neighbors
step_downsample(Survived, seed = 456)
spec.random.forest.a <-
rand_forest(mtry = tune(),
min_n = tune(),
trees = tune()) %>%
set_mode("classification") %>%
set_engine("ranger")
wf.random.forest.a <-
workflow() %>%
add_recipe(recipe.random.forest.a) %>%
add_model(spec.random.forest.a)
wf.random.forest.a
#> == Workflow ====================================================================
#> Preprocessor: Recipe
#> Model: rand_forest()
#>
#> -- Preprocessor ----------------------------------------------------------------
#> 2 Recipe Steps
#>
#> * step_impute_knn()
#> * step_downsample()
#>
#> -- Model -----------------------------------------------------------------------
#> Random Forest Model Specification (classification)
#>
#> Main Arguments:
#> mtry = tune()
#> trees = tune()
#> min_n = tune()
#>
#> Computational engine: ranger
rs.random.forest.a <- tune_race_anova(
wf.random.forest.a,
resamples = cv.folds,
grid = 25,
metrics = metric_set(accuracy),
control = control_race(verbose = TRUE, verbose_elim = TRUE,
parallel_over = "everything",
save_pred = TRUE,
save_workflow = TRUE)
)
#> i Creating pre-processing data to finalize unknown parameter: mtry
#> i Racing will maximize the accuracy metric.
#> i Resamples are analyzed in a random order.
#> i Fold4: 4 eliminated; 21 candidates remain.
show_best(rs.random.forest.a)
#> # A tibble: 5 x 9
#> mtry trees min_n .metric .estimator mean n std_err .config
#> <int> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
#> 1 4 837 18 accuracy binary 0.818 4 0.00685 Preprocessor1_Model~
#> 2 4 1968 16 accuracy binary 0.817 4 0.00738 Preprocessor1_Model~
#> 3 4 1439 25 accuracy binary 0.817 4 0.00664 Preprocessor1_Model~
#> 4 3 1769 10 accuracy binary 0.816 4 0.0130 Preprocessor1_Model~
#> 5 3 1478 13 accuracy binary 0.816 4 0.0109 Preprocessor1_Model~
#########################################################
# Random Forest Model B -- This Does Not Work
# Tuning Recipe Parameters: Yes
# Tuning Model Hyperparameters: Yes
recipe.random.forest.b <-
recipe(Survived ~ ., data = train.df) %>%
update_role(PassengerId, new_role = 'ID') %>%
step_impute_knn(all_predictors(),
neighbors = tune()) %>% # <-- Tuning neighbors
step_downsample(Survived, seed = 456)
spec.random.forest.b <-
rand_forest(mtry = tune(),
min_n = tune(),
trees = tune()) %>%
set_mode("classification") %>%
set_engine("ranger")
wf.random.forest.b <-
workflow() %>%
add_recipe(recipe.random.forest.b) %>%
add_model(spec.random.forest.b)
wf.random.forest.b
#> == Workflow ====================================================================
#> Preprocessor: Recipe
#> Model: rand_forest()
#>
#> -- Preprocessor ----------------------------------------------------------------
#> 2 Recipe Steps
#>
#> * step_impute_knn()
#> * step_downsample()
#>
#> -- Model -----------------------------------------------------------------------
#> Random Forest Model Specification (classification)
#>
#> Main Arguments:
#> mtry = tune()
#> trees = tune()
#> min_n = tune()
#>
#> Computational engine: ranger
rs.random.forest.b <- tune_race_anova(
wf.random.forest.b,
resamples = cv.folds,
grid = 25,
metrics = metric_set(accuracy),
control = control_race(verbose = TRUE, verbose_elim = TRUE,
parallel_over = "everything",
save_pred = TRUE,
save_workflow = TRUE)
)
#> Error: Some tuning parameters require finalization but there are recipe parameters that require tuning. Please use `parameters()` to finalize the parameter ranges.
#########################################################
sessionInfo()
#> R version 4.1.0 (2021-05-18)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19041)
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=English_United States.1252
#> [2] LC_CTYPE=English_United States.1252
#> [3] LC_MONETARY=English_United States.1252
#> [4] LC_NUMERIC=C
#> [5] LC_TIME=English_United States.1252
#>
#> attached base packages:
#> [1] parallel stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] titanic_0.1.0 doParallel_1.0.16 iterators_1.0.13 foreach_1.5.1
#> [5] finetune_0.1.0 themis_0.1.4 yardstick_0.0.8 workflowsets_0.1.0
#> [9] workflows_0.2.3 tune_0.1.6 rsample_0.1.0 recipes_0.1.16
#> [13] parsnip_0.1.7 modeldata_0.1.1 infer_0.5.4 dials_0.0.9
#> [17] scales_1.1.1 broom_0.7.9 tidymodels_0.1.3 forcats_0.5.1
#> [21] stringr_1.4.0 dplyr_1.0.7 purrr_0.3.4 readr_2.0.0
#> [25] tidyr_1.1.3 tibble_3.1.3 ggplot2_3.3.5 tidyverse_1.3.1
#>
#> loaded via a namespace (and not attached):
#> [1] minqa_1.2.4 colorspace_2.0-2 ellipsis_0.3.2 class_7.3-19
#> [5] fs_1.5.0 rstudioapi_0.13 listenv_0.8.0 furrr_0.2.3
#> [9] ParamHelpers_1.14 prodlim_2019.11.13 fansi_0.5.0 lubridate_1.7.10
#> [13] ranger_0.13.1 xml2_1.3.2 codetools_0.2-18 splines_4.1.0
#> [17] knitr_1.33 jsonlite_1.7.2 nloptr_1.2.2.2 pROC_1.17.0.1
#> [21] dbplyr_2.1.1 compiler_4.1.0 httr_1.4.2 backports_1.2.1
#> [25] assertthat_0.2.1 Matrix_1.3-4 cli_3.0.1 htmltools_0.5.1.1
#> [29] tools_4.1.0 gtable_0.3.0 glue_1.4.2 RANN_2.6.1
#> [33] parallelMap_1.5.1 fastmatch_1.1-3 Rcpp_1.0.7 cellranger_1.1.0
#> [37] styler_1.5.1 DiceDesign_1.9 vctrs_0.3.8 nlme_3.1-152
#> [41] timeDate_3043.102 mlr_2.19.0 gower_0.2.2 xfun_0.25
#> [45] globals_0.14.0 lme4_1.1-27.1 rvest_1.0.1 lifecycle_1.0.0
#> [49] future_1.21.0 MASS_7.3-54 ipred_0.9-11 hms_1.1.0
#> [53] BBmisc_1.11 yaml_2.2.1 rpart_4.1-15 stringi_1.7.3
#> [57] highr_0.9 checkmate_2.0.0 lhs_1.1.1 boot_1.3-28
#> [61] hardhat_0.1.6 lava_1.6.9 rlang_0.4.11 pkgconfig_2.0.3
#> [65] evaluate_0.14 lattice_0.20-44 tidyselect_1.1.1 parallelly_1.27.0
#> [69] plyr_1.8.6 magrittr_2.0.1 R6_2.5.0 generics_0.1.0
#> [73] DBI_1.1.1 pillar_1.6.2 haven_2.4.3 withr_2.4.2
#> [77] survival_3.2-11 nnet_7.3-16 ROSE_0.0-4 modelr_0.1.8
#> [81] crayon_1.4.1 unbalanced_2.0 utf8_1.2.2 tzdb_0.1.2
#> [85] rmarkdown_2.10 grid_4.1.0 readxl_1.3.1 data.table_1.14.0
#> [89] FNN_1.1.3 reprex_2.0.1 digest_0.6.27 munsell_0.5.0
#> [93] GPfit_1.0-8
Created on 2021-08-07 by the reprex package (v2.0.1)

Extract names of genes expressed by at least 10% of cells in a cluster

I have a Seurat object with defined clusters. I need to extract a list of all genes that are expressed by at least 10% of cells in my cluster. I need to repeat it for every cluster that I have, separately.
I know one code that could potentially extract genes expressed by at least 10% of cells from the whole Seurat:
genes.to.keep <- Matrix::rowSums(Monocyte.integrated#assays$RNA#counts > 0) >= floor(0.1 * ncol(Monocyte.integrated#assays$RNA#counts))
counts.sub <- Monocyte.integrated#assays$RNA#counts[genes.to.keep,]
But this is not what I want. And I'm not sure how to modify it to include cluster names (considering it's correct).
I store the cluster names in the metadata variable called "cluster_names".
I would appreciate any help
BW
You could use lapply to iterate over the factor levels of your clusters to subset and filter them individually and use setNames to name the resulting list. Below is a reproducible example:
library(Seurat)
data("pbmc_small")
pbmc_small <- FindClusters(pbmc_small, resolution = 1)
names(pbmc_small#meta.data)[names(pbmc_small#meta.data)=="seurat_clusters"] <- "cluster_names"
levels(pbmc_small$cluster_names) <- paste0("cluster_", seq_along(levels(pbmc_small$cluster_names)))
setNames(lapply(levels(pbmc_small$cluster_names), function(x) {
p <- subset(pbmc_small, cluster_names==x)
rownames(p)[Matrix::rowSums(p#assays$RNA#counts > 0) >= .1*dim(p)[2]]
}), levels(pbmc_small$cluster_names))
#> $cluster_1
#> [1] "CD79B" "HLA-DRA" "LTB" "SP100" "PPP3CC" "CXCR4"
#> [7] "STX10" "SNHG7" "CD3D" "NOSIP" "SAFB2" "CD2"
#> [13] "IL7R" "PIK3IP1" "MPHOSPH6" "KHDRBS1" "MAL" "CCR7"
#> [19] "THYN1" "TAF7" "LDHB" "TMEM123" "EPC1" "EIF4A2"
#> [25] "CD3E" "TMUB1" "BLOC1S4" "SRSF7" "ACAP1" "TNFAIP8"
#> [31] "CD7" "TAGAP" "DNAJB1" "ASNSD1" "S1PR4" "CTSW"
#> [37] "GZMK" "NKG7" "IL32" "DNAJC2" "LYAR" "CST7"
#> [43] "LCK" "CCL5" "HNRNPH1" "SSR2" "GIMAP1" "MMADHC"
#> [49] "CD8A" "GYPC" "HNRNPF" "RPL7L1" "KLRG1" "CRBN"
#> [55] "SATB1" "PMPCB" "NRBP1" "TCF7" "HNRNPA3" "S100A8"
#> [61] "S100A9" "LYZ" "FCN1" "TYROBP" "NFKBIA" "TYMP"
#> [67] "CTSS" "TSPO" "CTSB" "LGALS1" "BLVRA" "LGALS3"
#> [73] "IFI6" "HLA-DPA1" "CST3" "GSTP1" "EIF3G" "VPS28"
#> [79] "ZFP36L1" "ANXA2" "HSP90AA1" "LST1" "AIF1" "PSAP"
#> [85] "YWHAB" "MYO1G" "SAT1" "RGS2" "FCGR3A" "S100A11"
#> [91] "FCER1G" "IFITM2" "COTL1" "LGALS9" "CD68" "RHOC"
#> [97] "CARD16" "COPS6" "PPBP" "GPX1" "TPM4" "PF4"
#> [103] "SDPR" "NRGN" "SPARC" "GNG11" "CLU" "HIST1H2AC"
#> [109] "NCOA4" "GP9" "FERMT3" "ODC1" "CD9" "RUFY1"
#> [115] "TUBB1" "TALDO1" "TREML1" "NGFRAP1" "PGRMC1" "CA2"
#> [121] "ITGA2B" "MYL9" "TMEM40" "PARVB" "PTCRA" "ACRBP"
#> [127] "TSC22D1" "VDAC3" "GZMB" "GZMA" "GNLY" "FGFBP2"
#> [133] "AKR1C3" "CCL4" "PRF1" "GZMH" "XBP1" "GZMM"
#> [139] "PTGDR" "IGFBP7" "TTC38" "KLRD1" "ARHGDIA" "IL2RB"
#> [145] "CLIC3" "PPP1R18" "CD247" "ALOX5AP" "XCL2" "C12orf75"
#> [151] "RARRES3" "PCMT1" "LAMP1" "SPON2"
#>
#> $cluster_2
#> [1] "CD79B" "CD79A" "HLA-DRA" "HLA-DQB1"
#> [5] "HVCN1" "HLA-DMB" "LTB" "SP100"
#> [9] "NCF1" "EAF2" "FAM96A" "CXCR4"
#> [13] "STX10" "SNHG7" "NT5C" "NOSIP"
#> [17] "IL7R" "KHDRBS1" "TAF7" "LDHB"
#> [21] "TMEM123" "EIF4A2" "TMUB1" "BLOC1S4"
#> [25] "SRSF7" "TNFAIP8" "TAGAP" "DNAJB1"
#> [29] "S1PR4" "NKG7" "IL32" "DNAJC2"
#> [33] "LYAR" "CCL5" "SSR2" "GIMAP1"
#> [37] "MMADHC" "HNRNPF" "RPL7L1" "HNRNPA3"
#> [41] "S100A8" "S100A9" "LYZ" "CD14"
#> [45] "FCN1" "TYROBP" "ASGR1" "NFKBIA"
#> [49] "TYMP" "CTSS" "TSPO" "RBP7"
#> [53] "CTSB" "LGALS1" "FPR1" "VSTM1"
#> [57] "BLVRA" "MPEG1" "BID" "SMCO4"
#> [61] "CFD" "LINC00936" "LGALS2" "MS4A6A"
#> [65] "FCGRT" "LGALS3" "NUP214" "SCO2"
#> [69] "IL17RA" "IFI6" "HLA-DPA1" "FCER1A"
#> [73] "CLEC10A" "HLA-DMA" "RGS1" "HLA-DPB1"
#> [77] "HLA-DQA1" "RNF130" "HLA-DRB5" "HLA-DRB1"
#> [81] "CST3" "IL1B" "POP7" "HLA-DQA2"
#> [85] "GSTP1" "EIF3G" "VPS28" "LY86"
#> [89] "ZFP36L1" "ANXA2" "GRN" "CFP"
#> [93] "HSP90AA1" "LST1" "AIF1" "PSAP"
#> [97] "YWHAB" "MYO1G" "SAT1" "RGS2"
#> [101] "SERPINA1" "IFITM3" "FCGR3A" "LILRA3"
#> [105] "S100A11" "FCER1G" "TNFRSF1B" "IFITM2"
#> [109] "WARS" "IFI30" "MS4A7" "C5AR1"
#> [113] "HCK" "COTL1" "LGALS9" "CD68"
#> [117] "RP11-290F20.3" "RHOC" "CARD16" "LRRC25"
#> [121] "COPS6" "ADAR" "GPX1" "TPM4"
#> [125] "NRGN" "NCOA4" "FERMT3" "ODC1"
#> [129] "TALDO1" "PARVB" "VDAC3" "GZMB"
#> [133] "XBP1" "IGFBP7" "ARHGDIA" "PPP1R18"
#> [137] "ALOX5AP" "RARRES3" "PCMT1" "SPON2"
#>
#> $cluster_3
#> [1] "MS4A1" "CD79B" "CD79A" "HLA-DRA"
#> [5] "TCL1A" "HLA-DQB1" "HVCN1" "HLA-DMB"
#> [9] "LTB" "LINC00926" "FCER2" "SP100"
#> [13] "NCF1" "PPP3CC" "EAF2" "PPAPDC1B"
#> [17] "CD19" "KIAA0125" "CYB561A3" "CD180"
#> [21] "RP11-693J15.5" "FAM96A" "CXCR4" "STX10"
#> [25] "SNHG7" "NT5C" "BANK1" "IGLL5"
#> [29] "CD200" "FCRLA" "CD3D" "NOSIP"
#> [33] "CD2" "IL7R" "PIK3IP1" "KHDRBS1"
#> [37] "THYN1" "TAF7" "LDHB" "TMEM123"
#> [41] "CCDC104" "EPC1" "EIF4A2" "CD3E"
#> [45] "SRSF7" "ACAP1" "TNFAIP8" "CD7"
#> [49] "TAGAP" "DNAJB1" "S1PR4" "CTSW"
#> [53] "GZMK" "NKG7" "IL32" "DNAJC2"
#> [57] "LYAR" "CST7" "LCK" "CCL5"
#> [61] "HNRNPH1" "SSR2" "GIMAP1" "MMADHC"
#> [65] "CD8A" "PTPN22" "GYPC" "HNRNPF"
#> [69] "RPL7L1" "CRBN" "SATB1" "SIT1"
#> [73] "PMPCB" "NRBP1" "TCF7" "HNRNPA3"
#> [77] "S100A9" "LYZ" "FCN1" "TYROBP"
#> [81] "NFKBIA" "TYMP" "CTSS" "TSPO"
#> [85] "CTSB" "LGALS1" "BLVRA" "MPEG1"
#> [89] "BID" "CFD" "LINC00936" "LGALS2"
#> [93] "MS4A6A" "FCGRT" "LGALS3" "SCO2"
#> [97] "HLA-DPA1" "FCER1A" "CLEC10A" "HLA-DMA"
#> [101] "RGS1" "HLA-DPB1" "HLA-DQA1" "RNF130"
#> [105] "HLA-DRB5" "HLA-DRB1" "CST3" "IL1B"
#> [109] "POP7" "HLA-DQA2" "CD1C" "GSTP1"
#> [113] "EIF3G" "VPS28" "LY86" "ZFP36L1"
#> [117] "ZNF330" "ANXA2" "GRN" "CFP"
#> [121] "HSP90AA1" "FUOM" "LST1" "AIF1"
#> [125] "PSAP" "YWHAB" "MYO1G" "SAT1"
#> [129] "RGS2" "SERPINA1" "IFITM3" "FCGR3A"
#> [133] "S100A11" "FCER1G" "TNFRSF1B" "IFITM2"
#> [137] "WARS" "IFI30" "MS4A7" "HCK"
#> [141] "COTL1" "LGALS9" "CD68" "RHOC"
#> [145] "CARD16" "LRRC25" "COPS6" "ADAR"
#> [149] "GPX1" "TPM4" "NCOA4" "FERMT3"
#> [153] "ODC1" "RUFY1" "TALDO1" "VDAC3"
#> [157] "GZMA" "GNLY" "FGFBP2" "PRF1"
#> [161] "XBP1" "GZMM" "PTGDR" "ARHGDIA"
#> [165] "PPP1R18" "CD247" "ALOX5AP" "XCL2"
#> [169] "C12orf75" "RARRES3" "PCMT1" "SPON2"
Created on 2021-03-26 by the reprex package (v1.0.0)

ARIMA fitted model gives NULL

I am trying to plot the residuals vs. the fitted values, but when I use the fitted function on my ARMA model, the output I receive is NULL.
The data I am using is 500 values between roughly -5 and 5.
The data should be modelled well by an ARMA(1,1) process.
I am not sure what the problem is in the following code.
model <- arima(data$Z, order = c(1,0,1), include.mean=FALSE)
fitted(model)
Use the following code
library(forecast)
#> Warning: package 'forecast' was built under R version 3.5.3
z <- runif(500, -5.0, 5)
model <- arima(z, order = c(1,0,1), include.mean = F)
fitted(model)
#> Time Series:
#> Start = 1
#> End = 500
#> Frequency = 1
#> [1] -0.0015806455 0.0799286719 -0.1409297625 0.0479671123 -0.1228818961
#> [6] 0.0940340261 -0.0395403451 0.0930088194 -0.0504654231 0.0154369074
#> [11] 0.0133157834 -0.0398617697 0.0848056118 -0.1391454237 0.1515008682
#> [16] -0.1467538990 0.1805508412 -0.1879896786 0.0793030786 -0.1378767013
#> [21] 0.0249573249 0.0287911357 -0.0351466073 -0.0204974526 0.0461081760
#> [26] 0.0026239567 0.0460184801 -0.0203468288 0.0828714994 -0.1221614534
#> [31] 0.0877768930 -0.1300021809 0.1775641943 -0.1583465561 0.0598343159
#> [36] 0.0383818418 -0.0412695391 -0.0322236465 -0.0045104996 0.0464239480
#> [41] -0.0873485626 0.1217045601 -0.0466749971 -0.0100498122 0.0800410409
#> [46] -0.0299737152 -0.0614290196 0.0263853310 -0.0265231697 0.0694531484
#> [51] 0.0298069473 -0.0408218386 -0.0140498359 -0.0338596582 0.0378135790
#> [56] 0.0005786616 0.0066221013 -0.0229934639 -0.0408114564 0.1034192284
#> [61] -0.0377462959 -0.0257183236 -0.0322490101 -0.0111188196 0.0407765161
#> [66] 0.0503798846 -0.0390813201 0.0948137913 -0.1497653064 0.0903615396
#> [71] -0.0827762735 0.0019291654 -0.0496267125 0.0970206197 -0.0931098112
#> [76] 0.0735280460 0.0086683535 -0.0199644624 -0.0002643464 0.0869008538
#> [81] -0.0204382045 -0.0639750387 -0.0111928636 -0.0319269965 0.0897082975
#> [86] -0.1231369993 0.0746107817 -0.0543711631 0.0056392789 -0.0642910157
#> [91] 0.0706781787 0.0120862153 0.0159663078 -0.0730658685 0.0837554717
#> [96] 0.0197429018 -0.0560623745 0.0776559650 -0.0808164436 -0.0082439969
#> [101] -0.0357098828 0.0132052455 -0.0815812696 0.1186676628 -0.1277749333
#> [106] 0.1277066903 -0.0914505386 0.0533966779 0.0102037355 0.0279883047
#> [111] -0.0811406552 0.1212558120 -0.0877936586 0.1084079690 -0.1269089632
#> [116] 0.1391932820 -0.0159836164 0.0100766075 0.0028410998 0.0786503805
#> [121] -0.0516762816 0.0152099611 -0.0599428484 0.1284742491 -0.0168351682
#> [126] 0.0648963409 0.0019635567 0.0818920976 -0.0573381183 0.0346615048
#> [131] -0.0372407913 -0.0482556686 0.0374608687 -0.0196944986 -0.0259857030
#> [136] -0.0661423447 0.0449849608 -0.0088458317 0.0012445222 -0.0368579185
#> [141] -0.0248778616 -0.0081077663 0.0744412577 -0.1315420519 0.0386156339
#> [146] 0.0231558591 0.0494305331 -0.1055739416 0.0748404861 -0.0532585073
#> [151] 0.0474897484 0.0152686161 0.0462263086 -0.0051924179 0.0583029703
#> [156] -0.0013862901 -0.0456514139 0.0310454056 -0.0003521619 -0.0049355367
#> [161] -0.0523830774 0.0596767804 -0.1155786218 0.1455602295 -0.0517246838
#> [166] 0.0866360162 -0.0631074760 0.0991277528 -0.1073100396 0.0881465371
#> [171] 0.0176332718 -0.0523389260 -0.0300377628 0.1071425810 -0.0447753946
#> [176] -0.0140462900 0.0089771025 -0.0607728545 0.0354816226 -0.0583115285
#> [181] 0.0441725572 -0.1010844880 0.1277178696 -0.0858740586 0.1352428209
#> [186] -0.1692240491 0.0833982337 -0.1343390578 0.0863563683 0.0079788036
#> [191] 0.0451385447 -0.0305476615 0.0724859272 -0.0319277030 0.0875179824
#> [196] -0.1143069402 0.1428043464 -0.0542025365 0.0691813621 -0.1378836483
#> [201] 0.1548367687 -0.1547689318 0.0466587034 -0.0270980299 0.0617565456
#> [206] -0.1153437078 0.0819353251 -0.0886510700 0.0086937995 -0.0097789550
#> [211] -0.0618482380 0.1113973826 -0.0825907319 0.0645605858 -0.0099850724
#> [216] -0.0315631929 0.0921659522 -0.1340849744 0.1033218258 -0.1417316946
#> [221] 0.0868881723 -0.1081003833 0.0686838112 0.0146099431 0.0520422776
#> [226] 0.0078375045 -0.0296319158 0.0563414844 0.0421604164 -0.0997564843
#> [231] 0.1578777288 -0.1957675156 0.2243556112 -0.2127929966 0.1288517674
#> [236] -0.1607573440 0.0567032448 -0.0760473709 0.0282155748 -0.0910475040
#> [241] 0.0075793433 0.0309329174 -0.1072956621 0.1580420902 -0.1328846885
#> [246] 0.0884293296 -0.1356508126 0.1722392449 -0.2106045376 0.1054175969
#> [251] -0.0015348903 -0.0373736007 0.0167893100 0.0052500910 -0.0042228543
#> [256] 0.0669646749 0.0186587322 -0.0342439539 -0.0287081617 0.0757394852
#> [261] -0.1300820561 0.0113874056 -0.0732004266 -0.0127913096 0.0443308870
#> [266] -0.0106436071 -0.0434872013 0.0253017841 -0.0152324172 -0.0029074241
#> [271] -0.0832628166 0.0830016957 -0.0670986967 0.0660973240 0.0062552073
#> [276] 0.0537228356 -0.1080867153 0.1092415667 -0.1497847261 0.0859415492
#> [281] -0.1475177902 0.0654457064 -0.0026609979 0.0088159232 0.0707379173
#> [286] -0.1208832375 0.1171907317 -0.0067955664 -0.0662620888 0.0613072133
#> [291] -0.0688126032 0.1002880427 0.0018881851 0.0381840867 -0.0569733203
#> [296] 0.0434666013 0.0255480141 -0.0962203190 0.0012360699 -0.0811855149
#> [301] 0.1181875355 -0.0113710015 0.0075110430 -0.0522479209 -0.0017592812
#> [306] 0.0526061777 -0.0169970424 -0.0076249015 0.0845160198 -0.0902542228
#> [311] 0.0825594728 -0.0687970535 0.0373812783 0.0482434223 -0.0064498737
#> [316] -0.0533391773 0.1073577827 -0.1531742787 0.0639086879 0.0389546639
#> [321] -0.0161981486 0.0975635033 -0.1363791170 0.0140204015 -0.0522590460
#> [326] 0.0649534485 -0.0132355802 -0.0253227616 -0.0664868743 -0.0359240445
#> [331] 0.0146378920 -0.0574512043 -0.0016882519 0.0247159085 -0.0790481636
#> [336] 0.0418239200 0.0188440234 -0.0397519834 0.0722264546 -0.1126478393
#> [341] 0.0993232677 -0.0379553899 0.0368922160 0.0588238729 -0.0838114665
#> [346] -0.0189325360 0.0739032318 -0.0428047888 -0.0466670324 -0.0175479638
#> [351] -0.0441230892 0.1131879514 -0.1219213716 0.1508840663 -0.0961787428
#> [356] 0.1441710031 -0.1831779503 0.1654813243 -0.0949113008 0.1520370285
#> [361] -0.1727357031 0.1351809646 -0.1345324996 0.1686684178 -0.1549104829
#> [366] 0.1862771569 -0.0672782655 0.0734554425 -0.0258031629 -0.0484269379
#> [371] -0.0198109969 -0.0554243649 0.1046794323 -0.1405714927 0.1797461702
#> [376] -0.1419333618 0.1889498847 -0.0645619368 0.0117013953 -0.0567398582
#> [381] 0.0310466272 -0.0872145874 0.0428480299 0.0124185088 -0.0002800209
#> [386] -0.0187980372 -0.0429250516 -0.0115772653 0.0135315958 -0.0252087526
#> [391] -0.0365567495 -0.0225046419 0.0050644310 -0.0611879250 0.0476489402
#> [396] -0.0588489005 0.0560405677 0.0188174734 -0.0203073820 0.0646727336
#> [401] 0.0150454588 -0.0858822185 0.0658706992 0.0307391635 -0.0585834988
#> [406] -0.0248515288 0.0512787227 -0.0030806330 -0.0127171414 0.0043611259
#> [411] -0.0810023715 0.0037203293 0.0149482900 0.0271406549 0.0665726775
#> [416] -0.0854764974 0.0508187103 -0.0620854299 -0.0305801406 0.0974746898
#> [421] -0.0694405928 0.0658222850 -0.0258447983 0.0536554925 -0.0167506813
#> [426] -0.0427748538 0.1163436514 -0.1652023366 0.0534030991 0.0321732584
#> [431] 0.0462210191 0.0039255339 -0.0237493688 0.0874260572 -0.0285420511
#> [436] 0.0628541394 -0.0885130445 0.0668230909 -0.1043385046 0.1072994580
#> [441] -0.1529411364 0.1065325407 -0.0241914848 0.0596283347 -0.1161781497
#> [446] 0.0754220131 -0.1024254740 0.0890813186 -0.1442837568 0.1275026643
#> [451] -0.0156066013 -0.0495071292 -0.0030282481 -0.0748786377 0.1329556363
#> [456] -0.0759620198 0.0869286497 0.0220572915 0.0626362825 -0.0400019048
#> [461] 0.0590547079 -0.0949463899 0.0772250687 0.0297254457 -0.0886943700
#> [466] 0.0201779536 -0.0878654822 -0.0020070518 -0.0185832580 -0.0488934965
#> [471] 0.0325913155 -0.0730349390 0.0001271660 -0.0604520442 0.0575821964
#> [476] -0.0523677730 0.0046346989 -0.0065106330 0.0447399374 0.0391716272
#> [481] 0.0299163020 0.0626436810 -0.0413999734 0.0237195869 0.0638785024
#> [486] -0.1326918031 0.0186015266 0.0726652337 -0.0772833974 -0.0182879433
#> [491] 0.0249745768 0.0336220956 -0.0513471211 0.0202261267 -0.0442003287
#> [496] 0.0826917008 -0.0668356103 0.1329418861 -0.0392132173 0.0669457471
plot(residuals(model))
checkresiduals(model)
#>
#> Ljung-Box test
#>
#> data: Residuals from ARIMA(1,0,1) with zero mean
#> Q* = 3.8204, df = 8, p-value = 0.873
#>
#> Model df: 2. Total lags used: 10
Created on 2019-11-10 by the reprex package (v0.3.0)

XGboost gives me 100% prediction accuracy, for my binary classification problem. How can I solve it?

XGBoost gives me 100% prediction accuracy, for a binary classification problem. This seems too good to be true. How can i solve it?
I am using a normalized dataset (max-min or z-score), already split it as training and validation set, and I am using training set values in order to predict the validation set. In both subsets, data is very alike obviously, but there is nothing i can do about it. I also avoid look-forward bias. What else can be the possible reason for 100% accuracy and how can i solve it? Thank you very much!
My code is:
train_x=data.matrix(tmp[,-40])
train_y=tmp[,40]
test_x=data.matrix(tmp2[,-40])
test_y=tmp2[,40]
test_y=as.factor(test_y)
xgb_train = xgb.DMatrix(data=train_x, label=train_y)
xgb_test = xgb.DMatrix(data=test_x, label=test_y)
set.seed(12345)
xgbc=xgboost(data=xgb_train, max.depth=4, nrounds=200)
print(xgbc)
preds=predict(xgbc,test_x)
preds[preds>0.5] = "1"
pred_y = as.factor(test_y)
print(pred_y)
cm = confusionMatrix(test_y, pred_y)
print(cm)
Code output is:
> xgbc=xgboost(data=xgb_train,max.depth=4, nrounds=200, nthread=2, eta=1,
objective="binary:logistic")
[1] train-error:0.415888
[2] train-error:0.390654
[3] train-error:0.368692
[4] train-error:0.323832
[5] train-error:0.307944
[6] train-error:0.278037
[7] train-error:0.259346
[8] train-error:0.240187
[9] train-error:0.232710
[10] train-error:0.224766
[11] train-error:0.208879
[12] train-error:0.192523
[13] train-error:0.185981
[14] train-error:0.177103
[15] train-error:0.168224
[16] train-error:0.157944
[17] train-error:0.141121
[18] train-error:0.132243
[19] train-error:0.132243
[20] train-error:0.121495
[21] train-error:0.109346
[22] train-error:0.101869
[23] train-error:0.100000
[24] train-error:0.090654
[25] train-error:0.080374
[26] train-error:0.078505
[27] train-error:0.069626
[28] train-error:0.063084
[29] train-error:0.066822
[30] train-error:0.056542
[31] train-error:0.044860
[32] train-error:0.042991
[33] train-error:0.039252
[34] train-error:0.037383
[35] train-error:0.029439
[36] train-error:0.023832
[37] train-error:0.018692
[38] train-error:0.011682
[39] train-error:0.011215
[40] train-error:0.010748
[41] train-error:0.009346
[42] train-error:0.007477
[43] train-error:0.005140
[44] train-error:0.005140
[45] train-error:0.006075
[46] train-error:0.003271
[47] train-error:0.002804
[48] train-error:0.003271
[49] train-error:0.002804
[50] train-error:0.002804
[51] train-error:0.002336
[52] train-error:0.002336
[53] train-error:0.002336
[54] train-error:0.002336
[55] train-error:0.000935
[56] train-error:0.000467
[57] train-error:0.000000
[58] train-error:0.000000
[59] train-error:0.000000
[60] train-error:0.000935
[61] train-error:0.000467
[62] train-error:0.000000
[63] train-error:0.000000
[64] train-error:0.000000
[65] train-error:0.000000
[66] train-error:0.000000
[67] train-error:0.000000
[68] train-error:0.000000
[69] train-error:0.000000
[70] train-error:0.000000
[71] train-error:0.000000
[72] train-error:0.000000
[73] train-error:0.000000
[74] train-error:0.000000
[75] train-error:0.000000
[76] train-error:0.000000
[77] train-error:0.000000
[78] train-error:0.000000
[79] train-error:0.000000
[80] train-error:0.000000
[81] train-error:0.000000
[82] train-error:0.000000
[83] train-error:0.000000
[84] train-error:0.000000
[85] train-error:0.000000
[86] train-error:0.000000
[87] train-error:0.000000
[88] train-error:0.000000
[89] train-error:0.000000
[90] train-error:0.000000
[91] train-error:0.000000
[92] train-error:0.000000
[93] train-error:0.000000
[94] train-error:0.000000
[95] train-error:0.000000
[96] train-error:0.000000
[97] train-error:0.000000
[98] train-error:0.000000
[99] train-error:0.000000
[100] train-error:0.000000
> print(xgbc)
##### xgb.Booster
raw: 186.6 Kb
call:
xgb.train(params = params, data = dtrain, nrounds = nrounds,
watchlist = watchlist, verbose = verbose, print_every_n = print_every_n,
early_stopping_rounds = early_stopping_rounds, maximize = maximize,
save_period = save_period, save_name = save_name, xgb_model = xgb_model,
callbacks = callbacks, max.depth = 4, nthread = 2, eta = 1,
objective = "binary:logistic")
params (as set within xgb.train):
max_depth = "4", nthread = "2", eta = "1", objective = "binary:logistic",
silent = "1"
xgb.attributes:
niter
callbacks:
cb.print.evaluation(period = print_every_n)
cb.evaluation.log()
# of features: 38
niter: 200
nfeatures : 38
evaluation_log:
iter train_error
1 0.415888
2 0.390654
---
199 0.000000
200 0.000000
preds=predict(xgbc,test_x)
> preds
[1] 7.273692e-01 1.643806e-02 3.032141e-04 9.764441e-01 9.691942e-02
5.343258e-01 9.090783e-01
[8] 5.609832e-01 4.061035e-01 1.105066e-01 4.406907e-03 9.946358e-01
7.929156e-01 4.119191e-03
[15] 3.098451e-01 2.945659e-04 3.966548e-03 7.829595e-01 1.698021e-01
9.574184e-01 7.132806e-01
[22] 1.044374e-01 9.024003e-01 5.769060e-01 5.096554e-02 1.751429e-01
9.982671e-01 9.993696e-01
[29] 6.521277e-01 5.780852e-03 4.867651e-01 9.707865e-01 8.398834e-01
1.825542e-01 1.134274e-01
[36] 7.154977e-02 5.450470e-01 1.047506e-01 3.099218e-03 2.268739e-01
9.023346e-01 8.026977e-01
[43] 3.844074e-01 4.463347e-01 8.543612e-01 9.998935e-01 8.699111e-01
6.243381e-02 1.137973e-01
[50] 9.385086e-01 9.994442e-01 8.376440e-01 8.492180e-01 3.362629e-04
4.316351e-02 9.234415e-01
[57] 8.924388e-01 9.977444e-01 6.618840e-02 2.186051e-04 1.647688e-03
8.050095e-03 6.535615e-01
[64] 4.707330e-01 9.138927e-01 5.177013e-02 3.349773e-04 9.392425e-01
4.979803e-02 2.934091e-01
[71] 8.948106e-01 9.854530e-01 9.795361e-02 9.275551e-01 5.865968e-01
9.746857e-01 3.859183e-01
[78] 1.194406e-01 3.267710e-01 6.294726e-01 9.250816e-01 6.118813e-02
3.394562e-01 7.257250e-04
[85] 8.491386e-01 7.081388e-03 3.268852e-01 8.931246e-01 2.204458e-01
8.818560e-01 9.923303e-01
[92] 9.845840e-01 7.688413e-01 9.803721e-01 9.958567e-01 9.500723e-01
7.733757e-01 9.368727e-01
[99] 3.276393e-01 9.952766e-01 2.130413e-01 8.992375e-02 8.594028e-02
8.160641e-01 9.915828e-01
> preds[preds>0.5] = "1"
> preds[preds<=0.5]= "0"
> pred_y = as.factor(test_y)
> print(pred_y)
[1] 1 1 0 0 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 0 0 0 0 0 1 0 0 1 0 0 1 0 0 1 1 1
1 0 1 1 1 0 1 0 1 1 1 1 0 0
[51] 1 1 0 1 0 1 1 0 1 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0 1 1 0 0 0 0 1 0 1 0 1 1
1 1 0 0 1 0 0 0 1 1 1 1 0 1
> test_y=as.factor(test_y)
> cm = confusionMatrix(test_y, pred_y)
> print(cm)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 421 0
1 0 497
Accuracy : 1
95% CI : (0.996, 1)
No Information Rate : 0.5414
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 1
Mcnemar's Test P-Value : NA
Sensitivity : 1.0000
Specificity : 1.0000
Pos Pred Value : 1.0000
Neg Pred Value : 1.0000
Prevalence : 0.4586
Detection Rate : 0.4586
Detection Prevalence : 0.4586
Balanced Accuracy : 1.0000
'Positive' Class : 0
It seems like you're seriously overfitting to your training data, and you should use cross validation instead of just a naive train-test split. There are a number of ways to do this. You can do it with xgb.cv inside of the xgboost package in R, for one. I prefer Tidymodels, but that's a different rabbit hole. My guess is that if you tune a parameter like gamma, you'll end up with non-zero loss because the gamma > 0 will help prevent overfitting by pruning your trees. You can also help prevent overfitting by growing fewer, shallower trees, subsampling features, etc. All of these options can be tuned with xgb.cv
Try checking the correlation of the predictor variables with the output. Try removing variables with high correlation because it introduces high bias. This solved my issue with 100% accuracy.

Resources