I'm trying to understand the intuition about what is going on in the xgb.dump of a binary classification with an interaction depth of 1. Specifically how the same split is used twiced in a row (f38 < 2.5) (code lines 2 and 6)
The resulting output looks like this:
xgb.dump(model_2,with.stats=T)
[1] "booster[0]"
[2] "0:[f38<2.5] yes=1,no=2,missing=1,gain=173.793,cover=6317"
[3] "1:leaf=-0.0366182,cover=3279.75"
[4] "2:leaf=-0.0466305,cover=3037.25"
[5] "booster[1]"
[6] "0:[f38<2.5] yes=1,no=2,missing=1,gain=163.887,cover=6314.25"
[7] "1:leaf=-0.035532,cover=3278.65"
[8] "2:leaf=-0.0452568,cover=3035.6"
Is the difference between the first use of f38 and the second use of f38 simply the residual fitting going on? At first it seemed weird to me, and trying to understand exactly what's going on here!
Thanks!
Is the difference between the first use of f38 and the second use of f38 simply the residual fitting going on?
most likely yes - its updating the gradient after the first round and finding the same feature with split point in your example
Here's a reproducible example.
Note how I lower the learning rate in the second example and its finds the same feature, same split point again for all three rounds. In the first example it uses different features in all 3 rounds.
require(xgboost)
data(agaricus.train, package='xgboost')
train <- agaricus.train
dtrain <- xgb.DMatrix(data = train$data, label=train$label)
#high learning rate, finds different first split feature (f55,f28,f66) in each tree
bst <- xgboost(data = train$data, label = train$label, max_depth = 2, eta = 1, nrounds = 3,nthread = 2, objective = "binary:logistic")
xgb.dump(model = bst)
# [1] "booster[0]" "0:[f28<-9.53674e-07] yes=1,no=2,missing=1"
# [3] "1:[f55<-9.53674e-07] yes=3,no=4,missing=3" "3:leaf=1.71218"
# [5] "4:leaf=-1.70044" "2:[f108<-9.53674e-07] yes=5,no=6,missing=5"
# [7] "5:leaf=-1.94071" "6:leaf=1.85965"
# [9] "booster[1]" "0:[f59<-9.53674e-07] yes=1,no=2,missing=1"
# [11] "1:[f28<-9.53674e-07] yes=3,no=4,missing=3" "3:leaf=0.784718"
# [13] "4:leaf=-0.96853" "2:leaf=-6.23624"
# [15] "booster[2]" "0:[f101<-9.53674e-07] yes=1,no=2,missing=1"
# [17] "1:[f66<-9.53674e-07] yes=3,no=4,missing=3" "3:leaf=0.658725"
# [19] "4:leaf=5.77229" "2:[f110<-9.53674e-07] yes=5,no=6,missing=5"
# [21] "5:leaf=-0.791407" "6:leaf=-9.42142"
## changed eta to lower learning rate, finds same feature(f55) in first split of each tree
bst2 <- xgboost(data = train$data, label = train$label, max_depth = 2, eta = .01, nrounds = 3,nthread = 2, objective = "binary:logistic")
xgb.dump(model = bst2)
# [1] "booster[0]" "0:[f28<-9.53674e-07] yes=1,no=2,missing=1"
# [3] "1:[f55<-9.53674e-07] yes=3,no=4,missing=3" "3:leaf=0.0171218"
# [5] "4:leaf=-0.0170044" "2:[f108<-9.53674e-07] yes=5,no=6,missing=5"
# [7] "5:leaf=-0.0194071" "6:leaf=0.0185965"
# [9] "booster[1]" "0:[f28<-9.53674e-07] yes=1,no=2,missing=1"
# [11] "1:[f55<-9.53674e-07] yes=3,no=4,missing=3" "3:leaf=0.016952"
# [13] "4:leaf=-0.0168371" "2:[f108<-9.53674e-07] yes=5,no=6,missing=5"
# [15] "5:leaf=-0.0192151" "6:leaf=0.0184251"
# [17] "booster[2]" "0:[f28<-9.53674e-07] yes=1,no=2,missing=1"
# [19] "1:[f55<-9.53674e-07] yes=3,no=4,missing=3" "3:leaf=0.0167863"
# [21] "4:leaf=-0.0166737" "2:[f108<-9.53674e-07] yes=5,no=6,missing=5"
# [23] "5:leaf=-0.0190286" "6:leaf=0.0182581"
Related
I am using the R programming language.
Using the following code, I am able to put two plots on the same page:
#load library
library(dbscan)
#specify number of plots per page
par(mfrow = c(1,2))
#load libraries
library(dbscan)
library(dplyr)
#generate data
n <- 100
x <- cbind(
x=runif(10, 0, 5) + rnorm(n, sd=0.4),
y=runif(10, 0, 5) + rnorm(n, sd=0.4)
)
### calculate LOF score
lof <- lof(x, k=3)
### distribution of outlier factors (first plot)
summary(lof)
hist(lof, breaks=10)
### point size is proportional to LOF (second plot)
plot(x, pch = ".", main = "LOF (k=3)")
points(x, cex = (lof-1)*3, pch = 1, col="red")
This produces the following plot:
Now, I am trying to make several plots (e.g. 6 plots, 2 pairs of 3) on the same page. I tried to implement this with a "for loop" (for k = 3, 4, 5):
par(mfrow = c(3,2))
vals <- 3:5
combine <- vector('list', length(vals))
count <- 0
for (i in vals) {
lof_i <- lof(x, k=i)
### distribution of outlier factors
summary(lof_i)
hist(lof_i, breaks=10)
### point size is proportional to LOF
plot(x, pch = ".", main = "LOF (k=i)")
points(x, cex = (lof_i-1)*3, pch = 1, col="red")
}
However, this seems to just repeat the same graph 6 times on the same page:
Can someone please show me how to correct this code?
Is it also possible to save the files "lof_3, lof_4, lof_5"? It seems that none of these files are created, only "lof_i" is created:
> lof_3
Error: object 'lof_3' not found
> head(lof_i)
[1] 1.223307 1.033424 1.077149 1.011407 1.040634 1.431029
Thanks
Looking at your plots you seem to have generated and plotted different plots, but to have the labels correct you need to pass a variable and not a fixed character to your title (e.g. using the paste command).
To get the calculated values out of your loop you could either generate an empty list and assign the results in the loop to individual list elements, or use something like lapply that will automatically return the results in a list form.
To simplify things a bit you could define a function that either plots or returns the calculated values, e.g. like this:
library(dbscan)
#generate data
set.seed(123)
n <- 100
x <- cbind(
x=runif(10, 0, 5) + rnorm(n, sd=0.4),
y=runif(10, 0, 5) + rnorm(n, sd=0.4)
)
plotLOF <- function(i, plot=TRUE){
lof <- lof(x, k=i)
if (plot){
hist(lof, breaks=10)
plot(x, pch = ".", main = paste0("LOF (k=", i, ")"))
points(x, cex = (lof-1)*3, pch = 1, col="red")
} else return(lof)
}
par(mfrow = c(3,2))
invisible(lapply(3:5, plotLOF))
lapply(3:5, plotLOF, plot=FALSE)
#> [[1]]
#> [1] 1.1419243 0.9551471 1.0777472 1.1224447 0.8799095 1.0377858 0.8416306
#> [8] 1.0487133 1.0250496 1.3183819 0.9896833 1.0353398 1.3088266 1.0123238
#> [15] 1.1233530 0.9685039 1.0589151 1.3147785 1.0488644 0.9212146 1.2568698
#> [22] 1.0086274 1.0454450 0.9661698 1.0644528 1.1107202 1.0942201 1.5147076
#> [29] 1.0321698 1.0553455 1.1149748 0.9341090 1.2352716 0.9478602 1.4096464
#> [36] 1.0519127 1.0507267 1.3199825 1.2525485 0.9361488 1.0958563 1.2131615
#> [43] 0.9943090 1.0123238 1.1060491 1.0377766 0.9803135 0.9627699 1.1165421
#> [50] 0.9796819 0.9946925 2.1576989 1.6015310 1.5670315 0.9343637 1.0033725
#> [57] 0.8769431 0.9783065 1.0800050 1.2768800 0.9735274 1.0377472 1.0743988
#> [64] 1.7583562 1.2662485 0.9685039 1.1662145 1.2491499 1.1131718 1.0085023
#> [71] 0.9636864 1.1538360 1.2126138 1.0609829 1.0679010 1.0490234 1.1403292
#> [78] 0.9638900 1.1863703 0.9651060 0.9503445 1.0098536 0.8440855 0.9052420
#> [85] 1.2662485 1.4447713 1.0845415 1.0661381 0.9282678 0.9380078 1.1414628
#> [92] 1.0407138 1.0942201 1.0589805 1.0370938 1.0147094 1.1067291 0.8834466
#> [99] 1.7027132 1.1766560
#>
#> [[2]]
#> [1] 1.1667311 1.0409009 1.0920953 1.0068953 0.9894195 1.1332413 0.9764505
#> [8] 1.0228796 1.0446905 1.0893386 1.1211637 1.1029415 1.3453498 0.9712910
#> [15] 1.1635936 1.0265746 0.9480282 1.2144437 1.0570346 0.9314618 1.3345561
#> [22] 0.9816097 0.9929112 1.0322014 1.2739621 1.2947553 1.0202948 1.6153264
#> [29] 1.0790922 0.9987830 1.0378609 0.9622779 1.2974938 0.9129639 1.2601398
#> [36] 1.0265746 1.0241622 1.2420568 1.2204376 0.9297345 1.1148404 1.2546361
#> [43] 1.0059582 0.9819820 1.0342491 0.9452673 1.0369500 0.9791091 1.2000825
#> [50] 0.9878844 1.0205586 2.0057587 1.2757014 1.5347815 0.9622614 1.0692613
#> [57] 1.0026404 0.9408510 1.0280687 1.3534531 0.9669894 0.9300601 0.9929112
#> [64] 1.7567871 1.3861828 1.0265746 1.1120151 1.3542396 1.1562077 0.9842179
#> [71] 1.0301098 1.2326327 1.1866352 1.0403814 1.0577086 0.8745912 1.0017905
#> [78] 0.9904356 1.0602487 0.9501681 1.0176457 1.0405430 0.9718224 1.0046821
#> [85] 1.1909982 1.6151918 0.9640852 1.0141963 1.0270237 0.9867738 1.1474414
#> [92] 1.1293307 1.0323945 1.0859417 0.9622614 1.0290635 1.0186381 0.9225209
#> [99] 1.6456612 1.1366753
#>
#> [[3]]
#> [1] 1.1299335 1.0122028 1.2077092 0.9485150 1.0115694 1.1190314 0.9989174
#> [8] 1.0145663 1.0357546 0.9783702 1.1050504 1.0661798 1.3571416 1.0024603
#> [15] 1.1484745 1.0162149 0.9601474 1.1310442 1.0957731 1.0065501 1.2687934
#> [22] 0.9297323 0.9725355 0.9876444 1.2314822 1.2209304 0.9906446 1.4249452
#> [29] 1.2156607 0.9959685 1.0304305 0.9976110 1.1711354 1.0048161 0.9813000
#> [36] 1.0128909 0.9730295 1.1741982 1.3317209 0.9708714 1.0994309 1.1900047
#> [43] 0.9960765 0.9659553 0.9744357 0.9556112 1.0508484 0.9669406 1.3919743
#> [50] 0.9467537 1.0596883 1.7396644 1.1323109 1.6516971 0.9922995 1.0223594
#> [57] 0.9917594 0.9542419 1.0672565 1.2274498 1.0589385 0.9649404 0.9953886
#> [64] 1.7666795 1.3111620 0.9860706 1.0576620 1.2547512 1.0038281 0.9825967
#> [71] 1.0104708 1.1739417 1.1884817 1.0199412 0.9956941 0.9720389 0.9601474
#> [78] 0.9898781 1.1025485 0.9797453 1.0086780 1.0556471 1.0150204 1.0339022
#> [85] 1.1174116 1.5252177 0.9721734 0.9486663 1.0161640 0.9903872 1.2339874
#> [92] 1.0753099 0.9819882 1.0439012 1.0016272 1.0122706 1.0536213 0.9948601
#> [99] 1.4693656 1.0274264
Created on 2021-02-22 by the reprex package (v1.0.0)
for i in vector
eval(parse(text = sprintf("plot(df$%s)",i)))
This is very powerful line of code...can be very handy to plot graphs with loops.
{
eval(parse(text= sprintf('lof_%s <- lof(x, k=%s)',i,i)))
### distribution of outlier factors
eval(parse(text=sprintf('summary(lof_%s)',i)))
eval(parse(text=sprintf('hist(lof_%s, breaks=10)',i)))
### point size is proportional to LOF
eval(parse(text=sprintf("plot(x, pch = '.', main = 'LOF (k=%s)')",i)))
eval(parse(text=sprintf("points(x, cex = (lof_%s-1)*3, pch = 1, col='red')",i)))
}```
Exaplaination-
eval() - it evaluates the expression
parse() - it parse the text for evaluation
sprintf() - it creates a string(text) by concatenating with the parameter parsed.
Your code is not working because inside the loop i is being treated as character. It is not holding the values from the iterator.In case you need to understand above function then i would suggest you to just run this function and see the output sprintf('lof_%s <- lof(x, k=%s)',i,i).
I'm having trouble with the trafo function for SMOTE {smotefamily}'s K parameter. In particular, when the number of nearest neighbours K is greater than or equal to the sample size, an error is returned (warning("k should be less than sample size!")) and the tuning process is terminated.
The user cannot control K to be smaller than the sample size during the internal resampling process. This would have to be controlled internally so that if, for instance, trafo_K = 2 ^ K >= sample_size for some value of K, then, say, trafo_K = sample_size - 1.
I was wondering if there's a solution to this or if one is already on its way?
library("mlr3") # mlr3 base package
library("mlr3misc") # contains some helper functions
library("mlr3pipelines") # create ML pipelines
library("mlr3tuning") # tuning ML algorithms
library("mlr3learners") # additional ML algorithms
library("mlr3viz") # autoplot for benchmarks
library("paradox") # hyperparameter space
library("OpenML") # to obtain data sets
library("smotefamily") # SMOTE algorithm for imbalance correction
# get list of curated binary classification data sets (see https://arxiv.org/abs/1708.03731v2)
ds = listOMLDataSets(
number.of.classes = 2,
number.of.features = c(1, 100),
number.of.instances = c(5000, 10000)
)
# select imbalanced data sets (without categorical features as SMOTE cannot handle them)
ds = subset(ds, minority.class.size / number.of.instances < 0.2 &
number.of.symbolic.features == 1)
ds
d = getOMLDataSet(980)
d
# make sure target is a factor and create mlr3 tasks
data = as.data.frame(d)
data[[d$target.features]] = as.factor(data[[d$target.features]])
task = TaskClassif$new(
id = d$desc$name, backend = data,
target = d$target.features)
task
# Code above copied from https://mlr3gallery.mlr-org.com/posts/2020-03-30-imbalanced-data/
class_counts <- table(task$truth())
majority_to_minority_ratio <- class_counts[class_counts == max(class_counts)] /
class_counts[class_counts == min(class_counts)]
# Pipe operator for SMOTE
po_smote <- po("smote", dup_size = round(majority_to_minority_ratio))
# Random Forest learner
rf <- lrn("classif.ranger", predict_type = "prob")
# Pipeline of Random Forest learner with SMOTE
graph <- po_smote %>>%
po('learner', rf, id = 'rf')
graph$plot()
# Graph learner
rf_smote <- GraphLearner$new(graph, predict_type = 'prob')
rf_smote$predict_type <- 'prob'
# Parameter set in data table format
ps_table <- as.data.table(rf_smote$param_set)
View(ps_table[, 1:4])
# Define parameter search space for the SMOTE parameters
param_set <- ps_table$id %>%
lapply(
function(x) {
if (grepl('smote.', x)) {
if (grepl('.dup_size', x)) {
ParamInt$new(x, lower = 1, upper = round(majority_to_minority_ratio))
} else if (grepl('.K', x)) {
ParamInt$new(x, lower = 1, upper = round(majority_to_minority_ratio))
}
}
}
)
param_set <- Filter(Negate(is.null), param_set)
param_set <- ParamSet$new(param_set)
# Apply transformation function on SMOTE's K (= The number of nearest neighbors used for sampling new values. See SMOTE().)
param_set$trafo <- function(x, param_set) {
index <- which(grepl('.K', names(x)))
if (sum(index) != 0){
x[[index]] <- round(3 ^ x[[index]]) # Intentionally define a trafo that won't work
}
x
}
# Define and instantiate resampling strategy to be applied within pipeline
cv <- rsmp("cv", folds = 2)
cv$instantiate(task)
# Set up tuning instance
instance <- TuningInstance$new(
task = task,
learner = rf_smote,
resampling = cv,
measures = msr("classif.bbrier"),
param_set,
terminator = term("evals", n_evals = 3),
store_models = TRUE)
tuner <- TunerRandomSearch$new()
# Tune pipe learner to find optimal SMOTE parameter values
tuner$optimize(instance)
And here's what happens
INFO [11:00:14.904] Benchmark with 2 resampling iterations
INFO [11:00:14.919] Applying learner 'smote.rf' on task 'optdigits' (iter 2/2)
Error in get.knnx(data, query, k, algorithm) : ANN: ERROR------->
In addition: Warning message:
In get.knnx(data, query, k, algorithm) : k should be less than sample size!
Session info
R version 3.6.2 (2019-12-12)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 16299)
Matrix products: default
locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] smotefamily_1.3.1 OpenML_1.10 mlr3viz_0.1.1.9002
[4] mlr3tuning_0.1.2-9000 mlr3pipelines_0.1.2.9000 mlr3misc_0.2.0
[7] mlr3learners_0.2.0 mlr3filters_0.2.0.9000 mlr3_0.2.0-9000
[10] paradox_0.2.0 yardstick_0.0.5 rsample_0.0.5
[13] recipes_0.1.9 parsnip_0.0.5 infer_0.5.1
[16] dials_0.0.4 scales_1.1.0 broom_0.5.4
[19] tidymodels_0.0.3 reshape2_1.4.3 janitor_1.2.1
[22] data.table_1.12.8 forcats_0.4.0 stringr_1.4.0
[25] dplyr_0.8.4 purrr_0.3.3 readr_1.3.1
[28] tidyr_1.0.2 tibble_3.0.1 ggplot2_3.3.0
[31] tidyverse_1.3.0
loaded via a namespace (and not attached):
[1] utf8_1.1.4 tidyselect_1.0.0 lme4_1.1-21
[4] htmlwidgets_1.5.1 grid_3.6.2 ranger_0.12.1
[7] pROC_1.16.1 munsell_0.5.0 codetools_0.2-16
[10] bbotk_0.1 DT_0.12 future_1.17.0
[13] miniUI_0.1.1.1 withr_2.2.0 colorspace_1.4-1
[16] knitr_1.28 uuid_0.1-4 rstudioapi_0.10
[19] stats4_3.6.2 bayesplot_1.7.1 listenv_0.8.0
[22] rstan_2.19.2 lgr_0.3.4 DiceDesign_1.8-1
[25] vctrs_0.2.4 generics_0.0.2 ipred_0.9-9
[28] xfun_0.12 R6_2.4.1 markdown_1.1
[31] mlr3measures_0.1.3-9000 rstanarm_2.19.2 lhs_1.0.1
[34] assertthat_0.2.1 promises_1.1.0 nnet_7.3-12
[37] gtable_0.3.0 globals_0.12.5 processx_3.4.1
[40] timeDate_3043.102 rlang_0.4.5 workflows_0.1.1
[43] BBmisc_1.11 splines_3.6.2 checkmate_2.0.0
[46] inline_0.3.15 yaml_2.2.1 modelr_0.1.5
[49] tidytext_0.2.2 threejs_0.3.3 crosstalk_1.0.0
[52] backports_1.1.6 httpuv_1.5.2 rsconnect_0.8.16
[55] tokenizers_0.2.1 tools_3.6.2 lava_1.6.6
[58] ellipsis_0.3.0 ggridges_0.5.2 Rcpp_1.0.4.6
[61] plyr_1.8.5 base64enc_0.1-3 visNetwork_2.0.9
[64] ps_1.3.0 prettyunits_1.1.1 rpart_4.1-15
[67] zoo_1.8-7 haven_2.2.0 fs_1.3.1
[70] furrr_0.1.0 magrittr_1.5 colourpicker_1.0
[73] reprex_0.3.0 GPfit_1.0-8 SnowballC_0.6.0
[76] packrat_0.5.0 matrixStats_0.55.0 tidyposterior_0.0.2
[79] hms_0.5.3 shinyjs_1.1 mime_0.8
[82] xtable_1.8-4 XML_3.99-0.3 tidypredict_0.4.3
[85] shinystan_2.5.0 readxl_1.3.1 gridExtra_2.3
[88] rstantools_2.0.0 compiler_3.6.2 crayon_1.3.4
[91] minqa_1.2.4 StanHeaders_2.21.0-1 htmltools_0.4.0
[94] later_1.0.0 lubridate_1.7.4 DBI_1.1.0
[97] dbplyr_1.4.2 MASS_7.3-51.4 boot_1.3-23
[100] Matrix_1.2-18 cli_2.0.1 parallel_3.6.2
[103] gower_0.2.1 igraph_1.2.4.2 pkgconfig_2.0.3
[106] xml2_1.2.2 foreach_1.4.7 dygraphs_1.1.1.6
[109] prodlim_2019.11.13 farff_1.1 rvest_0.3.5
[112] snakecase_0.11.0 janeaustenr_0.1.5 callr_3.4.1
[115] digest_0.6.25 cellranger_1.1.0 curl_4.3
[118] shiny_1.4.0 gtools_3.8.1 nloptr_1.2.1
[121] lifecycle_0.2.0 nlme_3.1-142 jsonlite_1.6.1
[124] fansi_0.4.1 pillar_1.4.3 lattice_0.20-38
[127] loo_2.2.0 fastmap_1.0.1 httr_1.4.1
[130] pkgbuild_1.0.6 survival_3.1-8 glue_1.4.0
[133] xts_0.12-0 FNN_1.1.3 shinythemes_1.1.2
[136] iterators_1.0.12 class_7.3-15 stringi_1.4.4
[139] memoise_1.1.0 future.apply_1.5.0
Many thanks.
I've found a workaround.
As pointed out earlier, the problem is that SMOTE {smotefamily}'s K cannot be greater than or equal to the sample size.
I dag into the process and disovered that SMOTE {smotefamily} uses knearest {smotefamily}, which uses knnx.index {FNN}, which in turn uses get.knn {FNN},
which is what returns the error warning("k should be less than sample size!") that terminates the tuning process in mlr3.
Now, within SMOTE {smotefamily}, the three arguments for knearest {smotefamily} are P_set, P_set and K. From an mlr3 resampling perspective,
data frame P_set is a subset of the cross-validation fold of the training data, filtered to only contain the records of the minority class. The 'sample size' that
the error is referring to is the number of rows of P_set.
Thus, it becomes more likely that K >= nrow(P_set) as K increases via a trafo such as some_integer ^ K (e.g. 2 ^ K).
We need to ensure that K will never be greater than or equal to P_set.
Here's my proposed solution:
Define a variable cv_folds before defining the CV resampling strategy with rsmp().
Define the CV resampling strategy where folds = cv_folds in rsmp(), before defining the trafo.
Instantiate the CV. Now, the dataset is split into training and test/valitation data in each fold.
Find the minimum sample size of the minority class among all training data folds and set that as the threshold for K:
smote_k_thresh <- 1:cv_folds %>%
lapply(
function(x) {
index <- cv$train_set(x)
aux <- as.data.frame(task$data())[index, task$target_names]
aux <- min(table(aux))
}
) %>%
bind_cols %>%
min %>%
unique
Now define the trafo as follows:
param_set$trafo <- function(x, param_set) {
index <- which(grepl('.K', names(x)))
if (sum(index) != 0){
aux <- round(2 ^ x[[index]])
if (aux < smote_k_thresh) {
x[[index]] <- aux
} else {
x[[index]] <- sample(smote_k_thresh - 1, 1)
}
}
x
}
In other words, when the trafoed K remains smaller than the sample size, keep it. Otherwise, set its value to be any number between 1 and smote_k_thresh - 1.
Implementation
Original code slightly modified to accommodate proposed tweaks:
library("mlr3learners") # additional ML algorithms
library("mlr3viz") # autoplot for benchmarks
library("paradox") # hyperparameter space
library("OpenML") # to obtain data sets
library("smotefamily") # SMOTE algorithm for imbalance correction
# get list of curated binary classification data sets (see https://arxiv.org/abs/1708.03731v2)
ds = listOMLDataSets(
number.of.classes = 2,
number.of.features = c(1, 100),
number.of.instances = c(5000, 10000)
)
# select imbalanced data sets (without categorical features as SMOTE cannot handle them)
ds = subset(ds, minority.class.size / number.of.instances < 0.2 &
number.of.symbolic.features == 1)
ds
d = getOMLDataSet(980)
d
# make sure target is a factor and create mlr3 tasks
data = as.data.frame(d)
data[[d$target.features]] = as.factor(data[[d$target.features]])
task = TaskClassif$new(
id = d$desc$name, backend = data,
target = d$target.features)
task
# Code above copied from https://mlr3gallery.mlr-org.com/posts/2020-03-30-imbalanced-data/
class_counts <- table(task$truth())
majority_to_minority_ratio <- class_counts[class_counts == max(class_counts)] /
class_counts[class_counts == min(class_counts)]
# Pipe operator for SMOTE
po_smote <- po("smote", dup_size = round(majority_to_minority_ratio))
# Define and instantiate resampling strategy to be applied within pipeline
# Do that BEFORE defining the trafo
cv_folds <- 2
cv <- rsmp("cv", folds = cv_folds)
cv$instantiate(task)
# Calculate max possible value for k-nearest neighbours
smote_k_thresh <- 1:cv_folds %>%
lapply(
function(x) {
index <- cv$train_set(x)
aux <- as.data.frame(task$data())[index, task$target_names]
aux <- min(table(aux))
}
) %>%
bind_cols %>%
min %>%
unique
# Random Forest learner
rf <- lrn("classif.ranger", predict_type = "prob")
# Pipeline of Random Forest learner with SMOTE
graph <- po_smote %>>%
po('learner', rf, id = 'rf')
graph$plot()
# Graph learner
rf_smote <- GraphLearner$new(graph, predict_type = 'prob')
rf_smote$predict_type <- 'prob'
# Parameter set in data table format
ps_table <- as.data.table(rf_smote$param_set)
View(ps_table[, 1:4])
# Define parameter search space for the SMOTE parameters
param_set <- ps_table$id %>%
lapply(
function(x) {
if (grepl('smote.', x)) {
if (grepl('.dup_size', x)) {
ParamInt$new(x, lower = 1, upper = round(majority_to_minority_ratio))
} else if (grepl('.K', x)) {
ParamInt$new(x, lower = 1, upper = round(majority_to_minority_ratio))
}
}
}
)
param_set <- Filter(Negate(is.null), param_set)
param_set <- ParamSet$new(param_set)
# Apply transformation function on SMOTE's K while ensuring it never equals or exceeds the sample size
param_set$trafo <- function(x, param_set) {
index <- which(grepl('.K', names(x)))
if (sum(index) != 0){
aux <- round(5 ^ x[[index]]) # Try a large value here for the sake of the example
if (aux < smote_k_thresh) {
x[[index]] <- aux
} else {
x[[index]] <- sample(smote_k_thresh - 1, 1)
}
}
x
}
# Set up tuning instance
instance <- TuningInstance$new(
task = task,
learner = rf_smote,
resampling = cv,
measures = msr("classif.bbrier"),
param_set,
terminator = term("evals", n_evals = 10),
store_models = TRUE)
tuner <- TunerRandomSearch$new()
# Tune pipe learner to find optimal SMOTE parameter values
tuner$optimize(instance)
# Here are the original K values
instance$archive$data
# And here are their transformations
instance$archive$data$opt_x
I'm running a loess.smooth method after running the spline method on it.
The input given below is the data I get after running the spline method.
However I'm going wrong with the loess.smooth method. The entire first column is returning the output in float format but I need it in integer format with an increment of 1.
Any help would be much appreciated.
Thanks
**input:** spline_file
1 0.157587435
2 0.146704412
3 0.129899285
4 0.138925582
5 0.104085676
out <- loess.smooth(spline_file$x, spline_file$y, span = 1, degree = 1,
family = c("gaussian"), length.out = seq(1, max_exp, by = 1), surface=
"interpolate", normalize = TRUE, method="linear")
**OUTPUT:**
0 0.150404703
1.020408163 0.154413716
2.040816327 0.158458172
3.06122449 0.162515428
4.081632653 0.166562839
5.102040816 0.170577762
**OUTPUT REQUIRED:**
x y
1 0.225926707
2 0.226026551
3 0.226241194
4 0.2265471
5 0.226920733
not sure if the following fully answers your question but maybe it helps. Below some code, demonstrative plot and some explanations/recommendations.
You should not use a degree of 1, your data requires a higher degree.
You should check the allowed parameters via ?loess.smooth. I think you mixed up some parameters of scatter.smooth and loess.smooth and further used some parameters that do not exist for the function (e.g. normalize - please correct me if I have overseen something).
In any case it makes sense that the output of a spline smoothing function has more data points than the original data. To be ablet to plot a smooth curve additional points are generated between your data points by the smoothing function. Check the plot generated at the end of below code. If the fit is good, is another question...
spline_file <- read.table(text = "
1 0.157587435
2 0.146704412
3 0.129899285
4 0.138925582
5 0.104085676
", stringsAsFactors = FALSE)
colnames(spline_file) <- c("x", "y")
spline_loess <- loess.smooth(spline_file$x, spline_file$y, span = 1, degree = 2,
family = c("gaussian")
,surface= "interpolate"
, statistics = "exact"
)
spline_loess
# $x
# [1] 1.000000 1.081633 1.163265 1.244898 1.326531 1.408163 1.489796
# [8] 1.571429 1.653061 1.734694 1.816327 1.897959 1.979592 2.061224
# [15] 2.142857 2.224490 2.306122 2.387755 2.469388 2.551020 2.632653
# [22] 2.714286 2.795918 2.877551 2.959184 3.040816 3.122449 3.204082
# [29] 3.285714 3.367347 3.448980 3.530612 3.612245 3.693878 3.775510
# [36] 3.857143 3.938776 4.020408 4.102041 4.183673 4.265306 4.346939
# [43] 4.428571 4.510204 4.591837 4.673469 4.755102 4.836735 4.918367
# [50] 5.000000
#
# $y
# [1] 0.1586807 0.1571512 0.1556485 0.1541759 0.1527367 0.1513344
# [7] 0.1499721 0.1486533 0.1473813 0.1461595 0.1449911 0.1438795
# [13] 0.1428280 0.1417881 0.1406496 0.1394364 0.1381783 0.1369053
# [19] 0.1356473 0.1344341 0.1332957 0.1322619 0.1313626 0.1306278
# [25] 0.1300873 0.1297791 0.1297453 0.1299324 0.1302747 0.1307066
# [31] 0.1311626 0.1315769 0.1318839 0.1320181 0.1319138 0.1315054
# [37] 0.1307273 0.1295270 0.1281453 0.1266888 0.1251504 0.1235232
# [43] 0.1218002 0.1199744 0.1180388 0.1159866 0.1138105 0.1115038
# [49] 0.1090594 0.1064704
plot(spline_file)
lines(spline_loess)
I'm trying to predict whether or not an airline will add a route to their existing network by looking at their previous additions and training the model on what the previous year looked like. I've used xgboost before and it worked fine, but I removed a few cities and now xgboost is just predicting everything to be 50:50.
trainm <- sparse.model.matrix(add ~. -1, data = train)
train_label <- train[, "add"]
train_matrix <- xgb.DMatrix(data = (trainm), label = train_label)
testm <- sparse.model.matrix(add~. -1, data = test)
test_label <- test[, "add"]
test_matrix <- xgb.DMatrix(data = (testm), label = test_label)
nc <- length(unique(train_label))
xgb_params <- list("objective" = "binary:logistic",
"eval_metric" = "error",
"scale_pos_weight" = weight)
watchlist <- list(train = train_matrix, test = test_matrix)
bst_model <- xgb.train(params = xgb_params,
nthreads = 2,
data = train_matrix,
nrounds = 10,
watchlist = watchlist,
booster = 'gbtree'
)
outputs:
[1] train-error:0.972469 test-error:0.972580
[2] train-error:0.972469 test-error:0.972580
[3] train-error:0.972469 test-error:0.972580
[4] train-error:0.972469 test-error:0.972580
[5] train-error:0.972469 test-error:0.972580
[6] train-error:0.972469 test-error:0.972580
[7] train-error:0.972469 test-error:0.972580
[8] train-error:0.972469 test-error:0.972580
[9] train-error:0.972469 test-error:0.972580
[10] train-error:0.972469 test-error:0.972580
It is weighted because it is very imbalanced (~36 negative for every 1 positive) just don't know why it's suddenly not working.
Edit. It fixed itself and I have no idea why.
Edit2. It did it again and I have no idea why.
Edit3. I fixed it. It has to do with NA values in certain columns.
Hi I am programming here in R, and I want to use the xgboost function for predicting a dummy variable.
That's the code:
library(xgboost)
library(Matrix)
mydata<-read.csv(file.choose(),header = TRUE,sep=",")
names(mydata)
[1] "Factor_Check" "Cor_Check" "Cor_Check4"
[4] "Cor_Check2" "n_tokens_title" "n_tokens_content"
[7] "n_unique_tokens" "n_non_stop_words" "n_non_stop_unique_tokens"
[10] "num_hrefs" "num_self_hrefs" "num_imgs"
[13] "num_videos" "average_token_length" "num_keywords"
[16] "data_channel_is_lifestyle" "data_channel_is_entertainment" "data_channel_is_bus"
[19] "data_channel_is_socmed" "data_channel_is_tech" "data_channel_is_world"
[22] "kw_min_min" "kw_max_min" "kw_avg_min"
[25] "kw_min_max" "kw_max_max" "kw_avg_max"
[28] "kw_min_avg" "kw_max_avg" "kw_avg_avg"
[31] "self_reference_min_shares" "self_reference_max_shares" "self_reference_avg_sharess"
[34] "weekday_is_monday" "weekday_is_tuesday" "weekday_is_wednesday"
[37] "weekday_is_thursday" "weekday_is_friday" "weekday_is_saturday"
[40] "weekday_is_sunday" "is_weekend" "LDA_00"
[43] "LDA_01" "LDA_02" "LDA_03"
[46] "LDA_04" "global_subjectivity" "global_sentiment_polarity"
[49] "global_rate_positive_words" "global_rate_negative_words" "rate_positive_words"
[52] "rate_negative_words" "avg_positive_polarity" "min_positive_polarity"
[55] "max_positive_polarity" "avg_negative_polarity" "min_negative_polarity"
[58] "max_negative_polarity" "title_subjectivity" "title_sentiment_polarity"
[61] "abs_title_subjectivity" "abs_title_sentiment_polarity" "TargetVarCont"
[64] "TargetVar1" "TargetVar2"
Factor Check is Factor the rest are numeric
output.var <- "TargetVar2"
vars.to.exclude <- c("Factor_Check","Cor_Check","Cor_Check4","Cor_Check2","TargetVar1", "TargetVarCont")
Building the model based on 80% of the data
train<-mydata[(1:round(nrow(mydata)*(0.8))),]
train<-train[,!(names(train) %in% vars.to.exclude)]
Train<- Matrix::sparse.model.matrix(~.-1 , data=train)
xgb <- xgboost(data = Train[,!(names(Train) %in% output.var)], label = Train[,output.var],max.depth = 2, eta = 1, nthread = 2, nround = 2, objective = "binary:logistic")
Train
Error: shinyjs: could not find the Shiny session object. This usually
happens when a shinyjs function is called from a context that wasn't
set up by a Shiny session.
Does anyone know why I am getting this error?