Error while using sparse model matrix

Error while using sparse model matrix - r

Hi I am programming here in R, and I want to use the xgboost function for predicting a dummy variable.
That's the code:
library(xgboost)
library(Matrix)
mydata<-read.csv(file.choose(),header = TRUE,sep=",")
names(mydata)
[1] "Factor_Check" "Cor_Check" "Cor_Check4"
[4] "Cor_Check2" "n_tokens_title" "n_tokens_content"
[7] "n_unique_tokens" "n_non_stop_words" "n_non_stop_unique_tokens"
[10] "num_hrefs" "num_self_hrefs" "num_imgs"
[13] "num_videos" "average_token_length" "num_keywords"
[16] "data_channel_is_lifestyle" "data_channel_is_entertainment" "data_channel_is_bus"
[19] "data_channel_is_socmed" "data_channel_is_tech" "data_channel_is_world"
[22] "kw_min_min" "kw_max_min" "kw_avg_min"
[25] "kw_min_max" "kw_max_max" "kw_avg_max"
[28] "kw_min_avg" "kw_max_avg" "kw_avg_avg"
[31] "self_reference_min_shares" "self_reference_max_shares" "self_reference_avg_sharess"
[34] "weekday_is_monday" "weekday_is_tuesday" "weekday_is_wednesday"
[37] "weekday_is_thursday" "weekday_is_friday" "weekday_is_saturday"
[40] "weekday_is_sunday" "is_weekend" "LDA_00"
[43] "LDA_01" "LDA_02" "LDA_03"
[46] "LDA_04" "global_subjectivity" "global_sentiment_polarity"
[49] "global_rate_positive_words" "global_rate_negative_words" "rate_positive_words"
[52] "rate_negative_words" "avg_positive_polarity" "min_positive_polarity"
[55] "max_positive_polarity" "avg_negative_polarity" "min_negative_polarity"
[58] "max_negative_polarity" "title_subjectivity" "title_sentiment_polarity"
[61] "abs_title_subjectivity" "abs_title_sentiment_polarity" "TargetVarCont"
[64] "TargetVar1" "TargetVar2"
Factor Check is Factor the rest are numeric
output.var <- "TargetVar2"
vars.to.exclude <- c("Factor_Check","Cor_Check","Cor_Check4","Cor_Check2","TargetVar1", "TargetVarCont")
Building the model based on 80% of the data
train<-mydata[(1:round(nrow(mydata)*(0.8))),]
train<-train[,!(names(train) %in% vars.to.exclude)]
Train<- Matrix::sparse.model.matrix(~.-1 , data=train)
xgb <- xgboost(data = Train[,!(names(Train) %in% output.var)], label = Train[,output.var],max.depth = 2, eta = 1, nthread = 2, nround = 2, objective = "binary:logistic")
Train
Error: shinyjs: could not find the Shiny session object. This usually
happens when a shinyjs function is called from a context that wasn't
set up by a Shiny session.
Does anyone know why I am getting this error?

Related

Using R, How do I copy the tibble to an element of the list. for example, each element like ff[i] have a nibble at each i

Use vector() to create an empty vector called ff that is of mode “list” and length 9. Now write a for() loop to loop over the 9 files in dfiles and for each (i) read the file in to a tibble, and change the column names to x and y, and (ii) copy the tibble to an element of your list ff.
dfiles is a directory which has different files.
This is what I did.
ff <- vector(mode = "list", length = 9)
length <- length(dfiles)
for (i in 1:length) {
study <- read_csv(dfiles[i])
names(study)[1] <- "x"
names(study)[2] <- "y"
ff[i] <- c(study)
print(head(ff[i]))
}
[[1]]
[1] -0.989532202 -0.052799402 0.823610903 -0.255509103 -0.220684347
[6] 0.307726791 -0.060013253 -0.555652890 -0.138615019 1.882839792
[11] 0.873668680 -0.914597073 -1.244917622 -0.359982241 1.328774701
[16] 0.292679118 -0.701505237 0.882234568 -0.133370389 -1.120678499
[21] 0.461192454 1.524142810 0.434468298 0.192000371 -0.656243128
[26] 0.568398531 -1.070570535 -1.653149024 -0.043352768 -0.034593506
[31] 2.365055532 -1.216347308 0.170906323 0.805053094 1.050592844
[36] -0.010724485 -0.743256141 -0.065784052 1.939755992 0.482739008
[41] -2.044477073 1.423459129 0.540502661 -0.033571772 -0.017863621
[46] -0.149789720 0.256559481 -0.503866933 0.277011252 -0.931356025
[51] 0.200146875 1.106837421 0.509206114 1.033749676 -1.090868762
[56] 0.054792784 0.617250303 -1.068004868 1.565814337 -1.034808011
[61] 0.164518709 0.151832330 0.121670302 -0.210424584 0.449936787
[66] -1.031164492 -1.289364188 -0.654568638 -0.057324104 1.256747820
[71] 1.587454140 0.319481463 0.381591623 -0.243644884 0.048053084
[76] -1.404545861 0.289933729 -0.535553582 0.334678773 -0.345981339
[81] -0.661615735 -0.219111377 -0.366904911 1.094578208 0.209208082
[86] 0.432491426 -1.240853586 1.496821710 0.159370441 -0.856281403
[91] 0.309046645 0.870434030 -1.383677138 1.690106970 -0.158030705
[96] 1.121170781 0.072261319 -0.332422845 -1.834920047 -1.100172219
[101] -0.041340300 0.827852545 -1.881678654 1.375441112 1.398990464
[106] -1.143316256 0.472300562 -1.033639213 -0.125199979 0.928662739
[111] 0.868339648 -0.849174604 -0.386636454 -0.976163571 0.339543660
[116] -1.559075164 -2.629325442 1.469812282 2.273472913 -0.455033540
[121] 0.761102487 -0.007502784 1.474313800
and the following error.
1: In ff[i] <- c(study) :
number of items to replace is not a multiple of replacement length
2: In ff[i] <- c(study) :
I was expecting that it'll still have column names so I am not sure how to fix it and where I am going wrong.

Was supposed to use double brackets.
ff[[i]] <- study would fix the problem.

Split a sequence of numbers into groups of 10 digits using R

I would like for R to read in the first 10,000 digits of Pi and group every 10 digits together
e.g., I want R to read in a sequence
pi <- 3.14159265358979323846264338327950288419716939937510582097...
and would like R to give me a table where each row contains 10 digit:
3141592653
5897932384
6264338327
...
I am new to R and really don't know where to start so any help would be much appreciated!
Thank you in advance

https://rextester.com/OQRM27791
p <- strsplit("314159265358979323846264338327950288419716939937510582097", "")
digits <- p[[1]]
split(digits, ceiling((1:length(digits)) / 10));

Here's one way to do it. It's fully reproducible, so just cut and paste it into your R console. The vector result is the first 10,000 digits of pi, split into 1000 strings of 10 digits.
For this many digits, I have used an online source for the precalculated value of pi. This is read in using readChar and the decimal point is stripped out with gsub. The resulting string is split into individual characters and put in a 1000 * 10 matrix (filled row-wise). The rows are then pasted into strings, giving the result. I have displayed only the first 100 entries of result for clarity of presentation.
pi_url <- "https://www.pi2e.ch/blog/wp-content/uploads/2017/03/pi_dec_1m.txt"
pi_char <- gsub("\\.", "", readChar(url, 1e4 + 1))
pi_mat <- matrix(strsplit(pi_char, "")[[1]], byrow = TRUE, ncol = 10)
result <- apply(pi_mat, 1, paste0, collapse = "")
head(result, 100)
#> [1] "3141592653" "5897932384" "6264338327" "9502884197" "1693993751"
#> [6] "0582097494" "4592307816" "4062862089" "9862803482" "5342117067"
#> [11] "9821480865" "1328230664" "7093844609" "5505822317" "2535940812"
#> [16] "8481117450" "2841027019" "3852110555" "9644622948" "9549303819"
#> [21] "6442881097" "5665933446" "1284756482" "3378678316" "5271201909"
#> [26] "1456485669" "2346034861" "0454326648" "2133936072" "6024914127"
#> [31] "3724587006" "6063155881" "7488152092" "0962829254" "0917153643"
#> [36] "6789259036" "0011330530" "5488204665" "2138414695" "1941511609"
#> [41] "4330572703" "6575959195" "3092186117" "3819326117" "9310511854"
#> [46] "8074462379" "9627495673" "5188575272" "4891227938" "1830119491"
#> [51] "2983367336" "2440656643" "0860213949" "4639522473" "7190702179"
#> [56] "8609437027" "7053921717" "6293176752" "3846748184" "6766940513"
#> [61] "2000568127" "1452635608" "2778577134" "2757789609" "1736371787"
#> [66] "2146844090" "1224953430" "1465495853" "7105079227" "9689258923"
#> [71] "5420199561" "1212902196" "0864034418" "1598136297" "7477130996"
#> [76] "0518707211" "3499999983" "7297804995" "1059731732" "8160963185"
#> [81] "9502445945" "5346908302" "6425223082" "5334468503" "5261931188"
#> [86] "1710100031" "3783875288" "6587533208" "3814206171" "7766914730"
#> [91] "3598253490" "4287554687" "3115956286" "3882353787" "5937519577"
#> [96] "8185778053" "2171226806" "6130019278" "7661119590" "9216420198"
Created on 2020-07-23 by the reprex package (v0.3.0)

We can use str_extract:
pi <- readLines("https://www.pi2e.ch/blog/wp-content/uploads/2017/03/pi_dec_1m.txt")
library(stringr)
t <- unlist(str_extract_all(sub("\\.","", pi), "\\d{10}"))
t[1:100]
[1] "3141592653" "5897932384" "6264338327" "9502884197" "1693993751" "0582097494" "4592307816" "4062862089"
[9] "9862803482" "5342117067" "9821480865" "1328230664" "7093844609" "5505822317" "2535940812" "8481117450"
[17] "2841027019" "3852110555" "9644622948" "9549303819" "6442881097" "5665933446" "1284756482" "3378678316"
[25] "5271201909" "1456485669" "2346034861" "0454326648" "2133936072" "6024914127" "3724587006" "6063155881"
[33] "7488152092" "0962829254" "0917153643" "6789259036" "0011330530" "5488204665" "2138414695" "1941511609"
[41] "4330572703" "6575959195" "3092186117" "3819326117" "9310511854" "8074462379" "9627495673" "5188575272"
[49] "4891227938" "1830119491" "2983367336" "2440656643" "0860213949" "4639522473" "7190702179" "8609437027"
[57] "7053921717" "6293176752" "3846748184" "6766940513" "2000568127" "1452635608" "2778577134" "2757789609"
[65] "1736371787" "2146844090" "1224953430" "1465495853" "7105079227" "9689258923" "5420199561" "1212902196"
[73] "0864034418" "1598136297" "7477130996" "0518707211" "3499999983" "7297804995" "1059731732" "8160963185"
[81] "9502445945" "5346908302" "6425223082" "5334468503" "5261931188" "1710100031" "3783875288" "6587533208"
[89] "3814206171" "7766914730" "3598253490" "4287554687" "3115956286" "3882353787" "5937519577" "8185778053"
[97] "2171226806" "6130019278" "7661119590" "9216420198"

Tuning SMOTE's K with a trafo fails: 'warning("k should be less than sample size!")'

I'm having trouble with the trafo function for SMOTE {smotefamily}'s K parameter. In particular, when the number of nearest neighbours K is greater than or equal to the sample size, an error is returned (warning("k should be less than sample size!")) and the tuning process is terminated.
The user cannot control K to be smaller than the sample size during the internal resampling process. This would have to be controlled internally so that if, for instance, trafo_K = 2 ^ K >= sample_size for some value of K, then, say, trafo_K = sample_size - 1.
I was wondering if there's a solution to this or if one is already on its way?
library("mlr3") # mlr3 base package
library("mlr3misc") # contains some helper functions
library("mlr3pipelines") # create ML pipelines
library("mlr3tuning") # tuning ML algorithms
library("mlr3learners") # additional ML algorithms
library("mlr3viz") # autoplot for benchmarks
library("paradox") # hyperparameter space
library("OpenML") # to obtain data sets
library("smotefamily") # SMOTE algorithm for imbalance correction
# get list of curated binary classification data sets (see https://arxiv.org/abs/1708.03731v2)
ds = listOMLDataSets(
number.of.classes = 2,
number.of.features = c(1, 100),
number.of.instances = c(5000, 10000)
)
# select imbalanced data sets (without categorical features as SMOTE cannot handle them)
ds = subset(ds, minority.class.size / number.of.instances < 0.2 &
number.of.symbolic.features == 1)
ds
d = getOMLDataSet(980)
d
# make sure target is a factor and create mlr3 tasks
data = as.data.frame(d)
data[[d$target.features]] = as.factor(data[[d$target.features]])
task = TaskClassif$new(
id = d$desc$name, backend = data,
target = d$target.features)
task
# Code above copied from https://mlr3gallery.mlr-org.com/posts/2020-03-30-imbalanced-data/
class_counts <- table(task$truth())
majority_to_minority_ratio <- class_counts[class_counts == max(class_counts)] /
class_counts[class_counts == min(class_counts)]
# Pipe operator for SMOTE
po_smote <- po("smote", dup_size = round(majority_to_minority_ratio))
# Random Forest learner
rf <- lrn("classif.ranger", predict_type = "prob")
# Pipeline of Random Forest learner with SMOTE
graph <- po_smote %>>%
po('learner', rf, id = 'rf')
graph$plot()
# Graph learner
rf_smote <- GraphLearner$new(graph, predict_type = 'prob')
rf_smote$predict_type <- 'prob'
# Parameter set in data table format
ps_table <- as.data.table(rf_smote$param_set)
View(ps_table[, 1:4])
# Define parameter search space for the SMOTE parameters
param_set <- ps_table$id %>%
lapply(
function(x) {
if (grepl('smote.', x)) {
if (grepl('.dup_size', x)) {
ParamInt$new(x, lower = 1, upper = round(majority_to_minority_ratio))
} else if (grepl('.K', x)) {
ParamInt$new(x, lower = 1, upper = round(majority_to_minority_ratio))
}
}
}
)
param_set <- Filter(Negate(is.null), param_set)
param_set <- ParamSet$new(param_set)
# Apply transformation function on SMOTE's K (= The number of nearest neighbors used for sampling new values. See SMOTE().)
param_set$trafo <- function(x, param_set) {
index <- which(grepl('.K', names(x)))
if (sum(index) != 0){
x[[index]] <- round(3 ^ x[[index]]) # Intentionally define a trafo that won't work
}
x
}
# Define and instantiate resampling strategy to be applied within pipeline
cv <- rsmp("cv", folds = 2)
cv$instantiate(task)
# Set up tuning instance
instance <- TuningInstance$new(
task = task,
learner = rf_smote,
resampling = cv,
measures = msr("classif.bbrier"),
param_set,
terminator = term("evals", n_evals = 3),
store_models = TRUE)
tuner <- TunerRandomSearch$new()
# Tune pipe learner to find optimal SMOTE parameter values
tuner$optimize(instance)
And here's what happens
INFO [11:00:14.904] Benchmark with 2 resampling iterations
INFO [11:00:14.919] Applying learner 'smote.rf' on task 'optdigits' (iter 2/2)
Error in get.knnx(data, query, k, algorithm) : ANN: ERROR------->
In addition: Warning message:
In get.knnx(data, query, k, algorithm) : k should be less than sample size!
Session info
R version 3.6.2 (2019-12-12)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 16299)
Matrix products: default
locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] smotefamily_1.3.1 OpenML_1.10 mlr3viz_0.1.1.9002
[4] mlr3tuning_0.1.2-9000 mlr3pipelines_0.1.2.9000 mlr3misc_0.2.0
[7] mlr3learners_0.2.0 mlr3filters_0.2.0.9000 mlr3_0.2.0-9000
[10] paradox_0.2.0 yardstick_0.0.5 rsample_0.0.5
[13] recipes_0.1.9 parsnip_0.0.5 infer_0.5.1
[16] dials_0.0.4 scales_1.1.0 broom_0.5.4
[19] tidymodels_0.0.3 reshape2_1.4.3 janitor_1.2.1
[22] data.table_1.12.8 forcats_0.4.0 stringr_1.4.0
[25] dplyr_0.8.4 purrr_0.3.3 readr_1.3.1
[28] tidyr_1.0.2 tibble_3.0.1 ggplot2_3.3.0
[31] tidyverse_1.3.0
loaded via a namespace (and not attached):
[1] utf8_1.1.4 tidyselect_1.0.0 lme4_1.1-21
[4] htmlwidgets_1.5.1 grid_3.6.2 ranger_0.12.1
[7] pROC_1.16.1 munsell_0.5.0 codetools_0.2-16
[10] bbotk_0.1 DT_0.12 future_1.17.0
[13] miniUI_0.1.1.1 withr_2.2.0 colorspace_1.4-1
[16] knitr_1.28 uuid_0.1-4 rstudioapi_0.10
[19] stats4_3.6.2 bayesplot_1.7.1 listenv_0.8.0
[22] rstan_2.19.2 lgr_0.3.4 DiceDesign_1.8-1
[25] vctrs_0.2.4 generics_0.0.2 ipred_0.9-9
[28] xfun_0.12 R6_2.4.1 markdown_1.1
[31] mlr3measures_0.1.3-9000 rstanarm_2.19.2 lhs_1.0.1
[34] assertthat_0.2.1 promises_1.1.0 nnet_7.3-12
[37] gtable_0.3.0 globals_0.12.5 processx_3.4.1
[40] timeDate_3043.102 rlang_0.4.5 workflows_0.1.1
[43] BBmisc_1.11 splines_3.6.2 checkmate_2.0.0
[46] inline_0.3.15 yaml_2.2.1 modelr_0.1.5
[49] tidytext_0.2.2 threejs_0.3.3 crosstalk_1.0.0
[52] backports_1.1.6 httpuv_1.5.2 rsconnect_0.8.16
[55] tokenizers_0.2.1 tools_3.6.2 lava_1.6.6
[58] ellipsis_0.3.0 ggridges_0.5.2 Rcpp_1.0.4.6
[61] plyr_1.8.5 base64enc_0.1-3 visNetwork_2.0.9
[64] ps_1.3.0 prettyunits_1.1.1 rpart_4.1-15
[67] zoo_1.8-7 haven_2.2.0 fs_1.3.1
[70] furrr_0.1.0 magrittr_1.5 colourpicker_1.0
[73] reprex_0.3.0 GPfit_1.0-8 SnowballC_0.6.0
[76] packrat_0.5.0 matrixStats_0.55.0 tidyposterior_0.0.2
[79] hms_0.5.3 shinyjs_1.1 mime_0.8
[82] xtable_1.8-4 XML_3.99-0.3 tidypredict_0.4.3
[85] shinystan_2.5.0 readxl_1.3.1 gridExtra_2.3
[88] rstantools_2.0.0 compiler_3.6.2 crayon_1.3.4
[91] minqa_1.2.4 StanHeaders_2.21.0-1 htmltools_0.4.0
[94] later_1.0.0 lubridate_1.7.4 DBI_1.1.0
[97] dbplyr_1.4.2 MASS_7.3-51.4 boot_1.3-23
[100] Matrix_1.2-18 cli_2.0.1 parallel_3.6.2
[103] gower_0.2.1 igraph_1.2.4.2 pkgconfig_2.0.3
[106] xml2_1.2.2 foreach_1.4.7 dygraphs_1.1.1.6
[109] prodlim_2019.11.13 farff_1.1 rvest_0.3.5
[112] snakecase_0.11.0 janeaustenr_0.1.5 callr_3.4.1
[115] digest_0.6.25 cellranger_1.1.0 curl_4.3
[118] shiny_1.4.0 gtools_3.8.1 nloptr_1.2.1
[121] lifecycle_0.2.0 nlme_3.1-142 jsonlite_1.6.1
[124] fansi_0.4.1 pillar_1.4.3 lattice_0.20-38
[127] loo_2.2.0 fastmap_1.0.1 httr_1.4.1
[130] pkgbuild_1.0.6 survival_3.1-8 glue_1.4.0
[133] xts_0.12-0 FNN_1.1.3 shinythemes_1.1.2
[136] iterators_1.0.12 class_7.3-15 stringi_1.4.4
[139] memoise_1.1.0 future.apply_1.5.0
Many thanks.

I've found a workaround.
As pointed out earlier, the problem is that SMOTE {smotefamily}'s K cannot be greater than or equal to the sample size.
I dag into the process and disovered that SMOTE {smotefamily} uses knearest {smotefamily}, which uses knnx.index {FNN}, which in turn uses get.knn {FNN},
which is what returns the error warning("k should be less than sample size!") that terminates the tuning process in mlr3.
Now, within SMOTE {smotefamily}, the three arguments for knearest {smotefamily} are P_set, P_set and K. From an mlr3 resampling perspective,
data frame P_set is a subset of the cross-validation fold of the training data, filtered to only contain the records of the minority class. The 'sample size' that
the error is referring to is the number of rows of P_set.
Thus, it becomes more likely that K >= nrow(P_set) as K increases via a trafo such as some_integer ^ K (e.g. 2 ^ K).
We need to ensure that K will never be greater than or equal to P_set.
Here's my proposed solution:
Define a variable cv_folds before defining the CV resampling strategy with rsmp().
Define the CV resampling strategy where folds = cv_folds in rsmp(), before defining the trafo.
Instantiate the CV. Now, the dataset is split into training and test/valitation data in each fold.
Find the minimum sample size of the minority class among all training data folds and set that as the threshold for K:
smote_k_thresh <- 1:cv_folds %>%
lapply(
function(x) {
index <- cv$train_set(x)
aux <- as.data.frame(task$data())[index, task$target_names]
aux <- min(table(aux))
}
) %>%
bind_cols %>%
min %>%
unique
Now define the trafo as follows:
param_set$trafo <- function(x, param_set) {
index <- which(grepl('.K', names(x)))
if (sum(index) != 0){
aux <- round(2 ^ x[[index]])
if (aux < smote_k_thresh) {
x[[index]] <- aux
} else {
x[[index]] <- sample(smote_k_thresh - 1, 1)
}
}
x
}
In other words, when the trafoed K remains smaller than the sample size, keep it. Otherwise, set its value to be any number between 1 and smote_k_thresh - 1.
Implementation
Original code slightly modified to accommodate proposed tweaks:
library("mlr3learners") # additional ML algorithms
library("mlr3viz") # autoplot for benchmarks
library("paradox") # hyperparameter space
library("OpenML") # to obtain data sets
library("smotefamily") # SMOTE algorithm for imbalance correction
# get list of curated binary classification data sets (see https://arxiv.org/abs/1708.03731v2)
ds = listOMLDataSets(
number.of.classes = 2,
number.of.features = c(1, 100),
number.of.instances = c(5000, 10000)
)
# select imbalanced data sets (without categorical features as SMOTE cannot handle them)
ds = subset(ds, minority.class.size / number.of.instances < 0.2 &
number.of.symbolic.features == 1)
ds
d = getOMLDataSet(980)
d
# make sure target is a factor and create mlr3 tasks
data = as.data.frame(d)
data[[d$target.features]] = as.factor(data[[d$target.features]])
task = TaskClassif$new(
id = d$desc$name, backend = data,
target = d$target.features)
task
# Code above copied from https://mlr3gallery.mlr-org.com/posts/2020-03-30-imbalanced-data/
class_counts <- table(task$truth())
majority_to_minority_ratio <- class_counts[class_counts == max(class_counts)] /
class_counts[class_counts == min(class_counts)]
# Pipe operator for SMOTE
po_smote <- po("smote", dup_size = round(majority_to_minority_ratio))
# Define and instantiate resampling strategy to be applied within pipeline
# Do that BEFORE defining the trafo
cv_folds <- 2
cv <- rsmp("cv", folds = cv_folds)
cv$instantiate(task)
# Calculate max possible value for k-nearest neighbours
smote_k_thresh <- 1:cv_folds %>%
lapply(
function(x) {
index <- cv$train_set(x)
aux <- as.data.frame(task$data())[index, task$target_names]
aux <- min(table(aux))
}
) %>%
bind_cols %>%
min %>%
unique
# Random Forest learner
rf <- lrn("classif.ranger", predict_type = "prob")
# Pipeline of Random Forest learner with SMOTE
graph <- po_smote %>>%
po('learner', rf, id = 'rf')
graph$plot()
# Graph learner
rf_smote <- GraphLearner$new(graph, predict_type = 'prob')
rf_smote$predict_type <- 'prob'
# Parameter set in data table format
ps_table <- as.data.table(rf_smote$param_set)
View(ps_table[, 1:4])
# Define parameter search space for the SMOTE parameters
param_set <- ps_table$id %>%
lapply(
function(x) {
if (grepl('smote.', x)) {
if (grepl('.dup_size', x)) {
ParamInt$new(x, lower = 1, upper = round(majority_to_minority_ratio))
} else if (grepl('.K', x)) {
ParamInt$new(x, lower = 1, upper = round(majority_to_minority_ratio))
}
}
}
)
param_set <- Filter(Negate(is.null), param_set)
param_set <- ParamSet$new(param_set)
# Apply transformation function on SMOTE's K while ensuring it never equals or exceeds the sample size
param_set$trafo <- function(x, param_set) {
index <- which(grepl('.K', names(x)))
if (sum(index) != 0){
aux <- round(5 ^ x[[index]]) # Try a large value here for the sake of the example
if (aux < smote_k_thresh) {
x[[index]] <- aux
} else {
x[[index]] <- sample(smote_k_thresh - 1, 1)
}
}
x
}
# Set up tuning instance
instance <- TuningInstance$new(
task = task,
learner = rf_smote,
resampling = cv,
measures = msr("classif.bbrier"),
param_set,
terminator = term("evals", n_evals = 10),
store_models = TRUE)
tuner <- TunerRandomSearch$new()
# Tune pipe learner to find optimal SMOTE parameter values
tuner$optimize(instance)
# Here are the original K values
instance$archive$data
# And here are their transformations
instance$archive$data$opt_x

Vectorize over all combinations of arguments

Is there a way to vectorize an R function over all combinations of multiple parameters and return the result as a list?
As an example, using Vectorize over rnorm produces the following, but I would like to have a list of vectors corresponding to each combination of the arguments (so it should return a list of 60 vectors instead of just 5):
> vrnorm <- Vectorize(rnorm)
> vrnorm( 10*1:5, mean = 1:4, sd = 1:3)
[[1]]
[1] 1.37858918 -0.85432372 1.87321175 2.08362291 0.02950438 1.67967249
[7] 2.25954748 1.44031251 0.09816078 0.91365201
[[2]]
[1] 1.7717267 1.7961157 2.3291686 2.6114272 2.6228930 -0.2580403
[7] 3.3232109 -0.4652434 -0.4803258 -0.1170871 0.1158350 -1.0902252
[13] -0.6400934 3.6625290 2.5924096 4.5878564 0.7265718 3.2034281
[19] -0.2499768 2.0164275
[[3]]
[1] 5.8251252 3.1089121 2.8893594 2.9079357 1.9308677 4.3359878
[7] -0.3668157 4.9728508 -0.6494110 6.7729562 6.1623976 -0.1696638
[13] 5.4664038 3.8141798 -3.1842879 2.3985010 0.3840465 4.0696628
[19] 4.8217798 3.3135100 4.9028273 3.6193840 4.8861864 3.9871897
[25] -0.1059491 3.8961742 4.8293925 3.8935335 6.3194862 4.7846143
[[4]]
[1] 3.737043 2.849215 4.611868 3.494396 2.909659 4.861474 2.000194 3.343171
[9] 4.019523 3.277575 3.885272 3.331160 4.581551 4.960162 3.061960 5.359514
[17] 4.651848 3.640535 3.612368 4.338019 5.233665 3.585976 4.018191 4.320883
[25] 2.598541 3.519587 5.231375 4.733647 2.493334 2.791483 4.330052 2.498424
[33] 3.317115 3.515012 5.079780 4.720884 3.055191 5.262385 1.939961 4.779480
[[5]]
[1] 4.31697756 0.93754587 3.96698522 -0.03680018 1.94987430 1.73985617
[7] -1.42300550 2.07764933 0.45701395 2.42548257 0.67745524 -2.42054060
[13] 1.14655845 1.60277193 -1.04636658 0.94097335 3.07688803 0.58049012
[19] 1.25812532 1.91613097 -2.95408979 3.00990345 -0.67314868 0.64746260
[25] 1.69640497 0.68493689 2.84261574 1.65290227 4.16990548 -3.30426803
[31] 3.80508273 5.95888355 -0.09021591 3.88157980 -1.19166351 2.70208228
[37] -0.56278834 -0.83943824 -0.86868493 -1.19995506 -2.30275483 1.70435276
[43] 2.67984044 -0.04976799 0.98716782 2.71171575 5.21648742 0.13860495
[49] 1.61038570 0.50679460

Use expand.grid to expand all arguments and create a data frame, and then use mapply.
dat <- expand.grid(n = 10 * 1:5, mean = 1:4, sd = 1:3)
mapply(rnorm, dat$n, dat$mean, dat$sd, SIMPLIFY = FALSE)

You can also use purrr::pmap() as an alternative to mapply
library(purrr)
dat <- expand.grid(n = 10 * 1:5, mean = 1:4, sd = 1:3)
pmap(dat, rnorm)

Error in svd(X) : infinite or missing values in 'x' . when using BacktestVaR in GAS

I want to use BacktestVaR function in GAS package
the data file is here returns
returns1 <- return[,-1]
BacktestVaR(returns1,0.9998714,0.05)
When I run the above code I get:-
"Error in svd(X) : infinite or missing values in 'x'"
Can some one please help me with this?

I guess there's a problem with your VaR argument. It expects a numeric vector containing VaR series and you are passing 0.9998714.
There's an example given in the package documentation which you can have a look at.
I tried an example given in the package on your data and it worked fine for me.
library(GAS)
returns1 <- return[,-1]
Forecast = UniGASFor(Fit, Roll = TRUE, out = returns1)
alpha = 0.05
VaR = quantile(Forecast, alpha)
BacktestVaR(returns1, VaR, alpha)
It gave an output:-
$LRuc
Test Pvalue
1.508023e+01 1.030369e-04
$LRcc
Test Pvalue
1.508023e+01 5.313369e-04
$AE
[1] 0
$AD
ADmean ADmax
NaN -Inf
$DQ
$DQ$stat
[,1]
[1,] 7.526316
$DQ$pvalue
[,1]
[1,] 0.3762069
$Loss
$Loss$Loss
[1] 0.02396477
$Loss$LossSeries
[1] 0.099206718 0.089934791 0.085306377 0.083944161 0.078122362 0.073599847 0.061635033
[8] 0.064839991 0.059715813 0.065246570 0.061809442 0.055136995 0.052664206 0.052750109
[15] 0.048890123 0.043372077 0.043033141 0.045092669 0.042304367 0.037946743 0.041365010
[22] 0.037392398 0.041342509 0.037154130 0.034005277 0.035064542 0.024833787 0.032198291
[29] 0.026951766 0.032110073 0.025078754 0.019482687 0.029475914 0.031723679 0.022555941
[36] 0.012634228 0.020209926 0.028950001 0.026904561 0.028708228 0.031490940 0.031142592
[43] 0.039528132 0.023939675 0.036555585 0.025622543 0.030231260 0.020470378 0.028312997
[50] 0.025243985 0.018476646 0.022936782 0.024340429 0.021230794 0.019276576 0.023544289
[57] 0.019724022 0.021008776 0.022342456 0.019971455 0.018544509 0.017889817 0.010320351
[64] 0.013567978 0.023370654 0.018427862 0.013352942 0.015784444 0.016032580 0.013898405
[71] 0.016405078 0.021259721 0.009921251 0.013944924 0.022224791 0.019584060 0.016001481
[78] 0.017540380 0.006435535 0.018837333 0.013470815 0.015819393 0.021200104 0.014361778
[85] 0.017106075 0.017225547 0.012276949 0.011625625 0.011784346 0.018752417 0.014791428
[92] 0.011591563 0.012849788 0.011635476 0.017176090 0.018051420 0.014120935 0.012636333
[99] 0.009913736 0.017225520 0.015406386 0.012874489 0.016547533 0.014883970 0.012906750
[106] 0.017762996 0.013358853 0.014217855 0.013441140 0.010019856 0.015160385 0.011431101
[113] 0.009502256 0.008921462 0.013421278 0.010276422 0.012234584 0.007779987 0.009893465
[120] 0.013416031 0.013245883 0.009740190 0.006903344 0.007681396 0.018183227 0.012966043
[127] 0.013923885 0.012345783 0.014619745 0.013296054 0.011492134 0.010751146 0.006154623
[134] 0.011448771 0.014871403 0.010247001 0.012144674 0.012527776 0.013672466 0.008994635
[141] 0.012822531 0.008867439 0.011508661 0.012899977 0.009832727 0.013247198 0.009932820
Warning message:
In max(series) : no non-missing arguments to max; returning -Inf

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Error while using sparse model matrix - r

Related

Using R, How do I copy the tibble to an element of the list. for example, each element like ff[i] have a nibble at each i

Split a sequence of numbers into groups of 10 digits using R

Tuning SMOTE's K with a trafo fails: 'warning("k should be less than sample size!")'

Vectorize over all combinations of arguments

Error in svd(X) : infinite or missing values in 'x' . when using BacktestVaR in GAS

Categories

Resources