I am trying to conduct a 5-fold cross validation on a generalized linear mixed model using the groupdata2 and cvms packages. This is the code I tried to run:
data <- groupdata2::fold(detect, k = 5,
cat_col = 'outcome',
id_col = 'bird') %>%
arrange(.folds)
cvms::cross_validate(
data,
"outcome ~ sex + year + season + (1 | bird) + (1 | obsname)",
family="binomial",
fold_cols = ".folds",
control = NULL,
REML = FALSE)
This is the error I receive:
Error in groupdata2::fold(detect, k = 4, cat_col = "outcome", id_col = "bird") %>% :
1 assertions failed:
* The value in 'data[[cat_col]]' must be constant within each ID.
In the package vignette, the following explanation is given: "A participant must always have the same diagnosis (‘a’ or ‘b’) throughout the dataset. Otherwise, the participant might be placed in multiple folds." This makes sense in the example. However, my data is based on the outcome of resighting birds, so outcome varies depending on whether the bird was observed on that particular survey. Is there a way around this?
Reproducible example:
bird <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3)
outcome <- c(0,1,1,1,0,0,0,1,0,1,0,1,0,0,1)
df <- data.frame(bird, outcome)
df$outcome <- as.factor(df$outcome)
df$bird <- as.factor(df$bird)
data <- groupdata2::fold(df, k = 5,
cat_col = 'outcome',
id_col = 'bird') %>%
arrange(.folds)
The full documentation says:
cat_col: Name of categorical variable to balance between folds.
E.g. when predicting a binary variable (a or b), we usually
want both classes represented in every fold.
N.B. If also passing an ‘id_col’, ‘cat_col’ should be
constant within each ID.
So in this case, where outcome varies within individual birds (id_col), you simply can't specify that the folds be balanced within respect to the outcome. (I don't 100% understand this constraint in the software: it seems it should be possible to do at least approximate balancing by selecting groups (birds) with a balanced range of outcomes, but I can see how it could make the balancing procedure a lot harder).
In my opinion, though, the importance of balancing outcomes is somewhat overrated in general. Lack of balance would mean that some of the simpler metrics in ?binomial_metrics (e.g. accuracy, sensitivity, specificity) are not very useful, but others (balanced accuracy, AUC, aic) should be fine.
A potentially greater problem is that you appear to have (potentially) crossed random effects (i.e. (1|bird) + (1|obsname)). I'm guessing obsname is the name of an observer: if some observers detected (or failed to detect) multiple birds and some birds were detected/failed by multiple observers, then there may be no way to define folds that are actually independent, or at least it may be very difficult.
You may be able to utilize the new collapse_groups() function in groupdata2 v2.0.0 instead of fold() for this. It allows you to take existing groups (e.g. bird) and collapse them to fewer groups (e.g. folds) with attempted balancing of multiple categorical columns, numeric columns, and factor columns (number of unique levels - though the same levels might be in multiple groups).
It does not have the constraints that fold() does with regards to changing outcomes, but on the other hand does not come with the same "guarantees" in the "non-changing outcome" context. E.g. it does not guarantee at least one of each outcome levels in all folds.
You need more birds than the number of folds though, so I've added a few to the test data:
bird <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,
4,4,4,5,5,5,5,5,6,6,6,6,6,7,7,7,7)
outcome <- c(0,1,1,1,0,0,0,1,0,1,0,1,0,0,1,0,1,
0,1,1,0,1,1,0,0,1,1,0,0,1,0,0,1,1)
df <- data.frame(bird, outcome)
df$outcome <- as.factor(df$outcome)
df$bird <- as.factor(df$bird)
# Combine 'bird' groups to folds
data <- groupdata2::collapse_groups(
data = df,
n = 3,
group_cols="bird",
cat_col="outcome",
col_name = ".folds"
) %>%
arrange(.folds)
# Check the balance of the relevant columns
groupdata2::summarize_balances(
data=data,
group_cols=".folds",
cat_cols="outcome"
)$Groups
> # A tibble: 3 × 6
> .group_col .group `# rows` `# bird` `# outc_0` `# outc_1`
> <fct> <fct> <int> <int> <dbl> <dbl>
> 1 .folds 1 14 3 7 7
> 2 .folds 2 10 2 6 4
> 3 .folds 3 10 2 4 6
summarize_balances() shows us that we created 3 folds with 14 rows in the first fold and 10 in the other folds. There are 3 unique bird levels in the first fold and 2 in the others (normally only unique within the group, but here we know that birds are only in one group, as that is how collapse_groups() works with its group_cols argument).
The outcome variable (here # outc_0 and # outc_1) are somewhat decently balanced.
With larger datasets, you might want to run multiple collapsings and choose the one with the best balance from the summary. That can be done by adding num_new_group_cols = 10 to collapse_groups() (for even better results, enable the auto_tune setting) and then listing all the created group columns when running summarize_balances().
Hope this helps you or others in a similar position. The constraint in fold() is hard to get around with its current internal approach, but collapse_groups hopefully does the trick in those cases.
See more https://rdrr.io/cran/groupdata2/man/collapse_groups.html
I find different numeric values for the computation of the Area Under the Precision Recall Curve (PRAUC) with the dataset I am working on when computed via 2 different R packages: yardstick and caret.
I am afraid I was not able to reproduce this mismatch with synthetic data, but only with my dataset (this is strange as well)
In order to make this reproducible, I am sharing the prediction output of my model, you can download it here https://drive.google.com/open?id=1LuCcEw-RNRcdz6cg0X5bIEblatxH4Rdz (don't worry, it's a small csv).
The csv contains a dataframe with 4 columns:
yes probability estimate of being in class yes
no = 1 - yes
obs actual class label
pred predicted class label (with .5 threshold)
here follows the code to produce the 2 values of PRAUC
require(data.table)
require(yardstick)
require(caret)
pr <- fread('pred_sample.csv')
# transform to factors
# put the positive class in the first level
pr[, obs := factor(obs, levels = c('yes', 'no'))]
pr[, pred := factor(pred, levels = c('yes', 'no'))] # this is actually not needed
# compute yardstick PRAUC
pr_auc(pr, obs, yes) # 0.315
# compute caret PRAUC
prSummary(pr, lev = c('yes', 'no')) # 0.2373
I could understand a little difference, due to the approximation when computing the area (interpolating the curve), but this seems way too high.
I even tried a third package, PRROC, and the result is still different, namely around .26.
I want to measure the features importance with the cforest function from the party library.
My output variable has something like 2000 samples in class 0 and 100 samples in class 1.
I think a good way to avoid bias due to class unbalance is to train each tree of the forest using a subsample such that the number of elements of class 1 is the same of the number of element in class 0.
Is there anyway to do that? I am thinking to an option like n_samples = c(20, 20)
EDIT:
An example of code
> iris.cf <- cforest(Species ~ ., data = iris,
+ control = cforest_unbiased(mtry = 2)) #<--- Here I would like to train the forest using a balanced subsample of the data
> varimp(object = iris.cf)
Sepal.Length Sepal.Width Petal.Length Petal.Width
0.048981818 0.002254545 0.305818182 0.271163636
>
EDIT:
Maybe my question is not clear enough.
Random forest is a set of decision trees. In general the decision trees are constructed using only a random subsample of the data. I would like that the used subsample has the same numbers of element in the class 1 and in the class 0.
EDIT:
The function that I am looking for is for sure available in the randomForest package
sampsize
Size(s) of sample to draw. For classification, if sampsize is a vector of the length the number of strata, then sampling is stratified by strata, and the elements of sampsize indicate the numbers to be drawn from the strata.
I need the same for the party package. Is there any way to get it?
I will assume you know what you want to accomplish, but don't know enough R to do that.
Not sure if the function provides balancing of data as an argument, but you can do it manually. Below is the code I quickly threw together. More elegant solution might exist.
# just in case
myData <- iris
# replicate everything *10* times. Replicate is just a "loop 10 times".
replicate(10,
{
# split dataset by class and add separate classes to list
splitList <- split(myData, myData$Species)
# sample *20* random rows from each matrix in a list
sampledList <- lapply(splitList, function(dat) { dat[sample(20),] })
# combine sampled rows to a data.frame
sampledData <- do.call(rbind, sampledList)
# your code below
res.cf <- cforest(Species ~ ., data = sampledData,
control = cforest_unbiased(mtry = 2)
)
varimp(object = res.cf)
}
)
Hope you can take it from here.
I use Package ‘monmlp’ package in R as follows. (Monotone multi-layer perceptron neural network)
model = monmlp.fit(trainData, trainLabs, hidden1=3, n.ensemble=1, bag=F,silent=T)
pred = monmlp.predict(testData,model)
preds = as.numeric(pred)
labs = as.numeric(testLabs)
pr = prediction(preds,labs)
pf = performance(pr,"auc")
pf#y.values[[1]]
I want to predict some new data using the trained model and take the instances which result higher than a threshold value like 0.9.
In brief, I want to take instances that more likely to be in class 1 using a threshold.
classes are 0 and 1, and
pred = monmlp.predict(testData,model)
head(pred)
returns
[,1]
311694 0.005271582
129347 0.005271582
15637 0.005271582
125458 0.005271582
315130 0.010411831
272375 0.010411831
What are these values? Probabilty values?
If yes what does these values mean?
pred[which(pred>1)]
[1] 1023.839 1023.839 1023.839
Thanks.
Regarding the output: "a matrix with number of rows equal to the number of samples and number of columns equal to the number of predictand variables. If weights is from an ensemble of models, the matrix is the ensemble mean and the attribute ensemble contains a list with predictions for each ensemble member."
Source:
http://cran.r-project.org/web/packages/monmlp/monmlp.pdf
I've never used the package nor the technique, but maybe the quoted answer may mean something to you
I was wondering if there is any package for generating a D-efficient balanced design with R
I tried the AlgDesign package, but I did not manage to get a balanced design.
Not sure if this is because of the small full factorial of the attributes and levels I am considering (32) or because balanced designs are out of the scope of AlgDesign.
I have 3 attributes: two attributes have four levels, and 1 has 2 levels
att 1: 4 levels
att 2: 4 levels
att 3: 2 levels
I use the following R code:
library(AlgDesign)
#-----------------------------
# define attributes and levels
#-----------------------------
desVarNames <- c("esource", "certified", "cost")
desLevels <- c(4,2, 4)
n <- 6 #number of choice sets
desOpt <- 4 #num option per choice set
set.seed(123456)
#generate full factorial
dat<-gen.factorial(desLevels,length(desLevels),varNames=desVarNames, center=TRUE)
destT <- optFederov(~., dat, nTrials = (n*(desOpt)), criterion="D")
destT
First, when you say "balanced" design, I suspect you mean that each factor level should appear the same number of times in the final design. (Actually, your variable "dat" above is a balanced design, too, since each possible combination appears only once).
The way optFederov works is by randomly selecting and replacing trials using Federov's exchange algorithm. As such, everytime a trial is exchanged with another candidate trial, an initially balanced design will become unbalanced, since if a trial "balances" a design, replacing it with any other trial will unbalance the design. If the algorithm was forced to keep a balanced design at each step, no replacement would be possible, and it would get stuck.
Not only a balanced design is incompatible with the way Feverov's exchange algorithm works, it is actually not desirable from a D-efficiency point of view.
For instance, if you have 4 factors with 2, 3, 5 and 7 levels respectively, the only way to have a balanced design is to include all 2*3*5*7 = 210 trials, whereas AlgDesign only suggests 19.
data = gen.factorial(c(2,3,5,7), factors = "all")
trials = optFederov(data = data, center = FALSE, criterion = "D")
In short, the notion of a "balanced d-efficient" design is largely antagonistic.