R caret createFolds vs. createMultiFolds discrepancies - r

I create the folds for a cross-validation with caret.
I discovered a discrepancy between the functions createFolds and createMultiFolds. It seems to me that createFolds is without replacement, which is the correct version according to my understanding. createMultiFolds has two flaws, first it uses replacement and second it has much more observations per fold than expected.
Does anyone know why these discrepancies occur, or do I have to specify it differently? In the end I would like to use a repeated cross-validation.
Here is a MWE:
library(caret)
data(mtcars)
set.seed(123)
folds <- createMultiFolds(y = mtcars$am, k = 5, times = 5)
set.seed(123)
folds <- createFolds(mtcars$am, k = 5)
The output is as follows:
createMultiFolds (only the first 5 folds):
Fold1.Rep1 1 2 3 4 6 7 8 9 10 11 12 13 14 15 16 18 20 22 23 24 25 26 27 29 30 31
Fold2.Rep1 1 2 3 5 6 7 8 9 11 12 14 16 17 18 19 20 21 22 23 24 25 28 29 31 32
Fold3.Rep1 2 4 5 6 7 8 9 10 11 12 13 15 17 18 19 20 21 23 26 27 28 29 30 31 32
Fold4.Rep1 1 2 3 4 5 6 7 10 13 14 15 16 17 18 19 21 22 23 24 25 26 27 28 29 30 32
Fold5.Rep1 1 3 4 5 8 9 10 11 12 13 14 15 16 17 19 20 21 22 24 25 26 27 28 30 31 32
createFolds:
Fold1 5 17 19 21 28 32
Fold2 4 10 13 15 26 27 30
Fold3 1 3 14 16 22 24 25
Fold4 8 9 11 12 20 31
Fold5 2 6 7 18 23 29

If you inspect the source code of createMultiFolds, you will see that it calls createFolds with returnTrain = TRUE. From the documentation,
returnTrain: a logical. When true, the values returned are the sample
positions corresponding to the data used during training.
This argument only works in conjunction with ‘list = TRUE’
Therefore, if you modify createFolds appropriately, everything is well:
> library(caret)
> data(mtcars)
> set.seed(123)
> multiFolds <- createMultiFolds(y = mtcars$am, k = 5, times = 2)
> set.seed(123)
> folds1 <- createFolds(mtcars$am, k = 5, returnTrain = TRUE)
> folds2 <- createFolds(mtcars$am, k = 5, returnTrain = TRUE)
> all(multiFolds$Fold1.Rep1 == folds1$Fold1)
[1] TRUE
> all(multiFolds$Fold2.Rep1 == folds1$Fold2)
[1] TRUE
> all(multiFolds$Fold3.Rep1 == folds1$Fold3)
[1] TRUE
> all(multiFolds$Fold4.Rep1 == folds1$Fold4)
[1] TRUE
> all(multiFolds$Fold5.Rep1 == folds1$Fold5)
[1] TRUE
> all(multiFolds$Fold1.Rep2 == folds2$Fold1)
[1] TRUE
> all(multiFolds$Fold2.Rep2 == folds2$Fold2)
[1] TRUE
> all(multiFolds$Fold3.Rep2 == folds2$Fold3)
[1] TRUE
> all(multiFolds$Fold4.Rep2 == folds2$Fold4)
[1] TRUE
> all(multiFolds$Fold5.Rep2 == folds2$Fold5)
[1] TRUE
createMultiFolds has two flaws, first it uses replacement [...]
Where did you get this from? If you’re talking about the 1’s, the first one is part of the name: Fold1.Rep1, Fold2.Rep1, …, Fold{k}.Rep{times}.

As noted in the question, createFolds() splits the data into k folds. However, the output from the function is a list of observation indices that are held out from each fold, not the rows included in each fold. We can see this by creating a table of all the fold data as follows.
set.seed(123)
folds <- createFolds(mtcars$am, k = 5)
table(unlist(folds))
...and the output:
> table(unlist(folds))
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
27 28 29 30 31 32
1 1 1 1 1 1
If we use the returnTrain = TRUE argument with createFolds(), it returns the index of observations included in each fold, as illustrated in the other answer. For k = 5, we expect each observation to be used in 4 of the folds, and confirm this with the following code.
set.seed(123)
folds <- createFolds(mtcars$am, k = 5, returnTrain = TRUE)
table(unlist(folds))
...and the output:
> table(unlist(folds))
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
27 28 29 30 31 32
4 4 4 4 4 4
As noted in the answer, setting returnTrain = TRUE causes createFolds() to return the same output as createMultiFolds() with times = 1. We can illustrate that each observation is used in 4 of the 5 folds as follows.
set.seed(123)
folds1 <- createMultiFolds(y = mtcars$am, k = 5, times = 1)
table(unlist(folds1))
...and the output:
> table(unlist(folds1))
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
27 28 29 30 31 32
4 4 4 4 4 4
We can compare the contents of folds and folds with lapply() and all() as follows.
# compare folds to folds1
lapply(1:5,function(x){
all(folds1[[x]],folds[[x]])
})
[[1]]
[1] TRUE
[[2]]
[1] TRUE
[[3]]
[1] TRUE
[[4]]
[1] TRUE
[[5]]
[1] TRUE
If we set times = 2, we expect each observation to be included in 8 of the 10 folds.
set.seed(123)
folds <- createMultiFolds(y = mtcars$am, k = 5, times = 2)
table(unlist(folds))
...and the output:
> table(unlist(folds))
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
27 28 29 30 31 32
8 8 8 8 8 8
CONCLUSIONS: in both functions caret uses sampling to ensure that each observation is included in the hold out group 1 time across the k folds for each repetition of times =, within the constraint that observations for each value of the dependent variable passed to the function are proportionally distributed in the in sample and out of sample components of each fold.
In the case of a small data set such as mtcars, it's not easy for the algorithm to split effectively, as we can see when we run tables to compare in sample / holdout vs. mtcars$am.
set.seed(123)
folds <- createFolds(mtcars$am, k = 5)
table(unlist(folds))
lapply(folds,function(x){
holdout <- rep(FALSE,nrow(mtcars))
holdout[x] <- TRUE
table(holdout,mtcars$am)
})
$Fold1
holdout 0 1
FALSE 16 10
TRUE 3 3
$Fold2
holdout 0 1
FALSE 15 10
TRUE 4 3
$Fold3
holdout 0 1
FALSE 14 11
TRUE 5 2
$Fold4
holdout 0 1
FALSE 15 11
TRUE 4 2
$Fold5
holdout 0 1
FALSE 16 10
TRUE 3 3
Each fold contains 6 or 7 observations in the hold out set, with a minimum of 2 manual transmission cars (am = 1) in each hold out set.
With default arguments, createFolds() returns the indexes of held out observations rather than included observations. createFolds(x,k,returnTrain=TRUE) behaves exactly the same as createMultiFolds(x,k,times=1).

Related

R:How to apply a sliding conditional branch to consecutive values in the sequential data

I want to use conditional statement to consecutive values in the sliding manner.
For example, I have dataset like this;
data <- data.frame(ID = rep.int(c("A","B"), times = c(24, 12)),
+ time = c(1:24,1:12),
+ visit = as.integer(runif(36, min = 0, max = 20)))
and I got table below;
> data
ID time visit
1 A 1 7
2 A 2 0
3 A 3 6
4 A 4 6
5 A 5 3
6 A 6 8
7 A 7 4
8 A 8 10
9 A 9 18
10 A 10 6
11 A 11 1
12 A 12 13
13 A 13 7
14 A 14 1
15 A 15 6
16 A 16 1
17 A 17 11
18 A 18 8
19 A 19 16
20 A 20 14
21 A 21 15
22 A 22 19
23 A 23 5
24 A 24 13
25 B 1 6
26 B 2 6
27 B 3 16
28 B 4 4
29 B 5 19
30 B 6 5
31 B 7 17
32 B 8 6
33 B 9 10
34 B 10 1
35 B 11 13
36 B 12 15
I want to flag each ID by continuous values of "visit".
If the number of "visit" continued less than 10 for 6 times consecutively, I'd attach "empty", and "busy" otherwise.
In the data above, "A" is continuously below 10 from rows 1 to 6, then "empty". On the other hand, "B" doesn't have 6 consecutive one digit, then "busy".
I want to apply the condition to next segment of 6 values if the condition weren't fulfilled in the previous segment.
I'd like achieve this using R. Any advice will be appreciated.

R: take 2 random non-overlapping samples (for same indexes) of length n out of vector of length n as well

Say I have a vector named all_combinations with numbers from 1 to 20.
I need to extract 2 vectors (coding_1 and coding_2) of length equal to number_of_peptide_clusters, which happens to be 20 as well in my current case.
The 2 new vectors should be randomly sampled from all_combinations, so that are not overlapping at each index position.
I do the following:
set.seed(3)
all_combinations=1:20
number_of_peptide_clusters=20
coding_1 <- sample(all_combinations, number_of_peptide_clusters, replace = FALSE)
coding_1
[1] 5 12 7 4 10 8 11 15 17 16 18 13 9 20 2 14 19 1 3 6
coding_2 <- sample(all_combinations, number_of_peptide_clusters, replace = FALSE)
coding_2
[1] 5 9 19 16 18 12 8 6 15 3 13 14 7 2 11 20 10 4 17 1
This is the example that gives me trouble, cause only one number is overlapping at the same index (5 at position 1).
What I would do in these cases is spot the overlapping numbers and resample them out of the list of all overlapping numbers...
Imagine coding_1 and coding_2 were:
coding_1
[1] 5 9 7 4 10 8 11 15 17 16 18 13 12 20 2 14 19 1 3 6
coding_2
[1] 5 9 19 16 18 12 8 6 15 3 13 14 7 2 11 20 10 4 17 1
In this case I would have 5 and 9 overlapping in the same position, so I would resample them in coding_2 out of the full list of overlapping ones [resample index 1 from c(5,9) so that isn't equal to 5, and index 2 so it isn't equal to 9]. So coding_2 would be:
coding_2
[1] 9 5 19 16 18 12 8 6 15 3 13 14 7 2 11 20 10 4 17 1
However, in the particular case above, I cannot use such approach... So what would be the best way to obtain 2 samples of length 20 from a vector of length 20 as well, so that the samples aren't overlapping at the same index positions?
It would be great that I could obtain the second sample coding_2 already knowing coding_1... Otherwise obtaining the 2 at the same time would also be acceptable if it makes things easier. Thanks!
I think the best solution is simply to use a rejection strategy:
set.seed(3)
all_combinations <- 1:20
number_of_peptide_clusters <- 20
count <- 0
repeat {
count <- count + 1
message("Try number ", count)
coding_1 <- sample(all_combinations, number_of_peptide_clusters, replace = FALSE)
coding_2 <- sample(all_combinations, number_of_peptide_clusters, replace = FALSE)
if (!any(coding_1 == coding_2))
break
}
#> Try number 1
#> Try number 2
#> Try number 3
#> Try number 4
#> Try number 5
#> Try number 6
#> Try number 7
#> Try number 8
#> Try number 9
coding_1
#> [1] 18 16 17 12 13 8 6 15 3 5 20 9 11 4 19 2 14 7 1 10
coding_2
#> [1] 5 20 14 2 11 6 7 10 19 8 4 1 15 9 13 17 18 16 12 3
Created on 2020-11-04 by the reprex package (v0.3.0)

How to randomly split a data frame into halves that are balanced on subject and item

The following randomly splits a data frame into halves.
df <- read.csv("https://raw.githubusercontent.com/HirokiYamamoto2531/data/master/data.csv")
head(df, 3)
# dv iv subject item
#1 562 -0.5 1 7
#2 790 0.5 1 21
#3 NA -0.5 1 19
r <- seq_len(nrow(df))
first <- sample(r, 240)
second <- r[!r %in% first]
df_1 <- df[first, ]
df_2 <- df[second, ]
However, in this way, each data frame (df_1 and df_2) is not balanced on subject and item: e.g.,
table(df_1$subject)
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
# 7 8 3 5 5 3 8 1 5 7 7 6 7 7 9 8 8 9 6 7 8 5 4 4 5 2 7 6 9
# 30 31 32 33 34 35 36 37 38 39 40
# 7 5 7 7 7 3 5 7 5 3 8
table(df_1$item)
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
# 12 11 12 12 9 11 11 8 11 12 10 8 14 7 14 10 8 7 9 9 7 11 9 8
# There are 40 subjects and 24 items, and each subject is assigned to 12 items and each item to 20 subjects.
I would like to know how to split the data frame into halves that are balanced on subject and item (i.e., exactly 6 data points from each subject and 10 data points from each item).
You can use the createDataPartition function from the caret package to create a balanced partition of one variable.
The code below creates a balanced partition of the dataset according to the variable subject:
df <- read.csv("https://raw.githubusercontent.com/HirokiYamamoto2531/data/master/data.csv")
partition <- caret::createDataPartition(df$subject, p = 0.5, list = FALSE)
first.half <- df[partition, ]
second.half <- df[-partition, ]
table(first.half$subject)
table(second.half$subject)
I'm not sure whether it's possible to balance two variables at once. You can try balancing for one variable and checking if you're happy with the partition of the second variable.

How to obtain all possible sub-samples of size n from a dataframe of size N in R?

I have a dataframe with 20 classrooms [1 to 20] indexes and 20 different number of students in each class, how to obtain all sub-samples of size n = 8 and store them because i want to use them later for calculations. I used combn() but that takes only one vector, can i use it with a dataframe and how? (sorry but i'm new in R),
dataframe below:
classrooms students
1 1 29
2 2 30
3 3 35
4 4 28
5 5 32
6 6 20
7 7 25
8 8 22
9 9 32
10 10 26
11 11 27
12 12 34
13 13 27
14 14 28
15 15 33
16 16 21
17 17 36
18 18 24
19 19 19
20 20 32
It is as simple as passing a function to combn. simplify = FALSE means that a list will be returned.
Assuming you want all possible combinations of 8 classrooms from the dataset classrooms
combinations <- combn(nrow(classrooms), 8, function(x,data) data[x,],
simplify = FALSE, data =classrooms )
head(combinations, n = 2)
[[1]]
classrooms students
1 1 29
2 2 30
3 3 35
4 4 28
5 5 32
6 6 20
7 7 25
8 8 22
[[2]]
classrooms students
1 1 29
2 2 30
3 3 35
4 4 28
5 5 32
6 6 20
7 7 25
9 9 32

stratified 10 fold cross validation

I have made a start to create some training and test sets using 10 fold crossvalidation for an artificial dataset:
rows <- 1000
X1<- sort(runif(n = rows, min = -1, max =1))
occ.prob <- 1/(1+exp(-(0.0 + 3.0*X1)))
true.presence <- rbinom(n = rows, size = 1, prob = occ.prob)
# combine data as data frame and save
data <- data.frame(X1, true.presence)
id <- sample(1:10,nrow(data),replace=TRUE)
ListX <- split(data,id)
fold1 <- data[id==1,]
fold2 <- data[id==2,]
fold3 <- data[id==3,]
fold4 <- data[id==4,]
fold5 <- data[id==5,]
fold6 <- data[id==6,]
fold7 <- data[id==7,]
fold8 <- data[id==8,]
fold9 <- data[id==9,]
fold10 <- data[id==10,]
trainingset <- subset(data, id %in% c(2,3,4,5,6,7,8,9,10))
testset <- subset(data, id %in% c(1))
I am just wondering whether there are easier ways to achieve this and how I could perform stratified crossvalidation which ensures that the class priors (true.presence) are roughly the same in all folds?
createFolds method of caret package performs a stratified partitioning. Here is a paragraph from the help page:
... The random sampling is done within the levels of y (=outcomes) when y is a factor in an attempt to balance the class distributions within the splits.
Here is the answer of your problem:
library(caret)
folds <- createFolds(factor(data$true.presence), k = 10, list = FALSE)
and the proportions:
> library(plyr)
> data$fold <- folds
> ddply(data, 'fold', summarise, prop=mean(true.presence))
fold prop
1 1 0.5000000
2 2 0.5050505
3 3 0.5000000
4 4 0.5000000
5 5 0.5000000
6 6 0.5049505
7 7 0.5000000
8 8 0.5049505
9 9 0.5000000
10 10 0.5050505
I'm sure that (a) there's a more efficient way to code this, and (b) there's almost certainly a function somewhere in a package that will just return the folds, but here's some simple code that gives you an idea of how one might do this:
rows <- 1000
X1<- sort(runif(n = rows, min = -1, max =1))
occ.prob <- 1/(1+exp(-(0.0 + 3.0*X1)))
true.presence <- rbinom(n = rows, size = 1, prob = occ.prob)
# combine data as data frame and save
dat <- data.frame(X1, true.presence)
require(plyr)
createFolds <- function(x,k){
n <- nrow(x)
x$folds <- rep(1:k,length.out = n)[sample(n,n)]
x
}
folds <- ddply(dat,.(true.presence),createFolds,k = 10)
#Proportion of true.presence in each fold:
ddply(folds,.(folds),summarise,prop = sum(true.presence)/length(true.presence))
folds prop
1 1 0.5049505
2 2 0.5049505
3 3 0.5100000
4 4 0.5100000
5 5 0.5100000
6 6 0.5100000
7 7 0.5100000
8 8 0.5100000
9 9 0.5050505
10 10 0.5050505
#joran is right (regarding his assumption (b)). dismo::kfold() is what you are looking for.
So using data from the initial question:
require(dismo)
folds <- kfold(data, k=10, by=data$true.presence)
gives a vector of length nrow(data) containing the fold association of each row of data.
Hence, data[fold==1,] returns the 1st fold and data[fold!=1,] can be used for validation.
edit 6/2018: I strongly support using the caret package as recommended by #gkcn. It is better integrated in the tidyverse workflow and more actively developed. Go with that!
I found splitTools is pretty useful, hope the vignette https://cran.r-project.org/web/packages/splitTools/vignettes/splitTools.html can help anyone interested in this topic.
> y <- rep(c(letters[1:4]), each = 5)
> y
[1] "a" "a" "a" "a" "a" "b" "b" "b" "b" "b" "c" "c" "c" "c" "c" "d" "d" "d" "d" "d"
> create_folds(y)
$Fold1
[1] 1 2 3 5 6 7 8 10 12 13 14 15 17 18 19 20
$Fold2
[1] 1 2 4 5 6 8 9 10 11 12 13 14 16 17 19 20
$Fold3
[1] 2 3 4 5 6 7 9 10 11 12 13 15 16 17 18 20
$Fold4
[1] 1 2 3 4 7 8 9 10 11 13 14 15 16 18 19 20
$Fold5
[1] 1 3 4 5 6 7 8 9 11 12 14 15 16 17 18 19
> create_folds(y, m_rep = 3)
$Fold1.Rep1
[1] 1 2 4 5 6 7 8 10 11 12 13 15 16 17 19 20
$Fold2.Rep1
[1] 2 3 4 5 6 8 9 10 11 12 13 14 16 17 18 20
$Fold3.Rep1
[1] 1 2 3 5 7 8 9 10 11 12 14 15 17 18 19 20
$Fold4.Rep1
[1] 1 2 3 4 6 7 9 10 11 13 14 15 16 18 19 20
$Fold5.Rep1
[1] 1 3 4 5 6 7 8 9 12 13 14 15 16 17 18 19
$Fold1.Rep2
[1] 1 2 3 5 6 8 9 10 11 12 13 14 16 17 18 19
$Fold2.Rep2
[1] 1 2 3 4 6 7 8 10 11 12 14 15 17 18 19 20
$Fold3.Rep2
[1] 2 3 4 5 6 7 8 9 12 13 14 15 16 17 19 20
$Fold4.Rep2
[1] 1 3 4 5 7 8 9 10 11 13 14 15 16 17 18 20
$Fold5.Rep2
[1] 1 2 4 5 6 7 9 10 11 12 13 15 16 18 19 20
$Fold1.Rep3
[1] 1 2 3 4 6 7 9 10 11 12 13 15 16 18 19 20
$Fold2.Rep3
[1] 2 3 4 5 6 8 9 10 11 12 13 14 16 17 18 19
$Fold3.Rep3
[1] 1 2 4 5 6 7 8 9 11 12 14 15 16 17 19 20
$Fold4.Rep3
[1] 1 2 3 5 7 8 9 10 12 13 14 15 17 18 19 20
$Fold5.Rep3
[1] 1 3 4 5 6 7 8 10 11 13 14 15 16 17 18 20

Resources