stratified 10 fold cross validation - r

I have made a start to create some training and test sets using 10 fold crossvalidation for an artificial dataset:
rows <- 1000
X1<- sort(runif(n = rows, min = -1, max =1))
occ.prob <- 1/(1+exp(-(0.0 + 3.0*X1)))
true.presence <- rbinom(n = rows, size = 1, prob = occ.prob)
# combine data as data frame and save
data <- data.frame(X1, true.presence)
id <- sample(1:10,nrow(data),replace=TRUE)
ListX <- split(data,id)
fold1 <- data[id==1,]
fold2 <- data[id==2,]
fold3 <- data[id==3,]
fold4 <- data[id==4,]
fold5 <- data[id==5,]
fold6 <- data[id==6,]
fold7 <- data[id==7,]
fold8 <- data[id==8,]
fold9 <- data[id==9,]
fold10 <- data[id==10,]
trainingset <- subset(data, id %in% c(2,3,4,5,6,7,8,9,10))
testset <- subset(data, id %in% c(1))
I am just wondering whether there are easier ways to achieve this and how I could perform stratified crossvalidation which ensures that the class priors (true.presence) are roughly the same in all folds?

createFolds method of caret package performs a stratified partitioning. Here is a paragraph from the help page:
... The random sampling is done within the levels of y (=outcomes) when y is a factor in an attempt to balance the class distributions within the splits.
Here is the answer of your problem:
library(caret)
folds <- createFolds(factor(data$true.presence), k = 10, list = FALSE)
and the proportions:
> library(plyr)
> data$fold <- folds
> ddply(data, 'fold', summarise, prop=mean(true.presence))
fold prop
1 1 0.5000000
2 2 0.5050505
3 3 0.5000000
4 4 0.5000000
5 5 0.5000000
6 6 0.5049505
7 7 0.5000000
8 8 0.5049505
9 9 0.5000000
10 10 0.5050505

I'm sure that (a) there's a more efficient way to code this, and (b) there's almost certainly a function somewhere in a package that will just return the folds, but here's some simple code that gives you an idea of how one might do this:
rows <- 1000
X1<- sort(runif(n = rows, min = -1, max =1))
occ.prob <- 1/(1+exp(-(0.0 + 3.0*X1)))
true.presence <- rbinom(n = rows, size = 1, prob = occ.prob)
# combine data as data frame and save
dat <- data.frame(X1, true.presence)
require(plyr)
createFolds <- function(x,k){
n <- nrow(x)
x$folds <- rep(1:k,length.out = n)[sample(n,n)]
x
}
folds <- ddply(dat,.(true.presence),createFolds,k = 10)
#Proportion of true.presence in each fold:
ddply(folds,.(folds),summarise,prop = sum(true.presence)/length(true.presence))
folds prop
1 1 0.5049505
2 2 0.5049505
3 3 0.5100000
4 4 0.5100000
5 5 0.5100000
6 6 0.5100000
7 7 0.5100000
8 8 0.5100000
9 9 0.5050505
10 10 0.5050505

#joran is right (regarding his assumption (b)). dismo::kfold() is what you are looking for.
So using data from the initial question:
require(dismo)
folds <- kfold(data, k=10, by=data$true.presence)
gives a vector of length nrow(data) containing the fold association of each row of data.
Hence, data[fold==1,] returns the 1st fold and data[fold!=1,] can be used for validation.
edit 6/2018: I strongly support using the caret package as recommended by #gkcn. It is better integrated in the tidyverse workflow and more actively developed. Go with that!

I found splitTools is pretty useful, hope the vignette https://cran.r-project.org/web/packages/splitTools/vignettes/splitTools.html can help anyone interested in this topic.
> y <- rep(c(letters[1:4]), each = 5)
> y
[1] "a" "a" "a" "a" "a" "b" "b" "b" "b" "b" "c" "c" "c" "c" "c" "d" "d" "d" "d" "d"
> create_folds(y)
$Fold1
[1] 1 2 3 5 6 7 8 10 12 13 14 15 17 18 19 20
$Fold2
[1] 1 2 4 5 6 8 9 10 11 12 13 14 16 17 19 20
$Fold3
[1] 2 3 4 5 6 7 9 10 11 12 13 15 16 17 18 20
$Fold4
[1] 1 2 3 4 7 8 9 10 11 13 14 15 16 18 19 20
$Fold5
[1] 1 3 4 5 6 7 8 9 11 12 14 15 16 17 18 19
> create_folds(y, m_rep = 3)
$Fold1.Rep1
[1] 1 2 4 5 6 7 8 10 11 12 13 15 16 17 19 20
$Fold2.Rep1
[1] 2 3 4 5 6 8 9 10 11 12 13 14 16 17 18 20
$Fold3.Rep1
[1] 1 2 3 5 7 8 9 10 11 12 14 15 17 18 19 20
$Fold4.Rep1
[1] 1 2 3 4 6 7 9 10 11 13 14 15 16 18 19 20
$Fold5.Rep1
[1] 1 3 4 5 6 7 8 9 12 13 14 15 16 17 18 19
$Fold1.Rep2
[1] 1 2 3 5 6 8 9 10 11 12 13 14 16 17 18 19
$Fold2.Rep2
[1] 1 2 3 4 6 7 8 10 11 12 14 15 17 18 19 20
$Fold3.Rep2
[1] 2 3 4 5 6 7 8 9 12 13 14 15 16 17 19 20
$Fold4.Rep2
[1] 1 3 4 5 7 8 9 10 11 13 14 15 16 17 18 20
$Fold5.Rep2
[1] 1 2 4 5 6 7 9 10 11 12 13 15 16 18 19 20
$Fold1.Rep3
[1] 1 2 3 4 6 7 9 10 11 12 13 15 16 18 19 20
$Fold2.Rep3
[1] 2 3 4 5 6 8 9 10 11 12 13 14 16 17 18 19
$Fold3.Rep3
[1] 1 2 4 5 6 7 8 9 11 12 14 15 16 17 19 20
$Fold4.Rep3
[1] 1 2 3 5 7 8 9 10 12 13 14 15 17 18 19 20
$Fold5.Rep3
[1] 1 3 4 5 6 7 8 10 11 13 14 15 16 17 18 20

Related

R caret createFolds vs. createMultiFolds discrepancies

I create the folds for a cross-validation with caret.
I discovered a discrepancy between the functions createFolds and createMultiFolds. It seems to me that createFolds is without replacement, which is the correct version according to my understanding. createMultiFolds has two flaws, first it uses replacement and second it has much more observations per fold than expected.
Does anyone know why these discrepancies occur, or do I have to specify it differently? In the end I would like to use a repeated cross-validation.
Here is a MWE:
library(caret)
data(mtcars)
set.seed(123)
folds <- createMultiFolds(y = mtcars$am, k = 5, times = 5)
set.seed(123)
folds <- createFolds(mtcars$am, k = 5)
The output is as follows:
createMultiFolds (only the first 5 folds):
Fold1.Rep1 1 2 3 4 6 7 8 9 10 11 12 13 14 15 16 18 20 22 23 24 25 26 27 29 30 31
Fold2.Rep1 1 2 3 5 6 7 8 9 11 12 14 16 17 18 19 20 21 22 23 24 25 28 29 31 32
Fold3.Rep1 2 4 5 6 7 8 9 10 11 12 13 15 17 18 19 20 21 23 26 27 28 29 30 31 32
Fold4.Rep1 1 2 3 4 5 6 7 10 13 14 15 16 17 18 19 21 22 23 24 25 26 27 28 29 30 32
Fold5.Rep1 1 3 4 5 8 9 10 11 12 13 14 15 16 17 19 20 21 22 24 25 26 27 28 30 31 32
createFolds:
Fold1 5 17 19 21 28 32
Fold2 4 10 13 15 26 27 30
Fold3 1 3 14 16 22 24 25
Fold4 8 9 11 12 20 31
Fold5 2 6 7 18 23 29
If you inspect the source code of createMultiFolds, you will see that it calls createFolds with returnTrain = TRUE. From the documentation,
returnTrain: a logical. When true, the values returned are the sample
positions corresponding to the data used during training.
This argument only works in conjunction with ‘list = TRUE’
Therefore, if you modify createFolds appropriately, everything is well:
> library(caret)
> data(mtcars)
> set.seed(123)
> multiFolds <- createMultiFolds(y = mtcars$am, k = 5, times = 2)
> set.seed(123)
> folds1 <- createFolds(mtcars$am, k = 5, returnTrain = TRUE)
> folds2 <- createFolds(mtcars$am, k = 5, returnTrain = TRUE)
> all(multiFolds$Fold1.Rep1 == folds1$Fold1)
[1] TRUE
> all(multiFolds$Fold2.Rep1 == folds1$Fold2)
[1] TRUE
> all(multiFolds$Fold3.Rep1 == folds1$Fold3)
[1] TRUE
> all(multiFolds$Fold4.Rep1 == folds1$Fold4)
[1] TRUE
> all(multiFolds$Fold5.Rep1 == folds1$Fold5)
[1] TRUE
> all(multiFolds$Fold1.Rep2 == folds2$Fold1)
[1] TRUE
> all(multiFolds$Fold2.Rep2 == folds2$Fold2)
[1] TRUE
> all(multiFolds$Fold3.Rep2 == folds2$Fold3)
[1] TRUE
> all(multiFolds$Fold4.Rep2 == folds2$Fold4)
[1] TRUE
> all(multiFolds$Fold5.Rep2 == folds2$Fold5)
[1] TRUE
createMultiFolds has two flaws, first it uses replacement [...]
Where did you get this from? If you’re talking about the 1’s, the first one is part of the name: Fold1.Rep1, Fold2.Rep1, …, Fold{k}.Rep{times}.
As noted in the question, createFolds() splits the data into k folds. However, the output from the function is a list of observation indices that are held out from each fold, not the rows included in each fold. We can see this by creating a table of all the fold data as follows.
set.seed(123)
folds <- createFolds(mtcars$am, k = 5)
table(unlist(folds))
...and the output:
> table(unlist(folds))
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
27 28 29 30 31 32
1 1 1 1 1 1
If we use the returnTrain = TRUE argument with createFolds(), it returns the index of observations included in each fold, as illustrated in the other answer. For k = 5, we expect each observation to be used in 4 of the folds, and confirm this with the following code.
set.seed(123)
folds <- createFolds(mtcars$am, k = 5, returnTrain = TRUE)
table(unlist(folds))
...and the output:
> table(unlist(folds))
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
27 28 29 30 31 32
4 4 4 4 4 4
As noted in the answer, setting returnTrain = TRUE causes createFolds() to return the same output as createMultiFolds() with times = 1. We can illustrate that each observation is used in 4 of the 5 folds as follows.
set.seed(123)
folds1 <- createMultiFolds(y = mtcars$am, k = 5, times = 1)
table(unlist(folds1))
...and the output:
> table(unlist(folds1))
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
27 28 29 30 31 32
4 4 4 4 4 4
We can compare the contents of folds and folds with lapply() and all() as follows.
# compare folds to folds1
lapply(1:5,function(x){
all(folds1[[x]],folds[[x]])
})
[[1]]
[1] TRUE
[[2]]
[1] TRUE
[[3]]
[1] TRUE
[[4]]
[1] TRUE
[[5]]
[1] TRUE
If we set times = 2, we expect each observation to be included in 8 of the 10 folds.
set.seed(123)
folds <- createMultiFolds(y = mtcars$am, k = 5, times = 2)
table(unlist(folds))
...and the output:
> table(unlist(folds))
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
27 28 29 30 31 32
8 8 8 8 8 8
CONCLUSIONS: in both functions caret uses sampling to ensure that each observation is included in the hold out group 1 time across the k folds for each repetition of times =, within the constraint that observations for each value of the dependent variable passed to the function are proportionally distributed in the in sample and out of sample components of each fold.
In the case of a small data set such as mtcars, it's not easy for the algorithm to split effectively, as we can see when we run tables to compare in sample / holdout vs. mtcars$am.
set.seed(123)
folds <- createFolds(mtcars$am, k = 5)
table(unlist(folds))
lapply(folds,function(x){
holdout <- rep(FALSE,nrow(mtcars))
holdout[x] <- TRUE
table(holdout,mtcars$am)
})
$Fold1
holdout 0 1
FALSE 16 10
TRUE 3 3
$Fold2
holdout 0 1
FALSE 15 10
TRUE 4 3
$Fold3
holdout 0 1
FALSE 14 11
TRUE 5 2
$Fold4
holdout 0 1
FALSE 15 11
TRUE 4 2
$Fold5
holdout 0 1
FALSE 16 10
TRUE 3 3
Each fold contains 6 or 7 observations in the hold out set, with a minimum of 2 manual transmission cars (am = 1) in each hold out set.
With default arguments, createFolds() returns the indexes of held out observations rather than included observations. createFolds(x,k,returnTrain=TRUE) behaves exactly the same as createMultiFolds(x,k,times=1).

R: take 2 random non-overlapping samples (for same indexes) of length n out of vector of length n as well

Say I have a vector named all_combinations with numbers from 1 to 20.
I need to extract 2 vectors (coding_1 and coding_2) of length equal to number_of_peptide_clusters, which happens to be 20 as well in my current case.
The 2 new vectors should be randomly sampled from all_combinations, so that are not overlapping at each index position.
I do the following:
set.seed(3)
all_combinations=1:20
number_of_peptide_clusters=20
coding_1 <- sample(all_combinations, number_of_peptide_clusters, replace = FALSE)
coding_1
[1] 5 12 7 4 10 8 11 15 17 16 18 13 9 20 2 14 19 1 3 6
coding_2 <- sample(all_combinations, number_of_peptide_clusters, replace = FALSE)
coding_2
[1] 5 9 19 16 18 12 8 6 15 3 13 14 7 2 11 20 10 4 17 1
This is the example that gives me trouble, cause only one number is overlapping at the same index (5 at position 1).
What I would do in these cases is spot the overlapping numbers and resample them out of the list of all overlapping numbers...
Imagine coding_1 and coding_2 were:
coding_1
[1] 5 9 7 4 10 8 11 15 17 16 18 13 12 20 2 14 19 1 3 6
coding_2
[1] 5 9 19 16 18 12 8 6 15 3 13 14 7 2 11 20 10 4 17 1
In this case I would have 5 and 9 overlapping in the same position, so I would resample them in coding_2 out of the full list of overlapping ones [resample index 1 from c(5,9) so that isn't equal to 5, and index 2 so it isn't equal to 9]. So coding_2 would be:
coding_2
[1] 9 5 19 16 18 12 8 6 15 3 13 14 7 2 11 20 10 4 17 1
However, in the particular case above, I cannot use such approach... So what would be the best way to obtain 2 samples of length 20 from a vector of length 20 as well, so that the samples aren't overlapping at the same index positions?
It would be great that I could obtain the second sample coding_2 already knowing coding_1... Otherwise obtaining the 2 at the same time would also be acceptable if it makes things easier. Thanks!
I think the best solution is simply to use a rejection strategy:
set.seed(3)
all_combinations <- 1:20
number_of_peptide_clusters <- 20
count <- 0
repeat {
count <- count + 1
message("Try number ", count)
coding_1 <- sample(all_combinations, number_of_peptide_clusters, replace = FALSE)
coding_2 <- sample(all_combinations, number_of_peptide_clusters, replace = FALSE)
if (!any(coding_1 == coding_2))
break
}
#> Try number 1
#> Try number 2
#> Try number 3
#> Try number 4
#> Try number 5
#> Try number 6
#> Try number 7
#> Try number 8
#> Try number 9
coding_1
#> [1] 18 16 17 12 13 8 6 15 3 5 20 9 11 4 19 2 14 7 1 10
coding_2
#> [1] 5 20 14 2 11 6 7 10 19 8 4 1 15 9 13 17 18 16 12 3
Created on 2020-11-04 by the reprex package (v0.3.0)

How to randomly split a data frame into halves that are balanced on subject and item

The following randomly splits a data frame into halves.
df <- read.csv("https://raw.githubusercontent.com/HirokiYamamoto2531/data/master/data.csv")
head(df, 3)
# dv iv subject item
#1 562 -0.5 1 7
#2 790 0.5 1 21
#3 NA -0.5 1 19
r <- seq_len(nrow(df))
first <- sample(r, 240)
second <- r[!r %in% first]
df_1 <- df[first, ]
df_2 <- df[second, ]
However, in this way, each data frame (df_1 and df_2) is not balanced on subject and item: e.g.,
table(df_1$subject)
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
# 7 8 3 5 5 3 8 1 5 7 7 6 7 7 9 8 8 9 6 7 8 5 4 4 5 2 7 6 9
# 30 31 32 33 34 35 36 37 38 39 40
# 7 5 7 7 7 3 5 7 5 3 8
table(df_1$item)
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
# 12 11 12 12 9 11 11 8 11 12 10 8 14 7 14 10 8 7 9 9 7 11 9 8
# There are 40 subjects and 24 items, and each subject is assigned to 12 items and each item to 20 subjects.
I would like to know how to split the data frame into halves that are balanced on subject and item (i.e., exactly 6 data points from each subject and 10 data points from each item).
You can use the createDataPartition function from the caret package to create a balanced partition of one variable.
The code below creates a balanced partition of the dataset according to the variable subject:
df <- read.csv("https://raw.githubusercontent.com/HirokiYamamoto2531/data/master/data.csv")
partition <- caret::createDataPartition(df$subject, p = 0.5, list = FALSE)
first.half <- df[partition, ]
second.half <- df[-partition, ]
table(first.half$subject)
table(second.half$subject)
I'm not sure whether it's possible to balance two variables at once. You can try balancing for one variable and checking if you're happy with the partition of the second variable.

purrr; sample from multiple columns with probability list

Say I want to take a sample of values of variable length from an arbitrary number of different probability distributions, and with a weighted probability of sampling from each distribution.
Seems like I should be able to do this using purrr's map functions, but am struggling...
library(tidyverse)
set.seed(20171127)
# sample from 5 different probability distributions
dists <- tibble(
samp_distA = round(rnorm(n=1000, mean=17, sd=4)),
samp_distB = round(rnorm(n=1000, mean=13, sd=4)),
samp_distC = round(rnorm(n=1000, mean=13, sd=4)),
samp_distD = round(rbeta(n=1000, 2,8)*10),
samp_distE = round(rnorm(n=1000, mean=8, sd=3))
)
# define number of samples to be drawn for each group
n.times <- c(20,15,35,8,6)
# define weights to be used for sampling from dists
probs <- tibble(A = c(0.80, 0.05, 0.05, 0.05, 0.05),
B = c(0.05, 0.80, 0.05, 0.05, 0.05),
C = c(0.05, 0.05, 0.80, 0.05, 0.05),
D = c(0.05, 0.05, 0.05, 0.80, 0.80),
E = c(0.05, 0.05, 0.05, 0.05, 0.80)
)
# sample from dists, n.times, and using probs as weights...
output <- map2(sample, size=n.times, weight=probs, tbl=dists)
#...doesn't work
Any suggestions gratefully received.
set.seed(123)
map2(
n.times,
map(probs, rep, each = nrow(dists)),
sample, x = flatten_dbl(dists), replace = TRUE
)
# [[1]]
# [1] 15 13 18 6 15 15 12 8 9 12 7 17 14 12 15 10 18 19 24 24
#
# [[2]]
# [1] 12 2 15 16 14 17 11 11 10 12 6 19 13 12 13
#
# [[3]]
# [1] 10 9 16 12 13 11 10 18 14 19 16 16 12 19 4 15 19 19 13 14 15 10 14 12 10
# [26] 8 18 19 7 8 21 8 19 10 9
#
# [[4]]
# [1] 3 3 2 15 1 4 14 2
#
# [[5]]
# [1] 9 14 10 6 12 8
NB: I'm dubious about your answer to MrFlick's comment: "an 80% chance of selecting all values from samp_distA". To me it is much more intuitive to go the other route: "an 80% chance for each of the 10 values to come from samp_distA"... so that's what I did. Do you confirm you want the former?
Base R equivalent:
set.seed(123)
mapply(
sample,
n.times,
lapply(probs, rep, each = nrow(dists)),
MoreArgs = list(x = unlist(dists, use.names = FALSE), replace = TRUE)
)
Edit
Re your follow-up question in a comment ("run the function multiple times for each person, e.g. so that as output, person A had 10 lists of randomly-sampled values, each of length 20 (and similar for persons B, C, D, and E, perhaps with each person having a predefined different number of lists)"):
n.reps <- c(A = 10, B = 1, C = 3, D = 2, E = 1)
set.seed(123)
pmap(
list(n.reps, n.times, map(probs, rep, each = nrow(dists))),
function(.x, .y, .z) replicate(
.x,
sample(flatten_dbl(dists), .y, replace = TRUE, .z),
simplify = FALSE
)
)
# $A
# $A[[1]]
# [1] 15 20 16 20 16 14 17 20 21 22 18 19 15 14 18 19 16 20 9 16
#
# $A[[2]]
# [1] 13 9 11 19 25 19 11 18 16 19 16 21 15 12 11 11 9 13 20 1
#
# $A[[3]]
# [1] 15 20 13 20 13 11 16 16 14 19 18 10 21 11 12 16 18 10 20 14
#
# $A[[4]]
# [1] 16 19 14 11 17 9 20 11 19 13 11 16 8 11 10 18 27 22 20 4
#
# $A[[5]]
# [1] 12 18 16 19 13 13 23 19 21 14 22 8 9 19 16 19 9 14 13 20
#
# $A[[6]]
# [1] 18 26 16 15 21 17 15 19 14 18 19 25 5 16 7 19 21 15 23 16
#
# $A[[7]]
# [1] 12 26 20 12 7 5 13 14 19 7 16 12 11 27 22 18 11 17 11 16
#
# $A[[8]]
# [1] 21 18 24 22 18 0 15 3 9 16 16 11 16 20 22 18 18 20 16 21
#
# $A[[9]]
# [1] 15 20 11 16 16 21 12 20 17 9 18 10 22 17 12 0 18 16 23 20
#
# $A[[10]]
# [1] 16 22 15 4 7 19 18 13 15 1 7 18 21 1 20 21 15 12 20 15
#
#
# $B
# $B[[1]]
# [1] 9 5 8 17 9 10 7 13 12 11 9 21 10 15 12
#
#
# $C
# $C[[1]]
# [1] 15 15 16 13 19 14 16 15 11 15 19 16 19 12 6 12 10 12 1 18 9 10 18 11 19
# [26] 9 6 19 18 12 9 18 14 12 7
#
# $C[[2]]
# [1] 5 14 16 10 8 13 8 18 22 18 14 12 13 10 19 12 15 10 16 13 16 9 15 6 15
# [26] 14 4 9 11 11 3 15 18 10 14
#
# $C[[3]]
# [1] 13 8 12 9 6 9 2 7 8 12 2 11 20 10 1 14 14 11 11 1 13 13 18 14 12
# [26] 21 11 3 7 7 13 13 11 7 14
#
#
# $D
# $D[[1]]
# [1] 11 1 1 7 12 6 0 8
#
# $D[[2]]
# [1] 4 1 7 15 2 2 8 9
#
#
# $E
# $E[[1]]
# [1] 7 8 6 11 10 6
This seems doable with purrr, but it takes a bit of set up, particularly because there's not a sample2 function (that I'm aware of) that samples a distribution based on a vector of probabilities, and then grabs a random sample from that subset.
To do that with purrr, we have to loop twice: the outside loops through each person using a simple numerical index; inside that loop, we loop through the n.times to get random samples from the appropriate distribution.
# prep data ---------------------------------------------------------------
# pull all the controls into a single data frame
controldf <- tibble(
cols = c(1:5), n.times
) %>%
bind_cols(probs %>%
t %>%
as.tibble %>%
setNames(c("distA", "distB", "distC", "distD", "distE"))
)
# turn the distrubtions into long form
longdists <- dists %>%
gather(dist, val)
distnames <- c("A", "B", "C", "D", "E")
# function to do the work ---------------------------------------------------------------
getdist <- function(i) {
# get the probabilities as a numeric vector
myprobs <- controldf[i,3:7] %>% as.numeric
# how many samples do we need
myn <- controldf[[i,2]]
# use our probabilties to decide what distribution to grab from
samplestoget <- sample(distnames, myn, prob = myprobs, replace = T) %>%
paste0("samp_dist", .)
# loop through our list of distributions to grab from
map_dbl(samplestoget, ~filter(
# filter on distribution key
longdists, dist == .x
) %>%
# from that distribution, select a single value at random
sample_n(1) %>%
# extract the numeric value
pluck('val') )
}
# get the values by running the function over our indexes -------------------------
results <- map(controldf$cols, ~ getdist(.x))

R looping: to display numbers in an interval

I have a dataset and I want to display the numbers for each row between col1 and col2 counted by col3 using R:
dataset=data.frame(col1=c(3,9,15), col2=c(4,11,16), col3=c(2,3,2))
My result should look like:
3
3
4
4
9
9
9
10
10
10
11
11
11
15
15
16
16
Seems trivial but I cannot get a for loop work. Thanks.
Or this can be done with apply
unlist(apply(dataset, 1, function(x) rep(x[1]:x[2],
each=x[3])))
#[1] 3 3 4 4 9 9 9 10 10 10 11 11 11 15 15 16 16
Try this:
col1=c(3,9,15)
col2=c(4,11,16)
col3=c(2,3,2)
res = NULL
for (k in 1:length(col1)){
res = c(res, sort(rep(col1[k]:col2[k],col3[k])))
}
Result:
> res
[1] 3 3 4 4 9 9 9 10 10 10 11 11 11 15 15 16 16

Resources