I am very new to coding in R and wanted some guidance in how to generate this function.
I have a pool of numbers 0 ,1 ,2 ,3 , 4, 5, 6, 7, 8, 9
that I will draw 3 numbers from that pool to sum.
I would like to run this 100 times.
The 3 numbers that I draw from the pool must be unique.
i.e 9,9,9 cannot be drawn from the pool.
The current code i have is this.
numbers_in_box <- c(0,1,2,3,4,5,6,7,8,9)
# sample(numbers_in_box,3, replace = FALSE)
replicate(n = 100,sample(numbers_in_box,3, replace = FALSE),simplify = FALSE)
Thank you
The code in the question is not wrong, I would change to simplify = TRUE and start by setting the pseudo-RNG seed. Then assign the output of replicate and colSums to get the sums.
set.seed(2021) # Make the results reproducible
numbers_in_box <- c(0,1,2,3,4,5,6,7,8,9)
# sample(numbers_in_box,3, replace = FALSE)
x <- replicate(n = 100,sample(numbers_in_box,3, replace = FALSE),simplify = TRUE)
colSums(x)
# [1] 19 18 15 22 10 11 8 14 8 12 18 8 14 10 13 16 12 12 3 15 12
# [22] 10 6 7 17 21 6 23 17 8 8 10 15 15 15 16 11 11 8 7 18 17
# [43] 18 10 8 12 15 17 16 20 14 14 19 17 11 14 12 14 17 19 7 6 19
# [64] 9 21 19 15 19 18 20 15 13 7 13 21 12 21 16 17 18 20 4 13 8
# [85] 17 8 15 15 15 21 14 8 11 15 17 10 20 18 9 9
Put a sum() call in your replicate():
replicate(n = 100, sum(sample(numbers_in_box,3, replace = FALSE)), simplify = TRUE)
Also, as #Rui suggested, I'd recommend changing simplify to TRUE unless there's some reason you really want a list output rather than a vector.
We can use rerun
library(purrr)
rerun(100, sample(numbers_in_box,3, replace = FALSE))
Related
How to write an R-script to initialize a vector with integers, rearrange the elements by interleaving the
first half elements with the second half elements and store in the same vector without using pre-defined function and display the updated vector.
This sounds like a homework question, and it would be nice to see some effort on your own part, but it's pretty straightforward to do this in R.
Suppose your vector looks like this:
vec <- 1:20
vec
#> [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Then you can just do:
c(t(cbind(vec[1:10], vec[11:20])))
#> [1] 1 11 2 12 3 13 4 14 5 15 6 16 7 17 8 18 9 19 10 20
This works by joining the two vectors into a 10 x 2 matrix, then transposing that matrix and turning it into a vector.
We may use matrix directly and concatenate
c(matrix(vec, nrow = 2, byrow = TRUE))
-output
[1] 1 11 2 12 3 13 4 14 5 15 6 16 7 17 8 18 9 19 10 20
data
vec <- 1:20
Or using mapply:
vec <- 1:20
c(mapply(\(x,y) c(x,y), vec[1:10], vec[11:20]))
#> [1] 1 11 2 12 3 13 4 14 5 15 6 16 7 17 8 18 9 19 10 20
We can try this using order + %%
> vec[order((seq_along(vec) - 1) %% (length(vec) / 2))]
[1] 1 11 2 12 3 13 4 14 5 15 6 16 7 17 8 18 9 19 10 20
Another way is to use rbind on the 2 halves of the vector, which creates a matrix with two rows. Then, we can then turn the matrix into a vector, which will go through column by column (i.e., 1, 11, 2, 12...). However, this will only work for even vectors.
vec <- 1:20
c(rbind(vec[1:10], vec[11:20]))
# [1] 1 11 2 12 3 13 4 14 5 15 6 16 7 17 8 18 9 19 10 20
So, for uneven vectors, we can use order, which will return the indices of the numbers in the two seq_along vectors.
vec2 <- 1:21
order(c(seq_along(vec2[1:10]),seq_along(vec2[11:21])))
# [1] 1 11 2 12 3 13 4 14 5 15 6 16 7 17 8 18 9 19 10 20 21
I create the folds for a cross-validation with caret.
I discovered a discrepancy between the functions createFolds and createMultiFolds. It seems to me that createFolds is without replacement, which is the correct version according to my understanding. createMultiFolds has two flaws, first it uses replacement and second it has much more observations per fold than expected.
Does anyone know why these discrepancies occur, or do I have to specify it differently? In the end I would like to use a repeated cross-validation.
Here is a MWE:
library(caret)
data(mtcars)
set.seed(123)
folds <- createMultiFolds(y = mtcars$am, k = 5, times = 5)
set.seed(123)
folds <- createFolds(mtcars$am, k = 5)
The output is as follows:
createMultiFolds (only the first 5 folds):
Fold1.Rep1 1 2 3 4 6 7 8 9 10 11 12 13 14 15 16 18 20 22 23 24 25 26 27 29 30 31
Fold2.Rep1 1 2 3 5 6 7 8 9 11 12 14 16 17 18 19 20 21 22 23 24 25 28 29 31 32
Fold3.Rep1 2 4 5 6 7 8 9 10 11 12 13 15 17 18 19 20 21 23 26 27 28 29 30 31 32
Fold4.Rep1 1 2 3 4 5 6 7 10 13 14 15 16 17 18 19 21 22 23 24 25 26 27 28 29 30 32
Fold5.Rep1 1 3 4 5 8 9 10 11 12 13 14 15 16 17 19 20 21 22 24 25 26 27 28 30 31 32
createFolds:
Fold1 5 17 19 21 28 32
Fold2 4 10 13 15 26 27 30
Fold3 1 3 14 16 22 24 25
Fold4 8 9 11 12 20 31
Fold5 2 6 7 18 23 29
If you inspect the source code of createMultiFolds, you will see that it calls createFolds with returnTrain = TRUE. From the documentation,
returnTrain: a logical. When true, the values returned are the sample
positions corresponding to the data used during training.
This argument only works in conjunction with ‘list = TRUE’
Therefore, if you modify createFolds appropriately, everything is well:
> library(caret)
> data(mtcars)
> set.seed(123)
> multiFolds <- createMultiFolds(y = mtcars$am, k = 5, times = 2)
> set.seed(123)
> folds1 <- createFolds(mtcars$am, k = 5, returnTrain = TRUE)
> folds2 <- createFolds(mtcars$am, k = 5, returnTrain = TRUE)
> all(multiFolds$Fold1.Rep1 == folds1$Fold1)
[1] TRUE
> all(multiFolds$Fold2.Rep1 == folds1$Fold2)
[1] TRUE
> all(multiFolds$Fold3.Rep1 == folds1$Fold3)
[1] TRUE
> all(multiFolds$Fold4.Rep1 == folds1$Fold4)
[1] TRUE
> all(multiFolds$Fold5.Rep1 == folds1$Fold5)
[1] TRUE
> all(multiFolds$Fold1.Rep2 == folds2$Fold1)
[1] TRUE
> all(multiFolds$Fold2.Rep2 == folds2$Fold2)
[1] TRUE
> all(multiFolds$Fold3.Rep2 == folds2$Fold3)
[1] TRUE
> all(multiFolds$Fold4.Rep2 == folds2$Fold4)
[1] TRUE
> all(multiFolds$Fold5.Rep2 == folds2$Fold5)
[1] TRUE
createMultiFolds has two flaws, first it uses replacement [...]
Where did you get this from? If you’re talking about the 1’s, the first one is part of the name: Fold1.Rep1, Fold2.Rep1, …, Fold{k}.Rep{times}.
As noted in the question, createFolds() splits the data into k folds. However, the output from the function is a list of observation indices that are held out from each fold, not the rows included in each fold. We can see this by creating a table of all the fold data as follows.
set.seed(123)
folds <- createFolds(mtcars$am, k = 5)
table(unlist(folds))
...and the output:
> table(unlist(folds))
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
27 28 29 30 31 32
1 1 1 1 1 1
If we use the returnTrain = TRUE argument with createFolds(), it returns the index of observations included in each fold, as illustrated in the other answer. For k = 5, we expect each observation to be used in 4 of the folds, and confirm this with the following code.
set.seed(123)
folds <- createFolds(mtcars$am, k = 5, returnTrain = TRUE)
table(unlist(folds))
...and the output:
> table(unlist(folds))
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
27 28 29 30 31 32
4 4 4 4 4 4
As noted in the answer, setting returnTrain = TRUE causes createFolds() to return the same output as createMultiFolds() with times = 1. We can illustrate that each observation is used in 4 of the 5 folds as follows.
set.seed(123)
folds1 <- createMultiFolds(y = mtcars$am, k = 5, times = 1)
table(unlist(folds1))
...and the output:
> table(unlist(folds1))
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
27 28 29 30 31 32
4 4 4 4 4 4
We can compare the contents of folds and folds with lapply() and all() as follows.
# compare folds to folds1
lapply(1:5,function(x){
all(folds1[[x]],folds[[x]])
})
[[1]]
[1] TRUE
[[2]]
[1] TRUE
[[3]]
[1] TRUE
[[4]]
[1] TRUE
[[5]]
[1] TRUE
If we set times = 2, we expect each observation to be included in 8 of the 10 folds.
set.seed(123)
folds <- createMultiFolds(y = mtcars$am, k = 5, times = 2)
table(unlist(folds))
...and the output:
> table(unlist(folds))
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
27 28 29 30 31 32
8 8 8 8 8 8
CONCLUSIONS: in both functions caret uses sampling to ensure that each observation is included in the hold out group 1 time across the k folds for each repetition of times =, within the constraint that observations for each value of the dependent variable passed to the function are proportionally distributed in the in sample and out of sample components of each fold.
In the case of a small data set such as mtcars, it's not easy for the algorithm to split effectively, as we can see when we run tables to compare in sample / holdout vs. mtcars$am.
set.seed(123)
folds <- createFolds(mtcars$am, k = 5)
table(unlist(folds))
lapply(folds,function(x){
holdout <- rep(FALSE,nrow(mtcars))
holdout[x] <- TRUE
table(holdout,mtcars$am)
})
$Fold1
holdout 0 1
FALSE 16 10
TRUE 3 3
$Fold2
holdout 0 1
FALSE 15 10
TRUE 4 3
$Fold3
holdout 0 1
FALSE 14 11
TRUE 5 2
$Fold4
holdout 0 1
FALSE 15 11
TRUE 4 2
$Fold5
holdout 0 1
FALSE 16 10
TRUE 3 3
Each fold contains 6 or 7 observations in the hold out set, with a minimum of 2 manual transmission cars (am = 1) in each hold out set.
With default arguments, createFolds() returns the indexes of held out observations rather than included observations. createFolds(x,k,returnTrain=TRUE) behaves exactly the same as createMultiFolds(x,k,times=1).
Say I have a vector named all_combinations with numbers from 1 to 20.
I need to extract 2 vectors (coding_1 and coding_2) of length equal to number_of_peptide_clusters, which happens to be 20 as well in my current case.
The 2 new vectors should be randomly sampled from all_combinations, so that are not overlapping at each index position.
I do the following:
set.seed(3)
all_combinations=1:20
number_of_peptide_clusters=20
coding_1 <- sample(all_combinations, number_of_peptide_clusters, replace = FALSE)
coding_1
[1] 5 12 7 4 10 8 11 15 17 16 18 13 9 20 2 14 19 1 3 6
coding_2 <- sample(all_combinations, number_of_peptide_clusters, replace = FALSE)
coding_2
[1] 5 9 19 16 18 12 8 6 15 3 13 14 7 2 11 20 10 4 17 1
This is the example that gives me trouble, cause only one number is overlapping at the same index (5 at position 1).
What I would do in these cases is spot the overlapping numbers and resample them out of the list of all overlapping numbers...
Imagine coding_1 and coding_2 were:
coding_1
[1] 5 9 7 4 10 8 11 15 17 16 18 13 12 20 2 14 19 1 3 6
coding_2
[1] 5 9 19 16 18 12 8 6 15 3 13 14 7 2 11 20 10 4 17 1
In this case I would have 5 and 9 overlapping in the same position, so I would resample them in coding_2 out of the full list of overlapping ones [resample index 1 from c(5,9) so that isn't equal to 5, and index 2 so it isn't equal to 9]. So coding_2 would be:
coding_2
[1] 9 5 19 16 18 12 8 6 15 3 13 14 7 2 11 20 10 4 17 1
However, in the particular case above, I cannot use such approach... So what would be the best way to obtain 2 samples of length 20 from a vector of length 20 as well, so that the samples aren't overlapping at the same index positions?
It would be great that I could obtain the second sample coding_2 already knowing coding_1... Otherwise obtaining the 2 at the same time would also be acceptable if it makes things easier. Thanks!
I think the best solution is simply to use a rejection strategy:
set.seed(3)
all_combinations <- 1:20
number_of_peptide_clusters <- 20
count <- 0
repeat {
count <- count + 1
message("Try number ", count)
coding_1 <- sample(all_combinations, number_of_peptide_clusters, replace = FALSE)
coding_2 <- sample(all_combinations, number_of_peptide_clusters, replace = FALSE)
if (!any(coding_1 == coding_2))
break
}
#> Try number 1
#> Try number 2
#> Try number 3
#> Try number 4
#> Try number 5
#> Try number 6
#> Try number 7
#> Try number 8
#> Try number 9
coding_1
#> [1] 18 16 17 12 13 8 6 15 3 5 20 9 11 4 19 2 14 7 1 10
coding_2
#> [1] 5 20 14 2 11 6 7 10 19 8 4 1 15 9 13 17 18 16 12 3
Created on 2020-11-04 by the reprex package (v0.3.0)
Say I want to take a sample of values of variable length from an arbitrary number of different probability distributions, and with a weighted probability of sampling from each distribution.
Seems like I should be able to do this using purrr's map functions, but am struggling...
library(tidyverse)
set.seed(20171127)
# sample from 5 different probability distributions
dists <- tibble(
samp_distA = round(rnorm(n=1000, mean=17, sd=4)),
samp_distB = round(rnorm(n=1000, mean=13, sd=4)),
samp_distC = round(rnorm(n=1000, mean=13, sd=4)),
samp_distD = round(rbeta(n=1000, 2,8)*10),
samp_distE = round(rnorm(n=1000, mean=8, sd=3))
)
# define number of samples to be drawn for each group
n.times <- c(20,15,35,8,6)
# define weights to be used for sampling from dists
probs <- tibble(A = c(0.80, 0.05, 0.05, 0.05, 0.05),
B = c(0.05, 0.80, 0.05, 0.05, 0.05),
C = c(0.05, 0.05, 0.80, 0.05, 0.05),
D = c(0.05, 0.05, 0.05, 0.80, 0.80),
E = c(0.05, 0.05, 0.05, 0.05, 0.80)
)
# sample from dists, n.times, and using probs as weights...
output <- map2(sample, size=n.times, weight=probs, tbl=dists)
#...doesn't work
Any suggestions gratefully received.
set.seed(123)
map2(
n.times,
map(probs, rep, each = nrow(dists)),
sample, x = flatten_dbl(dists), replace = TRUE
)
# [[1]]
# [1] 15 13 18 6 15 15 12 8 9 12 7 17 14 12 15 10 18 19 24 24
#
# [[2]]
# [1] 12 2 15 16 14 17 11 11 10 12 6 19 13 12 13
#
# [[3]]
# [1] 10 9 16 12 13 11 10 18 14 19 16 16 12 19 4 15 19 19 13 14 15 10 14 12 10
# [26] 8 18 19 7 8 21 8 19 10 9
#
# [[4]]
# [1] 3 3 2 15 1 4 14 2
#
# [[5]]
# [1] 9 14 10 6 12 8
NB: I'm dubious about your answer to MrFlick's comment: "an 80% chance of selecting all values from samp_distA". To me it is much more intuitive to go the other route: "an 80% chance for each of the 10 values to come from samp_distA"... so that's what I did. Do you confirm you want the former?
Base R equivalent:
set.seed(123)
mapply(
sample,
n.times,
lapply(probs, rep, each = nrow(dists)),
MoreArgs = list(x = unlist(dists, use.names = FALSE), replace = TRUE)
)
Edit
Re your follow-up question in a comment ("run the function multiple times for each person, e.g. so that as output, person A had 10 lists of randomly-sampled values, each of length 20 (and similar for persons B, C, D, and E, perhaps with each person having a predefined different number of lists)"):
n.reps <- c(A = 10, B = 1, C = 3, D = 2, E = 1)
set.seed(123)
pmap(
list(n.reps, n.times, map(probs, rep, each = nrow(dists))),
function(.x, .y, .z) replicate(
.x,
sample(flatten_dbl(dists), .y, replace = TRUE, .z),
simplify = FALSE
)
)
# $A
# $A[[1]]
# [1] 15 20 16 20 16 14 17 20 21 22 18 19 15 14 18 19 16 20 9 16
#
# $A[[2]]
# [1] 13 9 11 19 25 19 11 18 16 19 16 21 15 12 11 11 9 13 20 1
#
# $A[[3]]
# [1] 15 20 13 20 13 11 16 16 14 19 18 10 21 11 12 16 18 10 20 14
#
# $A[[4]]
# [1] 16 19 14 11 17 9 20 11 19 13 11 16 8 11 10 18 27 22 20 4
#
# $A[[5]]
# [1] 12 18 16 19 13 13 23 19 21 14 22 8 9 19 16 19 9 14 13 20
#
# $A[[6]]
# [1] 18 26 16 15 21 17 15 19 14 18 19 25 5 16 7 19 21 15 23 16
#
# $A[[7]]
# [1] 12 26 20 12 7 5 13 14 19 7 16 12 11 27 22 18 11 17 11 16
#
# $A[[8]]
# [1] 21 18 24 22 18 0 15 3 9 16 16 11 16 20 22 18 18 20 16 21
#
# $A[[9]]
# [1] 15 20 11 16 16 21 12 20 17 9 18 10 22 17 12 0 18 16 23 20
#
# $A[[10]]
# [1] 16 22 15 4 7 19 18 13 15 1 7 18 21 1 20 21 15 12 20 15
#
#
# $B
# $B[[1]]
# [1] 9 5 8 17 9 10 7 13 12 11 9 21 10 15 12
#
#
# $C
# $C[[1]]
# [1] 15 15 16 13 19 14 16 15 11 15 19 16 19 12 6 12 10 12 1 18 9 10 18 11 19
# [26] 9 6 19 18 12 9 18 14 12 7
#
# $C[[2]]
# [1] 5 14 16 10 8 13 8 18 22 18 14 12 13 10 19 12 15 10 16 13 16 9 15 6 15
# [26] 14 4 9 11 11 3 15 18 10 14
#
# $C[[3]]
# [1] 13 8 12 9 6 9 2 7 8 12 2 11 20 10 1 14 14 11 11 1 13 13 18 14 12
# [26] 21 11 3 7 7 13 13 11 7 14
#
#
# $D
# $D[[1]]
# [1] 11 1 1 7 12 6 0 8
#
# $D[[2]]
# [1] 4 1 7 15 2 2 8 9
#
#
# $E
# $E[[1]]
# [1] 7 8 6 11 10 6
This seems doable with purrr, but it takes a bit of set up, particularly because there's not a sample2 function (that I'm aware of) that samples a distribution based on a vector of probabilities, and then grabs a random sample from that subset.
To do that with purrr, we have to loop twice: the outside loops through each person using a simple numerical index; inside that loop, we loop through the n.times to get random samples from the appropriate distribution.
# prep data ---------------------------------------------------------------
# pull all the controls into a single data frame
controldf <- tibble(
cols = c(1:5), n.times
) %>%
bind_cols(probs %>%
t %>%
as.tibble %>%
setNames(c("distA", "distB", "distC", "distD", "distE"))
)
# turn the distrubtions into long form
longdists <- dists %>%
gather(dist, val)
distnames <- c("A", "B", "C", "D", "E")
# function to do the work ---------------------------------------------------------------
getdist <- function(i) {
# get the probabilities as a numeric vector
myprobs <- controldf[i,3:7] %>% as.numeric
# how many samples do we need
myn <- controldf[[i,2]]
# use our probabilties to decide what distribution to grab from
samplestoget <- sample(distnames, myn, prob = myprobs, replace = T) %>%
paste0("samp_dist", .)
# loop through our list of distributions to grab from
map_dbl(samplestoget, ~filter(
# filter on distribution key
longdists, dist == .x
) %>%
# from that distribution, select a single value at random
sample_n(1) %>%
# extract the numeric value
pluck('val') )
}
# get the values by running the function over our indexes -------------------------
results <- map(controldf$cols, ~ getdist(.x))
I have a large vector (column of a data frame) where values containing integers 1 to 30. I want to replace numbers from 1 to 5 with 1, 6 to 10 with 5, 11 to 15 with 9...
> x3 <- sample(1:30, 100, rep=TRUE)
> x3
[1] 13 24 16 30 10 6 15 10 3 17 18 22 11 13 29 7 25 28 17 27 1 5 6 20 15 15 8 10 13 26 27 24 3 24 5 7 10 6 28 27 1 4 22 25 14 13 2 10 4 29 23 24 30 24 29 11 2 28 23 1 1 2
[63] 3 23 13 26 21 22 11 4 8 26 17 11 20 23 6 14 24 5 15 21 11 13 6 14 20 11 22 9 6 29 4 30 20 30 4 24 23 29
As I mentioned this is a column in a data frame and with above assignment I want to create a different column. If I do the following I have to do this 30 times.
myFrame$NewColumn[myFrame$oldColumn==1] <- 1
myFrame$NewColumn[myFrame$oldColumn==2] <- 1
myFrame$NewColumn[myFrame$oldColumn==3] <- 1
...
Whats a better way to do this?
We can do this with cut (suppose what you mean by '...' is 10, 11, 12):
x4 <- cut(x3,
breaks = c(seq(1, 30, 5), 30), right = F, include.lowest = T, # generate correct intervals
labels = 4 * (0:5) + 1) # number to fill
# x4 is factor. We should convert it to character first then to the number
x4 <- as.numeric(as.character(x4))
Did you try:
myFrame$NewColumn[myFrame$oldColumn > 0 & myFrame$oldColumn< 6] <- 1
myFrame$NewColumn[myFrame$oldColumn > 5 & myFrame$oldColumn< 11] <- 1
...
Or even better:
myFrame$NewColumn <- as.integer((myFrame$oldColumn - 1)/5)) * 4 + 1