Newbie here
I want an equidistant series of numbers between 0 and 20.
Why do I get two different sets of numbers?
0:20
#[1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
seq(0:20)
#[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Any help would be appreciated. Thank you
Apparently when you pass a vector to seq(), it just prints 1:length(vector), as in
> seq(c(2, 4, 6, 100))
[1] 1 2 3 4
> seq(c('a', 2, mean))
[1] 1 2 3
I don't think that's how you typically use seq(). You'll get the behavior you expect if you pass the first value in the sequence, the last value, and optionally the length of the output or the step size. Better would be
> seq(0, 20)
[1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
> seq(from = 0, to = 20)
[1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Look at the seq header from the documentation.
seq(from = 1, to = 1, by = ((to - from)/(length.out - 1)), ...)
To achieve the same behaviour as the first statement you should call it with 2 parameters (from and to).
In your second statement you are calling it with only one parameter and it seems that it is using the length of the given parameter as the number of elements to generate from the default from value, which is 1.
Check this quick example, with a 6-elements vector:
> seq(c(1,4,5,6,2,3))
[1] 1 2 3 4 5 6
In your case, the length of the list 0:20 is 20 and generates 20 numbers from 1: 1,2,...,21
Related
I want to use conditional statement to consecutive values in the sliding manner.
For example, I have dataset like this;
data <- data.frame(ID = rep.int(c("A","B"), times = c(24, 12)),
+ time = c(1:24,1:12),
+ visit = as.integer(runif(36, min = 0, max = 20)))
and I got table below;
> data
ID time visit
1 A 1 7
2 A 2 0
3 A 3 6
4 A 4 6
5 A 5 3
6 A 6 8
7 A 7 4
8 A 8 10
9 A 9 18
10 A 10 6
11 A 11 1
12 A 12 13
13 A 13 7
14 A 14 1
15 A 15 6
16 A 16 1
17 A 17 11
18 A 18 8
19 A 19 16
20 A 20 14
21 A 21 15
22 A 22 19
23 A 23 5
24 A 24 13
25 B 1 6
26 B 2 6
27 B 3 16
28 B 4 4
29 B 5 19
30 B 6 5
31 B 7 17
32 B 8 6
33 B 9 10
34 B 10 1
35 B 11 13
36 B 12 15
I want to flag each ID by continuous values of "visit".
If the number of "visit" continued less than 10 for 6 times consecutively, I'd attach "empty", and "busy" otherwise.
In the data above, "A" is continuously below 10 from rows 1 to 6, then "empty". On the other hand, "B" doesn't have 6 consecutive one digit, then "busy".
I want to apply the condition to next segment of 6 values if the condition weren't fulfilled in the previous segment.
I'd like achieve this using R. Any advice will be appreciated.
How to write an R-script to initialize a vector with integers, rearrange the elements by interleaving the
first half elements with the second half elements and store in the same vector without using pre-defined function and display the updated vector.
This sounds like a homework question, and it would be nice to see some effort on your own part, but it's pretty straightforward to do this in R.
Suppose your vector looks like this:
vec <- 1:20
vec
#> [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Then you can just do:
c(t(cbind(vec[1:10], vec[11:20])))
#> [1] 1 11 2 12 3 13 4 14 5 15 6 16 7 17 8 18 9 19 10 20
This works by joining the two vectors into a 10 x 2 matrix, then transposing that matrix and turning it into a vector.
We may use matrix directly and concatenate
c(matrix(vec, nrow = 2, byrow = TRUE))
-output
[1] 1 11 2 12 3 13 4 14 5 15 6 16 7 17 8 18 9 19 10 20
data
vec <- 1:20
Or using mapply:
vec <- 1:20
c(mapply(\(x,y) c(x,y), vec[1:10], vec[11:20]))
#> [1] 1 11 2 12 3 13 4 14 5 15 6 16 7 17 8 18 9 19 10 20
We can try this using order + %%
> vec[order((seq_along(vec) - 1) %% (length(vec) / 2))]
[1] 1 11 2 12 3 13 4 14 5 15 6 16 7 17 8 18 9 19 10 20
Another way is to use rbind on the 2 halves of the vector, which creates a matrix with two rows. Then, we can then turn the matrix into a vector, which will go through column by column (i.e., 1, 11, 2, 12...). However, this will only work for even vectors.
vec <- 1:20
c(rbind(vec[1:10], vec[11:20]))
# [1] 1 11 2 12 3 13 4 14 5 15 6 16 7 17 8 18 9 19 10 20
So, for uneven vectors, we can use order, which will return the indices of the numbers in the two seq_along vectors.
vec2 <- 1:21
order(c(seq_along(vec2[1:10]),seq_along(vec2[11:21])))
# [1] 1 11 2 12 3 13 4 14 5 15 6 16 7 17 8 18 9 19 10 20 21
I create the folds for a cross-validation with caret.
I discovered a discrepancy between the functions createFolds and createMultiFolds. It seems to me that createFolds is without replacement, which is the correct version according to my understanding. createMultiFolds has two flaws, first it uses replacement and second it has much more observations per fold than expected.
Does anyone know why these discrepancies occur, or do I have to specify it differently? In the end I would like to use a repeated cross-validation.
Here is a MWE:
library(caret)
data(mtcars)
set.seed(123)
folds <- createMultiFolds(y = mtcars$am, k = 5, times = 5)
set.seed(123)
folds <- createFolds(mtcars$am, k = 5)
The output is as follows:
createMultiFolds (only the first 5 folds):
Fold1.Rep1 1 2 3 4 6 7 8 9 10 11 12 13 14 15 16 18 20 22 23 24 25 26 27 29 30 31
Fold2.Rep1 1 2 3 5 6 7 8 9 11 12 14 16 17 18 19 20 21 22 23 24 25 28 29 31 32
Fold3.Rep1 2 4 5 6 7 8 9 10 11 12 13 15 17 18 19 20 21 23 26 27 28 29 30 31 32
Fold4.Rep1 1 2 3 4 5 6 7 10 13 14 15 16 17 18 19 21 22 23 24 25 26 27 28 29 30 32
Fold5.Rep1 1 3 4 5 8 9 10 11 12 13 14 15 16 17 19 20 21 22 24 25 26 27 28 30 31 32
createFolds:
Fold1 5 17 19 21 28 32
Fold2 4 10 13 15 26 27 30
Fold3 1 3 14 16 22 24 25
Fold4 8 9 11 12 20 31
Fold5 2 6 7 18 23 29
If you inspect the source code of createMultiFolds, you will see that it calls createFolds with returnTrain = TRUE. From the documentation,
returnTrain: a logical. When true, the values returned are the sample
positions corresponding to the data used during training.
This argument only works in conjunction with ‘list = TRUE’
Therefore, if you modify createFolds appropriately, everything is well:
> library(caret)
> data(mtcars)
> set.seed(123)
> multiFolds <- createMultiFolds(y = mtcars$am, k = 5, times = 2)
> set.seed(123)
> folds1 <- createFolds(mtcars$am, k = 5, returnTrain = TRUE)
> folds2 <- createFolds(mtcars$am, k = 5, returnTrain = TRUE)
> all(multiFolds$Fold1.Rep1 == folds1$Fold1)
[1] TRUE
> all(multiFolds$Fold2.Rep1 == folds1$Fold2)
[1] TRUE
> all(multiFolds$Fold3.Rep1 == folds1$Fold3)
[1] TRUE
> all(multiFolds$Fold4.Rep1 == folds1$Fold4)
[1] TRUE
> all(multiFolds$Fold5.Rep1 == folds1$Fold5)
[1] TRUE
> all(multiFolds$Fold1.Rep2 == folds2$Fold1)
[1] TRUE
> all(multiFolds$Fold2.Rep2 == folds2$Fold2)
[1] TRUE
> all(multiFolds$Fold3.Rep2 == folds2$Fold3)
[1] TRUE
> all(multiFolds$Fold4.Rep2 == folds2$Fold4)
[1] TRUE
> all(multiFolds$Fold5.Rep2 == folds2$Fold5)
[1] TRUE
createMultiFolds has two flaws, first it uses replacement [...]
Where did you get this from? If you’re talking about the 1’s, the first one is part of the name: Fold1.Rep1, Fold2.Rep1, …, Fold{k}.Rep{times}.
As noted in the question, createFolds() splits the data into k folds. However, the output from the function is a list of observation indices that are held out from each fold, not the rows included in each fold. We can see this by creating a table of all the fold data as follows.
set.seed(123)
folds <- createFolds(mtcars$am, k = 5)
table(unlist(folds))
...and the output:
> table(unlist(folds))
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
27 28 29 30 31 32
1 1 1 1 1 1
If we use the returnTrain = TRUE argument with createFolds(), it returns the index of observations included in each fold, as illustrated in the other answer. For k = 5, we expect each observation to be used in 4 of the folds, and confirm this with the following code.
set.seed(123)
folds <- createFolds(mtcars$am, k = 5, returnTrain = TRUE)
table(unlist(folds))
...and the output:
> table(unlist(folds))
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
27 28 29 30 31 32
4 4 4 4 4 4
As noted in the answer, setting returnTrain = TRUE causes createFolds() to return the same output as createMultiFolds() with times = 1. We can illustrate that each observation is used in 4 of the 5 folds as follows.
set.seed(123)
folds1 <- createMultiFolds(y = mtcars$am, k = 5, times = 1)
table(unlist(folds1))
...and the output:
> table(unlist(folds1))
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
27 28 29 30 31 32
4 4 4 4 4 4
We can compare the contents of folds and folds with lapply() and all() as follows.
# compare folds to folds1
lapply(1:5,function(x){
all(folds1[[x]],folds[[x]])
})
[[1]]
[1] TRUE
[[2]]
[1] TRUE
[[3]]
[1] TRUE
[[4]]
[1] TRUE
[[5]]
[1] TRUE
If we set times = 2, we expect each observation to be included in 8 of the 10 folds.
set.seed(123)
folds <- createMultiFolds(y = mtcars$am, k = 5, times = 2)
table(unlist(folds))
...and the output:
> table(unlist(folds))
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
27 28 29 30 31 32
8 8 8 8 8 8
CONCLUSIONS: in both functions caret uses sampling to ensure that each observation is included in the hold out group 1 time across the k folds for each repetition of times =, within the constraint that observations for each value of the dependent variable passed to the function are proportionally distributed in the in sample and out of sample components of each fold.
In the case of a small data set such as mtcars, it's not easy for the algorithm to split effectively, as we can see when we run tables to compare in sample / holdout vs. mtcars$am.
set.seed(123)
folds <- createFolds(mtcars$am, k = 5)
table(unlist(folds))
lapply(folds,function(x){
holdout <- rep(FALSE,nrow(mtcars))
holdout[x] <- TRUE
table(holdout,mtcars$am)
})
$Fold1
holdout 0 1
FALSE 16 10
TRUE 3 3
$Fold2
holdout 0 1
FALSE 15 10
TRUE 4 3
$Fold3
holdout 0 1
FALSE 14 11
TRUE 5 2
$Fold4
holdout 0 1
FALSE 15 11
TRUE 4 2
$Fold5
holdout 0 1
FALSE 16 10
TRUE 3 3
Each fold contains 6 or 7 observations in the hold out set, with a minimum of 2 manual transmission cars (am = 1) in each hold out set.
With default arguments, createFolds() returns the indexes of held out observations rather than included observations. createFolds(x,k,returnTrain=TRUE) behaves exactly the same as createMultiFolds(x,k,times=1).
Say I have a vector named all_combinations with numbers from 1 to 20.
I need to extract 2 vectors (coding_1 and coding_2) of length equal to number_of_peptide_clusters, which happens to be 20 as well in my current case.
The 2 new vectors should be randomly sampled from all_combinations, so that are not overlapping at each index position.
I do the following:
set.seed(3)
all_combinations=1:20
number_of_peptide_clusters=20
coding_1 <- sample(all_combinations, number_of_peptide_clusters, replace = FALSE)
coding_1
[1] 5 12 7 4 10 8 11 15 17 16 18 13 9 20 2 14 19 1 3 6
coding_2 <- sample(all_combinations, number_of_peptide_clusters, replace = FALSE)
coding_2
[1] 5 9 19 16 18 12 8 6 15 3 13 14 7 2 11 20 10 4 17 1
This is the example that gives me trouble, cause only one number is overlapping at the same index (5 at position 1).
What I would do in these cases is spot the overlapping numbers and resample them out of the list of all overlapping numbers...
Imagine coding_1 and coding_2 were:
coding_1
[1] 5 9 7 4 10 8 11 15 17 16 18 13 12 20 2 14 19 1 3 6
coding_2
[1] 5 9 19 16 18 12 8 6 15 3 13 14 7 2 11 20 10 4 17 1
In this case I would have 5 and 9 overlapping in the same position, so I would resample them in coding_2 out of the full list of overlapping ones [resample index 1 from c(5,9) so that isn't equal to 5, and index 2 so it isn't equal to 9]. So coding_2 would be:
coding_2
[1] 9 5 19 16 18 12 8 6 15 3 13 14 7 2 11 20 10 4 17 1
However, in the particular case above, I cannot use such approach... So what would be the best way to obtain 2 samples of length 20 from a vector of length 20 as well, so that the samples aren't overlapping at the same index positions?
It would be great that I could obtain the second sample coding_2 already knowing coding_1... Otherwise obtaining the 2 at the same time would also be acceptable if it makes things easier. Thanks!
I think the best solution is simply to use a rejection strategy:
set.seed(3)
all_combinations <- 1:20
number_of_peptide_clusters <- 20
count <- 0
repeat {
count <- count + 1
message("Try number ", count)
coding_1 <- sample(all_combinations, number_of_peptide_clusters, replace = FALSE)
coding_2 <- sample(all_combinations, number_of_peptide_clusters, replace = FALSE)
if (!any(coding_1 == coding_2))
break
}
#> Try number 1
#> Try number 2
#> Try number 3
#> Try number 4
#> Try number 5
#> Try number 6
#> Try number 7
#> Try number 8
#> Try number 9
coding_1
#> [1] 18 16 17 12 13 8 6 15 3 5 20 9 11 4 19 2 14 7 1 10
coding_2
#> [1] 5 20 14 2 11 6 7 10 19 8 4 1 15 9 13 17 18 16 12 3
Created on 2020-11-04 by the reprex package (v0.3.0)
I have a homework problem where I have a sample of 30 men, a random sampling of 10 of them:
men
[1] 15 18 14 6 22 17 20 3 16 9
And From them, do 12 random samples and determine how many times each man appears.
The problem statement, verbatim, is "Perform 12 samples of 10 men from a population of size 30 and for each man, record the number
samples in which he appears."
I have attempted a loop for the problem that would produce a vector of 10 elements, each one lined up with the appropriate index.
mtimes<-rep(0,12)
> repeat{
+ mtimes[menind]<-sum(sample(pop1,12,replace = TRUE) == men[menind])
+ menind = menind + 1
+ if (menind == 10){
+ break
+ }
+ }
This resulted in a vector:
mtimes
[1] 0 0 1 0 0 0 0 0 0 0
It seems the 3rd man should not have appeared only once while no one else appeared in the samples.
You can use replicate and table here
set.seed(1)
table(replicate(n = 12, expr = sample(30, size = 10, replace = TRUE)))
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
# 3 2 3 5 2 2 5 5 3 3 6 7 4 5 8 2 1 3 2 9 3 7 2 8 3 3 5 3 3 3
I assume that by "men" you mean 1:30.
Another option would be to increase the size of the sample to 10*12 as in
set.seed(1)
table(sample(30, size = 10*12, replace = TRUE))