I'm trying to find a package in R where I can find clusters that exceed a given threshold in a dataset.
What I want to know is the the cluster duration/size and the individual values of each cluster.
For example (a simple one):
I have a vector of data,
10 8 6 14 14 7 14 5 11 12 8 11 11 16 20 6 8 8 6 15
The clusters that are larger than 9 are defined in bold,
10 8 6 14 14 7 14 5 11 12 8 11 11 16 20 6 8 8 6 15
So here the cluster sizes in order are,
1, 2, 1, 2, 4, 1
What I want R to do is return the clusters in separate ordered vectors, e.g.
[1] 10
[2] 14 14
[3] 14
[4] 11 12
[5] 11 11 16 20
[6] 15
Is there such a package or also a piece of code with if statements for example would also help.
Cheers
The data.table::rleid function works well for this:
Filter(function(a) a[1] > 9, split(vec, data.table::rleid(vec > 9)))
# $`1`
# [1] 10
# $`3`
# [1] 14 14
# $`5`
# [1] 14
# $`7`
# [1] 11 12
# $`9`
# [1] 11 11 16 20
# $`11`
# [1] 15
If you'd prefer to not load the data.table package just for that, then a base-R approach from https://stackoverflow.com/a/33509966:
myrleid <- function(x) {
rl <- rle(x)$lengths
rep(seq_along(rl), times = rl)
}
Filter(function(a) a[1] > 9, split(vec, myrleid(vec > 9)))
Related
How to write an R-script to initialize a vector with integers, rearrange the elements by interleaving the
first half elements with the second half elements and store in the same vector without using pre-defined function and display the updated vector.
This sounds like a homework question, and it would be nice to see some effort on your own part, but it's pretty straightforward to do this in R.
Suppose your vector looks like this:
vec <- 1:20
vec
#> [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Then you can just do:
c(t(cbind(vec[1:10], vec[11:20])))
#> [1] 1 11 2 12 3 13 4 14 5 15 6 16 7 17 8 18 9 19 10 20
This works by joining the two vectors into a 10 x 2 matrix, then transposing that matrix and turning it into a vector.
We may use matrix directly and concatenate
c(matrix(vec, nrow = 2, byrow = TRUE))
-output
[1] 1 11 2 12 3 13 4 14 5 15 6 16 7 17 8 18 9 19 10 20
data
vec <- 1:20
Or using mapply:
vec <- 1:20
c(mapply(\(x,y) c(x,y), vec[1:10], vec[11:20]))
#> [1] 1 11 2 12 3 13 4 14 5 15 6 16 7 17 8 18 9 19 10 20
We can try this using order + %%
> vec[order((seq_along(vec) - 1) %% (length(vec) / 2))]
[1] 1 11 2 12 3 13 4 14 5 15 6 16 7 17 8 18 9 19 10 20
Another way is to use rbind on the 2 halves of the vector, which creates a matrix with two rows. Then, we can then turn the matrix into a vector, which will go through column by column (i.e., 1, 11, 2, 12...). However, this will only work for even vectors.
vec <- 1:20
c(rbind(vec[1:10], vec[11:20]))
# [1] 1 11 2 12 3 13 4 14 5 15 6 16 7 17 8 18 9 19 10 20
So, for uneven vectors, we can use order, which will return the indices of the numbers in the two seq_along vectors.
vec2 <- 1:21
order(c(seq_along(vec2[1:10]),seq_along(vec2[11:21])))
# [1] 1 11 2 12 3 13 4 14 5 15 6 16 7 17 8 18 9 19 10 20 21
Say I have a vector named all_combinations with numbers from 1 to 20.
I need to extract 2 vectors (coding_1 and coding_2) of length equal to number_of_peptide_clusters, which happens to be 20 as well in my current case.
The 2 new vectors should be randomly sampled from all_combinations, so that are not overlapping at each index position.
I do the following:
set.seed(3)
all_combinations=1:20
number_of_peptide_clusters=20
coding_1 <- sample(all_combinations, number_of_peptide_clusters, replace = FALSE)
coding_1
[1] 5 12 7 4 10 8 11 15 17 16 18 13 9 20 2 14 19 1 3 6
coding_2 <- sample(all_combinations, number_of_peptide_clusters, replace = FALSE)
coding_2
[1] 5 9 19 16 18 12 8 6 15 3 13 14 7 2 11 20 10 4 17 1
This is the example that gives me trouble, cause only one number is overlapping at the same index (5 at position 1).
What I would do in these cases is spot the overlapping numbers and resample them out of the list of all overlapping numbers...
Imagine coding_1 and coding_2 were:
coding_1
[1] 5 9 7 4 10 8 11 15 17 16 18 13 12 20 2 14 19 1 3 6
coding_2
[1] 5 9 19 16 18 12 8 6 15 3 13 14 7 2 11 20 10 4 17 1
In this case I would have 5 and 9 overlapping in the same position, so I would resample them in coding_2 out of the full list of overlapping ones [resample index 1 from c(5,9) so that isn't equal to 5, and index 2 so it isn't equal to 9]. So coding_2 would be:
coding_2
[1] 9 5 19 16 18 12 8 6 15 3 13 14 7 2 11 20 10 4 17 1
However, in the particular case above, I cannot use such approach... So what would be the best way to obtain 2 samples of length 20 from a vector of length 20 as well, so that the samples aren't overlapping at the same index positions?
It would be great that I could obtain the second sample coding_2 already knowing coding_1... Otherwise obtaining the 2 at the same time would also be acceptable if it makes things easier. Thanks!
I think the best solution is simply to use a rejection strategy:
set.seed(3)
all_combinations <- 1:20
number_of_peptide_clusters <- 20
count <- 0
repeat {
count <- count + 1
message("Try number ", count)
coding_1 <- sample(all_combinations, number_of_peptide_clusters, replace = FALSE)
coding_2 <- sample(all_combinations, number_of_peptide_clusters, replace = FALSE)
if (!any(coding_1 == coding_2))
break
}
#> Try number 1
#> Try number 2
#> Try number 3
#> Try number 4
#> Try number 5
#> Try number 6
#> Try number 7
#> Try number 8
#> Try number 9
coding_1
#> [1] 18 16 17 12 13 8 6 15 3 5 20 9 11 4 19 2 14 7 1 10
coding_2
#> [1] 5 20 14 2 11 6 7 10 19 8 4 1 15 9 13 17 18 16 12 3
Created on 2020-11-04 by the reprex package (v0.3.0)
I have two vectors which define start (from) indices and finish (to) indices:
Start = c(1, 10, 20)
Finish = c(9, 19, 30)
I want to create a list of all Start:Finish sequences along the two vectors, i.e. generate the sequences Start[1]:Finish[1] (1:9); Start[2]:Finish[2], and so on.
## [[1]]
## [1] 1 2 3 4 5 6 7 8 9
##
## [[2]]
## [1] 10 11 12 13 14 15 16 17 18 19
##
## [[3]]
## [1] 20 21 22 23 24 25 26 27 28 29 30
Preferably in some vectorized way. The values in 'Start' vector will always be larger than the corresponding elements in 'Finish' vector.
Just use mapply:
Start = c(1,10,20)
Finish = c(9,19,30)
mapply(":", Start, Finish)
## [[1]]
## [1] 1 2 3 4 5 6 7 8 9
##
## [[2]]
## [1] 10 11 12 13 14 15 16 17 18 19
##
## [[3]]
## [1] 20 21 22 23 24 25 26 27 28 29 30
##
You could, of course, also use Vectorize, but that's just a wrapper for mapply. However, Vectorize cannot be used with primitive functions, so you'll have to specify seq.default rather than seq, or seq.int.
Example:
Vectorize(seq.default)(Start, Finish)
## [[1]]
## [1] 1 2 3 4 5 6 7 8 9
##
## [[2]]
## [1] 10 11 12 13 14 15 16 17 18 19
##
## [[3]]
## [1] 20 21 22 23 24 25 26 27 28 29 30
##
Agree with #ColonelBeauvel and #nicola, though you could use seq instead of :, hence
Start = c(1, 10, 20)
Finish = c(9, 19, 30)
Map(seq, Start, Finish)
I have two vectors which define start (from) indices and finish (to) indices:
Start = c(1, 10, 20)
Finish = c(9, 19, 30)
I want to create a list of all Start:Finish sequences along the two vectors, i.e. generate the sequences Start[1]:Finish[1] (1:9); Start[2]:Finish[2], and so on.
## [[1]]
## [1] 1 2 3 4 5 6 7 8 9
##
## [[2]]
## [1] 10 11 12 13 14 15 16 17 18 19
##
## [[3]]
## [1] 20 21 22 23 24 25 26 27 28 29 30
Preferably in some vectorized way. The values in 'Start' vector will always be larger than the corresponding elements in 'Finish' vector.
Just use mapply:
Start = c(1,10,20)
Finish = c(9,19,30)
mapply(":", Start, Finish)
## [[1]]
## [1] 1 2 3 4 5 6 7 8 9
##
## [[2]]
## [1] 10 11 12 13 14 15 16 17 18 19
##
## [[3]]
## [1] 20 21 22 23 24 25 26 27 28 29 30
##
You could, of course, also use Vectorize, but that's just a wrapper for mapply. However, Vectorize cannot be used with primitive functions, so you'll have to specify seq.default rather than seq, or seq.int.
Example:
Vectorize(seq.default)(Start, Finish)
## [[1]]
## [1] 1 2 3 4 5 6 7 8 9
##
## [[2]]
## [1] 10 11 12 13 14 15 16 17 18 19
##
## [[3]]
## [1] 20 21 22 23 24 25 26 27 28 29 30
##
Agree with #ColonelBeauvel and #nicola, though you could use seq instead of :, hence
Start = c(1, 10, 20)
Finish = c(9, 19, 30)
Map(seq, Start, Finish)
It would be very helpful to me to be able to create an R list object without having to specify the names of each element. For example:
a1 <- 1
a2 <- 20
a3 <- 1:20
b <- list(a1,a2,a3, inherit.name=TRUE)
> b
[[a1]]
[1] 1
[[a2]]
[1] 20
[[a3]]
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
This would be ideal. Any suggestions?
The tidyverse package tibble has a function that can do this as well. Try out tibble::lst
tibble::lst(a1, a2, a3)
# $a1
# [1] 1
#
# $a2
# [1] 20
#
# $a3
# [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Coincidentally, I just wrote this function. It looks a lot like #joran's solution, but it tries not to stomp on already-named arguments.
namedList <- function(...) {
L <- list(...)
snm <- sapply(substitute(list(...)),deparse)[-1]
if (is.null(nm <- names(L))) nm <- snm
if (any(nonames <- nm=="")) nm[nonames] <- snm[nonames]
setNames(L,nm)
}
## TESTING:
a <- b <- c <- 1
namedList(a,b,c)
namedList(a,b,d=c)
namedList(e=a,f=b,d=c)
Copied from comments: if you want something from a CRAN package, you can use Hmisc::llist:
Hmisc::llist(a, b, c, d=a, labels = FALSE)
The only apparent difference is that the individual vectors also have names in this case.
A random idea:
a1<-1
a2<-20
a3<-1:20
my_list <- function(...){
names <- as.list(substitute(list(...)))[-1L]
result <- list(...)
names(result) <- names
result
}
> my_list(a1,a2,a3)
$a1
[1] 1
$a2
[1] 20
$a3
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
(The idea is stolen from the code in data.frame.)
Another idea ,
sapply(ls(pattern='^a[0-9]'), get)
$a1
[1] 1
$a2
[1] 20
$a3
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20