Select random and unique elements from a vector - r

Say I have a simple vector with repeated elements:
a <- c(1,1,1,2,2,3,3,3)
Is there a way to randomly select a unique element from each of the repeated elements? I.e. one random draw pointing which elements to keep would be:
1,4,6 ## here I selected the first 1, the first 2 and the first 3
And another:
1,5,8 ## here I selected the first 1, the second 2 and the third 3
I could do this with a loop for each repeated elements, but I am sure there must be a faster way to do this?
EDIT:
Ideally the solution should also always select a particular element if it is already a unique element. I.e. my vector could also be:
b <- c(1,1,1,2,2,3,3,3,4) ## The number four is unique and should always be drawn

Using base R ave we could do something like
unique(ave(seq_along(a), a, FUN = function(x) if(length(x) > 1) head(sample(x), 1) else x))
#[1] 3 5 6
unique(ave(seq_along(a), a, FUN = function(x) if(length(x) > 1) head(sample(x), 1) else x))
#[1] 3 4 7
This generates an index for every value of a, grouped by a and then selects one random index value in each group.
Using same logic with sapply and split
sapply(split(seq_along(a), a), function(x) if(length(x) > 1) head(sample(x), 1) else x)
And it would also work with tapply
tapply(seq_along(a), a, function(x) if(length(x) > 1) head(sample(x), 1) else x)
The reason why we need to check the length (if(length(x) > 1)) is because from ?sample
If x has length 1, is numeric (in the sense of is.numeric) and x >= 1, sampling via sample takes place from 1:x.
Hence, when there is only one number (n) in sample(), it takes sample from 1:n (and not n) so we need to check it's length.

Related

Is there a function in R which checks if two XStrings have matching substrings of some size (n) in corresponding positions

I am trying to determine if two XStrings of equal length have the same substring of some given length in the corresponding positions.
Is there a built-in function in R for this problem?
Let's say I have strings
a <- "AACCT**GCCCGGAA**CCT" ,
b <- "CCATC**GCCCGGAA**CCT"
and given length is 8
I need a function fun(a,b,len=8) that would return TRUE or possibly even a position where such a substring first occurs.
Of course, real strings that I am using are much longer and the given length of substring may not be 8 all the time.
This could be done by for lops but it would be preferred not to use them
You could do this by splitting the strings into individual characters, testing equality of the resulting vectors, and performing run-length encoding on the logical vector produced:
f <- function(a, b, n) {
rl <- rle(strsplit(a, "")[[1]] == strsplit(b, "")[[1]])
ind <- which(rl$values == TRUE & rl$lengths >= n)[1]
cumsum(rl$lengths[seq(ind - 1)]) + 1
}
This will tell you the first position in the strings where there are at least n parallel matching bases:
f(a, b, 8)
#> [1] 6
We can use rleid
library(data.table)
Map(function(u, v) {i1 <- u == v
grp <- rleid(i1); which(i1 & ave(seq_along(grp), grp,
FUN = length) >= 8)[1]},
strsplit(a, ""), strsplit(b, ""))[[1]]
#[1] 6

R: Can we Sum an vector with condition?

Is it possible for us to use sum all the elements having even index in R vector without going through iterating through all the elements ? such as sum(vectorx[i*2], which i in (1:5))
Multiply the vector by c(0, 1) and then add the elements. Due to vector recycling, the elements with odd indices will be multiplied by 0 and the ones in even indices will be multiplied by 1
x = 1:10
sum(x * c(0, 1))
#[1] 30
There are multiple ways to do this
set.seed(1234)
i <- sample(5)
i
#[1] 4 5 2 3 1
1) Use recycling method
sum(i[c(FALSE, TRUE)])
#[1] 8
2) Create a sequence of alternating index to subset
sum(i[seq(2, length(i), 2)])
3) Use modulo division
sum(i[seq_along(i) %% 2 == 0])
We can use seq.int
x <- 1:10
sum(x[seq.int(2, length(x), 2)])

Remove isolated elements of a vector

I have a vector of integers and I want to filter it by eliminating the components that are "isolated".
What do I mean by "isolated"? those components that does not lie in an 4-neighbourhood of other component.
The components in the vector are ordered increasingly, and there are no repetitions.
For example if I have c(1,2,3,8,15,16,17) then I need to eliminate 8 because is not in a 4-neighbourhood of other element.
I've tried applying
for (p in 1:(length(index)-2))
if((index[p+1]>3+index[p])&(index[p+2]>3+index[p+1])){index[p+1]<-0}
index<-index[index!=0]
where index is my vector of interest, but there's some problem with the logical condition.
Could you please give me some hints?
Thanks in advance.
You can achieve it with a combination of outer and colSums, i.e.
x[colSums(abs(outer(x, x, `-`)) >= 4) == length(x)-1]
#[1] 8
To eliminate the values, we can do,
i1 <- colSums(outer(x, x, FUN = function(i, j) abs(i - j) >= 4)) == length(x) - 1
x[!i1]
#[1] 1 2 3 15 16 17
where,
x <- c(1,2,3,8,15,16,17)
We keep values where preceding or next difference is lower or equal to 4 :
v <- c(1,2,3,8,15,16,17)
v[c(FALSE, diff(v) <= 4) | c(diff(v) <= 4, FALSE)]

How to exclusive a set from a large set in R

Suppose that I have a set of 10 elements. Suppose that my code is able to choose only 3 elements at a time. Then, I would like it to choose another $3$ elements, however, without selecting the elements that are already selected.
x <- c(4,3,5,6,-2,7,-4,10,22,-12)
Then, suppose that my condition is to select 3 elements that are less than 5. Then,
new_x <- c(4, 3, -2)
Then, I would like to select another 3 elements that are less than 5 but were not selected at the first time. If there is no 3 element then the third element should have value zero.
Hence,
new_xx <- c(-4,-12,0)
Any help, please?
Here is an option using split
f <- function(x, max = 5, n = 3) {
x <- x[x < max]
ret <- split(x, rep(1:(length(x) / n + 1), each = n)[1:length(x)])
lapply(ret, function(w) replace(rep(0, n), 1:length(w), w))
}
f(x)
#$`1`
#[1] 4 3 -2
#
#$`2`
#[1] -4 -12 0
Explanation: We define a custom function that first selects entries < 5, then splits the resulting vector into chunks of length 3 and stores the result in a list, and finally 0-pads those list elements that are vectors of length < 3.
Sample data
x <- c(4,3,5,6,-2,7,-4,10,22,-12)

counting elements in a list based on another list

I have two lists looking like this:
mylist <- list(a=c(1:5),
b = c(5:12),
c = c(2:8))
list.id <- list(a=2, b=8, c=5)
I want to count the number of elements in mylist that are higher than the corresponding element in list.id and divide the result for the length of element in mylist. I have written this function.
perm.fun <- perm.fun2 = function(x,y){length(which(x[[i]] < y[[i]]))/length(x[[i]])}
However, when I do: lapply(mylist, perm.fun, list.id) I do not obtain the expected result.
Thanks
Using lapply, you would need to loop on the indices (1, 2, 3) so they can be used to extract the elements from both mylist and list.id:
perm.fun <- function(i, x, y) mean(x[[i]] > y[[i]])
lapply(seq_along(mylist), perm.fun, mylist, list.id)
But mapply is a much better tool for that task. From the doc:
mapply applies FUN to the first elements of each ... argument, the second elements, the third elements, and so on.
So your code can just be:
mapply(function(x, y) mean(x > y), mylist, list.id)
# a b c
# 0.6000000 0.5000000 0.4285714

Resources