unexpected results when comparing Biostrings subsequences using the identical function

unexpected results when comparing Biostrings subsequences using the identical function - r

I'm checking if a sequence is present at the beginning and at the end of a longer sequence. I considered using identical but this gives me a surprising result:
library(Biostrings)
EcoRI <- DNAString("GAATTC")
myseq <- DNAString("GAATTCGGGGAAAATTTTCCCCGAATTC")
EcoRI
# 6-letter "DNAString" instance
#seq: GAATTC
subseq(myseq, 1, 6)
# 6-letter "DNAString" instance
#seq: GAATTC
subseq(myseq, 23, 28)
# 6-letter "DNAString" instance
#seq: GAATTC
identical(EcoRI, subseq(myseq, 1, 6))
#TRUE
identical(EcoRI, subseq(myseq, 23, 28))
#FALSE
identical(subseq(myseq, 1, 6), subseq(myseq, 23, 28))
#FALSE
An easy fix is to use:
identical(toString(EcoRI), toString(subseq(myseq, 23, 28)))
# TRUE
But I don't understand why identical on the DNAString objects returns FALSE sometimes.
Does identical also compare the offset attributes?
attributes(EcoRI)$offset
#[1] 0
attributes(subseq(myseq, 1, 6))$offset
#[1] 0
attributes(subseq(myseq, 23, 28))$offset
#[1] 22

Related

unique pairs or combinations from a vector

Where am I going wrong with my function.
I am trying to create a function which will count all the unique pairs in a vector, say I have the following input:
ar <- c(10, 20, 20, 30, 30, 30, 40, 50)
The number of unique pairs is 20 = 1, 30 = 1 so I can just sum these up and the total number of unique pairs is 2.
However everything I am trying is creating 30 as having 2 unique pairs (since 30 occurs 3 times in the vector.
n <- 9
ar <- c(10, 20, 20, 30, 30, 30, 40, 50)
CountThePairs <- function(n, ar){
for(i in 1:length(ar)){
sum = ar[i] - ar[]
pairs = length(which(sum == 0))
}
return(sum)
}
CountThePairs(n = NULL, ar)
Is there an easier way of doing this? I prefer the base R version but interested in package versions also.

Here's a simpler way using floor and table form base R -
ar <- c(10, 20, 20, 30, 30, 30, 40, 50)
sum(floor(table(ar)/2))
[1] 2
Example 2 - Adding one more 30 to vector so now there are 2 pairs of 30 -
ar <- c(10, 20, 20, 30, 30, 30, 30, 40, 50)
sum(floor(table(ar)/2))
[1] 3
If 2 30 pairs count as one "unique" pair then original solution by #tmfmnk was correct -
sum(table(ar) >= 2)

You could use sapply on the unique values of the vector to return a logical vector if that value is repeated. The sum of that logical value is then the number of unique pairs.
ar <- c(10, 20, 20, 30, 30, 30, 40, 50)
is_pair <- sapply(unique(ar), function(x) length(ar[ar == x]) > 1)
sum(is_pair)
#[1] 2
I'm not sure what behaviour you want if there are four 30's - does this count as one unique pair still or is it now two? If the latter, you would need a slightly different solution:
n_pair <- sapply(unique(ar), function(x) length(ar[ar == x]) %/% 2)
sum(n_pair)
#[1] 2

Screen which element is within a range in R [duplicate]

This question already has answers here:
How do I filter a range of numbers in R? [duplicate]
(6 answers)
Closed 3 years ago.
I would like to ask if there is a way to check for example
c(13, 20, 1, 5, 40, 15, 6, 8)
is within a range e.g. > 5 and <= 30 will give output like below:
[1] TRUE TRUE FALSE FALSE TRUE TRUE TRUE

Isn't it just this?
x <- c(13, 20, 1, 5, 40, 15, 6, 8)
x > 5 & x <= 30
#[1] TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE
We can also use between from dplyr or data.table but this includes upper and lower boundaries so we can do
dplyr::between(x, 6, 31)
#[1] TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE
Or
data.table::between(x, 6, 31)

First of all you omitted a FALSE in your expected result.
But you can achieve that by doing this :
c <- c(13, 20, 1, 5, 40, 15, 6, 8)
a <- c > 5 & c <= 30
print(a)

Find indices of numbers of interest in a vector

I have a long vector, let's say A <- c(12, 16, 23, 15, 89, 43, ...) and I would like to find the positions of some numbers in this vector, contained in another vector, B <- c(16, 89).
In this example, I would like to obtain the vector c(2,5). A the moment I am using a for loop, but I would really like to avoid it:
C <- numeric(length(B))
for (i in 1:length(C)){
C[i] <- which(A==B[i])
}
Any suggestion?
Thanks in advance

Try with
x <- which(A %in% B)
#> x
#[1] 2 5
Hope this helps

You can use simply:
match(B,A)
#[1] 2 5

Find nearest smaller number

I have a vector of numbers
f <- c(1, 3, 5, 8, 10, 12, 19, 27)
I want to compare the values in the vector to another number, and find the closest smaller value.
For example, if the input number is 18, then the closest, smaller value in the vector is 12 (position 6 in the vector).
If the input is 19, then the result should be the value 19, i.e. the index 7.

I think this answer is pretty straightforward:
f <- c(1,3,6,8,10,12,19,27)
x <- 18
# find the value that is closest to x
maxless <- max(f[f <= x])
# find out which value that is
which(f == maxless)

If your vector f is always sorted, then you can do sum(f <= x)
f <- c(1,3,6,8,10,12,19,27)
x <- 18
sum(f <= x)
# [1] 6
x <- 19
sum(f <= x)
# [1] 7

Try this (not a perfect solution)
x<-c(1,3,6,8,10,12,19,27)
showIndex<-function(x,input){
abs.diff<-abs(x-input)
index.value<-unique(ifelse(abs.diff==0,which.min(abs.diff),which.min(abs.diff)-1))
return(index.value)
}
showIndex(x,12)
[1] 6
showIndex(x,19)
[1] 7

You could try:
x <- 18
f <- c(1,3,6,8,10,12,19,27)
ifelse(x %in% f, which(f %in% x), which.min(abs(f - x)) - 1)
That way if x is not in f, it will return the nearest previous index. If x is in f, it will return x index.

Another one:
which.min(abs(18 - replace(f, f>18, Inf)))
#[1] 6
f[which.min(abs(18 - replace(f, f>18, Inf)))]
#[1] 12
Or as a function:
minsmaller <- function(x,value) which.min(abs(value - replace(x, x>value, Inf)))
minsmaller(f, 18)
#[1] 6
minsmaller(f, 19)
#[1] 7

There is findInterval:
findInterval(18:19, f)
#[1] 6 7
And building a more concrete function:
ff = function(x, table)
{
ot = order(table)
ans = findInterval(x, table[ot])
ot[ifelse(ans == 0, NA, ans)]
}
set.seed(007); sf = sample(f)
sf
#[1] 27 6 1 12 10 19 8 3
ff(c(0, 1, 18, 19, 28), sf)
#[1] NA 3 4 6 1

In a functional programming style:
f <- c(1, 3, 6, 8, 10, 12, 19, 27)
x <- 18
Position(function(fi) fi <= x, f, right = TRUE)

R: find nearest index

I have two vectors with a few thousand points, but generalized here:
A <- c(10, 20, 30, 40, 50)
b <- c(13, 17, 20)
How can I get the indicies of A that are nearest to b? The expected outcome would be c(1, 2, 2).
I know that findInterval can only find the first occurrence, and not the nearest, and I'm aware that which.min(abs(b[2] - A)) is getting warmer, but I can't figure out how to vectorize it to work with long vectors of both A and b.

You can just put your code in a sapply. I think this has the same speed as a for loop so isn't technically vectorized though:
sapply(b,function(x)which.min(abs(x - A)))

FindInterval gets you very close. You just have to pick between the offset it returns and the next one:
#returns the nearest occurence of x in vec
nearest.vec <- function(x, vec)
{
smallCandidate <- findInterval(x, vec, all.inside=TRUE)
largeCandidate <- smallCandidate + 1
#nudge is TRUE if large candidate is nearer, FALSE otherwise
nudge <- 2 * x > vec[smallCandidate] + vec[largeCandidate]
return(smallCandidate + nudge)
}
nearest.vec(b,A)
returns (1,2,2), and should comparable to FindInterval in performance.

Here's a solution that uses R's often overlooked outer function. Not sure if it'll perform better, but it does avoid sapply.
A <- c(10, 20, 30, 40, 50)
b <- c(13, 17, 20)
dist <- abs(outer(A, b, '-'))
result <- apply(dist, 2, which.min)
# [1] 1 2 2

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

unexpected results when comparing Biostrings subsequences using the identical function - r

Related

unique pairs or combinations from a vector

Screen which element is within a range in R [duplicate]

Find indices of numbers of interest in a vector

Find nearest smaller number

R: find nearest index

Categories

Resources