Locate different patterns in a sequence - r

If I want to find two different patterns in a single sequence how am I supposed to do
eg:
seq="ATGCAAAGGT"
the patterns are
pattern=c("ATGC","AAGG")
How am I supposed to find these two patterns simultaneously in the sequence?
I also want to find the location of these patterns like for example the patterns locations are 1,4 and 5,8.
Can anyone help me with this ?

Lets say your sequence file is just a vector of sequences:
seq.file <- c('ATGCAAAGGT','ATGCTAAGGT','NOTINTHISONE')
You can search for both motifs, and then return a true / false vector that identifies if both are present using the following one-liner:
grepl('ATGC', seq.file) & grepl('AAGG', seq.file)
[1] TRUE TRUE FALSE
Lets say the vector of sequences is a column within data frame d, which also contains a column of ID values:
id <- c('s1','s2','s3')
d <- data.frame(id,seq.file)
colnames(d) <- c('id','sequence')
You can append a column to this data frame, d, that identifies whether a given sequence matches with this one-liner:
d$match <- grepl('ATGC',d$sequence) & grepl('AAGG', d$sequence)
> print(d)
id sequence match
1 s1 ATGCAAAGGT TRUE
2 s2 ATGCTAAGGT TRUE
3 s3 NOTINTHISONE FALSE
The following for-loop can return a list of the positions of each of the patterns within the sequence:
require(stringr)
for(i in 1: length(d$sequence)){
out <- str_locate_all(d$sequence[i], pattern)
first <- c(out[[1]])
first.o <- paste(first[1],first[2],sep=',')
second <- c(out[[2]])
second.o <- paste(second[1],second[2], sep=',')
print(c(first.o, second.o))
}
[1] "1,4" "6,9"
[1] "1,4" "6,9"
[1] "NA,NA" "NA,NA"

You can try using the stringr library to do something like this:
seq = "ATGCAAAGGT"
library(stringr)
str_extract_all(seq, 'ATGC|AAGG')
[[1]]
[1] "ATGC" "AAGG"
Without knowing more specifically what output you are looking for, this is the best I can provide right now.

How about this using stringr to find start and end positions:
library(stringr)
seq <- "ATGCAAAGGT"
pattern <- c("ATGC","AAGG")
str_locate_all(seq, pattern)
#[[1]]
# start end
#[1,] 1 4
#
#[[2]]
# start end
#[1,] 6 9

Related

Substring match when filtering rows

I have strings in file1 that matches part of the strings in file2. I want to filter out the strings from file2 that partly matches those in file1. Please see my try. Not sure how to define substring match in this way.
file1:
V1
species1
species121
species14341
file2
V1
genus1|species1|strain1
genus1|species121|strain1
genus1|species1442|strain1
genus1|species4242|strain1
genus1|species4131|strain1
my try:
file1[!file1$V1 %in% file2$V1]
You cannot use the %in% operator in this way in R. It is used to determine whether an element of a vector is in another vector, not like in in Python which can be used to match a substring: Look at this:
"species1" %in% "genus1|species1|strain1" # FALSE
"species1" %in% c("genus1", "species1", "strain1") # TRUE
You can, however, use grepl for this (the l is for logical, i.e. it returns TRUE or FALSE).
grepl("species1", "genus1|species1|strain1") # TRUE
There's an additional complication here in that you cannot use grepl with a vector, as it will only compare the first value:
grepl(file1$V1, "genus1|species1|strain1")
[1] TRUE
Warning message:
In grepl(file1$V1, "genus1|species1|strain1") :
argument 'pattern' has length > 1 and only the first element will be used
The above simply tells you that the first element of file1$V1 is in "genus1|species1|strain1".
Furthermore, you want to compare each element in file1$V1 to an entire vector of strings, rather than just one string. That's OK but you will get a vector the same length as the second vector as an output:
grepl("species1", file2$V1)
[1] TRUE TRUE TRUE FALSE FALSE
We can just see if any() of those are a match. As you've tagged your question with tidyverse, here's a dplyr solution:
library(dplyr)
file1 |>
rowwise() |> # This makes sure you only pass one element at a time to `grepl`
mutate(
in_v2 = any(grepl(V1, file2$V1))
) |>
filter(!in_v2)
# A tibble: 1 x 2
# Rowwise:
# V1 in_v2
# <chr> <lgl>
# 1 species14341 FALSE
One way to get what you want is using the grepl function. So, you can run the following code:
# Load library
library(qdapRegex)
# Extract the names of file2$V1 you are interested in (those between | |)
v <- unlist(rm_between(file2$V1, "|", "|", extract = T))
# Which of theese elements are in file1$V1?
elem.are <- which(v %in% file1$V1)
# Delete the elements in elem.are
file2$V1[-elem.are]
In v we save the names of file2$V1 we are interested in (those
between | |)
Then we save at elem.are the positions of those names which appear
in file1$V1
Finally, we omit those elements using file2$V1[-elem.are]

Find best match for multiple substrings across multiple candidates

I have the following sample data:
targets <- c("der", "das")
candidates <- c("sdassder", "sderf", "fongs")
Desired Output:
I would like to find sdassder as the Output since it includes the most Matches for targets (as substrings).
What i tried:
x <- sapply(targets, function(target) sapply(candidates, grep, pattern = target)) > 0
which.max(rowSums(x))
Goal:
As you can see, i found some dirty Code that technically yields the result, but i dont feel its a best practise.I hope this Question fits here otherwise i move to Code review.
I tried mapply, do.call, outer, but didnt manage to find a better Code.
Edit:
Adding another Option myself, after seeing the current answers.
Using pipes:
sapply(targets, grepl, candidates) %>% rowSums %>% which.max %>% candidates[.]
You can simplify it a little, I think.
matches <- sapply(targets, grepl, candidates)
matches
# der das
# [1,] TRUE TRUE
# [2,] TRUE FALSE
# [3,] FALSE FALSE
And find the number of matches using rowSums:
rowSums(matches)
# [1] 2 1 0
candidates[ which.max(rowSums(matches)) ]
# [1] "sdassder"
(Note that this last part does not really inform about ties.)
If you want to see the individual matches per-candidate, you can always apply the names manually, though this is only an aesthetic thing, adding very little to the work itself.
rownames(matches) <- candidates
matches
# der das
# sdassder TRUE TRUE
# sderf TRUE FALSE
# fongs FALSE FALSE
rowSums(matches)
# sdassder sderf fongs
# 2 1 0
which.max(rowSums(matches))
# sdassder
# 1 <------ this "1" indicates the index within the rowSums vector
names(which.max(rowSums(matches)))
# [1] "sdassder"
One stringr option could be:
candidates[which.max(rowSums(outer(candidates, targets, str_detect)))]
[1] "sdassder"
We could paste the targets together and create a pattern to match.
library(stringr)
str_c(targets, collapse = "|")
#[1] "der|das"
Use it in str_count to count the number of times pattern was matched.
str_count(candidates, str_c(targets, collapse = "|"))
#[1] 2 1 0
Get the index of maximum value and subset it from original candidates
candidates[which.max(str_count(candidates, str_c(targets, collapse = "|")))]
#[1] "sdassder"

Count number of occurrences when string contains substring

I have string like
'abbb'
I need to understand how many times I can find substring 'bb'.
grep('bb','abbb')
returns 1. Therefore, the answer is 2 (a-bb and ab-bb). How can I count number of occurrences the way I need?
You can make the pattern non-consuming with '(?=bb)', as in:
length(gregexpr('(?=bb)', x, perl=TRUE)[[1]])
[1] 2
Here is an ugly approach using substr and sapply:
input <- "abbb"
search <- "bb"
res <- sum(sapply(1:(nchar(input)-nchar(search)+1),function(i){
substr(input,i,i+(nchar(search)-1))==search
}))
We can use stri_count
library(stringi)
stri_count_regex(input, '(?=bb)')
#[1] 2
stri_count_regex(x, '(?=bb)')
#[1] 0 1 0
data
input <- "abbb"
x <- c('aa','bb','ba')

Finding number of r's in the vector (Both R and r) before the first u

rquote <- "R's internals are irrefutably intriguing"
chars <- strsplit(rquote, split = "")[[1]]
in the above code we need to find the number of r's(R and r) in rquote
You could use substrings.
## find position of first 'u'
u1 <- regexpr("u", rquote, fixed = TRUE)
## get count of all 'r' or 'R' before 'u1'
lengths(gregexpr("r", substr(rquote, 1, u1), ignore.case = TRUE))
# [1] 5
This follows what you ask for in the title of the post. If you want the count of all the "r", case insensitive, then simplify the above to
lengths(gregexpr("r", rquote, ignore.case = TRUE))
# [1] 6
Then there's always stringi
library(stringi)
## count before first 'u'
stri_count_regex(stri_sub(rquote, 1, stri_locate_first_regex(rquote, "u")[,1]), "r|R")
# [1] 5
## count all R or r
stri_count_regex(rquote, "r|R")
# [1] 6
To get the number of R's before the first u, you need to make an intermediate step. (You probably don't need to. I'm sure akrun knows some incredibly cool regular expression to get the job done, but it won't be as easy to understand as this).
rquote <- "R's internals are irrefutably intriguing"
before_u <- gsub("u[[:print:]]+$", "", rquote)
length(stringr::str_extract_all(before_u, "(R|r)")[[1]])
You may try this,
> length(str_extract_all(rquote, '[Rr]')[[1]])
[1] 6
To get the count of all r's before the first u
> length(str_extract_all(rquote, perl('u.*(*SKIP)(*F)|[Rr]'))[[1]])
[1] 5
EDIT: Just saw before the first u. In that case, we can get the position of the first 'u' from either which or match.
Then use grepl in the 'chars' up to the position (ind) to find the logical index of 'R' with ignore.case=TRUE and use sum using the strsplit output from the OP's code.
ind <- which(chars=='u')[1]
Or
ind <- match('u', chars)
sum(grepl('r', chars[seq(ind)], ignore.case=TRUE))
#[1] 5
Or we can use two gsubs on the original string ('rquote'). First one removes the characters starting with u until the end of the string (u.$) and the second matches all characters except R, r ([^Rr]) and replace it with ''. We can use nchar to get count of the characters remaining.
nchar(gsub('[^Rr]', '', sub('u.*$', '', rquote)))
#[1] 5
Or if we want to count the 'r' in the entire string, gregexpr to get the position of matching characters from the original string ('rquote') and get the length
length(gregexpr('[rR]', rquote)[[1]])
#[1] 6

R: Find patern and get the values in between

I am using readLines() to extract an html code from a site. In almost every line of the code there is pattern of the form <td>VALUE1<td>VALUE2<td>. I would like to take the values in between the <td>. I tried some compilations such as:
output <- gsub(pattern='(.*<td>)(.*)(<td>.*)(.*)(.*<td>)',replacement='\\2',x='<td>VALUE1<td>VALUE2<td>')
but the output gives back only the one value. Any idea how to do that?
string <- "<td>VALUE1<td>VALUE2<td>"
regmatches(string , gregexpr("(?<=<td>)\\w+(?=<td>)" , string , perl = T) )
# use gregexpr function to get the match indices and the lengthes
indices <- gregexpr("(?<=<td>)\\w+(?=<td>)" , string , perl = T)
# this should be the result
# [1] 5 15
# attr(,"match.length")
# this means you have two matches the first one starts at index 5 and the
#second match starts at index 15
#[1] 6 6
#attr(,"useBytes")
# this means the first match should be with length 6 , also in this case the
#second match with length of 6
# then get the result of this match and pass it to regmatches function to
# substring your string at these indices
regmatches(string , indices)
Did you take a look at the "XML" package that can extract tables from HTML? You probably need to provide more context of the entire message that you are trying to parse so that we could see if it might be appropriate.

Resources