Find best match for multiple substrings across multiple candidates - r

I have the following sample data:
targets <- c("der", "das")
candidates <- c("sdassder", "sderf", "fongs")
Desired Output:
I would like to find sdassder as the Output since it includes the most Matches for targets (as substrings).
What i tried:
x <- sapply(targets, function(target) sapply(candidates, grep, pattern = target)) > 0
which.max(rowSums(x))
Goal:
As you can see, i found some dirty Code that technically yields the result, but i dont feel its a best practise.I hope this Question fits here otherwise i move to Code review.
I tried mapply, do.call, outer, but didnt manage to find a better Code.
Edit:
Adding another Option myself, after seeing the current answers.
Using pipes:
sapply(targets, grepl, candidates) %>% rowSums %>% which.max %>% candidates[.]

You can simplify it a little, I think.
matches <- sapply(targets, grepl, candidates)
matches
# der das
# [1,] TRUE TRUE
# [2,] TRUE FALSE
# [3,] FALSE FALSE
And find the number of matches using rowSums:
rowSums(matches)
# [1] 2 1 0
candidates[ which.max(rowSums(matches)) ]
# [1] "sdassder"
(Note that this last part does not really inform about ties.)
If you want to see the individual matches per-candidate, you can always apply the names manually, though this is only an aesthetic thing, adding very little to the work itself.
rownames(matches) <- candidates
matches
# der das
# sdassder TRUE TRUE
# sderf TRUE FALSE
# fongs FALSE FALSE
rowSums(matches)
# sdassder sderf fongs
# 2 1 0
which.max(rowSums(matches))
# sdassder
# 1 <------ this "1" indicates the index within the rowSums vector
names(which.max(rowSums(matches)))
# [1] "sdassder"

One stringr option could be:
candidates[which.max(rowSums(outer(candidates, targets, str_detect)))]
[1] "sdassder"

We could paste the targets together and create a pattern to match.
library(stringr)
str_c(targets, collapse = "|")
#[1] "der|das"
Use it in str_count to count the number of times pattern was matched.
str_count(candidates, str_c(targets, collapse = "|"))
#[1] 2 1 0
Get the index of maximum value and subset it from original candidates
candidates[which.max(str_count(candidates, str_c(targets, collapse = "|")))]
#[1] "sdassder"

Related

Substring match when filtering rows

I have strings in file1 that matches part of the strings in file2. I want to filter out the strings from file2 that partly matches those in file1. Please see my try. Not sure how to define substring match in this way.
file1:
V1
species1
species121
species14341
file2
V1
genus1|species1|strain1
genus1|species121|strain1
genus1|species1442|strain1
genus1|species4242|strain1
genus1|species4131|strain1
my try:
file1[!file1$V1 %in% file2$V1]
You cannot use the %in% operator in this way in R. It is used to determine whether an element of a vector is in another vector, not like in in Python which can be used to match a substring: Look at this:
"species1" %in% "genus1|species1|strain1" # FALSE
"species1" %in% c("genus1", "species1", "strain1") # TRUE
You can, however, use grepl for this (the l is for logical, i.e. it returns TRUE or FALSE).
grepl("species1", "genus1|species1|strain1") # TRUE
There's an additional complication here in that you cannot use grepl with a vector, as it will only compare the first value:
grepl(file1$V1, "genus1|species1|strain1")
[1] TRUE
Warning message:
In grepl(file1$V1, "genus1|species1|strain1") :
argument 'pattern' has length > 1 and only the first element will be used
The above simply tells you that the first element of file1$V1 is in "genus1|species1|strain1".
Furthermore, you want to compare each element in file1$V1 to an entire vector of strings, rather than just one string. That's OK but you will get a vector the same length as the second vector as an output:
grepl("species1", file2$V1)
[1] TRUE TRUE TRUE FALSE FALSE
We can just see if any() of those are a match. As you've tagged your question with tidyverse, here's a dplyr solution:
library(dplyr)
file1 |>
rowwise() |> # This makes sure you only pass one element at a time to `grepl`
mutate(
in_v2 = any(grepl(V1, file2$V1))
) |>
filter(!in_v2)
# A tibble: 1 x 2
# Rowwise:
# V1 in_v2
# <chr> <lgl>
# 1 species14341 FALSE
One way to get what you want is using the grepl function. So, you can run the following code:
# Load library
library(qdapRegex)
# Extract the names of file2$V1 you are interested in (those between | |)
v <- unlist(rm_between(file2$V1, "|", "|", extract = T))
# Which of theese elements are in file1$V1?
elem.are <- which(v %in% file1$V1)
# Delete the elements in elem.are
file2$V1[-elem.are]
In v we save the names of file2$V1 we are interested in (those
between | |)
Then we save at elem.are the positions of those names which appear
in file1$V1
Finally, we omit those elements using file2$V1[-elem.are]

Compare 2 strings in R

I have data as below:
vec <- c("ABC|ADC|1","ABC|ADG|2")
I need to check if below substring is present or not
"ADC|DFG", it should return false for this as I need to match exact pattern.
"ABC|ADC|1|5" should return True as this is a child element for the first element in vector.
I tried using grepl but it returns true if I just pass ADC as well, any help is appreciated.
grepl returns true because the pipe character | in regex is a special one. a|b means match a or b. all you need to do is escape it.
frtest<-c("ABC|ADC","ABC|ADC|1|2","ABC|ADG","ABC|ADG|2|5")
# making the last number and it's pipe optional
test <- gsub('(\\|\\d)$', '(\\1)?', frtest)
# escaping all pipes
test<-gsub('\\|' ,'\\\\\\\\|',test)
# testing if any of the strings is in vec
res <- sapply(test, function(x) any(grepl(x, vec)) )
# reassigning the names so they're readable
names(res) <-frtest
#> ABC|ADC ABC|ADC|1|2 ABC|ADG ABC|ADG|2|5
TRUE TRUE TRUE TRUE
For two vectors vec and test, this returns a vector which is TRUE if either the corresponding element of test is the start of one of the elements of vec, or one of the elements of vec is the start of the corresponding element of test.
vec <- c("ABC|ADC|1","ABC|ADG|2")
test <- c("ADC|DFG", "ABC|ADC|1|5", "ADC|1", "ABC|ADC")
colSums(sapply(test, startsWith, vec) | t(sapply(vec, startsWith, test))) > 0
# ADC|DFG ABC|ADC|1|5 ADC|1 ABC|ADC
# FALSE TRUE FALSE TRUE

How to grepl with two pattern objects in R

I have a vector called
vec <- c("16S_s95_S112_R2_101.fastq.gz",
"16S_s95_S112_R1_001.fastq.gz",
"16S_s94_S103_R2_021.fastq.gz",
"16S_s94_S103_R1_001.fastq.gz")
I want to grepl items with sample <- "_s95_" and R1 <- "R1".
I want to use sample and R1 objects while doing grepl and find something matching _s95_ and R1 strings both.
Result I want is 16S_s95_S112_R1_001.fastq.gz.
I tried grepl(pattern = sample&R1, x= vec) which did not work for me.
I can do this with multiple grepl's, but I am trying to find something neat to do this.
For your specific use case where you know the order of the patterns, it's almost certainly going to be faster to follow Jilber Urbina's suggestion to programmatically compose a single regex.
For a more general solution that works regardless of order and on any number of patterns, we can use sapply to loop across each pattern, and then use rowSums to count the number of pattern matches and find the rows where all of them match:
patterns = c("_s95_", 'R1')
sapply(patterns, function(x) grepl(x, vec))
_s95_ R1
[1,] TRUE FALSE
[2,] TRUE TRUE
[3,] FALSE FALSE
[4,] FALSE TRUE
vec[which(rowSums(sapply(patterns, function(x) grepl(x, vec))) == length(patterns))]
[1] "16S_s95_S112_R1_001.fastq.gz"
You need to work a bit more in your pattern in order to get the match, try:
> grep(paste0(".*", sample, ".*", R1), vec, value=TRUE)
[1] "16S_s95_S112_R1_001.fastq.gz"

Locate different patterns in a sequence

If I want to find two different patterns in a single sequence how am I supposed to do
eg:
seq="ATGCAAAGGT"
the patterns are
pattern=c("ATGC","AAGG")
How am I supposed to find these two patterns simultaneously in the sequence?
I also want to find the location of these patterns like for example the patterns locations are 1,4 and 5,8.
Can anyone help me with this ?
Lets say your sequence file is just a vector of sequences:
seq.file <- c('ATGCAAAGGT','ATGCTAAGGT','NOTINTHISONE')
You can search for both motifs, and then return a true / false vector that identifies if both are present using the following one-liner:
grepl('ATGC', seq.file) & grepl('AAGG', seq.file)
[1] TRUE TRUE FALSE
Lets say the vector of sequences is a column within data frame d, which also contains a column of ID values:
id <- c('s1','s2','s3')
d <- data.frame(id,seq.file)
colnames(d) <- c('id','sequence')
You can append a column to this data frame, d, that identifies whether a given sequence matches with this one-liner:
d$match <- grepl('ATGC',d$sequence) & grepl('AAGG', d$sequence)
> print(d)
id sequence match
1 s1 ATGCAAAGGT TRUE
2 s2 ATGCTAAGGT TRUE
3 s3 NOTINTHISONE FALSE
The following for-loop can return a list of the positions of each of the patterns within the sequence:
require(stringr)
for(i in 1: length(d$sequence)){
out <- str_locate_all(d$sequence[i], pattern)
first <- c(out[[1]])
first.o <- paste(first[1],first[2],sep=',')
second <- c(out[[2]])
second.o <- paste(second[1],second[2], sep=',')
print(c(first.o, second.o))
}
[1] "1,4" "6,9"
[1] "1,4" "6,9"
[1] "NA,NA" "NA,NA"
You can try using the stringr library to do something like this:
seq = "ATGCAAAGGT"
library(stringr)
str_extract_all(seq, 'ATGC|AAGG')
[[1]]
[1] "ATGC" "AAGG"
Without knowing more specifically what output you are looking for, this is the best I can provide right now.
How about this using stringr to find start and end positions:
library(stringr)
seq <- "ATGCAAAGGT"
pattern <- c("ATGC","AAGG")
str_locate_all(seq, pattern)
#[[1]]
# start end
#[1,] 1 4
#
#[[2]]
# start end
#[1,] 6 9

Grep in R using OR and NOT

I have the following vector in R and I would like to find all the strings containing A's and B's but not the number 2.
vec1<-c("A_cont_1", "A_cont_12", "B_treat_8", "AB_cont_22", "cont_21_Aa")
The following does not work:
grep("A|B|!2", vec1)
It gives me back all the strings:
[1] 1 2 3 4 5
The same is true for this example:
grep("A|B|-2", vec1)
What would be the correct syntax?
You can do this with a fairly simple regular expression:
grep("^[^2]*[AB][^2]*$", vec1)
In words, it means:
^ match the start of the string
[^2]* match anything except "2", zero or more times
[AB] match "A" or "B"
[^2]* match anything except "2", zero or more times
$ match the end of the string
I would use two grep calls:
intersect(grep("A|B",vec1),grep("2",vec1,invert=TRUE))
#[1] 1 3
OP, your attempt is pretty close, try this:
grep('^(A|B|[^2])*$', vec1)
grep generally does not work very well for doing a positive and a negative search in one invocation. You might be able to make it work with a complex regular expression, but you might be better off just doing:
grep '[AB]' somefile.txt | grep -v '2'
The R equivalent of that would be:
grep("2", grep("A|B", vec1, value = T), invert = T)
I extended the answer provided by #eddi. I have tested it in R and it works for me. I changed the last variable in your example since they all contained A|B.
# Create the vector from the OP with one change
vec1<-c("A_cont_1", "A_cont_12", "B_treat_8", "AB_cont_22", "cont_21_dd")
I then ran the following code. It will tell you which results you should expect from each section of grep.
First, tell me which columns contain A or B
> grepl("A|B", vec1)
[1] TRUE TRUE TRUE TRUE FALSE
Now tell me which columns contain a "2"
> grepl("2", vec1)
[1] FALSE TRUE FALSE TRUE TRUE
The index we want is 2,4
> grep("2", grep("A|B", vec1, value = T))
[1] 2 4
Done!

Resources