Substring match when filtering rows - r

I have strings in file1 that matches part of the strings in file2. I want to filter out the strings from file2 that partly matches those in file1. Please see my try. Not sure how to define substring match in this way.
file1:
V1
species1
species121
species14341
file2
V1
genus1|species1|strain1
genus1|species121|strain1
genus1|species1442|strain1
genus1|species4242|strain1
genus1|species4131|strain1
my try:
file1[!file1$V1 %in% file2$V1]

You cannot use the %in% operator in this way in R. It is used to determine whether an element of a vector is in another vector, not like in in Python which can be used to match a substring: Look at this:
"species1" %in% "genus1|species1|strain1" # FALSE
"species1" %in% c("genus1", "species1", "strain1") # TRUE
You can, however, use grepl for this (the l is for logical, i.e. it returns TRUE or FALSE).
grepl("species1", "genus1|species1|strain1") # TRUE
There's an additional complication here in that you cannot use grepl with a vector, as it will only compare the first value:
grepl(file1$V1, "genus1|species1|strain1")
[1] TRUE
Warning message:
In grepl(file1$V1, "genus1|species1|strain1") :
argument 'pattern' has length > 1 and only the first element will be used
The above simply tells you that the first element of file1$V1 is in "genus1|species1|strain1".
Furthermore, you want to compare each element in file1$V1 to an entire vector of strings, rather than just one string. That's OK but you will get a vector the same length as the second vector as an output:
grepl("species1", file2$V1)
[1] TRUE TRUE TRUE FALSE FALSE
We can just see if any() of those are a match. As you've tagged your question with tidyverse, here's a dplyr solution:
library(dplyr)
file1 |>
rowwise() |> # This makes sure you only pass one element at a time to `grepl`
mutate(
in_v2 = any(grepl(V1, file2$V1))
) |>
filter(!in_v2)
# A tibble: 1 x 2
# Rowwise:
# V1 in_v2
# <chr> <lgl>
# 1 species14341 FALSE

One way to get what you want is using the grepl function. So, you can run the following code:
# Load library
library(qdapRegex)
# Extract the names of file2$V1 you are interested in (those between | |)
v <- unlist(rm_between(file2$V1, "|", "|", extract = T))
# Which of theese elements are in file1$V1?
elem.are <- which(v %in% file1$V1)
# Delete the elements in elem.are
file2$V1[-elem.are]
In v we save the names of file2$V1 we are interested in (those
between | |)
Then we save at elem.are the positions of those names which appear
in file1$V1
Finally, we omit those elements using file2$V1[-elem.are]

Related

Compare 2 strings in R

I have data as below:
vec <- c("ABC|ADC|1","ABC|ADG|2")
I need to check if below substring is present or not
"ADC|DFG", it should return false for this as I need to match exact pattern.
"ABC|ADC|1|5" should return True as this is a child element for the first element in vector.
I tried using grepl but it returns true if I just pass ADC as well, any help is appreciated.
grepl returns true because the pipe character | in regex is a special one. a|b means match a or b. all you need to do is escape it.
frtest<-c("ABC|ADC","ABC|ADC|1|2","ABC|ADG","ABC|ADG|2|5")
# making the last number and it's pipe optional
test <- gsub('(\\|\\d)$', '(\\1)?', frtest)
# escaping all pipes
test<-gsub('\\|' ,'\\\\\\\\|',test)
# testing if any of the strings is in vec
res <- sapply(test, function(x) any(grepl(x, vec)) )
# reassigning the names so they're readable
names(res) <-frtest
#> ABC|ADC ABC|ADC|1|2 ABC|ADG ABC|ADG|2|5
TRUE TRUE TRUE TRUE
For two vectors vec and test, this returns a vector which is TRUE if either the corresponding element of test is the start of one of the elements of vec, or one of the elements of vec is the start of the corresponding element of test.
vec <- c("ABC|ADC|1","ABC|ADG|2")
test <- c("ADC|DFG", "ABC|ADC|1|5", "ADC|1", "ABC|ADC")
colSums(sapply(test, startsWith, vec) | t(sapply(vec, startsWith, test))) > 0
# ADC|DFG ABC|ADC|1|5 ADC|1 ABC|ADC
# FALSE TRUE FALSE TRUE

Find best match for multiple substrings across multiple candidates

I have the following sample data:
targets <- c("der", "das")
candidates <- c("sdassder", "sderf", "fongs")
Desired Output:
I would like to find sdassder as the Output since it includes the most Matches for targets (as substrings).
What i tried:
x <- sapply(targets, function(target) sapply(candidates, grep, pattern = target)) > 0
which.max(rowSums(x))
Goal:
As you can see, i found some dirty Code that technically yields the result, but i dont feel its a best practise.I hope this Question fits here otherwise i move to Code review.
I tried mapply, do.call, outer, but didnt manage to find a better Code.
Edit:
Adding another Option myself, after seeing the current answers.
Using pipes:
sapply(targets, grepl, candidates) %>% rowSums %>% which.max %>% candidates[.]
You can simplify it a little, I think.
matches <- sapply(targets, grepl, candidates)
matches
# der das
# [1,] TRUE TRUE
# [2,] TRUE FALSE
# [3,] FALSE FALSE
And find the number of matches using rowSums:
rowSums(matches)
# [1] 2 1 0
candidates[ which.max(rowSums(matches)) ]
# [1] "sdassder"
(Note that this last part does not really inform about ties.)
If you want to see the individual matches per-candidate, you can always apply the names manually, though this is only an aesthetic thing, adding very little to the work itself.
rownames(matches) <- candidates
matches
# der das
# sdassder TRUE TRUE
# sderf TRUE FALSE
# fongs FALSE FALSE
rowSums(matches)
# sdassder sderf fongs
# 2 1 0
which.max(rowSums(matches))
# sdassder
# 1 <------ this "1" indicates the index within the rowSums vector
names(which.max(rowSums(matches)))
# [1] "sdassder"
One stringr option could be:
candidates[which.max(rowSums(outer(candidates, targets, str_detect)))]
[1] "sdassder"
We could paste the targets together and create a pattern to match.
library(stringr)
str_c(targets, collapse = "|")
#[1] "der|das"
Use it in str_count to count the number of times pattern was matched.
str_count(candidates, str_c(targets, collapse = "|"))
#[1] 2 1 0
Get the index of maximum value and subset it from original candidates
candidates[which.max(str_count(candidates, str_c(targets, collapse = "|")))]
#[1] "sdassder"

Locate different patterns in a sequence

If I want to find two different patterns in a single sequence how am I supposed to do
eg:
seq="ATGCAAAGGT"
the patterns are
pattern=c("ATGC","AAGG")
How am I supposed to find these two patterns simultaneously in the sequence?
I also want to find the location of these patterns like for example the patterns locations are 1,4 and 5,8.
Can anyone help me with this ?
Lets say your sequence file is just a vector of sequences:
seq.file <- c('ATGCAAAGGT','ATGCTAAGGT','NOTINTHISONE')
You can search for both motifs, and then return a true / false vector that identifies if both are present using the following one-liner:
grepl('ATGC', seq.file) & grepl('AAGG', seq.file)
[1] TRUE TRUE FALSE
Lets say the vector of sequences is a column within data frame d, which also contains a column of ID values:
id <- c('s1','s2','s3')
d <- data.frame(id,seq.file)
colnames(d) <- c('id','sequence')
You can append a column to this data frame, d, that identifies whether a given sequence matches with this one-liner:
d$match <- grepl('ATGC',d$sequence) & grepl('AAGG', d$sequence)
> print(d)
id sequence match
1 s1 ATGCAAAGGT TRUE
2 s2 ATGCTAAGGT TRUE
3 s3 NOTINTHISONE FALSE
The following for-loop can return a list of the positions of each of the patterns within the sequence:
require(stringr)
for(i in 1: length(d$sequence)){
out <- str_locate_all(d$sequence[i], pattern)
first <- c(out[[1]])
first.o <- paste(first[1],first[2],sep=',')
second <- c(out[[2]])
second.o <- paste(second[1],second[2], sep=',')
print(c(first.o, second.o))
}
[1] "1,4" "6,9"
[1] "1,4" "6,9"
[1] "NA,NA" "NA,NA"
You can try using the stringr library to do something like this:
seq = "ATGCAAAGGT"
library(stringr)
str_extract_all(seq, 'ATGC|AAGG')
[[1]]
[1] "ATGC" "AAGG"
Without knowing more specifically what output you are looking for, this is the best I can provide right now.
How about this using stringr to find start and end positions:
library(stringr)
seq <- "ATGCAAAGGT"
pattern <- c("ATGC","AAGG")
str_locate_all(seq, pattern)
#[[1]]
# start end
#[1,] 1 4
#
#[[2]]
# start end
#[1,] 6 9

Count repeated words in a column using R

I have a dataframe with column 'NAME' like this:
NAME
Cybermart co
Hot burgers hot sandwiches
Landmark co
I want to add a new column to this dataframe depending on:
whether there is any word that gets repeated in the 'name' column.
So the new column would be like this:
REPEATED_WORD
No
Yes
No
Is there any way I can do this?
vapply(strsplit(tolower(x), "\\s+"), anyDuplicated, 1L) > 0L
#[1] FALSE TRUE FALSE
We can split te 'NAME' column by white space (\\s+), loop over the output list and check whether the length of unique elements are the same as the length of each list element to get a logical vector. Convert the logical vector to "Yes", "No" (if required)
df1$REPEATED_WORD <- c("No", "Yes")[sapply(strsplit(df1$NAME, '\\s+'),
function(x) length(unique(tolower(x)))!=length(x)) + 1L]
df1$REPEATED_WORD
#[1] "No" "Yes" "No"
If we are using regex, we can capture non-white space elements ((\\S+)) and use regex lookarounds to check if there is any repeated word.
library(stringi)
stri_detect(tolower(df1$NAME), regex="(\\S+)(?=.*\\s+\\1\\s+)")
#[1] FALSE TRUE FALSE
It is better to leave it as a logical vector instead of converting to "Yes/No". If that is need just add 1 to the logical vector (or using ifelse) and change the TRUE values to "Yes" and FALSE to "No" (as showed above)
I had a similar solution to #akrun's 2nd one (pure regex). I'm going to put it in case it's useful to future searchers:
NAME <-
c('Cybermart co',
'Hot burgers hot sandwiches',
'Landmark co'
)
grepl("(?i)\\b(\\w+)\\s+.*\\1\\b", NAME, perl=TRUE)
## [1] FALSE TRUE FALSE

Finding number of r's in the vector (Both R and r) before the first u

rquote <- "R's internals are irrefutably intriguing"
chars <- strsplit(rquote, split = "")[[1]]
in the above code we need to find the number of r's(R and r) in rquote
You could use substrings.
## find position of first 'u'
u1 <- regexpr("u", rquote, fixed = TRUE)
## get count of all 'r' or 'R' before 'u1'
lengths(gregexpr("r", substr(rquote, 1, u1), ignore.case = TRUE))
# [1] 5
This follows what you ask for in the title of the post. If you want the count of all the "r", case insensitive, then simplify the above to
lengths(gregexpr("r", rquote, ignore.case = TRUE))
# [1] 6
Then there's always stringi
library(stringi)
## count before first 'u'
stri_count_regex(stri_sub(rquote, 1, stri_locate_first_regex(rquote, "u")[,1]), "r|R")
# [1] 5
## count all R or r
stri_count_regex(rquote, "r|R")
# [1] 6
To get the number of R's before the first u, you need to make an intermediate step. (You probably don't need to. I'm sure akrun knows some incredibly cool regular expression to get the job done, but it won't be as easy to understand as this).
rquote <- "R's internals are irrefutably intriguing"
before_u <- gsub("u[[:print:]]+$", "", rquote)
length(stringr::str_extract_all(before_u, "(R|r)")[[1]])
You may try this,
> length(str_extract_all(rquote, '[Rr]')[[1]])
[1] 6
To get the count of all r's before the first u
> length(str_extract_all(rquote, perl('u.*(*SKIP)(*F)|[Rr]'))[[1]])
[1] 5
EDIT: Just saw before the first u. In that case, we can get the position of the first 'u' from either which or match.
Then use grepl in the 'chars' up to the position (ind) to find the logical index of 'R' with ignore.case=TRUE and use sum using the strsplit output from the OP's code.
ind <- which(chars=='u')[1]
Or
ind <- match('u', chars)
sum(grepl('r', chars[seq(ind)], ignore.case=TRUE))
#[1] 5
Or we can use two gsubs on the original string ('rquote'). First one removes the characters starting with u until the end of the string (u.$) and the second matches all characters except R, r ([^Rr]) and replace it with ''. We can use nchar to get count of the characters remaining.
nchar(gsub('[^Rr]', '', sub('u.*$', '', rquote)))
#[1] 5
Or if we want to count the 'r' in the entire string, gregexpr to get the position of matching characters from the original string ('rquote') and get the length
length(gregexpr('[rR]', rquote)[[1]])
#[1] 6

Resources