String matching with wildcards, trying Biostrings package - r

Given the string patt:
patt = "AGCTTCATGAAGCTGAGTNGGACGCGATGATGCG"
We can make a collection of shorter substrings str_col:
str_col = substring(patt,1:(nchar(patt)-9),10:nchar(patt))
which we want to match against a subject1:
subject1 = "AGCTTCATGAAGCTGAGTGGGACGCGATGATGCGACTAGGGACCTTAGCAGC"
treating "N" in patt as a wildcard (match to any letter in subject1), so all substrings in str_col match to subject1.
I want to do this kind of string matching in a large database of strings, and I found the Bioconductor package Biostrings be very efficient to do that. But, in order to be efficient, Biostrings requires you to convert your collection of substrings (here str_col) into a dictionary of class pdict using the function PDidc(). You can use this 'dictionary' later in functions like countPDict() to count matches against a target.
In order to use wildcards, you have to divide your dictionary in 3 parts: a head (left), a trusted band (middle) and a tail (right). You can only have wildcards, like "N", in the head or tail, but not in the trusted band, and you cannot have a trusted band of width = 0. So, for example, str_col[15] won't match if you use a trusted band of minimum width = 1 like:
> PDict(str_col[1:15],tb.start=5,tb.end=5)
Error in .Call2("ACtree2_build", tb, pp_exclude, base_codes, nodebuf_ptr, :
non base DNA letter found in Trusted Band for pattern 15
because the "N" is right in the trusted band. Notice that the strings here are DNA sequences, so "N" is a code for "match to A, C, G, or T".
> PDict(str_col[1:14],tb.start=5,tb.end=5) #is OK
TB_PDict object of length 14 and width 10 (preprocessing algo="ACtree2"):
- with a head of width 4
- with a Trusted Band of width 1
- with a tail of width 5
Is there any way to circumvent this limitation of Biostrings? I also tried to perform such task using R base functions, but I couldn't come up with anything.

I reckon that you'll need matching against some other wild cards from the IUPAC ambiguity code at one point, no?
If you need perfect matches and base functions are enough for you, you can use the same trick as the function glob2rx() : simply use conversion with gsub() to construct the matching patterns. An example:
IUPACtoRX <- function(x){
p <- gsub("N","\\[ATCG\\]",x)
p <- gsub("Y","\\[CT\\]",p) #match any pyrimidine
# add the ambiguity codes you want here
p
}
Obviously you need a line for every ambiguity you want to program in, but it's pretty straightforward I'd say.
Doing this, you can then eg do something like:
> sapply(str_col, function(i) grepl(IUPACtoRX(i),subject1) )
AGCTTCATGA GCTTCATGAA CTTCATGAAG TTCATGAAGC TCATGAAGCT CATGAAGCTG ATGAAGCTGA
TRUE TRUE TRUE TRUE TRUE TRUE TRUE
TGAAGCTGAG GAAGCTGAGT AAGCTGAGTN AGCTGAGTNG GCTGAGTNGG CTGAGTNGGA TGAGTNGGAC
TRUE TRUE TRUE TRUE TRUE TRUE TRUE
GAGTNGGACG AGTNGGACGC GTNGGACGCG TNGGACGCGA NGGACGCGAT GGACGCGATG GACGCGATGA
TRUE TRUE TRUE TRUE TRUE TRUE TRUE
ACGCGATGAT CGCGATGATG GCGATGATGC CGATGATGCG
TRUE TRUE TRUE TRUE
To find the number of matches, you can use eg gregexpr():
> sapply(str_col,function(i) sum(gregexpr(IUPACtoRX(i),subject1) > 0 ))
AGCTTCATGA GCTTCATGAA CTTCATGAAG TTCATGAAGC TCATGAAGCT CATGAAGCTG ATGAAGCTGA
1 1 1 1 1 1 1
TGAAGCTGAG GAAGCTGAGT AAGCTGAGTN AGCTGAGTNG GCTGAGTNGG CTGAGTNGGA TGAGTNGGAC
1 1 1 1 1 1 1
GAGTNGGACG AGTNGGACGC GTNGGACGCG TNGGACGCGA NGGACGCGAT GGACGCGATG GACGCGATGA
1 1 1 1 1 1 1
ACGCGATGAT CGCGATGATG GCGATGATGC CGATGATGCG
1 1 1 1

Related

Ignore or display NA in a row if the search word is not available in a list- R

How to print or display Not Available if any of my search list in (Table_search) is not available in the list I input. In the input I have three lines and I have 3 keywords to search through these lines and tell me if the keyword is present in those lines or not. If present print that line else print Not available like I showed in the desired output.
My code just prints all the available lines but that doesn't help as I need to know where is the word is missing as well.
Table_search <- list("Table 14", "Source Data:","VERSION")
Table_match_list <- sapply(Table_search, grep, x = tablelist, value = TRUE)
Input:
Table 14.1.1.1 (Page 1 of 2)
Source Data: Listing 16.2.1.1.1
Summary of Subject Status by Respiratory/Non-Ambulatory at Event Entry
Desired Output:
Table 14.1.1.1 (Page 1 of 2)
Source Data: Listing 16.2.1.1.1
NA
#r2evans
sapply(unlist(Table_search), grepl, x = dat)
I get a good output with this code actually, but instead of true or false I would like to print the actual data.
I think a single regex will do it:
replace(dat, !grepl(paste(unlist(Table_search), collapse="|"), dat), NA)
# [1] "Table 14.1.1.1 (Page 1 of 2)" "Source Data: Listing 16.2.1.1.1"
# [3] NA
One problem with using sapply(., grep) is that grep returns integer indices, and if no match is made then it returns a length-0 vector. For sapply (a class-unsafe function), this means that you may or may not get a integer vector in return. Each return may be length 0 (nothing found) or length 1 (something found), and when sapply finds that each return value is not exactly the same length, it returns a list instead (ergo my "class-unsafe" verbiage above).
This doesn't change when you use value=TRUE: change my reasoning above about "0 or 1 logical" into "0 or 1 character", and it's the same exact problem.
Because of this, I suggest grepl: it should always return logical indicating found or not found.
Further, since you don't appear to need to differentiate which of the patterns is found, just "at least one of them", then we can use a single regex, joined with the regex-OR operator |. This works with an arbitrary length of your Table_search list.
If you somehow needed to know which of the patterns was found, then you might want something like:
sapply(unlist(Table_search), grepl, x = dat)
# Table 14 Source Data: VERSION
# [1,] TRUE FALSE FALSE
# [2,] FALSE TRUE FALSE
# [3,] FALSE FALSE FALSE
and then figure out what to do with the different columns (each row indicates a string within the dat vector).
One way (that is doing the same as my first code suggestion, albeit less efficiently) is
rowSums(sapply(unlist(Table_search), grepl, x = dat)) > 0
# [1] TRUE TRUE FALSE
where the logical return value indicates if something was found. If, for instance, you want to know if two or more of the patterns were found, one might use rowSums(.) >= 2).
Data
Table_search <- list("Table 14", "Source Data:","VERSION")
dat <- c("Table 14.1.1.1 (Page 1 of 2)", "Source Data: Listing 16.2.1.1.1", "Summary of Subject Status by Respiratory/Non-Ambulatory at Event Entry")

How to match strings to large dictionary and avoid memory problems

I have a dataframe with strings such as these, some of which are existing English words and others which are not:
df <- data.frame(
strings = c("'tis"," &%##","aah", "notexistingword", "823942", "abaxile"))
Now I'd like to check which of them are real words by matching them to a large dictionary such as the GradyAugmented;
library(qdapDictionaries)
df$inGrady <- grepl(paste0("\\b(", paste(GradyAugmented[1:2500], collapse = "|"), ")\\b"), df$strings)
df
strings inGrady
1 'tis TRUE
2 &%## FALSE
3 aah TRUE
4 notexistingword FALSE
5 823942 FALSE
6 abaxile TRUE
Unfortunately, this works fine just as long as I restrict the size of GradyAugmented (the cut-off point from which it no longer seems to work is around size 2500). As soon as I use the whole dictionary I get an error, asserting there's an invalid regular expression. My hunch is that it's less the regex but a memory problem. How can that problem be resolved?
are you looking for something like this?
df$inGrady <- df$strings %in% GradyAugmented
# strings inGrady
# 1 'tis TRUE
# 2 &%## FALSE
# 3 aah TRUE
# 4 notexistingword FALSE
# 5 823942 FALSE
# 6 abaxile TRUE

Difference between expr(mean(1:10)) and expr(mean(!!(1:10))

Going through metaprogramming sections of Hadley's book Advanced R 2nd ed, I have quite a bit of a tough time understanding the concept. I have been programming with R for a while but this is the first time I came across the concept of metaprogramming. This exercise question in particular confuses me
"The following two calls print the same, but are actually different:
(a <- expr(mean(1:10)))
#> mean(1:10)
(b <- expr(mean(!!(1:10))))
#> mean(1:10)
identical(a, b)
#> [1] FALSE
What’s the difference? Which one is more natural?"
when I eval them they both returns the same
> eval(a)
[1] 5.5
> eval(b)
[1] 5.5
when I look inside the a and b object the second object does print differently, but I am not sure what does this mean in terms of their difference:
> a[[2]]
1:10
> b[[2]]
[1] 1 2 3 4 5 6 7 8 9 10
also if I just run them without eval(expr(...)) then it will return differently:
mean(1:10)
[1] 5.5
mean(!!(1:10))
[1] 1
My guess is that without expr(...) !!(1:10) act as a double negation which with coercion essentially forcing all the numbers to be 1, hence mean of 1.
My questions are:
Why does the !! acts differently with and without the expr(...) ? I would expect eval(expr(mean(!!(1:10)))) to return the same as mean(!!(1:10)) but this is not so
I still do not quite fully grasp what is the difference between a object and b object ?
thank you in advance
!! here is used not as double negation, but the unquote operator from rlang.
Unquoting is one inverse of quoting. It allows you to selectively
evaluate code inside expr(), so that expr(!!x) is equivalent to x.
The difference between a and b is that the argument remains as an unevaluated call in a, while it is evaluated in b:
class(a[[2]])
[1] "call"
class(b[[2]])
[1] "integer"
The a behaviour may be an advantage in some circumstances because it delays evaluation, or a disadvantage for the same reason. When it is a disadvantage, it is of the the cause of much frustration. If the argument was a larger vector, the size of b would increase, while a would stay the same.
See section 19.4 of Advanced R for more details.
Here is the difference. When we negate (!) an integer vector, numbers other than 0 are converted to FALSE and 0 to TRUE. With another negate ie. double (!!), the FALSE are changed to TRUE and viceversa
!0:5
#[1] TRUE FALSE FALSE FALSE FALSE FALSE
!!0:5
#[1] FALSE TRUE TRUE TRUE TRUE TRUE
With the OP's example, it is all TRUE
!!1:10
#[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
and TRUE/FALSE can be otherwise 1/0
as.integer(!!1:10)
#[1] 1 1 1 1 1 1 1 1 1 1
thus the mean would be 1
mean(!!1:10)
#[1] 1
Regarding the 'a' vs. 'b'
str(a)
#language mean(1:10)
str(b)
#language mean(1:10)
Both are language objects and it will be evaluated to get the mean of numbers 1:10
all.equal(a, b)
#[1] TRUE
If we need to get the mean of 10 numbers, the first one is the correct way.
We could evaluate the second option correctly i.e. getting a mean value of 1, by quoteing
eval(quote(mean(!!(1:10))))
#[1] 1
eval(quote(mean(1:10)))
#[1] 5.5
!! has special meaing when used inside expr.
outside expr you will get different results because
!! is a double negation
even inside expr the two versions are different because
1:10 is an expression resulting in an integer vector
when evaluated, while !!(1:10) is the result of
evaluating this same expression.
an expression and its result after it is evaluated are
different things.

How do I find the position of a (fuzzy) match within a string?

I have a text processing problem in R. I want to get the character within a string where a different string makes an exact match and/or a fuzzy match with some edit distance. For example:
A = "blahmatchblah"
B = "match"
C = "latch"
I would like to return something telling me that the 5th character within string A is where the match for a search of both B and C. All the pattern matching tools I'm aware of will tell me if there's a (fuzzy) match for B and C within A, but none for where that match begins.
The base function aregexec() is used for approximate string position matching. Unfortunately it's not vectorized over pattern, so we'll have to use a loop to get the positions for both B and C.
sapply(c(B, C), aregexec, A)
# $match
# [1] 5
# attr(,"match.length")
# [1] 5
#
# $latch
# [1] 5
# attr(,"match.length")
# [1] 5
See help(aregexec) for more.
I don't have rep to comment but at least for the first part of your question: gregexpr(B,A)[[1]][1] will yield 5 because "match" is a valid sub-sequence in A.
A few months back I made an interface to the fuzzywuzzy Python package in R, which has the get_matching_blocks() method (it's pretty close to what you actually ask).
Assuming you want to find the matching blocks between two strings,
A = "blahmatchblah"
B = "match"
library(fuzzywuzzyR)
init <- SequenceMatcher$new(string1 = A, string2 = B)
init$get_matching_blocks()
returns,
[[1]]
Match(a=4, b=0, size=5)
[[2]]
Match(a=13, b=5, size=0)
The first sublist gives the matching blocks of the two strings. a = 4 gives the starting index of the string A and b=0 gives the starting index of the string B (indexing starts from 0). size = 5 gives the count of characters that both strings match (in this case the matching block is "match" and has 5 characters).
The documentation, especially for SequenceMatcher, has more info.

Grep in R using OR and NOT

I have the following vector in R and I would like to find all the strings containing A's and B's but not the number 2.
vec1<-c("A_cont_1", "A_cont_12", "B_treat_8", "AB_cont_22", "cont_21_Aa")
The following does not work:
grep("A|B|!2", vec1)
It gives me back all the strings:
[1] 1 2 3 4 5
The same is true for this example:
grep("A|B|-2", vec1)
What would be the correct syntax?
You can do this with a fairly simple regular expression:
grep("^[^2]*[AB][^2]*$", vec1)
In words, it means:
^ match the start of the string
[^2]* match anything except "2", zero or more times
[AB] match "A" or "B"
[^2]* match anything except "2", zero or more times
$ match the end of the string
I would use two grep calls:
intersect(grep("A|B",vec1),grep("2",vec1,invert=TRUE))
#[1] 1 3
OP, your attempt is pretty close, try this:
grep('^(A|B|[^2])*$', vec1)
grep generally does not work very well for doing a positive and a negative search in one invocation. You might be able to make it work with a complex regular expression, but you might be better off just doing:
grep '[AB]' somefile.txt | grep -v '2'
The R equivalent of that would be:
grep("2", grep("A|B", vec1, value = T), invert = T)
I extended the answer provided by #eddi. I have tested it in R and it works for me. I changed the last variable in your example since they all contained A|B.
# Create the vector from the OP with one change
vec1<-c("A_cont_1", "A_cont_12", "B_treat_8", "AB_cont_22", "cont_21_dd")
I then ran the following code. It will tell you which results you should expect from each section of grep.
First, tell me which columns contain A or B
> grepl("A|B", vec1)
[1] TRUE TRUE TRUE TRUE FALSE
Now tell me which columns contain a "2"
> grepl("2", vec1)
[1] FALSE TRUE FALSE TRUE TRUE
The index we want is 2,4
> grep("2", grep("A|B", vec1, value = T))
[1] 2 4
Done!

Resources