How to match strings to large dictionary and avoid memory problems - r

I have a dataframe with strings such as these, some of which are existing English words and others which are not:
df <- data.frame(
strings = c("'tis"," &%##","aah", "notexistingword", "823942", "abaxile"))
Now I'd like to check which of them are real words by matching them to a large dictionary such as the GradyAugmented;
library(qdapDictionaries)
df$inGrady <- grepl(paste0("\\b(", paste(GradyAugmented[1:2500], collapse = "|"), ")\\b"), df$strings)
df
strings inGrady
1 'tis TRUE
2 &%## FALSE
3 aah TRUE
4 notexistingword FALSE
5 823942 FALSE
6 abaxile TRUE
Unfortunately, this works fine just as long as I restrict the size of GradyAugmented (the cut-off point from which it no longer seems to work is around size 2500). As soon as I use the whole dictionary I get an error, asserting there's an invalid regular expression. My hunch is that it's less the regex but a memory problem. How can that problem be resolved?

are you looking for something like this?
df$inGrady <- df$strings %in% GradyAugmented
# strings inGrady
# 1 'tis TRUE
# 2 &%## FALSE
# 3 aah TRUE
# 4 notexistingword FALSE
# 5 823942 FALSE
# 6 abaxile TRUE

Related

Specifying a word followed by a specific word followed by max of 3 words in regex in R

I'm looking for a specific regex pattern which I can't seem to get:
cryptically:
pattern <- "[1 word|no word][this is][1-3 words max]"
text <- c("this guy cannot get a mortgage, this is a fake application", "this is a new application", "hi this is a specific question", "this is real", "this is not what you are looking for")
str_match("pattern", text)
The output I'd like to have is:
[1]FALSE #cause too many words in front
[2]TRUE
[3]TRUE
[4]TRUE
[5]FALSE #cause too many words behind it
It should be doable but im struggling with the words and max amount of it in regex
Can anyone help me with this one?
grepl("^(\\S+\\s*)?this is\\s*\\S+\\s*\\S*\\s*\\S*$", text, perl = TRUE)
# [1] FALSE TRUE TRUE TRUE FALSE
This seems a little brute-force, but it allows
^(\\S+\\s*)? zero or one word before
the literal this is (followed by zero or more blank-space), then
at a minimum, \\S+ one word (with at least one letter), then
possibly space-and-a-word \\s*\\S*, twice, allowing up to three words
Depending on how you intend to use this, you can extract the words into a single-column or multiple columns, using strcapture (still base R):
strcapture("^(\\S+\\s*)?this is\\s*(\\S+\\s*\\S*\\s*\\S*)$", text,
proto = list(ign="",w1=""), perl = TRUE)[,-1,drop=FALSE]
# w1
# 1 <NA>
# 2 a new application
# 3 a specific question
# 4 real
# 5 <NA>
strcapture("^(\\S+\\s*)?this is\\s*(\\S+)\\s*(\\S*)\\s*(\\S*)$", text,
proto = list(ign="",w1="",w2="",w3=""), perl = TRUE)[,-1,drop=FALSE]
# w1 w2 w3
# 1 <NA> <NA> <NA>
# 2 a new application
# 3 a specific question
# 4 real
# 5 <NA> <NA> <NA>
The [,-1,drop=FALSE] is because we need to (..) capture the words before "this is" so that it can be optional, but we don't need to keep them, so I drop them right away. (The drop=FALSE is because base R data.frame defaults to reducing a single-column return to a vector.)
Slight improvement (less brute-force), that allows for programmatically determining the number of words to accept.
text2 <- c("this is one", "this is one two", "this is one two three", "this is one two three four", "this is one two three four five", "this not is", "hi this is")
grepl("^(\\S+\\s*)?this is\\s*(\\S+\\s*){1,4}$", text2, perl = TRUE)
# [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE
grepl("^(\\S+\\s*)?this is\\s*(\\S+\\s*){1,2}$", text2, perl = TRUE)
# [1] TRUE TRUE FALSE FALSE FALSE FALSE FALSE
grepl("^(\\S+\\s*)?this is\\s*(\\S+\\s*){1,99}$", text2, perl = TRUE)
# [1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE
This doesn't necessarily work with strcapture, since it does not have a pre-defined number of groups. Namely, it will only capture the last of the words:
strcapture("^(\\S+\\s*)?this is\\s*(\\S+\\s*){1,3}$", text2,
proto = list(ign="",w1=""), perl = TRUE)
# ign w1
# 1 one
# 2 two
# 3 three
# 4 <NA> <NA>
# 5 <NA> <NA>
# 6 <NA> <NA>
# 7 <NA> <NA>

Find best match for multiple substrings across multiple candidates

I have the following sample data:
targets <- c("der", "das")
candidates <- c("sdassder", "sderf", "fongs")
Desired Output:
I would like to find sdassder as the Output since it includes the most Matches for targets (as substrings).
What i tried:
x <- sapply(targets, function(target) sapply(candidates, grep, pattern = target)) > 0
which.max(rowSums(x))
Goal:
As you can see, i found some dirty Code that technically yields the result, but i dont feel its a best practise.I hope this Question fits here otherwise i move to Code review.
I tried mapply, do.call, outer, but didnt manage to find a better Code.
Edit:
Adding another Option myself, after seeing the current answers.
Using pipes:
sapply(targets, grepl, candidates) %>% rowSums %>% which.max %>% candidates[.]
You can simplify it a little, I think.
matches <- sapply(targets, grepl, candidates)
matches
# der das
# [1,] TRUE TRUE
# [2,] TRUE FALSE
# [3,] FALSE FALSE
And find the number of matches using rowSums:
rowSums(matches)
# [1] 2 1 0
candidates[ which.max(rowSums(matches)) ]
# [1] "sdassder"
(Note that this last part does not really inform about ties.)
If you want to see the individual matches per-candidate, you can always apply the names manually, though this is only an aesthetic thing, adding very little to the work itself.
rownames(matches) <- candidates
matches
# der das
# sdassder TRUE TRUE
# sderf TRUE FALSE
# fongs FALSE FALSE
rowSums(matches)
# sdassder sderf fongs
# 2 1 0
which.max(rowSums(matches))
# sdassder
# 1 <------ this "1" indicates the index within the rowSums vector
names(which.max(rowSums(matches)))
# [1] "sdassder"
One stringr option could be:
candidates[which.max(rowSums(outer(candidates, targets, str_detect)))]
[1] "sdassder"
We could paste the targets together and create a pattern to match.
library(stringr)
str_c(targets, collapse = "|")
#[1] "der|das"
Use it in str_count to count the number of times pattern was matched.
str_count(candidates, str_c(targets, collapse = "|"))
#[1] 2 1 0
Get the index of maximum value and subset it from original candidates
candidates[which.max(str_count(candidates, str_c(targets, collapse = "|")))]
#[1] "sdassder"

Grep when pattern is found exactly n times [duplicate]

This question already has answers here:
r grep by regex - finding a string that contains a sub string exactly one once
(6 answers)
Closed 3 years ago.
I am looking for a regex expression to capture strings where the pattern is repeated n times. Here is an example with expected output.
# find sentences with 2 occurrences of the word "is"
z = c("this is what it is and is not", "this is not", "this is it it is")
regex_function(z)
[1] FALSE FALSE TRUE
I have gotten this far:
grepl("(.*\\bis\\b.*){2}",z)
[1] TRUE FALSE TRUE
But this will return TRUE if there are at least 2 matches. How can I force it to look for strings with exactly 2 occurrences?
To find where the word is is contained two times you can remove all is with gsub and compare the length of the strings with nchar.
nchar(z) - nchar(gsub("(\\bis\\b)", "", z)) == 4
#[1] FALSE FALSE TRUE
or count the hits of gregexpr like:
sapply(gregexpr("\\bis\\b", z), function(x) sum(x>0)) == 2
#[1] FALSE FALSE TRUE
or with a regex in grepl
grepl("^(?!(.*\\bis\\b){3})(.*\\bis\\b){2}.*$", z, perl=TRUE)
#[1] FALSE FALSE TRUE
This is an option that works but needs 2 regex calls. I am still looking for a compact regex call which correctly solves this issue.
grepl("(.*\\bis\\b.*){2}",z) & !grepl("(.*\\bis\\b.*){3}",z)
Basically adding a grepl of n+1 and only keeping the ones that satisfy grep no 1 and do not satisfy grep no2.
library(stringi)
stri_count_regex(z, "\\bis\\b") == 2L
# [1] FALSE FALSE TRUE
with stringr:
library(stringr)
library(magrittr)
regex_function = function(str){
str_extract_all(str,"\\bis\\b")%>%
lapply(.,function(x){length(x) == 2}) %>%
unlist()
}
> regex_function(z)
[1] FALSE FALSE TRUE

extract words from a string into different strings

I'm very new with coding, and I have to clean a table with string variables. One of the columns I'm trying to clean includes several variables in itself. So if I take one row from my column it looks like this
string<- ("'casual': True,'classy': False,'divey': False,'hipster': False,'intimate': False,'romantic': False,'touristy': False,'trendy': False,'upscale': False")
I'm trying to extract Boolean values for each of the categories into separate columns.So my outcome should have 9 columns(each for every category) and rows should include True/ False values.
What am I supposed to use in this case?
An option is to use str_extract_all to extract the word (\\w+) that succeeds a a space followed by a :
library(stringr)
as.logical(str_extract_all(string, "(?<=: )\\w+")[[1]])
#[1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
If we need to parse into a data.frame, it would be better to use fromJSON from jsonlite
library(jsonlite)
lst1 <- fromJSON(paste0("{", gsub("'", "", gsub("\\b(\\w+)\\b",
'"\\1"', string)), "}"))
data.frame(lapply(lst1, as.logical))
# casual classy divey hipster intimate romantic touristy trendy upscale
#1 TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Or in base R
as.logical(regmatches(string, gregexpr("(?<=: )\\w+", string, perl = TRUE))[[1]])

String matching with wildcards, trying Biostrings package

Given the string patt:
patt = "AGCTTCATGAAGCTGAGTNGGACGCGATGATGCG"
We can make a collection of shorter substrings str_col:
str_col = substring(patt,1:(nchar(patt)-9),10:nchar(patt))
which we want to match against a subject1:
subject1 = "AGCTTCATGAAGCTGAGTGGGACGCGATGATGCGACTAGGGACCTTAGCAGC"
treating "N" in patt as a wildcard (match to any letter in subject1), so all substrings in str_col match to subject1.
I want to do this kind of string matching in a large database of strings, and I found the Bioconductor package Biostrings be very efficient to do that. But, in order to be efficient, Biostrings requires you to convert your collection of substrings (here str_col) into a dictionary of class pdict using the function PDidc(). You can use this 'dictionary' later in functions like countPDict() to count matches against a target.
In order to use wildcards, you have to divide your dictionary in 3 parts: a head (left), a trusted band (middle) and a tail (right). You can only have wildcards, like "N", in the head or tail, but not in the trusted band, and you cannot have a trusted band of width = 0. So, for example, str_col[15] won't match if you use a trusted band of minimum width = 1 like:
> PDict(str_col[1:15],tb.start=5,tb.end=5)
Error in .Call2("ACtree2_build", tb, pp_exclude, base_codes, nodebuf_ptr, :
non base DNA letter found in Trusted Band for pattern 15
because the "N" is right in the trusted band. Notice that the strings here are DNA sequences, so "N" is a code for "match to A, C, G, or T".
> PDict(str_col[1:14],tb.start=5,tb.end=5) #is OK
TB_PDict object of length 14 and width 10 (preprocessing algo="ACtree2"):
- with a head of width 4
- with a Trusted Band of width 1
- with a tail of width 5
Is there any way to circumvent this limitation of Biostrings? I also tried to perform such task using R base functions, but I couldn't come up with anything.
I reckon that you'll need matching against some other wild cards from the IUPAC ambiguity code at one point, no?
If you need perfect matches and base functions are enough for you, you can use the same trick as the function glob2rx() : simply use conversion with gsub() to construct the matching patterns. An example:
IUPACtoRX <- function(x){
p <- gsub("N","\\[ATCG\\]",x)
p <- gsub("Y","\\[CT\\]",p) #match any pyrimidine
# add the ambiguity codes you want here
p
}
Obviously you need a line for every ambiguity you want to program in, but it's pretty straightforward I'd say.
Doing this, you can then eg do something like:
> sapply(str_col, function(i) grepl(IUPACtoRX(i),subject1) )
AGCTTCATGA GCTTCATGAA CTTCATGAAG TTCATGAAGC TCATGAAGCT CATGAAGCTG ATGAAGCTGA
TRUE TRUE TRUE TRUE TRUE TRUE TRUE
TGAAGCTGAG GAAGCTGAGT AAGCTGAGTN AGCTGAGTNG GCTGAGTNGG CTGAGTNGGA TGAGTNGGAC
TRUE TRUE TRUE TRUE TRUE TRUE TRUE
GAGTNGGACG AGTNGGACGC GTNGGACGCG TNGGACGCGA NGGACGCGAT GGACGCGATG GACGCGATGA
TRUE TRUE TRUE TRUE TRUE TRUE TRUE
ACGCGATGAT CGCGATGATG GCGATGATGC CGATGATGCG
TRUE TRUE TRUE TRUE
To find the number of matches, you can use eg gregexpr():
> sapply(str_col,function(i) sum(gregexpr(IUPACtoRX(i),subject1) > 0 ))
AGCTTCATGA GCTTCATGAA CTTCATGAAG TTCATGAAGC TCATGAAGCT CATGAAGCTG ATGAAGCTGA
1 1 1 1 1 1 1
TGAAGCTGAG GAAGCTGAGT AAGCTGAGTN AGCTGAGTNG GCTGAGTNGG CTGAGTNGGA TGAGTNGGAC
1 1 1 1 1 1 1
GAGTNGGACG AGTNGGACGC GTNGGACGCG TNGGACGCGA NGGACGCGAT GGACGCGATG GACGCGATGA
1 1 1 1 1 1 1
ACGCGATGAT CGCGATGATG GCGATGATGC CGATGATGCG
1 1 1 1

Resources