What is the name for string matching, where every character in needle must be matched in order throughout the string - string-matching

Given a list of strings,
["floor", "forman", "barometer"]
Imagine we have a needle of "fo", the implementation should be such, that "floor" and "forman" should both be matched, because they contain the characters, 'f' and 'o', in that order.
As a second example, the needle "or", would match all three items.
Some people call this fuzzy matching, but reading on wikipedia1, and given how fuzzy string matching is defined, often using distance / scoring functions, the above is not fuzzy matching in the strict sense.
Is there a name used in the literature for this type of string matching?

Related

Can I extract all matches with functions like aregexec?

I've been enjoying the powerful function aregexec that allows me to mine strings in a fuzzy way.
For that I can search for a string of nucleotide "ATGGCTTCGTC" within a DNA section with defined allowance of insertion, deletion and substitute.
However, it only show me the first match without finishing the whole string. For example,
If I run
aregexec("a","adfasdfasdfaa")
only the first "a" will show up from the result. I'd like to see all the matches.
I wonder if there are other more powerful functions or a argument to be added to this one.
Thank you very much.
P.S. I explained the fuzzy search poorly. I mean, the match doesn't have to be perfect. Say if I allow an substitution of one character, and search AATTGG in ctagtactaAATGGGatctgct, the capital part will be considered a match. I can similarly allow insertions and deletions of certain characters.
gregexpr will show every time there is the pattern in the string, like in this example.
gregexpr("as","adfasdfasdfaa")
There are many more information if you use ?grep in R, it will explain every aspect of using regex.

How does the Between operator work in dynamodb with strings

I was not expecting to get back a value from the query below. 1574208000#W2 is not between 1574207999 and 1574208001. But the records are still returned. Can anyone shed light on how the between comparison is done?
DynamoDb between operator with strings works with the lexicographic order of the strings (ie, the order in which they would appear in a dictionary). Using this order, 1574208000#W2 does fall between 1574207999 and 1574208001
Two strings are lexicographically equal if they are the same length and contain the same characters in the same positions.
Apart from that, to determine which string comes first, compare corresponding characters of the two strings from left to right. The first character where the two strings differ determines which string comes first. Characters are compared using the Unicode character set. All uppercase letters come before lower case letters. If two letters are the same case, then alphabetic order is used to compare them.
If two strings contain the same characters in the same positions, then the shortest string comes first. Ref
To try this out, you can try a simple example in Java
String a = "1574207999", b = "1574208000#W2", c = "1574208001";
System.out.println(a.compareTo(b)); // prints negative number, indicating a < b
System.out.println(b.compareTo(c)); // prints negative number, indicating b < c

Extract first value in boolean search string

I would like to be able to control the hierarchy of elements I extract from a search string.
Specifically, in the string "425 million won", I would like to extract "won" first, but then "n" if "won" doesn't appear.
I want the result to be "won" for the following:
stringr::str_extract("425 million won", "won|n")
Note that specifying a space before won in my regex is inadequate because of other limitations in my data (there may not necessarily be a space between "million" and "won"). Ideally, I would like to do this using regex, as opposed to if-else clauses because of performance considerations.
See code in use here
pattern <- "^(?:(?!won).)*\\K(?:won|n)"
s <- "425 million won"
m <- gregexpr(pattern,s,perl=TRUE)
regmatches(s,m)[[1]]
Explanation
^ Assert position at the start of the line
(?:(?!won).)* Tempered greedy token matching any character except instances where won proceeds
\K Resets the starting point of the match. Any previously consumed characters are no longer included in the final match
(?:won|n) Match either won or n
If you just want to extend on the code you already have:
na.omit(str_extract("420 million won", c("won", "n")))[1]

Negative lookbehind or tempered pattern for checking one two or three words before a string

I am trying to write a code that would identify supporting terms (i.e. 'detect' 'evidence') unless there is a negation term up to 3 words before.
Some examples:
"FISH tests did not detect BCL2 translocation"
"FISH tests did not provide evidence of a BCL2 translocation"
I tried using lookbehind, but since it requires an exact length I can't have a the flexibility of looking back 1-3 words.
I tried using a tempered dot, but it gives any number of words.
The code I currently have, looks only a single word before the 'support diagnosis' term.
grepl("(?<!\\bnot\\b\\s|cannot\\s|n't\\s|\\bno\\b\\s|negative\\s)(reveal|seen|show|detect|demonstrate|confirm|identif|evidence|suggest|positive|observe)(?:(?!\\bnot\\b)(?!cannot)(?!n't)(?!\\bno\\b)(?!negative for)(?!, ).)*?(bcl-?2|14[q]?[;:]18)"), y, perl=TRUE,ignore.case = T)
A lookbehind doesn't help in this situation, what you can do is to systematically search the negative terms and to discard parts of your string using (*SKIP)(*FAIL) up to three words:
(\\bnot\\b|\\bcannot\\b|n't\\b)(?:\\W++(?!(?1))\\w+){0,3}(*SKIP)(*F)|\\b(reveal|seen|show)\\b(?!\\snot\\b)

Approximate String matching exclude first character

I'm trying to do approximate String matching between lists of terms terms1 and terms2 where I want to match Strings including typos, different notations, etc. I'm using
amatch(terms1, terms2, method="osa", maxDist=1, nomatch=0)
I want to match e.g. licence and license, but I don't want to match training and raining.
So I thought about excluding the 1st character from the approx. matching, so that it is not considered for deletion/substitution, but has to be the same in both Strings.
How could this be done or are there any better ways to match correctly?
Any help appreciated!

Resources