I would like to be able to control the hierarchy of elements I extract from a search string.
Specifically, in the string "425 million won", I would like to extract "won" first, but then "n" if "won" doesn't appear.
I want the result to be "won" for the following:
stringr::str_extract("425 million won", "won|n")
Note that specifying a space before won in my regex is inadequate because of other limitations in my data (there may not necessarily be a space between "million" and "won"). Ideally, I would like to do this using regex, as opposed to if-else clauses because of performance considerations.
See code in use here
pattern <- "^(?:(?!won).)*\\K(?:won|n)"
s <- "425 million won"
m <- gregexpr(pattern,s,perl=TRUE)
regmatches(s,m)[[1]]
Explanation
^ Assert position at the start of the line
(?:(?!won).)* Tempered greedy token matching any character except instances where won proceeds
\K Resets the starting point of the match. Any previously consumed characters are no longer included in the final match
(?:won|n) Match either won or n
If you just want to extend on the code you already have:
na.omit(str_extract("420 million won", c("won", "n")))[1]
Related
I've been enjoying the powerful function aregexec that allows me to mine strings in a fuzzy way.
For that I can search for a string of nucleotide "ATGGCTTCGTC" within a DNA section with defined allowance of insertion, deletion and substitute.
However, it only show me the first match without finishing the whole string. For example,
If I run
aregexec("a","adfasdfasdfaa")
only the first "a" will show up from the result. I'd like to see all the matches.
I wonder if there are other more powerful functions or a argument to be added to this one.
Thank you very much.
P.S. I explained the fuzzy search poorly. I mean, the match doesn't have to be perfect. Say if I allow an substitution of one character, and search AATTGG in ctagtactaAATGGGatctgct, the capital part will be considered a match. I can similarly allow insertions and deletions of certain characters.
gregexpr will show every time there is the pattern in the string, like in this example.
gregexpr("as","adfasdfasdfaa")
There are many more information if you use ?grep in R, it will explain every aspect of using regex.
I am scraping a word document to get the frequency of "content words" only. To this point, I have been able to use Tidyverse and Tidytext packages to remove words that are articles, include punctuation, have a length of one, etc with functions like:
!str_detect(word, pattern = "[[:digit:]]"), # removes any words with numeric digits
!str_detect(word, pattern = "[[:punct:]]"), # removes any remaining punctuations
!str_detect(word, pattern = "(.)\\1{2,}"), # removes any words with 3 or more repeated letters
!str_detect(word, pattern = "\\b(.)\\b") # removes any remaining single letter words
Now, I do not want to remove entire observations any longer--I want to remove only certain characters from existing observations (ex. remove "s" and "ed" endings)
Current Dataframe:
print(df)
WORD N
Happy 7
Apple 8
Coworkers 16
Customers 9
Kicked 11
Turtle 8
Desired Dataframe:
WORD N
Happy 7
Apple 8
Coworker 16
Customer 9
Kick 11
Turtle 8
Your regex may work for simple cases (nouns, verbs) but for more accurate results I recommend a proper stemmer/lemmatizer. I've had good results with spaCy's Lemmatizer.
Here is a R wrapper to spaCy http://spacyr.quanteda.io/
You can use regular expressions like
/\w+((s)|(ed))$/g
The \w+ will match 1 or more alphabetic characters.
The ((s)|(ed))$ looks for an ending of either "s" or "ed". You can extend that list as needed.
The beginning and ending slashes aren't part of the regex, they just mark the beginning and end of the match pattern.
The final g after the last slash is a regex flag that indicates you want to match globally, which in most languages will mean that you don't just stop when you find the first match, it'll find all matches. This may not be appropriate in your case, you'll have to experiment to figure out if it's what you need.
Note that the beginning/ending slashes and the g is a syntax not used in every language, so I'm not sure whether it applies in R. Some languages have regex libraries that make you pass the flags in as separate arguments so read your language's documentation to figure out how that works.
Wrapping things in parentheses automatically creates capturing groups, so you can check the regex match object to see if the 1st capture group (corresponding to the outer parens) has a match, which tells you that this word has an ending you need to replace, then you can perform a regex replace using the first capture group and it'll get rid of any of those endings for you.
I recommend https://regex101.com to test your regular expressions while developing them. Here's a regex & test suite I saved pertaining to your question, if you want to use it: https://regex101.com/r/tBduP6/2
this is my first entry on stack overflow, so please be indulgent if my post might have some lack in terms of quality.
I want to learn some webscraping with R and started with a simple example --> Extracting a table from a Wikipedia site.
I managed to download the specific page and identified the HTML sections I am interested in:
<td style="text-align:right">511.000.000\n</td>
Now I want to extract the number in the data from the table by using regex. So i created a regex, which should match the structure of the number from my point of view:
pattern<-"\\d*\\.\\d*\\.\\d*\\.\\d*\\."
I also tried other variations but none of them found the number within the HTML code. I wanted to keep the pattern open as the numbers might be hundreds, thousand, millions, billions.
My questions: The number is within the HTML code, might it be
necessary to include some code for the non-number code (which should
not be extracted...)
What would be the correct version for the
pattern to identify the number correctly?
Thank you very much for your support!!
So many stars implies a lot of backtracking.
One point further, using \\d* would match more than 3 digits in any group and would also match a group with no digit.
Assuming your numbers are always integers, formatted using a . as thousand separator, you could use the following: \\d{1,3}(?:\\.\\d{3})* (note the usage of non-capturing group construct (?:...) - implying the use of perl = TRUE in arguments, as mentioned in Regular Expressions as used in R).
Look closely at your regex. You are assuming that the number will have 4 periods (\\.) in it, but in your own example there are only two periods. It's not going to match because while the asterisk marks \\d as optional (zero or more), the periods are not marked as optional. If you add a ? modifier after the 3rd and 4th period, you may find that your pattern starts matching.
I was wondering if there is an easy way in SAS to count sentences in a string?
In pseudo code I would search for the index of every ., ?, and !, and check if the index before that (-1 or -2) is a character.
Any better ideas?
Assuming that your sentences are correctly punctuated, there should be exactly 1 sentence per ?!., so in that case you can use countc(my_string,'?!.'). The main exceptions are probably interrobangs (?!,!?) and ellipses (...).
If your string contains lots of sentences with missing stops or double stops, one option is simply to cross your fingers and hope they more or less cancel out.
If there are lots of double stops but not so many missing ones, you could apply a regex to replace any run of consecutive stops with a single . before counting those, e.g. countc(prxchange('s/[\.!\?]{2,}/./',-1,string),'?!.').
I have a double challenge.
First, I want to match lines that contain two (or eventually more) specified words within certain distance in whatever order.
Using lookaround I manage to select lines matching two or more words, regardless of the order within they occur. I can also easily add more words to be found in the same line, so it this can also be applied without much effort when more word must occur in order to be selected. The disadvantage is that can't detail the maximal distance between them.
^(?=.*\john)(?=.*\jack).*$
By using the pipe operator I can detail both orders in which the terms may occur as well as the accepted distance between them, but when more words should be matched the code becomes lengthy and errorsensitive.
jack.{0,100}john|john.{0,100}jack
Is there a way to combine the respective advantages of both approaches in one regular expression?
Second, ideally I would like that only 'jack' and 'john' (and are selected in the line but not the whole line.
Is there a possibility to do this all at once?
For this case, you have to use the second approach. But it can't be possible with regex alone.. You have to ask for language tools help like paste in-order to build a regex (given in the second format).
In python, I would do like below to create a long regex.
>>> def create_reg(lis):
out = []
for i in lis:
out.append(''.join(i) + '|' + ''.join([i[2],i[1], i[0]]))
return '(?:' + '|'.join(out) + ')'
>>> lst = [('john', '{0,100}', 'jack'), ('foo', '{0,100}', 'bar')]
>>> create_reg(lst)
'(?:john{0,100}jack|jack{0,100}john|foo{0,100}bar|bar{0,100}foo)'
>>>