How to handle "Regular expression backtrack stack overflow. (U_REGEX_STACK_OVERFLOW)"? - r

I have a text from which I want to extract the first two paragraphs. The text consists of several paragraphs seperated by empty lines. The paragraphs themselves can contain line breaks. What I want to extract is everything from the beginning of the text until the second empty line. This is the original text:
Today I meet my friends in Kyiv to celebrate my new permanent residency status in Ukraine.
Then I went to a nice restaurant with them.
Buy me a Beer: https://www.buymeacoffee.com/johnnyfd
Support the GoFundMe: http://gofundme.com/f/send-money-dire...
Follow Me:
The text I want to have is:
Today I meet my friends in Kyiv to celebrate my new permanent residency status in Ukraine.
Then I went to a nice restaurant with them.
Buy me a Beer: https://www.buymeacoffee.com/johnnyfd
I tried to create a regular expression doing the job and I though the following seemed to be a possible solution:
(.*|\n)*(?:[[:blank:]]*\n){2,}(.*|\n)*(?:[[:blank:]]*\n){2,}
When I use it in R in stri_extract_all_regex, I receive the following error:
Error in stri_extract_all_regex(video_desc_orig, "(.*|\n)*?(?:[[:blank:]]*\n){2,}(.*?|\n)*(?:[[:blank:]]*\n){2,}") :
Regular expression backtrack stack overflow. (U_REGEX_STACK_OVERFLOW)
It's the first time for me using Regex and I really don't know how to interpret this error. Any help appreciated ;)

You have nested quantifiers like (.*|\n)* which creates a lot of paths to explore. This pattern for example first matches all text, and then starts to backtrack to fit in the next parts of the pattern.
Including the last 2 newlines, making sure that the lines contain at least a single non whitespace character:
\A[^\S\n]*\S.*(?:\n[^\S\n]*\S.*)*\n{2,}[^\S\n]*\S.*(?:\n[^\S\n]*\S.*)*
Explanation
\A Start of string
[^\S\n]*\S.* Match a whole line with at least a single non whitespace char
(?:\n[^\S\n]*\S.*)* Optionally repeat all following lines that contain at least a single non whitespace chars
\n{2,} Match 2 or more newlines
[^\S\n]*\S.*(?:\n[^\S\n]*\S.*)* Same as the previous pattern to match the lines for the second paragraph
See a regex demo and a R demo.
Example
library(stringi)
string <- 'Today I meet my friends in Kyiv to celebrate my new permanent residency status in Ukraine.
Then I went to a nice restaurant with them.
Buy me a Beer: https://www.buymeacoffee.com/johnnyfd
Support the GoFundMe: http://gofundme.com/f/send-money-dire...
Follow Me: '
stri_extract_all_regex(
string,
'\\A[^\\S\\n]*\\S.*(?:\\n[^\\S\\n]*\\S.*)*\\n{2,}[^\\S\\n]*\\S.*(?:\\n[^\\S\\n]*\\S.*)*'
)
Output
[[1]]
[1] "Today I meet my friends in Kyiv to celebrate my new permanent residency status in Ukraine.\nThen I went to a nice restaurant with them.\n\nBuy me a Beer: https://www.buymeacoffee.com/johnnyfd"

In R you need to do double slashes \\.
string <- 'Today I meet my friends in Kyiv to celebrate my new permanent residency status in Ukraine.
Then I went to a nice restaurant with them.
Buy me a Beer: https://www.buymeacoffee.com/johnnyfd
Support the GoFundMe: http://gofundme.com/f/send-money-dire...
Follow Me: '
library(stringr)
string |>
str_extract('(.*|\\n)*(?:[[:blank:]]*\\n){2,}(.*|\\n)*(?:[[:blank:]]*\\n){2,}') |>
cat()
# Output
Today I meet my friends in Kyiv to celebrate my new permanent residency status in Ukraine.
Then I went to a nice restaurant with them.
Buy me a Beer: https://www.buymeacoffee.com/johnnyfd

Related

How to remove these footnotes from text

Alright so I have minimal experience with RStudio, I've been googling this for hours now and I'm fed up-- I don't care about the pride of figuring it out on my own anymore, I just want it done. I want to do some stuff with Canterbury Tales-- the Middle English version on Gutenberg.
Downloaded the plaintext, trimmed out the meta data, etc but it's chock-full of "helpful" footnotes and I can't figure out how to cut them out. EX:
"And shortly, whan the sonne was to reste,
So hadde I spoken with hem everichon,
That I was of hir felawshipe anon,
And made forward erly for to ryse,
To take our wey, ther as I yow devyse.
19. Hn. Bifel; E. Bifil. 23. E. were; _rest_ was. 24. E. Hn.
compaignye. 26, 32. E. felaweshipe. Hl. pilgryms; E. pilgrimes.
34. E. oure
But natheles, whyl I have tyme and space,..."
I at least have the vague notion that this is a grep/regex puzzle. Looking at the text in TextEdit, each bundle of footnotes is indented by 4 spaces, and the next verse starts with a capitalized word indented by (edit: 4 spaces as well).
So I tried downloading the package qdap and using the rm_between function to specify removal of text between four spaces and a number; and two spaces and a capital letter (" [0-9]"," "[A-Z]") to no avail.
I mean, this isn't nearly as simple as "make the text lowercase and remove all the numbers dur-hur" which all the tutorials are so helpfully offering. But I'm assuming this is a rather common thing that people have to do when dealing with big texts. Can anyone help me? Or do I have to go into textedit and just manually delete all the footnotes?
EDIT: I restarted the workspace today and all I have is a scan of the file, each line stored in a character vector, with the Gutenburg metadata trimmed out:
text<- scan("thefilepath.txt, what = "character", sep = "\n")
start <-which(text=="GROUP A. THE PROLOGUE.")
end <-which(text==""God bringe us to the Ioye . that ever schal be!")
cant.lines.v <- text[start:end]
And that's it so far. Eventually I will
cant.v<- paste(cant.lines.v, collapse=" ")
And then strsplit and unlist into a vector of individual words-- but I'm assuming, to get rid of the footnotes, I need to gsub and replace with blank space, and that will be easier with each separate line? I just don't know how to encode the pattern I need to cut. I believe it is 4 spaces followed by a number, then continuing on until you get to 4 spaces followed by a capitalized word and a second word w/o numbers and special characters and punctuation.
I hope that I'm providing enough information, I'm not well-versed in this but I am looking to become so...thanks in advance.

How to remove a pattern of text ending with a colon in R?

I have the following sentence
review <- C("1a. How long did it take for you to receive a personalized response to an internet or email inquiry made to THIS dealership?: Approx. It was very prompt however. 2f. Consideration of your time and responsiveness to your requests.: Were a little bit pushy but excellent otherwise 2g. Your satisfaction with the process of coming to an agreement on pricing.: Were willing to try to bring the price to a level that was acceptable to me. Please provide any additional comments regarding your recent sales experience.: Abel is awesome! Took care of everything from welcoming me into the dealership to making sure I got the car I wanted (even the color)! ")
I want to remove everything before :
I tried the following code,
gsub("^[^:]+:","",review)
However, it only removed first sentence ending with a colon
Expected results:
Approx. It was very prompt however. Were a little bit pushy but excellent otherwise Were willing to try to bring the price to a level that was acceptable to me. Abel is awesome! Took care of everything from welcoming me into the dealership to making sure I got the car I wanted (even the color)!
Any help or suggestions will be appreciated. Thank you.
If the sentences are not complex and have no abbreviations you may use
gsub("(?:\\d+[a-zA-Z]\\.)?[^.?!:]*[?!.]:\\s*", "", review)
See the regex demo.
Note that you may further generalize it a bit by changing \\d+[a-zA-Z] to [0-9a-zA-Z]+ / [[:alnum:]]+ to match 1+ digits or letters.
Details
(?:\d+[a-zA-Z]\.)? - an optional sequence of
\d+ - 1+ digits
[a-zA-Z] - an ASCII letter
\. - a dot
[^.?!:]* - 0 or more chars other than ., ?, !, :
[?!.] - a ?, ! or .
: - a colon
\s* - 0+ whitespaces
R test:
> gsub("(?:\\d+[a-zA-Z]\\.)?[^.?!:]*[?!.]:\\s*", "", review)
[1] "Approx. It was very prompt however. Were a little bit pushy but excellent otherwise Were willing to try to bring the price to a level that was acceptable to me.Abel is awesome! Took care of everything from welcoming me into the dealership to making sure I got the car I wanted (even the color)! "
Extending to handle abbreviations
You may enumerate the exceptions if you add alternation:
gsub("(?:\\d+[a-zA-Z]\\.)?(?:i\\.?e\\.|[^.?!:])*[?!.]:\\s*", "", review)
^^^^^^^^^^^^^^^^^^^^^^
Here, (?:i\.?e\.|[^.?!:])* matches 0 or more ie. or i.e. substrings or any chars other than ., ?, ! or :.
See this demo.

Use R to search for a specific text pattern and return entire sentence(s) where pattern appears

So I scanned in a physical document, changed it to a tiff image and used the package Tesseract to import it into R. However, I need R to look for specific keywords, find it in the text file and return the entire line that the keyword is in.
For example, if I had the text file:
This is also straightforward. Look at the years of experience required and see if that matches the years of experience that the candidate has. It is important to note that if the candidate matches or exceeds the years of experience required, you would rate both of those scenarios a “5”.
And I tell R to search for the keyword "straightforward", how do I get it to return "This is also straightforward...see if that matches the"?
Here is a solution using the quanteda package that breaks the text into sentences, and then uses grep() to return the sentence containing the word "straightforward".
aText <- "This is also straightforward. Look at the years of experience required and see if that matches the years of experience that the candidate has. It is important to note that if the candidate matches or exceeds the years of experience required, you would rate both of those scenarios a “5”."
library(quanteda)
aCorpus <- corpus(aText)
theSentences <- tokens(aCorpus,what="sentence")
grep("straightforward",theSentences,value=TRUE)
and the output:
> grep("straightforward",theSentences,value=TRUE)
text1
"This is also straightforward."
To search for multiple keywords, add them in the grep() function via the or operator | .
grep("straightforward|exceeds",theSentences,value=TRUE)
...and the output:
> grep("straightforward|exceeds",theSentences,value=TRUE)
text1
"This is also straightforward."
<NA>
"It is important to note that if the candidate matches or exceeds the years of experience required, you would rate both of those scenarios a \"5\"."
Here is one base R option:
text <- "This is also straightforward. Look at the years of experience required and see if that matches the years of experience that the candidate has. It is important to note that if the candidate matches or exceeds the years of experience required, you would rate both of those scenarios a “5”."
lst <- unlist(strsplit(text, "(?<=[a-z]\\.\\s)", perl=TRUE))
lst[grepl("\\bstraightforward\\b", lst)]
I am splitting your text on the pattern (?<=[a-z]\\.\\s), which says to lookbehind for a lowercase letter, following by a full stop and a space. This should work well most of the time. There is the issue of abbreviations, but most of the time they would be in the form of capital letter followed by dot, and also most of the time they would not be ending sentences.
Demo

Unable replace \x.. sequence from a twitter feed in R

I've been grappling with regex in following string:
"Just beautiful, let’s see how the next few days go. \n\nLong term buying opportunities could be around the corner \xed\xa0\xbd\xed\xb2\xb0\xed\xa0\xbd\xed\xb3\x89\xed\xa0\xbd\xed\xb2\xb8... https://t dot co/hUradDaNVX"
I am unable to remove the entire \x...\x pattern from the above string.
I'm unable to remove https URL from above string.
My regex expression are:
gsub('http.* *', '', twts_array)
gsub("\\x.*\\x..","",twts_array)
My output is:
"Just beautiful let’s see how the next few days go \n\nLong term buying opportunities could be around the corner \xed\xa0\xbd\xed\xb2\xb0\xed\xa0\xbd\xed\xb3\x89\xed\xa0\xbd\xed\xb2\xb8... httpstcohUradDaNVX"
My expected output is:
Just beautiful, let’s see how the next few days go. Long term buying opportunities could be around the corner
P.S: As you can see neither of problems got solved. I also added dot for . in https://t dot co/hUradDaNVX as StackOverflow does not allow me to post shortened urls. Can some one help me in tackling this problem.
On Linux you can do the following:
twts_array <- "Just beautiful, let’s see how the next few days go. \n\nLong term buying opportunities could be around the corner \xed\xa0\xbd\xed\xb2\xb0\xed\xa0\xbd\xed\xb3\x89\xed\xa0\xbd\xed\xb2\xb8... https://t dot co/hUradDaNVX"
twts_array_str <- enc2utf8(twts_array)
twts_array_str <- gsub('<..>', '', twts_array_str)
twts_array_str <- gsub('http.*', '', twts_array_str)
twts_array_str
# "Just beautiful, let’s see how the next few days go. \n\nLong term buying opportunities could be around the corner ... "
enc2utf8 will convert any unknown Unicode sequences to <..> format. Then it will be replaced by gsub with URL as well.

How to count the number of sentences in a text in R?

I read a text into R using the readChar() function. I aim at testing the hypothesis that the sentences of the text have as many occurrences of letter "a" as occurrences of letter "b". I recently discovered the {stringr} package, which helped me a great deal to do useful things with my text such as counting the number of characters and the total number of occurrences of each letter in the entire text. Now, I need to know the number of sentences in the whole text. Does R have any function, which can help me do that? Thank you very much!
Thank you #gui11aume for your answer. A very good package I just found that can help do the work is {openNLP}. This is the code to do that:
install.packages("openNLP") ## Installs the required natural language processing (NLP) package
install.packages("openNLPmodels.en") ## Installs the model files for the English language
library(openNLP) ## Loads the package for use in the task
library(openNLPmodels.en) ## Loads the model files for the English language
text = "Dr. Brown and Mrs. Theresa will be away from a very long time!!! I can't wait to see them again." ## This sentence has unusual punctuation as suggested by #gui11aume
x = sentDetect(text, language = "en") ## sentDetect() is the function to use. It detects and seperates sentences in a text. The first argument is the string vector (or text) and the second argument is the language.
x ## Displays the different sentences in the string vector (or text).
[1] "Dr. Brown and Mrs. Theresa will be away from a very long time!!! "
[2] "I can't wait to see them again."
length(x) ## Displays the number of sentences in the string vector (or text).
[1] 2
The {openNLP} package is really great for natural language processing in R and you can find a good and short intro to it here or you can check out the package's documentation here.
Three more languages are supported in the package. You just need to install and load the corresponding model files.
{openNLPmodels.es} for Spanish
{openNLPmodels.ge} for German
{openNLPmodels.th} for Thai
What you are looking for is sentence tokenization, and it is not as straightforward as it seems, even in English (sentences like "I met Dr. Bennett, the ex husband of Mrs. Johson." can contain full stops).
R is definitely not the best choice for natural language processing. If you are Python proficient, I suggest you have a look at the nltk module, which covers this and many other topics. You can also copy the code from this blog post, which does sentence tokenization and word tokenization.
If you want to stick to R, I would suggest you count the end-of-sentence characters (., ?, !), since you are able to count characters. A way of doing it with a regular expression is like so:
text <- 'Hello world!! Here are two sentences for you...'
length(gregexpr('[[:alnum:] ][.!?]', text)[[1]])

Resources