R - Searching text with NEAR regex

R - Searching text with NEAR regex - r

I have a vector containing text, broken up, like the following:
words = c("Lorem Ipsum is simply dummy text of the", "printing and typesetting industry. Lorem Ipsum has been the industrys
standard dummy text ever since the 1500s", "when an unknown printer took a galley of type and scrambled it to
make a type specimen book.", "It has survived not only five ,centuries, but also the leap into electronic")
I am using the following regex to find where the words "dummy" and "text" appear within 6 words of each other:
grep("\b(?:dummy\\W+(?:\\w+\\W+){1,6}?text|text\\W+(?:\\w+\\W+){1,6}?dummy)\b", words)
However its returning 0 despite there being 'dummy text' within the first index.
Any idea where I am going wrong?

The \b in "\b" matches a backspace char, you need to double escape the \b, \\b, to make it match a word boundary.
After fixing the typo, you need to pay attention to the limiting quantifiers. {1,6}? is a lazy quantifier matching one to six occurrences (as few as possible, but still as many as necessary to find a valid match) of the modified subpattern. It means there must be at least one word between dummy and text.
So, you need to use
pattern <- "\\b(?:dummy\\W+(?:\\w+\\W+){0,6}text|text\\W+(?:\\w+\\W+){0,6}dummy)\\b"
See the regex demo.
Details
\b - a word boundary
(?: - start of a non-capturing group
dummy - a dummy word
\W+ - one or more non-word chars
(?:\w+\W+){0,6} - zero to six occurrences of one or more word chars followed with one or more non-word chars
text - a text word
| - or
text - a text word
\W+ - one or more non-word chars
(?:\w+\W+){0,6} - zero to six occurrences of one or more word chars followed with one or more non-word chars
dummy - a dummy word
) - end of the non-capturing group
\b - a word boundary

Related

Remove first occurrence of special characters until the first word or word character in R using regex

For my project I am looking into removing parts of text based on the pattern of special characters. I have a long .txt file that has the below structure:
mycharobj=c("---------Some text is here.---------More text is here - [3548]----- Even more text is here.-----------More text is here - [408]--------- Even more text is here again.")
String continues following the above pattern.
My target is to remove parts that start with - and end - [number], such as:
"-----------------------More text is here - [3548]"
"-----------More text is here - [408]"
I am planning to use the below to remove these parts with (will be looped in the future)
library(stringr)
library(qdapRegex)
temp=unlist(regmatches(mycharobj, gregexpr("[[:digit:]]+", mycharobj)))
mycharobj=rm_between(mycharobj, "-", paste(temp[1],"]", sep=""))
but for this to work, I need a regex expression that will remove the first occurrence of "-----------" in text until the first word or word character. If a string starts with text (word or word characters), it needs to ignore this and identify the first occurrence of "-----------" for my potential loop to work.
I was wondering if this can be done with regular expressions? Any help is appreciated. I have a very computationally demanding solution for this; split the string based on the special character "-" and then identify the parts of the text that I need through a set of conditionals. But due to the fact that it takes a lot more of the processing time, this solution is not very scalable for processing a large number of such .txt files.

You can use
gsub("-{9,}(?:(?!-{9}).)*?- \\[\\d+]", "", mycharobj, perl=TRUE)
See the regex demo.
Details:
-{9,} - nine or more - chars
(?:(?!-{9}).)*? - any one char, other than a line break char, zero or more but as few as possible occurrences, that does not start a nine hyphen char sequence
- \[ - a - [ string
\d+ - one or more digits
] - a ] char.

Trouble finding string followed by variable whitespace and numbers in R with regex

I'm attempting to use some regex to find the lines in a series of documents so I can accurately subset the information. First, some sample data.
text <- c("BAR 02/ BLAHBLAH ",
" 27/ LOCATION: BLAH-TOWN",
" 2013 BLAH;BLAH",
" BAR 09/ 10/ BOOHAABLAH ",
" 25/ 14/ LOREM IPSUM, ",
" 2014 2014 LOREM LORE LOT",
" BAR BLAH MUH BLAH NO BLAH")
I am attempting to find the element of the list where BAR is followed ONLY by numbers. The number of whitespaces is variable, but the lines I am interested in capturing are always followed by numbers. I am using the base R grep() function and have tried a large number of functions. No positive lookahead configuration I have found so far seems to catch it?
Some of the things I have tried so far.
grep("(BAR\\b(?=\\s*[0-9]))", text, perl= T)
grep("(BAR\\b(?=\\s*\\b[0-9]))", text, perl= T)
grep("(BAR\\b\\s*\\d\\d\/)", text, perl = T)
grep("BAR\\s*[0-9]",text,perl=T)
grep("BAR\\s*(?![^A-Za-z])",text,perl=T)
Where am I going wrong? I've heard some about tidyr, but none of what I've read on it shows any more promise than grep.

I will provide the answer based on your feedback. It appears you modify the character vector by changing BAR to VIOL and introduce Unicode whitespace into the string.
Thus, the following should work in your case:
grep("(*UCP)VIOL\\s+[0-9]", text, perl=TRUE)
The (*UCP) PCRE verb will make \s match any Unicode whitespaces.
In other environments (this is not your case), where TRE (default base R regex engine) POSIX character classes are Unicode aware, one might also use
grep("VIOL[[:space:]]+[0-9]", text)

Regex to match words that have an upper case character not in 1st pos, and some lower case chars

My text contains terms that are pasted to each other, luckily the pasted terms mostly start with upper case.
The strings I want to match will contain at least one word which contains at least one lower case character AND at least one upper case character that wouldn't be the first one.
Please see below for the diverse cases I should handle.
my_corpus <- c("PleaseMatch this",
"And alsoThis",
"this ASWell",
"thisTOO",
"Though NOT THIS",
"Nor This")
rgx <- "..." # please help me here
grep(rgx ,my_corpus) # 1 2 3 4

You might consider the following solution:
[[:lower:]][[:upper:]]|\B[[:upper:]][[:lower:]]
See this regex demo.
Or, if Foo_Bar should not be matched (note the \B non-word boundary will match an uppercase letter after _):
[[:lower:]][[:upper:]]|[[:alnum:]][[:upper:]][[:lower:]]
See this demo.
Or, to also handle a1A case:
[[:lower:]][[:upper:]]|[[:alnum:]][[:upper:]][[:lower:]]|[0-9][[:upper:]]\b
See the regex demo.
Details:
[[:lower:]] - matches a lowercase letter
[[:upper:]] - matches an uppercase letter
| - an alternation operator (separates alternatives in one group)
[[:alnum:]] - matches an alphanumeric char
[0-9] - matches any ASCII digit (you may use [[:digit:]], too)
\b - a word boundary
\B - a non-word boundary.

R utf-8 and replace a word from a sentence based on ending character

I have a requirement where I am working on a large data which is having double byte characters, in korean text. i want to look for a character and replace it. In order to display the korean text correctly in the browser I have changed the locale settings in R. But not sure if it gets updated for the code as well. below is my code to change locale to korean and the korean text gets visible properly in viewer, however in console it gives junk character on printing-
Sys.setlocale(category = "LC_ALL", locale = "korean")
My data is in a data.table format that contains a column with text in korean. example -
"광주광역시 동구 제봉로 49 (남동,(지하))"
I want to get rid of the 1st word which ends with "시" character. Then I want to get rid of the "(남동,(지하))" an the end. I was trying gsub, but it does not seem to be working.
New <- c("광주광역시 동구 제봉로 49 (남동,(지하))")
data <- as.data.table(New)
data[,New_trunc := gsub("\\b시", "", data$New)]
Please let me know where I am going wrong. Since I want to search the end of word, I am using \\b and since I want to replace any word ending with "시" character I am giving it as \\b시.....is this not the way to give? How to take care of () at the end of the sentence.
What would be a good source to refer to for regular expressions.
Is a utf-8 setting needed for the script as well?How to do that?

Since you need to match the letter you have at the end of the word, you need to place \b (word boundary) after the letter, so as to require a transition from a letter to a non-letter (or end of string) after that letter. A PCRE pattern that will handle this is
"\\s*\\b\\p{L}*시\\b"
Details
\\s* - zero or more whitespaces
\\b - a leading word boundary
\\p{L}* - zero or more letters
시 - your specific letter
\\b - end of the word
The second issue is that you need to remove a set of nested parentheses at the end of the string. You need again to rely on the PCRE regex (perl=TRUE) that can handle recursion with the help of a subroutine call.
> sub("\\s*(\\((?:[^()]++|(?1))*\\))$", "", New, perl=TRUE)
[1] "광주광역시 동구 제봉로 49"
Details:
\\s* - zero or more whitespaces
(\\((?:[^()]++|(?1))*\\)) - Group 1 (will be recursed) matching
\\( - a literal (
(?:[^()]++|(?1))* - zero or more occurrences of
[^()]++ - 1 or more chars other than ( and ) (possessively)
| - or
(?1) - a subroutine call that repeats the whole Group 1 subpattern
\\) - a literal )
$ - end of string.
Now, if you need to combine both, you would see that R PCRE-powered gsub does not handle Unicode chars in the pattern so easily. You must tell it to use Unicode mode with (*UCP) PCRE verb.
> gsub("(*UCP)\\b\\p{L}*시\\b|\\s*(\\((?:[^()]++|(?1))*\\))$", "", New, perl=TRUE)
[1] " 동구 제봉로 49"
Or using trimws to get rid of the leading/trailing whitespace:
> trimws(gsub("(*UCP)\\b\\p{L}*시\\b|(\\((?:[^()]++|(?1))*\\))$", "", New, perl=TRUE))
[1] "동구 제봉로 49"
See more details about the verb at PCRE Man page.

Regular expression to match maximium of five words

I have a regular expression
^[a-zA-Z+#-.0-9]{1,5}$
which validates that the word contains alpha-numeric characters and few special characters and length should not be more than 5 characters.
How do I make this regular expression to accept a maximum of five words matching the above regular expression.

^[a-zA-Z+#\-.0-9]{1,5}(\s[a-zA-Z+#\-.0-9]{1,5}){0,4}$
Also, you could use for example [ ] instead of \s if you just want to accept space, not tab and newline. And you could write [ ]+ (or \s+) for any number of spaces (or whitespaces), not just one.
Edit: Removed the invalid solution and fixed the bug mentioned by unicornaddict.

I believe this may be what you're looking for. It forces at least one word of your desired pattern, then zero to four of the same, each preceded by one or more white-space characters:
^XX(\s+XX){0,4}$
where XX is your actual one-word regex.
It's separated into two distinct sections so that you're not required to have white-space at the end of the string. If you want to allow for such white-space, simply add \s* at that point. For example, allowing white-space both at start and end would be:
^\s*XX(\s+XX){0,4}\s*$

You regex has a small bug. It matches letters, digits, +, #, period but not hyphen and also all char between # and period. This is because hyphen in a char class when surrounded on both sides acts as a range meta char. To avoid this you'll have to escape the hyphen:
^[a-zA-Z+#\-.0-9]{1,5}$
Or put it at the beg/end of the char class, so that its treated literally:
^[-a-zA-Z+#-.0-9]{1,5}$
^[a-zA-Z+#.0-9-]{1,5}$
Now to match a max of 5 such words you can use:
^(?:[a-zA-Z+#\-.0-9]{1,5}\s+){1,5}$
EDIT: This solution has a severe limitation of matching only those input that end in white space!!! To overcome this limitation you can see the ans by Jakob.